Banana Slug Genomics

De novo Assembly III

Guest lecturer: Stefan Prost, stefan.prost@berkley.edu

Error Correction (EC) using the k-mer spectrum

BLESS
SGA
Musket

EC Using Kmer Counts

RACER

Adapter and Low Quality Base Trimming

Skewer (Jiang et al. 2014)
AdapterRemoval (Lindgreen 2012)
Trimmomatic (Bolger et al. 2014)

Contamination Filtering

Blast
Allpaths-LG
- Removing Low Frequency k-mers
- Discarding Scaffolds shorter than 1kb

de Bruijn Graph

Most assemblers use kmers to create a de Bruijn graph
In graph theory, an n-dimensional De Bruijn graph of m symbols is a directed graph representing overlaps between sequences of symbols. It has mn vertices, consisting of all possible length-n sequences of the given symbols; the same symbol may appear multiple times in a sequence” - http://en.wikipedia.org/wiki/De_Bruijn_graph

Short Read Assembler

de Bruijn Graph:
- Allpaths-LG
- Abyss (good guessing assembler…fast but not too accurate but must run for every kmer)
- SOAPdenovo
- Platanus
SSAKE
Overlap, Layout, Consensus
- SGA
- Celera

Software Specifics

Allpaths-LG is only for long read data
Platanus was made for heterozygous genomes
SOAP, Allpaths-LG will help you find proper kmer size
Abyss has tools to assess quality (abyss-fac)
Once you run the assembly, you can get rid of almost everything except the contigs and scaffolds (unitigs is the genome w/o any gap regions)

N50

N50 at least 50% of the bases are in scaffolds this size or bigger
A contig N50 is calculated by first ordering every contig by length from longest to shortest. Next, starting from the longest contig, lenghts of each contig are summed, until this running sum equals one-half o the total length of all contigs in the assemly. The contig N50 of the assembly is the length of the shortest contig in this list (Yandel and Ence 2012).
Not always a good indicator - will not detect misassemblies

Assembly Quality Assessment

How do you know which kmer size is better? Some people just use N50
CEGMA (recommended) searches for genes that should be found in all eukaryotes in the final assembly.
- how many are complete, partial
- percent completeness
- orthology (how many times was a gene present more than once)
- Lots of orthology in partial genes because they’re broken between contigs
Using RNA transcripts: Baa.pl (Ryan 2013/4, Arxiv)
De novo liklihood-based measures (LAP; Ghodsi et al. 2013)
- How likely it is that reads map to your assembled genome
Feature Response Curve
- Mate-pair orientations and separations
- Repeat content by kmer analysis
- Depth-of-coverage
- Correlated polymorphism in the read alignments
- Read alignment Breakpoint

Gap Filling

GapCloser (Luo et al. 2012)
- Most gap closing is done on first cycle (Run CEGMA > GapCloser > Cegma again)
GapFiller (Nadalin et al. 2011)
Tries to extend alignments (not done by assemblers)
Can run multiple times, but first cycle is the most effective
Complete genes (CEGMA) much better after gap closing
Works better on some assemblies than others

Resolving Mis-Assemblies

REAPR (Hunt et al. 2013) better than NxRepair
- Run Cegma to figure out best assembly and then run reaper
- AKA “Grim Reapr” - screws up your scaffold N50
NxRepair (Murphy et al. 2014)
Breaks down scaffolds without enough evidence
Run REAPR on the best assembly candidate
Need at least one paired library
Looks for areas with low coverage
Compares expected coverage to actual coverage
Really low coverage is a misassembly
Will remove those areas if you let it, but those are often areas that have been gap filled
Run CEGMA before and after, usually score doesn’t change (finding real misassemblies)
ALWAYS:
- Run CEGMA
- Pick best candidate and run REAPR
- Run CEGMA again to check quality

Genome Merging

Metassembler (Wences and Schatz 2014, Arxiv)
- newest assembler out
- ranks assemblies by N50
- choose primary assembly (best N50)
- compares two assemblies and builds consensus sequence
GAM-NGS (Vicedomini et al. 2013)

Overall

Allpaths/abyss/soap → gap closer → REAPR → gap closer → final
Read clipping marks the quality of the read. Soft masking (low quality gets lowercase character) and hard masking (low quality gets replaced by N character)
Assemblies are never finished - still over 100 gaps in human genome
What is a good genome?
- Contig N50 close to or bigger than average gene size
- Number of scaffolds close to number of chromosomes

Downstream Processing

Repeat annotation
Gene annotation
Mapping to get Diploid Genome (most assemblers won't call polymorphisms)

You could leave a comment if you were logged in.

Banana Slug Genomics

User Tools

Site Tools

De novo Assembly III

Guest lecturer: Stefan Prost, stefan.prost@berkley.edu

Error Correction (EC) using the k-mer spectrum

EC Using Kmer Counts

Adapter and Low Quality Base Trimming

Contamination Filtering

de Bruijn Graph

Short Read Assembler

Software Specifics

N50

Assembly Quality Assessment

Gap Filling

Resolving Mis-Assemblies

Genome Merging

Overall

Downstream Processing

Page Tools