lecture_notes:04-10-2015
De novo Assembly III
Guest lecturer: Stefan Prost, stefan.prost@berkley.edu
Error Correction (EC) using the k-mer spectrum
EC Using Kmer Counts
Adapter and Low Quality Base Trimming
Skewer (Jiang et al. 2014)
AdapterRemoval (Lindgreen 2012)
Trimmomatic (Bolger et al. 2014)
Contamination Filtering
de Bruijn Graph
Short Read Assembler
Software Specifics
Allpaths-LG is only for long read data
Platanus was made for heterozygous genomes
SOAP, Allpaths-LG will help you find proper kmer size
Abyss has tools to assess quality (abyss-fac)
Once you run the assembly, you can get rid of almost everything except the contigs and scaffolds (unitigs is the genome w/o any gap regions)
N50
N50 at least 50% of the bases are in scaffolds this size or bigger
A contig N50 is calculated by first ordering every contig by length from longest to shortest. Next, starting from the longest contig, lenghts of each contig are summed, until this running sum equals one-half o the total length of all contigs in the assemly. The contig N50 of the assembly is the length of the shortest contig in this list (Yandel and Ence 2012).
Not always a good indicator - will not detect misassemblies
Assembly Quality Assessment
How do you know which kmer size is better? Some people just use N50
CEGMA (recommended) searches for genes that should be found in all eukaryotes in the final assembly.
how many are complete, partial
percent completeness
orthology (how many times was a gene present more than once)
Lots of orthology in partial genes because they’re broken between contigs
Using RNA transcripts: Baa.pl (Ryan 2013/4, Arxiv)
De novo liklihood-based measures (LAP; Ghodsi et al. 2013)
Feature Response Curve
Mate-pair orientations and separations
Repeat content by kmer analysis
Depth-of-coverage
Correlated polymorphism in the read alignments
Read alignment Breakpoint
Gap Filling
GapCloser (Luo et al. 2012)
GapFiller (Nadalin et al. 2011)
Tries to extend alignments (not done by assemblers)
Can run multiple times, but first cycle is the most effective
Complete genes (CEGMA) much better after gap closing
Works better on some assemblies than others
Resolving Mis-Assemblies
REAPR (Hunt et al. 2013) better than NxRepair
NxRepair (Murphy et al. 2014)
Breaks down scaffolds without enough evidence
Run REAPR on the best assembly candidate
Need at least one paired library
Looks for areas with low coverage
Compares expected coverage to actual coverage
Really low coverage is a misassembly
Will remove those areas if you let it, but those are often areas that have been gap filled
Run CEGMA before and after, usually score doesn’t change (finding real misassemblies)
ALWAYS:
Genome Merging
Overall
Allpaths/abyss/soap → gap closer → REAPR → gap closer → final
Read clipping marks the quality of the read. Soft masking (low quality gets lowercase character) and hard masking (low quality gets replaced by N character)
Assemblies are never finished - still over 100 gaps in human genome
What is a good genome?
Downstream Processing
lecture_notes/04-10-2015.txt · Last modified: 2015/04/17 17:19 by almussel