User Tools

Site Tools


lecture_notes:04-10-2015

De novo Assembly III

Guest lecturer: Stefan Prost, stefan.prost@berkley.edu

Error Correction (EC) using the k-mer spectrum
  • BLESS
  • SGA
  • Musket
EC Using Kmer Counts
  • RACER
Adapter and Low Quality Base Trimming
  • Skewer (Jiang et al. 2014)
  • AdapterRemoval (Lindgreen 2012)
  • Trimmomatic (Bolger et al. 2014)
Contamination Filtering
  • Blast
  • Allpaths-LG
    • Removing Low Frequency k-mers
    • Discarding Scaffolds shorter than 1kb
de Bruijn Graph
  • Most assemblers use kmers to create a de Bruijn graph
  • In graph theory, an n-dimensional De Bruijn graph of m symbols is a directed graph representing overlaps between sequences of symbols. It has mn vertices, consisting of all possible length-n sequences of the given symbols; the same symbol may appear multiple times in a sequence” - http://en.wikipedia.org/wiki/De_Bruijn_graph
Short Read Assembler
  • de Bruijn Graph:
    • Allpaths-LG
    • Abyss (good guessing assembler…fast but not too accurate but must run for every kmer)
    • SOAPdenovo
    • Platanus
  • SSAKE
  • Overlap, Layout, Consensus
    • SGA
    • Celera
Software Specifics
  • Allpaths-LG is only for long read data
  • Platanus was made for heterozygous genomes
  • SOAP, Allpaths-LG will help you find proper kmer size
  • Abyss has tools to assess quality (abyss-fac)
  • Once you run the assembly, you can get rid of almost everything except the contigs and scaffolds (unitigs is the genome w/o any gap regions)
N50
  • N50 at least 50% of the bases are in scaffolds this size or bigger
  • A contig N50 is calculated by first ordering every contig by length from longest to shortest. Next, starting from the longest contig, lenghts of each contig are summed, until this running sum equals one-half o the total length of all contigs in the assemly. The contig N50 of the assembly is the length of the shortest contig in this list (Yandel and Ence 2012).
  • Not always a good indicator - will not detect misassemblies
Assembly Quality Assessment
  • How do you know which kmer size is better? Some people just use N50
  • CEGMA (recommended) searches for genes that should be found in all eukaryotes in the final assembly.
    • how many are complete, partial
    • percent completeness
    • orthology (how many times was a gene present more than once)
    • Lots of orthology in partial genes because they’re broken between contigs
  • Using RNA transcripts: Baa.pl (Ryan 2013/4, Arxiv)
  • De novo liklihood-based measures (LAP; Ghodsi et al. 2013)
    • How likely it is that reads map to your assembled genome
  • Feature Response Curve
    • Mate-pair orientations and separations
    • Repeat content by kmer analysis
    • Depth-of-coverage
    • Correlated polymorphism in the read alignments
    • Read alignment Breakpoint
Gap Filling
  • GapCloser (Luo et al. 2012)
    • Most gap closing is done on first cycle (Run CEGMA > GapCloser > Cegma again)
  • GapFiller (Nadalin et al. 2011)
  • Tries to extend alignments (not done by assemblers)
  • Can run multiple times, but first cycle is the most effective
  • Complete genes (CEGMA) much better after gap closing
  • Works better on some assemblies than others
Resolving Mis-Assemblies
  • REAPR (Hunt et al. 2013) better than NxRepair
    • Run Cegma to figure out best assembly and then run reaper
    • AKA “Grim Reapr” - screws up your scaffold N50
  • NxRepair (Murphy et al. 2014)
  • Breaks down scaffolds without enough evidence
  • Run REAPR on the best assembly candidate
  • Need at least one paired library
  • Looks for areas with low coverage
  • Compares expected coverage to actual coverage
  • Really low coverage is a misassembly
  • Will remove those areas if you let it, but those are often areas that have been gap filled
  • Run CEGMA before and after, usually score doesn’t change (finding real misassemblies)
  • ALWAYS:
    • Run CEGMA
    • Pick best candidate and run REAPR
    • Run CEGMA again to check quality
Genome Merging
  • Metassembler (Wences and Schatz 2014, Arxiv)
    • newest assembler out
    • ranks assemblies by N50
    • choose primary assembly (best N50)
    • compares two assemblies and builds consensus sequence
  • GAM-NGS (Vicedomini et al. 2013)
Overall
  • Allpaths/abyss/soap → gap closer → REAPR → gap closer → final
  • Read clipping marks the quality of the read. Soft masking (low quality gets lowercase character) and hard masking (low quality gets replaced by N character)
  • Assemblies are never finished - still over 100 gaps in human genome
  • What is a good genome?
    • Contig N50 close to or bigger than average gene size
    • Number of scaffolds close to number of chromosomes
Downstream Processing
  • Repeat annotation
  • Gene annotation
  • Mapping to get Diploid Genome (most assemblers won't call polymorphisms)
You could leave a comment if you were logged in.
lecture_notes/04-10-2015.txt · Last modified: 2015/04/17 17:19 by almussel