User Tools

Site Tools


lecture_notes:04-10-2015

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
lecture_notes:04-10-2015 [2015/04/10 10:44]
jolespin
lecture_notes:04-10-2015 [2015/04/17 10:19] (current)
almussel
Line 1: Line 1:
-De novo Assembly III +=====De novo Assembly III===== 
-Guest lecturer: Stefan Prost, stefan.prost@berkley.edu+====Guest lecturer: Stefan Prost, stefan.prost@berkley.edu====
  
-**Best rated assemblers**+==Error Correction (EC) using the k-mer spectrum== 
 +    ​BLESS 
 +    ​SGA 
 +    ​Musket
  
-#Error Correction (EC) using the k-mer spectrum +==EC Using Kmer Counts== 
- . BLESS +    ​* ​RACER
- . SGA +
- . Musket +
-#EC Using Kmer Counts +
- RACER+
  
-#Adapter and Low Quality Base Trimming +==Adapter and Low Quality Base Trimming== 
- Skewer (Jiang et al. 2014) +    ​* ​Skewer (Jiang et al. 2014) 
- AdapterRemoval (Lindgreen 2012) +    ​* ​AdapterRemoval (Lindgreen 2012) 
- Trimmomatic (Bolger et al. 2014) +    ​* ​Trimmomatic (Bolger et al. 2014) 
-#Contamination Filtering +==Contamination Filtering== 
- Blast +    ​* ​Blast 
- Allpaths-LG +    ​* ​Allpaths-LG 
- Removing Low Frequency k-mers +        ​* ​Removing Low Frequency k-mers 
- Discarding Scaffolds shorter than 1kb +        ​* ​Discarding Scaffolds shorter than 1kb 
-#de Bruijn Graph +==de Bruijn Graph== 
- Most assemblers use kmers to create a de Bruijn graph  +    ​* ​Most assemblers use kmers to create a de Bruijn graph  
- . “In graph theory, an n-dimensional De Bruijn graph of m symbols is a directed graph representing overlaps between sequences of symbols. It has mn vertices, consisting of all possible length-n sequences of the given symbols; the same symbol may appear multiple times in a sequence” - http://​en.wikipedia.org/​wiki/​De_Bruijn_graph ​+    ​* ​In graph theory, an n-dimensional De Bruijn graph of m symbols is a directed graph representing overlaps between sequences of symbols. It has mn vertices, consisting of all possible length-n sequences of the given symbols; the same symbol may appear multiple times in a sequence” - http://​en.wikipedia.org/​wiki/​De_Bruijn_graph ​
  
-#Short Read Assembler +==Short Read Assembler== 
- Allpaths-LG +    * de Bruijn Graph: 
- Abyss (good guessing assembler…fast but not too accurate but must run for every kmer) +        * Allpaths-LG 
- SOAPdenovo +        ​* ​Abyss (good guessing assembler…fast but not too accurate but must run for every kmer) 
- Platanus +        ​* ​SOAPdenovo 
- SSAKE +        ​* ​Platanus 
- SGA + SSAKE 
- Celera+    * Overlap, Layout, Consensus 
 +        * SGA 
 +        ​* ​Celera
  
-#Abyss Output files +==Software Specifics== 
- . unitigs.fa  +    * Allpaths-LG ​is only for long read data 
- . “A unitig ​is a special kind of contig. Ideally, it is fully consistent with all the data including reads, overlaps, and mate constraints. In practice, unitigs can only be consistent with most of the data. Conceptuallya unitig is a high-confidence contig. Maximal unitigs should contain either ​(1unique sequence up to repeat boundarieswith less than a read-length ​of repeat on each end, or (2) nearly ​the full extent of a genomic repeat.” - http://​wgs-assembler.sourceforge.net/​wiki/​index.php/​Celera_Assembler_Terminology +    * Platanus was made for heterozygous genomes 
- . contigs.fa +    * SOAPAllpaths-LG will help you find proper kmer size 
- . scaffolds.fa+    * Abyss has tools to assess quality ​(abyss-fac) 
 +    * Once you run the assemblyyou can get rid of almost everything except the contigs and scaffolds ​(unitigs is the genome w/o any gap regions)
  
-#N50 +==N50== 
- N50 at least 50% of the bases are in scaffolds this size or bigger +    ​* ​N50 at least 50% of the bases are in scaffolds this size or bigger 
- A contig N50 is calculated by first ordering every contig by lenght ​from longest to shortest. ​ Next, starting from the longest contig, lenghts of each contig are summed, until this running sum equals one-half o the total length of all contigs in the assemly. ​ The contig N50 of the assembly is the length of the shortest contig in this list (Yandel and Ence 2012). ​+    ​* ​A contig N50 is calculated by first ordering every contig by length ​from longest to shortest. ​ Next, starting from the longest contig, lenghts of each contig are summed, until this running sum equals one-half o the total length of all contigs in the assemly. ​ The contig N50 of the assembly is the length of the shortest contig in this list (Yandel and Ence 2012). ​ 
 +    * Not always a good indicator - will not detect misassemblies
  
-#Assembly Quality Assessment +==Assembly Quality Assessment== 
- . By annotation: ​CEGMA (Parra et al. 2007looks for certain ​genes with the sequence to measure quality. (http://​korflab.ucdavis.edu/​datasets/​cegma/​+    * How do you know which kmer size is better? Some people just use N50 
- Using RNA transcripts:​ Baa.pl (Ryan 2013/4, Arxiv) +    * CEGMA (recommendedsearches ​for genes that should be found in all eukaryotes in the final assembly. 
- De novo liklihood-based measures (LAP; Ghodsi et al. 2013) +        * how many are complete, partial 
- Feature Response Curve +        * percent completeness 
- Mate-pair orientations and separations +        * orthology ​(how many times was a gene present more than once
- Repeat content by kmer analysis +        * Lots of orthology in partial genes because they’re broken between contigs 
- Depth-of-coverage +    * Using RNA transcripts:​ Baa.pl (Ryan 2013/4, Arxiv) 
- Correlated polymorphism in the read alignments +    ​* ​De novo liklihood-based measures (LAP; Ghodsi et al. 2013) 
- Read alignment Breakpoint+        * How likely it is that reads map to your assembled genome 
 +    * Feature Response Curve 
 +        ​* ​Mate-pair orientations and separations 
 +        ​* ​Repeat content by kmer analysis 
 +        ​* ​Depth-of-coverage 
 +        ​* ​Correlated polymorphism in the read alignments 
 +        ​* ​Read alignment Breakpoint
  
-#Gap Filling +==Gap Filling== 
- GapCloser (Luo et al. 2012)  +    ​* ​GapCloser (Luo et al. 2012)  
- Most gap closing is done on first cycle (Run CEGMA > GapCloser > Cegma again)  +        ​* ​Most gap closing is done on first cycle (Run CEGMA > GapCloser > Cegma again)  
- GapFiller (Nadalin et al. 2011)+    ​* ​GapFiller (Nadalin et al. 2011) 
 +    * Tries to extend alignments (not done by assemblers) 
 +    * Can run multiple times, but first cycle is the most effective 
 +    * Complete genes (CEGMA) much better after gap closing 
 +    * Works better on some assemblies than others
  
-#Resolving Mis-Assemblies +==Resolving Mis-Assemblies== 
- REAPR (Hunt et al. 2013) better than NxRepair +    ​* ​REAPR (Hunt et al. 2013) better than NxRepair 
- Run Cegma to figure out best assembly and then run reaper +        ​* ​Run Cegma to figure out best assembly and then run reaper 
- NxRepair (Murphy et al. 2014)+        * AKA "Grim Reapr" - screws up your scaffold N50 
 +    * NxRepair (Murphy et al. 2014) 
 +    * Breaks down scaffolds without enough evidence 
 +    * Run REAPR on the best assembly candidate 
 +    * Need at least one paired library 
 +    * Looks for areas with low coverage 
 +    * Compares expected coverage to actual coverage 
 +    * Really low coverage is a misassembly 
 +    * Will remove those areas if you let it, but those are often areas that have been gap filled 
 +    * Run CEGMA before and after, usually score doesn’t change (finding real misassemblies) 
 +    * ALWAYS: 
 +        * Run CEGMA 
 +        * Pick best candidate and run REAPR 
 +        * Run CEGMA again to check quality
  
-#Genome Merging +==Genome Merging== 
- Metassembler (Wences and Schatz 2014, Arxiv) +    ​* ​Metassembler (Wences and Schatz 2014, Arxiv) 
- newest assembler out +        ​* ​newest assembler out 
- ranks assemblies by N50 +        ​* ​ranks assemblies by N50 
- . Merges the different assemblies to form optimal ​assembly +        * choose primary ​assembly ​(best N50) 
- GAM-NGS (Vicedomini et al. 2013)+        * compares two assemblies and builds consensus sequence 
 +    ​* ​GAM-NGS (Vicedomini et al. 2013)
  
-Read clipping marks the quality of the read.  Soft masking (low quality gets lowercase character) and hard masking (low quality gets replaced by N character)+==Overall=== 
 +    * Allpaths/​abyss/​soap -> gap closer -> REAPR -> gap closer -> final 
 +    * Read clipping marks the quality of the read.  Soft masking (low quality gets lowercase character) and hard masking (low quality gets replaced by N character) 
 +    * Assemblies are never finished - still over 100 gaps in human genome 
 +    * What is a good genome? 
 +        * Contig N50 close to or bigger than average gene size 
 +        * Number of scaffolds close to number of chromosomes
  
-. Number of scaffolds shold be close to number chromosomes (ideally but rare) +==Downstream Processing== 
-. Contig N50 close to or bigger 2x avg(gene size)+    *Repeat annotation 
 +    *Gene annotation 
 +    ​*Mapping ​to get Diploid Genome ​(most assemblers won't call polymorphisms)
  
-#Downstream Processing 
- . Repeat annotation 
- . Gene annotation 
- . Mapping to get Diploid Genome 
- 
-. Efasta [AT{G,C}A] 
lecture_notes/04-10-2015.1428687881.txt.gz · Last modified: 2015/04/10 10:44 by jolespin