User Tools

Site Tools


lecture_notes:04-10-2015

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
lecture_notes:04-10-2015 [2015/04/10 10:44]
jolespin created
lecture_notes:04-10-2015 [2015/04/17 10:19] (current)
almussel
Line 1: Line 1:
-De novo Assembly III +=====De novo Assembly III===== 
-Guest lecturer: Stefan Prost, stefan.prost@berkley.edu+====Guest lecturer: Stefan Prost, stefan.prost@berkley.edu====
  
-**Best rated assemblers** +==Error Correction (EC) using the k-mer spectrum== 
-#Error Correction (EC) using the k-mer spectrum +    ​* ​BLESS 
- BLESS +    ​* ​SGA 
- SGA +    ​* ​Musket
- Musket +
-#EC Using Kmer Counts +
- . RACER+
  
-#Adapter and Low Quality Base Trimming +==EC Using Kmer Counts== 
- . Skewer (Jiang et al. 2014) +    * RACER
- . AdapterRemoval (Lindgreen 2012) +
- . Trimmomatic (Bolger et al. 2014) +
-#​Contamination Filtering +
- . Blast +
- . Allpaths-LG +
- . Removing Low Frequency k-mers +
- . Discarding Scaffolds shorter than 1kb +
-#de Bruijn Graph +
- . Most assemblers use kmers to create a de Bruijn graph  +
- . “In graph theory, an n-dimensional De Bruijn graph of m symbols is a directed graph representing overlaps between sequences of symbols. It has mn vertices, consisting of all possible length-n sequences of the given symbols; the same symbol may appear multiple times in a sequence” - http://​en.wikipedia.org/​wiki/​De_Bruijn_graph ​+
  
-#Short Read Assembler +==Adapter and Low Quality Base Trimming== 
-Allpaths-LG +    * Skewer (Jiang et al2014) 
- . Abyss (good guessing assembler…fast but not too accurate but must run for every kmer+    * AdapterRemoval ​(Lindgreen 2012
-SOAPdenovo +    * Trimmomatic (Bolger et al2014) 
- . Platanus +==Contamination Filtering== 
- . SSAKE +    * Blast 
- . SGA +    * Allpaths-LG 
-Celera+        * Removing Low Frequency k-mers 
 +        * Discarding Scaffolds shorter than 1kb 
 +==de Bruijn Graph== 
 +    * Most assemblers use kmers to create a de Bruijn graph  
 +    * In graph theory, an n-dimensional De Bruijn graph of m symbols is a directed graph representing overlaps between sequences of symbols. It has mn vertices, consisting of all possible length-n sequences of the given symbols; the same symbol may appear multiple times in a sequence” - http://​en.wikipedia.org/​wiki/​De_Bruijn_graph ​
  
-#Abyss Output files +==Short Read Assembler== 
- . unitigs.fa ​ +    * de Bruijn Graph: 
- . “A unitig is a special kind of contig. Ideally, it is fully consistent with all the data including reads, overlaps, and mate constraints. In practice, unitigs can only be consistent with most of the data. Conceptually,​ a unitig is a high-confidence contig. Maximal unitigs should contain either ​(1unique sequence up to repeat boundarieswith less than a read-length of repeat on each endor (2) nearly the full extent of a genomic repeat.” - http://​wgs-assembler.sourceforge.net/​wiki/​index.php/​Celera_Assembler_Terminology +        * Allpaths-LG 
- . contigs.fa +        * Abyss (good guessing assembler…fast but not too accurate but must run for every kmer) 
- . scaffolds.fa+        * SOAPdenovo 
 +        * Platanus 
 + * SSAKE 
 +    * OverlapLayoutConsensus 
 +        * SGA 
 +        * Celera
  
-#N50 +==Software Specifics== 
- . N50 at least 50% of the bases are in scaffolds this size or bigger +    * Allpaths-LG is only for long read data 
- . A contig N50 is calculated by first ordering every contig by lenght from longest ​to shortest. ​ Next, starting from the longest contiglenghts ​of each contig are summed, until this running sum equals one-half o the total length of all contigs ​in the assemly. ​ The contig N50 of the assembly ​is the length of the shortest contig in this list (Yandel and Ence 2012)+    * Platanus was made for heterozygous genomes 
 +    * SOAP, Allpaths-LG will help you find proper kmer size 
 +    * Abyss has tools to assess quality (abyss-fac) 
 +    * Once you run the assemblyyou can get rid of almost everything except ​the contigs ​and scaffolds (unitigs ​is the genome w/o any gap regions)
  
-#Assembly Quality Assessment +==N50== 
- . By annotation: CEGMA (Parra et al. 2007) looks for certain genes with the sequence ​to measure quality(http://​korflab.ucdavis.edu/​datasets/​cegma/​) +    * N50 at least 50% of the bases are in scaffolds this size or bigger 
- . Using RNA transcripts:​ Baa.pl (Ryan 2013/4Arxiv) +    * A contig N50 is calculated by first ordering every contig by length from longest ​to shortest Nextstarting from the longest contig, lenghts of each contig are summed, until this running sum equals one-half o the total length of all contigs in the assemly The contig N50 of the assembly is the length of the shortest contig in this list (Yandel and Ence 2012).  
- De novo liklihood-based measures ​(LAP; Ghodsi et al. 2013) +    * Not always a good indicator ​will not detect misassemblies
- Feature Response Curve +
- . Mate-pair orientations and separations +
- . Repeat content by kmer analysis +
- . Depth-of-coverage +
- . Correlated polymorphism in the read alignments +
- . Read alignment Breakpoint+
  
-#Gap Filling +==Assembly Quality Assessment== 
- . GapCloser ​(Luo et al2012)  +    * How do you know which kmer size is better? Some people just use N50 
- Most gap closing is done on first cycle (Run CEGMA > GapCloser > Cegma again)  +    * CEGMA (recommended) searches for genes that should be found in all eukaryotes in the final assembly. 
- . GapFiller ​(Nadalin ​et al. 2011)+        * how many are complete, partial 
 +        * percent completeness 
 +        * orthology (how many times was a gene present more than once
 +        * Lots of orthology in partial genes because they’re broken between contigs 
 +    * Using RNA transcripts:​ Baa.pl (Ryan 2013/4, Arxiv
 +    * De novo liklihood-based measures ​(LAP; Ghodsi ​et al. 2013) 
 +        * How likely it is that reads map to your assembled genome 
 +    * Feature Response Curve 
 +        * Mate-pair orientations and separations 
 +        * Repeat content by kmer analysis 
 +        * Depth-of-coverage 
 +        * Correlated polymorphism in the read alignments 
 +        * Read alignment Breakpoint
  
-#Resolving Mis-Assemblies +==Gap Filling== 
- . REAPR (Hunt et al. 2013better than NxRepair +    * GapCloser ​(Luo et al. 2012)  
- Run Cegma to figure out best assembly and then run reaper +        * Most gap closing is done on first cycle (Run CEGMA > GapCloser > Cegma again) ​ 
- . NxRepair ​(Murphy ​et al. 2014)+    * GapFiller ​(Nadalin ​et al. 2011) 
 +    * Tries to extend alignments (not done by assemblers) 
 +    * Can run multiple times, but first cycle is the most effective 
 +    * Complete genes (CEGMA) much better after gap closing 
 +    * Works better on some assemblies than others
  
-#Genome Merging +==Resolving Mis-Assemblies== 
- . Metassembler ​(Wences and Schatz 2014, Arxiv+    * REAPR (Hunt et al. 2013better than NxRepair 
- . newest assembler ​out +        * Run Cegma to figure ​out best assembly and then run reaper 
- . ranks assemblies by N50 +        * AKA "Grim Reapr" - screws up your scaffold ​N50 
- Merges ​the different assemblies to form optimal ​assembly +    * NxRepair (Murphy et al2014) 
- . GAM-NGS ​(Vicedomini et al. 2013)+    * Breaks down scaffolds without enough evidence 
 +    * Run REAPR on the best assembly ​candidate 
 +    * Need at least one paired library 
 +    * Looks for areas with low coverage 
 +    * Compares expected coverage to actual coverage 
 +    * Really low coverage is a misassembly 
 +    * Will remove those areas if you let it, but those are often areas that have been gap filled 
 +    * Run CEGMA before and after, usually score doesn’t change ​(finding real misassemblies) 
 +    * ALWAYS: 
 +        * Run CEGMA 
 +        * Pick best candidate and run REAPR 
 +        * Run CEGMA again to check quality
  
-. Read clipping marks the quality of the read.  Soft masking ​(low quality gets lowercase character) and hard masking ​(low quality gets replaced by N character)+==Genome Merging== 
 +    * Metassembler ​(Wences and Schatz 2014, Arxiv) 
 +        * newest assembler out 
 +        * ranks assemblies by N50 
 +        * choose primary assembly (best N50) 
 +        * compares two assemblies ​and builds consensus sequence 
 +    * GAM-NGS ​(Vicedomini et al. 2013)
  
-. Number ​of scaffolds shold be close to number chromosomes ​(ideally but rare+==Overall=== 
-Contig N50 close to or bigger ​2x avg(gene size)+    * Allpaths/​abyss/​soap -> gap closer -> REAPR -> gap closer -> final 
 +    * Read clipping marks the quality ​of the read.  Soft masking ​(low quality gets lowercase character) and hard masking (low quality gets replaced by N character
 +    * Assemblies are never finished - still over 100 gaps in human genome 
 +    * What is a good genome? 
 +        * Contig N50 close to or bigger ​than average ​gene size 
 +        * Number of scaffolds close to number of chromosomes
  
-#Downstream Processing +==Downstream Processing== 
- Repeat annotation +    *Repeat annotation 
- Gene annotation +    *Gene annotation 
- Mapping to get Diploid Genome+    *Mapping to get Diploid Genome ​(most assemblers won't call polymorphisms)
  
-. Efasta [AT{G,C}A] 
lecture_notes/04-10-2015.1428687861.txt.gz · Last modified: 2015/04/10 10:44 by jolespin