User Tools

Site Tools


lecture_notes:04-19-2010

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
lecture_notes:04-19-2010 [2010/04/22 06:12]
galt
lecture_notes:04-19-2010 [2010/04/24 14:54] (current)
karplus added correction about quadratic behavior of overlap-consensus
Line 1: Line 1:
 ====== Lecture Notes for April 19, 2010 ====== ====== Lecture Notes for April 19, 2010 ======
-Need to fill in any ** INCOMPLETE ** portions 
- 
-{{:​lecture_notes:​20100419_banana_slug_seminar.ppt|Daniel'​s Powerpoint}} 
  
 ===== Presentation ===== ===== Presentation =====
 +** {{:​lecture_notes:​20100419_banana_slug_seminar.ppt|Assembling and comparing genomes or transcriptomes using de Bruijn graphs}} ** \\
 +Daniel Zerbino, Marcel Schulz, Wendy Lee, Ewan Birney and David Haussler
 +
 ==== Overview ==== ==== Overview ====
   * De Bruijn graphs   * De Bruijn graphs
Line 10: Line 10:
   * Practical considerations   * Practical considerations
   * Extensions to Velvet   * Extensions to Velvet
 +
 ==== De Bruijn graphs ==== ==== De Bruijn graphs ====
-Overlap-Layout-Consensus (traditional)+ 
 +=== Overlap-Layout-Consensus (traditional) ​===
   * Traditional assemblers: Phrap, Arachne, Celera, etc.   * Traditional assemblers: Phrap, Arachne, Celera, etc.
   * Short reads: Edena   * Short reads: Edena
-  * Generally more expensive computationally +  * Generally more expensive computationally. 
-    * Overlaps only allow clean right-to-left and left-to-right overlaps+    * Overlaps only allow clean right-to-left and left-to-right overlaps.
     * Transitivity     * Transitivity
-      * Need to remove the redundant connections+      * Need to remove the redundant connections.
       * Well studied graph theory problem       * Well studied graph theory problem
-Comparison of different assembly techniques +  * Use whole-read alignments 
-  ​* ​Greedy hash table searches (~2007) + 
-    * Some examples: SSAKE, SHARCGS, VCAKE +=== Quick Overlap-Layout-Consensus (OLC) example === 
-    * ** INCOMPLETE ​** +  * Filter out all the reads that are identical or self-contained in another read. Only left-to-right,​ etc. are allowed. 
-  * De Bruijn graph assemblers (~2009) +  * Compute all their overlaps. 
-    * Currently the most common for NGS: Euler, ALLPATHS, Velvet, ABySS, SOAPdenovo +  * Get many transitive redundant connections. 
-    * ** INCOMPLETE ​** +    * Many algorithms are available for transitive reduction. 
-    * Can throw in reads of different lengths with no problems +  * You obtain simplified paths through the reads. 
-In Velvet: +    * Get 4 big chunks of reads from the example. 
-  * Errors that are created in assembly create tips (spurs) ​that can be easily removed +    * Hard to distinguish which reads belong to which copy of the repeat. 
-  * When finding bubbles, remap the branch with the lower coverage back onto the branch with the higher coverage (as not to lose the information) +  * This is how contigs were first built. 
-  * When encountering loops, ** INCOMPLETE ​** +  * Have to compare all reads against each other. 
-So what's the difference?+    * Quadratic. This has become difficult. 
 +    * Correction: quadratic behavior can be reduced to about O(n log n) by doing a clustering before aligning: just like in several fast search techniques, you only attempt to align sequences that share seed hits to relatively short k-mers (much shorter than those needed for de Bruijn graphs, since they only need to reduce the number of hits, not get unique locations in the genome). --- //​[[karplus@soe.ucsc.edu|Kevin Karplus]] 2010/04/24 07:51// 
 +  * Next generation sequencing reads are short, and overlaps between reads are short. They are getting a little longer with newer technology. 
 +  * OLC method uses all information in the read at least. 
 + 
 +=== Comparison of different assembly techniques ​=== 
 + 
 +== Greedy hash table searches (~2007) ​== 
 +  * Some examples: SSAKE, SHARCGS, VCAKE 
 +  * Much faster & leaner than the previous 
 +  * Less robust to variation and noise 
 +  * Essentially simplified de Bruijn graph assemblers 
 +  * Build a dictionary of words in the data set. 
 +    * His example is 4mer, but typically 20mer and longer. 
 +  ​For each, create a node in the hash table. 
 +    ​Big dataset. Typically a binary marker to say if present. 
 +  ​Sometimes at branch have two possible extensions. 
 +  ​Greedy approach essentially de Bruijn graph with local info only. In this structure you could have many paths through. 
 +  * Can throw in reads of different lengths with no problems. 
 + 
 +== De Bruijn graph assemblers (~2009) ​== 
 +  * Currently the most common for NGS: Euler, ALLPATHS, Velvet, ABySS, SOAPdenovo 
 +  Pavel Pevzner popularized in paper from 2001 
 +  ​Why do we use them? 
 +    ​High through-put,​ cheap, short reads with errors 
 +  ​Basically a compromise between two previous approaches:​ 
 +    ​Just a k-mer hash 
 +    * ... but processed globally 
 +  * Graph construction and correction is similar ​in all assemblers. 
 +  * ABySS handles parallel implementation. 
 + 
 +=== Working with de Bruijn graphs in Velvet ​=== 
 + 
 +== Fixing errors == 
 +  * Errors that are created in assembly create ​spurs/tips that can be easily removed. 
 +  * When finding bubbles, remap the branch with the lower coverage back onto the branch with the higher coverage (as not to lose the information). 
 +  * Loops can represent tandem repeats. 
 + 
 +== Creating de Bruijn graphs == 
 +  * Firstconcatenate the simple graphs without branches into nodes which compact the non-branching runs into a single item. 
 +  ​Biochemistry of short reads means errors tend to be near the ends. 
 +  ​If you have errors in the reads in the second half, it creates millions of tips, but they are at least easy to remove later. ​  
 +  ​Each read is a path through the graph. The head of the read is still kept, so we're not losing the read placement information. 
 +  ​We are still left with bubbles. 
 +    * Possibly accumulation of overlapping spurs/tips. More often variation (e.g. SNPs or variation). 
 +    * Would like to keep that information where possible. 
 +    * Could also be a long read. But we don't want to cut the long read in the middle. 
 +  * Now we have loops that goes back on itself (not a bubble). 
 + 
 +=== So what's the difference? ​===
   * Algebraic difference:   * Algebraic difference:
     * Reads in the OLC methods are atomic (i.e. an arc)     * Reads in the OLC methods are atomic (i.e. an arc)
     * Reads in the DB graph are sequential paths through the graph     * Reads in the DB graph are sequential paths through the graph
   * …this leads to practical differences   * …this leads to practical differences
-    * DB graphs allow for greater variety in overlap ​** INCOMPLETE ​** +    * DB graphs allow for greater variety ​of overlaps 
-    * ** INCOMPLETE **+    * Overlaps ​in the OLC approach require a global alignment, not just a shared k-mer 
 +    ​OLC methods gets error correction from long reads, while DB graphs requires more work to correct. 
 + 
 +=== Simulations:​ Tour Bus - error === 
 +  ​Compare 5mb chunk from 4 species: //E. coli//, yeast, //C. elegans//, human 
 +  ​Increased coverage from 0 to 50x 
 +  ​N50 on y-axis 
 +  First there is exponential growth as expected, but then the complexity causes trouble unless you have other kinds of information. 
 +  ​Not using paired-reads information 
 +  ​You see that the human is a little more complex than //E. coli//. 
 ==== Repeat resolution ==== ==== Repeat resolution ====
-Scaling up the hard way: chromosome X+ 
 +=== Scaling up the hard way: chromosome X ===
   * 548 million Solexa reads were generated from a flow-sorted human X chromosome.   * 548 million Solexa reads were generated from a flow-sorted human X chromosome.
     * Fit in 70GB of RAM     * Fit in 70GB of RAM
-    * Many contigs: 898,401 contigs +    ​* Numerous short contigs 
-    * Shirt contigs: 260bp N50 (however max of 6,956bp)+      ​* Many contigs: 898,401 contigs 
 +      * Shirt contigs: 260bp N50 (however max of 6,956bp)
     * Overall length: 130Mb     * Overall length: 130Mb
 +    * How to handle repeats?
   * Moral: there are engineering issues to be resolved but the complexity of the graph needs to be handled accordingly.   * Moral: there are engineering issues to be resolved but the complexity of the graph needs to be handled accordingly.
-    ​* Reduced representation (Margulies et al.). + 
-    * Combined re-mapping and //de novo// sequencing (Cheetham et al., Pleaseance et al.). +== Handling complexity of the graph == 
-    * Code parallelization (Birol et al., Jeffrey Cook). +  ​* Reduced representation (Margulies et al.). 
-    * Improved indexing (Zamin Iqbal and Mario Caccamo). +    ​* Reduced representation using restriction enzymes and gel to cut the problem into tractable bits. 
-    * Use of intermediate remapping (Matthias Haimei). +    * Cut chunks from the gel and then sequenced the gel slice separately or at least in different lanes. 
-Rock band: Using long and short reads together+    * Then the assembly problem is smaller and more localized instead of whole genome. 
 +  ​* Combined re-mapping and //de novo// sequencing (Cheetham et al., Pleaseance et al.). 
 +    ​* Map to reference or similar genome, then re-map locally some stuff that is unique. 
 +    * MAQ (Mapping and Assembly with Quality) and BWA (Burrows-Wheeler Alignment Tool) only allow 3 or 4 mismatches. 
 +    * Can see reads not mapping in a small area with too many SNPs or breakpoint rearrangments. 
 +  ​* Code parallelization (Birol et al., Jeffrey Cook). 
 +    ​* Only ABySS has published yet 
 +  ​* Improved indexing (Zamin Iqbal and Mario Caccamo). 
 +    ​* Fit whole genome shotgun onto 200GB 
 +  ​* Use of intermediate remapping (Matthias Haimei). 
 +    * Wanted to parallelize the second level work of building scaffolds, but those are mostly engineering issues. 
 + 
 +=== Repeats in a de Bruijn graph === 
 +  * The de Bruijn graph for a small organism looks like long-ish contigs, but still makes a hair-ball. 
 + 
 +=== Rock band: Using long and short reads together ​===
   * If all of the paths along the short reads move from one long read (e.g. 100-1kb) to another long read, then you know the sequence from the short reads   * If all of the paths along the short reads move from one long read (e.g. 100-1kb) to another long read, then you know the sequence from the short reads
-Different approaches to repeat resolution +  * Long reads which are mapped seamlessly into the graph. 
-  ​* ​Theoretical:​ spectral graph analysis +  * 35bp reads 50x, some long reads (100-1kb) 0-20x 
-    * Equivalent to a Principle Component Analysis or a heat conduction model +  * Played with the length 
-    * Relies on a (massive) matrix diagonialization +  * The longer your long reads are, the more bridgeable it is and the longer the contigs. 
-    * Comprehensive:​ all data integrated at once +    * Someone might ask: "Why not just assemble the long reads?"​ 
-  * Traditional scaffolding + 
-    * E.g. Arachne, Celera, BAMBUS +=== Different approaches to repeat resolution ​=== 
-    * Heuristic approach similar to ** INCOMPLETE ​** + 
-  * In NGS assemblers:​ +== Theoretical:​ spectral graph analysis ​== 
-    * Euler: for each pair of reads, find all possible paths from one read to another +  * Equivalent to a Principle Component Analysis or a heat conduction model 
-    * ABySS: Same as above, but the read-pairs are bundled into node-to-node connections to reduce calculations +    ​* Find what is the vector that threads its way 
-    * ALLPATHS: Same as above, but the search is limited to localized clouds around pre-computed scaffolds +  ​* Relies on a (massive) matrix diagonialization 
-  * Using the differences ​in ** INCOMPLETE *+  * Comprehensive:​ all data integrated at once 
-    * ** INCOMPLETE ** +    Takes all the data at once in one huge matrix, not too affected by noise. No one has actually tried this approach as it requires too much computation. 
-Peeble: Handling paired-end reads+ 
 +== Traditional scaffolding ​== 
 +  * E.g. Arachne, Celera, BAMBUS 
 +  * Heuristic approach similar to that used in traditional overlap layout-consensus contigs 
 +    ​The traditional approach is contigs with paired-end reads. 
 +    ​Find chains connected by contig reads. 
 +  ​Build a big graph of pairwise connections,​ simplify, extract obvious linear components 
 +    ​Your coverage is huge. 
 +    Each base pair has to handle 1000 coverage read-pairs. Long reads separated by a mess in the middle. Check possible paths. Find which are acceptable. 
 +    * Find all the reads that start and end on the two long reads. Then consider them all together, much more efficient. 
 +    * 5% or more of mate-pairs can be wrong 
 +    * Hopefully will be able to filter them out. 
 +    * In Velvet use a likelihood approach 
 +    * Have a distance estimate 
 +    * Coverage of A and B, how many would we expect to see? If we see too few, we discard the connection. 
 + 
 +== In NGS assemblers: ​== 
 +  * Euler: for each pair of reads, find all possible paths from one read to another 
 +  * ABySS: Same as above, but the read-pairs are bundled into node-to-node connections to reduce calculations 
 +  * ALLPATHS: Same as above, but the search is limited to localized clouds around pre-computed scaffolds 
 +    ALLPATHS did better using small localized problems of a single small scaffold for parallelization. 
 + 
 +== Using the differences ​between insert length == 
 +  ​The Shorty algorithm uses the variance between read pairs anchored on a common contig on k-mer 
 +    * Projected image of the other ends creates possibility of assembling a smaller set, the projection from a node. 
 + 
 +=== Peeble: Handling paired-end reads ===
   * Pick one unique node, follow their paths   * Pick one unique node, follow their paths
 +    * Velvet used the reverse idea, the projections to a node
   * Some of them are unique, pick out those nodes   * Some of them are unique, pick out those nodes
 +    * Hit the nodes in the vicinity of the starting contig, got distance information on the immediate vicinity.
 +    * Uses a greedy algorithm. Find way to next unique node.
 +  * Having a more-precise insert size could help
 +  * Velvet did a size assuming 10% distribution
 +    * Solexa might have been different
 +  * SOLiD up to 2k, Solexa just 300
 +  * It is difficult to have enough DNA of a narrow range.
 +  * Don't love barcoding because it uses a lot of reagents.
 +
 ==== Practical considerations ==== ==== Practical considerations ====
-Colorspace+ 
 +=== Colorspace ​===
   * Di-base encoding has a 4 letter alphabet, but very different behavior to sequence space   * Di-base encoding has a 4 letter alphabet, but very different behavior to sequence space
 +    * Double-encoded files use ACGT.
 +    * Be careful not to confuse it!
     * Different rules for complementarity     * Different rules for complementarity
 +      * Just reverse, don't complement them.
   * Direct conversion to sequence space is simple but wasteful   * Direct conversion to sequence space is simple but wasteful
-    * One error messes up all the remaining bases+    * One error messes up all the remaining bases. This is a huge loss, so do all in colorspace and make a complete assembly, then finally convert to base calls later.
   * Conversion must therefore be done at the very end of the process, when the reads are aligned   * Conversion must therefore be done at the very end of the process, when the reads are aligned
-    * You can then use the transition rules to detect errors +    * You can then use the transition rules to detect errors. Typically it detects the error after 1 or 2 bp, then corrects. 
-Different error models+    * SOLiD has developed tools 
 + 
 +=== Different error models ​===
   * When using different technologies,​ you have to take into account different technologies   * When using different technologies,​ you have to take into account different technologies
     * Trivial when doing OLC assembly     * Trivial when doing OLC assembly
-    * Much more tricky when doing De Bruijn graph assemble, since k-mers are not assigned to reads+    * Much more tricky when doing de Bruijn graph assemble, since k-mers are not assigned to reads
     * Different assemblers had different settings (e.g. Newbler)     * Different assemblers had different settings (e.g. Newbler)
-Pre-filtering the reads+  * 454 had homopolymer issues in flow-space. 
 +  * Illumina or ABI are different. 
 +    * Makes it hard to handle errors. 
 +    * Especially in de Bruijn graphs. Applying the correction rules becomes much more complex. 
 +  * Newbler may be better for 454. 
 + 
 +=== Pre-filtering the reads ===
   * Some assemblers have in-built filtering of the reads (e.g. Euler) but not a generality   * Some assemblers have in-built filtering of the reads (e.g. Euler) but not a generality
   * Efficient filtering of low quality bases can cut down on the computational cost (memory and time)   * Efficient filtering of low quality bases can cut down on the computational cost (memory and time)
 +    * User-specific,​ data-space specific solutions
 +    * No tried and true consistent approach
   * Beware that some assemblers require reads of identical length   * Beware that some assemblers require reads of identical length
 +
 ==== Extensions to Velvet ==== ==== Extensions to Velvet ====
-Oases: //de novo// transcriptome assembly ​(How to study mRNA reads which do not map) +** Oases: //de novo// transcriptome assembly ​** \\ 
-De Brujin graphs ~ splice graphs +How to study mRNA reads which do not map 
-  * Pavel Pevzner mentioned that De Bruijn graphs are more or less like splice graphs + 
-Oases: Process+=== De Brujin graphs ~ splice graphs ​=== 
 +  * People are using next generation sequencing generally because it's cheaper. 
 +  * Would have just one path except for alternate splicing. 
 +  * Pavel Pevzner mentioned ​in 2002 that de Bruijn graphs are equivalent to splice graphs, so we can apply those algorithms developed for splice graphs to de Bruijn graphs. 
 + 
 +=== Oases: Process ​===
   * Create clusters of contigs:   * Create clusters of contigs:
     * Connecting reads     * Connecting reads
     * Connecting read pairs     * Connecting read pairs
 +    * The genetic locus should create a connected locus, cluster of contigs. ​
 +    * Sometimes one cluster gets broken into two, but works pretty well.
   * Re-implement traditional algorithms to run on graphs, instead of a reference genome   * Re-implement traditional algorithms to run on graphs, instead of a reference genome
     * Greedy transitive reduction (Myers, 2005)     * Greedy transitive reduction (Myers, 2005)
     * Motif searches     * Motif searches
     * Dynamic assembly of transcripts (Lee, 2003)     * Dynamic assembly of transcripts (Lee, 2003)
-Oases: Output+      * find the path of highest weight 
 + 
 +=== Oases: Output ​===
   * Full length transcripts - handles alternative splicing.   * Full length transcripts - handles alternative splicing.
 +    * With skipped exon and one artifact.
   * Transcripts from different genes - handles varying coverage levels.   * Transcripts from different genes - handles varying coverage levels.
   * Alternative splicing events: when possible, distinguished between common AS events.   * Alternative splicing events: when possible, distinguished between common AS events.
-Oases: Results: +    * Find donor and acceptor sites on or around junctions. 
-  * ** INCOMPLETE *+ 
-Oases: results of Drosophila PE-reads +=== Oases: Results ​=== 
-  * The number of transcripts and loci are in the ballpark of what they were expecting +  * Benchmarked within the RGASP competition 
-Mapping vs. //de novo//+  ​Already 100+ users, some reporting very promising results: 
 +    ​N50 = 1886 bp (Monica Britton, UC Davis) 
 +    ​99.8% of aligned transcripts against unigene (Micha Bayer - Scottish Crop Research Institute) 
 + 
 +=== Oases: results of Drosophila PE-reads ​=== 
 +  * The number of transcripts and loci are in the ballpark of what they were expecting
 +  * rl 36, k21 ins size 200 
 +  * Map to genome ~90% 
 + 
 +=== Oases: examples === 
 +  * Found an unannotated gene with 7 exons 
 +  * Assembler has no notion of the reference genome, but still found one example that mapped very well. 
 + 
 +=== Mapping vs. de novo ===
   * Mapping   * Mapping
-    * Many tools available+    * Many good mature ​tools available
     * Easy to compute     * Easy to compute
-    * Simpler to analyze+    * Simpler to analyze, comes with genetic coords
   * //De novo//   * //De novo//
 +    * Harder than mapping
 +    * May require special hardware
     * No reference needed     * No reference needed
     * Handles novel structures without //a priori//     * Handles novel structures without //a priori//
-Columbus: Velvet module+  * How to get the best of both worlds? 
 + 
 +=== Columbus: Velvet module ​===
   * Solution: the reference-based de Bruijn graph   * Solution: the reference-based de Bruijn graph
-    * Preserves the structure of the reference sequences +    * Just throw in the reference genome ​as long reads. 
-    * Integrates alignment information (SAM, BAM file) +    Chucking ​in long reads and getting short contigs. 
-    * Edits, breaks up, reconnects ** INCOMPLETE ** +    * Preserves ​the structure ​of the reference ​sequencesNot allowed ​to overlap ​with self
- +      ​* ​What they wanted really was to start from a reference genome, then handle it as a little insertion. 
- +    * Integrates alignment information (SAMBAM file with BWA, Bowtie
- +    * Editsbreaks upreconnectsand extends the "​alignment contigs"​ as necessary 
-===== Galt's Notes ===== +  ​* Applications 
- +    * Comparative transcriptome assembly 
-Assembling and comparing genomes or transcriptomes using de Bruijn graphs. +    * Cancer RNAseq 
- +    * Comparative assembly
-Daniel Zerbino, Marcel Schulz, Wendy Lee, +
-Ewan Birney and David Haussler. +
- +
-Slides will be made available. +
- +
-Talking about +
-  *ways to build graphs  +
-  *handle repeats +
-  *practical considerations +
-  *extensions to velvet +
- +
-==== OLC ==== +
-Traditional approach is overlap-layout-consensus +
-Phrap, etc.\\ +
-Short reads - Edena\\ +
-More expensive computationally. +
-Use whole-read alignments. +
- +
-=== Quick OLC example === +
-  Filter out all the reads that are identical or self-contained ​in another read. +
-  Only left-to-right etc allowed. +
-  Compute all their overlaps. +
-  Get many transitive redundant connections. +
-  Many algorithms are available for transitive reduction. +
-  You obtain simplified paths through ​the reads. +
-  Get 4 big chunks of reads from the example. +
-  Hard to distinguish which reads belong to which copy of the repeat. +
-  This is how contigs were first built. +
-  Have to compare all reads against each other, +
-  Quadratic. ​ This has become difficult. +
-  NGS reads are short, and overlaps between reads are short. +
-  They are getting a little longer with newer technology. +
-  OLC method uses all information in the read at least. +
- +
-==== DBG ==== +
- +
-De Bruijn graphs, Pavel Pevzner popularized in paper 2001. +
-Why do we use them? +
-High through-put,​ cheap, short reads with errors. +
- +
- +
-The next-gen means 2007 there were many papers, SSKE, SHARCGS, VCKE, +
-have a kmer hash, much faster and leaner, but get noise. +
- +
-Build a dictionary of words in the data set. +
-His example is 4mer, but typically 20mer and longer. +
-For each, create a node in the hash table. +
-Big dataset. ​ Typically a binary marker to say if present. +
-Sometimes at branch have two possible extensions. +
-Greedy approach essentially De Bruijn graph with +
-local info only.  In this structure you could have  +
-many paths through. ​ And reads of different lengths can be used. +
- +
-Later Euler, Allphaths, Velvet, ABySS, SOAPdenovo came out. +
-They souped up the kmer hash, but processed more globally. +
-Graph construction and correction is similar in all assemblers. +
-ABySS handles parallel implementation. +
- +
-First, concatenate the simple graphs without branches +
-into nodes which compact the non-branching runs into a single item. +
-Biochemistry of shortreads means errors tend to be near the ends. +
-If you have errors in the reads in the second half, +
-it creates millions of tips, but they are at least easy to  +
-remove later. ​  +
- +
-Each read is a path through the graph. The head of the read +
-is still kept, so we're not losing the read placement information. +
- +
-We are still left with bubbles. +
-Possibly accumulation of overlapping spur-tips. +
-More often variation e.g. SNP or variation. +
-Would like to keep that information where possible. +
-Could also be a long read.  But don't want to cut +
-the long read in the middle. ​ When you find a bubble +
-in velvet, remap that back onto the one with the higher +
-branch. +
-Now we have a loop that goes back +
-on itself (not a bubble). e.g. tandem repeats. +
- +
-OLC has atomic arcs. +
-DBG has a greater variety of overlaps. +
- +
-OLC gets error correction from long reads. +
-DBG requires more work to correct. +
- +
-==== Tour Bus ==== +
-Errors. +
-Compare 5mb chunk from 4 species, ecolo yeast, elegans, human. +
-Increased coverage from 0 to 50x. +
-N50 on y axis. +
-First, exponential growth as expected. +
-Then the complexity causes trouble, +
-unless you have other kinds of information. +
-Not using paired-reads information. +
-You see that the human is a little more complex than E. coli. +
- +
-Scaling up the hard way: chrX. +
-1/2 billion reads from flow-sorted human chrX. +
-Fit in 70gb ram. +
-Overall length 130mb. +
-Short contigs, and numerous. +
-How to handle the repeats? +
- +
-**Margulies** used reduced representation +
-using restriction enzymes and gel +
-to cut the problem into tractable bits. +
-Cut chunks from the gel, and then +
-sequenced the gel slice separately,​ +
-or at least in different lanes. +
-Then the assembly problem is smaller +
-and more localized instead of whole-genome. +
- +
-Combined re-mapping and de novo sequencing. +
-**Cheetem and Pleasance**. +
-Map to reference ​or similar ​genome+
-then re-map locally some stuff that's unique. +
- +
-MAQ and BWA only allow 3 or 4 mismatches. +
-Can see reads not mapping in a small area +
-with too many SNPs or breakpoint rearrangment+
- +
-Code parallelization (**Birol and Cook**). +
-Only ABySS has published yet. +
- +
-Improved indexing +
-Fit WGS onto 200GB. +
- +
-Use of intermediate re-mapping. (**Haimel**) +
-Wanted to parallelize the second level  +
-work of building scaffolds. +
- +
-But those are mostly engineering issues. +
- +
-The DBG for a small organism +
-looks like long-ish contigs, +
-but still makes a hair-ball. +
- +
-==== Rock Band ==== +
-Use long and short reads together. +
-Long reads which are mapped seamlessly into the graph. ​  +
- +
-35bp reads 50x, some long reads (100-1kb) 0-20x. +
-Played with the length. +
-The longer your longreads are, the more bridgeable +
-it is and the longer the contigs. ​  +
-Why not just assemble the long reads? someone might ask. +
- +
-Different approach. ​ Spectral graph analysis. +
-Find what is the vector that threads its way. +
-Has a fancy theory with massive matrix diagonalization. +
-Takes all the data at once in one huge matrix, +
-not too affected by noise. No one has actually +
-tried this approach. Too much computation. +
-The trad approach is contigs with paired-end reads.  +
-Find chains connected by contig reads. +
- +
-Your coverage is huge. +
-Each basepair has to handle 1000 coverage +
-read-pairs. ​ Long reads separated by a mess +
-in the middle. ​ Check possible paths. +
-Find which are acceptable.  +
-Find all the reads that start and end on the  +
-two long reads.  Then consider them all +
-together, much more efficient. +
- +
-5% or more of mate-pairs can be wrong. +
-Hopefully will be able to filter them out. +
-In velvet use a likelihood approach. +
-Have a distance estimate. +
-Coverage of A and B, how many would +
-we expect to see? If we see too few, +
-we discard the connection. +
- +
-Allpaths did better using small localized problems +
-of a single small scaffold for parallelization. +
- +
-The Shorty algorithm uses the variance between +
-read pair anchored on a common contig on a kmer. +
-Projected image of the other ends creates possibility +
-of assembling a smaller set, the projection from a node. +
- +
-==== PEBBLE ==== +
-Velvet used the reverse idea, the projections to a node. +
-Hit the nodes in the vicinity of the starting contig, +
-got distance information on the immediate vicinity. +
-Have a greedy algorithm. ​ Find way to next unique node. +
- +
-Having a more-precise insert size could help. +
-Velvet did a size assuming 10% distribution. +
-Solexa might have been different. +
-Solid up to 2k, solexa just 300. +
-It is difficult to have enough dna of a narrow range. +
-Don't love barcoding because it uses a lot of reagents. +
- +
- +
-==== COLOR-SPACE ==== +
- +
-4-letter alphabet. +
-Double-encoded files use ACGT. +
-Be careful not to confuse it. +
-Different rules for complementarity. +
-Just reverse, don't complement them. +
-The assemblers and mappers should  +
-in theory adapt to this. +
-Could use base-space, but one error +
-will ruin the rest of the read.  This is +
-a huge loss, so do all in colorspace +
-and make a complete assembly, +
-then finally convert to base calls later. +
-Typically it detects the error after  +
-1 or 2 bp, then corrects. +
-SOLID has developed tools. +
- +
-Different error models. +
-454 had homopolymer issues, flowspace. +
-Illumina or ABI are different. +
-Makes it hard to handle errors. +
-Especially in DBG. Applying the correction +
-rules becomes much more complex. +
-Newbler may be better for 454. +
- +
-Pre-filtering reads. +
-Some assemblers have built-in filtering of reads. +
- +
-Filtering low quality bases can cut down +
-on the computational cost. +
- +
-User-specific,​ data-space specific solutions. +
-No tried and true consistent approach. +
- +
-Some assemblers require reads of identical lengths. +
- +
------------ +
- +
-==== OASES ==== +
- +
-De-novo Transcriptome Assembly. +
-People are using NGS generally because it's cheaper. +
-Would have just one path except for alternate splicing. +
-Pevzner in 2002 mentioned DBG is equivalent to a +
-splice-graph. ​ So can apply those algorithms developed for  +
-Splice-graphs to DBG. +
- +
-=== Oases Process === +
-The genetic locus should create a connected locus, cluster of contigs.  +
-Sometimes one cluster gets broken into two, but works pretty well. +
- +
-Dynamic assembly, find the path of highest weight. +
- +
-=== Oases Output === +
-With skipped exon, and one artifact. +
-Handles contigs with different coverage levels. +
-Find donor and acceptor sites on or around junctions. +
- +
-Drosophila reads. +
-rl 36, k21 ins size 200. +
-Map to genome 90% +
-Found an unannotated gene with 7 exons. +
- +
-Assembler has no notion ​of the reference ​genome, +
-but still found one example that mapped very well. +
- +
- +
-==== COLUMBUS ==== +
-Mapping.  +
-Either map, or do de-novo. +
-Many good mature mapping tools. +
-Easier ​to analyze, comes with genomic coords. +
-De-novo harder. ​ May require special hardware. +
-Works without reference. ​ Handles novel structures. +
- +
-How to get the best of both worlds? +
-Just throw in the ref genome as long reads. +
-Chucking in long reads and getting short contigs+
-What they wanted really was to start from a  +
-reference genome.  Then handle it as a little insertion. +
-Takes DBG and SAM/BAM (BWA, bowtie+
-Keeps ref sequences. ​ Not allowed to overlap with self. +
-Starting with comparative transcriptome assembly. +
-Would like to ? cancer. +
- +
-DBG convention for NGS. +
- +
-For transcriptomeusing k=21. +
-Can use reads as short as 25. +
-How many X coverage needed? +
-How many genes do you want to catch. +
- +
-With genomewant long inserts generally. +
-With transcriptomeshort inserts are more  +
-useful. ​ They have been using 200bp tiny stuff. +
- +
-==== End-of-class Business ==== +
- +
-We can put Daniel'​s slides on the wiki. +
- +
-Galt will send Daniel a link to the wiki. +
- +
-Kevin on Wed. will be talking about: +
-Use solid data for ordering and orientation of Pog contigs. +
-Special purpose scripts. +
-The data was so noisy.+
  
 +=== De Bruijn graph convention for next generation sequencing ===
 +  * For transcriptome,​ using k=21.
 +  * Can use reads as short as 25.
 +  * How many X coverage needed?
 +  * How many genes do you want to catch?
 +  * With genome, want long inserts generally.
 +  * With transcriptome,​ short inserts are more useful. They have been using 200bp tiny stuff.
  
 +===== After presentation =====
  
 +Kevin on Wednesday will be talking about:
 +  * Use SOLiD data for ordering and orientation of //Pog// contigs
 +  * Special purpose scripts
 +  * The data was so noisy
lecture_notes/04-19-2010.1271916737.txt.gz · Last modified: 2010/04/22 06:12 by galt