Differences

This shows you the differences between two versions of the page.

--- lecture_notes:04-19-2010 [2010/04/19 22:21]
cbrumbau created
+++ lecture_notes:04-19-2010 [2010/04/24 14:54] (current)
karplus added correction about quadratic behavior of overlap-consensus
@@ Line 1: / Line 1: @@
 ====== Lecture Notes for April 19, 2010 ======
-Need to fill in any ** INCOMPLETE ** portions
 ===== Presentation =====
+** {{:lecture_notes:20100419_banana_slug_seminar.ppt|Assembling and comparing genomes or transcriptomes using de Bruijn graphs}} ** \\
+Daniel Zerbino, Marcel Schulz, Wendy Lee, Ewan Birney and David Haussler
 ==== Overview ====
   * De Bruijn graphs
@@ Line 8: / Line 10: @@
   * Practical considerations
   * Extensions to Velvet
 ==== De Bruijn graphs ====
-Overlap-Layout-Consensus (traditional)
+=== Overlap-Layout-Consensus (traditional) ===
   * Traditional assemblers: Phrap, Arachne, Celera, etc.
   * Short reads: Edena
-  * Generally more expensive computationally
+  * Generally more expensive computationally.
-    * Overlaps only allow clean right-to-left and left-to-right overlaps
+    * Overlaps only allow clean right-to-left and left-to-right overlaps.
     * Transitivity
-      * Need to remove the redundant connections
+      * Need to remove the redundant connections.
       * Well studied graph theory problem
-Comparison of different assembly techniques
+  * Use whole-read alignments
-  * Greedy hash table searches (~2007)
-    * Some examples: SSAKE, SHARCGS, VCAKE
+=== Quick Overlap-Layout-Consensus (OLC) example ===
-    * ** INCOMPLETE **
+  * Filter out all the reads that are identical or self-contained in another read. Only left-to-right, etc. are allowed.
-  * De Bruijn graph assemblers (~2009)
+  * Compute all their overlaps.
-    * Currently the most common for NGS: Euler, ALLPATHS, Velvet, ABySS, SOAPdenovo
+  * Get many transitive redundant connections.
-    * ** INCOMPLETE **
+    * Many algorithms are available for transitive reduction.
-    * Can throw in reads of different lengths with no problems
+  * You obtain simplified paths through the reads.
-In Velvet:
+    * Get 4 big chunks of reads from the example.
-  * Errors that are created in assembly create tips (spurs) that can be easily removed
+    * Hard to distinguish which reads belong to which copy of the repeat.
-  * When finding bubbles, remap the branch with the lower coverage back onto the branch with the higher coverage (as not to lose the information)
+  * This is how contigs were first built.
-  * When encountering loops, ** INCOMPLETE **
+  * Have to compare all reads against each other.
-So what's the difference?
+    * Quadratic. This has become difficult.
+    * Correction: quadratic behavior can be reduced to about O(n log n) by doing a clustering before aligning: just like in several fast search techniques, you only attempt to align sequences that share seed hits to relatively short k-mers (much shorter than those needed for de Bruijn graphs, since they only need to reduce the number of hits, not get unique locations in the genome). --- //[[karplus@soe.ucsc.edu|Kevin Karplus]] 2010/04/24 07:51//
+  * Next generation sequencing reads are short, and overlaps between reads are short. They are getting a little longer with newer technology.
+  * OLC method uses all information in the read at least.
+=== Comparison of different assembly techniques ===
+== Greedy hash table searches (~2007) ==
+  * Some examples: SSAKE, SHARCGS, VCAKE
+  * Much faster & leaner than the previous
+  * Less robust to variation and noise
+  * Essentially simplified de Bruijn graph assemblers
+  * Build a dictionary of words in the data set.
+    * His example is 4mer, but typically 20mer and longer.
+  * For each, create a node in the hash table.
+    * Big dataset. Typically a binary marker to say if present.
+  * Sometimes at branch have two possible extensions.
+  * Greedy approach essentially de Bruijn graph with local info only. In this structure you could have many paths through.
+  * Can throw in reads of different lengths with no problems.
+== De Bruijn graph assemblers (~2009) ==
+  * Currently the most common for NGS: Euler, ALLPATHS, Velvet, ABySS, SOAPdenovo
+  * Pavel Pevzner popularized in paper from 2001
+  * Why do we use them?
+    * High through-put, cheap, short reads with errors
+  * Basically a compromise between two previous approaches:
+    * Just a k-mer hash
+    * ... but processed globally
+  * Graph construction and correction is similar in all assemblers.
+  * ABySS handles parallel implementation.
+=== Working with de Bruijn graphs in Velvet ===
+== Fixing errors ==
+  * Errors that are created in assembly create spurs/tips that can be easily removed.
+  * When finding bubbles, remap the branch with the lower coverage back onto the branch with the higher coverage (as not to lose the information).
+  * Loops can represent tandem repeats.
+== Creating de Bruijn graphs ==
+  * First, concatenate the simple graphs without branches into nodes which compact the non-branching runs into a single item.
+  * Biochemistry of short reads means errors tend to be near the ends.
+  * If you have errors in the reads in the second half, it creates millions of tips, but they are at least easy to remove later.
+  * Each read is a path through the graph. The head of the read is still kept, so we're not losing the read placement information.
+  * We are still left with bubbles.
+    * Possibly accumulation of overlapping spurs/tips. More often variation (e.g. SNPs or variation).
+    * Would like to keep that information where possible.
+    * Could also be a long read. But we don't want to cut the long read in the middle.
+  * Now we have loops that goes back on itself (not a bubble).
+=== So what's the difference? ===
   * Algebraic difference:
     * Reads in the OLC methods are atomic (i.e. an arc)
     * Reads in the DB graph are sequential paths through the graph
   * …this leads to practical differences
-    * DB graphs allow for greater variety in overlap ** INCOMPLETE **
+    * DB graphs allow for greater variety of overlaps
-    * ** INCOMPLETE **
+    * Overlaps in the OLC approach require a global alignment, not just a shared k-mer
+    * OLC methods gets error correction from long reads, while DB graphs requires more work to correct.
+=== Simulations: Tour Bus - error ===
+  * Compare 5mb chunk from 4 species: //E. coli//, yeast, //C. elegans//, human
+  * Increased coverage from 0 to 50x
+  * N50 on y-axis
+  * First there is exponential growth as expected, but then the complexity causes trouble unless you have other kinds of information.
+  * Not using paired-reads information
+  * You see that the human is a little more complex than //E. coli//.
 ==== Repeat resolution ====
-Scaling up the hard way: chromosome X
+=== Scaling up the hard way: chromosome X ===
   * 548 million Solexa reads were generated from a flow-sorted human X chromosome.
     * Fit in 70GB of RAM
-    * Many contigs: 898,401 contigs
+    * Numerous short contigs
-    * Shirt contigs: 260bp N50 (however max of 6,956bp)
+      * Many contigs: 898,401 contigs
+      * Shirt contigs: 260bp N50 (however max of 6,956bp)
     * Overall length: 130Mb
+    * How to handle repeats?
   * Moral: there are engineering issues to be resolved but the complexity of the graph needs to be handled accordingly.
-    * Reduced representation (Margulies et al.).
-    * Combined re-mapping and //de novo// sequencing (Cheetham et al., Pleaseance et al.).
+== Handling complexity of the graph ==
-    * Code parallelization (Birol et al., Jeffrey Cook).
+  * Reduced representation (Margulies et al.).
-    * Improved indexing (Zamin Iqbal and Mario Caccamo).
+    * Reduced representation using restriction enzymes and gel to cut the problem into tractable bits.
-    * Use of intermediate remapping (Matthias Haimei).
+    * Cut chunks from the gel and then sequenced the gel slice separately or at least in different lanes.
-Rock band: Using long and short reads together
+    * Then the assembly problem is smaller and more localized instead of whole genome.
+  * Combined re-mapping and //de novo// sequencing (Cheetham et al., Pleaseance et al.).
+    * Map to reference or similar genome, then re-map locally some stuff that is unique.
+    * MAQ (Mapping and Assembly with Quality) and BWA (Burrows-Wheeler Alignment Tool) only allow 3 or 4 mismatches.
+    * Can see reads not mapping in a small area with too many SNPs or breakpoint rearrangments.
+  * Code parallelization (Birol et al., Jeffrey Cook).
+    * Only ABySS has published yet
+  * Improved indexing (Zamin Iqbal and Mario Caccamo).
+    * Fit whole genome shotgun onto 200GB
+  * Use of intermediate remapping (Matthias Haimei).
+    * Wanted to parallelize the second level work of building scaffolds, but those are mostly engineering issues.
+=== Repeats in a de Bruijn graph ===
+  * The de Bruijn graph for a small organism looks like long-ish contigs, but still makes a hair-ball.
+=== Rock band: Using long and short reads together ===
   * If all of the paths along the short reads move from one long read (e.g. 100-1kb) to another long read, then you know the sequence from the short reads
-Different approaches to repeat resolution
+  * Long reads which are mapped seamlessly into the graph.
-  * Theoretical: spectral graph analysis
+  * 35bp reads 50x, some long reads (100-1kb) 0-20x
-    * Equivalent to a Principle Component Analysis or a heat conduction model
+  * Played with the length
-    * Relies on a (massive) matrix diagonialization
+  * The longer your long reads are, the more bridgeable it is and the longer the contigs.
-    * Comprehensive: all data integrated at once
+    * Someone might ask: "Why not just assemble the long reads?"
-  * Traditional scaffolding
-    * E.g. Arachne, Celera, BAMBUS
+=== Different approaches to repeat resolution ===
-    * Heuristic approach similar to ** INCOMPLETE **
-  * In NGS assemblers:
+== Theoretical: spectral graph analysis ==
-    * Euler: for each pair of reads, find all possible paths from one read to another
+  * Equivalent to a Principle Component Analysis or a heat conduction model
-    * ABySS: Same as above, but the read-pairs are bundled into node-to-node connections to reduce calculations
+    * Find what is the vector that threads its way
-    * ALLPATHS: Same as above, but the search is limited to localized clouds around pre-computed scaffolds
+  * Relies on a (massive) matrix diagonialization
-  * Using the differences in ** INCOMPLETE **
+  * Comprehensive: all data integrated at once
-    * ** INCOMPLETE **
+    * Takes all the data at once in one huge matrix, not too affected by noise. No one has actually tried this approach as it requires too much computation.
-Peeble: Handling paired-end reads
+== Traditional scaffolding ==
+  * E.g. Arachne, Celera, BAMBUS
+  * Heuristic approach similar to that used in traditional overlap layout-consensus contigs
+    * The traditional approach is contigs with paired-end reads.
+    * Find chains connected by contig reads.
+  * Build a big graph of pairwise connections, simplify, extract obvious linear components
+    * Your coverage is huge.
+    * Each base pair has to handle 1000 coverage read-pairs. Long reads separated by a mess in the middle. Check possible paths. Find which are acceptable.
+    * Find all the reads that start and end on the two long reads. Then consider them all together, much more efficient.
+    * 5% or more of mate-pairs can be wrong
+    * Hopefully will be able to filter them out.
+    * In Velvet use a likelihood approach
+    * Have a distance estimate
+    * Coverage of A and B, how many would we expect to see? If we see too few, we discard the connection.
+== In NGS assemblers: ==
+  * Euler: for each pair of reads, find all possible paths from one read to another
+  * ABySS: Same as above, but the read-pairs are bundled into node-to-node connections to reduce calculations
+  * ALLPATHS: Same as above, but the search is limited to localized clouds around pre-computed scaffolds
+    * ALLPATHS did better using small localized problems of a single small scaffold for parallelization.
+== Using the differences between insert length ==
+  * The Shorty algorithm uses the variance between read pairs anchored on a common contig on k-mer
+    * Projected image of the other ends creates possibility of assembling a smaller set, the projection from a node.
+=== Peeble: Handling paired-end reads ===
   * Pick one unique node, follow their paths
+    * Velvet used the reverse idea, the projections to a node
   * Some of them are unique, pick out those nodes
+    * Hit the nodes in the vicinity of the starting contig, got distance information on the immediate vicinity.
+    * Uses a greedy algorithm. Find way to next unique node.
+  * Having a more-precise insert size could help
+  * Velvet did a size assuming 10% distribution
+    * Solexa might have been different
+  * SOLiD up to 2k, Solexa just 300
+  * It is difficult to have enough DNA of a narrow range.
+  * Don't love barcoding because it uses a lot of reagents.
 ==== Practical considerations ====
-Colorspace
+=== Colorspace ===
   * Di-base encoding has a 4 letter alphabet, but very different behavior to sequence space
+    * Double-encoded files use ACGT.
+    * Be careful not to confuse it!
     * Different rules for complementarity
+      * Just reverse, don't complement them.
   * Direct conversion to sequence space is simple but wasteful
-    * One error messes up all the remaining bases
+    * One error messes up all the remaining bases. This is a huge loss, so do all in colorspace and make a complete assembly, then finally convert to base calls later.
   * Conversion must therefore be done at the very end of the process, when the reads are aligned
-    * You can then use the transition rules to detect errors
+    * You can then use the transition rules to detect errors. Typically it detects the error after 1 or 2 bp, then corrects.
-Different error models
+    * SOLiD has developed tools
+=== Different error models ===
   * When using different technologies, you have to take into account different technologies
     * Trivial when doing OLC assembly
-    * Much more tricky when doing De Bruijn graph assemble, since k-mers are not assigned to reads
+    * Much more tricky when doing de Bruijn graph assemble, since k-mers are not assigned to reads
     * Different assemblers had different settings (e.g. Newbler)
-Pre-filtering the reads
+  * 454 had homopolymer issues in flow-space.
+  * Illumina or ABI are different.
+    * Makes it hard to handle errors.
+    * Especially in de Bruijn graphs. Applying the correction rules becomes much more complex.
+  * Newbler may be better for 454.
+=== Pre-filtering the reads ===
   * Some assemblers have in-built filtering of the reads (e.g. Euler) but not a generality
   * Efficient filtering of low quality bases can cut down on the computational cost (memory and time)
+    * User-specific, data-space specific solutions
+    * No tried and true consistent approach
   * Beware that some assemblers require reads of identical length
 ==== Extensions to Velvet ====
-Oases: //de novo// transcriptome assembly (How to study mRNA reads which do not map)
+** Oases: //de novo// transcriptome assembly ** \\
-De Brujin graphs ~ splice graphs
+How to study mRNA reads which do not map
-  * Pavel Pevzner mentioned that De Bruijn graphs are more or less like splice graphs
-Oases: Process
+=== De Brujin graphs ~ splice graphs ===
+  * People are using next generation sequencing generally because it's cheaper.
+  * Would have just one path except for alternate splicing.
+  * Pavel Pevzner mentioned in 2002 that de Bruijn graphs are equivalent to splice graphs, so we can apply those algorithms developed for splice graphs to de Bruijn graphs.
+=== Oases: Process ===
   * Create clusters of contigs:
     * Connecting reads
     * Connecting read pairs
+    * The genetic locus should create a connected locus, cluster of contigs.
+    * Sometimes one cluster gets broken into two, but works pretty well.
   * Re-implement traditional algorithms to run on graphs, instead of a reference genome
     * Greedy transitive reduction (Myers, 2005)
     * Motif searches
     * Dynamic assembly of transcripts (Lee, 2003)
-Oases: Output
+      * find the path of highest weight
+=== Oases: Output ===
   * Full length transcripts - handles alternative splicing.
+    * With skipped exon and one artifact.
   * Transcripts from different genes - handles varying coverage levels.
   * Alternative splicing events: when possible, distinguished between common AS events.
-Oases: Results:
+    * Find donor and acceptor sites on or around junctions.
-  * ** INCOMPLETE **
-Oases: results of Drosophila PE-reads
+=== Oases: Results ===
-  * The number of transcripts and loci are in the ballpark of what they were expecting
+  * Benchmarked within the RGASP competition
-Mapping vs. //de novo//
+  * Already 100+ users, some reporting very promising results:
+    * N50 = 1886 bp (Monica Britton, UC Davis)
+    * 99.8% of aligned transcripts against unigene (Micha Bayer - Scottish Crop Research Institute)
+=== Oases: results of Drosophila PE-reads ===
+  * The number of transcripts and loci are in the ballpark of what they were expecting.
+  * rl 36, k21 ins size 200
+  * Map to genome ~90%
+=== Oases: examples ===
+  * Found an unannotated gene with 7 exons
+  * Assembler has no notion of the reference genome, but still found one example that mapped very well.
+=== Mapping vs. de novo ===
   * Mapping
-    * Many tools available
+    * Many good mature tools available
     * Easy to compute
-    * Simpler to analyze
+    * Simpler to analyze, comes with genetic coords
   * //De novo//
+    * Harder than mapping
+    * May require special hardware
     * No reference needed
     * Handles novel structures without //a priori//
-Columbus: Velvet module
+  * How to get the best of both worlds?
+=== Columbus: Velvet module ===
   * Solution: the reference-based de Bruijn graph
-    * Preserves the structure of the reference sequences
+    * Just throw in the reference genome as long reads.
-    * Integrates alignment information (SAM, BAM file)
+    * Chucking in long reads and getting short contigs.
-    * Edits, breaks up, reconnects ** INCOMPLETE **
+    * Preserves the structure of the reference sequences. Not allowed to overlap with self.
+      * What they wanted really was to start from a reference genome, then handle it as a little insertion.
+    * Integrates alignment information (SAM, BAM file with BWA, Bowtie)
+    * Edits, breaks up, reconnects, and extends the "alignment contigs" as necessary
+  * Applications
+    * Comparative transcriptome assembly
+    * Cancer RNAseq
+    * Comparative assembly
+=== De Bruijn graph convention for next generation sequencing ===
+  * For transcriptome, using k=21.
+  * Can use reads as short as 25.
+  * How many X coverage needed?
+  * How many genes do you want to catch?
+  * With genome, want long inserts generally.
+  * With transcriptome, short inserts are more useful. They have been using 200bp tiny stuff.
+===== After presentation =====
+Kevin on Wednesday will be talking about:
+  * Use SOLiD data for ordering and orientation of //Pog// contigs
+  * Special purpose scripts
+  * The data was so noisy

Banana Slug Genomics

User Tools

Site Tools

Differences

Page Tools