Differences

This shows you the differences between two versions of the page.

--- lecture_notes:04-19-2010 [2010/04/22 06:10]
galt Added Galt's notes.
+++ lecture_notes:04-19-2010 [2010/04/24 02:26]
cbrumbau Started to merge notes (incomplete), cleaned up formatting, filled in missing notes from presentation
@@ Line 1: / Line 1: @@
 ====== Lecture Notes for April 19, 2010 ======
-Need to fill in any ** INCOMPLETE ** portions
-{{:lecture_notes:20100419_banana_slug_seminar.ppt|Daniel's Powerpoint}}
 ===== Presentation =====
+** {{:lecture_notes:20100419_banana_slug_seminar.ppt|Assembling and comparing genomes or transcriptomes using de Bruijn graphs}} ** \\
+Daniel Zerbino, Marcel Schulz, Wendy Lee, Ewan Birney and David Haussler
 ==== Overview ====
   * De Bruijn graphs
@@ Line 10: / Line 10: @@
   * Practical considerations
   * Extensions to Velvet
 ==== De Bruijn graphs ====
-Overlap-Layout-Consensus (traditional)
+=== Overlap-Layout-Consensus (traditional) ===
   * Traditional assemblers: Phrap, Arachne, Celera, etc.
   * Short reads: Edena
-  * Generally more expensive computationally
+  * Generally more expensive computationally.
-    * Overlaps only allow clean right-to-left and left-to-right overlaps
+    * Overlaps only allow clean right-to-left and left-to-right overlaps.
     * Transitivity
-      * Need to remove the redundant connections
+      * Need to remove the redundant connections.
       * Well studied graph theory problem
-Comparison of different assembly techniques
+  * Use whole-read alignments
-  * Greedy hash table searches (~2007)
-    * Some examples: SSAKE, SHARCGS, VCAKE
+=== Quick OLC example ===
-    * ** INCOMPLETE **
+  * Filter out all the reads that are identical or self-contained in another read. Only left-to-right, etc. are allowed.
-  * De Bruijn graph assemblers (~2009)
+  * Compute all their overlaps.
-    * Currently the most common for NGS: Euler, ALLPATHS, Velvet, ABySS, SOAPdenovo
+  * Get many transitive redundant connections.
-    * ** INCOMPLETE **
+    * Many algorithms are available for transitive reduction.
-    * Can throw in reads of different lengths with no problems
+  * You obtain simplified paths through the reads.
-In Velvet:
+    * Get 4 big chunks of reads from the example.
-  * Errors that are created in assembly create tips (spurs) that can be easily removed
+    * Hard to distinguish which reads belong to which copy of the repeat.
-  * When finding bubbles, remap the branch with the lower coverage back onto the branch with the higher coverage (as not to lose the information)
+  * This is how contigs were first built.
-  * When encountering loops, ** INCOMPLETE **
+  * Have to compare all reads against each other.
-So what's the difference?
+    * Quadratic. This has become difficult.
+  * NGS reads are short, and overlaps between reads are short. They are getting a little longer with newer technology.
+  * OLC method uses all information in the read at least.
+=== Comparison of different assembly techniques ===
+== Greedy hash table searches (~2007) ==
+  * Some examples: SSAKE, SHARCGS, VCAKE
+  * Much faster & leaner than the previous
+  * Less robust to variation and noise
+  * Essentially simplified de Bruijn graph assemblers
+  * Build a dictionary of words in the data set.
+    * His example is 4mer, but typically 20mer and longer.
+  * For each, create a node in the hash table.
+    * Big dataset. Typically a binary marker to say if present.
+  * Sometimes at branch have two possible extensions.
+  * Greedy approach essentially De Bruijn graph with local info only. In this structure you could have many paths through.
+  * Can throw in reads of different lengths with no problems.
+== De Bruijn graph assemblers (~2009) ==
+  * Currently the most common for NGS: Euler, ALLPATHS, Velvet, ABySS, SOAPdenovo
+  * Pavel Pevzner popularized in paper from 2001
+  * Why do we use them?
+    * High through-put, cheap, short reads with errors
+  * Basically a compromise between two previous approaches:
+    * Just a k-mer hash
+    * ... but processed globally
+  * Graph construction and correction is similar in all assemblers.
+  * ABySS handles parallel implementation.
+=== Working with de Bruijn graphs in Velvet ===
+== Fixing errors ==
+  * Errors that are created in assembly create spurs/tips that can be easily removed.
+  * When finding bubbles, remap the branch with the lower coverage back onto the branch with the higher coverage (as not to lose the information).
+  * Loops can represent tandem repeats.
+== Creating de Bruijn graphs ==
+  * First, concatenate the simple graphs without branches into nodes which compact the non-branching runs into a single item.
+  * Biochemistry of short reads means errors tend to be near the ends.
+  * If you have errors in the reads in the second half, it creates millions of tips, but they are at least easy to remove later.
+  * Each read is a path through the graph. The head of the read is still kept, so we're not losing the read placement information.
+  * We are still left with bubbles.
+    * Possibly accumulation of overlapping spurs/tips. More often variation (e.g. SNPs or variation).
+    * Would like to keep that information where possible.
+    * Could also be a long read. But we don't want to cut the long read in the middle.
+  * Now we have loops that goes back on itself (not a bubble).
+=== So what's the difference? ===
   * Algebraic difference:
     * Reads in the OLC methods are atomic (i.e. an arc)
     * Reads in the DB graph are sequential paths through the graph
   * …this leads to practical differences
-    * DB graphs allow for greater variety in overlap ** INCOMPLETE **
+    * DB graphs allow for greater variety of overlaps
-    * ** INCOMPLETE **
+    * Overlaps in the OLC approach require a global alignment, not just a shared k-mer
+    * OLC methods gets error correction from long reads, while DB graphs requires more work to correct.
+=== Simulations: Tour Bus - error ===
+  * Compare 5mb chunk from 4 species: //E. coli//, yeast, //C. elegans//, human
+  * Increased coverage from 0 to 50x
+  * N50 on y-axis
+  * First there is exponential growth as expected, but then the complexity causes trouble unless you have other kinds of information.
+  * Not using paired-reads information
+  * You see that the human is a little more complex than //E. coli//.
 ==== Repeat resolution ====
-Scaling up the hard way: chromosome X
+=== Scaling up the hard way: chromosome X ===
   * 548 million Solexa reads were generated from a flow-sorted human X chromosome.
     * Fit in 70GB of RAM
-    * Many contigs: 898,401 contigs
+    * Numerous short contigs
-    * Shirt contigs: 260bp N50 (however max of 6,956bp)
+      * Many contigs: 898,401 contigs
+      * Shirt contigs: 260bp N50 (however max of 6,956bp)
     * Overall length: 130Mb
+    * How to handle repeats?
   * Moral: there are engineering issues to be resolved but the complexity of the graph needs to be handled accordingly.
-    * Reduced representation (Margulies et al.).
-    * Combined re-mapping and //de novo// sequencing (Cheetham et al., Pleaseance et al.).
+== Handling complexity of the graph ==
-    * Code parallelization (Birol et al., Jeffrey Cook).
+  * Reduced representation (Margulies et al.).
-    * Improved indexing (Zamin Iqbal and Mario Caccamo).
+    * Reduced representation using restriction enzymes and gel to cut the problem into tractable bits.
-    * Use of intermediate remapping (Matthias Haimei).
+    * Cut chunks from the gel and then sequenced the gel slice separately or at least in different lanes.
-Rock band: Using long and short reads together
+    * Then the assembly problem is smaller and more localized instead of whole genome.
+  * Combined re-mapping and //de novo// sequencing (Cheetham et al., Pleaseance et al.).
+    * Map to reference or similar genome, then re-map locally some stuff that is unique.
+    * MAQ and BWA only allow 3 or 4 mismatches.
+    * Can see reads not mapping in a small area with too many SNPs or breakpoint rearrangments.
+  * Code parallelization (Birol et al., Jeffrey Cook).
+    * Only ABySS has published yet
+  * Improved indexing (Zamin Iqbal and Mario Caccamo).
+    * Fit WGS onto 200GB
+  * Use of intermediate remapping (Matthias Haimei).
+    * Wanted to parallelize the second level work of building scaffolds, but those are mostly engineering issues.
+=== Repeats in a de Bruijn graph ===
+  * The de Bruijn graph for a small organism looks like long-ish contigs, but still makes a hair-ball.
+=== Rock band: Using long and short reads together ===
   * If all of the paths along the short reads move from one long read (e.g. 100-1kb) to another long read, then you know the sequence from the short reads
-Different approaches to repeat resolution
+  * Long reads which are mapped seamlessly into the graph.
-  * Theoretical: spectral graph analysis
+  * 35bp reads 50x, some long reads (100-1kb) 0-20x
-    * Equivalent to a Principle Component Analysis or a heat conduction model
+  * Played with the length
-    * Relies on a (massive) matrix diagonialization
+  * The longer your long reads are, the more bridgeable it is and the longer the contigs.
-    * Comprehensive: all data integrated at once
+    * "Why not just assemble the long reads?" someone might ask
-  * Traditional scaffolding
-    * E.g. Arachne, Celera, BAMBUS
+=== Different approaches to repeat resolution ===
-    * Heuristic approach similar to ** INCOMPLETE **
-  * In NGS assemblers:
+== Theoretical: spectral graph analysis ==
-    * Euler: for each pair of reads, find all possible paths from one read to another
+  * Equivalent to a Principle Component Analysis or a heat conduction model
-    * ABySS: Same as above, but the read-pairs are bundled into node-to-node connections to reduce calculations
+    * Find what is the vector that threads its way
-    * ALLPATHS: Same as above, but the search is limited to localized clouds around pre-computed scaffolds
+  * Relies on a (massive) matrix diagonialization
-  * Using the differences in ** INCOMPLETE **
+  * Comprehensive: all data integrated at once
-    * ** INCOMPLETE **
+    * Takes all the data at once in one huge matrix, not too affected by noise. No one has actually tried this approach as it requires too much computation.
-Peeble: Handling paired-end reads
+== Traditional scaffolding ==
+  * E.g. Arachne, Celera, BAMBUS
+  * Heuristic approach similar to that used in traditional overlap layout-consensus contigs
+    * The traditional approach is contigs with paired-end reads.
+    * Find chains connected by contig reads.
+  * Build a big graph of pairwise connections, simplify, extract obvious linear components
+    * Your coverage is huge.
+    * Each base pair has to handle 1000 coverage read-pairs. Long reads separated by a mess in the middle. Check possible paths. Find which are acceptable.
+    * Find all the reads that start and end on the two long reads. Then consider them all together, much more efficient.
+    * 5% or more of mate-pairs can be wrong
+    * Hopefully will be able to filter them out.
+    * In Velvet use a likelihood approach
+    * Have a distance estimate
+    * Coverage of A and B, how many would we expect to see? If we see too few, we discard the connection.
+== In NGS assemblers: ==
+  * Euler: for each pair of reads, find all possible paths from one read to another
+  * ABySS: Same as above, but the read-pairs are bundled into node-to-node connections to reduce calculations
+  * ALLPATHS: Same as above, but the search is limited to localized clouds around pre-computed scaffolds
+    * ALLPATHS did better using small localized problems of a single small scaffold for parallelization.
+== Using the differences between insert length ==
+  * The Shorty algorithm uses the variance between read pairs anchored on a common contig on k-mer
+    * Projected image of the other ends creates possibility of assembling a smaller set, the projection from a node.
+=== Peeble: Handling paired-end reads ===
   * Pick one unique node, follow their paths
+    * Velvet used the reverse idea, the projections to a node
   * Some of them are unique, pick out those nodes
+    * Hit the nodes in the vicinity of the starting contig, got distance information on the immediate vicinity.
+    * Uses a greedy algorithm. Find way to next unique node.
+  * Having a more-precise insert size could help
+  * Velvet did a size assuming 10% distribution
+    * Solexa might have been different
+  * SOLiD up to 2k, Solexa just 300
+  * It is difficult to have enough DNA of a narrow range.
+  * Don't love barcoding because it uses a lot of reagents.
 ==== Practical considerations ====
-Colorspace
+=== Colorspace ===
   * Di-base encoding has a 4 letter alphabet, but very different behavior to sequence space
+    * Double-encoded files use ACGT.
+    * Be careful not to confuse it!
     * Different rules for complementarity
+      * Just reverse, don't complement them.
   * Direct conversion to sequence space is simple but wasteful
-    * One error messes up all the remaining bases
+    * One error messes up all the remaining bases. This is a huge loss, so do all in colorspace and make a complete assembly, then finally convert to base calls later.
   * Conversion must therefore be done at the very end of the process, when the reads are aligned
-    * You can then use the transition rules to detect errors
+    * You can then use the transition rules to detect errors. Typically it detects the error after 1 or 2 bp, then corrects.
-Different error models
+    * SOLiD has developed tools
+=== Different error models ===
   * When using different technologies, you have to take into account different technologies
     * Trivial when doing OLC assembly
     * Much more tricky when doing De Bruijn graph assemble, since k-mers are not assigned to reads
     * Different assemblers had different settings (e.g. Newbler)
-Pre-filtering the reads
+  * 454 had homopolymer issues in flow-space.
+  * Illumina or ABI are different.
+    * Makes it hard to handle errors.
+    * Especially in de Bruijn graphs. Applying the correction rules becomes much more complex.
+  * Newbler may be better for 454.
+=== Pre-filtering the reads ===
   * Some assemblers have in-built filtering of the reads (e.g. Euler) but not a generality
   * Efficient filtering of low quality bases can cut down on the computational cost (memory and time)
+    * User-specific, data-space specific solutions
+    * No tried and true consistent approach
   * Beware that some assemblers require reads of identical length
 ==== Extensions to Velvet ====
-Oases: //de novo// transcriptome assembly (How to study mRNA reads which do not map)
+** Oases: //de novo// transcriptome assembly ** \\
-De Brujin graphs ~ splice graphs
+How to study mRNA reads which do not map
+=== De Brujin graphs ~ splice graphs ===
   * Pavel Pevzner mentioned that De Bruijn graphs are more or less like splice graphs
-Oases: Process
+=== Oases: Process ===
   * Create clusters of contigs:
     * Connecting reads
@@ Line 99: / Line 231: @@
     * Motif searches
     * Dynamic assembly of transcripts (Lee, 2003)
-Oases: Output
+=== Oases: Output ===
   * Full length transcripts - handles alternative splicing.
   * Transcripts from different genes - handles varying coverage levels.
   * Alternative splicing events: when possible, distinguished between common AS events.
-Oases: Results:
-  * ** INCOMPLETE **
+=== Oases: Results ===
-Oases: results of Drosophila PE-reads
+  * Benchmarked within the RGASP competition
-  * The number of transcripts and loci are in the ballpark of what they were expecting
+  * Already 100+ users, some reporting very promising results:
-Mapping vs. //de novo//
+    * N50 = 1886 bp (Monica Britton, UC Davis)
+    * 99.8% of aligned transcripts against unigene (Micha Bayer - Scottish Crop Research Institute)
+=== Oases: results of Drosophila PE-reads ===
+  * The number of transcripts and loci are in the ballpark of what they were expecting.
+=== Mapping vs. de novo ===
   * Mapping
     * Many tools available
@@ Line 115: / Line 254: @@
     * No reference needed
     * Handles novel structures without //a priori//
-Columbus: Velvet module
+=== Columbus: Velvet module ===
   * Solution: the reference-based de Bruijn graph
     * Preserves the structure of the reference sequences
     * Integrates alignment information (SAM, BAM file)
-    * Edits, breaks up, reconnects ** INCOMPLETE **
+    * Edits, breaks up, reconnects, and extends the "alignment contigs" as necessary
 ===== Galt's Notes =====
-Assembling and comparing genomes or transcriptomes using de Bruijn graphs.
-Daniel Zerbino, Marcel Schulz, Wendy Lee,
-Ewan Birney and David Haussler.
-Slides will be made available.
-Talking about
-  *ways to build graphs
-  *handle repeats
-  *practical considerations
-  *extensions to velvet
-==== OLC ====
-Traditional approach is overlap-layout-consensus
-Phrap, etc.\\
-Short reads - Edena\\
-More expensive computationally.
-Use whole-read alignments.
-=== Quick OLC example ===
-  Filter out all the reads that are identical or self-contained in another read.
-  Only left-to-right etc allowed.
-  Compute all their overlaps.
-  Get many transitive redundant connections.
-  Many algorithms are available for transitive reduction.
-  You obtain simplified paths through the reads.
-  Get 4 big chunks of reads from the example.
-  Hard to distinguish which reads belong to which copy of the repeat.
-  This is how contigs were first built.
-  Have to compare all reads against each other,
-  Quadratic.  This has become difficult.
-  NGS reads are short, and overlaps between reads are short.
-  They are getting a little longer with newer technology.
-  OLC method uses all information in the read at least.
-==== DBG ====
-De Bruijn graphs, Pavel Pevzner popularized in paper 2001.
-Why do we use them?
-High through-put, cheap, short reads with errors.
-The next-gen means 2007 there were many papers, SSKE, SHARCGS, VCKE,
-have a kmer hash, much faster and leaner, but get noise.
-Build a dictionary of words in the data set.
-His example is 4mer, but typically 20mer and longer.
-For each, create a node in the hash table.
-Big dataset.  Typically a binary marker to say if present.
-Sometimes at branch have two possible extensions.
-Greedy approach essentially De Bruijn graph with
-local info only.  In this structure you could have
-many paths through.  And reads of different lengths can be used.
-Later Euler, Allphaths, Velvet, ABySS, SOAPdenovo came out.
-They souped up the kmer hash, but processed more globally.
-Graph construction and correction is similar in all assemblers.
-ABySS handles parallel implementation.
-First, concatenate the simple graphs without branches
-into nodes which compact the non-branching runs into a single item.
-Biochemistry of shortreads means errors tend to be near the ends.
-If you have errors in the reads in the second half,
-it creates millions of tips, but they are at least easy to
-remove later.
-Each read is a path through the graph. The head of the read
-is still kept, so we're not losing the read placement information.
-We are still left with bubbles.
-Possibly accumulation of overlapping spur-tips.
-More often variation e.g. SNP or variation.
-Would like to keep that information where possible.
-Could also be a long read.  But don't want to cut
-the long read in the middle.  When you find a bubble
-in velvet, remap that back onto the one with the higher
-branch.
-Now we have a loop that goes back
-on itself (not a bubble). e.g. tandem repeats.
-OLC has atomic arcs.
-DBG has a greater variety of overlaps.
-OLC gets error correction from long reads.
-DBG requires more work to correct.
-==== Tour Bus ====
-Errors.
-Compare 5mb chunk from 4 species, ecolo yeast, elegans, human.
-Increased coverage from 0 to 50x.
-N50 on y axis.
-First, exponential growth as expected.
-Then the complexity causes trouble,
-unless you have other kinds of information.
-Not using paired-reads information.
-You see that the human is a little more complex than E. coli.
-Scaling up the hard way: chrX.
-/2 billion reads from flow-sorted human chrX.
-Fit in 70gb ram.
-Overall length 130mb.
-Short contigs, and numerous.
-How to handle the repeats?
-**Margulies** used reduced representation
-using restriction enzymes and gel
-to cut the problem into tractable bits.
-Cut chunks from the gel, and then
-sequenced the gel slice separately,
-or at least in different lanes.
-Then the assembly problem is smaller
-and more localized instead of whole-genome.
-Combined re-mapping and de novo sequencing.
-**Cheetem and Pleasance**.
-Map to reference or similar genome,
-then re-map locally some stuff that's unique.
-MAQ and BWA only allow 3 or 4 mismatches.
-Can see reads not mapping in a small area
-with too many SNPs or breakpoint rearrangment.
-Code parallelization (**Birol and Cook**).
-Only ABySS has published yet.
-Improved indexing
-Fit WGS onto 200GB.
-Use of intermediate re-mapping. (**Haimel**)
-Wanted to parallelize the second level
-work of building scaffolds.
-But those are mostly engineering issues.
-The DBG for a small organism
-looks like long-ish contigs,
-but still makes a hair-ball.
-==== Rock Band ====
-Use long and short reads together.
-Long reads which are mapped seamlessly into the graph.
-bp reads 50x, some long reads (100-1kb) 0-20x.
-Played with the length.
-The longer your longreads are, the more bridgeable
-it is and the longer the contigs.
-Why not just assemble the long reads? someone might ask.
-Different approach.  Spectral graph analysis.
-Find what is the vector that threads its way.
-Has a fancy theory with massive matrix diagonalization.
-Takes all the data at once in one huge matrix,
-not too affected by noise. No one has actually
-tried this approach. Too much computation.
-The trad approach is contigs with paired-end reads.
-Find chains connected by contig reads.
-Your coverage is huge.
-Each basepair has to handle 1000 coverage
-read-pairs.  Long reads separated by a mess
-in the middle.  Check possible paths.
-Find which are acceptable.
-Find all the reads that start and end on the
-two long reads.  Then consider them all
-together, much more efficient.
-% or more of mate-pairs can be wrong.
-Hopefully will be able to filter them out.
-In velvet use a likelihood approach.
-Have a distance estimate.
-Coverage of A and B, how many would
-we expect to see? If we see too few,
-we discard the connection.
-Allpaths did better using small localized problems
-of a single small scaffold for parallelization.
-The Shorty algorithm uses the variance between
-read pair anchored on a common contig on a kmer.
-Projected image of the other ends creates possibility
-of assembling a smaller set, the projection from a node.
-==== PEBBLE ====
-Velvet used the reverse idea, the projections to a node.
-Hit the nodes in the vicinity of the starting contig,
-got distance information on the immediate vicinity.
-Have a greedy algorithm.  Find way to next unique node.
-Having a more-precise insert size could help.
-Velvet did a size assuming 10% distribution.
-Solexa might have been different.
-Solid up to 2k, solexa just 300.
-It is difficult to have enough dna of a narrow range.
-Don't love barcoding because it uses a lot of reagents.
-==== COLOR-SPACE ====
--letter alphabet.
-Double-encoded files use ACGT.
-Be careful not to confuse it.
-Different rules for complementarity.
-Just reverse, don't complement them.
-The assemblers and mappers should
-in theory adapt to this.
-Could use base-space, but one error
-will ruin the rest of the read.  This is
-a huge loss, so do all in colorspace
-and make a complete assembly,
-then finally convert to base calls later.
-Typically it detects the error after
-or 2 bp, then corrects.
-SOLID has developed tools.
-Different error models.
-had homopolymer issues, flowspace.
-Illumina or ABI are different.
-Makes it hard to handle errors.
-Especially in DBG. Applying the correction
-rules becomes much more complex.
-Newbler may be better for 454.
-Pre-filtering reads.
-Some assemblers have built-in filtering of reads.
-Filtering low quality bases can cut down
-on the computational cost.
-User-specific, data-space specific solutions.
-No tried and true consistent approach.
-Some assemblers require reads of identical lengths.
------------
 ==== OASES ====
@@ Line 419: / Line 321: @@
 useful.  They have been using 200bp tiny stuff.
-==== End-of-class Business ====
+===== End-of-class business =====
-We can put Daniel's slides on the wiki.
-Galt will send Daniel a link to the wiki.
 Kevin on Wed. will be talking about:
-Use solid data for ordering and orientation of Pog contigs.
+  * Use SOLiD data for ordering and orientation of //Pog// contigs
-Special purpose and scripts.
+  * Special purpose scripts
-The data was so noisy.
+  * The data was so noisy

Banana Slug Genomics

User Tools

Site Tools

Differences

Page Tools