User Tools

Site Tools


lecture_notes:04-19-2010

Lecture Notes for April 19, 2010

Presentation

Assembling and comparing genomes or transcriptomes using de Bruijn graphs
Daniel Zerbino, Marcel Schulz, Wendy Lee, Ewan Birney and David Haussler

Overview

  • De Bruijn graphs
  • Repeat resolution
  • Practical considerations
  • Extensions to Velvet

De Bruijn graphs

Overlap-Layout-Consensus (traditional)

  • Traditional assemblers: Phrap, Arachne, Celera, etc.
  • Short reads: Edena
  • Generally more expensive computationally.
    • Overlaps only allow clean right-to-left and left-to-right overlaps.
    • Transitivity
      • Need to remove the redundant connections.
      • Well studied graph theory problem
  • Use whole-read alignments

Quick Overlap-Layout-Consensus (OLC) example

  • Filter out all the reads that are identical or self-contained in another read. Only left-to-right, etc. are allowed.
  • Compute all their overlaps.
  • Get many transitive redundant connections.
    • Many algorithms are available for transitive reduction.
  • You obtain simplified paths through the reads.
    • Get 4 big chunks of reads from the example.
    • Hard to distinguish which reads belong to which copy of the repeat.
  • This is how contigs were first built.
  • Have to compare all reads against each other.
    • Quadratic. This has become difficult.
    • Correction: quadratic behavior can be reduced to about O(n log n) by doing a clustering before aligning: just like in several fast search techniques, you only attempt to align sequences that share seed hits to relatively short k-mers (much shorter than those needed for de Bruijn graphs, since they only need to reduce the number of hits, not get unique locations in the genome). — Kevin Karplus 2010/04/24 07:51
  • Next generation sequencing reads are short, and overlaps between reads are short. They are getting a little longer with newer technology.
  • OLC method uses all information in the read at least.

Comparison of different assembly techniques

Greedy hash table searches (~2007)
  • Some examples: SSAKE, SHARCGS, VCAKE
  • Much faster & leaner than the previous
  • Less robust to variation and noise
  • Essentially simplified de Bruijn graph assemblers
  • Build a dictionary of words in the data set.
    • His example is 4mer, but typically 20mer and longer.
  • For each, create a node in the hash table.
    • Big dataset. Typically a binary marker to say if present.
  • Sometimes at branch have two possible extensions.
  • Greedy approach essentially de Bruijn graph with local info only. In this structure you could have many paths through.
  • Can throw in reads of different lengths with no problems.
De Bruijn graph assemblers (~2009)
  • Currently the most common for NGS: Euler, ALLPATHS, Velvet, ABySS, SOAPdenovo
  • Pavel Pevzner popularized in paper from 2001
  • Why do we use them?
    • High through-put, cheap, short reads with errors
  • Basically a compromise between two previous approaches:
    • Just a k-mer hash
    • … but processed globally
  • Graph construction and correction is similar in all assemblers.
  • ABySS handles parallel implementation.

Working with de Bruijn graphs in Velvet

Fixing errors
  • Errors that are created in assembly create spurs/tips that can be easily removed.
  • When finding bubbles, remap the branch with the lower coverage back onto the branch with the higher coverage (as not to lose the information).
  • Loops can represent tandem repeats.
Creating de Bruijn graphs
  • First, concatenate the simple graphs without branches into nodes which compact the non-branching runs into a single item.
  • Biochemistry of short reads means errors tend to be near the ends.
  • If you have errors in the reads in the second half, it creates millions of tips, but they are at least easy to remove later.
  • Each read is a path through the graph. The head of the read is still kept, so we're not losing the read placement information.
  • We are still left with bubbles.
    • Possibly accumulation of overlapping spurs/tips. More often variation (e.g. SNPs or variation).
    • Would like to keep that information where possible.
    • Could also be a long read. But we don't want to cut the long read in the middle.
  • Now we have loops that goes back on itself (not a bubble).

So what's the difference?

  • Algebraic difference:
    • Reads in the OLC methods are atomic (i.e. an arc)
    • Reads in the DB graph are sequential paths through the graph
  • …this leads to practical differences
    • DB graphs allow for greater variety of overlaps
    • Overlaps in the OLC approach require a global alignment, not just a shared k-mer
    • OLC methods gets error correction from long reads, while DB graphs requires more work to correct.

Simulations: Tour Bus - error

  • Compare 5mb chunk from 4 species: E. coli, yeast, C. elegans, human
  • Increased coverage from 0 to 50x
  • N50 on y-axis
  • First there is exponential growth as expected, but then the complexity causes trouble unless you have other kinds of information.
  • Not using paired-reads information
  • You see that the human is a little more complex than E. coli.

Repeat resolution

Scaling up the hard way: chromosome X

  • 548 million Solexa reads were generated from a flow-sorted human X chromosome.
    • Fit in 70GB of RAM
    • Numerous short contigs
      • Many contigs: 898,401 contigs
      • Shirt contigs: 260bp N50 (however max of 6,956bp)
    • Overall length: 130Mb
    • How to handle repeats?
  • Moral: there are engineering issues to be resolved but the complexity of the graph needs to be handled accordingly.
Handling complexity of the graph
  • Reduced representation (Margulies et al.).
    • Reduced representation using restriction enzymes and gel to cut the problem into tractable bits.
    • Cut chunks from the gel and then sequenced the gel slice separately or at least in different lanes.
    • Then the assembly problem is smaller and more localized instead of whole genome.
  • Combined re-mapping and de novo sequencing (Cheetham et al., Pleaseance et al.).
    • Map to reference or similar genome, then re-map locally some stuff that is unique.
    • MAQ (Mapping and Assembly with Quality) and BWA (Burrows-Wheeler Alignment Tool) only allow 3 or 4 mismatches.
    • Can see reads not mapping in a small area with too many SNPs or breakpoint rearrangments.
  • Code parallelization (Birol et al., Jeffrey Cook).
    • Only ABySS has published yet
  • Improved indexing (Zamin Iqbal and Mario Caccamo).
    • Fit whole genome shotgun onto 200GB
  • Use of intermediate remapping (Matthias Haimei).
    • Wanted to parallelize the second level work of building scaffolds, but those are mostly engineering issues.

Repeats in a de Bruijn graph

  • The de Bruijn graph for a small organism looks like long-ish contigs, but still makes a hair-ball.

Rock band: Using long and short reads together

  • If all of the paths along the short reads move from one long read (e.g. 100-1kb) to another long read, then you know the sequence from the short reads
  • Long reads which are mapped seamlessly into the graph.
  • 35bp reads 50x, some long reads (100-1kb) 0-20x
  • Played with the length
  • The longer your long reads are, the more bridgeable it is and the longer the contigs.
    • Someone might ask: “Why not just assemble the long reads?”

Different approaches to repeat resolution

Theoretical: spectral graph analysis
  • Equivalent to a Principle Component Analysis or a heat conduction model
    • Find what is the vector that threads its way
  • Relies on a (massive) matrix diagonialization
  • Comprehensive: all data integrated at once
    • Takes all the data at once in one huge matrix, not too affected by noise. No one has actually tried this approach as it requires too much computation.
Traditional scaffolding
  • E.g. Arachne, Celera, BAMBUS
  • Heuristic approach similar to that used in traditional overlap layout-consensus contigs
    • The traditional approach is contigs with paired-end reads.
    • Find chains connected by contig reads.
  • Build a big graph of pairwise connections, simplify, extract obvious linear components
    • Your coverage is huge.
    • Each base pair has to handle 1000 coverage read-pairs. Long reads separated by a mess in the middle. Check possible paths. Find which are acceptable.
    • Find all the reads that start and end on the two long reads. Then consider them all together, much more efficient.
    • 5% or more of mate-pairs can be wrong
    • Hopefully will be able to filter them out.
    • In Velvet use a likelihood approach
    • Have a distance estimate
    • Coverage of A and B, how many would we expect to see? If we see too few, we discard the connection.
In NGS assemblers:
  • Euler: for each pair of reads, find all possible paths from one read to another
  • ABySS: Same as above, but the read-pairs are bundled into node-to-node connections to reduce calculations
  • ALLPATHS: Same as above, but the search is limited to localized clouds around pre-computed scaffolds
    • ALLPATHS did better using small localized problems of a single small scaffold for parallelization.
Using the differences between insert length
  • The Shorty algorithm uses the variance between read pairs anchored on a common contig on k-mer
    • Projected image of the other ends creates possibility of assembling a smaller set, the projection from a node.

Peeble: Handling paired-end reads

  • Pick one unique node, follow their paths
    • Velvet used the reverse idea, the projections to a node
  • Some of them are unique, pick out those nodes
    • Hit the nodes in the vicinity of the starting contig, got distance information on the immediate vicinity.
    • Uses a greedy algorithm. Find way to next unique node.
  • Having a more-precise insert size could help
  • Velvet did a size assuming 10% distribution
    • Solexa might have been different
  • SOLiD up to 2k, Solexa just 300
  • It is difficult to have enough DNA of a narrow range.
  • Don't love barcoding because it uses a lot of reagents.

Practical considerations

Colorspace

  • Di-base encoding has a 4 letter alphabet, but very different behavior to sequence space
    • Double-encoded files use ACGT.
    • Be careful not to confuse it!
    • Different rules for complementarity
      • Just reverse, don't complement them.
  • Direct conversion to sequence space is simple but wasteful
    • One error messes up all the remaining bases. This is a huge loss, so do all in colorspace and make a complete assembly, then finally convert to base calls later.
  • Conversion must therefore be done at the very end of the process, when the reads are aligned
    • You can then use the transition rules to detect errors. Typically it detects the error after 1 or 2 bp, then corrects.
    • SOLiD has developed tools

Different error models

  • When using different technologies, you have to take into account different technologies
    • Trivial when doing OLC assembly
    • Much more tricky when doing de Bruijn graph assemble, since k-mers are not assigned to reads
    • Different assemblers had different settings (e.g. Newbler)
  • 454 had homopolymer issues in flow-space.
  • Illumina or ABI are different.
    • Makes it hard to handle errors.
    • Especially in de Bruijn graphs. Applying the correction rules becomes much more complex.
  • Newbler may be better for 454.

Pre-filtering the reads

  • Some assemblers have in-built filtering of the reads (e.g. Euler) but not a generality
  • Efficient filtering of low quality bases can cut down on the computational cost (memory and time)
    • User-specific, data-space specific solutions
    • No tried and true consistent approach
  • Beware that some assemblers require reads of identical length

Extensions to Velvet

Oases: de novo transcriptome assembly
How to study mRNA reads which do not map

De Brujin graphs ~ splice graphs

  • People are using next generation sequencing generally because it's cheaper.
  • Would have just one path except for alternate splicing.
  • Pavel Pevzner mentioned in 2002 that de Bruijn graphs are equivalent to splice graphs, so we can apply those algorithms developed for splice graphs to de Bruijn graphs.

Oases: Process

  • Create clusters of contigs:
    • Connecting reads
    • Connecting read pairs
    • The genetic locus should create a connected locus, cluster of contigs.
    • Sometimes one cluster gets broken into two, but works pretty well.
  • Re-implement traditional algorithms to run on graphs, instead of a reference genome
    • Greedy transitive reduction (Myers, 2005)
    • Motif searches
    • Dynamic assembly of transcripts (Lee, 2003)
      • find the path of highest weight

Oases: Output

  • Full length transcripts - handles alternative splicing.
    • With skipped exon and one artifact.
  • Transcripts from different genes - handles varying coverage levels.
  • Alternative splicing events: when possible, distinguished between common AS events.
    • Find donor and acceptor sites on or around junctions.

Oases: Results

  • Benchmarked within the RGASP competition
  • Already 100+ users, some reporting very promising results:
    • N50 = 1886 bp (Monica Britton, UC Davis)
    • 99.8% of aligned transcripts against unigene (Micha Bayer - Scottish Crop Research Institute)

Oases: results of Drosophila PE-reads

  • The number of transcripts and loci are in the ballpark of what they were expecting.
  • rl 36, k21 ins size 200
  • Map to genome ~90%

Oases: examples

  • Found an unannotated gene with 7 exons
  • Assembler has no notion of the reference genome, but still found one example that mapped very well.

Mapping vs. de novo

  • Mapping
    • Many good mature tools available
    • Easy to compute
    • Simpler to analyze, comes with genetic coords
  • De novo
    • Harder than mapping
    • May require special hardware
    • No reference needed
    • Handles novel structures without a priori
  • How to get the best of both worlds?

Columbus: Velvet module

  • Solution: the reference-based de Bruijn graph
    • Just throw in the reference genome as long reads.
    • Chucking in long reads and getting short contigs.
    • Preserves the structure of the reference sequences. Not allowed to overlap with self.
      • What they wanted really was to start from a reference genome, then handle it as a little insertion.
    • Integrates alignment information (SAM, BAM file with BWA, Bowtie)
    • Edits, breaks up, reconnects, and extends the “alignment contigs” as necessary
  • Applications
    • Comparative transcriptome assembly
    • Cancer RNAseq
    • Comparative assembly

De Bruijn graph convention for next generation sequencing

  • For transcriptome, using k=21.
  • Can use reads as short as 25.
  • How many X coverage needed?
  • How many genes do you want to catch?
  • With genome, want long inserts generally.
  • With transcriptome, short inserts are more useful. They have been using 200bp tiny stuff.

After presentation

Kevin on Wednesday will be talking about:

  • Use SOLiD data for ordering and orientation of Pog contigs
  • Special purpose scripts
  • The data was so noisy
You could leave a comment if you were logged in.
lecture_notes/04-19-2010.txt · Last modified: 2010/04/24 07:54 by karplus