User Tools

Site Tools


Lecture Notes for April 19, 2010


Assembling and comparing genomes or transcriptomes using de Bruijn graphs
Daniel Zerbino, Marcel Schulz, Wendy Lee, Ewan Birney and David Haussler


  • De Bruijn graphs
  • Repeat resolution
  • Practical considerations
  • Extensions to Velvet

De Bruijn graphs

Overlap-Layout-Consensus (traditional)

  • Traditional assemblers: Phrap, Arachne, Celera, etc.
  • Short reads: Edena
  • Generally more expensive computationally.
    • Overlaps only allow clean right-to-left and left-to-right overlaps.
    • Transitivity
      • Need to remove the redundant connections.
      • Well studied graph theory problem
  • Use whole-read alignments

Quick Overlap-Layout-Consensus (OLC) example

  • Filter out all the reads that are identical or self-contained in another read. Only left-to-right, etc. are allowed.
  • Compute all their overlaps.
  • Get many transitive redundant connections.
    • Many algorithms are available for transitive reduction.
  • You obtain simplified paths through the reads.
    • Get 4 big chunks of reads from the example.
    • Hard to distinguish which reads belong to which copy of the repeat.
  • This is how contigs were first built.
  • Have to compare all reads against each other.
    • Quadratic. This has become difficult.
    • Correction: quadratic behavior can be reduced to about O(n log n) by doing a clustering before aligning: just like in several fast search techniques, you only attempt to align sequences that share seed hits to relatively short k-mers (much shorter than those needed for de Bruijn graphs, since they only need to reduce the number of hits, not get unique locations in the genome). — Kevin Karplus 2010/04/24 07:51
  • Next generation sequencing reads are short, and overlaps between reads are short. They are getting a little longer with newer technology.
  • OLC method uses all information in the read at least.

Comparison of different assembly techniques

Greedy hash table searches (~2007)
  • Some examples: SSAKE, SHARCGS, VCAKE
  • Much faster & leaner than the previous
  • Less robust to variation and noise
  • Essentially simplified de Bruijn graph assemblers
  • Build a dictionary of words in the data set.
    • His example is 4mer, but typically 20mer and longer.
  • For each, create a node in the hash table.
    • Big dataset. Typically a binary marker to say if present.
  • Sometimes at branch have two possible extensions.
  • Greedy approach essentially de Bruijn graph with local info only. In this structure you could have many paths through.
  • Can throw in reads of different lengths with no problems.
De Bruijn graph assemblers (~2009)
  • Currently the most common for NGS: Euler, ALLPATHS, Velvet, ABySS, SOAPdenovo
  • Pavel Pevzner popularized in paper from 2001
  • Why do we use them?
    • High through-put, cheap, short reads with errors
  • Basically a compromise between two previous approaches:
    • Just a k-mer hash
    • … but processed globally
  • Graph construction and correction is similar in all assemblers.
  • ABySS handles parallel implementation.

Working with de Bruijn graphs in Velvet

Fixing errors
  • Errors that are created in assembly create spurs/tips that can be easily removed.
  • When finding bubbles, remap the branch with the lower coverage back onto the branch with the higher coverage (as not to lose the information).
  • Loops can represent tandem repeats.
Creating de Bruijn graphs
  • First, concatenate the simple graphs without branches into nodes which compact the non-branching runs into a single item.
  • Biochemistry of short reads means errors tend to be near the ends.
  • If you have errors in the reads in the second half, it creates millions of tips, but they are at least easy to remove later.
  • Each read is a path through the graph. The head of the read is still kept, so we're not losing the read placement information.
  • We are still left with bubbles.
    • Possibly accumulation of overlapping spurs/tips. More often variation (e.g. SNPs or variation).
    • Would like to keep that information where possible.
    • Could also be a long read. But we don't want to cut the long read in the middle.
  • Now we have loops that goes back on itself (not a bubble).

So what's the difference?

  • Algebraic difference:
    • Reads in the OLC methods are atomic (i.e. an arc)
    • Reads in the DB graph are sequential paths through the graph
  • …this leads to practical differences
    • DB graphs allow for greater variety of overlaps
    • Overlaps in the OLC approach require a global alignment, not just a shared k-mer
    • OLC methods gets error correction from long reads, while DB graphs requires more work to correct.

Simulations: Tour Bus - error

  • Compare 5mb chunk from 4 species: E. coli, yeast, C. elegans, human
  • Increased coverage from 0 to 50x
  • N50 on y-axis
  • First there is exponential growth as expected, but then the complexity causes trouble unless you have other kinds of information.
  • Not using paired-reads information
  • You see that the human is a little more complex than E. coli.

Repeat resolution

Scaling up the hard way: chromosome X

  • 548 million Solexa reads were generated from a flow-sorted human X chromosome.
    • Fit in 70GB of RAM
    • Numerous short contigs
      • Many contigs: 898,401 contigs
      • Shirt contigs: 260bp N50 (however max of 6,956bp)
    • Overall length: 130Mb
    • How to handle repeats?
  • Moral: there are engineering issues to be resolved but the complexity of the graph needs to be handled accordingly.
Handling complexity of the graph
  • Reduced representation (Margulies et al.).
    • Reduced representation using restriction enzymes and gel to cut the problem into tractable bits.
    • Cut chunks from the gel and then sequenced the gel slice separately or at least in different lanes.
    • Then the assembly problem is smaller and more localized instead of whole genome.
  • Combined re-mapping and de novo sequencing (Cheetham et al., Pleaseance et al.).
    • Map to reference or similar genome, then re-map locally some stuff that is unique.
    • MAQ (Mapping and Assembly with Quality) and BWA (Burrows-Wheeler Alignment Tool) only allow 3 or 4 mismatches.
    • Can see reads not mapping in a small area with too many SNPs or breakpoint rearrangments.
  • Code parallelization (Birol et al., Jeffrey Cook).
    • Only ABySS has published yet
  • Improved indexing (Zamin Iqbal and Mario Caccamo).
    • Fit whole genome shotgun onto 200GB
  • Use of intermediate remapping (Matthias Haimei).
    • Wanted to parallelize the second level work of building scaffolds, but those are mostly engineering issues.

Repeats in a de Bruijn graph

  • The de Bruijn graph for a small organism looks like long-ish contigs, but still makes a hair-ball.

Rock band: Using long and short reads together

  • If all of the paths along the short reads move from one long read (e.g. 100-1kb) to another long read, then you know the sequence from the short reads
  • Long reads which are mapped seamlessly into the graph.
  • 35bp reads 50x, some long reads (100-1kb) 0-20x
  • Played with the length
  • The longer your long reads are, the more bridgeable it is and the longer the contigs.
    • Someone might ask: “Why not just assemble the long reads?”

Different approaches to repeat resolution

Theoretical: spectral graph analysis
  • Equivalent to a Principle Component Analysis or a heat conduction model
    • Find what is the vector that threads its way
  • Relies on a (massive) matrix diagonialization
  • Comprehensive: all data integrated at once
    • Takes all the data at once in one huge matrix, not too affected by noise. No one has actually tried this approach as it requires too much computation.
Traditional scaffolding
  • E.g. Arachne, Celera, BAMBUS
  • Heuristic approach similar to that used in traditional overlap layout-consensus contigs
    • The traditional approach is contigs with paired-end reads.
    • Find chains connected by contig reads.
  • Build a big graph of pairwise connections, simplify, extract obvious linear components
    • Your coverage is huge.
    • Each base pair has to handle 1000 coverage read-pairs. Long reads separated by a mess in the middle. Check possible paths. Find which are acceptable.
    • Find all the reads that start and end on the two long reads. Then consider them all together, much more efficient.
    • 5% or more of mate-pairs can be wrong
    • Hopefully will be able to filter them out.
    • In Velvet use a likelihood approach
    • Have a distance estimate
    • Coverage of A and B, how many would we expect to see? If we see too few, we discard the connection.
In NGS assemblers:
  • Euler: for each pair of reads, find all possible paths from one read to another
  • ABySS: Same as above, but the read-pairs are bundled into node-to-node connections to reduce calculations
  • ALLPATHS: Same as above, but the search is limited to localized clouds around pre-computed scaffolds
    • ALLPATHS did better using small localized problems of a single small scaffold for parallelization.
Using the differences between insert length
  • The Shorty algorithm uses the variance between read pairs anchored on a common contig on k-mer
    • Projected image of the other ends creates possibility of assembling a smaller set, the projection from a node.

Peeble: Handling paired-end reads

  • Pick one unique node, follow their paths
    • Velvet used the reverse idea, the projections to a node
  • Some of them are unique, pick out those nodes
    • Hit the nodes in the vicinity of the starting contig, got distance information on the immediate vicinity.
    • Uses a greedy algorithm. Find way to next unique node.
  • Having a more-precise insert size could help
  • Velvet did a size assuming 10% distribution
    • Solexa might have been different
  • SOLiD up to 2k, Solexa just 300
  • It is difficult to have enough DNA of a narrow range.
  • Don't love barcoding because it uses a lot of reagents.

Practical considerations


  • Di-base encoding has a 4 letter alphabet, but very different behavior to sequence space
    • Double-encoded files use ACGT.
    • Be careful not to confuse it!
    • Different rules for complementarity
      • Just reverse, don't complement them.
  • Direct conversion to sequence space is simple but wasteful
    • One error messes up all the remaining bases. This is a huge loss, so do all in colorspace and make a complete assembly, then finally convert to base calls later.
  • Conversion must therefore be done at the very end of the process, when the reads are aligned
    • You can then use the transition rules to detect errors. Typically it detects the error after 1 or 2 bp, then corrects.
    • SOLiD has developed tools

Different error models

  • When using different technologies, you have to take into account different technologies
    • Trivial when doing OLC assembly
    • Much more tricky when doing de Bruijn graph assemble, since k-mers are not assigned to reads
    • Different assemblers had different settings (e.g. Newbler)
  • 454 had homopolymer issues in flow-space.
  • Illumina or ABI are different.
    • Makes it hard to handle errors.
    • Especially in de Bruijn graphs. Applying the correction rules becomes much more complex.
  • Newbler may be better for 454.

Pre-filtering the reads

  • Some assemblers have in-built filtering of the reads (e.g. Euler) but not a generality
  • Efficient filtering of low quality bases can cut down on the computational cost (memory and time)
    • User-specific, data-space specific solutions
    • No tried and true consistent approach
  • Beware that some assemblers require reads of identical length

Extensions to Velvet

Oases: de novo transcriptome assembly
How to study mRNA reads which do not map

De Brujin graphs ~ splice graphs

  • People are using next generation sequencing generally because it's cheaper.
  • Would have just one path except for alternate splicing.
  • Pavel Pevzner mentioned in 2002 that de Bruijn graphs are equivalent to splice graphs, so we can apply those algorithms developed for splice graphs to de Bruijn graphs.

Oases: Process

  • Create clusters of contigs:
    • Connecting reads
    • Connecting read pairs
    • The genetic locus should create a connected locus, cluster of contigs.
    • Sometimes one cluster gets broken into two, but works pretty well.
  • Re-implement traditional algorithms to run on graphs, instead of a reference genome
    • Greedy transitive reduction (Myers, 2005)
    • Motif searches
    • Dynamic assembly of transcripts (Lee, 2003)
      • find the path of highest weight

Oases: Output

  • Full length transcripts - handles alternative splicing.
    • With skipped exon and one artifact.
  • Transcripts from different genes - handles varying coverage levels.
  • Alternative splicing events: when possible, distinguished between common AS events.
    • Find donor and acceptor sites on or around junctions.

Oases: Results

  • Benchmarked within the RGASP competition
  • Already 100+ users, some reporting very promising results:
    • N50 = 1886 bp (Monica Britton, UC Davis)
    • 99.8% of aligned transcripts against unigene (Micha Bayer - Scottish Crop Research Institute)

Oases: results of Drosophila PE-reads

  • The number of transcripts and loci are in the ballpark of what they were expecting.
  • rl 36, k21 ins size 200
  • Map to genome ~90%

Oases: examples

  • Found an unannotated gene with 7 exons
  • Assembler has no notion of the reference genome, but still found one example that mapped very well.

Mapping vs. de novo

  • Mapping
    • Many good mature tools available
    • Easy to compute
    • Simpler to analyze, comes with genetic coords
  • De novo
    • Harder than mapping
    • May require special hardware
    • No reference needed
    • Handles novel structures without a priori
  • How to get the best of both worlds?

Columbus: Velvet module

  • Solution: the reference-based de Bruijn graph
    • Just throw in the reference genome as long reads.
    • Chucking in long reads and getting short contigs.
    • Preserves the structure of the reference sequences. Not allowed to overlap with self.
      • What they wanted really was to start from a reference genome, then handle it as a little insertion.
    • Integrates alignment information (SAM, BAM file with BWA, Bowtie)
    • Edits, breaks up, reconnects, and extends the “alignment contigs” as necessary
  • Applications
    • Comparative transcriptome assembly
    • Cancer RNAseq
    • Comparative assembly

De Bruijn graph convention for next generation sequencing

  • For transcriptome, using k=21.
  • Can use reads as short as 25.
  • How many X coverage needed?
  • How many genes do you want to catch?
  • With genome, want long inserts generally.
  • With transcriptome, short inserts are more useful. They have been using 200bp tiny stuff.

After presentation

Kevin on Wednesday will be talking about:

  • Use SOLiD data for ordering and orientation of Pog contigs
  • Special purpose scripts
  • The data was so noisy
You could leave a comment if you were logged in.
lecture_notes/04-19-2010.txt · Last modified: 2010/04/24 07:54 by karplus