Lecture Notes for April 19, 2010
Presentation
Overview
De Bruijn graphs
Repeat resolution
Practical considerations
Extensions to Velvet
De Bruijn graphs
Overlap-Layout-Consensus (traditional)
Traditional assemblers: Phrap, Arachne, Celera, etc.
Short reads: Edena
Generally more expensive computationally.
Use whole-read alignments
Quick Overlap-Layout-Consensus (OLC) example
Filter out all the reads that are identical or self-contained in another read. Only left-to-right, etc. are allowed.
Compute all their overlaps.
Get many transitive redundant connections.
You obtain simplified paths through the reads.
This is how contigs were first built.
Have to compare all reads against each other.
Next generation sequencing reads are short, and overlaps between reads are short. They are getting a little longer with newer technology.
OLC method uses all information in the read at least.
Comparison of different assembly techniques
Greedy hash table searches (~2007)
Some examples: SSAKE, SHARCGS, VCAKE
Much faster & leaner than the previous
Less robust to variation and noise
Essentially simplified de Bruijn graph assemblers
Build a dictionary of words in the data set.
For each, create a node in the hash table.
Sometimes at branch have two possible extensions.
Greedy approach essentially de Bruijn graph with local info only. In this structure you could have many paths through.
Can throw in reads of different lengths with no problems.
De Bruijn graph assemblers (~2009)
Currently the most common for NGS: Euler, ALLPATHS, Velvet, ABySS, SOAPdenovo
Pavel Pevzner popularized in paper from 2001
Why do we use them?
Basically a compromise between two previous approaches:
Just a k-mer hash
… but processed globally
Graph construction and correction is similar in all assemblers.
ABySS handles parallel implementation.
Working with de Bruijn graphs in Velvet
Fixing errors
Errors that are created in assembly create spurs/tips that can be easily removed.
When finding bubbles, remap the branch with the lower coverage back onto the branch with the higher coverage (as not to lose the information).
Loops can represent tandem repeats.
Creating de Bruijn graphs
First, concatenate the simple graphs without branches into nodes which compact the non-branching runs into a single item.
Biochemistry of short reads means errors tend to be near the ends.
If you have errors in the reads in the second half, it creates millions of tips, but they are at least easy to remove later.
Each read is a path through the graph. The head of the read is still kept, so we're not losing the read placement information.
We are still left with bubbles.
Possibly accumulation of overlapping spurs/tips. More often variation (e.g. SNPs or variation).
Would like to keep that information where possible.
Could also be a long read. But we don't want to cut the long read in the middle.
Now we have loops that goes back on itself (not a bubble).
So what's the difference?
Simulations: Tour Bus - error
Compare 5mb chunk from 4 species: E. coli, yeast, C. elegans, human
Increased coverage from 0 to 50x
N50 on y-axis
First there is exponential growth as expected, but then the complexity causes trouble unless you have other kinds of information.
Not using paired-reads information
You see that the human is a little more complex than E. coli.
Repeat resolution
Scaling up the hard way: chromosome X
Handling complexity of the graph
Reduced representation (Margulies et al.).
Reduced representation using restriction enzymes and gel to cut the problem into tractable bits.
Cut chunks from the gel and then sequenced the gel slice separately or at least in different lanes.
Then the assembly problem is smaller and more localized instead of whole genome.
Combined re-mapping and de novo sequencing (Cheetham et al., Pleaseance et al.).
Map to reference or similar genome, then re-map locally some stuff that is unique.
MAQ (Mapping and Assembly with Quality) and BWA (Burrows-Wheeler Alignment Tool) only allow 3 or 4 mismatches.
Can see reads not mapping in a small area with too many SNPs or breakpoint rearrangments.
Code parallelization (Birol et al., Jeffrey Cook).
Improved indexing (Zamin Iqbal and Mario Caccamo).
Use of intermediate remapping (Matthias Haimei).
Repeats in a de Bruijn graph
Rock band: Using long and short reads together
If all of the paths along the short reads move from one long read (e.g. 100-1kb) to another long read, then you know the sequence from the short reads
Long reads which are mapped seamlessly into the graph.
35bp reads 50x, some long reads (100-1kb) 0-20x
Played with the length
The longer your long reads are, the more bridgeable it is and the longer the contigs.
Different approaches to repeat resolution
Theoretical: spectral graph analysis
Equivalent to a Principle Component Analysis or a heat conduction model
Relies on a (massive) matrix diagonialization
Comprehensive: all data integrated at once
Traditional scaffolding
E.g. Arachne, Celera, BAMBUS
Heuristic approach similar to that used in traditional overlap layout-consensus contigs
Build a big graph of pairwise connections, simplify, extract obvious linear components
Your coverage is huge.
Each base pair has to handle 1000 coverage read-pairs. Long reads separated by a mess in the middle. Check possible paths. Find which are acceptable.
Find all the reads that start and end on the two long reads. Then consider them all together, much more efficient.
5% or more of mate-pairs can be wrong
Hopefully will be able to filter them out.
In Velvet use a likelihood approach
Have a distance estimate
Coverage of A and B, how many would we expect to see? If we see too few, we discard the connection.
In NGS assemblers:
Euler: for each pair of reads, find all possible paths from one read to another
ABySS: Same as above, but the read-pairs are bundled into node-to-node connections to reduce calculations
ALLPATHS: Same as above, but the search is limited to localized clouds around pre-computed scaffolds
Using the differences between insert length
Peeble: Handling paired-end reads
Pick one unique node, follow their paths
Some of them are unique, pick out those nodes
Hit the nodes in the vicinity of the starting contig, got distance information on the immediate vicinity.
Uses a greedy algorithm. Find way to next unique node.
Having a more-precise insert size could help
Velvet did a size assuming 10% distribution
SOLiD up to 2k, Solexa just 300
It is difficult to have enough DNA of a narrow range.
Don't love barcoding because it uses a lot of reagents.
Practical considerations
Colorspace
Di-base encoding has a 4 letter alphabet, but very different behavior to sequence space
Double-encoded files use ACGT.
Be careful not to confuse it!
Different rules for complementarity
Direct conversion to sequence space is simple but wasteful
Conversion must therefore be done at the very end of the process, when the reads are aligned
You can then use the transition rules to detect errors. Typically it detects the error after 1 or 2 bp, then corrects.
SOLiD has developed tools
Different error models
When using different technologies, you have to take into account different technologies
Trivial when doing OLC assembly
Much more tricky when doing de Bruijn graph assemble, since k-mers are not assigned to reads
Different assemblers had different settings (e.g. Newbler)
454 had homopolymer issues in flow-space.
Illumina or ABI are different.
Newbler may be better for 454.
Pre-filtering the reads
Some assemblers have in-built filtering of the reads (e.g. Euler) but not a generality
Efficient filtering of low quality bases can cut down on the computational cost (memory and time)
Beware that some assemblers require reads of identical length
Extensions to Velvet
Oases: de novo transcriptome assembly
How to study mRNA reads which do not map
De Brujin graphs ~ splice graphs
People are using next generation sequencing generally because it's cheaper.
Would have just one path except for alternate splicing.
Pavel Pevzner mentioned in 2002 that de Bruijn graphs are equivalent to splice graphs, so we can apply those algorithms developed for splice graphs to de Bruijn graphs.
Oases: Process
Create clusters of contigs:
Connecting reads
Connecting read pairs
The genetic locus should create a connected locus, cluster of contigs.
Sometimes one cluster gets broken into two, but works pretty well.
Re-implement traditional algorithms to run on graphs, instead of a reference genome
Greedy transitive reduction (Myers, 2005)
Motif searches
Dynamic assembly of transcripts (Lee, 2003)
Oases: Output
Full length transcripts - handles alternative splicing.
Transcripts from different genes - handles varying coverage levels.
Alternative splicing events: when possible, distinguished between common AS events.
Oases: Results
Oases: results of Drosophila PE-reads
Oases: examples
Found an unannotated gene with 7 exons
Assembler has no notion of the reference genome, but still found one example that mapped very well.
Mapping vs. de novo
Columbus: Velvet module
De Bruijn graph convention for next generation sequencing
For transcriptome, using k=21.
Can use reads as short as 25.
How many X coverage needed?
How many genes do you want to catch?
With genome, want long inserts generally.
With transcriptome, short inserts are more useful. They have been using 200bp tiny stuff.
After presentation
Kevin on Wednesday will be talking about: