Banana Slug Genomics

**This is an old revision of the document!** ----

A PCRE internal error occured. This might be caused by a faulty plugin

====== Lecture Notes for April 19, 2010 ====== ===== Presentation ===== ** {{:lecture_notes:20100419_banana_slug_seminar.ppt|Assembling and comparing genomes or transcriptomes using de Bruijn graphs}} ** \\ Daniel Zerbino, Marcel Schulz, Wendy Lee, Ewan Birney and David Haussler ==== Overview ==== * De Bruijn graphs * Repeat resolution * Practical considerations * Extensions to Velvet ==== De Bruijn graphs ==== === Overlap-Layout-Consensus (traditional) === * Traditional assemblers: Phrap, Arachne, Celera, etc. * Short reads: Edena * Generally more expensive computationally. * Overlaps only allow clean right-to-left and left-to-right overlaps. * Transitivity * Need to remove the redundant connections. * Well studied graph theory problem * Use whole-read alignments === Quick Overlap-Layout-Consensus (OLC) example === * Filter out all the reads that are identical or self-contained in another read. Only left-to-right, etc. are allowed. * Compute all their overlaps. * Get many transitive redundant connections. * Many algorithms are available for transitive reduction. * You obtain simplified paths through the reads. * Get 4 big chunks of reads from the example. * Hard to distinguish which reads belong to which copy of the repeat. * This is how contigs were first built. * Have to compare all reads against each other. * Quadratic. This has become difficult. * Next generation sequencing reads are short, and overlaps between reads are short. They are getting a little longer with newer technology. * OLC method uses all information in the read at least. === Comparison of different assembly techniques === == Greedy hash table searches (~2007) == * Some examples: SSAKE, SHARCGS, VCAKE * Much faster & leaner than the previous * Less robust to variation and noise * Essentially simplified de Bruijn graph assemblers * Build a dictionary of words in the data set. * His example is 4mer, but typically 20mer and longer. * For each, create a node in the hash table. * Big dataset. Typically a binary marker to say if present. * Sometimes at branch have two possible extensions. * Greedy approach essentially de Bruijn graph with local info only. In this structure you could have many paths through. * Can throw in reads of different lengths with no problems. == De Bruijn graph assemblers (~2009) == * Currently the most common for NGS: Euler, ALLPATHS, Velvet, ABySS, SOAPdenovo * Pavel Pevzner popularized in paper from 2001 * Why do we use them? * High through-put, cheap, short reads with errors * Basically a compromise between two previous approaches: * Just a k-mer hash * ... but processed globally * Graph construction and correction is similar in all assemblers. * ABySS handles parallel implementation. === Working with de Bruijn graphs in Velvet === == Fixing errors == * Errors that are created in assembly create spurs/tips that can be easily removed. * When finding bubbles, remap the branch with the lower coverage back onto the branch with the higher coverage (as not to lose the information). * Loops can represent tandem repeats. == Creating de Bruijn graphs == * First, concatenate the simple graphs without branches into nodes which compact the non-branching runs into a single item. * Biochemistry of short reads means errors tend to be near the ends. * If you have errors in the reads in the second half, it creates millions of tips, but they are at least easy to remove later. * Each read is a path through the graph. The head of the read is still kept, so we're not losing the read placement information. * We are still left with bubbles. * Possibly accumulation of overlapping spurs/tips. More often variation (e.g. SNPs or variation). * Would like to keep that information where possible. * Could also be a long read. But we don't want to cut the long read in the middle. * Now we have loops that goes back on itself (not a bubble). === So what's the difference? === * Algebraic difference: * Reads in the OLC methods are atomic (i.e. an arc) * Reads in the DB graph are sequential paths through the graph * …this leads to practical differences * DB graphs allow for greater variety of overlaps * Overlaps in the OLC approach require a global alignment, not just a shared k-mer * OLC methods gets error correction from long reads, while DB graphs requires more work to correct. === Simulations: Tour Bus - error === * Compare 5mb chunk from 4 species: //E. coli//, yeast, //C. elegans//, human * Increased coverage from 0 to 50x * N50 on y-axis * First there is exponential growth as expected, but then the complexity causes trouble unless you have other kinds of information. * Not using paired-reads information * You see that the human is a little more complex than //E. coli//. ==== Repeat resolution ==== === Scaling up the hard way: chromosome X === * 548 million Solexa reads were generated from a flow-sorted human X chromosome. * Fit in 70GB of RAM * Numerous short contigs * Many contigs: 898,401 contigs * Shirt contigs: 260bp N50 (however max of 6,956bp) * Overall length: 130Mb * How to handle repeats? * Moral: there are engineering issues to be resolved but the complexity of the graph needs to be handled accordingly. == Handling complexity of the graph == * Reduced representation (Margulies et al.). * Reduced representation using restriction enzymes and gel to cut the problem into tractable bits. * Cut chunks from the gel and then sequenced the gel slice separately or at least in different lanes. * Then the assembly problem is smaller and more localized instead of whole genome. * Combined re-mapping and //de novo// sequencing (Cheetham et al., Pleaseance et al.). * Map to reference or similar genome, then re-map locally some stuff that is unique. * MAQ (Mapping and Assembly with Quality) and BWA (Burrows-Wheeler Alignment Tool) only allow 3 or 4 mismatches. * Can see reads not mapping in a small area with too many SNPs or breakpoint rearrangments. * Code parallelization (Birol et al., Jeffrey Cook). * Only ABySS has published yet * Improved indexing (Zamin Iqbal and Mario Caccamo). * Fit whole genome shotgun onto 200GB * Use of intermediate remapping (Matthias Haimei). * Wanted to parallelize the second level work of building scaffolds, but those are mostly engineering issues. === Repeats in a de Bruijn graph === * The de Bruijn graph for a small organism looks like long-ish contigs, but still makes a hair-ball. === Rock band: Using long and short reads together === * If all of the paths along the short reads move from one long read (e.g. 100-1kb) to another long read, then you know the sequence from the short reads * Long reads which are mapped seamlessly into the graph. * 35bp reads 50x, some long reads (100-1kb) 0-20x * Played with the length * The longer your long reads are, the more bridgeable it is and the longer the contigs. * Someone might ask: "Why not just assemble the long reads?" === Different approaches to repeat resolution === == Theoretical: spectral graph analysis == * Equivalent to a Principle Component Analysis or a heat conduction model * Find what is the vector that threads its way * Relies on a (massive) matrix diagonialization * Comprehensive: all data integrated at once * Takes all the data at once in one huge matrix, not too affected by noise. No one has actually tried this approach as it requires too much computation. == Traditional scaffolding == * E.g. Arachne, Celera, BAMBUS * Heuristic approach similar to that used in traditional overlap layout-consensus contigs * The traditional approach is contigs with paired-end reads. * Find chains connected by contig reads. * Build a big graph of pairwise connections, simplify, extract obvious linear components * Your coverage is huge. * Each base pair has to handle 1000 coverage read-pairs. Long reads separated by a mess in the middle. Check possible paths. Find which are acceptable. * Find all the reads that start and end on the two long reads. Then consider them all together, much more efficient. * 5% or more of mate-pairs can be wrong * Hopefully will be able to filter them out. * In Velvet use a likelihood approach * Have a distance estimate * Coverage of A and B, how many would we expect to see? If we see too few, we discard the connection. == In NGS assemblers: == * Euler: for each pair of reads, find all possible paths from one read to another * ABySS: Same as above, but the read-pairs are bundled into node-to-node connections to reduce calculations * ALLPATHS: Same as above, but the search is limited to localized clouds around pre-computed scaffolds * ALLPATHS did better using small localized problems of a single small scaffold for parallelization. == Using the differences between insert length == * The Shorty algorithm uses the variance between read pairs anchored on a common contig on k-mer * Projected image of the other ends creates possibility of assembling a smaller set, the projection from a node. === Peeble: Handling paired-end reads === * Pick one unique node, follow their paths * Velvet used the reverse idea, the projections to a node * Some of them are unique, pick out those nodes * Hit the nodes in the vicinity of the starting contig, got distance information on the immediate vicinity. * Uses a greedy algorithm. Find way to next unique node. * Having a more-precise insert size could help * Velvet did a size assuming 10% distribution * Solexa might have been different * SOLiD up to 2k, Solexa just 300 * It is difficult to have enough DNA of a narrow range. * Don't love barcoding because it uses a lot of reagents. ==== Practical considerations ==== === Colorspace === * Di-base encoding has a 4 letter alphabet, but very different behavior to sequence space * Double-encoded files use ACGT. * Be careful not to confuse it! * Different rules for complementarity * Just reverse, don't complement them. * Direct conversion to sequence space is simple but wasteful * One error messes up all the remaining bases. This is a huge loss, so do all in colorspace and make a complete assembly, then finally convert to base calls later. * Conversion must therefore be done at the very end of the process, when the reads are aligned * You can then use the transition rules to detect errors. Typically it detects the error after 1 or 2 bp, then corrects. * SOLiD has developed tools === Different error models === * When using different technologies, you have to take into account different technologies * Trivial when doing OLC assembly * Much more tricky when doing de Bruijn graph assemble, since k-mers are not assigned to reads * Different assemblers had different settings (e.g. Newbler) * 454 had homopolymer issues in flow-space. * Illumina or ABI are different. * Makes it hard to handle errors. * Especially in de Bruijn graphs. Applying the correction rules becomes much more complex. * Newbler may be better for 454. === Pre-filtering the reads === * Some assemblers have in-built filtering of the reads (e.g. Euler) but not a generality * Efficient filtering of low quality bases can cut down on the computational cost (memory and time) * User-specific, data-space specific solutions * No tried and true consistent approach * Beware that some assemblers require reads of identical length ==== Extensions to Velvet ==== ** Oases: //de novo// transcriptome assembly ** \\ How to study mRNA reads which do not map === De Brujin graphs ~ splice graphs === * People are using next generation sequencing generally because it's cheaper. * Would have just one path except for alternate splicing. * Pavel Pevzner mentioned in 2002 that de Bruijn graphs are equivalent to splice graphs, so we can apply those algorithms developed for splice graphs to de Bruijn graphs. === Oases: Process === * Create clusters of contigs: * Connecting reads * Connecting read pairs * The genetic locus should create a connected locus, cluster of contigs. * Sometimes one cluster gets broken into two, but works pretty well. * Re-implement traditional algorithms to run on graphs, instead of a reference genome * Greedy transitive reduction (Myers, 2005) * Motif searches * Dynamic assembly of transcripts (Lee, 2003) * find the path of highest weight === Oases: Output === * Full length transcripts - handles alternative splicing. * With skipped exon and one artifact. * Transcripts from different genes - handles varying coverage levels. * Alternative splicing events: when possible, distinguished between common AS events. * Find donor and acceptor sites on or around junctions. === Oases: Results === * Benchmarked within the RGASP competition * Already 100+ users, some reporting very promising results: * N50 = 1886 bp (Monica Britton, UC Davis) * 99.8% of aligned transcripts against unigene (Micha Bayer - Scottish Crop Research Institute) === Oases: results of Drosophila PE-reads === * The number of transcripts and loci are in the ballpark of what they were expecting. * rl 36, k21 ins size 200 * Map to genome ~90% === Oases: examples === * Found an unannotated gene with 7 exons * Assembler has no notion of the reference genome, but still found one example that mapped very well. === Mapping vs. de novo === * Mapping * Many good mature tools available * Easy to compute * Simpler to analyze, comes with genetic coords * //De novo// * Harder than mapping * May require special hardware * No reference needed * Handles novel structures without //a priori// * How to get the best of both worlds? === Columbus: Velvet module === * Solution: the reference-based de Bruijn graph * Just throw in the reference genome as long reads. * Chucking in long reads and getting short contigs. * Preserves the structure of the reference sequences. Not allowed to overlap with self. * What they wanted really was to start from a reference genome, then handle it as a little insertion. * Integrates alignment information (SAM, BAM file with BWA, Bowtie) * Edits, breaks up, reconnects, and extends the "alignment contigs" as necessary * Applications * Comparative transcriptome assembly * Cancer RNAseq * Comparative assembly === De Bruijn graph convention for next generation sequencing === * For transcriptome, using k=21. * Can use reads as short as 25. * How many X coverage needed? * How many genes do you want to catch? * With genome, want long inserts generally. * With transcriptome, short inserts are more useful. They have been using 200bp tiny stuff. ===== After presentation ===== Kevin on Wednesday will be talking about: * Use SOLiD data for ordering and orientation of //Pog// contigs * Special purpose scripts * The data was so noisy

You could leave a comment if you were logged in.

Banana Slug Genomics

User Tools

Site Tools

Page Tools