Banana Slug Genomics

**This is an old revision of the document!** ----

A PCRE internal error occured. This might be caused by a faulty plugin

====== Lecture Notes for April 19, 2010 ====== Need to fill in any ** INCOMPLETE ** portions {{:lecture_notes:20100419_banana_slug_seminar.ppt|Daniel's Powerpoint}} ===== Presentation ===== ==== Overview ==== * De Bruijn graphs * Repeat resolution * Practical considerations * Extensions to Velvet ==== De Bruijn graphs ==== Overlap-Layout-Consensus (traditional) * Traditional assemblers: Phrap, Arachne, Celera, etc. * Short reads: Edena * Generally more expensive computationally * Overlaps only allow clean right-to-left and left-to-right overlaps * Transitivity * Need to remove the redundant connections * Well studied graph theory problem Comparison of different assembly techniques * Greedy hash table searches (~2007) * Some examples: SSAKE, SHARCGS, VCAKE * ** INCOMPLETE ** * De Bruijn graph assemblers (~2009) * Currently the most common for NGS: Euler, ALLPATHS, Velvet, ABySS, SOAPdenovo * ** INCOMPLETE ** * Can throw in reads of different lengths with no problems In Velvet: * Errors that are created in assembly create tips (spurs) that can be easily removed * When finding bubbles, remap the branch with the lower coverage back onto the branch with the higher coverage (as not to lose the information) * When encountering loops, ** INCOMPLETE ** So what's the difference? * Algebraic difference: * Reads in the OLC methods are atomic (i.e. an arc) * Reads in the DB graph are sequential paths through the graph * …this leads to practical differences * DB graphs allow for greater variety in overlap ** INCOMPLETE ** * ** INCOMPLETE ** ==== Repeat resolution ==== Scaling up the hard way: chromosome X * 548 million Solexa reads were generated from a flow-sorted human X chromosome. * Fit in 70GB of RAM * Many contigs: 898,401 contigs * Shirt contigs: 260bp N50 (however max of 6,956bp) * Overall length: 130Mb * Moral: there are engineering issues to be resolved but the complexity of the graph needs to be handled accordingly. * Reduced representation (Margulies et al.). * Combined re-mapping and //de novo// sequencing (Cheetham et al., Pleaseance et al.). * Code parallelization (Birol et al., Jeffrey Cook). * Improved indexing (Zamin Iqbal and Mario Caccamo). * Use of intermediate remapping (Matthias Haimei). Rock band: Using long and short reads together * If all of the paths along the short reads move from one long read (e.g. 100-1kb) to another long read, then you know the sequence from the short reads Different approaches to repeat resolution * Theoretical: spectral graph analysis * Equivalent to a Principle Component Analysis or a heat conduction model * Relies on a (massive) matrix diagonialization * Comprehensive: all data integrated at once * Traditional scaffolding * E.g. Arachne, Celera, BAMBUS * Heuristic approach similar to ** INCOMPLETE ** * In NGS assemblers: * Euler: for each pair of reads, find all possible paths from one read to another * ABySS: Same as above, but the read-pairs are bundled into node-to-node connections to reduce calculations * ALLPATHS: Same as above, but the search is limited to localized clouds around pre-computed scaffolds * Using the differences in ** INCOMPLETE ** * ** INCOMPLETE ** Peeble: Handling paired-end reads * Pick one unique node, follow their paths * Some of them are unique, pick out those nodes ==== Practical considerations ==== Colorspace * Di-base encoding has a 4 letter alphabet, but very different behavior to sequence space * Different rules for complementarity * Direct conversion to sequence space is simple but wasteful * One error messes up all the remaining bases * Conversion must therefore be done at the very end of the process, when the reads are aligned * You can then use the transition rules to detect errors Different error models * When using different technologies, you have to take into account different technologies * Trivial when doing OLC assembly * Much more tricky when doing De Bruijn graph assemble, since k-mers are not assigned to reads * Different assemblers had different settings (e.g. Newbler) Pre-filtering the reads * Some assemblers have in-built filtering of the reads (e.g. Euler) but not a generality * Efficient filtering of low quality bases can cut down on the computational cost (memory and time) * Beware that some assemblers require reads of identical length ==== Extensions to Velvet ==== Oases: //de novo// transcriptome assembly (How to study mRNA reads which do not map) De Brujin graphs ~ splice graphs * Pavel Pevzner mentioned that De Bruijn graphs are more or less like splice graphs Oases: Process * Create clusters of contigs: * Connecting reads * Connecting read pairs * Re-implement traditional algorithms to run on graphs, instead of a reference genome * Greedy transitive reduction (Myers, 2005) * Motif searches * Dynamic assembly of transcripts (Lee, 2003) Oases: Output * Full length transcripts - handles alternative splicing. * Transcripts from different genes - handles varying coverage levels. * Alternative splicing events: when possible, distinguished between common AS events. Oases: Results: * ** INCOMPLETE ** Oases: results of Drosophila PE-reads * The number of transcripts and loci are in the ballpark of what they were expecting Mapping vs. //de novo// * Mapping * Many tools available * Easy to compute * Simpler to analyze * //De novo// * No reference needed * Handles novel structures without //a priori// Columbus: Velvet module * Solution: the reference-based de Bruijn graph * Preserves the structure of the reference sequences * Integrates alignment information (SAM, BAM file) * Edits, breaks up, reconnects ** INCOMPLETE **

You could leave a comment if you were logged in.

Banana Slug Genomics

User Tools

Site Tools

Page Tools