Banana Slug Genomics

**This is an old revision of the document!** ----

A PCRE internal error occured. This might be caused by a faulty plugin

====== Lecture Notes for April 19, 2010 ====== Need to fill in any ** INCOMPLETE ** portions {{:lecture_notes:20100419_banana_slug_seminar.ppt|Daniel's Powerpoint}} ===== Presentation ===== ==== Overview ==== * De Bruijn graphs * Repeat resolution * Practical considerations * Extensions to Velvet ==== De Bruijn graphs ==== Overlap-Layout-Consensus (traditional) * Traditional assemblers: Phrap, Arachne, Celera, etc. * Short reads: Edena * Generally more expensive computationally * Overlaps only allow clean right-to-left and left-to-right overlaps * Transitivity * Need to remove the redundant connections * Well studied graph theory problem Comparison of different assembly techniques * Greedy hash table searches (~2007) * Some examples: SSAKE, SHARCGS, VCAKE * ** INCOMPLETE ** * De Bruijn graph assemblers (~2009) * Currently the most common for NGS: Euler, ALLPATHS, Velvet, ABySS, SOAPdenovo * ** INCOMPLETE ** * Can throw in reads of different lengths with no problems In Velvet: * Errors that are created in assembly create tips (spurs) that can be easily removed * When finding bubbles, remap the branch with the lower coverage back onto the branch with the higher coverage (as not to lose the information) * When encountering loops, ** INCOMPLETE ** So what's the difference? * Algebraic difference: * Reads in the OLC methods are atomic (i.e. an arc) * Reads in the DB graph are sequential paths through the graph * …this leads to practical differences * DB graphs allow for greater variety in overlap ** INCOMPLETE ** * ** INCOMPLETE ** ==== Repeat resolution ==== Scaling up the hard way: chromosome X * 548 million Solexa reads were generated from a flow-sorted human X chromosome. * Fit in 70GB of RAM * Many contigs: 898,401 contigs * Shirt contigs: 260bp N50 (however max of 6,956bp) * Overall length: 130Mb * Moral: there are engineering issues to be resolved but the complexity of the graph needs to be handled accordingly. * Reduced representation (Margulies et al.). * Combined re-mapping and //de novo// sequencing (Cheetham et al., Pleaseance et al.). * Code parallelization (Birol et al., Jeffrey Cook). * Improved indexing (Zamin Iqbal and Mario Caccamo). * Use of intermediate remapping (Matthias Haimei). Rock band: Using long and short reads together * If all of the paths along the short reads move from one long read (e.g. 100-1kb) to another long read, then you know the sequence from the short reads Different approaches to repeat resolution * Theoretical: spectral graph analysis * Equivalent to a Principle Component Analysis or a heat conduction model * Relies on a (massive) matrix diagonialization * Comprehensive: all data integrated at once * Traditional scaffolding * E.g. Arachne, Celera, BAMBUS * Heuristic approach similar to ** INCOMPLETE ** * In NGS assemblers: * Euler: for each pair of reads, find all possible paths from one read to another * ABySS: Same as above, but the read-pairs are bundled into node-to-node connections to reduce calculations * ALLPATHS: Same as above, but the search is limited to localized clouds around pre-computed scaffolds * Using the differences in ** INCOMPLETE ** * ** INCOMPLETE ** Peeble: Handling paired-end reads * Pick one unique node, follow their paths * Some of them are unique, pick out those nodes ==== Practical considerations ==== Colorspace * Di-base encoding has a 4 letter alphabet, but very different behavior to sequence space * Different rules for complementarity * Direct conversion to sequence space is simple but wasteful * One error messes up all the remaining bases * Conversion must therefore be done at the very end of the process, when the reads are aligned * You can then use the transition rules to detect errors Different error models * When using different technologies, you have to take into account different technologies * Trivial when doing OLC assembly * Much more tricky when doing De Bruijn graph assemble, since k-mers are not assigned to reads * Different assemblers had different settings (e.g. Newbler) Pre-filtering the reads * Some assemblers have in-built filtering of the reads (e.g. Euler) but not a generality * Efficient filtering of low quality bases can cut down on the computational cost (memory and time) * Beware that some assemblers require reads of identical length ==== Extensions to Velvet ==== Oases: //de novo// transcriptome assembly (How to study mRNA reads which do not map) De Brujin graphs ~ splice graphs * Pavel Pevzner mentioned that De Bruijn graphs are more or less like splice graphs Oases: Process * Create clusters of contigs: * Connecting reads * Connecting read pairs * Re-implement traditional algorithms to run on graphs, instead of a reference genome * Greedy transitive reduction (Myers, 2005) * Motif searches * Dynamic assembly of transcripts (Lee, 2003) Oases: Output * Full length transcripts - handles alternative splicing. * Transcripts from different genes - handles varying coverage levels. * Alternative splicing events: when possible, distinguished between common AS events. Oases: Results: * ** INCOMPLETE ** Oases: results of Drosophila PE-reads * The number of transcripts and loci are in the ballpark of what they were expecting Mapping vs. //de novo// * Mapping * Many tools available * Easy to compute * Simpler to analyze * //De novo// * No reference needed * Handles novel structures without //a priori// Columbus: Velvet module * Solution: the reference-based de Bruijn graph * Preserves the structure of the reference sequences * Integrates alignment information (SAM, BAM file) * Edits, breaks up, reconnects ** INCOMPLETE ** ===== Galt's Notes ===== Assembling and comparing genomes or transcriptomes using de Bruijn graphs. Daniel Zerbino, Marcel Schulz, Wendy Lee, Ewan Birney and David Haussler. Slides will be made available. Talking about *ways to build graphs *handle repeats *practical considerations *extensions to velvet ==== OLC ==== Traditional approach is overlap-layout-consensus Phrap, etc.\\ Short reads - Edena\\ More expensive computationally. Use whole-read alignments. === Quick OLC example === Filter out all the reads that are identical or self-contained in another read. Only left-to-right etc allowed. Compute all their overlaps. Get many transitive redundant connections. Many algorithms are available for transitive reduction. You obtain simplified paths through the reads. Get 4 big chunks of reads from the example. Hard to distinguish which reads belong to which copy of the repeat. This is how contigs were first built. Have to compare all reads against each other, Quadratic. This has become difficult. NGS reads are short, and overlaps between reads are short. They are getting a little longer with newer technology. OLC method uses all information in the read at least. ==== DBG ==== De Bruijn graphs, Pavel Pevzner popularized in paper 2001. Why do we use them? High through-put, cheap, short reads with errors. The next-gen means 2007 there were many papers, SSKE, SHARCGS, VCKE, have a kmer hash, much faster and leaner, but get noise. Build a dictionary of words in the data set. His example is 4mer, but typically 20mer and longer. For each, create a node in the hash table. Big dataset. Typically a binary marker to say if present. Sometimes at branch have two possible extensions. Greedy approach essentially De Bruijn graph with local info only. In this structure you could have many paths through. And reads of different lengths can be used. Later Euler, Allphaths, Velvet, ABySS, SOAPdenovo came out. They souped up the kmer hash, but processed more globally. Graph construction and correction is similar in all assemblers. ABySS handles parallel implementation. First, concatenate the simple graphs without branches into nodes which compact the non-branching runs into a single item. Biochemistry of shortreads means errors tend to be near the ends. If you have errors in the reads in the second half, it creates millions of tips, but they are at least easy to remove later. Each read is a path through the graph. The head of the read is still kept, so we're not losing the read placement information. We are still left with bubbles. Possibly accumulation of overlapping spur-tips. More often variation e.g. SNP or variation. Would like to keep that information where possible. Could also be a long read. But don't want to cut the long read in the middle. When you find a bubble in velvet, remap that back onto the one with the higher branch. Now we have a loop that goes back on itself (not a bubble). e.g. tandem repeats. OLC has atomic arcs. DBG has a greater variety of overlaps. OLC gets error correction from long reads. DBG requires more work to correct. ==== Tour Bus ==== Errors. Compare 5mb chunk from 4 species, ecolo yeast, elegans, human. Increased coverage from 0 to 50x. N50 on y axis. First, exponential growth as expected. Then the complexity causes trouble, unless you have other kinds of information. Not using paired-reads information. You see that the human is a little more complex than E. coli. Scaling up the hard way: chrX. 1/2 billion reads from flow-sorted human chrX. Fit in 70gb ram. Overall length 130mb. Short contigs, and numerous. How to handle the repeats? **Margulies** used reduced representation using restriction enzymes and gel to cut the problem into tractable bits. Cut chunks from the gel, and then sequenced the gel slice separately, or at least in different lanes. Then the assembly problem is smaller and more localized instead of whole-genome. Combined re-mapping and de novo sequencing. **Cheetem and Pleasance**. Map to reference or similar genome, then re-map locally some stuff that's unique. MAQ and BWA only allow 3 or 4 mismatches. Can see reads not mapping in a small area with too many SNPs or breakpoint rearrangment. Code parallelization (**Birol and Cook**). Only ABySS has published yet. Improved indexing Fit WGS onto 200GB. Use of intermediate re-mapping. (**Haimel**) Wanted to parallelize the second level work of building scaffolds. But those are mostly engineering issues. The DBG for a small organism looks like long-ish contigs, but still makes a hair-ball. ==== Rock Band ==== Use long and short reads together. Long reads which are mapped seamlessly into the graph. 35bp reads 50x, some long reads (100-1kb) 0-20x. Played with the length. The longer your longreads are, the more bridgeable it is and the longer the contigs. Why not just assemble the long reads? someone might ask. Different approach. Spectral graph analysis. Find what is the vector that threads its way. Has a fancy theory with massive matrix diagonalization. Takes all the data at once in one huge matrix, not too affected by noise. No one has actually tried this approach. Too much computation. The trad approach is contigs with paired-end reads. Find chains connected by contig reads. Your coverage is huge. Each basepair has to handle 1000 coverage read-pairs. Long reads separated by a mess in the middle. Check possible paths. Find which are acceptable. Find all the reads that start and end on the two long reads. Then consider them all together, much more efficient. 5% or more of mate-pairs can be wrong. Hopefully will be able to filter them out. In velvet use a likelihood approach. Have a distance estimate. Coverage of A and B, how many would we expect to see? If we see too few, we discard the connection. Allpaths did better using small localized problems of a single small scaffold for parallelization. The Shorty algorithm uses the variance between read pair anchored on a common contig on a kmer. Projected image of the other ends creates possibility of assembling a smaller set, the projection from a node. ==== PEBBLE ==== Velvet used the reverse idea, the projections to a node. Hit the nodes in the vicinity of the starting contig, got distance information on the immediate vicinity. Have a greedy algorithm. Find way to next unique node. Having a more-precise insert size could help. Velvet did a size assuming 10% distribution. Solexa might have been different. Solid up to 2k, solexa just 300. It is difficult to have enough dna of a narrow range. Don't love barcoding because it uses a lot of reagents. ==== COLOR-SPACE ==== 4-letter alphabet. Double-encoded files use ACGT. Be careful not to confuse it. Different rules for complementarity. Just reverse, don't complement them. The assemblers and mappers should in theory adapt to this. Could use base-space, but one error will ruin the rest of the read. This is a huge loss, so do all in colorspace and make a complete assembly, then finally convert to base calls later. Typically it detects the error after 1 or 2 bp, then corrects. SOLID has developed tools. Different error models. 454 had homopolymer issues, flowspace. Illumina or ABI are different. Makes it hard to handle errors. Especially in DBG. Applying the correction rules becomes much more complex. Newbler may be better for 454. Pre-filtering reads. Some assemblers have built-in filtering of reads. Filtering low quality bases can cut down on the computational cost. User-specific, data-space specific solutions. No tried and true consistent approach. Some assemblers require reads of identical lengths. ----------- ==== OASES ==== De-novo Transcriptome Assembly. People are using NGS generally because it's cheaper. Would have just one path except for alternate splicing. Pevzner in 2002 mentioned DBG is equivalent to a splice-graph. So can apply those algorithms developed for Splice-graphs to DBG. === Oases Process === The genetic locus should create a connected locus, cluster of contigs. Sometimes one cluster gets broken into two, but works pretty well. Dynamic assembly, find the path of highest weight. === Oases Output === With skipped exon, and one artifact. Handles contigs with different coverage levels. Find donor and acceptor sites on or around junctions. Drosophila reads. rl 36, k21 ins size 200. Map to genome 90% Found an unannotated gene with 7 exons. Assembler has no notion of the reference genome, but still found one example that mapped very well. ==== COLUMBUS ==== Mapping. Either map, or do de-novo. Many good mature mapping tools. Easier to analyze, comes with genomic coords. De-novo harder. May require special hardware. Works without reference. Handles novel structures. How to get the best of both worlds? Just throw in the ref genome as long reads. Chucking in long reads and getting short contigs. What they wanted really was to start from a reference genome. Then handle it as a little insertion. Takes DBG and SAM/BAM (BWA, bowtie) Keeps ref sequences. Not allowed to overlap with self. Starting with comparative transcriptome assembly. Would like to ? cancer. DBG convention for NGS. For transcriptome, using k=21. Can use reads as short as 25. How many X coverage needed? How many genes do you want to catch. With genome, want long inserts generally. With transcriptome, short inserts are more useful. They have been using 200bp tiny stuff. ==== End-of-class Business ==== We can put Daniel's slides on the wiki. Galt will send Daniel a link to the wiki. Kevin on Wed. will be talking about: Use solid data for ordering and orientation of Pog contigs. Special purpose and scripts. The data was so noisy.

You could leave a comment if you were logged in.

Banana Slug Genomics

User Tools

Site Tools

Page Tools