This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
|
lecture_notes:04-19-2010 [2010/04/22 06:12] galt |
lecture_notes:04-19-2010 [2010/04/24 14:54] (current) karplus added correction about quadratic behavior of overlap-consensus |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== Lecture Notes for April 19, 2010 ====== | ====== Lecture Notes for April 19, 2010 ====== | ||
| - | Need to fill in any ** INCOMPLETE ** portions | ||
| - | |||
| - | {{:lecture_notes:20100419_banana_slug_seminar.ppt|Daniel's Powerpoint}} | ||
| ===== Presentation ===== | ===== Presentation ===== | ||
| + | ** {{:lecture_notes:20100419_banana_slug_seminar.ppt|Assembling and comparing genomes or transcriptomes using de Bruijn graphs}} ** \\ | ||
| + | Daniel Zerbino, Marcel Schulz, Wendy Lee, Ewan Birney and David Haussler | ||
| + | |||
| ==== Overview ==== | ==== Overview ==== | ||
| * De Bruijn graphs | * De Bruijn graphs | ||
| Line 10: | Line 10: | ||
| * Practical considerations | * Practical considerations | ||
| * Extensions to Velvet | * Extensions to Velvet | ||
| + | |||
| ==== De Bruijn graphs ==== | ==== De Bruijn graphs ==== | ||
| - | Overlap-Layout-Consensus (traditional) | + | |
| + | === Overlap-Layout-Consensus (traditional) === | ||
| * Traditional assemblers: Phrap, Arachne, Celera, etc. | * Traditional assemblers: Phrap, Arachne, Celera, etc. | ||
| * Short reads: Edena | * Short reads: Edena | ||
| - | * Generally more expensive computationally | + | * Generally more expensive computationally. |
| - | * Overlaps only allow clean right-to-left and left-to-right overlaps | + | * Overlaps only allow clean right-to-left and left-to-right overlaps. |
| * Transitivity | * Transitivity | ||
| - | * Need to remove the redundant connections | + | * Need to remove the redundant connections. |
| * Well studied graph theory problem | * Well studied graph theory problem | ||
| - | Comparison of different assembly techniques | + | * Use whole-read alignments |
| - | * Greedy hash table searches (~2007) | + | |
| - | * Some examples: SSAKE, SHARCGS, VCAKE | + | === Quick Overlap-Layout-Consensus (OLC) example === |
| - | * ** INCOMPLETE ** | + | * Filter out all the reads that are identical or self-contained in another read. Only left-to-right, etc. are allowed. |
| - | * De Bruijn graph assemblers (~2009) | + | * Compute all their overlaps. |
| - | * Currently the most common for NGS: Euler, ALLPATHS, Velvet, ABySS, SOAPdenovo | + | * Get many transitive redundant connections. |
| - | * ** INCOMPLETE ** | + | * Many algorithms are available for transitive reduction. |
| - | * Can throw in reads of different lengths with no problems | + | * You obtain simplified paths through the reads. |
| - | In Velvet: | + | * Get 4 big chunks of reads from the example. |
| - | * Errors that are created in assembly create tips (spurs) that can be easily removed | + | * Hard to distinguish which reads belong to which copy of the repeat. |
| - | * When finding bubbles, remap the branch with the lower coverage back onto the branch with the higher coverage (as not to lose the information) | + | * This is how contigs were first built. |
| - | * When encountering loops, ** INCOMPLETE ** | + | * Have to compare all reads against each other. |
| - | So what's the difference? | + | * Quadratic. This has become difficult. |
| + | * Correction: quadratic behavior can be reduced to about O(n log n) by doing a clustering before aligning: just like in several fast search techniques, you only attempt to align sequences that share seed hits to relatively short k-mers (much shorter than those needed for de Bruijn graphs, since they only need to reduce the number of hits, not get unique locations in the genome). --- //[[karplus@soe.ucsc.edu|Kevin Karplus]] 2010/04/24 07:51// | ||
| + | * Next generation sequencing reads are short, and overlaps between reads are short. They are getting a little longer with newer technology. | ||
| + | * OLC method uses all information in the read at least. | ||
| + | |||
| + | === Comparison of different assembly techniques === | ||
| + | |||
| + | == Greedy hash table searches (~2007) == | ||
| + | * Some examples: SSAKE, SHARCGS, VCAKE | ||
| + | * Much faster & leaner than the previous | ||
| + | * Less robust to variation and noise | ||
| + | * Essentially simplified de Bruijn graph assemblers | ||
| + | * Build a dictionary of words in the data set. | ||
| + | * His example is 4mer, but typically 20mer and longer. | ||
| + | * For each, create a node in the hash table. | ||
| + | * Big dataset. Typically a binary marker to say if present. | ||
| + | * Sometimes at branch have two possible extensions. | ||
| + | * Greedy approach essentially de Bruijn graph with local info only. In this structure you could have many paths through. | ||
| + | * Can throw in reads of different lengths with no problems. | ||
| + | |||
| + | == De Bruijn graph assemblers (~2009) == | ||
| + | * Currently the most common for NGS: Euler, ALLPATHS, Velvet, ABySS, SOAPdenovo | ||
| + | * Pavel Pevzner popularized in paper from 2001 | ||
| + | * Why do we use them? | ||
| + | * High through-put, cheap, short reads with errors | ||
| + | * Basically a compromise between two previous approaches: | ||
| + | * Just a k-mer hash | ||
| + | * ... but processed globally | ||
| + | * Graph construction and correction is similar in all assemblers. | ||
| + | * ABySS handles parallel implementation. | ||
| + | |||
| + | === Working with de Bruijn graphs in Velvet === | ||
| + | |||
| + | == Fixing errors == | ||
| + | * Errors that are created in assembly create spurs/tips that can be easily removed. | ||
| + | * When finding bubbles, remap the branch with the lower coverage back onto the branch with the higher coverage (as not to lose the information). | ||
| + | * Loops can represent tandem repeats. | ||
| + | |||
| + | == Creating de Bruijn graphs == | ||
| + | * First, concatenate the simple graphs without branches into nodes which compact the non-branching runs into a single item. | ||
| + | * Biochemistry of short reads means errors tend to be near the ends. | ||
| + | * If you have errors in the reads in the second half, it creates millions of tips, but they are at least easy to remove later. | ||
| + | * Each read is a path through the graph. The head of the read is still kept, so we're not losing the read placement information. | ||
| + | * We are still left with bubbles. | ||
| + | * Possibly accumulation of overlapping spurs/tips. More often variation (e.g. SNPs or variation). | ||
| + | * Would like to keep that information where possible. | ||
| + | * Could also be a long read. But we don't want to cut the long read in the middle. | ||
| + | * Now we have loops that goes back on itself (not a bubble). | ||
| + | |||
| + | === So what's the difference? === | ||
| * Algebraic difference: | * Algebraic difference: | ||
| * Reads in the OLC methods are atomic (i.e. an arc) | * Reads in the OLC methods are atomic (i.e. an arc) | ||
| * Reads in the DB graph are sequential paths through the graph | * Reads in the DB graph are sequential paths through the graph | ||
| * …this leads to practical differences | * …this leads to practical differences | ||
| - | * DB graphs allow for greater variety in overlap ** INCOMPLETE ** | + | * DB graphs allow for greater variety of overlaps |
| - | * ** INCOMPLETE ** | + | * Overlaps in the OLC approach require a global alignment, not just a shared k-mer |
| + | * OLC methods gets error correction from long reads, while DB graphs requires more work to correct. | ||
| + | |||
| + | === Simulations: Tour Bus - error === | ||
| + | * Compare 5mb chunk from 4 species: //E. coli//, yeast, //C. elegans//, human | ||
| + | * Increased coverage from 0 to 50x | ||
| + | * N50 on y-axis | ||
| + | * First there is exponential growth as expected, but then the complexity causes trouble unless you have other kinds of information. | ||
| + | * Not using paired-reads information | ||
| + | * You see that the human is a little more complex than //E. coli//. | ||
| ==== Repeat resolution ==== | ==== Repeat resolution ==== | ||
| - | Scaling up the hard way: chromosome X | + | |
| + | === Scaling up the hard way: chromosome X === | ||
| * 548 million Solexa reads were generated from a flow-sorted human X chromosome. | * 548 million Solexa reads were generated from a flow-sorted human X chromosome. | ||
| * Fit in 70GB of RAM | * Fit in 70GB of RAM | ||
| - | * Many contigs: 898,401 contigs | + | * Numerous short contigs |
| - | * Shirt contigs: 260bp N50 (however max of 6,956bp) | + | * Many contigs: 898,401 contigs |
| + | * Shirt contigs: 260bp N50 (however max of 6,956bp) | ||
| * Overall length: 130Mb | * Overall length: 130Mb | ||
| + | * How to handle repeats? | ||
| * Moral: there are engineering issues to be resolved but the complexity of the graph needs to be handled accordingly. | * Moral: there are engineering issues to be resolved but the complexity of the graph needs to be handled accordingly. | ||
| - | * Reduced representation (Margulies et al.). | + | |
| - | * Combined re-mapping and //de novo// sequencing (Cheetham et al., Pleaseance et al.). | + | == Handling complexity of the graph == |
| - | * Code parallelization (Birol et al., Jeffrey Cook). | + | * Reduced representation (Margulies et al.). |
| - | * Improved indexing (Zamin Iqbal and Mario Caccamo). | + | * Reduced representation using restriction enzymes and gel to cut the problem into tractable bits. |
| - | * Use of intermediate remapping (Matthias Haimei). | + | * Cut chunks from the gel and then sequenced the gel slice separately or at least in different lanes. |
| - | Rock band: Using long and short reads together | + | * Then the assembly problem is smaller and more localized instead of whole genome. |
| + | * Combined re-mapping and //de novo// sequencing (Cheetham et al., Pleaseance et al.). | ||
| + | * Map to reference or similar genome, then re-map locally some stuff that is unique. | ||
| + | * MAQ (Mapping and Assembly with Quality) and BWA (Burrows-Wheeler Alignment Tool) only allow 3 or 4 mismatches. | ||
| + | * Can see reads not mapping in a small area with too many SNPs or breakpoint rearrangments. | ||
| + | * Code parallelization (Birol et al., Jeffrey Cook). | ||
| + | * Only ABySS has published yet | ||
| + | * Improved indexing (Zamin Iqbal and Mario Caccamo). | ||
| + | * Fit whole genome shotgun onto 200GB | ||
| + | * Use of intermediate remapping (Matthias Haimei). | ||
| + | * Wanted to parallelize the second level work of building scaffolds, but those are mostly engineering issues. | ||
| + | |||
| + | === Repeats in a de Bruijn graph === | ||
| + | * The de Bruijn graph for a small organism looks like long-ish contigs, but still makes a hair-ball. | ||
| + | |||
| + | === Rock band: Using long and short reads together === | ||
| * If all of the paths along the short reads move from one long read (e.g. 100-1kb) to another long read, then you know the sequence from the short reads | * If all of the paths along the short reads move from one long read (e.g. 100-1kb) to another long read, then you know the sequence from the short reads | ||
| - | Different approaches to repeat resolution | + | * Long reads which are mapped seamlessly into the graph. |
| - | * Theoretical: spectral graph analysis | + | * 35bp reads 50x, some long reads (100-1kb) 0-20x |
| - | * Equivalent to a Principle Component Analysis or a heat conduction model | + | * Played with the length |
| - | * Relies on a (massive) matrix diagonialization | + | * The longer your long reads are, the more bridgeable it is and the longer the contigs. |
| - | * Comprehensive: all data integrated at once | + | * Someone might ask: "Why not just assemble the long reads?" |
| - | * Traditional scaffolding | + | |
| - | * E.g. Arachne, Celera, BAMBUS | + | === Different approaches to repeat resolution === |
| - | * Heuristic approach similar to ** INCOMPLETE ** | + | |
| - | * In NGS assemblers: | + | == Theoretical: spectral graph analysis == |
| - | * Euler: for each pair of reads, find all possible paths from one read to another | + | * Equivalent to a Principle Component Analysis or a heat conduction model |
| - | * ABySS: Same as above, but the read-pairs are bundled into node-to-node connections to reduce calculations | + | * Find what is the vector that threads its way |
| - | * ALLPATHS: Same as above, but the search is limited to localized clouds around pre-computed scaffolds | + | * Relies on a (massive) matrix diagonialization |
| - | * Using the differences in ** INCOMPLETE ** | + | * Comprehensive: all data integrated at once |
| - | * ** INCOMPLETE ** | + | * Takes all the data at once in one huge matrix, not too affected by noise. No one has actually tried this approach as it requires too much computation. |
| - | Peeble: Handling paired-end reads | + | |
| + | == Traditional scaffolding == | ||
| + | * E.g. Arachne, Celera, BAMBUS | ||
| + | * Heuristic approach similar to that used in traditional overlap layout-consensus contigs | ||
| + | * The traditional approach is contigs with paired-end reads. | ||
| + | * Find chains connected by contig reads. | ||
| + | * Build a big graph of pairwise connections, simplify, extract obvious linear components | ||
| + | * Your coverage is huge. | ||
| + | * Each base pair has to handle 1000 coverage read-pairs. Long reads separated by a mess in the middle. Check possible paths. Find which are acceptable. | ||
| + | * Find all the reads that start and end on the two long reads. Then consider them all together, much more efficient. | ||
| + | * 5% or more of mate-pairs can be wrong | ||
| + | * Hopefully will be able to filter them out. | ||
| + | * In Velvet use a likelihood approach | ||
| + | * Have a distance estimate | ||
| + | * Coverage of A and B, how many would we expect to see? If we see too few, we discard the connection. | ||
| + | |||
| + | == In NGS assemblers: == | ||
| + | * Euler: for each pair of reads, find all possible paths from one read to another | ||
| + | * ABySS: Same as above, but the read-pairs are bundled into node-to-node connections to reduce calculations | ||
| + | * ALLPATHS: Same as above, but the search is limited to localized clouds around pre-computed scaffolds | ||
| + | * ALLPATHS did better using small localized problems of a single small scaffold for parallelization. | ||
| + | |||
| + | == Using the differences between insert length == | ||
| + | * The Shorty algorithm uses the variance between read pairs anchored on a common contig on k-mer | ||
| + | * Projected image of the other ends creates possibility of assembling a smaller set, the projection from a node. | ||
| + | |||
| + | === Peeble: Handling paired-end reads === | ||
| * Pick one unique node, follow their paths | * Pick one unique node, follow their paths | ||
| + | * Velvet used the reverse idea, the projections to a node | ||
| * Some of them are unique, pick out those nodes | * Some of them are unique, pick out those nodes | ||
| + | * Hit the nodes in the vicinity of the starting contig, got distance information on the immediate vicinity. | ||
| + | * Uses a greedy algorithm. Find way to next unique node. | ||
| + | * Having a more-precise insert size could help | ||
| + | * Velvet did a size assuming 10% distribution | ||
| + | * Solexa might have been different | ||
| + | * SOLiD up to 2k, Solexa just 300 | ||
| + | * It is difficult to have enough DNA of a narrow range. | ||
| + | * Don't love barcoding because it uses a lot of reagents. | ||
| + | |||
| ==== Practical considerations ==== | ==== Practical considerations ==== | ||
| - | Colorspace | + | |
| + | === Colorspace === | ||
| * Di-base encoding has a 4 letter alphabet, but very different behavior to sequence space | * Di-base encoding has a 4 letter alphabet, but very different behavior to sequence space | ||
| + | * Double-encoded files use ACGT. | ||
| + | * Be careful not to confuse it! | ||
| * Different rules for complementarity | * Different rules for complementarity | ||
| + | * Just reverse, don't complement them. | ||
| * Direct conversion to sequence space is simple but wasteful | * Direct conversion to sequence space is simple but wasteful | ||
| - | * One error messes up all the remaining bases | + | * One error messes up all the remaining bases. This is a huge loss, so do all in colorspace and make a complete assembly, then finally convert to base calls later. |
| * Conversion must therefore be done at the very end of the process, when the reads are aligned | * Conversion must therefore be done at the very end of the process, when the reads are aligned | ||
| - | * You can then use the transition rules to detect errors | + | * You can then use the transition rules to detect errors. Typically it detects the error after 1 or 2 bp, then corrects. |
| - | Different error models | + | * SOLiD has developed tools |
| + | |||
| + | === Different error models === | ||
| * When using different technologies, you have to take into account different technologies | * When using different technologies, you have to take into account different technologies | ||
| * Trivial when doing OLC assembly | * Trivial when doing OLC assembly | ||
| - | * Much more tricky when doing De Bruijn graph assemble, since k-mers are not assigned to reads | + | * Much more tricky when doing de Bruijn graph assemble, since k-mers are not assigned to reads |
| * Different assemblers had different settings (e.g. Newbler) | * Different assemblers had different settings (e.g. Newbler) | ||
| - | Pre-filtering the reads | + | * 454 had homopolymer issues in flow-space. |
| + | * Illumina or ABI are different. | ||
| + | * Makes it hard to handle errors. | ||
| + | * Especially in de Bruijn graphs. Applying the correction rules becomes much more complex. | ||
| + | * Newbler may be better for 454. | ||
| + | |||
| + | === Pre-filtering the reads === | ||
| * Some assemblers have in-built filtering of the reads (e.g. Euler) but not a generality | * Some assemblers have in-built filtering of the reads (e.g. Euler) but not a generality | ||
| * Efficient filtering of low quality bases can cut down on the computational cost (memory and time) | * Efficient filtering of low quality bases can cut down on the computational cost (memory and time) | ||
| + | * User-specific, data-space specific solutions | ||
| + | * No tried and true consistent approach | ||
| * Beware that some assemblers require reads of identical length | * Beware that some assemblers require reads of identical length | ||
| + | |||
| ==== Extensions to Velvet ==== | ==== Extensions to Velvet ==== | ||
| - | Oases: //de novo// transcriptome assembly (How to study mRNA reads which do not map) | + | ** Oases: //de novo// transcriptome assembly ** \\ |
| - | De Brujin graphs ~ splice graphs | + | How to study mRNA reads which do not map |
| - | * Pavel Pevzner mentioned that De Bruijn graphs are more or less like splice graphs | + | |
| - | Oases: Process | + | === De Brujin graphs ~ splice graphs === |
| + | * People are using next generation sequencing generally because it's cheaper. | ||
| + | * Would have just one path except for alternate splicing. | ||
| + | * Pavel Pevzner mentioned in 2002 that de Bruijn graphs are equivalent to splice graphs, so we can apply those algorithms developed for splice graphs to de Bruijn graphs. | ||
| + | |||
| + | === Oases: Process === | ||
| * Create clusters of contigs: | * Create clusters of contigs: | ||
| * Connecting reads | * Connecting reads | ||
| * Connecting read pairs | * Connecting read pairs | ||
| + | * The genetic locus should create a connected locus, cluster of contigs. | ||
| + | * Sometimes one cluster gets broken into two, but works pretty well. | ||
| * Re-implement traditional algorithms to run on graphs, instead of a reference genome | * Re-implement traditional algorithms to run on graphs, instead of a reference genome | ||
| * Greedy transitive reduction (Myers, 2005) | * Greedy transitive reduction (Myers, 2005) | ||
| * Motif searches | * Motif searches | ||
| * Dynamic assembly of transcripts (Lee, 2003) | * Dynamic assembly of transcripts (Lee, 2003) | ||
| - | Oases: Output | + | * find the path of highest weight |
| + | |||
| + | === Oases: Output === | ||
| * Full length transcripts - handles alternative splicing. | * Full length transcripts - handles alternative splicing. | ||
| + | * With skipped exon and one artifact. | ||
| * Transcripts from different genes - handles varying coverage levels. | * Transcripts from different genes - handles varying coverage levels. | ||
| * Alternative splicing events: when possible, distinguished between common AS events. | * Alternative splicing events: when possible, distinguished between common AS events. | ||
| - | Oases: Results: | + | * Find donor and acceptor sites on or around junctions. |
| - | * ** INCOMPLETE ** | + | |
| - | Oases: results of Drosophila PE-reads | + | === Oases: Results === |
| - | * The number of transcripts and loci are in the ballpark of what they were expecting | + | * Benchmarked within the RGASP competition |
| - | Mapping vs. //de novo// | + | * Already 100+ users, some reporting very promising results: |
| + | * N50 = 1886 bp (Monica Britton, UC Davis) | ||
| + | * 99.8% of aligned transcripts against unigene (Micha Bayer - Scottish Crop Research Institute) | ||
| + | |||
| + | === Oases: results of Drosophila PE-reads === | ||
| + | * The number of transcripts and loci are in the ballpark of what they were expecting. | ||
| + | * rl 36, k21 ins size 200 | ||
| + | * Map to genome ~90% | ||
| + | |||
| + | === Oases: examples === | ||
| + | * Found an unannotated gene with 7 exons | ||
| + | * Assembler has no notion of the reference genome, but still found one example that mapped very well. | ||
| + | |||
| + | === Mapping vs. de novo === | ||
| * Mapping | * Mapping | ||
| - | * Many tools available | + | * Many good mature tools available |
| * Easy to compute | * Easy to compute | ||
| - | * Simpler to analyze | + | * Simpler to analyze, comes with genetic coords |
| * //De novo// | * //De novo// | ||
| + | * Harder than mapping | ||
| + | * May require special hardware | ||
| * No reference needed | * No reference needed | ||
| * Handles novel structures without //a priori// | * Handles novel structures without //a priori// | ||
| - | Columbus: Velvet module | + | * How to get the best of both worlds? |
| + | |||
| + | === Columbus: Velvet module === | ||
| * Solution: the reference-based de Bruijn graph | * Solution: the reference-based de Bruijn graph | ||
| - | * Preserves the structure of the reference sequences | + | * Just throw in the reference genome as long reads. |
| - | * Integrates alignment information (SAM, BAM file) | + | * Chucking in long reads and getting short contigs. |
| - | * Edits, breaks up, reconnects ** INCOMPLETE ** | + | * Preserves the structure of the reference sequences. Not allowed to overlap with self. |
| - | + | * What they wanted really was to start from a reference genome, then handle it as a little insertion. | |
| - | + | * Integrates alignment information (SAM, BAM file with BWA, Bowtie) | |
| - | + | * Edits, breaks up, reconnects, and extends the "alignment contigs" as necessary | |
| - | ===== Galt's Notes ===== | + | * Applications |
| - | + | * Comparative transcriptome assembly | |
| - | Assembling and comparing genomes or transcriptomes using de Bruijn graphs. | + | * Cancer RNAseq |
| - | + | * Comparative assembly | |
| - | Daniel Zerbino, Marcel Schulz, Wendy Lee, | + | |
| - | Ewan Birney and David Haussler. | + | |
| - | + | ||
| - | Slides will be made available. | + | |
| - | + | ||
| - | Talking about | + | |
| - | *ways to build graphs | + | |
| - | *handle repeats | + | |
| - | *practical considerations | + | |
| - | *extensions to velvet | + | |
| - | + | ||
| - | ==== OLC ==== | + | |
| - | Traditional approach is overlap-layout-consensus | + | |
| - | Phrap, etc.\\ | + | |
| - | Short reads - Edena\\ | + | |
| - | More expensive computationally. | + | |
| - | Use whole-read alignments. | + | |
| - | + | ||
| - | === Quick OLC example === | + | |
| - | Filter out all the reads that are identical or self-contained in another read. | + | |
| - | Only left-to-right etc allowed. | + | |
| - | Compute all their overlaps. | + | |
| - | Get many transitive redundant connections. | + | |
| - | Many algorithms are available for transitive reduction. | + | |
| - | You obtain simplified paths through the reads. | + | |
| - | Get 4 big chunks of reads from the example. | + | |
| - | Hard to distinguish which reads belong to which copy of the repeat. | + | |
| - | This is how contigs were first built. | + | |
| - | Have to compare all reads against each other, | + | |
| - | Quadratic. This has become difficult. | + | |
| - | NGS reads are short, and overlaps between reads are short. | + | |
| - | They are getting a little longer with newer technology. | + | |
| - | OLC method uses all information in the read at least. | + | |
| - | + | ||
| - | ==== DBG ==== | + | |
| - | + | ||
| - | De Bruijn graphs, Pavel Pevzner popularized in paper 2001. | + | |
| - | Why do we use them? | + | |
| - | High through-put, cheap, short reads with errors. | + | |
| - | + | ||
| - | + | ||
| - | The next-gen means 2007 there were many papers, SSKE, SHARCGS, VCKE, | + | |
| - | have a kmer hash, much faster and leaner, but get noise. | + | |
| - | + | ||
| - | Build a dictionary of words in the data set. | + | |
| - | His example is 4mer, but typically 20mer and longer. | + | |
| - | For each, create a node in the hash table. | + | |
| - | Big dataset. Typically a binary marker to say if present. | + | |
| - | Sometimes at branch have two possible extensions. | + | |
| - | Greedy approach essentially De Bruijn graph with | + | |
| - | local info only. In this structure you could have | + | |
| - | many paths through. And reads of different lengths can be used. | + | |
| - | + | ||
| - | Later Euler, Allphaths, Velvet, ABySS, SOAPdenovo came out. | + | |
| - | They souped up the kmer hash, but processed more globally. | + | |
| - | Graph construction and correction is similar in all assemblers. | + | |
| - | ABySS handles parallel implementation. | + | |
| - | + | ||
| - | First, concatenate the simple graphs without branches | + | |
| - | into nodes which compact the non-branching runs into a single item. | + | |
| - | Biochemistry of shortreads means errors tend to be near the ends. | + | |
| - | If you have errors in the reads in the second half, | + | |
| - | it creates millions of tips, but they are at least easy to | + | |
| - | remove later. | + | |
| - | + | ||
| - | Each read is a path through the graph. The head of the read | + | |
| - | is still kept, so we're not losing the read placement information. | + | |
| - | + | ||
| - | We are still left with bubbles. | + | |
| - | Possibly accumulation of overlapping spur-tips. | + | |
| - | More often variation e.g. SNP or variation. | + | |
| - | Would like to keep that information where possible. | + | |
| - | Could also be a long read. But don't want to cut | + | |
| - | the long read in the middle. When you find a bubble | + | |
| - | in velvet, remap that back onto the one with the higher | + | |
| - | branch. | + | |
| - | Now we have a loop that goes back | + | |
| - | on itself (not a bubble). e.g. tandem repeats. | + | |
| - | + | ||
| - | OLC has atomic arcs. | + | |
| - | DBG has a greater variety of overlaps. | + | |
| - | + | ||
| - | OLC gets error correction from long reads. | + | |
| - | DBG requires more work to correct. | + | |
| - | + | ||
| - | ==== Tour Bus ==== | + | |
| - | Errors. | + | |
| - | Compare 5mb chunk from 4 species, ecolo yeast, elegans, human. | + | |
| - | Increased coverage from 0 to 50x. | + | |
| - | N50 on y axis. | + | |
| - | First, exponential growth as expected. | + | |
| - | Then the complexity causes trouble, | + | |
| - | unless you have other kinds of information. | + | |
| - | Not using paired-reads information. | + | |
| - | You see that the human is a little more complex than E. coli. | + | |
| - | + | ||
| - | Scaling up the hard way: chrX. | + | |
| - | 1/2 billion reads from flow-sorted human chrX. | + | |
| - | Fit in 70gb ram. | + | |
| - | Overall length 130mb. | + | |
| - | Short contigs, and numerous. | + | |
| - | How to handle the repeats? | + | |
| - | + | ||
| - | **Margulies** used reduced representation | + | |
| - | using restriction enzymes and gel | + | |
| - | to cut the problem into tractable bits. | + | |
| - | Cut chunks from the gel, and then | + | |
| - | sequenced the gel slice separately, | + | |
| - | or at least in different lanes. | + | |
| - | Then the assembly problem is smaller | + | |
| - | and more localized instead of whole-genome. | + | |
| - | + | ||
| - | Combined re-mapping and de novo sequencing. | + | |
| - | **Cheetem and Pleasance**. | + | |
| - | Map to reference or similar genome, | + | |
| - | then re-map locally some stuff that's unique. | + | |
| - | + | ||
| - | MAQ and BWA only allow 3 or 4 mismatches. | + | |
| - | Can see reads not mapping in a small area | + | |
| - | with too many SNPs or breakpoint rearrangment. | + | |
| - | + | ||
| - | Code parallelization (**Birol and Cook**). | + | |
| - | Only ABySS has published yet. | + | |
| - | + | ||
| - | Improved indexing | + | |
| - | Fit WGS onto 200GB. | + | |
| - | + | ||
| - | Use of intermediate re-mapping. (**Haimel**) | + | |
| - | Wanted to parallelize the second level | + | |
| - | work of building scaffolds. | + | |
| - | + | ||
| - | But those are mostly engineering issues. | + | |
| - | + | ||
| - | The DBG for a small organism | + | |
| - | looks like long-ish contigs, | + | |
| - | but still makes a hair-ball. | + | |
| - | + | ||
| - | ==== Rock Band ==== | + | |
| - | Use long and short reads together. | + | |
| - | Long reads which are mapped seamlessly into the graph. | + | |
| - | + | ||
| - | 35bp reads 50x, some long reads (100-1kb) 0-20x. | + | |
| - | Played with the length. | + | |
| - | The longer your longreads are, the more bridgeable | + | |
| - | it is and the longer the contigs. | + | |
| - | Why not just assemble the long reads? someone might ask. | + | |
| - | + | ||
| - | Different approach. Spectral graph analysis. | + | |
| - | Find what is the vector that threads its way. | + | |
| - | Has a fancy theory with massive matrix diagonalization. | + | |
| - | Takes all the data at once in one huge matrix, | + | |
| - | not too affected by noise. No one has actually | + | |
| - | tried this approach. Too much computation. | + | |
| - | The trad approach is contigs with paired-end reads. | + | |
| - | Find chains connected by contig reads. | + | |
| - | + | ||
| - | Your coverage is huge. | + | |
| - | Each basepair has to handle 1000 coverage | + | |
| - | read-pairs. Long reads separated by a mess | + | |
| - | in the middle. Check possible paths. | + | |
| - | Find which are acceptable. | + | |
| - | Find all the reads that start and end on the | + | |
| - | two long reads. Then consider them all | + | |
| - | together, much more efficient. | + | |
| - | + | ||
| - | 5% or more of mate-pairs can be wrong. | + | |
| - | Hopefully will be able to filter them out. | + | |
| - | In velvet use a likelihood approach. | + | |
| - | Have a distance estimate. | + | |
| - | Coverage of A and B, how many would | + | |
| - | we expect to see? If we see too few, | + | |
| - | we discard the connection. | + | |
| - | + | ||
| - | Allpaths did better using small localized problems | + | |
| - | of a single small scaffold for parallelization. | + | |
| - | + | ||
| - | The Shorty algorithm uses the variance between | + | |
| - | read pair anchored on a common contig on a kmer. | + | |
| - | Projected image of the other ends creates possibility | + | |
| - | of assembling a smaller set, the projection from a node. | + | |
| - | + | ||
| - | ==== PEBBLE ==== | + | |
| - | Velvet used the reverse idea, the projections to a node. | + | |
| - | Hit the nodes in the vicinity of the starting contig, | + | |
| - | got distance information on the immediate vicinity. | + | |
| - | Have a greedy algorithm. Find way to next unique node. | + | |
| - | + | ||
| - | Having a more-precise insert size could help. | + | |
| - | Velvet did a size assuming 10% distribution. | + | |
| - | Solexa might have been different. | + | |
| - | Solid up to 2k, solexa just 300. | + | |
| - | It is difficult to have enough dna of a narrow range. | + | |
| - | Don't love barcoding because it uses a lot of reagents. | + | |
| - | + | ||
| - | + | ||
| - | ==== COLOR-SPACE ==== | + | |
| - | + | ||
| - | 4-letter alphabet. | + | |
| - | Double-encoded files use ACGT. | + | |
| - | Be careful not to confuse it. | + | |
| - | Different rules for complementarity. | + | |
| - | Just reverse, don't complement them. | + | |
| - | The assemblers and mappers should | + | |
| - | in theory adapt to this. | + | |
| - | Could use base-space, but one error | + | |
| - | will ruin the rest of the read. This is | + | |
| - | a huge loss, so do all in colorspace | + | |
| - | and make a complete assembly, | + | |
| - | then finally convert to base calls later. | + | |
| - | Typically it detects the error after | + | |
| - | 1 or 2 bp, then corrects. | + | |
| - | SOLID has developed tools. | + | |
| - | + | ||
| - | Different error models. | + | |
| - | 454 had homopolymer issues, flowspace. | + | |
| - | Illumina or ABI are different. | + | |
| - | Makes it hard to handle errors. | + | |
| - | Especially in DBG. Applying the correction | + | |
| - | rules becomes much more complex. | + | |
| - | Newbler may be better for 454. | + | |
| - | + | ||
| - | Pre-filtering reads. | + | |
| - | Some assemblers have built-in filtering of reads. | + | |
| - | + | ||
| - | Filtering low quality bases can cut down | + | |
| - | on the computational cost. | + | |
| - | + | ||
| - | User-specific, data-space specific solutions. | + | |
| - | No tried and true consistent approach. | + | |
| - | + | ||
| - | Some assemblers require reads of identical lengths. | + | |
| - | + | ||
| - | ----------- | + | |
| - | + | ||
| - | ==== OASES ==== | + | |
| - | + | ||
| - | De-novo Transcriptome Assembly. | + | |
| - | People are using NGS generally because it's cheaper. | + | |
| - | Would have just one path except for alternate splicing. | + | |
| - | Pevzner in 2002 mentioned DBG is equivalent to a | + | |
| - | splice-graph. So can apply those algorithms developed for | + | |
| - | Splice-graphs to DBG. | + | |
| - | + | ||
| - | === Oases Process === | + | |
| - | The genetic locus should create a connected locus, cluster of contigs. | + | |
| - | Sometimes one cluster gets broken into two, but works pretty well. | + | |
| - | + | ||
| - | Dynamic assembly, find the path of highest weight. | + | |
| - | + | ||
| - | === Oases Output === | + | |
| - | With skipped exon, and one artifact. | + | |
| - | Handles contigs with different coverage levels. | + | |
| - | Find donor and acceptor sites on or around junctions. | + | |
| - | + | ||
| - | Drosophila reads. | + | |
| - | rl 36, k21 ins size 200. | + | |
| - | Map to genome 90% | + | |
| - | Found an unannotated gene with 7 exons. | + | |
| - | + | ||
| - | Assembler has no notion of the reference genome, | + | |
| - | but still found one example that mapped very well. | + | |
| - | + | ||
| - | + | ||
| - | ==== COLUMBUS ==== | + | |
| - | Mapping. | + | |
| - | Either map, or do de-novo. | + | |
| - | Many good mature mapping tools. | + | |
| - | Easier to analyze, comes with genomic coords. | + | |
| - | De-novo harder. May require special hardware. | + | |
| - | Works without reference. Handles novel structures. | + | |
| - | + | ||
| - | How to get the best of both worlds? | + | |
| - | Just throw in the ref genome as long reads. | + | |
| - | Chucking in long reads and getting short contigs. | + | |
| - | What they wanted really was to start from a | + | |
| - | reference genome. Then handle it as a little insertion. | + | |
| - | Takes DBG and SAM/BAM (BWA, bowtie) | + | |
| - | Keeps ref sequences. Not allowed to overlap with self. | + | |
| - | Starting with comparative transcriptome assembly. | + | |
| - | Would like to ? cancer. | + | |
| - | + | ||
| - | DBG convention for NGS. | + | |
| - | + | ||
| - | For transcriptome, using k=21. | + | |
| - | Can use reads as short as 25. | + | |
| - | How many X coverage needed? | + | |
| - | How many genes do you want to catch. | + | |
| - | + | ||
| - | With genome, want long inserts generally. | + | |
| - | With transcriptome, short inserts are more | + | |
| - | useful. They have been using 200bp tiny stuff. | + | |
| - | + | ||
| - | ==== End-of-class Business ==== | + | |
| - | + | ||
| - | We can put Daniel's slides on the wiki. | + | |
| - | + | ||
| - | Galt will send Daniel a link to the wiki. | + | |
| - | + | ||
| - | Kevin on Wed. will be talking about: | + | |
| - | Use solid data for ordering and orientation of Pog contigs. | + | |
| - | Special purpose scripts. | + | |
| - | The data was so noisy. | + | |
| + | === De Bruijn graph convention for next generation sequencing === | ||
| + | * For transcriptome, using k=21. | ||
| + | * Can use reads as short as 25. | ||
| + | * How many X coverage needed? | ||
| + | * How many genes do you want to catch? | ||
| + | * With genome, want long inserts generally. | ||
| + | * With transcriptome, short inserts are more useful. They have been using 200bp tiny stuff. | ||
| + | ===== After presentation ===== | ||
| + | Kevin on Wednesday will be talking about: | ||
| + | * Use SOLiD data for ordering and orientation of //Pog// contigs | ||
| + | * Special purpose scripts | ||
| + | * The data was so noisy | ||