This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
|
lecture_notes:04-19-2010 [2010/04/24 02:26] cbrumbau Started to merge notes (incomplete), cleaned up formatting, filled in missing notes from presentation |
lecture_notes:04-19-2010 [2010/04/24 14:54] (current) karplus added correction about quadratic behavior of overlap-consensus |
||
|---|---|---|---|
| Line 23: | Line 23: | ||
| * Use whole-read alignments | * Use whole-read alignments | ||
| - | === Quick OLC example === | + | === Quick Overlap-Layout-Consensus (OLC) example === |
| * Filter out all the reads that are identical or self-contained in another read. Only left-to-right, etc. are allowed. | * Filter out all the reads that are identical or self-contained in another read. Only left-to-right, etc. are allowed. | ||
| * Compute all their overlaps. | * Compute all their overlaps. | ||
| Line 34: | Line 34: | ||
| * Have to compare all reads against each other. | * Have to compare all reads against each other. | ||
| * Quadratic. This has become difficult. | * Quadratic. This has become difficult. | ||
| - | * NGS reads are short, and overlaps between reads are short. They are getting a little longer with newer technology. | + | * Correction: quadratic behavior can be reduced to about O(n log n) by doing a clustering before aligning: just like in several fast search techniques, you only attempt to align sequences that share seed hits to relatively short k-mers (much shorter than those needed for de Bruijn graphs, since they only need to reduce the number of hits, not get unique locations in the genome). --- //[[karplus@soe.ucsc.edu|Kevin Karplus]] 2010/04/24 07:51// |
| + | * Next generation sequencing reads are short, and overlaps between reads are short. They are getting a little longer with newer technology. | ||
| * OLC method uses all information in the read at least. | * OLC method uses all information in the read at least. | ||
| Line 49: | Line 50: | ||
| * Big dataset. Typically a binary marker to say if present. | * Big dataset. Typically a binary marker to say if present. | ||
| * Sometimes at branch have two possible extensions. | * Sometimes at branch have two possible extensions. | ||
| - | * Greedy approach essentially De Bruijn graph with local info only. In this structure you could have many paths through. | + | * Greedy approach essentially de Bruijn graph with local info only. In this structure you could have many paths through. |
| * Can throw in reads of different lengths with no problems. | * Can throw in reads of different lengths with no problems. | ||
| Line 117: | Line 118: | ||
| * Combined re-mapping and //de novo// sequencing (Cheetham et al., Pleaseance et al.). | * Combined re-mapping and //de novo// sequencing (Cheetham et al., Pleaseance et al.). | ||
| * Map to reference or similar genome, then re-map locally some stuff that is unique. | * Map to reference or similar genome, then re-map locally some stuff that is unique. | ||
| - | * MAQ and BWA only allow 3 or 4 mismatches. | + | * MAQ (Mapping and Assembly with Quality) and BWA (Burrows-Wheeler Alignment Tool) only allow 3 or 4 mismatches. |
| * Can see reads not mapping in a small area with too many SNPs or breakpoint rearrangments. | * Can see reads not mapping in a small area with too many SNPs or breakpoint rearrangments. | ||
| * Code parallelization (Birol et al., Jeffrey Cook). | * Code parallelization (Birol et al., Jeffrey Cook). | ||
| * Only ABySS has published yet | * Only ABySS has published yet | ||
| * Improved indexing (Zamin Iqbal and Mario Caccamo). | * Improved indexing (Zamin Iqbal and Mario Caccamo). | ||
| - | * Fit WGS onto 200GB | + | * Fit whole genome shotgun onto 200GB |
| * Use of intermediate remapping (Matthias Haimei). | * Use of intermediate remapping (Matthias Haimei). | ||
| * Wanted to parallelize the second level work of building scaffolds, but those are mostly engineering issues. | * Wanted to parallelize the second level work of building scaffolds, but those are mostly engineering issues. | ||
| Line 135: | Line 136: | ||
| * Played with the length | * Played with the length | ||
| * The longer your long reads are, the more bridgeable it is and the longer the contigs. | * The longer your long reads are, the more bridgeable it is and the longer the contigs. | ||
| - | * "Why not just assemble the long reads?" someone might ask | + | * Someone might ask: "Why not just assemble the long reads?" |
| === Different approaches to repeat resolution === | === Different approaches to repeat resolution === | ||
| Line 201: | Line 202: | ||
| * When using different technologies, you have to take into account different technologies | * When using different technologies, you have to take into account different technologies | ||
| * Trivial when doing OLC assembly | * Trivial when doing OLC assembly | ||
| - | * Much more tricky when doing De Bruijn graph assemble, since k-mers are not assigned to reads | + | * Much more tricky when doing de Bruijn graph assemble, since k-mers are not assigned to reads |
| * Different assemblers had different settings (e.g. Newbler) | * Different assemblers had different settings (e.g. Newbler) | ||
| * 454 had homopolymer issues in flow-space. | * 454 had homopolymer issues in flow-space. | ||
| Line 221: | Line 222: | ||
| === De Brujin graphs ~ splice graphs === | === De Brujin graphs ~ splice graphs === | ||
| - | * Pavel Pevzner mentioned that De Bruijn graphs are more or less like splice graphs | + | * People are using next generation sequencing generally because it's cheaper. |
| + | * Would have just one path except for alternate splicing. | ||
| + | * Pavel Pevzner mentioned in 2002 that de Bruijn graphs are equivalent to splice graphs, so we can apply those algorithms developed for splice graphs to de Bruijn graphs. | ||
| === Oases: Process === | === Oases: Process === | ||
| Line 227: | Line 230: | ||
| * Connecting reads | * Connecting reads | ||
| * Connecting read pairs | * Connecting read pairs | ||
| + | * The genetic locus should create a connected locus, cluster of contigs. | ||
| + | * Sometimes one cluster gets broken into two, but works pretty well. | ||
| * Re-implement traditional algorithms to run on graphs, instead of a reference genome | * Re-implement traditional algorithms to run on graphs, instead of a reference genome | ||
| * Greedy transitive reduction (Myers, 2005) | * Greedy transitive reduction (Myers, 2005) | ||
| * Motif searches | * Motif searches | ||
| * Dynamic assembly of transcripts (Lee, 2003) | * Dynamic assembly of transcripts (Lee, 2003) | ||
| + | * find the path of highest weight | ||
| === Oases: Output === | === Oases: Output === | ||
| * Full length transcripts - handles alternative splicing. | * Full length transcripts - handles alternative splicing. | ||
| + | * With skipped exon and one artifact. | ||
| * Transcripts from different genes - handles varying coverage levels. | * Transcripts from different genes - handles varying coverage levels. | ||
| * Alternative splicing events: when possible, distinguished between common AS events. | * Alternative splicing events: when possible, distinguished between common AS events. | ||
| + | * Find donor and acceptor sites on or around junctions. | ||
| === Oases: Results === | === Oases: Results === | ||
| Line 245: | Line 253: | ||
| === Oases: results of Drosophila PE-reads === | === Oases: results of Drosophila PE-reads === | ||
| * The number of transcripts and loci are in the ballpark of what they were expecting. | * The number of transcripts and loci are in the ballpark of what they were expecting. | ||
| + | * rl 36, k21 ins size 200 | ||
| + | * Map to genome ~90% | ||
| + | |||
| + | === Oases: examples === | ||
| + | * Found an unannotated gene with 7 exons | ||
| + | * Assembler has no notion of the reference genome, but still found one example that mapped very well. | ||
| === Mapping vs. de novo === | === Mapping vs. de novo === | ||
| * Mapping | * Mapping | ||
| - | * Many tools available | + | * Many good mature tools available |
| * Easy to compute | * Easy to compute | ||
| - | * Simpler to analyze | + | * Simpler to analyze, comes with genetic coords |
| * //De novo// | * //De novo// | ||
| + | * Harder than mapping | ||
| + | * May require special hardware | ||
| * No reference needed | * No reference needed | ||
| * Handles novel structures without //a priori// | * Handles novel structures without //a priori// | ||
| + | * How to get the best of both worlds? | ||
| === Columbus: Velvet module === | === Columbus: Velvet module === | ||
| * Solution: the reference-based de Bruijn graph | * Solution: the reference-based de Bruijn graph | ||
| - | * Preserves the structure of the reference sequences | + | * Just throw in the reference genome as long reads. |
| - | * Integrates alignment information (SAM, BAM file) | + | * Chucking in long reads and getting short contigs. |
| + | * Preserves the structure of the reference sequences. Not allowed to overlap with self. | ||
| + | * What they wanted really was to start from a reference genome, then handle it as a little insertion. | ||
| + | * Integrates alignment information (SAM, BAM file with BWA, Bowtie) | ||
| * Edits, breaks up, reconnects, and extends the "alignment contigs" as necessary | * Edits, breaks up, reconnects, and extends the "alignment contigs" as necessary | ||
| + | * Applications | ||
| + | * Comparative transcriptome assembly | ||
| + | * Cancer RNAseq | ||
| + | * Comparative assembly | ||
| - | ===== Galt's Notes ===== | + | === De Bruijn graph convention for next generation sequencing === |
| - | + | * For transcriptome, using k=21. | |
| - | ==== OASES ==== | + | * Can use reads as short as 25. |
| - | + | * How many X coverage needed? | |
| - | De-novo Transcriptome Assembly. | + | * How many genes do you want to catch? |
| - | People are using NGS generally because it's cheaper. | + | * With genome, want long inserts generally. |
| - | Would have just one path except for alternate splicing. | + | * With transcriptome, short inserts are more useful. They have been using 200bp tiny stuff. |
| - | Pevzner in 2002 mentioned DBG is equivalent to a | + | |
| - | splice-graph. So can apply those algorithms developed for | + | |
| - | Splice-graphs to DBG. | + | |
| - | + | ||
| - | === Oases Process === | + | |
| - | The genetic locus should create a connected locus, cluster of contigs. | + | |
| - | Sometimes one cluster gets broken into two, but works pretty well. | + | |
| - | + | ||
| - | Dynamic assembly, find the path of highest weight. | + | |
| - | + | ||
| - | === Oases Output === | + | |
| - | With skipped exon, and one artifact. | + | |
| - | Handles contigs with different coverage levels. | + | |
| - | Find donor and acceptor sites on or around junctions. | + | |
| - | + | ||
| - | Drosophila reads. | + | |
| - | rl 36, k21 ins size 200. | + | |
| - | Map to genome 90% | + | |
| - | Found an unannotated gene with 7 exons. | + | |
| - | + | ||
| - | Assembler has no notion of the reference genome, | + | |
| - | but still found one example that mapped very well. | + | |
| - | + | ||
| - | + | ||
| - | ==== COLUMBUS ==== | + | |
| - | Mapping. | + | |
| - | Either map, or do de-novo. | + | |
| - | Many good mature mapping tools. | + | |
| - | Easier to analyze, comes with genomic coords. | + | |
| - | De-novo harder. May require special hardware. | + | |
| - | Works without reference. Handles novel structures. | + | |
| - | + | ||
| - | How to get the best of both worlds? | + | |
| - | Just throw in the ref genome as long reads. | + | |
| - | Chucking in long reads and getting short contigs. | + | |
| - | What they wanted really was to start from a | + | |
| - | reference genome. Then handle it as a little insertion. | + | |
| - | Takes DBG and SAM/BAM (BWA, bowtie) | + | |
| - | Keeps ref sequences. Not allowed to overlap with self. | + | |
| - | Starting with comparative transcriptome assembly. | + | |
| - | Would like to ? cancer. | + | |
| - | + | ||
| - | DBG convention for NGS. | + | |
| - | + | ||
| - | For transcriptome, using k=21. | + | |
| - | Can use reads as short as 25. | + | |
| - | How many X coverage needed? | + | |
| - | How many genes do you want to catch. | + | |
| - | + | ||
| - | With genome, want long inserts generally. | + | |
| - | With transcriptome, short inserts are more | + | |
| - | useful. They have been using 200bp tiny stuff. | + | |
| - | ===== End-of-class business ===== | + | ===== After presentation ===== |
| - | Kevin on Wed. will be talking about: | + | Kevin on Wednesday will be talking about: |
| * Use SOLiD data for ordering and orientation of //Pog// contigs | * Use SOLiD data for ordering and orientation of //Pog// contigs | ||
| * Special purpose scripts | * Special purpose scripts | ||
| * The data was so noisy | * The data was so noisy | ||