User Tools

Site Tools


lecture_notes:04-19-2010

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
lecture_notes:04-19-2010 [2010/04/24 02:26]
cbrumbau Started to merge notes (incomplete), cleaned up formatting, filled in missing notes from presentation
lecture_notes:04-19-2010 [2010/04/24 14:54]
karplus added correction about quadratic behavior of overlap-consensus
Line 23: Line 23:
   * Use whole-read alignments   * Use whole-read alignments
  
-=== Quick OLC example ===+=== Quick Overlap-Layout-Consensus (OLCexample ===
   * Filter out all the reads that are identical or self-contained in another read. Only left-to-right,​ etc. are allowed.   * Filter out all the reads that are identical or self-contained in another read. Only left-to-right,​ etc. are allowed.
   * Compute all their overlaps.   * Compute all their overlaps.
Line 34: Line 34:
   * Have to compare all reads against each other.   * Have to compare all reads against each other.
     * Quadratic. This has become difficult.     * Quadratic. This has become difficult.
-  ​NGS reads are short, and overlaps between reads are short. They are getting a little longer with newer technology.+    ​Correction: quadratic behavior can be reduced to about O(n log n) by doing a clustering before aligning: just like in several fast search techniques, you only attempt to align sequences that share seed hits to relatively short k-mers (much shorter than those needed for de Bruijn graphs, since they only need to reduce the number of hits, not get unique locations in the genome). --- //​[[karplus@soe.ucsc.edu|Kevin Karplus]] 2010/04/24 07:51// 
 +  * Next generation sequencing ​reads are short, and overlaps between reads are short. They are getting a little longer with newer technology.
   * OLC method uses all information in the read at least.   * OLC method uses all information in the read at least.
  
Line 49: Line 50:
     * Big dataset. Typically a binary marker to say if present.     * Big dataset. Typically a binary marker to say if present.
   * Sometimes at branch have two possible extensions.   * Sometimes at branch have two possible extensions.
-  * Greedy approach essentially ​De Bruijn graph with local info only. In this structure you could have many paths through.+  * Greedy approach essentially ​de Bruijn graph with local info only. In this structure you could have many paths through.
   * Can throw in reads of different lengths with no problems.   * Can throw in reads of different lengths with no problems.
  
Line 117: Line 118:
   * Combined re-mapping and //de novo// sequencing (Cheetham et al., Pleaseance et al.).   * Combined re-mapping and //de novo// sequencing (Cheetham et al., Pleaseance et al.).
     * Map to reference or similar genome, then re-map locally some stuff that is unique.     * Map to reference or similar genome, then re-map locally some stuff that is unique.
-    * MAQ and BWA only allow 3 or 4 mismatches.+    * MAQ (Mapping and Assembly with Quality) ​and BWA (Burrows-Wheeler Alignment Tool) only allow 3 or 4 mismatches.
     * Can see reads not mapping in a small area with too many SNPs or breakpoint rearrangments.     * Can see reads not mapping in a small area with too many SNPs or breakpoint rearrangments.
   * Code parallelization (Birol et al., Jeffrey Cook).   * Code parallelization (Birol et al., Jeffrey Cook).
     * Only ABySS has published yet     * Only ABySS has published yet
   * Improved indexing (Zamin Iqbal and Mario Caccamo).   * Improved indexing (Zamin Iqbal and Mario Caccamo).
-    * Fit WGS onto 200GB+    * Fit whole genome shotgun ​onto 200GB
   * Use of intermediate remapping (Matthias Haimei).   * Use of intermediate remapping (Matthias Haimei).
     * Wanted to parallelize the second level work of building scaffolds, but those are mostly engineering issues.     * Wanted to parallelize the second level work of building scaffolds, but those are mostly engineering issues.
Line 135: Line 136:
   * Played with the length   * Played with the length
   * The longer your long reads are, the more bridgeable it is and the longer the contigs.   * The longer your long reads are, the more bridgeable it is and the longer the contigs.
-    * "Why not just assemble the long reads?" ​someone might ask+    * Someone might ask: "Why not just assemble the long reads?"​
  
 === Different approaches to repeat resolution === === Different approaches to repeat resolution ===
Line 201: Line 202:
   * When using different technologies,​ you have to take into account different technologies   * When using different technologies,​ you have to take into account different technologies
     * Trivial when doing OLC assembly     * Trivial when doing OLC assembly
-    * Much more tricky when doing De Bruijn graph assemble, since k-mers are not assigned to reads+    * Much more tricky when doing de Bruijn graph assemble, since k-mers are not assigned to reads
     * Different assemblers had different settings (e.g. Newbler)     * Different assemblers had different settings (e.g. Newbler)
   * 454 had homopolymer issues in flow-space.   * 454 had homopolymer issues in flow-space.
Line 221: Line 222:
  
 === De Brujin graphs ~ splice graphs === === De Brujin graphs ~ splice graphs ===
-  * Pavel Pevzner mentioned that De Bruijn graphs are more or less like splice graphs+  ​* People are using next generation sequencing generally because it's cheaper. 
 +  * Would have just one path except for alternate splicing. 
 +  ​* Pavel Pevzner mentioned ​in 2002 that de Bruijn graphs are equivalent to splice graphs, so we can apply those algorithms developed for splice graphs to de Bruijn graphs.
  
 === Oases: Process === === Oases: Process ===
Line 227: Line 230:
     * Connecting reads     * Connecting reads
     * Connecting read pairs     * Connecting read pairs
 +    * The genetic locus should create a connected locus, cluster of contigs. ​
 +    * Sometimes one cluster gets broken into two, but works pretty well.
   * Re-implement traditional algorithms to run on graphs, instead of a reference genome   * Re-implement traditional algorithms to run on graphs, instead of a reference genome
     * Greedy transitive reduction (Myers, 2005)     * Greedy transitive reduction (Myers, 2005)
     * Motif searches     * Motif searches
     * Dynamic assembly of transcripts (Lee, 2003)     * Dynamic assembly of transcripts (Lee, 2003)
 +      * find the path of highest weight
  
 === Oases: Output === === Oases: Output ===
   * Full length transcripts - handles alternative splicing.   * Full length transcripts - handles alternative splicing.
 +    * With skipped exon and one artifact.
   * Transcripts from different genes - handles varying coverage levels.   * Transcripts from different genes - handles varying coverage levels.
   * Alternative splicing events: when possible, distinguished between common AS events.   * Alternative splicing events: when possible, distinguished between common AS events.
 +    * Find donor and acceptor sites on or around junctions.
  
 === Oases: Results === === Oases: Results ===
Line 245: Line 253:
 === Oases: results of Drosophila PE-reads === === Oases: results of Drosophila PE-reads ===
   * The number of transcripts and loci are in the ballpark of what they were expecting.   * The number of transcripts and loci are in the ballpark of what they were expecting.
 +  * rl 36, k21 ins size 200
 +  * Map to genome ~90%
 +
 +=== Oases: examples ===
 +  * Found an unannotated gene with 7 exons
 +  * Assembler has no notion of the reference genome, but still found one example that mapped very well.
  
 === Mapping vs. de novo === === Mapping vs. de novo ===
   * Mapping   * Mapping
-    * Many tools available+    * Many good mature ​tools available
     * Easy to compute     * Easy to compute
-    * Simpler to analyze+    * Simpler to analyze, comes with genetic coords
   * //De novo//   * //De novo//
 +    * Harder than mapping
 +    * May require special hardware
     * No reference needed     * No reference needed
     * Handles novel structures without //a priori//     * Handles novel structures without //a priori//
 +  * How to get the best of both worlds?
  
 === Columbus: Velvet module === === Columbus: Velvet module ===
   * Solution: the reference-based de Bruijn graph   * Solution: the reference-based de Bruijn graph
-    * Preserves the structure of the reference sequences +    ​* Just throw in the reference genome as long reads. 
-    * Integrates alignment information (SAM, BAM file)+    * Chucking in long reads and getting short contigs. 
 +    ​* Preserves the structure of the reference sequences. Not allowed to overlap with self. 
 +      * What they wanted really was to start from a reference genome, then handle it as a little insertion. 
 +    * Integrates alignment information (SAM, BAM file with BWA, Bowtie)
     * Edits, breaks up, reconnects, and extends the "​alignment contigs"​ as necessary     * Edits, breaks up, reconnects, and extends the "​alignment contigs"​ as necessary
 +  * Applications
 +    * Comparative transcriptome assembly
 +    * Cancer RNAseq
 +    * Comparative assembly
  
-===== Galt's Notes ===== +=== De Bruijn ​graph convention ​for next generation sequencing ​=== 
- +  ​For transcriptome,​ using k=21. 
-==== OASES ==== +  ​* ​Can use reads as short as 25. 
- +  ​* ​How many X coverage needed? 
-De-novo Transcriptome Assembly. +  ​* ​How many genes do you want to catch? 
-People are using NGS generally because it's cheaper. +  ​* ​With genome, want long inserts generally. 
-Would have just one path except for alternate splicing. +  ​* ​With transcriptome,​ short inserts are more useful. They have been using 200bp tiny stuff.
-Pevzner in 2002 mentioned DBG is equivalent to a +
-splice-graph.  So can apply those algorithms developed ​for  +
-Splice-graphs to DBG. +
- +
-=== Oases Process ​=== +
-The genetic locus should create a connected locus, cluster of contigs.  +
-Sometimes one cluster gets broken into two, but works pretty well. +
- +
-Dynamic assembly, find the path of highest weight. +
- +
-=== Oases Output === +
-With skipped exon, and one artifact. +
-Handles contigs with different coverage levels. +
-Find donor and acceptor sites on or around junctions. +
- +
-Drosophila reads. +
-rl 36, k21 ins size 200. +
-Map to genome 90% +
-Found an unannotated gene with 7 exons. +
- +
-Assembler has no notion of the reference genome, +
-but still found one example that mapped very well. +
- +
- +
-==== COLUMBUS ==== +
-Mapping.  +
-Either map, or do de-novo. +
-Many good mature mapping tools. +
-Easier to analyze, comes with genomic coords. +
-De-novo harder. ​ May require special hardware. +
-Works without reference. ​ Handles novel structures. +
- +
-How to get the best of both worlds? +
-Just throw in the ref genome as long reads. +
-Chucking in long reads and getting short contigs. +
-What they wanted really was to start from a  +
-reference genome. ​ Then handle it as a little insertion. +
-Takes DBG and SAM/BAM (BWA, bowtie) +
-Keeps ref sequences. ​ Not allowed to overlap with self. +
-Starting with comparative transcriptome assembly. +
-Would like to ? cancer. +
- +
-DBG convention for NGS. +
- +
-For transcriptome,​ using k=21. +
-Can use reads as short as 25. +
-How many X coverage needed? +
-How many genes do you want to catch+
- +
-With genome, want long inserts generally. +
-With transcriptome,​ short inserts are more  +
-useful. ​ They have been using 200bp tiny stuff.+
  
-===== End-of-class business ​=====+===== After presentation ​=====
  
-Kevin on Wed. will be talking about:+Kevin on Wednesday ​will be talking about:
   * Use SOLiD data for ordering and orientation of //Pog// contigs   * Use SOLiD data for ordering and orientation of //Pog// contigs
   * Special purpose scripts   * Special purpose scripts
   * The data was so noisy   * The data was so noisy
lecture_notes/04-19-2010.txt · Last modified: 2010/04/24 14:54 by karplus