User Tools

Site Tools


lecture_notes:04-19-2010

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Last revision Both sides next revision
lecture_notes:04-19-2010 [2010/04/24 02:26]
cbrumbau Started to merge notes (incomplete), cleaned up formatting, filled in missing notes from presentation
lecture_notes:04-19-2010 [2010/04/24 08:08]
cbrumbau Merged rest of notes, expanded some acronyms
Line 23: Line 23:
   * Use whole-read alignments   * Use whole-read alignments
  
-=== Quick OLC example ===+=== Quick Overlap-Layout-Consensus (OLCexample ===
   * Filter out all the reads that are identical or self-contained in another read. Only left-to-right,​ etc. are allowed.   * Filter out all the reads that are identical or self-contained in another read. Only left-to-right,​ etc. are allowed.
   * Compute all their overlaps.   * Compute all their overlaps.
Line 34: Line 34:
   * Have to compare all reads against each other.   * Have to compare all reads against each other.
     * Quadratic. This has become difficult.     * Quadratic. This has become difficult.
-  * NGS reads are short, and overlaps between reads are short. They are getting a little longer with newer technology.+  * Next generation sequencing ​reads are short, and overlaps between reads are short. They are getting a little longer with newer technology.
   * OLC method uses all information in the read at least.   * OLC method uses all information in the read at least.
  
Line 49: Line 49:
     * Big dataset. Typically a binary marker to say if present.     * Big dataset. Typically a binary marker to say if present.
   * Sometimes at branch have two possible extensions.   * Sometimes at branch have two possible extensions.
-  * Greedy approach essentially ​De Bruijn graph with local info only. In this structure you could have many paths through.+  * Greedy approach essentially ​de Bruijn graph with local info only. In this structure you could have many paths through.
   * Can throw in reads of different lengths with no problems.   * Can throw in reads of different lengths with no problems.
  
Line 117: Line 117:
   * Combined re-mapping and //de novo// sequencing (Cheetham et al., Pleaseance et al.).   * Combined re-mapping and //de novo// sequencing (Cheetham et al., Pleaseance et al.).
     * Map to reference or similar genome, then re-map locally some stuff that is unique.     * Map to reference or similar genome, then re-map locally some stuff that is unique.
-    * MAQ and BWA only allow 3 or 4 mismatches.+    * MAQ (Mapping and Assembly with Quality) ​and BWA (Burrows-Wheeler Alignment Tool) only allow 3 or 4 mismatches.
     * Can see reads not mapping in a small area with too many SNPs or breakpoint rearrangments.     * Can see reads not mapping in a small area with too many SNPs or breakpoint rearrangments.
   * Code parallelization (Birol et al., Jeffrey Cook).   * Code parallelization (Birol et al., Jeffrey Cook).
     * Only ABySS has published yet     * Only ABySS has published yet
   * Improved indexing (Zamin Iqbal and Mario Caccamo).   * Improved indexing (Zamin Iqbal and Mario Caccamo).
-    * Fit WGS onto 200GB+    * Fit whole genome shotgun ​onto 200GB
   * Use of intermediate remapping (Matthias Haimei).   * Use of intermediate remapping (Matthias Haimei).
     * Wanted to parallelize the second level work of building scaffolds, but those are mostly engineering issues.     * Wanted to parallelize the second level work of building scaffolds, but those are mostly engineering issues.
Line 135: Line 135:
   * Played with the length   * Played with the length
   * The longer your long reads are, the more bridgeable it is and the longer the contigs.   * The longer your long reads are, the more bridgeable it is and the longer the contigs.
-    * "Why not just assemble the long reads?" ​someone might ask+    * Someone might ask: "Why not just assemble the long reads?"​
  
 === Different approaches to repeat resolution === === Different approaches to repeat resolution ===
Line 201: Line 201:
   * When using different technologies,​ you have to take into account different technologies   * When using different technologies,​ you have to take into account different technologies
     * Trivial when doing OLC assembly     * Trivial when doing OLC assembly
-    * Much more tricky when doing De Bruijn graph assemble, since k-mers are not assigned to reads+    * Much more tricky when doing de Bruijn graph assemble, since k-mers are not assigned to reads
     * Different assemblers had different settings (e.g. Newbler)     * Different assemblers had different settings (e.g. Newbler)
   * 454 had homopolymer issues in flow-space.   * 454 had homopolymer issues in flow-space.
Line 221: Line 221:
  
 === De Brujin graphs ~ splice graphs === === De Brujin graphs ~ splice graphs ===
-  * Pavel Pevzner mentioned that De Bruijn graphs are more or less like splice graphs+  ​* People are using next generation sequencing generally because it's cheaper. 
 +  * Would have just one path except for alternate splicing. 
 +  ​* Pavel Pevzner mentioned ​in 2002 that de Bruijn graphs are equivalent to splice graphs, so we can apply those algorithms developed for splice graphs to de Bruijn graphs.
  
 === Oases: Process === === Oases: Process ===
Line 227: Line 229:
     * Connecting reads     * Connecting reads
     * Connecting read pairs     * Connecting read pairs
 +    * The genetic locus should create a connected locus, cluster of contigs. ​
 +    * Sometimes one cluster gets broken into two, but works pretty well.
   * Re-implement traditional algorithms to run on graphs, instead of a reference genome   * Re-implement traditional algorithms to run on graphs, instead of a reference genome
     * Greedy transitive reduction (Myers, 2005)     * Greedy transitive reduction (Myers, 2005)
     * Motif searches     * Motif searches
     * Dynamic assembly of transcripts (Lee, 2003)     * Dynamic assembly of transcripts (Lee, 2003)
 +      * find the path of highest weight
  
 === Oases: Output === === Oases: Output ===
   * Full length transcripts - handles alternative splicing.   * Full length transcripts - handles alternative splicing.
 +    * With skipped exon and one artifact.
   * Transcripts from different genes - handles varying coverage levels.   * Transcripts from different genes - handles varying coverage levels.
   * Alternative splicing events: when possible, distinguished between common AS events.   * Alternative splicing events: when possible, distinguished between common AS events.
 +    * Find donor and acceptor sites on or around junctions.
  
 === Oases: Results === === Oases: Results ===
Line 245: Line 252:
 === Oases: results of Drosophila PE-reads === === Oases: results of Drosophila PE-reads ===
   * The number of transcripts and loci are in the ballpark of what they were expecting.   * The number of transcripts and loci are in the ballpark of what they were expecting.
 +  * rl 36, k21 ins size 200
 +  * Map to genome ~90%
 +
 +=== Oases: examples ===
 +  * Found an unannotated gene with 7 exons
 +  * Assembler has no notion of the reference genome, but still found one example that mapped very well.
  
 === Mapping vs. de novo === === Mapping vs. de novo ===
   * Mapping   * Mapping
-    * Many tools available+    * Many good mature ​tools available
     * Easy to compute     * Easy to compute
-    * Simpler to analyze+    * Simpler to analyze, comes with genetic coords
   * //De novo//   * //De novo//
 +    * Harder than mapping
 +    * May require special hardware
     * No reference needed     * No reference needed
     * Handles novel structures without //a priori//     * Handles novel structures without //a priori//
 +  * How to get the best of both worlds?
  
 === Columbus: Velvet module === === Columbus: Velvet module ===
   * Solution: the reference-based de Bruijn graph   * Solution: the reference-based de Bruijn graph
-    * Preserves the structure of the reference sequences +    ​* Just throw in the reference genome as long reads. 
-    * Integrates alignment information (SAM, BAM file)+    * Chucking in long reads and getting short contigs. 
 +    ​* Preserves the structure of the reference sequences. Not allowed to overlap with self. 
 +      * What they wanted really was to start from a reference genome, then handle it as a little insertion. 
 +    * Integrates alignment information (SAM, BAM file with BWA, Bowtie)
     * Edits, breaks up, reconnects, and extends the "​alignment contigs"​ as necessary     * Edits, breaks up, reconnects, and extends the "​alignment contigs"​ as necessary
 +  * Applications
 +    * Comparative transcriptome assembly
 +    * Cancer RNAseq
 +    * Comparative assembly
  
-===== Galt's Notes ===== +=== De Bruijn ​graph convention ​for next generation sequencing ​=== 
- +  ​For transcriptome,​ using k=21. 
-==== OASES ==== +  ​* ​Can use reads as short as 25. 
- +  ​* ​How many X coverage needed? 
-De-novo Transcriptome Assembly. +  ​* ​How many genes do you want to catch? 
-People are using NGS generally because it's cheaper. +  ​* ​With genome, want long inserts generally. 
-Would have just one path except for alternate splicing. +  ​* ​With transcriptome,​ short inserts are more useful. They have been using 200bp tiny stuff.
-Pevzner in 2002 mentioned DBG is equivalent to a +
-splice-graph.  So can apply those algorithms developed ​for  +
-Splice-graphs to DBG. +
- +
-=== Oases Process ​=== +
-The genetic locus should create a connected locus, cluster of contigs.  +
-Sometimes one cluster gets broken into two, but works pretty well. +
- +
-Dynamic assembly, find the path of highest weight. +
- +
-=== Oases Output === +
-With skipped exon, and one artifact. +
-Handles contigs with different coverage levels. +
-Find donor and acceptor sites on or around junctions. +
- +
-Drosophila reads. +
-rl 36, k21 ins size 200. +
-Map to genome 90% +
-Found an unannotated gene with 7 exons. +
- +
-Assembler has no notion of the reference genome, +
-but still found one example that mapped very well. +
- +
- +
-==== COLUMBUS ==== +
-Mapping.  +
-Either map, or do de-novo. +
-Many good mature mapping tools. +
-Easier to analyze, comes with genomic coords. +
-De-novo harder. ​ May require special hardware. +
-Works without reference. ​ Handles novel structures. +
- +
-How to get the best of both worlds? +
-Just throw in the ref genome as long reads. +
-Chucking in long reads and getting short contigs. +
-What they wanted really was to start from a  +
-reference genome. ​ Then handle it as a little insertion. +
-Takes DBG and SAM/BAM (BWA, bowtie) +
-Keeps ref sequences. ​ Not allowed to overlap with self. +
-Starting with comparative transcriptome assembly. +
-Would like to ? cancer. +
- +
-DBG convention for NGS. +
- +
-For transcriptome,​ using k=21. +
-Can use reads as short as 25. +
-How many X coverage needed? +
-How many genes do you want to catch+
- +
-With genome, want long inserts generally. +
-With transcriptome,​ short inserts are more  +
-useful. ​ They have been using 200bp tiny stuff.+
  
-===== End-of-class business ​=====+===== After presentation ​=====
  
-Kevin on Wed. will be talking about:+Kevin on Wednesday ​will be talking about:
   * Use SOLiD data for ordering and orientation of //Pog// contigs   * Use SOLiD data for ordering and orientation of //Pog// contigs
   * Special purpose scripts   * Special purpose scripts
   * The data was so noisy   * The data was so noisy
lecture_notes/04-19-2010.txt · Last modified: 2010/04/24 14:54 by karplus