User Tools

Site Tools


lecture_notes:04-09-2010

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
lecture_notes:04-09-2010 [2010/04/11 22:32]
learithe
lecture_notes:04-09-2010 [2010/04/12 21:37]
cbrumbau Changed for reference format, italics, punctuation, spelling, etc.
Line 7: Line 7:
 Take fix mode script from /​projects/​compbio/​bin/​scripts and replace protein user group with BME 235 user group. Take fix mode script from /​projects/​compbio/​bin/​scripts and replace protein user group with BME 235 user group.
  
-Next week will have a reference genome (//​Pyrobaculum oguniense//,​ aka //"Pog"//) to use for testing the tools on. +Next week will have a reference genome (//​Pyrobaculum oguniense//,​ aka "//Pog//") to use for testing the tools on. 
-For the most part Pog is done; however, there are still some uncertainty with 8 SNPs left. It is definitely past the MIAMI standard at this point. (//Pog// assembly is down to only 8 snps & one potentially variable insert)+For the most part //Pog// is done; however, there are still some uncertainty with 8 SNPs left. It is definitely past the MIAMI standard at this point. (//Pog// assembly is down to only 8 SNPs & one potentially variable insert.)
  
-Note about sequencing platform quality scores: most platforms are trying to use the phred quality score((http://​en.wikipedia.org/​wiki/​Phred_quality_score)), so the quality score is theoretically comparable between the platforms and runs (although calibration causes scores to vary between runs and instruments nonetheless)+Note about sequencing platform quality scores: most platforms are trying to use the Phred quality score[(cite:phred>​[[wp>​Phred quality score]])], so the quality score is theoretically comparable between the platforms and runs (although calibration causes scores to vary between runs and instruments nonetheless).
  
 It can be informative,​ once reads are mapped, to look at the quality scores for reads with observed errors. It can be informative,​ once reads are mapped, to look at the quality scores for reads with observed errors.
  
- +Lior Pachter (from UC Berkeley) is vising on Monday, to speak about the Bowtie/​TopHat/​CuffLinks algorithms. ​(Bowtie: mapping; TopHat/​Cufflinks:​ find splice junctions, predicted spliced transcripts. Bowtie is used in a lot of the assembly ​algorithms.)
- +
-Lior Pachter (from UC Berkeley) is vising on Monday, to speak about the Bowtie/​TopHat/​CuffLinks algorithms. Bowtie: mapping; TopHat/​Cufflinks:​ find splice junctions, predicted spliced transcripts. Bowtie is used in a lot of the assembly ​agorithms.+
  
 ===== Main lecture: Assembler graphs ===== ===== Main lecture: Assembler graphs =====
Line 45: Line 43:
     * <- <- (B -> A)     * <- <- (B -> A)
   * 3 different edge types:   * 3 different edge types:
-    * same dir: A to B / B to A +    * Same dir: A to B / B to A 
-    * tail-to-tail (convergent):​ A' to B  +    * Tail-to-tail (convergent):​ A' to B  
-    * head-to-head (divergent):​ A to B'+    * Head-to-head (divergent):​ A to B'
  
-Need to have some tolerance for error because the reads are noisy. When creating read overlaps, if you require 100% pairing, you’ll miss a lot of data. plus these include the read ends, where quality falls offso need a “overlap quality score”.+Need to have some tolerance for error because the reads are noisy. When creating read overlaps, if you require 100% pairing, you’ll miss a lot of data. Also these include the read ends, where quality falls offso you need a “overlap quality score”.
  
-Can’t do all-vs-all searches (n^2 algorithms not a good idea with billions of reads). So how do you search what to overlap? Most algorithms do a blast-like filter before trying to align edges (~ nlogn)+Can’t do all-vs-all searches (n<sup>2</​sup> ​algorithms not a good idea with billions of reads). So how do you search what to overlap? Most algorithms do a BLAST-like filter before trying to align edges (~nlogn).
  
- +Side Note: For transcriptome libraries, if done properly, reads should have known strandedness,​ so it can’t be run through algorithms which make strandedness arbitrary (story about problems with a prominent yeast microarray transcriptome analysis incorrectly finding a lot of “antisense” mRNAs due to library prep error).
-(Side Note: for transcriptome libraries, if done properly, reads should have known strandedness,​ so can’t be run through algorithms which make strandedness arbitrary (story about problems with a prominent yeast microarray transcriptome analysis incorrectly finding a lot of “antisense” mRNAs due to library prep error)+
  
  
Line 103: Line 100:
  
 Realistically,​ there are issues: Realistically,​ there are issues:
-== End of Contig boundaries: == 
  
-what if A->B and A->C and A->D BUT A->B and A->C are inconsistent with each other? +== End of contig boundaries == 
-… A becomes “end of contig”, because you aren’t sure where to go next + 
-also end of contig if there are no more edges from the node+What if A->B and A->C and A->​D ​//BUT// A->B and A->C are inconsistent with each other? 
 +… A becomes “end of contig”, because you aren’t sure where to go next. 
 +Also end of contig if there are no more edges from the node
 + 
 +== Spurs ==
  
-== Spurs: == 
 <​code>​ <​code>​
 kmer -> kmer -> kmer -> kmer -> kmer kmer -> kmer -> kmer -> kmer -> kmer
     \-> kmer -> kmer -> kmer (off to nowhere)     \-> kmer -> kmer -> kmer (off to nowhere)
 </​code>​ </​code>​
-path diverges but does not reconverge, resulting in source/sink dead-ends (these are likely due to read errors)+Path diverges but does not reconverge, resulting in source/sink dead-ends (these are likely due to read errors)
 + 
 +== Bubbles ==
  
-== Bubbles: == 
 <​code>​ <​code>​
     /-> kmer -> kmer -> kmer -\     /-> kmer -> kmer -> kmer -\
Line 122: Line 122:
 </​code>​ </​code>​
  
- path splits due to a SNP but then converges. ​this can happen with real SNPs, read error SNPs, and real repeats which differ by a SNP or two+The path splits due to a SNP but then converges. ​This can happen with real SNPs, read error SNPs, and real repeats which differ by a SNP or two
 + 
 +== Loop ==
  
-== Loop: == 
 <​code>​ <​code>​
 kmer -> kmer -> kmer -> kmer -> kmer -> kmer -> kmer kmer -> kmer -> kmer -> kmer -> kmer -> kmer -> kmer
                             \- kmers <-/                             \- kmers <-/
 </​code>​ </​code>​
-tandem ​repeats will generate a circle, but have edges in and out; hard to disambiguate copy though. If the data is really clean (ie, in/out edges are ~10 read-depth with low SD, and inside circle has ~20 read-depth with low SD), can guess that there might be 2 copies of the repeat, but not highly reliable+Tandem ​repeats will generate a circle, but have edges in and out; hard to disambiguate copy number ​though. If the data is really clean (i.e. in/out edges are ~10 read-depth with low SD, and inside circle has ~20 read-depth with low SD), we can guess that there might be 2 copies of the repeat, but this is not highly reliable
 + 
 +== Multiple paths ==
  
-== Multiple paths: == 
 <​code>​ <​code>​
 A                      B A                      B
Line 145: Line 147:
 Largest bias usually comes from PCR for amplification. Largest bias usually comes from PCR for amplification.
  
-===Assembly:=== +=== Assembly ===
-algorithms (both overlap and de Bruijn) need to collapse bubbles and trim spurs.\\ +
-spurs: discard if their read count is low\\ +
-bubbles: tricky, because they can represent real, divergent paths +
  
 +Algorithms (both overlap and de Bruijn) need to collapse bubbles and trim spurs.\\
 +Spurs: Discard if their read count is low.\\
 +Bubbles: Tricky, because they can represent real, divergent paths.
  
 +===== References =====
 +<​refnotes>​notes-separator:​ none</​refnotes>​
 +~~REFNOTES cite~~
lecture_notes/04-09-2010.txt · Last modified: 2010/04/12 21:37 by cbrumbau