User Tools

Site Tools


lecture_notes:04-09-2010

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
lecture_notes:04-09-2010 [2010/04/11 14:57]
learithe
lecture_notes:04-09-2010 [2010/04/12 14:37] (current)
cbrumbau Changed for reference format, italics, punctuation, spelling, etc.
Line 7: Line 7:
 Take fix mode script from /​projects/​compbio/​bin/​scripts and replace protein user group with BME 235 user group. Take fix mode script from /​projects/​compbio/​bin/​scripts and replace protein user group with BME 235 user group.
  
-Next week will have a reference genome (POG) to use for testing the tools on. +Next week will have a reference genome (//​Pyrobaculum oguniense//,​ aka "//​Pog//"​) to use for testing the tools on. 
-For the most part POG is done; however, there are still some uncertainty with 8 SNPs left. It is definitely past the MIAMI standard at this point.+For the most part //​Pog// ​is done; however, there are still some uncertainty with 8 SNPs left. It is definitely past the MIAMI standard at this point. ​(//Pog// assembly is down to only 8 SNPs & one potentially variable insert.)
  
-Note about sequencing platform quality scores: most platforms are trying to use the phred quality score((http://​en.wikipedia.org/​wiki/​Phred_quality_score)), so the quality score is comparable between the platforms and runs+Note about sequencing platform quality scores: most platforms are trying to use the Phred quality score[(cite:phred>​[[wp>​Phred quality score]])], so the quality score is theoretically ​comparable between the platforms and runs (although calibration causes scores to vary between runs and instruments nonetheless).
  
 It can be informative,​ once reads are mapped, to look at the quality scores for reads with observed errors. It can be informative,​ once reads are mapped, to look at the quality scores for reads with observed errors.
  
-//Pog// assembly ​is down to only 8 snps & one potentially variable insert+Lior Pachter (from UC Berkeley) is vising on Monday, to speak about the Bowtie/TopHat/CuffLinks algorithms. (Bowtie: mapping; TopHat/Cufflinks: find splice junctions, predicted spliced transcripts. Bowtie ​is used in a lot of the assembly algorithms.)
  
 ===== Main lecture: Assembler graphs ===== ===== Main lecture: Assembler graphs =====
Line 20: Line 20:
 Types of assembler graphs: Types of assembler graphs:
   * Overlap graph   * Overlap graph
-  * de Bruijn graph+  * de Bruijn graph  ​(pronounced like "De Broin"​)
  
 Differences are "What are the nodes?"​ Differences are "What are the nodes?"​
Line 36: Line 36:
             B             B
 </​code>​ </​code>​
-The problem is the direction of the reads when aligning:+The problem ​with edges between contig nodes is in defining ​direction of the reads when aligning:
   * 4 different edge scenarios:   * 4 different edge scenarios:
     * -> -> (A -> B)     * -> -> (A -> B)
Line 43: Line 43:
     * <- <- (B -> A)     * <- <- (B -> A)
   * 3 different edge types:   * 3 different edge types:
-    * same dir: A to B / B to A +    * Same dir: A to B / B to A 
-    * tail-to-tail: A' to B  +    * Tail-to-tail ​(convergent): A' to B  
-    * head-to-head: A to B'+    * Head-to-head ​(divergent): A to B' 
 + 
 +Need to have some tolerance for error because the reads are noisy. When creating read overlaps, if you require 100% pairing, you’ll miss a lot of data. Also these include the read ends, where quality falls off, so you need a “overlap quality score”. 
 + 
 +Can’t do all-vs-all searches (n<​sup>​2</​sup>​ algorithms not a good idea with billions of reads). So how do you search what to overlap? Most algorithms do a BLAST-like filter before trying to align edges (~nlogn). 
 + 
 +Side Note: For transcriptome libraries, if done properly, reads should have known strandedness,​ so it can’t be run through algorithms which make strandedness arbitrary (story about problems with a prominent yeast microarray transcriptome analysis incorrectly finding a lot of “antisense” mRNAs due to library prep error).
  
-Need to have some tolerance for error because the reads are noisy. 
  
 === de Bruijn graphs === === de Bruijn graphs ===
Line 96: Line 101:
 Realistically,​ there are issues: Realistically,​ there are issues:
  
-Spurs:+== End of contig boundaries == 
 + 
 +What if A->B and A->C and A->D //BUT// A->B and A->C are inconsistent with each other? 
 +… A becomes “end of contig”, because you aren’t sure where to go next. 
 +Also end of contig if there are no more edges from the node. 
 + 
 +== Spurs == 
 <​code>​ <​code>​
 kmer -> kmer -> kmer -> kmer -> kmer kmer -> kmer -> kmer -> kmer -> kmer
     \-> kmer -> kmer -> kmer (off to nowhere)     \-> kmer -> kmer -> kmer (off to nowhere)
 </​code>​ </​code>​
 +Path diverges but does not reconverge, resulting in source/sink dead-ends (these are likely due to read errors).
 +
 +== Bubbles ==
  
-Collapse bubbles: 
 <​code>​ <​code>​
     /-> kmer -> kmer -> kmer -\     /-> kmer -> kmer -> kmer -\
Line 108: Line 122:
 </​code>​ </​code>​
  
-Other issues:+The path splits due to a SNP but then converges. This can happen with real SNPs, read error SNPs, and real repeats which differ by a SNP or two. 
 + 
 +== Loop ==
  
-Loop: 
 <​code>​ <​code>​
 kmer -> kmer -> kmer -> kmer -> kmer -> kmer -> kmer kmer -> kmer -> kmer -> kmer -> kmer -> kmer -> kmer
                             \- kmers <-/                             \- kmers <-/
 </​code>​ </​code>​
 +Tandem repeats will generate a circle, but have edges in and out; hard to disambiguate copy number though. If the data is really clean (i.e. in/out edges are ~10 read-depth with low SD, and inside circle has ~20 read-depth with low SD), we can guess that there might be 2 copies of the repeat, but this is not highly reliable.
  
-Take the loop?+== Multiple paths ==
  
-Multiple paths: 
 <​code>​ <​code>​
 A                      B A                      B
Line 132: Line 147:
 Largest bias usually comes from PCR for amplification. Largest bias usually comes from PCR for amplification.
  
-Need to collapse the graph (both overlap and de Bruijn) to assemble the reads.+=== Assembly === 
 + 
 +Algorithms ​(both overlap and de Bruijn) ​need to collapse bubbles and trim spurs.\\ 
 +Spurs: Discard if their read count is low.\\ 
 +Bubbles: Tricky, because they can represent real, divergent paths. 
 + 
 +===== References ===== 
 +<​refnotes>​notes-separator:​ none</​refnotes>​ 
 +~~REFNOTES cite~~
lecture_notes/04-09-2010.1271023043.txt.gz · Last modified: 2010/04/11 14:57 by learithe