User Tools

Site Tools


lecture_notes:03-30-2011

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
lecture_notes:03-30-2011 [2011/03/30 23:05]
eyliaw created
lecture_notes:03-30-2011 [2011/04/01 19:20] (current)
svohr [Coverage] slight corrections
Line 5: Line 5:
  
 Inputs: ​ Sequencing data from various machines. ​ Some of the characteristics of these machines/​techniques:​ Inputs: ​ Sequencing data from various machines. ​ Some of the characteristics of these machines/​techniques:​
-  * Sanger capillary +==== Sanger capillary ​==== 
-  * 454 +  * ~800bp reads[(cite:​wikisanger>​http://​en.wikipedia.org/​wiki/​Microfluidic_Sanger_sequencing)]. 
-  * SoLiD +  * Q (quality value) ~30 
-  * Illumina +  * ~$1/read, expensive because primers must be attached to each read. 
-  * * +==== 454 ==== 
-  * Ion Torrent +  * ~400bp reads[(cite:​wiki454>​http://​en.wikipedia.org/​wiki/​454_Life_Sciences)]. 
-  * Pac Bio+  * Pyrosequencing 
 +  * Q ~20 
 +  * $5000/​run/​1M reads, no downscaling (numbers approximate). 
 +==== SoLiD ==== 
 +  * 2x25bp or 1x50bp reads 
 +  * Paired end reads: ​ ligation with adapter, cleaves 25bp from adapter using restriction enzyme. 
 +  * Potential for double ligation: two unrelated sequences ligating. 
 +  * $2000/​run/​100M reads (numbers approximate). 
 +==== Illumina ​==== 
 +  * 2x50, 2x100bps ? 
 +  ​Paired end reads 
 +  * Potential errors: innies (ligated region not between sequenced regions) or chimeric (sequence passes ligated region) 
 +  * Cheaper than SoLiD, 10K Genomes project uses it. 
 +==== Ion Torrent ​==== 
 +  * 2x100 base pairs 
 +  * ~50,000 to 5,000,000 reads depending on Sequencing Chip [(cite:​ionTorrent>​http://​www.iontorrent.com/​technology-how-does-it-perform/​)]. 
 +  * Ion semiconductor sequencing. No optics or modified bases are required. 
 +==== Pac Bio ==== 
 +  * Very long, single molecule reads (~10K) 
 +  * High error rates (~5%) 
 +  * Useful when mapping to a reference. 
 +===== Coverage ===== 
 +We briefly discussed how much sequence data would be required to assemble the genome. First, we considered the probability of seeing a particular base ''​i''​ in a single read ''​j''​. 
 +   
 +  P( seeing base i in read j ) = L/G 
 + 
 +where ''​L''​ is the read length and ''​G''​ is the total size of the genome. If we have ''​R''​ reads, then  
 + 
 +  P( never seeing base i ) = (1 - L/G)^R 
 + 
 +We can multiply ''​L/​G''​ by ''​R/​R''​ to get ''​((L*R) / G) / R''​ or ''​C / R''​ where ''​C''​ is our coverage of the genome. We take the limit of this as 
 +''​R''​ goes to infinity: 
 + 
 +  lim n->inf (1 - C/R)^R = e^-C 
 + 
 +Thus we can expect to miss ''​G*e^-C''​ bases. 
 + 
 +We cannot assemble an entire chromosome if we are missing bases. However, we can construct contiguous stretches of bases or //contigs// and later 
 +assemble them into //​scaffolds//​ using other information,​ such as long distance physical maps. 
 + 
 + 
 + 
 +===== References ===== 
 +<​refnotes>​notes-separator:​ none</​refnotes>​ 
 +~~REFNOTES cite~~
lecture_notes/03-30-2011.1301526310.txt.gz · Last modified: 2011/03/30 23:05 by eyliaw