This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
lecture_notes:03-30-2011 [2011/03/30 23:05] eyliaw created |
lecture_notes:03-30-2011 [2011/04/01 19:20] (current) svohr [Coverage] slight corrections |
||
---|---|---|---|
Line 5: | Line 5: | ||
Inputs: Sequencing data from various machines. Some of the characteristics of these machines/techniques: | Inputs: Sequencing data from various machines. Some of the characteristics of these machines/techniques: | ||
- | * Sanger capillary | + | ==== Sanger capillary ==== |
- | * 454 | + | * ~800bp reads[(cite:wikisanger>http://en.wikipedia.org/wiki/Microfluidic_Sanger_sequencing)]. |
- | * SoLiD | + | * Q (quality value) ~30 |
- | * Illumina | + | * ~$1/read, expensive because primers must be attached to each read. |
- | * * | + | ==== 454 ==== |
- | * Ion Torrent | + | * ~400bp reads[(cite:wiki454>http://en.wikipedia.org/wiki/454_Life_Sciences)]. |
- | * Pac Bio | + | * Pyrosequencing |
+ | * Q ~20 | ||
+ | * $5000/run/1M reads, no downscaling (numbers approximate). | ||
+ | ==== SoLiD ==== | ||
+ | * 2x25bp or 1x50bp reads | ||
+ | * Paired end reads: ligation with adapter, cleaves 25bp from adapter using restriction enzyme. | ||
+ | * Potential for double ligation: two unrelated sequences ligating. | ||
+ | * $2000/run/100M reads (numbers approximate). | ||
+ | ==== Illumina ==== | ||
+ | * 2x50, 2x100bps ? | ||
+ | * Paired end reads | ||
+ | * Potential errors: innies (ligated region not between sequenced regions) or chimeric (sequence passes ligated region) | ||
+ | * Cheaper than SoLiD, 10K Genomes project uses it. | ||
+ | ==== Ion Torrent ==== | ||
+ | * 2x100 base pairs | ||
+ | * ~50,000 to 5,000,000 reads depending on Sequencing Chip [(cite:ionTorrent>http://www.iontorrent.com/technology-how-does-it-perform/)]. | ||
+ | * Ion semiconductor sequencing. No optics or modified bases are required. | ||
+ | ==== Pac Bio ==== | ||
+ | * Very long, single molecule reads (~10K) | ||
+ | * High error rates (~5%) | ||
+ | * Useful when mapping to a reference. | ||
+ | ===== Coverage ===== | ||
+ | We briefly discussed how much sequence data would be required to assemble the genome. First, we considered the probability of seeing a particular base ''i'' in a single read ''j''. | ||
+ | |||
+ | P( seeing base i in read j ) = L/G | ||
+ | |||
+ | where ''L'' is the read length and ''G'' is the total size of the genome. If we have ''R'' reads, then | ||
+ | |||
+ | P( never seeing base i ) = (1 - L/G)^R | ||
+ | |||
+ | We can multiply ''L/G'' by ''R/R'' to get ''((L*R) / G) / R'' or ''C / R'' where ''C'' is our coverage of the genome. We take the limit of this as | ||
+ | ''R'' goes to infinity: | ||
+ | |||
+ | lim n->inf (1 - C/R)^R = e^-C | ||
+ | |||
+ | Thus we can expect to miss ''G*e^-C'' bases. | ||
+ | |||
+ | We cannot assemble an entire chromosome if we are missing bases. However, we can construct contiguous stretches of bases or //contigs// and later | ||
+ | assemble them into //scaffolds// using other information, such as long distance physical maps. | ||
+ | |||
+ | |||
+ | |||
+ | ===== References ===== | ||
+ | <refnotes>notes-separator: none</refnotes> | ||
+ | ~~REFNOTES cite~~ |