This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
|
lecture_notes:03-30-2011 [2011/03/30 23:09] eyliaw |
lecture_notes:03-30-2011 [2011/04/01 19:20] (current) svohr [Coverage] slight corrections |
||
|---|---|---|---|
| Line 6: | Line 6: | ||
| Inputs: Sequencing data from various machines. Some of the characteristics of these machines/techniques: | Inputs: Sequencing data from various machines. Some of the characteristics of these machines/techniques: | ||
| ==== Sanger capillary ==== | ==== Sanger capillary ==== | ||
| - | ~ | + | * ~800bp reads[(cite:wikisanger>http://en.wikipedia.org/wiki/Microfluidic_Sanger_sequencing)]. |
| + | * Q (quality value) ~30 | ||
| + | * ~$1/read, expensive because primers must be attached to each read. | ||
| ==== 454 ==== | ==== 454 ==== | ||
| + | * ~400bp reads[(cite:wiki454>http://en.wikipedia.org/wiki/454_Life_Sciences)]. | ||
| + | * Pyrosequencing | ||
| + | * Q ~20 | ||
| + | * $5000/run/1M reads, no downscaling (numbers approximate). | ||
| ==== SoLiD ==== | ==== SoLiD ==== | ||
| + | * 2x25bp or 1x50bp reads | ||
| + | * Paired end reads: ligation with adapter, cleaves 25bp from adapter using restriction enzyme. | ||
| + | * Potential for double ligation: two unrelated sequences ligating. | ||
| + | * $2000/run/100M reads (numbers approximate). | ||
| ==== Illumina ==== | ==== Illumina ==== | ||
| + | * 2x50, 2x100bps ? | ||
| + | * Paired end reads | ||
| + | * Potential errors: innies (ligated region not between sequenced regions) or chimeric (sequence passes ligated region) | ||
| + | * Cheaper than SoLiD, 10K Genomes project uses it. | ||
| ==== Ion Torrent ==== | ==== Ion Torrent ==== | ||
| + | * 2x100 base pairs | ||
| + | * ~50,000 to 5,000,000 reads depending on Sequencing Chip [(cite:ionTorrent>http://www.iontorrent.com/technology-how-does-it-perform/)]. | ||
| + | * Ion semiconductor sequencing. No optics or modified bases are required. | ||
| ==== Pac Bio ==== | ==== Pac Bio ==== | ||
| + | * Very long, single molecule reads (~10K) | ||
| + | * High error rates (~5%) | ||
| + | * Useful when mapping to a reference. | ||
| + | ===== Coverage ===== | ||
| + | We briefly discussed how much sequence data would be required to assemble the genome. First, we considered the probability of seeing a particular base ''i'' in a single read ''j''. | ||
| + | | ||
| + | P( seeing base i in read j ) = L/G | ||
| + | |||
| + | where ''L'' is the read length and ''G'' is the total size of the genome. If we have ''R'' reads, then | ||
| + | |||
| + | P( never seeing base i ) = (1 - L/G)^R | ||
| + | |||
| + | We can multiply ''L/G'' by ''R/R'' to get ''((L*R) / G) / R'' or ''C / R'' where ''C'' is our coverage of the genome. We take the limit of this as | ||
| + | ''R'' goes to infinity: | ||
| + | |||
| + | lim n->inf (1 - C/R)^R = e^-C | ||
| + | |||
| + | Thus we can expect to miss ''G*e^-C'' bases. | ||
| + | |||
| + | We cannot assemble an entire chromosome if we are missing bases. However, we can construct contiguous stretches of bases or //contigs// and later | ||
| + | assemble them into //scaffolds// using other information, such as long distance physical maps. | ||
| + | |||
| + | |||
| ===== References ===== | ===== References ===== | ||
| <refnotes>notes-separator: none</refnotes> | <refnotes>notes-separator: none</refnotes> | ||
| ~~REFNOTES cite~~ | ~~REFNOTES cite~~ | ||