Differences

This shows you the differences between two versions of the page.

--- lecture_notes:03-30-2011 [2011/03/30 23:05]
eyliaw created
+++ lecture_notes:03-30-2011 [2011/04/01 19:20] (current)
svohr [Coverage] slight corrections
@@ Line 5: / Line 5: @@
 Inputs:  Sequencing data from various machines.  Some of the characteristics of these machines/techniques:
-  * Sanger capillary
+==== Sanger capillary ====
-  * 454
+  * ~800bp reads[(cite:wikisanger>http://en.wikipedia.org/wiki/Microfluidic_Sanger_sequencing)].
-  * SoLiD
+  * Q (quality value) ~30
-  * Illumina
+  * ~$1/read, expensive because primers must be attached to each read.
-  * *
+==== 454 ====
-  * Ion Torrent
+  * ~400bp reads[(cite:wiki454>http://en.wikipedia.org/wiki/454_Life_Sciences)].
-  * Pac Bio
+  * Pyrosequencing
+  * Q ~20
+  * $5000/run/1M reads, no downscaling (numbers approximate).
+==== SoLiD ====
+  * 2x25bp or 1x50bp reads
+  * Paired end reads:  ligation with adapter, cleaves 25bp from adapter using restriction enzyme.
+  * Potential for double ligation: two unrelated sequences ligating.
+  * $2000/run/100M reads (numbers approximate).
+==== Illumina ====
+  * 2x50, 2x100bps ?
+  * Paired end reads
+  * Potential errors: innies (ligated region not between sequenced regions) or chimeric (sequence passes ligated region)
+  * Cheaper than SoLiD, 10K Genomes project uses it.
+==== Ion Torrent ====
+  * 2x100 base pairs
+  * ~50,000 to 5,000,000 reads depending on Sequencing Chip [(cite:ionTorrent>http://www.iontorrent.com/technology-how-does-it-perform/)].
+  * Ion semiconductor sequencing. No optics or modified bases are required.
+==== Pac Bio ====
+  * Very long, single molecule reads (~10K)
+  * High error rates (~5%)
+  * Useful when mapping to a reference.
+===== Coverage =====
+We briefly discussed how much sequence data would be required to assemble the genome. First, we considered the probability of seeing a particular base ''i'' in a single read ''j''.
+  P( seeing base i in read j ) = L/G
+where ''L'' is the read length and ''G'' is the total size of the genome. If we have ''R'' reads, then
+  P( never seeing base i ) = (1 - L/G)^R
+We can multiply ''L/G'' by ''R/R'' to get ''((L*R) / G) / R'' or ''C / R'' where ''C'' is our coverage of the genome. We take the limit of this as
+''R'' goes to infinity:
+  lim n->inf (1 - C/R)^R = e^-C
+Thus we can expect to miss ''G*e^-C'' bases.
+We cannot assemble an entire chromosome if we are missing bases. However, we can construct contiguous stretches of bases or //contigs// and later
+assemble them into //scaffolds// using other information, such as long distance physical maps.
+===== References =====
+<refnotes>notes-separator: none</refnotes>
+~~REFNOTES cite~~

Banana Slug Genomics

User Tools

Site Tools

Differences

Page Tools