User Tools

Site Tools


lecture_notes:03-30-2011

This is an old revision of the document!


A PCRE internal error occured. This might be caused by a faulty plugin

====== Brief overview of goals and data input characteristics ====== Kevin laid out some of the logistics of the class. A broad goal: For each chromosome in the slug, we want the full sequence in DNA bases. Since it is unlikely to be completed in the timeframe of one quarter, some smaller goals: build contigs and have an idea of the scaffold to arrange the contigs in. Inputs: Sequencing data from various machines. Some of the characteristics of these machines/techniques: ==== Sanger capillary ==== * ~800bp reads[(cite:wikisanger>http://en.wikipedia.org/wiki/Microfluidic_Sanger_sequencing)]. * Q (quality value) ~30 * ~$1/read, expensive because primers must be attached to each read. ==== 454 ==== * ~400bp reads[(cite:wiki454>http://en.wikipedia.org/wiki/454_Life_Sciences)]. * Pyrosequencing * Q ~20 * $5000/run/1M reads, no downscaling (numbers approximate). ==== SoLiD ==== * 2x25bp or 1x50bp reads * Paired end reads: ligation with adapter, cleaves 25bp from adapter using restriction enzyme. * Potential for double ligation: two unrelated sequences ligating. * $2000/run/100M reads (numbers approximate). ==== Illumina ==== * 2x50, 2x100bps ? * Paired end reads * Potential errors: innies (ligated region not between sequenced regions) or chimeric (sequence passes ligated region) * Cheaper than SoLiD, 10K Genomes project uses it. ==== Ion Torrent ==== * 2x100 base pairs * ~50,000 to 5,000,000 reads depending on Sequencing Chip [(cite:ionTorrent>http://www.iontorrent.com/technology-how-does-it-perform/)]. * Ion semiconductor sequencing. No optics or modified bases are required. ==== Pac Bio ==== * Very long, single molecule reads (~10K) * High error rates (~5%) * Useful when mapping to a reference. ===== Coverage ===== We briefly discussed how much sequence data would be required to assemble the genome. First, we considered the probability of seeing every base in the genome P( seeing base i in read j ) = L/G where ''L'' is the read length and ''G'' is the total size of the genome. If we have ''R'' reads, then P( never seeing base i ) = (1 - L/G)^R We can multiple ''L/G'' by ''R/R'' to get ''((L*R) / G) / R'' or ''C / R'' where ''C'' is our coverage of the genome. We take the limit of this as ''R'' goes to infinity: lim n->inf (1 - C/R)^R = e^-C Thus we can expect to miss G*e^-C bases. We cannot assemble an entire chromosome if we are missing bases. However, we can construct contiguous stretches of bases or //contigs// and later assemble them into //scaffolds// using other information, such as long distance physical maps. ===== References ===== <refnotes>notes-separator: none</refnotes> ~~REFNOTES cite~~

You could leave a comment if you were logged in.
lecture_notes/03-30-2011.1301684823.txt.gz · Last modified: 2011/04/01 19:07 by svohr