This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
lecture_notes:04-05-2010 [2010/04/05 21:02] hyjkim created |
lecture_notes:04-05-2010 [2010/04/16 01:16] (current) karplus fixed citations to use Refnotes syntax |
||
---|---|---|---|
Line 1: | Line 1: | ||
+ | Mon Apr 05 2010: Hyunsung John Kim | ||
+ | |||
+ | Please add to or modify this page as you see fit! | ||
+ | =====Homework===== | ||
+ | Pick two or three assemblers. | ||
+ | * Find out where to get them | ||
+ | * How to install them | ||
+ | * What papers there are about them | ||
+ | * Create the wiki page about them | ||
+ | * And possibly install them | ||
+ | |||
+ | Volunteers: | ||
+ | * Phrap: Galt and Shyamini | ||
+ | * Velvet: Hyunsung and Galt | ||
+ | * ABySS: Galt and Chris | ||
+ | * AMOS: Shyamini and Herbert | ||
+ | * Arachne: John and Michael | ||
+ | * CAP3/PCAT: Michael and Galt | ||
+ | * Celera: Shyamini and Hyunsung | ||
+ | * Euler/Euler-sr: Herbert and John | ||
+ | * MIRA1: Herbert and Michael | ||
+ | * TIGR Assembler: John and Shyamini | ||
+ | * SHARCGS: Michael and Chris | ||
+ | * SSAKE: Herbert and Hyunsung | ||
+ | * Staden gap4 package: Michael and Hyunsung | ||
+ | * VCAKE: Chris and John | ||
+ | * Phusion: Shyamini and Michael | ||
+ | * QSRA: Herbert and Chris | ||
+ | * SOLiD System Tools (Corona_lite, etc): Hyunsung and Chris | ||
+ | * Newbler documentation: Galt and Herbert | ||
+ | * SOAPdenovo: Galt and Jenny | ||
+ | |||
+ | |||
+ | Assembly Review Articles: | ||
+ | * Jason R. Miller, Sergey Koren and Granger Suttona [(cite:Miller2010>Jason R. Miller, Sergey Koren, Granger Sutton, Assembly algorithms for next-generation sequencing data, Genomics, In Press, Corrected Proof, Available online 6 March 2010, ISSN 0888-7543, DOI: 10.1016/j.ygeno.2010.03.001 http://www.sciencedirect.com/science/article/B6WG1-4YJ6GD8-1/2/ae6c957910e4ea658cdebff4a0ce9793)] \\ Covers these assemblers: SSAKE, SHARCGS, VCAKE, Newbler, Celera, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo.Compares de Bruijn graph to overlap/layout/consensus. | ||
+ | | ||
+ | |||
=====Assembly Overview===== | =====Assembly Overview===== | ||
+ | |||
+ | * What is assembly? | ||
+ | * Sequence data is comprised of many short reads. (inputs to the assembler) | ||
+ | * Sanger reads can be 1200 bp long. | ||
+ | * These are still short reads compared to a small prokaryotic genome or a larger eukaryotic genome. | ||
+ | * 454 reads are about 400 bp long. | ||
+ | * These are not considered short reads by sequencing standards. | ||
+ | * Long enough to be uniquely mappable to most regions except for very large repetitive regions. | ||
+ | * Solid reads are about 50bp long for shotgun libraries or 2x25bp for paired-end reads. | ||
+ | * Paired reads are multiple reads from the same strand | ||
+ | * Illumina reads can be long, but quality begins to drop off after 50 base pairs. | ||
+ | * All the second-generation sequencers produce noisy reads. | ||
+ | * 454 may have the lowest error rate at about 0.2%. | ||
+ | * Solid has a variable error rate. One run on the POG genome had an error rate between 0.5% to 6%. | ||
+ | * Error rate should increase linearly with length and position in the read. | ||
+ | * Mate pair error rates should be similar to shotgun error rates with a higher error rate for the second fragment read. | ||
+ | * POG dataset had two large spikes at two positions on the second read. | ||
+ | * Quality measure output by the Solid sequencer mirrored the error rate. | ||
+ | * No error data on Illumina platform. | ||
+ | * Expect half your reads to have an error in them. | ||
+ | * Contiguous chromosomes with a low error rate ( output from assemblers). | ||
+ | * Bermuda standard for a finished genome should have an error rate of 1 x 10^-5 bases.1) [(cite:Bermuda1>[[http://www.genome.gov/page.cfm?pageID=10506376]])] [(cite:Bermuda2>[[http://www.ornl.gov/sci/techresources/Human_Genome/research/bermuda.shtml]])] | ||
+ | * To reduce error rate in short reads, stack up many reads and take the most common base at each position. | ||
+ | * How much data do we have? | ||
+ | * Let R = reads | ||
+ | * Let l = length measured in bases | ||
+ | * Let G = genome size measured in bases | ||
+ | * Two issues to worry about: | ||
+ | * Do we read every base in the chromosome | ||
+ | * Probability of covering base i with 1 read assuming a uniform distribution: | ||
+ | * let p = l / G, the probability of covering base i with 1 read | ||
+ | * 1 - p, the probability of not covering base i with 1 read | ||
+ | * (1 - p) ^ R, probability of not covering base i with R reads | ||
+ | * c = R * l / G, coverage | ||
+ | * R = c / p | ||
+ | * lim ( (1-p)^(c/p), p -> 0 ) = e^-c | ||
+ | * Expected number of missed bases = G * e^-c | ||
+ | * How much coverage do we need to get less than one missing base per genome? | ||
+ | * G * e^-c < 1 | ||
+ | * e^-c < 1 /G | ||
+ | * c > ln (G) | ||
+ | * For the human genome, you need approximately 22x coverage or 66 gigabases of sequencing data to read each base once. | ||
+ | * It is reasonable to expect that we will not have enough coverage to see every base in the genome. | ||
+ | * On just 454 data, we expect to have between 1x-0.1x coverage for the banana slug (assuming the banana slug genome is between 700mb-7gb) | ||
+ | * Scaffolds | ||
+ | * What is a scaffold? | ||
+ | * Given a set of a continuous segment of dna (contigs), order and orient them. A scaffold is a set of contigus with order and orientation information. | ||
+ | * Paired-end libraries are helpful for joining gaps between contigs. | ||
+ | * Short reads from paired-end libraries are seldom used for forming contigs. | ||
+ | * It is possible to cluster short reads at the end of contigs. | ||
+ | * Assembly overview (Students will need to verify these teachings with papers on different assemblers) | ||
+ | - Cluster the reads. | ||
+ | - Stick reads together. | ||
+ | - Base calling provides a consensus for a contig. | ||
+ | - Can be a greedy algorithm or dynamic programming algorithm. | ||
+ | - Order and orient. | ||
+ | - May have mapping to find out where reads go. | ||
+ | - May also try to form contigs out of leftover reads. | ||
+ | - Can find repeat regions using paired-end data. | ||
+ | * Most resquencing projects map reads to scaffolds and create contigs based upon mapping. Sections with missing read data can be assumed to be a deleting or an alteration to the existing scaffold. | ||
+ | |||
+ | |||
+ | ===== References ===== | ||
+ | <refnotes>notes-separator: none</refnotes> | ||
+ | ~~REFNOTES cite~~ | ||