This is an old revision of the document!
Mon Apr 05 2010: Hyunsung John Kim Please add to or modify this page as you see fit! =====Homework===== Pick two or three assemblers. * Find out where to get them * How to install them * What papers there are about them * Create the wiki page about them * And possibly install them Volunteers: * Phrap: Galt and Shyamini * Velvet: Hyunsung and Galt * ABySS: Galt and Chris * AMOS: Shyamini and Herbert * Arachne: John and Michael * CAP3/PCAT: Michael and Galt * Celera: Shyamini and Hyunsung * Euler/Euler-sr: Herbert and John * MIRA1: Herbert and Michael * TIGR Assembler: John and Shyamini * SHARCGS: Michael and Chris * SSAKE: Herbert and Hyunsung * Staden gap4 package: Michael and Hyunsung * VCAKE: Chris and John * Phusion: Shyamini and Michael * QSRA: Herbert and Chris * SOLiD System Tools (Corona_lite, etc): Hyunsung and Chris * Newbler documentation: Galt and Herbert * SOAPdenovo: Galt and Jenny Assembly Review Articles: * [[http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6WG1-4YJ6GD8-1&_user=10&_coverDate=03%2F06%2F2010&_rdoc=1&_fmt=high&_orig=search&_sort=d&_docanchor=&view=c&_searchStrId=1282691739&_rerunOrigin=google&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=32c08d11cc10fd1eefca0f8a8def738b|Assembly algorithms for next-generation sequencing data]] Jason R. Miller, Sergey Koren and Granger Suttona Covers these assemblers: SSAKE, SHARCGS, VCAKE, Newbler, Celera, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo. Compares de Bruijn graph to overlap/layout/consensus. Jason R. Miller, Sergey Koren, Granger Sutton, Assembly algorithms for next-generation sequencing data, Genomics, In Press, Corrected Proof, Available online 6 March 2010, ISSN 0888-7543, DOI: 10.1016/j.ygeno.2010.03.001. (http://www.sciencedirect.com/science/article/B6WG1-4YJ6GD8-1/2/ae6c957910e4ea658cdebff4a0ce9793) Keywords: Genome assembly algorithms; Next-generation sequencing =====Assembly Overview===== * What is assembly? * Sequence data is comprised of many short reads. (inputs to the assembler) * Sanger reads can be 1200 bp long. * These are still short reads compared to a small prokaryotic genome or a larger eukaryotic genome. * 454 reads are about 400 bp long. * These are not considered short reads by sequencing standards. * Long enough to be uniquely mappable to most regions except for very large repetitive regions. * Solid reads are about 50bp long for shotgun libraries or 2x25bp for paired-end reads. * Paired reads are multiple reads from the same strand * Illumina reads can be long, but quality begins to drop off after 50 base pairs. * All the second-generation sequencers produce noisy reads. * 454 may have the lowest error rate at about 0.2%. * Solid has a variable error rate. One run on the POG genome had an error rate between 0.5% to 6%. * Error rate should increase linearly with length and position in the read. * Mate pair error rates should be similar to shotgun error rates with a higher error rate for the second fragment read. * POG dataset had two large spikes at two positions on the second read. * Quality measure output by the Solid sequencer mirrored the error rate. * No error data on Illumina platform. * Expect half your reads to have an error in them. * Contiguous chromosomes with a low error rate ( output from assemblers). * Bermuda standard for a finished genome should have an error rate of 1 x 10^-5 bases. (see comment below) * To reduce error rate in short reads, stack up many reads and take the most common base at each position. * How much data do we have? * Let R = reads * Let l = length measured in bases * Let G = genome size measured in bases * Two issues to worry about: * Do we read every base in the chromosome * Probability of covering base i with 1 read assuming a uniform distribution: * let p = l / G, the probability of covering base i with 1 read * 1 - p, the probability of not covering base i with 1 read * (1 - p) ^ R, probability of not covering base i with R reads * c = R * l / G, coverage * R = c / p * lim ( (1-p)^(c/p), p -> 0 ) = e^-c * Expected number of missed bases = G * e^-c * How much coverage do we need to get less than one missing base per genome? * G * e^-c < 1 * e^-c < 1 /G * c > ln (G) * For the human genome, you need approximately 22x coverage or 66 gigabases of sequencing data to read each base once. * It is reasonable to expect that we will not have enough coverage to see every base in the genome. * On just 454 data, we expect to have between 1x-0.1x coverage for the banana slug (assuming the banana slug genome is between 700mb-7gb) * Scaffolds * What is a scaffold? * Given a set of a continuous segment of dna (contigs), order and orient them. A scaffold is a set of contigus with order and orientation information. * Paired-end libraries are helpful for joining gaps between contigs. * Short reads from paired-end libraries are seldom used for forming contigs. * It is possible to cluster short reads at the end of contigs. * Assembly overview (Students will need to verify these teachings with papers on different assemblers) - Cluster the reads. - Stick reads together. - Base calling provides a consensus for a contig. - Can be a greedy algorithm or dynamic programming algorithm. - Order and orient. - May have mapping to find out where reads go. - May also try to form contigs out of leftover reads. - Can find repeat regions using paired-end data. * Most resquencing projects map reads to scaffolds and create contigs based upon mapping. Sections with missing read data can be assumed to be a deleting or an alteration to the existing scaffold.