User Tools

Site Tools


lecture_notes:04-05-2010

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revision Both sides next revision
lecture_notes:04-05-2010 [2010/04/05 21:02]
hyjkim created
lecture_notes:04-05-2010 [2010/04/13 14:27]
learithe
Line 1: Line 1:
-=====Assembly Overview=====+Mon Apr 05 2010: Hyunsung John Kim
  
 +Please add to or modify this page as you see fit!
 +=====Homework=====
 +Pick two or three assemblers.
 +  * Find out where to get them
 +  * How to install them
 +  * What papers there are about them
 +  * Create the wiki page about them
 +  * And possibly install them
 +
 +Volunteers:
 +  * Phrap: Galt and Shyamini
 +  * Velvet: Hyunsung and Galt 
 +  * ABySS: Galt and Chris
 +  * AMOS: Shyamini and Herbert
 +  * Arachne: John and Michael
 +  * CAP3/PCAT: Michael and Galt
 +  * Celera: Shyamini and Hyunsung
 +  * Euler/​Euler-sr:​ Herbert and John
 +  * MIRA1: Herbert and Michael
 +  * TIGR Assembler: John and Shyamini
 +  * SHARCGS: Michael and Chris
 +  * SSAKE: Herbert and Hyunsung
 +  * Staden gap4 package: Michael and Hyunsung
 +  * VCAKE: Chris and John
 +  * Phusion: Shyamini and Michael
 +  * QSRA: Herbert and Chris
 +  * SOLiD System Tools (Corona_lite,​ etc): Hyunsung and Chris
 +  * Newbler documentation:​ Galt and Herbert
 +  * SOAPdenovo: Galt and Jenny
 +
 +
 +Assembly Review Articles:
 +  * [[http://​www.sciencedirect.com/​science?​_ob=ArticleURL&​_udi=B6WG1-4YJ6GD8-1&​_user=10&​_coverDate=03%2F06%2F2010&​_rdoc=1&​_fmt=high&​_orig=search&​_sort=d&​_docanchor=&​view=c&​_searchStrId=1282691739&​_rerunOrigin=google&​_acct=C000050221&​_version=1&​_urlVersion=0&​_userid=10&​md5=32c08d11cc10fd1eefca0f8a8def738b|Assembly algorithms for next-generation sequencing data]]
 +
 +  Jason R. Miller, Sergey Koren and Granger Suttona
 +  ​
 +  Covers these assemblers: SSAKE, SHARCGS, VCAKE, Newbler, Celera, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo.
 +  ​
 +  Compares de Bruijn graph to overlap/​layout/​consensus.
 +  ​
 +  Jason R. Miller, Sergey Koren, Granger Sutton, Assembly algorithms for next-generation sequencing data, Genomics, ​
 +  In Press, Corrected Proof, Available online 6 March 2010, ISSN 0888-7543, DOI: 10.1016/​j.ygeno.2010.03.001.
 +  (http://​www.sciencedirect.com/​science/​article/​B6WG1-4YJ6GD8-1/​2/​ae6c957910e4ea658cdebff4a0ce9793)
 +  Keywords: Genome assembly algorithms; Next-generation sequencing
 +
 +
 +
 +=====Assembly Overview=====
  
 +  * What is assembly?
 +    * Sequence data is comprised of many short reads. (inputs to the assembler)
 +      * Sanger reads can be 1200 bp long.
 +        * These are still short reads compared to a small prokaryotic genome or a larger eukaryotic genome.
 +      * 454 reads are about 400 bp long. 
 +        * These are not considered short reads by sequencing standards.
 +        * Long enough to be uniquely mappable to most regions except for very large repetitive regions.
 +      * Solid reads are about 50bp long for shotgun libraries or 2x25bp for paired-end reads.
 +        * Paired reads are multiple reads from the same strand
 +      * Illumina reads can be long, but quality begins to drop off after 50 base pairs.
 +    * All the second-generation sequencers produce noisy reads.
 +      * 454 may have the lowest error rate at about 0.2%.
 +      * Solid has a variable error rate. One run on the POG genome had an error rate between 0.5% to 6%.
 +        * Error rate should increase linearly with length and position in the read.
 +        * Mate pair error rates should be similar to shotgun error rates with a higher error rate for the second fragment read.
 +        * POG dataset had two large spikes at two positions on the second read.
 +        * Quality measure output by the Solid sequencer mirrored the error rate.
 +      * No error data on Illumina platform.
 +      * Expect half your reads to have an error in them.
 +  * Contiguous chromosomes with a low error rate ( output from assemblers).
 +    * Miami standard for a finished genome should have an error rate of 1 x 10^-5 bases. FIXME
 +    * To reduce error rate in short reads, stack up many reads and take the most common base at each position.
 +  * How much data do we have?
 +    * Let R = reads
 +    * Let l = length measured in bases
 +    * Let G = genome size measured in bases
 +  * Two issues to worry about:
 +    * Do we read every base in the chromosome
 +    * Probability of covering base i with 1 read assuming a uniform distribution:​
 +      * let p = l / G, the probability of covering base i with 1 read
 +      * 1 - p, the probability of not covering base i with 1 read
 +      * (1 - p) ^ R, probability of not covering base i with R reads
 +      * c = R * l / G, coverage
 +      * R = c / p
 +      * lim ( (1-p)^(c/​p),​ p -> 0 ) = e^-c
 +      * Expected number of missed bases = G * e^-c
 +    * How much coverage do we need to get less than one missing base per genome?
 +      * G * e^-c < 1
 +      * e^-c < 1 /G
 +      * c > ln (G)
 +    * For the human genome, you need approximately 22x coverage or 66 gigabases of sequencing data to read each base once.
 +      * It is reasonable to expect that we will not have enough coverage to see every base in the genome.
 +    * On just 454 data, we expect to have between 1x-0.1x coverage for the banana slug (assuming the banana slug genome is between 700mb-7gb)
 +  * Scaffolds
 +    * What is a scaffold?
 +      * Given a set of a continuous segment of dna (contigs), order and orient them. A scaffold is a set of contigus with order and orientation information.
 +      * Paired-end libraries are helpful for joining gaps between contigs.
 +        * Short reads from paired-end libraries are seldom used for forming contigs.
 +        * It is possible to cluster short reads at the end of contigs.
 +  * Assembly overview (Students will need to verify these teachings with papers on different assemblers)
 +    - Cluster the reads.
 +    - Stick reads together.
 +      - Base calling provides a consensus for a contig.
 +      - Can be a greedy algorithm or dynamic programming algorithm.
 +    - Order and orient.
 +    - May have mapping to find out where reads go.
 +    - May also try to form contigs out of leftover reads.
 +    - Can find repeat regions using paired-end data.
 +  * Most resquencing projects map reads to scaffolds and create contigs based upon mapping. Sections with missing read data can be assumed to be a deleting or an alteration to the existing scaffold.
lecture_notes/04-05-2010.txt ยท Last modified: 2010/04/16 01:16 by karplus