Mon Apr 05 2010: Hyunsung John Kim
Please add to or modify this page as you see fit!
Homework
Pick two or three assemblers.
Find out where to get them
How to install them
What papers there are about them
Create the wiki page about them
And possibly install them
Volunteers:
Phrap: Galt and Shyamini
Velvet: Hyunsung and Galt
ABySS: Galt and Chris
AMOS: Shyamini and Herbert
Arachne: John and Michael
CAP3/PCAT: Michael and Galt
Celera: Shyamini and Hyunsung
Euler/Euler-sr: Herbert and John
MIRA1: Herbert and Michael
TIGR Assembler: John and Shyamini
SHARCGS: Michael and Chris
SSAKE: Herbert and Hyunsung
Staden gap4 package: Michael and Hyunsung
VCAKE: Chris and John
Phusion: Shyamini and Michael
QSRA: Herbert and Chris
SOLiD System Tools (Corona_lite, etc): Hyunsung and Chris
Newbler documentation: Galt and Herbert
SOAPdenovo: Galt and Jenny
Assembly Review Articles:
Jason R. Miller, Sergey Koren and Granger Suttona
Covers these assemblers: SSAKE, SHARCGS, VCAKE, Newbler, Celera, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo.Compares de Bruijn graph to overlap/layout/consensus.
Assembly Overview
What is assembly?
Contiguous chromosomes with a low error rate ( output from assemblers).
Bermuda standard for a finished genome should have an error rate of 1 x 10^-5 bases.1)
To reduce error rate in short reads, stack up many reads and take the most common base at each position.
How much data do we have?
Two issues to worry about:
Do we read every base in the chromosome
Probability of covering base i with 1 read assuming a uniform distribution:
let p = l / G, the probability of covering base i with 1 read
1 - p, the probability of not covering base i with 1 read
(1 - p) ^ R, probability of not covering base i with R reads
c = R * l / G, coverage
R = c / p
lim ( (1-p)^(c/p), p → 0 ) = e^-c
Expected number of missed bases = G * e^-c
How much coverage do we need to get less than one missing base per genome?
G * e^-c < 1
e^-c < 1 /G
c > ln (G)
For the human genome, you need approximately 22x coverage or 66 gigabases of sequencing data to read each base once.
On just 454 data, we expect to have between 1x-0.1x coverage for the banana slug (assuming the banana slug genome is between 700mb-7gb)
Scaffolds
What is a scaffold?
Given a set of a continuous segment of dna (contigs), order and orient them. A scaffold is a set of contigus with order and orientation information.
Paired-end libraries are helpful for joining gaps between contigs.
Assembly overview (Students will need to verify these teachings with papers on different assemblers)
Cluster the reads.
Stick reads together.
Base calling provides a consensus for a contig.
Can be a greedy algorithm or dynamic programming algorithm.
Order and orient.
May have mapping to find out where reads go.
May also try to form contigs out of leftover reads.
Can find repeat regions using paired-end data.
Most resquencing projects map reads to scaffolds and create contigs based upon mapping. Sections with missing read data can be assumed to be a deleting or an alteration to the existing scaffold.
References
Discussion
You are correct, I meant the Bermuda standard—bad memory strikes again!
Do you really mean “MIAMI standard for a finished genome”? It was my understanding that the genome finishing standard was the still the “Bermuda Principles”1)2) from a meeting in Bermuda in 1997, which defined an acceptible error rate of less than 1 in 10,000 (ie, < 1*10^-4), and the MIAME (Minium Information for a Microarray Experiment) standard3) was for microarray data? There is also the MIGS Standard4) for sequenced genomes, but that is more about providing information about what was actually sequenced (such as culture conditions, etc)…