Banana Slug Genomics

Homework

Pick two or three assemblers.

Find out where to get them
How to install them
What papers there are about them
Create the wiki page about them
And possibly install them

Volunteers:

Phrap: Galt and Shyamini
Velvet: Hyunsung and Galt
ABySS: Galt and Chris
AMOS: Shyamini and Herbert
Arachne: John and Michael
CAP3/PCAT: Michael and Galt
Celera: Shyamini and Hyunsung
Euler/Euler-sr: Herbert and John
MIRA1: Herbert and Michael
TIGR Assembler: John and Shyamini
SHARCGS: Michael and Chris
SSAKE: Herbert and Hyunsung
Staden gap4 package: Michael and Hyunsung
VCAKE: Chris and John
Phusion: Shyamini and Michael
QSRA: Herbert and Chris
SOLiD System Tools (Corona_lite, etc): Hyunsung and Chris
Newbler documentation: Galt and Herbert
SOAPdenovo: Galt and Jenny

Assembly Review Articles:

Jason R. Miller, Sergey Koren and Granger Suttona [1]
Covers these assemblers: SSAKE, SHARCGS, VCAKE, Newbler, Celera, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo.Compares de Bruijn graph to overlap/layout/consensus.

Assembly Overview

What is assembly?
- Sequence data is comprised of many short reads. (inputs to the assembler)
  - Sanger reads can be 1200 bp long.
    - These are still short reads compared to a small prokaryotic genome or a larger eukaryotic genome.
  - 454 reads are about 400 bp long.
    - These are not considered short reads by sequencing standards.
    - Long enough to be uniquely mappable to most regions except for very large repetitive regions.
  - Solid reads are about 50bp long for shotgun libraries or 2x25bp for paired-end reads.
    - Paired reads are multiple reads from the same strand
  - Illumina reads can be long, but quality begins to drop off after 50 base pairs.
- All the second-generation sequencers produce noisy reads.
  - 454 may have the lowest error rate at about 0.2%.
  - Solid has a variable error rate. One run on the POG genome had an error rate between 0.5% to 6%.
    - Error rate should increase linearly with length and position in the read.
    - Mate pair error rates should be similar to shotgun error rates with a higher error rate for the second fragment read.
    - POG dataset had two large spikes at two positions on the second read.
    - Quality measure output by the Solid sequencer mirrored the error rate.
  - No error data on Illumina platform.
  - Expect half your reads to have an error in them.
Contiguous chromosomes with a low error rate ( output from assemblers).
- Bermuda standard for a finished genome should have an error rate of 1 x 10^-5 bases.1) [2] [3]
- To reduce error rate in short reads, stack up many reads and take the most common base at each position.
How much data do we have?
- Let R = reads
- Let l = length measured in bases
- Let G = genome size measured in bases
Two issues to worry about:
- Do we read every base in the chromosome
- Probability of covering base i with 1 read assuming a uniform distribution:
  - let p = l / G, the probability of covering base i with 1 read
  - 1 - p, the probability of not covering base i with 1 read
  - (1 - p) ^ R, probability of not covering base i with R reads
  - c = R * l / G, coverage
  - R = c / p
  - lim ( (1-p)^(c/p), p → 0 ) = e^-c
  - Expected number of missed bases = G * e^-c
- How much coverage do we need to get less than one missing base per genome?
  - G * e^-c < 1
  - e^-c < 1 /G
  - c > ln (G)
- For the human genome, you need approximately 22x coverage or 66 gigabases of sequencing data to read each base once.
  - It is reasonable to expect that we will not have enough coverage to see every base in the genome.
- On just 454 data, we expect to have between 1x-0.1x coverage for the banana slug (assuming the banana slug genome is between 700mb-7gb)
Scaffolds
- What is a scaffold?
  - Given a set of a continuous segment of dna (contigs), order and orient them. A scaffold is a set of contigus with order and orientation information.
  - Paired-end libraries are helpful for joining gaps between contigs.
    - Short reads from paired-end libraries are seldom used for forming contigs.
    - It is possible to cluster short reads at the end of contigs.
Assembly overview (Students will need to verify these teachings with papers on different assemblers)
1. Cluster the reads.
2. Stick reads together.
  1. Base calling provides a consensus for a contig.
  2. Can be a greedy algorithm or dynamic programming algorithm.
3. Order and orient.
4. May have mapping to find out where reads go.
5. May also try to form contigs out of leftover reads.
6. Can find repeat regions using paired-end data.
Most resquencing projects map reads to scaffolds and create contigs based upon mapping. Sections with missing read data can be assumed to be a deleting or an alteration to the existing scaffold.

References

1. ^a Jason R. Miller, Sergey Koren, Granger Sutton, Assembly algorithms for next-generation sequencing data, Genomics, In Press, Corrected Proof, Available online 6 March 2010, ISSN 0888-7543, DOI: 10.1016/j.ygeno.2010.03.001 http://www.sciencedirect.com/science/article/B6WG1-4YJ6GD8-1/2/ae6c957910e4ea658cdebff4a0ce9793

2. ^a http://www.genome.gov/page.cfm?pageID=10506376

3. ^a http://www.ornl.gov/sci/techresources/Human_Genome/research/bermuda.shtml

Discussion

Kevin Karplus, 2010/04/16 00:59

You are correct, I meant the Bermuda standard—bad memory strikes again!

Jenny Draper, 2010/04/12 03:02

Do you really mean “MIAMI standard for a finished genome”? It was my understanding that the genome finishing standard was the still the “Bermuda Principles”¹⁾²⁾ from a meeting in Bermuda in 1997, which defined an acceptible error rate of less than 1 in 10,000 (ie, < 1*10^-4), and the MIAME (Minium Information for a Microarray Experiment) standard³⁾ was for microarray data? There is also the MIGS Standard⁴⁾ for sequenced genomes, but that is more about providing information about what was actually sequenced (such as culture conditions, etc)…