User Tools

Site Tools


lecture_notes:04-05-2010

Mon Apr 05 2010: Hyunsung John Kim

Please add to or modify this page as you see fit!

Homework

Pick two or three assemblers.

  • Find out where to get them
  • How to install them
  • What papers there are about them
  • Create the wiki page about them
  • And possibly install them

Volunteers:

  • Phrap: Galt and Shyamini
  • Velvet: Hyunsung and Galt
  • ABySS: Galt and Chris
  • AMOS: Shyamini and Herbert
  • Arachne: John and Michael
  • CAP3/PCAT: Michael and Galt
  • Celera: Shyamini and Hyunsung
  • Euler/Euler-sr: Herbert and John
  • MIRA1: Herbert and Michael
  • TIGR Assembler: John and Shyamini
  • SHARCGS: Michael and Chris
  • SSAKE: Herbert and Hyunsung
  • Staden gap4 package: Michael and Hyunsung
  • VCAKE: Chris and John
  • Phusion: Shyamini and Michael
  • QSRA: Herbert and Chris
  • SOLiD System Tools (Corona_lite, etc): Hyunsung and Chris
  • Newbler documentation: Galt and Herbert
  • SOAPdenovo: Galt and Jenny

Assembly Review Articles:

  • Jason R. Miller, Sergey Koren and Granger Suttona [1]
    Covers these assemblers: SSAKE, SHARCGS, VCAKE, Newbler, Celera, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo.Compares de Bruijn graph to overlap/layout/consensus.

Assembly Overview

  • What is assembly?
    • Sequence data is comprised of many short reads. (inputs to the assembler)
      • Sanger reads can be 1200 bp long.
        • These are still short reads compared to a small prokaryotic genome or a larger eukaryotic genome.
      • 454 reads are about 400 bp long.
        • These are not considered short reads by sequencing standards.
        • Long enough to be uniquely mappable to most regions except for very large repetitive regions.
      • Solid reads are about 50bp long for shotgun libraries or 2x25bp for paired-end reads.
        • Paired reads are multiple reads from the same strand
      • Illumina reads can be long, but quality begins to drop off after 50 base pairs.
    • All the second-generation sequencers produce noisy reads.
      • 454 may have the lowest error rate at about 0.2%.
      • Solid has a variable error rate. One run on the POG genome had an error rate between 0.5% to 6%.
        • Error rate should increase linearly with length and position in the read.
        • Mate pair error rates should be similar to shotgun error rates with a higher error rate for the second fragment read.
        • POG dataset had two large spikes at two positions on the second read.
        • Quality measure output by the Solid sequencer mirrored the error rate.
      • No error data on Illumina platform.
      • Expect half your reads to have an error in them.
  • Contiguous chromosomes with a low error rate ( output from assemblers).
    • Bermuda standard for a finished genome should have an error rate of 1 x 10^-5 bases.1) [2] [3]
    • To reduce error rate in short reads, stack up many reads and take the most common base at each position.
  • How much data do we have?
    • Let R = reads
    • Let l = length measured in bases
    • Let G = genome size measured in bases
  • Two issues to worry about:
    • Do we read every base in the chromosome
    • Probability of covering base i with 1 read assuming a uniform distribution:
      • let p = l / G, the probability of covering base i with 1 read
      • 1 - p, the probability of not covering base i with 1 read
      • (1 - p) ^ R, probability of not covering base i with R reads
      • c = R * l / G, coverage
      • R = c / p
      • lim ( (1-p)^(c/p), p → 0 ) = e^-c
      • Expected number of missed bases = G * e^-c
    • How much coverage do we need to get less than one missing base per genome?
      • G * e^-c < 1
      • e^-c < 1 /G
      • c > ln (G)
    • For the human genome, you need approximately 22x coverage or 66 gigabases of sequencing data to read each base once.
      • It is reasonable to expect that we will not have enough coverage to see every base in the genome.
    • On just 454 data, we expect to have between 1x-0.1x coverage for the banana slug (assuming the banana slug genome is between 700mb-7gb)
  • Scaffolds
    • What is a scaffold?
      • Given a set of a continuous segment of dna (contigs), order and orient them. A scaffold is a set of contigus with order and orientation information.
      • Paired-end libraries are helpful for joining gaps between contigs.
        • Short reads from paired-end libraries are seldom used for forming contigs.
        • It is possible to cluster short reads at the end of contigs.
  • Assembly overview (Students will need to verify these teachings with papers on different assemblers)
    1. Cluster the reads.
    2. Stick reads together.
      1. Base calling provides a consensus for a contig.
      2. Can be a greedy algorithm or dynamic programming algorithm.
    3. Order and orient.
    4. May have mapping to find out where reads go.
    5. May also try to form contigs out of leftover reads.
    6. Can find repeat regions using paired-end data.
  • Most resquencing projects map reads to scaffolds and create contigs based upon mapping. Sections with missing read data can be assumed to be a deleting or an alteration to the existing scaffold.

References

1. a Jason R. Miller, Sergey Koren, Granger Sutton, Assembly algorithms for next-generation sequencing data, Genomics, In Press, Corrected Proof, Available online 6 March 2010, ISSN 0888-7543, DOI: 10.1016/j.ygeno.2010.03.001 http://www.sciencedirect.com/science/article/B6WG1-4YJ6GD8-1/2/ae6c957910e4ea658cdebff4a0ce9793

Discussion

, 2010/04/16 00:59

You are correct, I meant the Bermuda standard—bad memory strikes again!

, 2010/04/12 03:02

Do you really mean “MIAMI standard for a finished genome”? It was my understanding that the genome finishing standard was the still the “Bermuda Principles”1)2) from a meeting in Bermuda in 1997, which defined an acceptible error rate of less than 1 in 10,000 (ie, < 1*10^-4), and the MIAME (Minium Information for a Microarray Experiment) standard3) was for microarray data? There is also the MIGS Standard4) for sequenced genomes, but that is more about providing information about what was actually sequenced (such as culture conditions, etc)…

You could leave a comment if you were logged in.
lecture_notes/04-05-2010.txt · Last modified: 2010/04/16 01:16 by karplus