User Tools

Site Tools


Brief overview of goals and data input characteristics

Kevin laid out some of the logistics of the class.

A broad goal: For each chromosome in the slug, we want the full sequence in DNA bases. Since it is unlikely to be completed in the timeframe of one quarter, some smaller goals: build contigs and have an idea of the scaffold to arrange the contigs in.

Inputs: Sequencing data from various machines. Some of the characteristics of these machines/techniques:

Sanger capillary

  • ~800bp reads[1].
  • Q (quality value) ~30
  • ~$1/read, expensive because primers must be attached to each read.


  • ~400bp reads[2].
  • Pyrosequencing
  • Q ~20
  • $5000/run/1M reads, no downscaling (numbers approximate).


  • 2x25bp or 1x50bp reads
  • Paired end reads: ligation with adapter, cleaves 25bp from adapter using restriction enzyme.
  • Potential for double ligation: two unrelated sequences ligating.
  • $2000/run/100M reads (numbers approximate).


  • 2×50, 2x100bps ?
  • Paired end reads
  • Potential errors: innies (ligated region not between sequenced regions) or chimeric (sequence passes ligated region)
  • Cheaper than SoLiD, 10K Genomes project uses it.

Ion Torrent

  • 2×100 base pairs
  • ~50,000 to 5,000,000 reads depending on Sequencing Chip [3].
  • Ion semiconductor sequencing. No optics or modified bases are required.

Pac Bio

  • Very long, single molecule reads (~10K)
  • High error rates (~5%)
  • Useful when mapping to a reference.


We briefly discussed how much sequence data would be required to assemble the genome. First, we considered the probability of seeing a particular base i in a single read j.

P( seeing base i in read j ) = L/G

where L is the read length and G is the total size of the genome. If we have R reads, then

P( never seeing base i ) = (1 - L/G)^R

We can multiply L/G by R/R to get ((L*R) / G) / R or C / R where C is our coverage of the genome. We take the limit of this as R goes to infinity:

lim n->inf (1 - C/R)^R = e^-C

Thus we can expect to miss G*e^-C bases.

We cannot assemble an entire chromosome if we are missing bases. However, we can construct contiguous stretches of bases or contigs and later assemble them into scaffolds using other information, such as long distance physical maps.


You could leave a comment if you were logged in.
lecture_notes/03-30-2011.txt · Last modified: 2011/04/01 19:20 by svohr