Table of Contents

Brief overview of goals and data input characteristics

Kevin laid out some of the logistics of the class.

A broad goal: For each chromosome in the slug, we want the full sequence in DNA bases. Since it is unlikely to be completed in the timeframe of one quarter, some smaller goals: build contigs and have an idea of the scaffold to arrange the contigs in.

Inputs: Sequencing data from various machines. Some of the characteristics of these machines/techniques:

Sanger capillary

454

SoLiD

Illumina

Ion Torrent

Pac Bio

Coverage

We briefly discussed how much sequence data would be required to assemble the genome. First, we considered the probability of seeing a particular base i in a single read j.

P( seeing base i in read j ) = L/G

where L is the read length and G is the total size of the genome. If we have R reads, then

P( never seeing base i ) = (1 - L/G)^R

We can multiply L/G by R/R to get ((L*R) / G) / R or C / R where C is our coverage of the genome. We take the limit of this as R goes to infinity:

lim n->inf (1 - C/R)^R = e^-C

Thus we can expect to miss G*e^-C bases.

We cannot assemble an entire chromosome if we are missing bases. However, we can construct contiguous stretches of bases or contigs and later assemble them into scaffolds using other information, such as long distance physical maps.

References