This is an old revision of the document!
De novo Assembly II
Guest lecturer: Stefan Prost, email@example.com
Illumina Paired-end Sequencing Libraries
Illumina Mate-Pair Sequencing Libraries
Idea: get paired reads that are much farther away (for more scaffolding info)
Basically, the same idea as paired-ends, except the middle section is missing and the ends are oriented the opposite way.
Genomic DNA > Fragment (2-5 kb) > biotinylate ends > Circularize > Fragment (400-600 bp) > Enrich biotinylated fragments > Ligate adaptors > …
Cut DNA, attach a biotin tag to both ends of the target molecule
Circularize target molecule
Circularized molecule is fragmented, fragment with the biotin tag (a.k.a the ends of the original target stuck to each other backwards, with the tags in between) is enriched
Then end repair, A-tailing, adapters added, amplification, sequencing
Dependent on inferring insert size (can be tricky)
Most companies can get you 8 kb inserts, with skill you can get up to 20 kb
Complicated process, weird stuff can happen in between
Important difference between paired ends and mate-pairs: ends are oriented the opposite way.
BAC (Bacterial Artificial Chromosome) and Fosmid Libraries
Read Quality Assessment
Base quality: Phred scores reported by sequencer.
Fastq files: fasta files, plus encoded phred scores
Quality for each individual base is not the whole story, the context matters to the signal processing too
Reads decrease in quality further down the read
Pacific Bio doesn’t have GC|AT bias
Estimating Genome Size from Read Data
G = (pn(1-k+1))/(λ_k)
G = Genome size
pn = proportion of correct reads
k = kmer length
λ_k= mode of the k-kmer count histogram
Simpson 2013, arXiv
Simulated contig length in the k-de Brujin graph can estimate the best kmer to use for assembly. Based on contig length N50