Overview of Assembly
Kevin outlined the processes involving in assembling a genome.
Clean Up Reads
There are two separate and distinct parts of data clean up; error correction and contaminant removal.
Error Correction
K-mer Counting
Count the number of occurrences of each K-mer in the reads.
Remove reads or correct individual bases of K-mers with low counts.
K-mer size must be large enough not to produce trivial counts, but small enough to fit memory constraints.
Contaminant Removal
Contamination can come from many sources:
Human (dust)
Bacterial
Viral (hard to remove)
Use BLAST to remove sequences that are unexpected.
We risk removing parts of the target genome that are very similar to the contaminant
Cluster Reads and Build Contigs
Build a graph for the reads
De Bruijn Graph
Overlap Consensus
Reads that don't quite fit can be error corrected.
Ideally use high quality data (Sanger,454). Recent trend is to use cheaper data.
Result: Contigs
Order and Orient Contigs
Homework
Learn about the Jellyfish tool for K-mer counting. Try running it on the Pyrobaculum data.
Use different parameters and monitor its memory usage. Fill the the wiki page for Jellyfish.