Overview of Assembly

Kevin outlined the processes involving in assembling a genome.

Clean up the Reads
Clustering and Building Contigs
Order and Orient Contigs

Clean Up Reads

There are two separate and distinct parts of data clean up; error correction and contaminant removal.

Error Correction

May not be necessary for all types of data (Sanger/454).
Can be done before or after contig assembly.
- Before: K-mer Counting
- After: Map reads to consensus sequences from contigs.

K-mer Counting

Count the number of occurrences of each K-mer in the reads.
Remove reads or correct individual bases of K-mers with low counts.
K-mer size must be large enough not to produce trivial counts, but small enough to fit memory constraints.

Contaminant Removal

Contamination can come from many sources:
- Human (dust)
- Bacterial
- Viral (hard to remove)
Use BLAST to remove sequences that are unexpected.
- Expensive to run.
- Blast contigs instead of individual reads.
- Strategies:
  - Look for specific contaminants (Human, E. coli).
  - Examine ribosome to identify possible contaminants.
  - Look for things we would not expected to see (e.g. eukaryotic sequence in prokaryotes or vice versa).
- Once a contig is identified as contaminant:
  - Remove the contig and reads that map to it.
  - Rebuild the contigs.
We risk removing parts of the target genome that are very similar to the contaminant
- Example from class: nitrogen fixing genes common to two bacterial strains.

Cluster Reads and Build Contigs

Build a graph for the reads
- De Bruijn Graph
  - Typically used for small reads.
- Overlap Consensus
  - Generally used for larger reads.
  - larger memory requirements.
Reads that don't quite fit can be error corrected.
Ideally use high quality data (Sanger,454). Recent trend is to use cheaper data.
Result: Contigs

Order and Orient Contigs

Iterative process:
- Use new data when available.
- Map reads to draft to import draft version.
- Leftover reads are sent back for clustering.
Mate-pair data is useful for bridging contigs that are adjacent but do not overlap (due to missing data, repeat sequence, etc.)
Result: Scaffolds

Homework

Learn about the Jellyfish tool for K-mer counting. Try running it on the Pyrobaculum data. Use different parameters and monitor its memory usage. Fill the the wiki page for Jellyfish.

You could leave a comment if you were logged in.

Banana Slug Genomics

Table of Contents

Overview of Assembly

Clean Up Reads

Error Correction

K-mer Counting

Contaminant Removal

Cluster Reads and Build Contigs

Order and Orient Contigs

Homework

Banana Slug Genomics

User Tools

Site Tools

Table of Contents

Overview of Assembly

Clean Up Reads

Error Correction

K-mer Counting

Contaminant Removal

Cluster Reads and Build Contigs

Order and Orient Contigs

Homework

Page Tools