User Tools

Site Tools


Overview of Assembly

Kevin outlined the processes involving in assembling a genome.

  • Clean up the Reads
  • Clustering and Building Contigs
  • Order and Orient Contigs

Clean Up Reads

There are two separate and distinct parts of data clean up; error correction and contaminant removal.

Error Correction

  • May not be necessary for all types of data (Sanger/454).
  • Can be done before or after contig assembly.
    • Before: K-mer Counting
    • After: Map reads to consensus sequences from contigs.
K-mer Counting
  • Count the number of occurrences of each K-mer in the reads.
  • Remove reads or correct individual bases of K-mers with low counts.
  • K-mer size must be large enough not to produce trivial counts, but small enough to fit memory constraints.

Contaminant Removal

  • Contamination can come from many sources:
    • Human (dust)
    • Bacterial
    • Viral (hard to remove)
  • Use BLAST to remove sequences that are unexpected.
    • Expensive to run.
    • Blast contigs instead of individual reads.
    • Strategies:
      • Look for specific contaminants (Human, E. coli).
      • Examine ribosome to identify possible contaminants.
      • Look for things we would not expected to see (e.g. eukaryotic sequence in prokaryotes or vice versa).
    • Once a contig is identified as contaminant:
      • Remove the contig and reads that map to it.
      • Rebuild the contigs.
  • We risk removing parts of the target genome that are very similar to the contaminant
    • Example from class: nitrogen fixing genes common to two bacterial strains.

Cluster Reads and Build Contigs

  • Build a graph for the reads
    • De Bruijn Graph
      • Typically used for small reads.
    • Overlap Consensus
      • Generally used for larger reads.
      • larger memory requirements.
  • Reads that don't quite fit can be error corrected.
  • Ideally use high quality data (Sanger,454). Recent trend is to use cheaper data.
  • Result: Contigs

Order and Orient Contigs

  • Iterative process:
    • Use new data when available.
    • Map reads to draft to import draft version.
    • Leftover reads are sent back for clustering.
  • Mate-pair data is useful for bridging contigs that are adjacent but do not overlap (due to missing data, repeat sequence, etc.)
  • Result: Scaffolds


Learn about the Jellyfish tool for K-mer counting. Try running it on the Pyrobaculum data. Use different parameters and monitor its memory usage. Fill the the wiki page for Jellyfish.

You could leave a comment if you were logged in.
lecture_notes/04-04-2011.txt · Last modified: 2015/07/28 15:12 by