User Tools

Site Tools


lecture_notes:04-15-2015

Sequence Assembly and K-mer Analysis

Overlap, Layout, Consensus Assembly

  1. Compare sequences and calculate the overlap between each sequence and each other sequence. Overlap indicates that two reads might e from the same part of the genome
  2. Build an adjacency matrix of which reads overlap each other - O(n^2)
  3. find connected components of the matrix
  4. build consensus sequences of each component to get contigs
  • Modern data is too big for this method to be practical (overlap step is quadratic in number of reads)
  • Still used for long read data

Reference Guided Assembly

  • align all reads to the reference sequence

de Bruijn Graph Assembly

  • Make kmers the unit of assembly
  • Divides each sequence into k-mers of a given length and constructing nodes such that each node contains a k-mer, and a directed edge from one node to another means that one k-mer can be extended into another k-mer
AACGT->ACGTA->CGTAG->GTAGC-> ...
  • Grows linearly with number of reads
  • Ceiling on graph size is the size of the genome
  • There are 4^k distinct kmers - we want way more kmers than the size of the genome to limit the overlaps
    • EX: k = 20, for the human genome there is a 1/300 chance a random kmer is there

Kmer Spectra

  • How many times does each unique kmer in the genome occur in the reads?
  • There is a peak at the point of average coverage, which tells you approximately the genome size
  • kmer spectra can be used for error correction
You could leave a comment if you were logged in.
lecture_notes/04-15-2015.txt · Last modified: 2015/04/17 17:38 by almussel