User Tools

Site Tools


lecture_notes:04-15-2015

This is an old revision of the document!


Notes on Sequence Assembly and K-mer Analysis

There is an overarching scheme to assembling genomes from DNA sequences:

  1. Compare sequences and calculate the overlap between each sequence and each other sequence. Overlap indicates that two reads might e from the same part of the genome
  2. Create layout of read overlap which represent contiguous part of the genome
  3. Analyze each read and call a consensus sequence.

The rate-limiting step for this process is calculating the overlap between each sequence because the process time increases exponentially with the number of sequences in the data set.

Assembling genomes with a De Bruijn graph circumvents this problem by allowing the assembler to extend the genome independently of any other sequence. In order to assemble the genome with a De Bruijn graph, you must select a k-mer size such that the genome being assembled contains few or no repeats when divided into k-mers of that size.

The graph is built by dividing each sequence into k-mers of a given length and constructing nodes such that each node contains a k-mer, and a directed edge from one node to another means that one k-mer can be extended into another k-mer.

You could leave a comment if you were logged in.
lecture_notes/04-15-2015.1429137088.txt.gz · Last modified: 2015/04/15 15:31 by chkcole