This is an old revision of the document!
====== Notes on Sequence Assembly and K-mer Analysis ====== There is an overarching scheme to assembling genomes from DNA sequences: - Compare sequences and calculate the overlap between each sequence and each other sequence. Overlap indicates that two reads might e from the same part of the genome - Create layout of read overlap which represent contiguous part of the genome - Analyze each read and call a consensus sequence. The rate-limiting step for this process is calculating the overlap between each sequence because the process time increases exponentially with the number of sequences in the data set. Assembling genomes with a De Bruijn graph circumvents this problem by allowing the assembler to extend the genome independently of any other sequence in the data. In order to assemble the genome with a De Bruijn graph, you must select a k-mer size such that the genome being assembled contains few or no repeats when divided into k-mers of that size.