lecture_notes:04-15-2015
Sequence Assembly and K-mer Analysis
Overlap, Layout, Consensus Assembly
Compare sequences and calculate the overlap between each sequence and each other sequence. Overlap indicates that two reads might e from the same part of the genome
Build an adjacency matrix of which reads overlap each other - O(n^2)
find connected components of the matrix
build consensus sequences of each component to get contigs
Reference Guided Assembly
de Bruijn Graph Assembly
Make kmers the unit of assembly
Divides each sequence into k-mers of a given length and constructing nodes such that each node contains a k-mer, and a directed edge from one node to another means that one k-mer can be extended into another k-mer
AACGT->ACGTA->CGTAG->GTAGC-> ...
Grows linearly with number of reads
Ceiling on graph size is the size of the genome
There are 4^k distinct kmers - we want way more kmers than the size of the genome to limit the overlaps
Kmer Spectra
How many times does each unique kmer in the genome occur in the reads?
There is a peak at the point of average coverage, which tells you approximately the genome size
kmer spectra can be used for error correction
lecture_notes/04-15-2015.txt · Last modified: 2015/04/17 17:38 by almussel