User Tools

Site Tools


lecture_notes:04-15-2015

====== Sequence Assembly and K-mer Analysis ====== ===Overlap, Layout, Consensus Assembly=== - Compare sequences and calculate the overlap between each sequence and each other sequence. Overlap indicates that two reads might e from the same part of the genome - Build an adjacency matrix of which reads overlap each other - O(n^2) - find connected components of the matrix - build consensus sequences of each component to get contigs * Modern data is too big for this method to be practical (overlap step is quadratic in number of reads) * Still used for long read data ===Reference Guided Assembly=== * align all reads to the reference sequence ===de Bruijn Graph Assembly=== * Make kmers the unit of assembly * Divides each sequence into k-mers of a given length and constructing nodes such that each node contains a k-mer, and a directed edge from one node to another means that one k-mer can be extended into another k-mer <code> AACGT->ACGTA->CGTAG->GTAGC-> ... </code> * Grows linearly with number of reads * Ceiling on graph size is the size of the genome * There are 4^k distinct kmers - we want way more kmers than the size of the genome to limit the overlaps * EX: k = 20, for the human genome there is a 1/300 chance a random kmer is there ===Kmer Spectra=== * How many times does each unique kmer in the genome occur in the reads? * There is a peak at the point of average coverage, which tells you approximately the genome size * kmer spectra can be used for error correction

You could leave a comment if you were logged in.
lecture_notes/04-15-2015.txt · Last modified: 2015/04/17 17:38 by almussel