lecture_notes:04-15-2015

- Compare sequences and calculate the overlap between each sequence and each other sequence. Overlap indicates that two reads might e from the same part of the genome
- Build an adjacency matrix of which reads overlap each other - O(n^2)
- find connected components of the matrix
- build consensus sequences of each component to get contigs

- Modern data is too big for this method to be practical (overlap step is quadratic in number of reads)
- Still used for long read data

- align all reads to the reference sequence

- Make kmers the unit of assembly
- Divides each sequence into k-mers of a given length and constructing nodes such that each node contains a k-mer, and a directed edge from one node to another means that one k-mer can be extended into another k-mer

AACGT->ACGTA->CGTAG->GTAGC-> ...

- Grows linearly with number of reads
- Ceiling on graph size is the size of the genome
- There are 4^k distinct kmers - we want way more kmers than the size of the genome to limit the overlaps
- EX: k = 20, for the human genome there is a 1/300 chance a random kmer is there

- How many times does each unique kmer in the genome occur in the reads?
- There is a peak at the point of average coverage, which tells you approximately the genome size
- kmer spectra can be used for error correction

You could leave a comment if you were logged in.

lecture_notes/04-15-2015.txt · Last modified: 2015/04/17 17:38 by almussel