This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision | |||
|
lecture_notes:04-15-2015 [2015/04/15 22:31] chkcole |
lecture_notes:04-15-2015 [2015/04/17 17:38] (current) almussel |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ====== Notes on Sequence Assembly and K-mer Analysis ====== | + | ====== Sequence Assembly and K-mer Analysis ====== |
| - | There is an overarching scheme to assembling genomes from DNA sequences: | + | ===Overlap, Layout, Consensus Assembly=== |
| - Compare sequences and calculate the overlap between each sequence and each other sequence. Overlap indicates that two reads might e from the same part of the genome | - Compare sequences and calculate the overlap between each sequence and each other sequence. Overlap indicates that two reads might e from the same part of the genome | ||
| - | - Create layout of read overlap which represent contiguous part of the genome | + | - Build an adjacency matrix of which reads overlap each other - O(n^2) |
| - | - Analyze each read and call a consensus sequence. | + | - find connected components of the matrix |
| - | The rate-limiting step for this process is calculating the overlap between each sequence because the process time increases exponentially with the number of sequences in the data set. | + | - build consensus sequences of each component to get contigs |
| - | Assembling genomes with a De Bruijn graph circumvents this problem by allowing the assembler to extend the genome independently of any other sequence. In order to assemble the genome with a De Bruijn graph, you must select a k-mer size such that the genome being assembled contains few or no repeats when divided into k-mers of that size. | + | * Modern data is too big for this method to be practical (overlap step is quadratic in number of reads) |
| + | * Still used for long read data | ||
| - | The graph is built by dividing each sequence into k-mers of a given length and constructing nodes such that each node contains a k-mer, and a directed edge from one node to another means that one k-mer can be extended into another k-mer. | + | ===Reference Guided Assembly=== |
| + | * align all reads to the reference sequence | ||
| + | |||
| + | ===de Bruijn Graph Assembly=== | ||
| + | * Make kmers the unit of assembly | ||
| + | * Divides each sequence into k-mers of a given length and constructing nodes such that each node contains a k-mer, and a directed edge from one node to another means that one k-mer can be extended into another k-mer | ||
| + | <code> | ||
| + | AACGT->ACGTA->CGTAG->GTAGC-> ... | ||
| + | </code> | ||
| + | * Grows linearly with number of reads | ||
| + | * Ceiling on graph size is the size of the genome | ||
| + | * There are 4^k distinct kmers - we want way more kmers than the size of the genome to limit the overlaps | ||
| + | * EX: k = 20, for the human genome there is a 1/300 chance a random kmer is there | ||
| + | |||
| + | ===Kmer Spectra=== | ||
| + | * How many times does each unique kmer in the genome occur in the reads? | ||
| + | * There is a peak at the point of average coverage, which tells you approximately the genome size | ||
| + | * kmer spectra can be used for error correction | ||