This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | |||
lecture_notes:04-15-2015 [2015/04/15 22:31] chkcole |
lecture_notes:04-15-2015 [2015/04/17 17:38] (current) almussel |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Notes on Sequence Assembly and K-mer Analysis ====== | + | ====== Sequence Assembly and K-mer Analysis ====== |
- | There is an overarching scheme to assembling genomes from DNA sequences: | + | ===Overlap, Layout, Consensus Assembly=== |
- Compare sequences and calculate the overlap between each sequence and each other sequence. Overlap indicates that two reads might e from the same part of the genome | - Compare sequences and calculate the overlap between each sequence and each other sequence. Overlap indicates that two reads might e from the same part of the genome | ||
- | - Create layout of read overlap which represent contiguous part of the genome | + | - Build an adjacency matrix of which reads overlap each other - O(n^2) |
- | - Analyze each read and call a consensus sequence. | + | - find connected components of the matrix |
- | The rate-limiting step for this process is calculating the overlap between each sequence because the process time increases exponentially with the number of sequences in the data set. | + | - build consensus sequences of each component to get contigs |
- | Assembling genomes with a De Bruijn graph circumvents this problem by allowing the assembler to extend the genome independently of any other sequence. In order to assemble the genome with a De Bruijn graph, you must select a k-mer size such that the genome being assembled contains few or no repeats when divided into k-mers of that size. | + | * Modern data is too big for this method to be practical (overlap step is quadratic in number of reads) |
+ | * Still used for long read data | ||
- | The graph is built by dividing each sequence into k-mers of a given length and constructing nodes such that each node contains a k-mer, and a directed edge from one node to another means that one k-mer can be extended into another k-mer. | + | ===Reference Guided Assembly=== |
+ | * align all reads to the reference sequence | ||
+ | |||
+ | ===de Bruijn Graph Assembly=== | ||
+ | * Make kmers the unit of assembly | ||
+ | * Divides each sequence into k-mers of a given length and constructing nodes such that each node contains a k-mer, and a directed edge from one node to another means that one k-mer can be extended into another k-mer | ||
+ | <code> | ||
+ | AACGT->ACGTA->CGTAG->GTAGC-> ... | ||
+ | </code> | ||
+ | * Grows linearly with number of reads | ||
+ | * Ceiling on graph size is the size of the genome | ||
+ | * There are 4^k distinct kmers - we want way more kmers than the size of the genome to limit the overlaps | ||
+ | * EX: k = 20, for the human genome there is a 1/300 chance a random kmer is there | ||
+ | |||
+ | ===Kmer Spectra=== | ||
+ | * How many times does each unique kmer in the genome occur in the reads? | ||
+ | * There is a peak at the point of average coverage, which tells you approximately the genome size | ||
+ | * kmer spectra can be used for error correction |