# Banana Slug Genomics

### Site Tools

lecture_notes:04-15-2015

# Sequence Assembly and K-mer Analysis

#### Overlap, Layout, Consensus Assembly

1. Compare sequences and calculate the overlap between each sequence and each other sequence. Overlap indicates that two reads might e from the same part of the genome
2. Build an adjacency matrix of which reads overlap each other - O(n^2)
3. find connected components of the matrix
4. build consensus sequences of each component to get contigs
• Modern data is too big for this method to be practical (overlap step is quadratic in number of reads)
• Still used for long read data

#### Reference Guided Assembly

• align all reads to the reference sequence

#### de Bruijn Graph Assembly

• Make kmers the unit of assembly
• Divides each sequence into k-mers of a given length and constructing nodes such that each node contains a k-mer, and a directed edge from one node to another means that one k-mer can be extended into another k-mer
`AACGT->ACGTA->CGTAG->GTAGC-> ...`
• Grows linearly with number of reads
• Ceiling on graph size is the size of the genome
• There are 4^k distinct kmers - we want way more kmers than the size of the genome to limit the overlaps
• EX: k = 20, for the human genome there is a 1/300 chance a random kmer is there

#### Kmer Spectra

• How many times does each unique kmer in the genome occur in the reads?
• There is a peak at the point of average coverage, which tells you approximately the genome size
• kmer spectra can be used for error correction
You could leave a comment if you were logged in.
lecture_notes/04-15-2015.txt · Last modified: 2015/04/17 10:38 by almussel