Differences

This shows you the differences between two versions of the page.

--- lecture_notes:04-15-2015 [2015/04/15 21:05]
chkcole
+++ lecture_notes:04-15-2015 [2015/04/17 17:38] (current)
almussel
@@ Line 1: / Line 1: @@
-====== Notes on Sequence Assembly and K-mer Analysis ======
+====== Sequence Assembly and K-mer Analysis ======
-There is an overarching scheme to assembling genomes from DNA sequences:
+===Overlap, Layout, Consensus Assembly===
   - Compare sequences and calculate the overlap between each sequence and each other sequence. Overlap indicates that two reads might e from the same part of the genome
-  - Create layout of read overlap which represent contiguous part of the genome
+  - Build an adjacency matrix of which reads overlap each other - O(n^2)
-  - Analyze each read and call a consensus sequence.
+  - find connected components of the matrix
-The rate-limiting step for this process is calculating the overlap between each sequence because the process time increases exponentially with the number of sequences in the data set.
+  - build consensus sequences of each component to get contigs
-Assembling genomes with a De Bruijn graph circumvents this problem by allowing the assembler to extend the genome independently of any other sequence. In order to assemble the genome with a De Bruijn graph, you must select a k-mer size such that the genome being assembled contains few or no repeats when divided into k-mers of that size.
+    * Modern data is too big for this method to be practical (overlap step is quadratic in number of reads)
+    * Still used for long read data
+===Reference Guided Assembly===
+    * align all reads to the reference sequence
+===de Bruijn Graph Assembly===
+    * Make kmers the unit of assembly
+    * Divides each sequence into k-mers of a given length and constructing nodes such that each node contains a k-mer, and a directed edge from one node to another means that one k-mer can be extended into another k-mer
+<code>
+AACGT->ACGTA->CGTAG->GTAGC-> ...
+</code>
+    * Grows linearly with number of reads
+    * Ceiling on graph size is the size of the genome
+    * There are 4^k distinct kmers - we want way more kmers than the size of the genome to limit the overlaps
+        * EX: k = 20, for the human genome there is a 1/300 chance a random kmer is there
+===Kmer Spectra===
+    * How many times does each unique kmer in the genome occur in the reads?
+    * There is a peak at the point of average coverage, which tells you approximately the genome size
+    * kmer spectra can be used for error correction

Banana Slug Genomics

User Tools

Site Tools

Differences

Page Tools