User Tools

Site Tools


lecture_notes:04-15-2015

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
lecture_notes:04-15-2015 [2015/04/15 14:05]
chkcole
lecture_notes:04-15-2015 [2015/04/17 10:38]
almussel
Line 1: Line 1:
-====== ​Notes on Sequence Assembly and K-mer Analysis ======+====== Sequence Assembly and K-mer Analysis ======
  
-There is an overarching scheme to assembling genomes from DNA sequences:+===Overlap, Layout, Consensus Assembly===
   - Compare sequences and calculate the overlap between each sequence and each other sequence. Overlap indicates that two reads might e from the same part of the genome   - Compare sequences and calculate the overlap between each sequence and each other sequence. Overlap indicates that two reads might e from the same part of the genome
-  - Create layout ​of read overlap ​which represent contiguous part of the genome +  - Build an adjacency matrix ​of which reads overlap ​each other - O(n^2) 
-  - Analyze each read and call a consensus ​sequence. +  - find connected components ​of the matrix 
-The rate-limiting step for this process is calculating the overlap between ​each sequence because the process time increases exponentially with the number of sequences in the data set.+  - build consensus ​sequences of each component to get contigs
  
-Assembling genomes with a De Bruijn graph circumvents ​this problem by allowing the assembler ​to extend the genome independently ​of any other sequence. In order to assemble ​the genome with a De Bruijn ​graphyou must select ​a k-mer size such that the genome ​being assembled contains few or no repeats when divided into k-mers of that size.+    * Modern data is too big for this method ​to be practical (overlap step is quadratic in number ​of reads) 
 +    * Still used for long read data 
 + 
 +===Reference Guided Assembly=== 
 +    * align all reads to the reference sequence 
 + 
 +===de Bruijn ​Graph Assembly=== 
 +    * Make kmers the unit of assembly  
 +    * Divides each sequence into k-mers of a given length and constructing nodes such that each node contains a k-merand directed edge from one node to another means that one k-mer can be extended into another k-mer 
 +<​code>​ 
 +AACGT->​ACGTA->​CGTAG->​GTAGC->​ ... 
 +</​code>​ 
 +    * Grows linearly with number of reads 
 +    * Ceiling on graph size is the size of the genome 
 +    * There are 4^distinct kmers we want way more kmers than the size of the genome to limit the overlaps 
 +        * EX: k = 20, for the human genome there is a 1/300 chance a random kmer is there 
 + 
 +===Kmer Spectra=== 
 +    * How many times does each unique kmer in the genome occur in the reads? 
 +    * There is a peak at the point of average coverage, which tells you approximately the genome ​size 
 +    * kmer spectra can be used for error correction
lecture_notes/04-15-2015.txt · Last modified: 2015/04/17 10:38 by almussel