User Tools

Site Tools


archive:bioinformatic_tools:quake

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
archive:bioinformatic_tools:quake [2011/05/07 11:52]
eyliaw
archive:bioinformatic_tools:quake [2011/05/09 18:02]
eyliaw [Running Quake]
Line 15: Line 15:
 Finally, correct the reads: Finally, correct the reads:
  
-   ​correct -f [fastq ​list file] -k [k-mer size] -c [cutoff] -m [counts file] -p [number of cores]+   ​correct -f [fastq file list] -k [k-mer size] -c [cutoff] -m [counts file] -p [number of cores] ​-z (gzips the output) 
 + 
 +In the file list, you should tab-separate paired end reads. ​ Also, be sure that all .'s in the sequence are written as N's.
  
 [Kevin] I think that we want to select the k-mer size manually, rather than relying on quake. ​ Their default cutoff is very conservative,​ and we'll probably do better over-correcting than under-correcting.  ​ [Kevin] I think that we want to select the k-mer size manually, rather than relying on quake. ​ Their default cutoff is very conservative,​ and we'll probably do better over-correcting than under-correcting.  ​
 If we look at the hugely over-represented k-mers (like the adapter sequences), and compare the true sequences to one that are one base different, we see that the true ones are about 30 times as frequent. ​ Thus quake'​s idea of correcting only the rarely seen k-mers isn't quite right. ​ What we really want to correct are those k-mers that are close neighbors of much more frequent k-mers. ​ I've not figured out yet precisely what "much more frequent"​ should mean. If we look at the hugely over-represented k-mers (like the adapter sequences), and compare the true sequences to one that are one base different, we see that the true ones are about 30 times as frequent. ​ Thus quake'​s idea of correcting only the rarely seen k-mers isn't quite right. ​ What we really want to correct are those k-mers that are close neighbors of much more frequent k-mers. ​ I've not figured out yet precisely what "much more frequent"​ should mean.
  
-    Can the k-mer counts be pre-filtered to save space? 
  
-    Well sort of. Once you've decided on a cutoff, Quake ignores all of the k-mers below that cutoff. So sure, you can filter the file to save some disk space. But having all of the k-mer counts is best for choosing the cutoff. My cov_model.py script to automatically choose the cutoff requires them.+K-mer counts can be pre-filtered to save space 
 + 
 +Quake dev:   
 +Once you've decided on a cutoff, Quake ignores all of the k-mers below that cutoff. So sure, you can filter the file to save some disk space. But having all of the k-mer counts is best for choosing the cutoff. My cov_model.py script to automatically choose the cutoff requires them. 
 ===== Methods ===== ===== Methods =====
 The paper gives an example of the distribution typically seen in kmer counting below. ​ They fit a Gamma distribution to the untrusted reads, and a Gaussian + Zeta (for the high frequency repeats) mixture for trusted reads. ​ The distribution of the trusted reads is actually expected to be Poisson, but the variance is significantly larger than the mean due to sequencing biases. ​ Their example is a high-coverage run, and they note that the distributions will overlap--as we've seen--with lower coverages. ​ This of course makes finding the cutoff trickier. ​ They chose a point where there was a high likelihood ratio of untrusted to trusted read, around 1000:1. The paper gives an example of the distribution typically seen in kmer counting below. ​ They fit a Gamma distribution to the untrusted reads, and a Gaussian + Zeta (for the high frequency repeats) mixture for trusted reads. ​ The distribution of the trusted reads is actually expected to be Poisson, but the variance is significantly larger than the mean due to sequencing biases. ​ Their example is a high-coverage run, and they note that the distributions will overlap--as we've seen--with lower coverages. ​ This of course makes finding the cutoff trickier. ​ They chose a point where there was a high likelihood ratio of untrusted to trusted read, around 1000:1.
archive/bioinformatic_tools/quake.txt · Last modified: 2015/07/28 06:26 by ceisenhart