Differences

This shows you the differences between two versions of the page.

--- archive:bioinformatic_tools:quake [2011/04/24 20:55]
eyliaw [Running Quake]
+++ archive:bioinformatic_tools:quake [2015/07/28 06:26] (current)
ceisenhart ↷ Page moved from bioinformatic_tools:quake to archive:bioinformatic_tools:quake
@@ Line 1: / Line 1: @@
 ====== Quake ======
-[[http://www.cbcb.umd.edu/software/quake/index.html|Quake]] is a sequencing error correction program that uses kmer count distributions to make the distinction between trusted and untrusted reads.[(cite:quake>Kelley, David R; Schatz, Michael C; and Salzberg, Steven L. Quake: quality-aware detection and correction of sequencing errors. Genome Biology 2010, 11:R116 doi:[[http://dx.doi.org/10.1186/gb-2010-11-11-r116|10.1186/gb-2010-11-11-r116]])]
+[[http://www.cbcb.umd.edu/software/quake/index.html|Quake]] is a sequencing error correction program that uses kmer count distributions to make the distinction between trusted (suspected true) and untrusted (suspected erroneous) reads.[(cite:quake>Kelley, David R; Schatz, Michael C; and Salzberg, Steven L. Quake: quality-aware detection and correction of sequencing errors. Genome Biology 2010, 11:R116 doi:[[http://dx.doi.org/10.1186/gb-2010-11-11-r116|10.1186/gb-2010-11-11-r116]])]
 ===== Running Quake =====
@@ Line 9: / Line 9: @@
    jellyfish stats -t -c -o [counts file]
-Then, build the covariance model to find the cutoff, default is 200:1 likelihood ratio of erroneous to true reads (same as plotting the histogram from Jellyfish):
+Then, build the covariance model to find the cutoff, default is 200:1 likelihood ratio of untrusted to trusted reads (same as plotting the histogram from Jellyfish):
    cov_model.py [counts file]
@@ Line 15: / Line 15: @@
 Finally, correct the reads:
-   correct -f [fastq list file] -k [k-mer size] -c [cutoff] -m [counts file] -p [number of cores]
+   correct -f [fastq file list] -k [k-mer size] -c [cutoff] -m [counts file] -p [number of cores] -z (gzips the output)
+In the file list, you should tab-separate paired end reads.  Also, be sure that all .'s in the sequence are written as N's.  If the quality scores are written as Phred + 64, you can use -q 64 to handle it.
+K-mer counts can also be pre-filtered to save space.
+Quake dev:
+Once you've decided on a cutoff, Quake ignores all of the k-mers below that cutoff. So sure, you can filter the file to save some disk space. But having all of the k-mer counts is best for choosing the cutoff. My cov_model.py script to automatically choose the cutoff requires them.
+[Kevin] I think that we want to select the k-mer size manually, rather than relying on quake.  Their default cutoff is very conservative, and we'll probably do better over-correcting than under-correcting.
+If we look at the hugely over-represented k-mers (like the adapter sequences), and compare the true sequences to one that are one base different, we see that the true ones are about 30 times as frequent.  Thus quake's idea of correcting only the rarely seen k-mers isn't quite right.  What we really want to correct are those k-mers that are close neighbors of much more frequent k-mers.  I've not figured out yet precisely what "much more frequent" should mean.
+===== Potential Problems ======
+  * Input files need to have an extension, or Quake will throw a substr error when trying to merge hidden files into a result.
+  * With paired-end input, Quake will output two files for each paired-end read.  One will be the cor.fastq file, which contains corrected, paired reads.  The other will be the cor_single.fastq file, which contains reads where only one pair could be corrected.  You can treat the cor_single.fastq file as a single read file.
 ===== Methods =====

Banana Slug Genomics

User Tools

Site Tools

Differences

Page Tools