User Tools

Site Tools


archive:bioinformatic_tools:quake

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Last revision Both sides next revision
archive:bioinformatic_tools:quake [2011/05/27 21:12]
eyliaw
archive:bioinformatic_tools:quake [2011/05/27 21:41]
eyliaw
Line 27: Line 27:
 [Kevin] I think that we want to select the k-mer size manually, rather than relying on quake. ​ Their default cutoff is very conservative,​ and we'll probably do better over-correcting than under-correcting.  ​ [Kevin] I think that we want to select the k-mer size manually, rather than relying on quake. ​ Their default cutoff is very conservative,​ and we'll probably do better over-correcting than under-correcting.  ​
 If we look at the hugely over-represented k-mers (like the adapter sequences), and compare the true sequences to one that are one base different, we see that the true ones are about 30 times as frequent. ​ Thus quake'​s idea of correcting only the rarely seen k-mers isn't quite right. ​ What we really want to correct are those k-mers that are close neighbors of much more frequent k-mers. ​ I've not figured out yet precisely what "much more frequent"​ should mean. If we look at the hugely over-represented k-mers (like the adapter sequences), and compare the true sequences to one that are one base different, we see that the true ones are about 30 times as frequent. ​ Thus quake'​s idea of correcting only the rarely seen k-mers isn't quite right. ​ What we really want to correct are those k-mers that are close neighbors of much more frequent k-mers. ​ I've not figured out yet precisely what "much more frequent"​ should mean.
 +
 +===== Potential Problems ======
 +  * Input files need to have an extension, or Quake will throw a substr error when trying to merge hidden files into a result.
 +  * With paired-end input, Quake will output two files for each paired-end read.  One will be the cor.fastq file, which contains corrected, paired reads. ​ The other will be the cor_single.fastq file, which contains reads where only one pair could be corrected. ​ You can treat the cor_single.fastq file as a single read file.
  
 ===== Methods ===== ===== Methods =====
Line 34: Line 38:
  
 They also recommend a kmer probability of 0.01 in a random sequence that is as long as the genome. That is, 2*G/4^k ~ 0.01, where G is the size of the sequenced genome and k is the size of the kmer. Simplified, k ~ log4(200*G),​ which is about 19 for our prediction of the banana slug genome size, {{https://​banana-slug.soe.ucsc.edu/​bioinformatic_tools:​jellyfish|2.042e+09}}. They also recommend a kmer probability of 0.01 in a random sequence that is as long as the genome. That is, 2*G/4^k ~ 0.01, where G is the size of the sequenced genome and k is the size of the kmer. Simplified, k ~ log4(200*G),​ which is about 19 for our prediction of the banana slug genome size, {{https://​banana-slug.soe.ucsc.edu/​bioinformatic_tools:​jellyfish|2.042e+09}}.
- 
-===== Potential Problems ====== 
-  * Input files need to have an extension, or Quake will throw a substr error when trying to merge hidden files into a result. 
-  * With paired-end input, Quake will output two files for each paired-end read.  One will be the cor.fastq file, which contains corrected, paired reads. ​ The other will be the cor_single.fastq file, which contains reads where only one pair could be corrected. ​ You can treat the cor_single.fastq file as a single read file. 
archive/bioinformatic_tools/quake.txt · Last modified: 2015/07/28 06:26 by ceisenhart