User Tools

Site Tools


archive:bioinformatic_tools:jellyfish

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
archive:bioinformatic_tools:jellyfish [2011/04/18 00:25]
svohr added run-1 illumina
archive:bioinformatic_tools:jellyfish [2015/07/28 06:23]
ceisenhart ↷ Page moved from bioinformatic_tools:jellyfish to archive:bioinformatic_tools:jellyfish
Line 1: Line 1:
 +====== Jellyfish ======
 +The current version installed on campusrocks is 1.1 (official release).
 +
 Jellyfish is a tool for fast, memory efficient counting of K-mers in DNA [[http://​www.cbcb.umd.edu/​software/​jellyfish/​]][(cite:​jellyfish>​Marçais,​ Guillaume and Kingsford, Carl. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics (2011) 27(6): 764-770 first published online January 7, 2011 doi:​10.1093/​bioinformatics/​btr011)] Jellyfish is a tool for fast, memory efficient counting of K-mers in DNA [[http://​www.cbcb.umd.edu/​software/​jellyfish/​]][(cite:​jellyfish>​Marçais,​ Guillaume and Kingsford, Carl. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics (2011) 27(6): 764-770 first published online January 7, 2011 doi:​10.1093/​bioinformatics/​btr011)]
 +
 +The Jellyfish "​stats"​ option allows for a bounded dump of the kmer table, using -L for the lower bound and -U for the upper bound. ​ Using this, we can examine high frequency kmers for abnormalities.
  
 The documentation is at  The documentation is at 
Line 27: Line 32:
  
 {{:​bioinformatic_tools:​slug-fit-gamma-illumina1.png|}} {{:​bioinformatic_tools:​slug-fit-gamma-illumina1.png|}}
 +
 +For run1, the first few distinct kmer with the specified multiplicities are
 +  - 970,576,481 (19-mers) ​ 1,​242,​303,​036 (22-mers)
 +  - 95,088,167 (19-mers) ​ 100,246,200 (22-mers)
 +  - 55,353,345 (19-mers) ​ 67,039,962 (22-mers)
 +  - 60,129,122 (19-mers) ​ 77,381,432 (22-mers)
 +Total distinct: 2,​298,​220,​805 (19-mers) 2,​699,​479,​169 (22-mers)
 +These counts were done before running SeqPrep, so include adapter reads.
 +
 +After running SeqPrep, using all the illumina data produced ​
 +
 +{{:​bioinformatic_tools:​fit-gamma-illumina-all-seqprep.png|}}
 +
 +We have 2196163636 distinct 19-mers. If we use 2-or-less as the criterion for calling a k-mer a sequencing error, we get 1,​222,​498,​009 distinct k-mers---close to our previous estimates.
 +
 +The fit-gamma-illumina-all-seqprep.gnuplot script gives an estimated coverage of 10.247. ​ If we divide the total number of k-mers (23731306715) by the approximate coverage, we get a genome length of 2.3159 Gbases.
 +
  
 ====== Gamma distribution is wrong ====== ====== Gamma distribution is wrong ======
Line 41: Line 63:
   * Jellyfish only accepts FASTA format file, which are not the native format for any of the sequencing platforms. ​ The 454 produces SFF files, which are probably not a good idea to use as input for a k-mer counter, since a base caller is needed to interpret them.  Illumina uses a "​qseq"​ format which includes both the sequence data and quality information. ​ Many other programs use FASTQ format. ​ Both the qseq and the FASTQ format should be read by Jellyfish (and aren'​t). ​ The preprocessing time and disk space to convert to FASTA format may negate any speed advantages that Jellyfish has.  According to John St. John, there is a newer release that will accept FASTQ input.   * Jellyfish only accepts FASTA format file, which are not the native format for any of the sequencing platforms. ​ The 454 produces SFF files, which are probably not a good idea to use as input for a k-mer counter, since a base caller is needed to interpret them.  Illumina uses a "​qseq"​ format which includes both the sequence data and quality information. ​ Many other programs use FASTQ format. ​ Both the qseq and the FASTQ format should be read by Jellyfish (and aren'​t). ​ The preprocessing time and disk space to convert to FASTA format may negate any speed advantages that Jellyfish has.  According to John St. John, there is a newer release that will accept FASTQ input.
   * Jellyfish is not quality-aware. ​ While some applications (like the estimation of sequencing error above) need to count all the k-mers, for other applications we only want to count the k-mers that are probably correct. ​ Doing end-trimming of reads based on quality before counting k-mers or filtering k-mers based on the minimum quality of the bases in the k-mer would result in much smaller hash tables and more robust error correction. ​ With Jellyfish, this has to be done by expensive pre-filtering,​ when a simple command-line parameter and an input parser that understands formats that include quality information would be much more valuable.   * Jellyfish is not quality-aware. ​ While some applications (like the estimation of sequencing error above) need to count all the k-mers, for other applications we only want to count the k-mers that are probably correct. ​ Doing end-trimming of reads based on quality before counting k-mers or filtering k-mers based on the minimum quality of the bases in the k-mer would result in much smaller hash tables and more robust error correction. ​ With Jellyfish, this has to be done by expensive pre-filtering,​ when a simple command-line parameter and an input parser that understands formats that include quality information would be much more valuable.
-  * The Jellyfish "​stats"​ option provides a dump of the full table, but not a partial dump of just some range of multiplicities. ​ I had one run with a few thousand anomalously high count k-mers. ​ I would like to dump just those k-mers and see what they are.  For error-correction,​ I only need the reasonably frequent k-mers, not the low-count ones, so the output could be reduced by a factor of five with a simple threshold check. 
   * I'd like to have a more user-friendly way to specify the hash table size.  Rather than giving the number of slots for the table, I'd like to be able to specify the amount of RAM to use, and let the program figure out how much it can pack in there. ​ I can look up how much RAM a node on the cluster has (about 15Gbytes), but computing what value to pass to Jellyfish is a bit mysterious.   * I'd like to have a more user-friendly way to specify the hash table size.  Rather than giving the number of slots for the table, I'd like to be able to specify the amount of RAM to use, and let the program figure out how much it can pack in there. ​ I can look up how much RAM a node on the cluster has (about 15Gbytes), but computing what value to pass to Jellyfish is a bit mysterious.
  
archive/bioinformatic_tools/jellyfish.txt · Last modified: 2015/07/28 06:23 by ceisenhart