Differences

This shows you the differences between two versions of the page.

--- archive:bioinformatic_tools:jellyfish [2011/04/09 01:22]
karplus Added complaint about not being able to dump just part of the k-mer table.
+++ archive:bioinformatic_tools:jellyfish [2015/07/28 06:23] (current)
ceisenhart ↷ Page moved from bioinformatic_tools:jellyfish to archive:bioinformatic_tools:jellyfish
@@ Line 1: / Line 1: @@
+====== Jellyfish ======
+The current version installed on campusrocks is 1.1 (official release).
 Jellyfish is a tool for fast, memory efficient counting of K-mers in DNA [[http://www.cbcb.umd.edu/software/jellyfish/]][(cite:jellyfish>Marçais, Guillaume and Kingsford, Carl. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics (2011) 27(6): 764-770 first published online January 7, 2011 doi:10.1093/bioinformatics/btr011)]
+The Jellyfish "stats" option allows for a bounded dump of the kmer table, using -L for the lower bound and -U for the upper bound.  Using this, we can examine high frequency kmers for abnormalities.
 The documentation is at
 [[http://www.cbcb.umd.edu/software/jellyfish/jellyfish-manual.pdf|www.cbcb.umd.edu/software/jellyfish/jellyfish-manual.pdf]]
+====== Pyrobaculum oguniense ======
 Jellyfish was run on the Pyrobaculum oguniense 454 data for 10-mers, 15-mers, 20-mers, 25-mers, 30-mers, and 31-mers.  Jelly fish runs quite quickly for all of these, but takes forever for 32-mers.  It seems like 31-mers are a hard limit (probably related to 64-bit word sizes).
@@ Line 15: / Line 21: @@
 {{bioinformatic_tools:fit-gamma.png}}
+====== Banana Slug ======
+Similarly, we ran Jellyfish on the first run of Illumina sequencing for banana slug to count 22-mers. This plot shows the same pattern of a high counts for unique reads that descends to a minimum before peaking and continuing out in the long tail. The minimum occurs around 3 so we can consider 22-mers with a count of 4 or higher to be correct. From this we find that 7.61% of 22-mers have at least 1 error so the per-base error rate is estimated to be 0.35%.
+Using the number of distinct 22-mers, we estimate the genome size to be 1.289e+09 bases ( 2,699,479,169 total - 1,410,480,198 "bad" 22-mers ). The estimate of the coverage from mean of the gamma fit is 9.786x. When we divide the total number of good 22-mers by the coverage, the estimated genome size is 2.042+09 bases.
+The large difference between these two estimated sizes is most likely due to repeat sequences. These are not taken into account when estimating the
+size using only distinct 22-mers. The larger estimate is probably closer to the actual size.
+{{:bioinformatic_tools:slug-fit-gamma-illumina1.png|}}
+For run1, the first few distinct kmer with the specified multiplicities are
+  - 970,576,481 (19-mers)  1,242,303,036 (22-mers)
+  - 95,088,167 (19-mers)  100,246,200 (22-mers)
+  - 55,353,345 (19-mers)  67,039,962 (22-mers)
+  - 60,129,122 (19-mers)  77,381,432 (22-mers)
+Total distinct: 2,298,220,805 (19-mers) 2,699,479,169 (22-mers)
+These counts were done before running SeqPrep, so include adapter reads.
+After running SeqPrep, using all the illumina data produced
+{{:bioinformatic_tools:fit-gamma-illumina-all-seqprep.png|}}
+We have 2196163636 distinct 19-mers. If we use 2-or-less as the criterion for calling a k-mer a sequencing error, we get 1,222,498,009 distinct k-mers---close to our previous estimates.
+The fit-gamma-illumina-all-seqprep.gnuplot script gives an estimated coverage of 10.247.  If we divide the total number of k-mers (23731306715) by the approximate coverage, we get a genome length of 2.3159 Gbases.
+====== Gamma distribution is wrong ======
+Of course, the gamma distribution is almost certainly the //wrong// distribution to use here.  We probably should be fitting a more realistic model of the problem.  We expect to see multiple peaks, from different causes.  The peak at low multiplicity should come mainly from sequencing errors, though it could also come from copying errors in the PCR amplification in creating the initial library, and from polymorphisms in the initial population of cells.  (For metagenomic libraries, that polymorphism may be a major contributor to the initial peak.)  The main peak at medium multiplicity should be from multiple copies of the same location in the genome.  If a k-mer occurs n times in the genome, we would expect to see it n times as often in the sequencing, so there should be additional peaks for k-mers that occur in repeats.
+If we have N error-free k-mers sampled uniformly from a universe of D distinct kmers (the no-repeat situation), we would expect a binomial distribution, with D (N choose m)p^m(1-p)^(N-m) distinct k-mers with multiplicity m, where p=1/D.  We can approximate this as  normal(mean=c, variance=c), where c is the coverage=N/D).
+If we have D_j k-mers that occur j times in the genome, with sum_j j D_j = G (the genome size), we expect about N j D_j/G of our samples to be from k-mers that repeat j times, and we can model the distribution of error-free reads as  sum_j D_j normal(mean=j c, variance=j c), where c is now N/G.  I (Kevin Karplus) tried this model and it is a terrible fit to the data.  The coverage estimate is ok (around 45), but the variance for the normal distribution is much smaller than than the observed distribution, and we get very distinct peaks that are much too narrow.  So my reasoning about the model must be wrong, but I'm not sure where yet.
+If I allow the variance to be fit separately, I can get a pretty good fit with D1 normal(c,v)+ D2 normal(2c,2v), with the variance about 5 times bigger than I expected. Adding a third peak doesn't help much.
 ===== Problems with Jellyfish =====
   * Jellyfish only accepts FASTA format file, which are not the native format for any of the sequencing platforms.  The 454 produces SFF files, which are probably not a good idea to use as input for a k-mer counter, since a base caller is needed to interpret them.  Illumina uses a "qseq" format which includes both the sequence data and quality information.  Many other programs use FASTQ format.  Both the qseq and the FASTQ format should be read by Jellyfish (and aren't).  The preprocessing time and disk space to convert to FASTA format may negate any speed advantages that Jellyfish has.  According to John St. John, there is a newer release that will accept FASTQ input.
   * Jellyfish is not quality-aware.  While some applications (like the estimation of sequencing error above) need to count all the k-mers, for other applications we only want to count the k-mers that are probably correct.  Doing end-trimming of reads based on quality before counting k-mers or filtering k-mers based on the minimum quality of the bases in the k-mer would result in much smaller hash tables and more robust error correction.  With Jellyfish, this has to be done by expensive pre-filtering, when a simple command-line parameter and an input parser that understands formats that include quality information would be much more valuable.
-  * The Jellyfish "stats" option provides a dump of the full table, but not a partial dump of just some range of multiplicities.  I had one run with a few thousand anomalously high count k-mers.  I would like to dump just those k-mers and see what they are.  For error-correction, I only need the reasonably frequent k-mers, not the low-count ones, so the output could be reduced by a factor of five with a simple threshold check.
+  * I'd like to have a more user-friendly way to specify the hash table size.  Rather than giving the number of slots for the table, I'd like to be able to specify the amount of RAM to use, and let the program figure out how much it can pack in there.  I can look up how much RAM a node on the cluster has (about 15Gbytes), but computing what value to pass to Jellyfish is a bit mysterious.

Banana Slug Genomics

User Tools

Site Tools

Differences

Page Tools