This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
archive:bioinformatic_tools:jellyfish [2011/04/24 19:22] eyliaw |
archive:bioinformatic_tools:jellyfish [2015/07/28 06:23] (current) ceisenhart ↷ Page moved from bioinformatic_tools:jellyfish to archive:bioinformatic_tools:jellyfish |
||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Jellyfish ====== | ||
+ | The current version installed on campusrocks is 1.1 (official release). | ||
+ | |||
Jellyfish is a tool for fast, memory efficient counting of K-mers in DNA [[http://www.cbcb.umd.edu/software/jellyfish/]][(cite:jellyfish>Marçais, Guillaume and Kingsford, Carl. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics (2011) 27(6): 764-770 first published online January 7, 2011 doi:10.1093/bioinformatics/btr011)] | Jellyfish is a tool for fast, memory efficient counting of K-mers in DNA [[http://www.cbcb.umd.edu/software/jellyfish/]][(cite:jellyfish>Marçais, Guillaume and Kingsford, Carl. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics (2011) 27(6): 764-770 first published online January 7, 2011 doi:10.1093/bioinformatics/btr011)] | ||
- | The Jellyfish "stats" option allow for a bounded dump of the kmer table, using -L for the lower bound and -U for the upper bound. Using this, we can examine high frequency kmers for abnormalities. | + | The Jellyfish "stats" option allows for a bounded dump of the kmer table, using -L for the lower bound and -U for the upper bound. Using this, we can examine high frequency kmers for abnormalities. |
The documentation is at | The documentation is at | ||
Line 29: | Line 32: | ||
{{:bioinformatic_tools:slug-fit-gamma-illumina1.png|}} | {{:bioinformatic_tools:slug-fit-gamma-illumina1.png|}} | ||
+ | |||
+ | For run1, the first few distinct kmer with the specified multiplicities are | ||
+ | - 970,576,481 (19-mers) 1,242,303,036 (22-mers) | ||
+ | - 95,088,167 (19-mers) 100,246,200 (22-mers) | ||
+ | - 55,353,345 (19-mers) 67,039,962 (22-mers) | ||
+ | - 60,129,122 (19-mers) 77,381,432 (22-mers) | ||
+ | Total distinct: 2,298,220,805 (19-mers) 2,699,479,169 (22-mers) | ||
+ | These counts were done before running SeqPrep, so include adapter reads. | ||
+ | |||
+ | After running SeqPrep, using all the illumina data produced | ||
+ | |||
+ | {{:bioinformatic_tools:fit-gamma-illumina-all-seqprep.png|}} | ||
+ | |||
+ | We have 2196163636 distinct 19-mers. If we use 2-or-less as the criterion for calling a k-mer a sequencing error, we get 1,222,498,009 distinct k-mers---close to our previous estimates. | ||
+ | |||
+ | The fit-gamma-illumina-all-seqprep.gnuplot script gives an estimated coverage of 10.247. If we divide the total number of k-mers (23731306715) by the approximate coverage, we get a genome length of 2.3159 Gbases. | ||
+ | |||
====== Gamma distribution is wrong ====== | ====== Gamma distribution is wrong ====== |