Banana Slug Genomics

**This is an old revision of the document!** ----

A PCRE internal error occured. This might be caused by a faulty plugin

Jellyfish is a tool for fast, memory efficient counting of K-mers in DNA [[http://www.cbcb.umd.edu/software/jellyfish/]][(cite:jellyfish>Marçais, Guillaume and Kingsford, Carl. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics (2011) 27(6): 764-770 first published online January 7, 2011 doi:10.1093/bioinformatics/btr011)] The documentation is at [[http://www.cbcb.umd.edu/software/jellyfish/jellyfish-manual.pdf|www.cbcb.umd.edu/software/jellyfish/jellyfish-manual.pdf]] Jellyfish was run on the Pyrobaculum oguniense 454 data for 10-mers, 15-mers, 20-mers, 25-mers, 30-mers, and 31-mers. Jelly fish runs quite quickly for all of these, but takes forever for 32-mers. It seems like 31-mers are a hard limit (probably related to 64-bit word sizes). Here is a plot of the frequency of multiplicity for k-mers in the Pog data. Note that 10-mers are too short and many occur with high multiplicity. For 15-mers and longer, there is a nice peak around 47 (drifting downward a little for longer k-mers). The curves can be fit pretty well with a gamma distribution. Below a multiplicity of around 16, the counts shoot way up due to sequencing errors. There is a heavy tail of high multiplicity k-mers, due to repeat regions and the Pog virus. We can estimate the sequencing error rate by counting all kmers (not just distinct ones) with multiplicity less than where the minimum occurs, and divide by the total number of kmers to get the fraction of bad kmers. Dividing again by the kmer length gives us a per-base error rate, which for the Pog 454 data is about 0.5% (no matter which of the long-enough kmer distributions we use). We can also estimate the genome size from the coverage estimate and total number of non-error kmers. The number of unique 31-mers that have coverage >=17 is 2,428,316. Using the 31-mer gamma fit gives a mean coverage of 47.11x. The total number of good 31-mers is 114.75e+06, dividing this by the coverage gives an estimate of 2,435,702. Both estimates give us a genome length of 2.43e06, very close to the final sizes for the v4c assembly (chr=2436036, ece=16887). {{bioinformatic_tools:fit-gamma.png}}

Discussion

Louis Letourneau, 2011/05/10 19:39

I found an answer, it works nicely with gnuplot.

Banana slug, 2011/05/19 09:35

Hey Louis,

Can you please share the GNU-plot solution here with the rest as well :-) ?

Thank you.

Louis Letourneau, 2011/05/19 15:22

I'm sorry I can't get it to format the comment like I want.

Basically I ran jellyfish count with kmers from 11 to 31. I then ran jellyish histo for each of them.

Finally I wrote a script to generate a png from gnuplot.

gnuplot < scriptFile.gnu > image.png

scriptFile.gnu :

set terminal png nocrop size 1280,1025 set format y '10^(%.0f)'

plot \

“kmer_histo_11.txt” using 1:($2 < 1 ? 1 : log10($2)) title “bezier_11” smooth bezier,\

“kmer_histo_15.txt” using 1:($2 < 1 ? 1 : log10($2)) title “bezier_15” smooth bezier,\

“kmer_histo_17.txt” using 1:($2 < 1 ? 1 : log10($2)) title “bezier_17” smooth bezier,\

“kmer_histo_19.txt” using 1:($2 < 1 ? 1 : log10($2)) title “bezier_19” smooth bezier,\

“kmer_histo_21.txt” using 1:($2 < 1 ? 1 : log10($2)) title “bezier_21” smooth bezier,\

“kmer_histo_25.txt” using 1:($2 < 1 ? 1 : log10($2)) title “bezier_25” smooth bezier,\

“kmer_histo_27.txt” using 1:($2 < 1 ? 1 : log10($2)) title “bezier_27” smooth bezier,\

“kmer_histo_31.txt” using 1:($2 < 1 ? 1 : log10($2)) title “bezier_31” smooth bezier;

Louis Letourneau, 2011/05/19 15:23

I also have the same “effect” of this page, any kmer lower than 15 is pretty much useless. Well not useless, the it doesn't bring our the information of what is erroneous in the the reads.

Louis Letourneau, 2011/05/04 03:18

May I ask what exactly you used to plot the graphs? Did you plot the output of jellyfish histo directly? Thanks

Edward Liaw, 2011/04/09 07:44

I added in two python scripts to BME235/bin: qseq2fasta and qseq2fastq, to parse the Illumina files. There is a faster implementation of the quality score conversion using a numpy array instead of ord→chr individually but I was having trouble installing numpy.

Be sure to use the -f option when running with Jellyfish. -f will filter out reads that did not pass the Illumina's filter.

John St. John, 2011/04/08 20:22

Jellyfish has a new version that allows for fastq file input. I posted it in my public dropbox folder as the developer hasn't posted it to his website yet.

http://dl.dropbox.com/u/3907635/jellyfish-1.1.tar.gz

To use it with a bunch of fastq files that are gziped for example:

zcat folders_*/s_*_*.fastq.gz | jellyfish count -m 19 -C -o kmer_hash -s 1000000000 -t 32 /dev/fd/0

the last part /dev/fd/0 tells jellyfish to read from stdin. This is nice because you can run a program that converts qseq to fastq and pipes to stdout and then pipe that to jellyfish without having to worry about making tmp files if you don't want to make fastq files in the mean time.

You could leave a comment if you were logged in.

Banana Slug Genomics

User Tools

Site Tools

Discussion

Page Tools