This is an old revision of the document!
We talked about the Jellyfish tool. Details of the space-efficient algorithm were explained, and we looked at how to schedule a run on the Campusrocks cluster.
jellyfish -m 22 -o output -c 3
qsub -hard -l num_proc=# jellyfish_script
The initial kmers (4^k possible) are first changed to keys with an invertible binary matrix. The keys are then reduced by a modulus M=2^l, so that they range from [0, M-1], condensing the hash table. The remainder 2k-l keys are stored in front of the values. Because the keys are condensed, collisions are to be expected for dissimilar keys, so the algorithm uses a reprobe function: i(i+1)/2, where i is the number of reprobes; to resolve them. The first key that hashes to a position will store a reprobe value of 1 and following dissimilar keys will go to the next reprobe position and store i+1. There is a maximum limit to the number of reprobes, after which the algorithm will write the hash to disk(?) instead of reprobing.
The other issue is that the value field for the counts has a limited size. When it has filled, it will reprobe and store the higher order bits in the extension.