Kevin - Jellyfish Notes
Basic Kmer Counting
- Naïve python way (See Figure 1)
○ Okay if you're in a microbial genome
○ Runs out of memory
§ ~60 bytes per entry
§ Every error creates a unique kmer and therefore a unique entry
- How to go faster
○ Want a faster language
○ Memory-mapped I/O
§ Move data as a disk block to memory
○ Process Kmers one character at a time to avoid building big strings
§ Look at sequences a kmer at a time
○ How to compress results (See Figure 2)
§ Strings are inefficient way to store kmers
§ Want to compress keys and values
□ We have lots of 1s and a few large numbers
§ We want a special purpose hash table
□ Make a numeric code for a Kmer
® w/ 4 letters you can encode it with 2 bits. But we have wildcards and N. You need 4 bits. What do we care about?
® Jellyfish only encodes for AGTC with 2 bits. We are ignoring then ambiguous or unknown sequences.
® Can restart sequences at N to avoid ambiguity problems
□ Shifts to minimize machine commands
□ We want spread out hash values
® Jellyfish limits array size to a power of 2
□ Invertible mapping - Can unscramble our scrambled and spread out key
® Allows you to discard part of the scrambled key since when you know where you are in the array you know where you are vis a vis M.
◊ In a 64 bit system. We can store the key in 33 bits. We have 32 bits left for other things.
□ We won't have perfect hashes (no collisions) since it is computationally expensive.
® Can use a new hash function when you have one
® Can increment by one to change hash -Linear probing
® Jellyfish uses quadratic probing 2, 8, etc
◊ Uses first open cell
◊ 90% = too full
} Can make a bigger table (not used here)
} Jellyfish dumps the table to disk and uses a new table. Means you need to eventually needs to merge the tables.
} We need to know the # of reprobes to figure out where it was in memory. To get the scrambled key and to reverse the process. Max # reprobes dictates how much space you need for this info. Max reprobes reached = dump to disk.
□ What do we do when the value exceeds our left over space?
® We can indicate in a field that there is over flow with 1111 -> 0000
◊ We look at key for value 1. If there is indicated over flow we keep looking till we find the end
○ We need to sync multi core CPUs to avoid collisions in data structure.
§ Lock structure would cause serialization bottleneck.
□ Wait in line model
§ Want something with no costs during no-collisions while handling collisions that happen
§ CPUs have CAS - Compare and Save
□ Address, old value, new value (a, x, y)
□ Look at address. If value at mem address is old value write new value. Returns old value.
® If x = y OK
® If x does not equal y Some sort of collision.
◊ If it’s a memory collision reprobe for mem
◊ If it’s a hash collision re hash
□ Only one can succeed. Supported by hardware.
§ Almost linear speed increase.
§ Need to sync input side too. Not sure how they do that.
□ Can interleave - straight through most efficient for I/O
□ Can start in widely spaced pages
○ Small kmers it uses direct addressing to speed things up
Figure 1
Figure 2