Kevin - Jellyfish Notes Basic Kmer Counting - Naïve python way (See Figure 1) ○ Okay if you're in a microbial genome ○ Runs out of memory § ~60 bytes per entry § Every error creates a unique kmer and therefore a unique entry - How to go faster ○ Want a faster language ○ Memory-mapped I/O § Move data as a disk block to memory ○ Process Kmers one character at a time to avoid building big strings § Look at sequences a kmer at a time ○ How to compress results (See Figure 2) § Strings are inefficient way to store kmers § Want to compress keys and values □ We have lots of 1s and a few large numbers § We want a special purpose hash table □ Make a numeric code for a Kmer ® w/ 4 letters you can encode it with 2 bits. But we have wildcards and N. You need 4 bits. What do we care about? ® Jellyfish only encodes for AGTC with 2 bits. We are ignoring then ambiguous or unknown sequences. ® Can restart sequences at N to avoid ambiguity problems □ Shifts to minimize machine commands □ We want spread out hash values ® Jellyfish limits array size to a power of 2 □ Invertible mapping - Can unscramble our scrambled and spread out key ® Allows you to discard part of the scrambled key since when you know where you are in the array you know where you are vis a vis M. ◊ In a 64 bit system. We can store the key in 33 bits. We have 32 bits left for other things. □ We won't have perfect hashes (no collisions) since it is computationally expensive. ® Can use a new hash function when you have one ® Can increment by one to change hash -Linear probing ® Jellyfish uses quadratic probing 2, 8, etc ◊ Uses first open cell ◊ 90% = too full } Can make a bigger table (not used here) } Jellyfish dumps the table to disk and uses a new table. Means you need to eventually needs to merge the tables. } We need to know the # of reprobes to figure out where it was in memory. To get the scrambled key and to reverse the process. Max # reprobes dictates how much space you need for this info. Max reprobes reached = dump to disk. □ What do we do when the value exceeds our left over space? ® We can indicate in a field that there is over flow with 1111 -> 0000 ◊ We look at key for value 1. If there is indicated over flow we keep looking till we find the end ○ We need to sync multi core CPUs to avoid collisions in data structure. § Lock structure would cause serialization bottleneck. □ Wait in line model § Want something with no costs during no-collisions while handling collisions that happen § CPUs have CAS - Compare and Save □ Address, old value, new value (a, x, y) □ Look at address. If value at mem address is old value write new value. Returns old value. ® If x = y OK ® If x does not equal y Some sort of collision. ◊ If it’s a memory collision reprobe for mem ◊ If it’s a hash collision re hash □ Only one can succeed. Supported by hardware. § Almost linear speed increase. § Need to sync input side too. Not sure how they do that. □ Can interleave - straight through most efficient for I/O □ Can start in widely spaced pages ○ Small kmers it uses direct addressing to speed things up Figure 1 {{:lecture_notes:lecture051915_fig1.jpg?400|}} ---- Figure 2 {{:lecture_notes:lecture051915_fig2.jpg?200|}}