User Tools

Site Tools

Kevin - Jellyfish Notes

Basic Kmer Counting
	- Naïve python way (See Figure 1)
		○ Okay if you're in a microbial genome
		○ Runs out of memory
			§ ~60 bytes per entry
			§ Every error creates a unique kmer and therefore a unique entry
	- How to go faster
		○ Want a faster language
		○ Memory-mapped I/O
			§ Move data as a disk block to memory
		○ Process Kmers one character at a time to avoid building big strings
			§ Look at sequences a kmer at a time
		○ How to compress results (See Figure 2)
			§ Strings are inefficient way to store kmers
			§ Want to compress keys and values
				□ We have lots of 1s and a few large numbers
			§ We want a special purpose hash table
				□ Make a numeric code for a Kmer
					® w/ 4 letters you can encode it with 2 bits. But we have wildcards and N. You need 4 bits. What do we care about? 
					® Jellyfish only encodes for AGTC with 2 bits. We are ignoring then ambiguous or unknown sequences. 
					® Can restart sequences at N to avoid ambiguity problems
				□ Shifts to minimize machine commands
				□ We want spread out hash values
					® Jellyfish limits array size to a power of 2
				□ Invertible mapping - Can unscramble our scrambled and spread out key
					® Allows you to discard part of the scrambled key since when you know where you are in the array you know where you are vis a vis M. 
						◊ In a 64 bit system. We can store the key in 33 bits. We have 32 bits left for other things.
				□ We won't have perfect hashes (no collisions) since it is computationally expensive.
					® Can use a new hash function when you have one
					® Can increment by one to change hash -Linear probing
					® Jellyfish uses quadratic probing 2, 8, etc
						◊ Uses first open cell
						◊ 90% = too full
							} Can make a bigger table (not used here)
							} Jellyfish dumps the table to disk and uses a new table. Means you need to eventually needs to merge the tables.
							} We need to know the # of reprobes to figure out where it was in memory. To get the scrambled key and to reverse the process. Max # reprobes dictates how much space you need for this info. Max reprobes reached = dump to disk.
				□ What do we do when the value exceeds our left over space?
					® We can indicate in a field that there is over flow with 1111 -> 0000
						◊ We look at key for value 1. If there is indicated over flow we keep looking till we find the end 
		○ We need to sync multi core CPUs to avoid collisions in data structure.
			§ Lock structure would cause serialization bottleneck. 
				□ Wait in line model
			§ Want something with no costs during no-collisions while handling collisions that happen
			§ CPUs have CAS - Compare and Save
				□ Address, old value, new value (a, x, y)
				□ Look at address. If value at mem address is old value write new value. Returns old value. 
					® If x = y OK
					® If x does not equal y Some sort of collision.
						◊ If it’s a memory collision reprobe for mem
						◊ If it’s a hash collision re hash
				□ Only one can succeed. Supported by hardware.
			§ Almost linear speed increase. 
			§ Need to sync input side too. Not sure how they do that. 
				□ Can interleave - straight through most efficient for I/O
				□ Can start in widely spaced pages
		○ Small kmers it uses direct addressing to speed things up

Figure 1

Figure 2

You could leave a comment if you were logged in.
lecture_notes/05-29-2015.txt · Last modified: 2015/05/29 18:37 by chkan