lecture_notes:04-07-2011

This shows you the differences between two versions of the page.

Both sides previous revision Previous revision Next revision | Previous revision | ||

lecture_notes:04-07-2011 [2011/04/08 01:43] eyliaw [Hash Algorithm] |
lecture_notes:04-07-2011 [2015/07/28 08:12] (current) 149.155.218.110 ↷ Links adapted because of a move operation |
||
---|---|---|---|

Line 1: | Line 1: | ||

====== Jellyfish ====== | ====== Jellyfish ====== | ||

- | We talked about the [[bioinformatic_tools:jellyfish|Jellyfish]] tool. Details of the space-efficient algorithm were explained, and we looked at how to schedule a run on the Campusrocks cluster. | + | We talked about the [[archive:bioinformatic_tools:jellyfish|Jellyfish]] tool. Details of the space-efficient algorithm were explained, and we looked at how to schedule a run on the Campusrocks cluster. |

===== Scheduling a run on Campusrocks ===== | ===== Scheduling a run on Campusrocks ===== | ||

- | jellyfish -m 22 -o output -c 3 | + | Examples: |

- | qsub -hard -l num_proc=# jellyfish_script | + | Jellyfish ex. args: jellyfish -m 22 -o output -c 3 |

+ | | ||

+ | Scheduling process args: qsub -hard -l num_proc=8 jellyfish_script | ||

===== Hash Algorithm ===== | ===== Hash Algorithm ===== | ||

- | The initial kmers (4^k possible) are first changed to keys with an invertible binary matrix. The keys are then reduced by a modulus M=2^l, so that they range from [0, M-1], condensing the hash table. The remainder 2k-l keys are stored in front of the values. Because the keys are condensed, collisions are to be expected for dissimilar keys, so the algorithm uses a reprobe function: i(i+1)/2, where i is the number of reprobes; to resolve them. The first key that hashes to a position will store a reprobe value of 1 and following dissimilar keys will go to the next reprobe position and store i+1. There is a maximum limit to the number of reprobes, after which the algorithm will write the hash to disk(?) instead of reprobing. | + | The initial kmers (4^k possible) are first changed to keys with an invertible binary matrix. The keys are then reduced by a modulus M=2^l, so that they range from [0, M-1], condensing the hash table. The remainder 2k-l portion of the key is stored in front of each value. Because the keys are condensed, collisions are to be expected for dissimilar keys, so the algorithm uses a reprobe function: i(i+1)/2, where i is the number of reprobes; to resolve them. The first key that hashes to a position will store a reprobe value of 1 and following dissimilar keys will go to the next reprobe position and store i+1. There is also a maximum limit to the number of reprobes (depending on the length of the reprobe counter), after which the algorithm will write the hash to disk instead of reprobing. |

- | | + | |

- | The algorithm also limits the size of the value field, making use of the fact that the average number of reads is the coverage c. When it has filled, it will reprobe and store the higher order bits in the extension. In the reprobed position, it avoids repeating the 2k-l key and instead stores the number of reprobes from the original position, which is a shorter value. The extra space is used for the value field instead. | + | |

- | | + | |

- | ===== Multithreading Tricks ===== | + | |

+ | The algorithm also limits the size of the value field, making use of the fact that the average number of reads is the coverage c. When it has filled, it will reprobe and store the higher order bits in the extension. In the reprobed position, it avoids repeating the 2k-l key and instead stores the number of reprobes from the original position, which is a shorter value. The extra space is used for the value field. | ||

+ | ===== Multithreading ===== | ||

+ | In order to parallelize updates to the hash table, the algorithm uses the compare and swap (CAS) instruction, which will fail if two threads attempt to modify the same location at the same time. The CAS operation will look at a location and only replace the value stored there if it equals an expected value. |

lecture_notes/04-07-2011.1302252195.txt.gz · Last modified: 2011/04/08 01:43 by eyliaw