lecture_notes:05-11-2015
Meraculous Update!!
Hash algorithm!
Hash goals: no keys stored, perfect static hash (no collisions, immutable keys), [ACTG][ACTG] represents forward and backward extensions
Only hashes k+2 mers
Series of hash functions to prevent collisions (4 step)
Documentation does not mention novel lightweight hash – looking into src needed
Couldn't find hash functions in source code!
Currently using one large hash with multithreading (with boost)
Prints to stdout as UFX.FN (FN of kmer key)
73GB of UFX files with lines of [kmer] [ACTG][ACTG] (X= no extensions, F=multiple extensions)
Memory usage worse than original since all extensions are hashed, not just U-U
Many kmers stored have invalid bases that are not used in making contigs (still hashed, though)
Packed DNA object pads length to be divisible by 4, divides into blocks of 4, maps to integer (supposedly cuts down from 1 byte for base to 1 byte for 4 nucleotides, which is not helpful)
Update on UX!
Last time – successful install and test assembly, fixing kmer assembly (not specifying memory allocation correctly)
qsub does not like “-w e” option while specifying memory
section of perl script that submits job to qsub (memory over v_mem specified will kill program)
Changed to mem_free which was not responsive (gibberish will kill qsub)
coverage issues - insufficient k-mer depth
changed coverage requirement in perl script from 15x to 0
UUtigs.fa (contig file) created successfully
bubble popping stage failed, due to inability to allocate memory (probably because its running on head node)
Results!
Pre-popping stats:
28610138 total contigs
542137 (1.89%) contigs >1000bp
15 (5.2e-5%) contigs >10000bp
majority contigs are the single read contigs (uncorrected unique kmers)
(do cumulative histogram starting from largest kmers working down 0 at the high end)
Quake error correction appears to run to completion, but does not produce output file
BLESS installation happening now.
BLESS uses minimum-sized Bloom filter (space-efficient probabilistic data structure)
Racer uses hard arbitrary threshold (/campusdata/BME235/bin/racer)
Bloom filter - old technique used for spell checkers, method of storing strings to see if it belongs to dictionary (wont gives false negative) huge table of bits with has functions, run string (kmer) though stored in bit table, then calculates probability of that kmer is there, (it is efficient but inaccurate)
Next steps!
Ordered List ItemFinish bubble popping step
Assess with CEGMA
Re-run (with error-corrected data and incorporated scaffold)
Run GapCloser and REAPR
Meta-Assembly
Annotate Genome
lecture_notes/05-11-2015.txt · Last modified: 2015/05/11 17:30 by thjmatthsoe
Discussion
Wow this is already really well done!