User Tools

Site Tools


lecture_notes:05-11-2015

Meraculous Update!!

Hash algorithm!

  • Hash goals: no keys stored, perfect static hash (no collisions, immutable keys), [ACTG][ACTG] represents forward and backward extensions
  • Only hashes k+2 mers
  • Series of hash functions to prevent collisions (4 step)
  • Documentation does not mention novel lightweight hash – looking into src needed
  • Couldn't find hash functions in source code!
  • Currently using one large hash with multithreading (with boost)
  • Prints to stdout as UFX.FN (FN of kmer key)
  • 73GB of UFX files with lines of [kmer] [ACTG][ACTG] (X= no extensions, F=multiple extensions)
  • Memory usage worse than original since all extensions are hashed, not just U-U
  • Many kmers stored have invalid bases that are not used in making contigs (still hashed, though)
  • Packed DNA object pads length to be divisible by 4, divides into blocks of 4, maps to integer (supposedly cuts down from 1 byte for base to 1 byte for 4 nucleotides, which is not helpful)

Update on UX!

  • Last time – successful install and test assembly, fixing kmer assembly (not specifying memory allocation correctly)
  • qsub does not like “-w e” option while specifying memory
  • section of perl script that submits job to qsub (memory over v_mem specified will kill program)
  • Changed to mem_free which was not responsive (gibberish will kill qsub)
  • coverage issues - insufficient k-mer depth
  • changed coverage requirement in perl script from 15x to 0
  • UUtigs.fa (contig file) created successfully
  • bubble popping stage failed, due to inability to allocate memory (probably because its running on head node)

Results!

Pre-popping stats:

  • 28610138 total contigs
  • 542137 (1.89%) contigs >1000bp
  • 15 (5.2e-5%) contigs >10000bp
  • majority contigs are the single read contigs (uncorrected unique kmers)

(do cumulative histogram starting from largest kmers working down 0 at the high end)

Other error correction tools!

  • Quake error correction appears to run to completion, but does not produce output file
  • BLESS installation happening now.
  • BLESS uses minimum-sized Bloom filter (space-efficient probabilistic data structure)
  • Racer uses hard arbitrary threshold (/campusdata/BME235/bin/racer)
  • Bloom filter - old technique used for spell checkers, method of storing strings to see if it belongs to dictionary (wont gives false negative) huge table of bits with has functions, run string (kmer) though stored in bit table, then calculates probability of that kmer is there, (it is efficient but inaccurate)

Next steps!

  1. Ordered List ItemFinish bubble popping step
  2. Assess with CEGMA
  3. Re-run (with error-corrected data and incorporated scaffold)
  4. Run GapCloser and REAPR
  5. Meta-Assembly
  6. Annotate Genome

Discussion

, 2015/05/11 19:01

Wow this is already really well done!

You could leave a comment if you were logged in.
lecture_notes/05-11-2015.txt · Last modified: 2015/05/11 17:30 by thjmatthsoe