This is an old revision of the document!
======Team 1 report: assembly with Meraculous====== =====Basic features===== Published by the US Department of energy. Meraculous was initially designed for haploid assembly, but currently supports diploid assembly as well. The advantages of this assembler include multi-threaded and parallelized computation, absence of error-correction for faster processing, paired-end short reads compatibility (e.g., Illumina), efficient and conservative traversal of subgraphs of the de Bruijn graph, selection of kmer set, production of a set of maximal linear sub-paths of the de Bruijn graph, alignment of reads to assembly in order to identify useful read-pair information and closure of gaps. Meraculous has been used to assemble //Pichia stipitis// genome, producing 15.4 Mb genome, 75 bp paired reads with 425x coverage. As a result, 95% of the genome was covered and an N50 = 101 kb was obtained. =====Meraculous algorithm===== * Counts occurrences of each kmer in the data set. * Removes kmers whose frequency are below the user threshold. * For each kmer, counts the number of high-quality single-base extensions * Classifies the 5' and 3' ends of each kmer * Stores the extensions of kmers with a classification in a hash * Removes non-reciprocal linkages between kmers * Selects kmers at random and extend outwards to produce contigs * Aligns all reads to contigs via BLAST * Assembles contigs into scaffolds using paired-end data * Searches unaligned reads as potential gap-closers using mate-pair data =====Meraculous limitations===== * The assembler relies on data with high quality in order to avoid error correction * Initial release did not support polyploid genome assembly due to allowing for linear subgraphs of the de Bruijn graph only * Low memory footprint =====User experience===== * Requires an array of other scripts in other languages * Most of high level scripts are written in perl * Tested the program in small dataset and obtained contigs =====Installation===== * Main issue was get all dependencies together * There was one non-standard perl mode needed * Some scripts contain error but they aren't hard to fix. =====Running Meraculous===== * Execute run_meraculous.sh scripts along with the configuration file * Configuration file contains info on where where data is and what format it comes in * It creates a timestamped folder that includes directories containing results of each step and executables to modify the run * Then you can check the errors that made a run fail and resume the run * Logs are informative =====Overall impression===== * Straightforward to figure out what went wrong just requiring a basic understanding of Perl * Handles all directory creation for you * Logs are very useful =====Error correction===== * Meraculous requires error correction and adapter removal. Trimming is unnecessary. * High error rates stop the assembler. Need to be removed. * Kmer size chosen directly affects assembly quality =====KamerGenie===== Meraculous requires an optimal kmer size for runs. KmerGenie is a program used to give optimal assembly kmer size by generating abundance histograms for many abundance histograms for many values of k. Here is a link that helped me understand KmerGenie: http://kmergenie.bx.psu.edu/. =====Musket===== =====Previous analysis===== =====Future directions=====