Team 1 report: assembly with Meraculous

Basic features

Published by the Joint Genome Institute, part of the US Department of Energy. Meraculous was initially designed for haploid assembly, but currently supports diploid assembly as well. The advantages of this assembler include multi-threaded and parallelized computation, absence of error-correction for faster processing, paired-end short reads compatibility (e.g., Illumina), efficient and conservative traversal of subgraphs of the de Bruijn graph, selection of kmer set, production of a set of maximal linear sub-paths of the de Bruijn graph, and alignment of reads to assembly in order to identify useful read-pair information and closure of gaps. Meraculous has been used to assemble the Pichia stipitis genome, a 15.4 Mb genome, using 75 bp paired reads with 425x coverage. The resulting assembly covered 95% of the genome and had an N50 of 101 kb.

Meraculous algorithm

Counts occurrences of each kmer in the data set.
Removes kmers whose frequency are below a threshold provided by the user.
For each kmer, counts the number of high-quality single-base extensions
Classifies the 5' and 3' ends of each kmer as U, F, or X, corresponding to having zero, one, or multiple high-quality single-base extensions
Stores the extensions of kmers with a classification in a hash
Removes non-reciprocal U-U extensions between kmers (i.e. an extension where the end of one mer is marked as U but the other is marked F).
Stores the linear subgraph of U-U extensions
Selects kmers at random and extend outwards to produce contigs
Aligns all reads to contigs via BLAST
Assembles contigs into scaffolds using paired-end data
Searches unaligned reads as potential gap-closers using mate-pair data

Meraculous limitations

The assembler relies on data with high quality in order to avoid error correction, also requires high coverage
Initial release did not support polyploid genome assembly due to allowing for linear subgraphs of the de Bruijn graph only
High disk space usage

User experience

Requires an array of other scripts in other languages
Most of high level scripts are written in perl
Tested the program with the packaged test data and obtained contigs

Installation

Main issue was new version of GCC and getting all the dependencies together ~16 hrs
There was one non-standard perl module needed
Files with carriage returns
Some scripts contain error but they aren't hard to fix.

Running Meraculous

Execute run_meraculous.sh scripts along with user-provided configuration file
Configuration file contains info on where where data is and what format it comes in
Creates a timestamped folder that includes directories containing results of each step and executables to suspend, resume, or restart the run from that step
Thorough error-logging at each step, allowing you to check the errors that made a run fail and then resume the run after fixing the errors
SGE-aware, handles qsub and monitoring jobs

Overall impression

Straightforward to figure out what went wrong just requiring a basic understanding of Perl
Handles all directory creation for you
Logs are very useful

Error correction

Meraculous requires error correction and adapter removal. Trimming is unnecessary, as low quality reads are ignored during contig formation.
High error rates bog down the assembler. Need to be removed.
Kmer size chosen directly affects assembly quality

KamerGenie

Meraculous requires an optimal kmer size for runs. KmerGenie is a program used to give optimal assembly kmer size by generating abundance histograms for many abundance histograms for many values of k. Here is a link that helped me understand KmerGenie: http://kmergenie.bx.psu.edu/.

Musket

Previous analysis

Future directions

You could leave a comment if you were logged in.

Banana Slug Genomics

Table of Contents

Team 1 report: assembly with Meraculous

Basic features

Meraculous algorithm

Meraculous limitations

User experience

Installation

Running Meraculous

Overall impression

Error correction

KamerGenie

Musket

Previous analysis

Future directions

Banana Slug Genomics

User Tools

Site Tools

Table of Contents

Team 1 report: assembly with Meraculous

Basic features

Meraculous algorithm

Meraculous limitations

User experience

Installation

Running Meraculous

Overall impression

Error correction

KamerGenie

Musket

Previous analysis

Future directions

Page Tools