PE-Assembler is a genome assembly tool that makes use of short, paired-end reads. Unlike many similar tools, PE-Assembler uses a 3' extension approach instead of building a de Bruijn graph.
PE-Assembler uses no error correction but still requires a pool of error-free none repetitive reads for the initial seed building step. To accomplish this, reads are selected if their kmers are found within the boundaries of the “solid kmer threshold” and the “repeat kmer threshold”. The solid kmer threshold is defined by the trough that appears between low-frequency erroneous kmers and the frequencies that closely follow the expected coverage. The repeat kmer threshold is defined by the trough between the two peaks for the kmers that appear once in the genome and the kmers that appear twice. In practice, this can be difficult to detect.
Next, seeds are constructed from the “solid” kmers by 3' extension. For a contig to be successfully built, it must be longer than “MaxSpan” (the maximum distance between pairs) and the ends must be verified by at least 1 paired-end read that overlaps the end of the contig.
Once the seeds are constructed, an iterative process is used to extend the contigs using overlap extension.
Scaffolds are constructed using “chimeric” paired-end reads (reads that map to two different contigs) of different lengths to order and orient the contigs.
Finally, gaps between contigs are filled in using a less-stringent minimum overlap length. Adjacent contigs whose gap can be successfully bridged are merged into a single contig.
PE-Assembler requires long insert sizes, high coverage, and was evaluated on smaller bacterial genomes, so it would not be the best tool for assembling the banana slug genome with our data.