SOAP = Short Oligonucleotide Analysis Package
SOAPdenovo assembles short oligonucleotide into contigs and scaffolds
for de-novo assembly of short-reads using de Bruijn graphs.
Can use a hierarchy of sizes of paired-end data.
Has been successfully used to sequence the Panda and Human genomes.
Quality seems good. They ran Panda genome on 512GB workstation with 32 CPUs.
There is a description which contains a download link.
Created by BGI - Beijing_Genomics_Institute.
The sequence and de novo assembly of the giant panda genome[1]
Downloaded the binaries for SOAPdenovo and GapCloser. I copied these binaries to the /bin folder since there are only two, and they at least correctly display a help message when you run them without arguments.
The method for SOAPdenovo is described in the paper "De novo assembly of human genomes with massively parallel short read sequencing" by Li et al.
Fragment and paired-end libraries are sequenced using various insert sizes. They used read lengths from 35 to 75 bp and insert sizes of 140 bp, 440 bp, 2.6 kb, 6 kb, and 9.6 kb.
Basic error correction was performed on the reads using K-mer counting to reduce the memory usage when constructing the de Bruijn graph. This was done by identifying low frequency (occurring <3 times) 17-mers and correcting these K-mers to the candidate with the highest frequency.
The next step is to build the de Bruijn graph to represent overlap of k-mers. 25-mers were used in their assembly. Only the single-end and paired-end reads with short insert sizes (<1 kb) were used in the graph due the high probability of chimeric reads in the long-insert pairs from the circularizing and fragmentation process. Further error correction is done using the de Bruijn graph.
Unable to get SOAPdenovo to read any kind of FASTA file, despite the documentation FASTA examples. Tried many variants of the FASTA file, even tried all 5 versions available for download, but could not get it to work. The other example shows the use of FASTQ. Found and installed sff2fastq utility. Made FASTQ version of the 454 data. Was able to get SOAPdenovo to run finally.
Perhaps it just won't take the fasta input by itself. It might work if you include a qual file with your fasta.
SOAPdenovo has three executables each tuned for a different range of k-mer sizes (SOAPdenovo-31mer, SOAPdenovo-63mer, SOAPdenovo-127mer). For example, SOAPdenovo-31mer
works best on k-mer sizes up to and including 31. For larger k-mers than 31 and lower than 64, use SOAPdenovo-63mer
.
SOAPdenovo requires a configuration file that describes the libraries that will be used in the assembly. A library entry is required for each read file or pair of read files in the case of paired-end reads. Here is an example of the 5 library entries for 1 lane of run1.
[LIB] #average insert size avg_ins=150 #if sequence needs to be reversed reverse_seq=0 #in which part(s) the reads are used asm_flags=3 #in which order the reads are used while scaffolding rank=1 #fastq file for read 1 q1=/campusdata/BME235/data/slug/clean/run1_seqprep_quake/s_1_1_qseq_seqprep.cor.fastq.gz #fastq file for read 2 always follows fastq file for read 1 q2=/campusdata/BME235/data/slug/clean/run1_seqprep_quake/s_1_2_qseq_seqprep.cor.fastq.gz [LIB] reverse_seq=0 asm_flags=3 rank=1 q=/campusdata/BME235/data/slug/clean/run1_seqprep_quake/s_1_1_qseq_seqprep.cor_single.fastq.gz [LIB] reverse_seq=1 asm_flags=3 rank=1 q=/campusdata/BME235/data/slug/clean/run1_seqprep_quake/s_1_2_qseq_seqprep.cor_single.fastq.gz [LIB] reverse_seq=0 asm_flags=3 rank=1 q=/campusdata/BME235/data/slug/clean/run1_seqprep_quake/s_1_merged_qseq_seqprep.cor.fastq.gz
cd /campusdata/BME235/programs wget http://soap.genomics.org.cn/down/SOAPdenovo-v1.04.tgz tar xfz SOAPdenovo-v1.04.tgz mv SOAPdenovo_Release1.04 SOAPdenovo mv SOAPdenovo-v1.04.tgz SOAPdenovo/ cd SOAPdenovo cp SOAPdenovo ../../bin/
Discussion
I ran SOAPdenovo on the 454 Pog reads. I tried all 5 versions of SOAPDenovo that are available, and none of them can read even the simplest FASTA file, despite all the documentation saying that it is supported. The only other format it shows is FASTQ, so I found a free program that converts SFF to FASTQ and ran that. After it was done, SOAPdenovo largest contig it made was only 4k. And just like with velvet, it actually does better with the data from set1 or 2, but with both 1 and 2 454-data, it actually performs worse, i.e. smaller contigs are generated. Seems unexpected.
I now think it might read a fasta if you also provide a quality file. It perhaps cannot use fasta alone.