archive:bioinformatic_tools:soapdenovo

SOAPdenovo
References
Discussion

SOAPdenovo

Overview

SOAP = Short Oligonucleotide Analysis Package
SOAPdenovo assembles short oligonucleotide into contigs and scaffolds for de-novo assembly of short-reads using de Bruijn graphs.
Can use a hierarchy of sizes of paired-end data.
Has been successfully used to sequence the Panda and Human genomes.
Quality seems good. They ran Panda genome on 512GB workstation with 32 CPUs.

There is a description which contains a download link.

Created by BGI - Beijing_Genomics_Institute.

The sequence and de novo assembly of the giant panda genome[1]

Downloaded the binaries for SOAPdenovo and GapCloser. I copied these binaries to the /bin folder since there are only two, and they at least correctly display a help message when you run them without arguments.

Method

The method for SOAPdenovo is described in the paper "De novo assembly of human genomes with massively parallel short read sequencing" by Li et al.

Short Read Data

Fragment and paired-end libraries are sequenced using various insert sizes. They used read lengths from 35 to 75 bp and insert sizes of 140 bp, 440 bp, 2.6 kb, 6 kb, and 9.6 kb.

Basic error correction was performed on the reads using K-mer counting to reduce the memory usage when constructing the de Bruijn graph. This was done by identifying low frequency (occurring <3 times) 17-mers and correcting these K-mers to the candidate with the highest frequency.

De Bruijn Graph

The next step is to build the de Bruijn graph to represent overlap of k-mers. 25-mers were used in their assembly. Only the single-end and paired-end reads with short insert sizes (<1 kb) were used in the graph due the high probability of chimeric reads in the long-insert pairs from the circularizing and fragmentation process. Further error correction is done using the de Bruijn graph.

Clip tips (low coverage paths that lead to dead ends)
Remove low coverage links
Resolve tiny repeats greater than K, but less than the read lengths.
Merge bubbles (paths with the same start and end). These can represent an error or a true polymorphism.

Quirks

NO FASTA, ONLY FASTQ

Unable to get SOAPdenovo to read any kind of FASTA file, despite the documentation FASTA examples. Tried many variants of the FASTA file, even tried all 5 versions available for download, but could not get it to work. The other example shows the use of FASTQ. Found and installed sff2fastq utility. Made FASTQ version of the 454 data. Was able to get SOAPdenovo to run finally.

Perhaps it just won't take the fasta input by itself. It might work if you include a qual file with your fasta.

Using SOAPdenovo

SOAPdenovo has three executables each tuned for a different range of k-mer sizes (SOAPdenovo-31mer, SOAPdenovo-63mer, SOAPdenovo-127mer). For example, SOAPdenovo-31mer works best on k-mer sizes up to and including 31. For larger k-mers than 31 and lower than 64, use SOAPdenovo-63mer.

SOAPdenovo requires a configuration file that describes the libraries that will be used in the assembly. A library entry is required for each read file or pair of read files in the case of paired-end reads. Here is an example of the 5 library entries for 1 lane of run1.

[LIB]
#average insert size
avg_ins=150

#if sequence needs to be reversed 
reverse_seq=0

#in which part(s) the reads are used
asm_flags=3

#in which order the reads are used while scaffolding
rank=1

#fastq file for read 1
q1=/campusdata/BME235/data/slug/clean/run1_seqprep_quake/s_1_1_qseq_seqprep.cor.fastq.gz
#fastq file for read 2 always follows fastq file for read 1
q2=/campusdata/BME235/data/slug/clean/run1_seqprep_quake/s_1_2_qseq_seqprep.cor.fastq.gz

[LIB]
reverse_seq=0
asm_flags=3
rank=1
q=/campusdata/BME235/data/slug/clean/run1_seqprep_quake/s_1_1_qseq_seqprep.cor_single.fastq.gz

[LIB]
reverse_seq=1
asm_flags=3
rank=1
q=/campusdata/BME235/data/slug/clean/run1_seqprep_quake/s_1_2_qseq_seqprep.cor_single.fastq.gz

[LIB]
reverse_seq=0
asm_flags=3
rank=1
q=/campusdata/BME235/data/slug/clean/run1_seqprep_quake/s_1_merged_qseq_seqprep.cor.fastq.gz

Installing

cd /campusdata/BME235/programs
wget http://soap.genomics.org.cn/down/SOAPdenovo-v1.04.tgz
tar xfz SOAPdenovo-v1.04.tgz
mv SOAPdenovo_Release1.04 SOAPdenovo
mv SOAPdenovo-v1.04.tgz SOAPdenovo/
cd SOAPdenovo
cp SOAPdenovo ../../bin/

Website

http://soap.genomics.org.cn/soapdenovo.html

Source with Binaries and Documentation

http://soap.genomics.org.cn/down/

References

1. ^a The sequence and de novo assembly of the giant panda genome Nature 463, 311-317 (21 January 2010)
doi:10.1038/nature08696;
Received 19 August 2009; Accepted 24 November 2009; Published online 13 December 2009

Table of Contents