User Tools

Site Tools


archive:bioinformatic_tools:soapdenovo

This is an old revision of the document!


A PCRE internal error occured. This might be caused by a faulty plugin

===== SOAPdenovo ===== ==== Overview ==== SOAP = **S**hort **O**ligonucleotide **A**nalysis **P**ackage\\ SOAPdenovo assembles short oligonucleotide into contigs and scaffolds for de-novo assembly of short-reads using de Bruijn graphs.\\ Can use a hierarchy of sizes of paired-end data.\\ Has been successfully used to sequence the Panda and Human genomes.\\ Quality seems good. They ran Panda genome on 512GB workstation with 32 CPUs. There is a [[http://soap.genomics.org.cn/soapdenovo.html|description]] which contains a download link. Created by BGI - [[wp>Beijing_Genomics_Institute]]. **The sequence and de novo assembly of the giant panda genome** [(cite:panda> The sequence and de novo assembly of the giant panda genome Nature 463, 311-317 (21 January 2010)\\ doi:[[http://dx.doi.org/10.1038/nature08696|10.1038/nature08696]];\\ Received 19 August 2009; Accepted 24 November 2009; Published online 13 December 2009 )] Downloaded the binaries for SOAPdenovo and [[http://soap.genomics.org.cn/about.html#resource2|GapCloser]]. I copied these binaries to the /bin folder since there are only two, and they at least correctly display a help message when you run them without arguments. ==== Method ==== The method for SOAPdenovo is described in the paper [[http://genome.cshlp.org/content/20/2/265.full|"De novo assembly of human genomes with massively parallel short read sequencing"]] by Li et al. === Short Read Data === Fragment and paired-end libraries are sequenced using various insert sizes. They used read lengths from 35 to 75 bp and insert sizes of 140 bp, 440 bp, 2.6 kb, 6 kb, and 9.6 kb. Basic error correction was performed on the reads using K-mer counting to reduce the memory usage when constructing the de Bruijn graph. This was done by identifying low frequency (occurring <3 times) 17-mers and correcting these K-mers to the candidate with the highest frequency. === De Bruijn Graph === The next step is to build the de Bruijn graph to represent overlap of k-mers. 25-mers were used in their assembly. Only the single-end and paired-end reads with short insert sizes (<1 kb) were used in the graph due the high probability of chimeric reads in the long-insert pairs from the circularizing and fragmentation process. Further error correction is done using the de Bruijn graph. * Clip tips (low coverage paths that lead to dead ends) * Remove low coverage links * Resolve tiny repeats greater than K, but less than the read lengths. * Merge bubbles (paths with the same start and end). These can represent an error or a true polymorphism. ==== Quirks ==== === NO FASTA, ONLY FASTQ === Unable to get SOAPdenovo to read any kind of FASTA file, despite the documentation FASTA examples. Tried many variants of the FASTA file, even tried all 5 versions available for download, but could not get it to work. The other example shows the use of FASTQ. Found and installed sff2fastq utility. Made FASTQ version of the 454 data. Was able to get SOAPdenovo to run finally. Perhaps it just won't take the fasta input by itself. It might work if you include a qual file with your fasta. ==== Installing ==== cd /campusdata/BME235/programs wget http://soap.genomics.org.cn/down/SOAPdenovo-v1.04.tgz tar xfz SOAPdenovo-v1.04.tgz mv SOAPdenovo_Release1.04 SOAPdenovo mv SOAPdenovo-v1.04.tgz SOAPdenovo/ cd SOAPdenovo cp SOAPdenovo ../../bin/ ==== Website ==== [[http://soap.genomics.org.cn/soapdenovo.html]] ==== Source with Binaries and Documentation ==== [[http://soap.genomics.org.cn/down/]] ===== References ===== <refnotes>notes-separator: none</refnotes> ~~REFNOTES cite~~

Discussion

, 2010/04/19 19:35

I ran SOAPdenovo on the 454 Pog reads. I tried all 5 versions of SOAPDenovo that are available, and none of them can read even the simplest FASTA file, despite all the documentation saying that it is supported. The only other format it shows is FASTQ, so I found a free program that converts SFF to FASTQ and ran that. After it was done, SOAPdenovo largest contig it made was only 4k. And just like with velvet, it actually does better with the data from set1 or 2, but with both 1 and 2 454-data, it actually performs worse, i.e. smaller contigs are generated. Seems unexpected.

I now think it might read a fasta if you also provide a quality file. It perhaps cannot use fasta alone.

You could leave a comment if you were logged in.
archive/bioinformatic_tools/soapdenovo.1305752927.txt.gz · Last modified: 2011/05/18 21:08 by svohr