Differences

This shows you the differences between two versions of the page.

--- archive:bioinformatic_tools:soapdenovo [2010/04/14 10:44]
galt
+++ archive:bioinformatic_tools:soapdenovo [2015/07/28 06:26] (current)
ceisenhart ↷ Page moved from bioinformatic_tools:soapdenovo to archive:bioinformatic_tools:soapdenovo
@@ Line 3: / Line 3: @@
 ==== Overview ====
-SOAP = Short_Oligonucleotide_Analysis_Package\\
+SOAP = **S**hort **O**ligonucleotide **A**nalysis **P**ackage\\
 SOAPdenovo assembles short oligonucleotide into contigs and scaffolds
 for de-novo assembly of short-reads using de Bruijn graphs.\\
 Can use a hierarchy of sizes of paired-end data.\\
 Has been successfully used to sequence the Panda and Human genomes.\\
-Quality seems good.  They ran Panda genome on 256GB workstation with 32 CPUs.
+Quality seems good.  They ran Panda genome on 512GB workstation with 32 CPUs.
 There is a [[http://soap.genomics.org.cn/soapdenovo.html|description]] which contains a
 download link.
-Created by BGI - Beijing Genomics Institute [[http://en.wikipedia.org/wiki/Beijing_Genomics_Institute|wiki]].
+Created by BGI - [[wp>Beijing_Genomics_Institute]].
 **The sequence and de novo assembly of the giant panda genome**
 [(cite:panda>
-  Nature 463, 311-317 (21 January 2010)\\
+The sequence and de novo assembly of the giant panda genome
-  doi:[[10.1038/nature08696|10.1038/nature08696]]; Received 19 August 2009; Accepted 24 November 2009; Published online 13 December 2009
+Nature 463, 311-317 (21 January 2010)\\
-  The sequence and de novo assembly of the giant panda genome
+doi:[[http://dx.doi.org/10.1038/nature08696|10.1038/nature08696]];\\
-[[http://www.nature.com/nature/journal/v463/n7279/full/nature08696.html|article]]
+Received 19 August 2009; Accepted 24 November 2009; Published online 13 December 2009
+)]
 Downloaded the binaries for SOAPdenovo and [[http://soap.genomics.org.cn/about.html#resource2|GapCloser]]. I copied these binaries to the /bin folder since there are only two, and they at least correctly display a help message when you run them without arguments.
+==== Method ====
+The method for SOAPdenovo is described in the paper [[http://genome.cshlp.org/content/20/2/265.full|"De novo assembly of human genomes with massively parallel short read sequencing"]] by Li et al.
+=== Short Read Data ===
+Fragment and paired-end libraries are sequenced using various insert sizes. They used read lengths from 35 to 75 bp and insert sizes of 140 bp, 440 bp, 2.6 kb, 6 kb, and 9.6 kb.
+Basic error correction was performed on the reads using K-mer counting to reduce the memory usage when constructing the de Bruijn graph. This was done by identifying low frequency (occurring <3 times) 17-mers and correcting these K-mers to the candidate with the highest frequency.
+=== De Bruijn Graph ===
+The next step is to build the de Bruijn graph to represent overlap of k-mers. 25-mers were used in their assembly.
+Only the single-end and paired-end reads with short insert sizes (<1 kb) were used in the graph due the high probability of chimeric reads in the long-insert pairs from the circularizing and fragmentation process. Further error correction is done using the de Bruijn graph.
+    * Clip tips (low coverage paths that lead to dead ends)
+    * Remove low coverage links
+    * Resolve tiny repeats greater than K, but less than the read lengths.
+    * Merge bubbles (paths with the same start and end). These can represent an error or a true polymorphism.
+==== Quirks ====
+=== NO FASTA, ONLY FASTQ ===
+Unable to get SOAPdenovo to read any kind of FASTA file, despite
+the documentation FASTA examples.  Tried many variants of the FASTA file,
+even tried all 5 versions available for download, but
+could not get it to work.  The other example shows the use of FASTQ.
+Found and installed sff2fastq utility. Made FASTQ version of the 454 data.
+Was able to get SOAPdenovo to run finally.
+Perhaps it just won't take the fasta input by itself.
+It might work if you include a qual file with your fasta.
+==== Using SOAPdenovo ====
+SOAPdenovo has three executables each tuned for a different range of k-mer sizes (SOAPdenovo-31mer, SOAPdenovo-63mer, SOAPdenovo-127mer). For example, ''SOAPdenovo-31mer'' works best on k-mer sizes up to and including 31. For larger k-mers than 31 and lower than 64, use ''SOAPdenovo-63mer''.
+SOAPdenovo requires a configuration file that describes the libraries that will be used in the assembly. A library entry is required for each read file or pair of read files in the case of paired-end reads. Here is an example of the 5 library entries for 1 lane of run1.
+<code>
+[LIB]
+#average insert size
+avg_ins=150
+#if sequence needs to be reversed
+reverse_seq=0
+#in which part(s) the reads are used
+asm_flags=3
+#in which order the reads are used while scaffolding
+rank=1
+#fastq file for read 1
+q1=/campusdata/BME235/data/slug/clean/run1_seqprep_quake/s_1_1_qseq_seqprep.cor.fastq.gz
+#fastq file for read 2 always follows fastq file for read 1
+q2=/campusdata/BME235/data/slug/clean/run1_seqprep_quake/s_1_2_qseq_seqprep.cor.fastq.gz
+[LIB]
+reverse_seq=0
+asm_flags=3
+rank=1
+q=/campusdata/BME235/data/slug/clean/run1_seqprep_quake/s_1_1_qseq_seqprep.cor_single.fastq.gz
+[LIB]
+reverse_seq=1
+asm_flags=3
+rank=1
+q=/campusdata/BME235/data/slug/clean/run1_seqprep_quake/s_1_2_qseq_seqprep.cor_single.fastq.gz
+[LIB]
+reverse_seq=0
+asm_flags=3
+rank=1
+q=/campusdata/BME235/data/slug/clean/run1_seqprep_quake/s_1_merged_qseq_seqprep.cor.fastq.gz
+</code>
+==== Installing ====
+  cd /campusdata/BME235/programs
+  wget http://soap.genomics.org.cn/down/SOAPdenovo-v1.04.tgz
+  tar xfz SOAPdenovo-v1.04.tgz
+  mv SOAPdenovo_Release1.04 SOAPdenovo
+  mv SOAPdenovo-v1.04.tgz SOAPdenovo/
+  cd SOAPdenovo
+  cp SOAPdenovo ../../bin/
 ==== Website ====
-[[http://?]]
+[[http://soap.genomics.org.cn/soapdenovo.html]]
 ==== Source with Binaries and Documentation ====
-[[http://?]]
+[[http://soap.genomics.org.cn/down/]]
 ===== References =====

Banana Slug Genomics

User Tools

Site Tools

Differences

Page Tools