User Tools

Site Tools


archive:bioinformatic_tools:soapdenovo

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
archive:bioinformatic_tools:soapdenovo [2010/04/09 19:22]
galt created
archive:bioinformatic_tools:soapdenovo [2015/07/28 06:26] (current)
ceisenhart ↷ Page moved from bioinformatic_tools:soapdenovo to archive:bioinformatic_tools:soapdenovo
Line 1: Line 1:
 ===== SOAPdenovo ===== ===== SOAPdenovo =====
-====High Level Overview==== 
  
-SOAP = Short_Oligonucleotide_Analysis_Package\\+==== Overview ==== 
 + 
 +SOAP = **S**hort **O**ligonucleotide **A**nalysis **P**ackage\\
 SOAPdenovo assembles short oligonucleotide into contigs and scaffolds SOAPdenovo assembles short oligonucleotide into contigs and scaffolds
 for de-novo assembly of short-reads using de Bruijn graphs.\\ for de-novo assembly of short-reads using de Bruijn graphs.\\
 Can use a hierarchy of sizes of paired-end data.\\ Can use a hierarchy of sizes of paired-end data.\\
-Has been successfully used to sequence the Panda and Human genomes.+Has been successfully used to sequence the Panda and Human genomes.\\ 
 +Quality seems good.  They ran Panda genome on 512GB workstation with 32 CPUs.
  
-There is a [[http://​soap.genomics.org.cn/​soapdenovo.html|description]] which contains+There is a [[http://​soap.genomics.org.cn/​soapdenovo.html|description]] which contains ​a
 download link. download link.
  
-Created by BGI - Beijing Genomics Institute ​[[http://en.wikipedia.org/wiki/Beijing_Genomics_Institute|wiki]].+Created by BGI - [[wp>​Beijing_Genomics_Institute]]. 
 + 
 +**The sequence and de novo assembly of the giant panda genome** 
 +[(cite:​panda>​ 
 +The sequence and de novo assembly of the giant panda genome 
 +Nature 463, 311-317 (21 January 2010)\\ 
 +doi:[[http://dx.doi.org/10.1038/nature08696|10.1038/​nature08696]];\\ 
 +Received 19 August 2009; Accepted 24 November 2009; Published online 13 December 2009 
 +)] 
 + 
 + 
 +Downloaded the binaries for SOAPdenovo and [[http://​soap.genomics.org.cn/​about.html#​resource2|GapCloser]]. I copied these binaries to the /bin folder since there are only two, and they at least correctly display a help message when you run them without arguments. 
 + 
 +==== Method ==== 
 +The method for SOAPdenovo is described in the paper [[http://​genome.cshlp.org/​content/​20/​2/​265.full|"​De novo assembly of human genomes with massively parallel short read sequencing"​]] by Li et al. 
 + 
 +=== Short Read Data === 
 +Fragment and paired-end libraries are sequenced using various insert sizes. They used read lengths from 35 to 75 bp and insert sizes of 140 bp, 440 bp, 2.6 kb, 6 kb, and 9.6 kb. 
 + 
 +Basic error correction was performed on the reads using K-mer counting to reduce the memory usage when constructing the de Bruijn graph. This was done by identifying low frequency (occurring <3 times) 17-mers and correcting these K-mers to the candidate with the highest frequency.  
 + 
 +=== De Bruijn Graph ===  
 +The next step is to build the de Bruijn graph to represent overlap of k-mers. 25-mers were used in their assembly. 
 +Only the single-end and paired-end reads with short insert sizes (<1 kb) were used in the graph due the high probability of chimeric reads in the long-insert pairs from the circularizing and fragmentation process. Further error correction is done using the de Bruijn graph.  
 + 
 +    * Clip tips (low coverage paths that lead to dead ends) 
 +    * Remove low coverage links 
 +    * Resolve tiny repeats greater than K, but less than the read lengths. 
 +    * Merge bubbles (paths with the same start and end). These can represent an error or a true polymorphism. 
 + 
 +==== Quirks ==== 
 + 
 +=== NO FASTA, ONLY FASTQ === 
 +Unable to get SOAPdenovo to read any kind of FASTA file, despite 
 +the documentation FASTA examples. ​ Tried many variants of the FASTA file, 
 +even tried all 5 versions available for download, but 
 +could not get it to work.  The other example shows the use of FASTQ. 
 +Found and installed sff2fastq utility. Made FASTQ version of the 454 data. 
 +Was able to get SOAPdenovo to run finally. 
 + 
 +Perhaps it just won't take the fasta input by itself. 
 +It might work if you include a qual file with your fasta. 
 + 
 +==== Using SOAPdenovo ==== 
 + 
 +SOAPdenovo has three executables each tuned for a different range of k-mer sizes (SOAPdenovo-31mer,​ SOAPdenovo-63mer,​ SOAPdenovo-127mer). For example, ''​SOAPdenovo-31mer''​ works best on k-mer sizes up to and including 31. For larger k-mers than 31 and lower than 64, use ''​SOAPdenovo-63mer''​. 
 + 
 +SOAPdenovo requires a configuration file that describes the libraries that will be used in the assembly. A library entry is required for each read file or pair of read files in the case of paired-end reads. Here is an example of the 5 library entries for 1 lane of run1. 
 + 
 +<​code>​ 
 +[LIB] 
 +#average insert size 
 +avg_ins=150 
 + 
 +#if sequence needs to be reversed  
 +reverse_seq=0 
 + 
 +#in which part(s) the reads are used 
 +asm_flags=3 
 + 
 +#in which order the reads are used while scaffolding 
 +rank=1 
 + 
 +#fastq file for read 1 
 +q1=/​campusdata/​BME235/​data/​slug/​clean/​run1_seqprep_quake/​s_1_1_qseq_seqprep.cor.fastq.gz 
 +#fastq file for read 2 always follows fastq file for read 1 
 +q2=/​campusdata/​BME235/​data/​slug/​clean/​run1_seqprep_quake/​s_1_2_qseq_seqprep.cor.fastq.gz 
 + 
 +[LIB] 
 +reverse_seq=0 
 +asm_flags=3 
 +rank=1 
 +q=/​campusdata/​BME235/​data/​slug/​clean/​run1_seqprep_quake/​s_1_1_qseq_seqprep.cor_single.fastq.gz 
 + 
 +[LIB] 
 +reverse_seq=1 
 +asm_flags=3 
 +rank=1 
 +q=/​campusdata/​BME235/​data/​slug/​clean/​run1_seqprep_quake/​s_1_2_qseq_seqprep.cor_single.fastq.gz 
 + 
 +[LIB] 
 +reverse_seq=0 
 +asm_flags=3 
 +rank=1 
 +q=/​campusdata/​BME235/​data/​slug/​clean/​run1_seqprep_quake/​s_1_merged_qseq_seqprep.cor.fastq.gz 
 + 
 +</​code>​ 
 + 
 + 
 + 
 +==== Installing ====
  
-  ​Nature 463, 311-317 (21 January 2010) | doi:10.1038/nature08696;​ Received 19 August 2009; Accepted 24 November 2009; Published online 13 December 2009 +  ​cd /campusdata/​BME235/​programs 
-  ​The sequence and de novo assembly of the giant panda genome +  ​wget http://soap.genomics.org.cn/down/SOAPdenovo-v1.04.tgz 
-[[http://www.nature.com/nature/journal/v463/n7279/full/nature08696.html|article]]+  tar xfz SOAPdenovo-v1.04.tgz 
 +  mv SOAPdenovo_Release1.04 SOAPdenovo 
 +  mv SOAPdenovo-v1.04.tgz SOAPdenovo/ 
 +  cd SOAPdenovo 
 +  cp SOAPdenovo ../../bin/
  
  
 +==== Website ====
 +[[http://​soap.genomics.org.cn/​soapdenovo.html]]
  
 +==== Source with Binaries and Documentation ====
 +[[http://​soap.genomics.org.cn/​down/​]]
  
 +===== References =====
 +<​refnotes>​notes-separator:​ none</​refnotes>​
 +~~REFNOTES cite~~
  
archive/bioinformatic_tools/soapdenovo.1270840930.txt.gz · Last modified: 2010/04/09 19:22 by galt