This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
archive:bioinformatic_tools:soapdenovo [2010/04/14 08:40] galt |
archive:bioinformatic_tools:soapdenovo [2015/07/28 06:26] (current) ceisenhart ↷ Page moved from bioinformatic_tools:soapdenovo to archive:bioinformatic_tools:soapdenovo |
||
---|---|---|---|
Line 1: | Line 1: | ||
===== SOAPdenovo ===== | ===== SOAPdenovo ===== | ||
- | ====High Level Overview==== | ||
- | SOAP = Short_Oligonucleotide_Analysis_Package\\ | + | ==== Overview ==== |
+ | |||
+ | SOAP = **S**hort **O**ligonucleotide **A**nalysis **P**ackage\\ | ||
SOAPdenovo assembles short oligonucleotide into contigs and scaffolds | SOAPdenovo assembles short oligonucleotide into contigs and scaffolds | ||
for de-novo assembly of short-reads using de Bruijn graphs.\\ | for de-novo assembly of short-reads using de Bruijn graphs.\\ | ||
Can use a hierarchy of sizes of paired-end data.\\ | Can use a hierarchy of sizes of paired-end data.\\ | ||
- | Has been successfully used to sequence the Panda and Human genomes. | + | Has been successfully used to sequence the Panda and Human genomes.\\ |
- | Quality seems good. They ran Panda genome on 256GB workstation with 32 CPUs. | + | Quality seems good. They ran Panda genome on 512GB workstation with 32 CPUs. |
There is a [[http://soap.genomics.org.cn/soapdenovo.html|description]] which contains a | There is a [[http://soap.genomics.org.cn/soapdenovo.html|description]] which contains a | ||
download link. | download link. | ||
- | Created by BGI - Beijing Genomics Institute [[http://en.wikipedia.org/wiki/Beijing_Genomics_Institute|wiki]]. | + | Created by BGI - [[wp>Beijing_Genomics_Institute]]. |
- | Nature 463, 311-317 (21 January 2010) | doi:10.1038/nature08696; Received 19 August 2009; Accepted 24 November 2009; Published online 13 December 2009 | + | **The sequence and de novo assembly of the giant panda genome** |
- | The sequence and de novo assembly of the giant panda genome | + | [(cite:panda> |
- | [[http://www.nature.com/nature/journal/v463/n7279/full/nature08696.html|article]] | + | The sequence and de novo assembly of the giant panda genome |
+ | Nature 463, 311-317 (21 January 2010)\\ | ||
+ | doi:[[http://dx.doi.org/10.1038/nature08696|10.1038/nature08696]];\\ | ||
+ | Received 19 August 2009; Accepted 24 November 2009; Published online 13 December 2009 | ||
+ | )] | ||
Downloaded the binaries for SOAPdenovo and [[http://soap.genomics.org.cn/about.html#resource2|GapCloser]]. I copied these binaries to the /bin folder since there are only two, and they at least correctly display a help message when you run them without arguments. | Downloaded the binaries for SOAPdenovo and [[http://soap.genomics.org.cn/about.html#resource2|GapCloser]]. I copied these binaries to the /bin folder since there are only two, and they at least correctly display a help message when you run them without arguments. | ||
+ | ==== Method ==== | ||
+ | The method for SOAPdenovo is described in the paper [[http://genome.cshlp.org/content/20/2/265.full|"De novo assembly of human genomes with massively parallel short read sequencing"]] by Li et al. | ||
+ | |||
+ | === Short Read Data === | ||
+ | Fragment and paired-end libraries are sequenced using various insert sizes. They used read lengths from 35 to 75 bp and insert sizes of 140 bp, 440 bp, 2.6 kb, 6 kb, and 9.6 kb. | ||
+ | |||
+ | Basic error correction was performed on the reads using K-mer counting to reduce the memory usage when constructing the de Bruijn graph. This was done by identifying low frequency (occurring <3 times) 17-mers and correcting these K-mers to the candidate with the highest frequency. | ||
+ | |||
+ | === De Bruijn Graph === | ||
+ | The next step is to build the de Bruijn graph to represent overlap of k-mers. 25-mers were used in their assembly. | ||
+ | Only the single-end and paired-end reads with short insert sizes (<1 kb) were used in the graph due the high probability of chimeric reads in the long-insert pairs from the circularizing and fragmentation process. Further error correction is done using the de Bruijn graph. | ||
+ | |||
+ | * Clip tips (low coverage paths that lead to dead ends) | ||
+ | * Remove low coverage links | ||
+ | * Resolve tiny repeats greater than K, but less than the read lengths. | ||
+ | * Merge bubbles (paths with the same start and end). These can represent an error or a true polymorphism. | ||
+ | |||
+ | ==== Quirks ==== | ||
+ | |||
+ | === NO FASTA, ONLY FASTQ === | ||
+ | Unable to get SOAPdenovo to read any kind of FASTA file, despite | ||
+ | the documentation FASTA examples. Tried many variants of the FASTA file, | ||
+ | even tried all 5 versions available for download, but | ||
+ | could not get it to work. The other example shows the use of FASTQ. | ||
+ | Found and installed sff2fastq utility. Made FASTQ version of the 454 data. | ||
+ | Was able to get SOAPdenovo to run finally. | ||
+ | |||
+ | Perhaps it just won't take the fasta input by itself. | ||
+ | It might work if you include a qual file with your fasta. | ||
+ | |||
+ | ==== Using SOAPdenovo ==== | ||
+ | |||
+ | SOAPdenovo has three executables each tuned for a different range of k-mer sizes (SOAPdenovo-31mer, SOAPdenovo-63mer, SOAPdenovo-127mer). For example, ''SOAPdenovo-31mer'' works best on k-mer sizes up to and including 31. For larger k-mers than 31 and lower than 64, use ''SOAPdenovo-63mer''. | ||
+ | |||
+ | SOAPdenovo requires a configuration file that describes the libraries that will be used in the assembly. A library entry is required for each read file or pair of read files in the case of paired-end reads. Here is an example of the 5 library entries for 1 lane of run1. | ||
+ | |||
+ | <code> | ||
+ | [LIB] | ||
+ | #average insert size | ||
+ | avg_ins=150 | ||
+ | |||
+ | #if sequence needs to be reversed | ||
+ | reverse_seq=0 | ||
+ | |||
+ | #in which part(s) the reads are used | ||
+ | asm_flags=3 | ||
+ | |||
+ | #in which order the reads are used while scaffolding | ||
+ | rank=1 | ||
+ | |||
+ | #fastq file for read 1 | ||
+ | q1=/campusdata/BME235/data/slug/clean/run1_seqprep_quake/s_1_1_qseq_seqprep.cor.fastq.gz | ||
+ | #fastq file for read 2 always follows fastq file for read 1 | ||
+ | q2=/campusdata/BME235/data/slug/clean/run1_seqprep_quake/s_1_2_qseq_seqprep.cor.fastq.gz | ||
+ | |||
+ | [LIB] | ||
+ | reverse_seq=0 | ||
+ | asm_flags=3 | ||
+ | rank=1 | ||
+ | q=/campusdata/BME235/data/slug/clean/run1_seqprep_quake/s_1_1_qseq_seqprep.cor_single.fastq.gz | ||
+ | |||
+ | [LIB] | ||
+ | reverse_seq=1 | ||
+ | asm_flags=3 | ||
+ | rank=1 | ||
+ | q=/campusdata/BME235/data/slug/clean/run1_seqprep_quake/s_1_2_qseq_seqprep.cor_single.fastq.gz | ||
+ | |||
+ | [LIB] | ||
+ | reverse_seq=0 | ||
+ | asm_flags=3 | ||
+ | rank=1 | ||
+ | q=/campusdata/BME235/data/slug/clean/run1_seqprep_quake/s_1_merged_qseq_seqprep.cor.fastq.gz | ||
+ | |||
+ | </code> | ||
+ | |||
+ | |||
+ | |||
+ | ==== Installing ==== | ||
+ | |||
+ | cd /campusdata/BME235/programs | ||
+ | wget http://soap.genomics.org.cn/down/SOAPdenovo-v1.04.tgz | ||
+ | tar xfz SOAPdenovo-v1.04.tgz | ||
+ | mv SOAPdenovo_Release1.04 SOAPdenovo | ||
+ | mv SOAPdenovo-v1.04.tgz SOAPdenovo/ | ||
+ | cd SOAPdenovo | ||
+ | cp SOAPdenovo ../../bin/ | ||
+ | |||
+ | |||
+ | ==== Website ==== | ||
+ | [[http://soap.genomics.org.cn/soapdenovo.html]] | ||
+ | |||
+ | ==== Source with Binaries and Documentation ==== | ||
+ | [[http://soap.genomics.org.cn/down/]] | ||
+ | ===== References ===== | ||
+ | <refnotes>notes-separator: none</refnotes> | ||
+ | ~~REFNOTES cite~~ | ||