Velvet was developed by Daniel R. Zerbino and Ewan Birney.
Velvet: algorithms for de novo short read assembly using de Bruijn graphs[1]
Velvet may be downloaded free from here (GPL license).
On wikipedia: Velvet.
Daniel Zerbino's PhD Thesis on Velvet
Velvet has support for COLORSPACE, possibly the only de-novo short-read DBG assembler that does at this time. The colorspace version of velvet (_de) expects all data to be double-encoded. Mixed-space not directly supported.
Velvet has support for long-read data.
Velvet will accept sequence data from fastq input files, but does not use the quality information.
This is done by the pre-processor. The primer base from the colorspace read is removed, followed by the first color, since it was tied to the primer-base. In the case of mate-paired reads, the F3 read is reversed. Then the colors are all converted to bases for software that doesn't parse colorspace inputs. Thus double-encoded means reads encoded in colorspace, and then re-encoded as if bases in base-space.
denovo_preprocessor converts colorspace reads into double-encoded 24-base reads that can be given to velvet_de.
velveth_de colorspace version of velveth hashes reads.
velvetg_de colorspace version of velvetg creates de Bruijn graph.
denovo_postprocessor converts velvet output double-encoded to colorspace contigs.
denovo_adp - adapter program converts colorspace to base-space while reducing read errors in colorspace as much as possible.
Strategy:
For 454 long reads, this was our best result:
velveth out 31 -short 454/?.TCA.454Reads.fna -long 454/?.TCA.454Reads.fna velvetg out -exp_cov 60 -cov_cutoff 13 Final graph has 1755 nodes and n50 of 41723, max 142286, total 2468925, using 778257/782604 reads
The contributed (velvet/contrib/) utility VelvetOptimiser is intended to help find the critical parameters k, exp_cov, and cov_cutoff. However although it found k, it got stuck on a local maximum on coverage and failed to produce anything useful.
Wondering if homopolymer errors in 454 data could cause trouble for the DBG, I made a utility called pseudoFlow.c that takes all homopolymers longer than 6 and shortens them to 6. We know that in the range 1 to 6, 454 is accurate. In any case, the pseudoFlow version of the data did not perform better, in fact it was a little worse.
ssh campusrocks.cse.ucsc.edu cd /campusdata/BME235/programs wget http://www.ebi.ac.uk/~zerbino/velvet/velvet_0.7.62.tgz tar xfz velvet_0.7.62.tgz mv velvet_0.7.62 velvet mv velvet_0.7.62.tgz velvet/ cd velvet make make color # color versions work with solid, have _de extension # install to bin dir cp velveth velvetg velveth_de velvetg_de /campusdata/BME235/bin/
Discussion
excuse me, where is the velvet-estimate-exp_cov.pl script?
velveth out 21 -short 454/?.TCA.454Reads.fna -long 454/?.TCA.454Reads.fna
velvetg out -exp_cov 50 -cov_cutoff 13
Final graph has 3281 nodes and n50 of 28393, max 82889, total 2492962, using 773102/782604 reads
Wow! this is by far the best one I have gotten yet.
The expected coverage and coverage cutoff was estimated with a utility in contrib/
velvetg out -exp_cov auto
velvet-estimate-exp_cov.pl out/stats.txt
but you still have to look at the graph.
It seems like using just long reads alone did not make it happy, as if it were written with the design that the long reads just supplement the contigs built with the short reads. So, in this case I just used the input twice. I had tried that before, but I guess I didn't have the exp_cov and cov_cutoff good enough.
New record for velvet!
k=31 should do better with long reads and it does:
velveth out 31 -short 454/?.TCA.454Reads.fna -long 454/?.TCA.454Reads.fna
velvetg out -exp_cov 50 -cov_cutoff 13
Final graph has 1790 nodes and n50 of 31924, max 125345, total 2472961, using 775986/782604 reads
WOW! This is the best yet!
velveth out 31 -short 454/?.TCA.454Reads.fna -long 454/?.TCA.454Reads.fna
velvetg out -exp_cov 60 -cov_cutoff 13
Final graph has 1755 nodes and n50 of 41723, max 142286, total 2468925, using 778257/782604 reads
Also: lowering the cov_cutoff to 9 or raising it to 20 both gave poorer results.
And raising the exp_cov to 70 made no improvement.
I tried velvet_de on the double-encoded files Kevin provided for Pog Solid reads. This did manage to produce a contig as large as 40k. I haven't had time yet to try it with a variety of settings.
I tried velvet with -exp_cov 50 and now the largest contig is 8kb. Still not great.
I have tried various settings and never got the largest contig above 14kb, which is pretty small.
I did finally get the simulated test data that comes with velvet, a 100kb genome with shortPaired and long reads, to assemble, but it was necessary to drop the default k=31 to k=21. With short reads of 35bp, you don't get the needed coverage with the longer k-value.
Of course the smaller k-value did no good whatsoever for the Pog data. I tried lots of combinations of parameters for k, exp_cov, cov_cutoff, and nothing seems to do any good. I actually get a better assembly if I just use 1.TCA*.fna or 2.*fna. That's when I can get biggest contig up to 14k in size. If I put both .fna files in, it only gets to 10k. I have tried it as long, as short, and as both and it made no difference.
I thought perhaps the 454 data has enough homo-polymer errors that it somehow throws off velvet from making a good graph. So I tried creating a special copy of the two .fna 454 files that restricted all homopolymers in length to 6 or less. But when I ran it through velvet, it made no difference at all.
I tried a very simple run of velvet last night doing just default settings and using just the Pog 454 fasta data. But I did not get very good results. Whether I told it the data was long or short, I never got a bigger contig than 741 bases. Of course there may be other settings that would help, and maybe there's a way to use the solid mate-pair data. The only good thing about it was that the run on campusrocks took only 5 minutes so I was able to try 3 or 4 things quickly.
Can Newbler use the (apparently?) low-coverage 454 data and produce a decent assembly from the 454 data alone?
Yes, according to the lecture on Fri, Newbler can use the 454 Pog data to make a 2.4MB assembly with only 51 contigs, the N50 length of which is 220kb. It depends on which version of Newbler too.
ok, re-arranged and citations corrected.
Formatting suggestions: Move the discussion about how to use Velvet with mixed-space reads into the main body of the article. Fix the citation to use refnotes format correctly.
Not directly. The colorspace version of velvet (_de) expects all data to be double-encoded. Obvious ideas to try: 1. Convert the 454 data to colorspace and run velvet{h,g}_de 2. Correct the colorspace reads using SAET solid utility which uses quality info and kmer-spectral distribution to correct reads. 3. Find some other tool that can take a 454-base assembly and use the mate-paired Solid data to resolve contigs and repeats.
Question: can Velvet mix color-space reads and flow-space reads in the same assembly?