Attached is the Allpaths3 version 1.0 documentation converted from .docx format to .pdf. You can download this file by clicking the following link: allpathsv3_manual.
Designed to work with 100+ bp paired end reads from a *minimum of one short and one long set of libraries. Also the program expects 40X coverage from each of those libraries! Additionally they say it requires a minimum of 32Gb of ram which I assume means shared memory, so it may not work on our cluster. Maybe it would be useful to run on small portions of our data though?
ALLPATHS is the most recent (as of this writing) tool developed by the broad institute to assemble shotgun sequences[1]. The broad institute claims that version 3 of the program can assemble up to mammalian sized genomes if the reads are at least 100+ base pairs[1]. Version 3 (currently 3.2) of the program may be downloaded from here and that folder also contains some documentation on how to use the program. Also the program ships with test data that you can assemble.
In class I mentioned that a fellow student, Amie, had a lot of trouble with assembling via Euler. Actually that program was ALLPATHS, but when she used it at the beginning of last quarter it was still in version 1.0. According to Broad version 1.0 was only expected to work on test data[1] (although Amie couldn't even get that working). Since then they have come out with two major revisions and one minor revision, and they claim that this version works on real data, so I think it is definitely worth a shot. — John St. John 2010/04/06 08:21
On the broad site they mention that they have tested the algorithm out on very short reads (30+bp) and short reads (100+bp). They mainly target sequencing strategies utilizing 100+ reads with the argument that in the future that will be the norm[1].
I am not sure exactly what this means so I'll quote it directly, it seems relevant. — John St. John 2010/04/05 17:50
We have developed and tested a method for assembling very short (~30 base) paired reads using the ALLPATHS algorithm. This method requires high coverage from two libraries, one from fragments of size 3-4 kb, and one from shorter fragments.[1]
They are mainly developing the algorithm for slightly longer reads. According to the site they are targeting a combination approach consisting of 45x coverage of 100bp reads from 180bp fragments, 45x coverage of 100bp reads from 3000bp fragments, and “additional sequence from longer fragments for large genomes”[1].
Perhaps our data will not behave nicely with this algorithm? Also they didn't say whether or not they combine the reads from those three sources, or somehow analyze them separately and merge them at some point. Perhaps reading more into their documentation or publications will answer some of these questions. — John St. John 2010/04/05 17:50
Installed statically on isla.cse.ucsc.edu and transferred over.
Installed with boost_1.38 and gcc version 4.4.
The installed the binaries here:
/projects/lowelab/users/jstjohn/allpaths/bin
and then I ziped them and transfered them via scp over to campusrocks.
To compile static I ran configure with “CXXFLAGS=-static”.
They also installed graphviz on the computer which is necessary to view some of the output, although I am pretty sure this isn't a pre-requisite to compile the program as the configure script never bugged me about specifying the graphviz source and/or binary.
Discussion
30 of the boxes have 4 cores with 16GB each 32 boxes have 2 cores with 16GB each The CPUs are in the 2Ghz range.
So if the cores are sharing 64Gb, we're in luck, but if they really are 16GB separately for each core (or worse, 16Gbytes/box) we may be in trouble.
Would I send the IT request to SOE? Who manages campusrocks?
It looks like the experimental parallel extension to libstdc++ started with gcc 4.3 You'll have to do an IT request to get a newer gcc installed (and even that might not work). Remember to explain why you need a newer gcc.