User Tools

Site Tools


archive:bioinformatic_tools:allpaths

This is an old revision of the document!


A PCRE internal error occured. This might be caused by a faulty plugin

===== ALLPATHS ===== Attached is the Allpaths3 version 1.0 documentation converted from .docx format to .pdf. You can download this file by clicking the following link: {{:bioinformatic_tools:allpathsv3_manual_r1.0_2.pdf|allpathsv3_manual}}. ====Potential Pitfalls==== Designed to work with 100+ bp paired end reads from a *minimum of one short and one long set of libraries. Also the program expects 40X coverage from each of those libraries! Additionally they say it requires a minimum of 32Gb of ram which I assume means shared memory, so it may not work on our cluster. Maybe it would be useful to run on small portions of our data though? ====High Level Overview==== ALLPATHS is the most recent (as of this writing) tool developed by the broad institute to assemble shotgun sequences[(cite:broad>http://www.broadinstitute.org/science/programs/genome-biology/computational-rd/computational-research-and-development)]. The broad institute claims that version 3 of the program can assemble up to mammalian sized genomes if the reads are at least 100+ base pairs[(cite:broad)]. Version 3 (currently 3.2) of the program may be downloaded from [[ftp://ftp.broad.mit.edu/pub/crd/ALLPATHS/Release-3-0/|here]] and that folder also contains some documentation on how to use the program. Also the program ships with test data that you can assemble. In class I mentioned that a fellow student, Amie, had a lot of trouble with assembling via Euler. Actually that program was ALLPATHS, but when she used it at the beginning of last quarter it was still in version 1.0. According to Broad version 1.0 was only expected to work on test data[(cite:broad)] (although Amie couldn't even get that working). Since then they have come out with two major revisions and one minor revision, and they claim that this version works on real data, so I think it is definitely worth a shot. --- //[[jstjohn@soe.ucsc.edu|John St. John]] 2010/04/06 08:21// ===Expected Data Quality=== On the broad site they mention that they have tested the algorithm out on very short reads (30+bp) and short reads (100+bp). They mainly target sequencing strategies utilizing 100+ reads with the argument that in the future that will be the norm[(cite:broad)]. ==Very Short Reads(30+bp paired reads)== I am not sure exactly what this means so I'll quote it directly, it seems relevant. --- //[[jstjohn@soe.ucsc.edu|John St. John]] 2010/04/05 17:50// >We have developed and tested a method for assembling very short (~30 base) paired reads using the ALLPATHS algorithm. This method requires high coverage from two libraries, one from fragments of size 3-4 kb, and one from shorter fragments.[(cite:broad)] ==Short Reads (~100+bp reads)== They are mainly developing the algorithm for slightly longer reads. According to the site they are targeting a combination approach consisting of 45x coverage of 100bp reads from 180bp fragments, 45x coverage of 100bp reads from 3000bp fragments, and "additional sequence from longer fragments for large genomes"[(cite:broad)]. Perhaps our data will not behave nicely with this algorithm? Also they didn't say whether or not they combine the reads from those three sources, or somehow analyze them separately and merge them at some point. Perhaps reading more into their documentation or publications will answer some of these questions. --- //[[jstjohn@soe.ucsc.edu|John St. John]] 2010/04/05 17:50// ===== Installation ===== Installing into ~/programs/allpaths Configure error, requires boost with at least the Boost.System binaries installed. Now I am installing that... Boost successfully installed. Now installing allpaths. Configure successful and currently building. --- //[[jstjohn@soe.ucsc.edu|John St. John]] 2010/04/07 00:53// Build unsuccessful: ./ParallelVecUtilities.h: In function 'void ParallelSort(vec<T>&)':\\ ./ParallelVecUtilities.h:27: error: '__gnu_parallel' has not been declared\\ ./ParallelVecUtilities.h: In function 'void ParallelSort(vec<T>&, StrictWeakOrdering)':\\ ./ParallelVecUtilities.h:35: error: '__gnu_parallel' has not been declared\\ ./ParallelVecUtilities.h: In function 'void ParallelReverseSort(vec<T>&)':\\ ./ParallelVecUtilities.h:42: error: '__gnu_parallel' has not been declared\\ ./ParallelVecUtilities.h: In function 'void ParallelReverseSort(vec<T>&, StrictWeakOrdering)':\\ ./ParallelVecUtilities.h:50: error: '__gnu_parallel' has not been declared\\ ./ParallelVecUtilities.h: In function 'void ParallelWhatPermutation(const V&, vec<T3>&, C, bool)':\\ ./ParallelVecUtilities.h:316: error: '__gnu_parallel' has not been declared Reading deeper into the documentation (the PDF attached to this page), I see that it requires gcc-4.3+. Campusrocks currently has gcc-4.1 installed. Perhaps if we compile the latest gcc we can install this program? Installed gcc-4.5! (had to do it myself, the sys admins wouldn't try) The gcc/g++-4.5 libraries are installed in: /campusdata/BME235/lib /campusdata/BME235/lib64 To compile with the gcc 4.5 compilers you need to have your environment properly set up so that everything knows where to look for the linked libraries. I did this by setting my LD_LIBRARY_PATH variable as follows in my .profile LD_LIBRARY_PATH=/campusdata/BME235/lib:/campusdata/BME235/lib64:$LD_LIBRARY_PATH export LD_LIBRARY_PATH Note if you want to run the install via a script, an example script that sets up environmental variables is here: /campusdata/BME235/programs/allpaths/allpaths3-3.2/installallpaths.sh ===== References ===== <refnotes>notes-separator: none</refnotes> ~~REFNOTES cite~~

Discussion

, 2010/04/08 21:02

30 of the boxes have 4 cores with 16GB each 32 boxes have 2 cores with 16GB each The CPUs are in the 2Ghz range.

So if the cores are sharing 64Gb, we're in luck, but if they really are 16GB separately for each core (or worse, 16Gbytes/box) we may be in trouble.

, 2010/04/08 02:50

Would I send the IT request to SOE? Who manages campusrocks?

, 2010/04/08 01:13

It looks like the experimental parallel extension to libstdc++ started with gcc 4.3 You'll have to do an IT request to get a newer gcc installed (and even that might not work). Remember to explain why you need a newer gcc.

You could leave a comment if you were logged in.
archive/bioinformatic_tools/allpaths.1271654898.txt.gz · Last modified: 2010/04/19 05:28 by jstjohn