Table of Contents

ABySS

Overview

ABySS[1] stands for Assembly By Short Sequences.

ABySS is a de novo parallel, paired-end, short read DNA sequence assembler.
The single processor version can assemble genomes of up to 100 Mbases.[2]
The parallel version uses MPI and can assemble larger genomes.[2]
It was used for assembly of a transcriptome from the tumor tissue of a patient with follicular lymphoma.[3]

ABySS can use large kmer values greater than 31.

Note that ABySS is also the recommended assembler by Illumina for large genomes. Illumina Technote Paper

Installing

Get the appropriate source files to be compiled:

cd /campusdata/BME235/programs
wget http://www.bcgsc.ca/downloads/abyss/abyss-1.1.2.tar.gz
wget http://www.open-mpi.org/software/ompi/v1.4/downloads/openmpi-1.4.1.tar.gz
wget http://google-sparsehash.googlecode.com/files/sparsehash-1.7.tar.gz
tar xfz abyss-1.1.2.tar.gz
tar xfz openmpi-1.4.1.tar.gz
tar xfz sparsehash-1.7.tar.gz
mv abyss-1.1.2.tar.gz abyss-1.1.2/
mv openmpi-1.4.1.tar.gz openmpi-1.4.1/
mv sparsehash-1.7.tar.gz sparsehash-1.7/

First, OpenMPI and Google sparsehash need to be compiled and installed for ABySS.

cd /campusdata/BME235/programs/openmpi-1.4.1
./configure --prefix=/campusdata/BME235
make
make install
cd /campusdata/BME235/programs/sparsehash-1.7
./configure --prefix=/campusdata/BME235
make
make install

Next, a patch needs to be applied so that ABySS can properly compile with support for Google sparsehash 1.7. This issue will be fixed in the next release of Google sparsehash.

cd /campusdata/BME235/include/google/sparsehash
wget http://google-sparsehash.googlecode.com/issues/attachment?aid=-5666329961626930947&name=deallocate.diff
patch < deallocate.diff

Now ABySS can be compiled with OpenMPI and Google sparsehash support.

cd /campusdata/BME235/programs/abyss-1.1.2
./configure --prefix=/campusdata/BME235 CPPFLAGS=-I/campusdata/BME235/include
make
make install

Alternate Install

Attempt installing against campusdata's openmpi which is already configured to work with SGE. Note to force inclusion of the correct mpi.h file I specify the include path to the

cd /campusdata/BME235/programs/abyss_tmp/abyss-1.1.2
./configure --prefix=/campusdata/BME235/programs/abyss_tmp/ CPPFLAGS='-I/opt/openmpi/include -I/campusdata/BME235/include'  --with-mpi=/opt/openmpi

Next i qlogin into some node and run make install in parallel:

qlogin
make -j8 install

the installation crashed due to a warning (-Werror was enabled). I modified configure.ac so that it is no longer enabled:

#AC_SUBST(AM_CXXFLAGS, '-Wall -Wextra -Werror')
AC_SUBST(AM_CXXFLAGS, '-Wall -Wextra')

Next I run autoconf to work with the modified configure.ac file:

/campus/BME235/bin/autoconf/bin/autoreconf
/campus/BME235/bin/autoconf/bin/autoconf

Finally I re-do the configure, and install:

./configure --prefix=/campusdata/BME235/programs/abyss_tmp/ CPPFLAGS='-I/opt/openmpi/include -I/campusdata/BME235/include'  --with-mpi=/opt/openmpi
make -j8 install
cd ../
fixmode . &

Yet Another Install (1.2.7)

On 7 Jun 2011, Kevin Karplus tried installing abyss-1.2.7 from /campusdata/BME235/programs/abyss-1.2.7/ using

configure --prefix=/campusdata/BME235 \
        CPPFLAGS='-I/opt/openmpi/include -I/campusdata/BME235/include' \
        LDFLAGS=-L/campusdata/BME235/lib  \
        CC=gcc44 CXX=g++44 \
        --with-mpi=/opt/openmpi     

Before this configure could work, sparsehash-1.10 was installed from /campusdata/BME235/programs/google/sparsehash-1.10/

Websites

ABySS
OpenMPI
Google sparsehash
Boost

Sources with Binaries and Documentation

ABySS
OpenMPI
Google sparsehash
Boost

Slug Assembly

Attempt1

In the directory:

/campus/BME235/assemblies/slug/ABySS-assembly1

I ran the following command to start the assembly process on this file in parallel MPI mode. note that the binaries for abyss were installed with open-mpi 1.4, but I am using mpirun 1.3. When we re-install open-mpi 1.4 so that it has SGE support, I will re-run this with that if there are problems. Here is the command executed to start the process:

/campus/BME235/assemblies/slug/ABySS-assembly1

And here are the contents of the script I use to run everything:

#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -V
#$ -l mem_free=15g
# 
/opt/openmpi/bin/mpirun -np $NSLOTS abyss-pe -j j=2 np=$NSLOTS n=8 k=25 name=slugAbyss lib='lane1 lane2 lane3 lane5 lane6 lane7 lane8' lane1='/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_1_1_all_qseq.fastq /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_1_2_all_qseq.fastq' lane2='/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_2_1_all_qseq.fastq /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_2_2_all_qseq.fastq' lane3='/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_3_1_all_qseq.fastq /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_3_2_all_qseq.fastq' lane5='/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_5_1_all_qseq.fastq /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_5_2_all_qseq.fastq' lane6='/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_6_1_all_qseq.fastq /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_6_2_all_qseq.fastq' lane7='/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_7_1_all_qseq.fastq /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_7_2_all_qseq.fastq' lane8='/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_8_1_all_qseq.fastq  /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_8_2_all_qseq.fastq'

Unfortunately this command crashes. The error states that the LD_LIBRARY_PATH might need to be set to point to shared MPI libraries. Also it would probably be best to use our version of “mpirun” once we get it compiled with sge support.

Attempt 2

I modified the script to use the other parallel version of ABySS I installed as described above, attempted in the same directory since the last attempt was entirely unsuccessfull:

/campus/BME235/assemblies/slug/ABySS-assembly1/run1_abyss_mpi.sh:
#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -V
#$ -l mem_free=15g
#
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/openmpi/lib:/campus/BME235/lib
#make the new MPI version of abyss prioritized
export PATH=/campus/BME235/programs/abyss_tmp/bin:$PATH/opt/openmpi/bin/mpirun -np $NSLOTS abyss-pe -j j=2 np=$NSLOTS n=8 k=25 name=slugAbyss lib='lane1 lane2 lane3 lane5 lane6 lane7 lane8' lane1='/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_1_1_all_qseq.fastq /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_1_2_all_qseq.fastq' lane2='/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_2_1_all_qseq.fastq /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_2_2_all_qseq.fastq' lane3='/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_3_1_all_qseq.fastq /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_3_2_all_qseq.fastq' lane5='/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_5_1_all_qseq.fastq /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_5_2_all_qseq.fastq' lane6='/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_6_1_all_qseq.fastq /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_6_2_all_qseq.fastq' lane7='/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_7_1_all_qseq.fastq /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_7_2_all_qseq.fastq' lane8='/campus/BME235/data/slug/I
llumina/illumina_run_1/CeleraReads/s_8_1_all_qseq.fastq  /campus/BME235/data/slug/Illumina/illumina_
run_1/CeleraReads/s_8_2_all_qseq.fastq'

And I run the script using the following qsub command:

qsub -pe orte 40 run1_abyss_mpi.sh

FAIL ARRRG!!

Out of curiosity I decided to follow the example for how to run qsub on an MPI job over sun grid engine as documented on the campusrocks page. To see the test and results look into the following directory on campusrocks:

/campus/jastjohn/test

I followed the example exactly and I get the following error (which is exactly the same as the one I get when trying to run ABySS!)

error: error: ending connection before all data received
error: 
error reading job context from "qlogin_starter"
--------------------------------------------------------------------------
A daemon (pid 9204) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--------------------------------------------------------------------------
        campusrocks-0-7.local - daemon did not report back when launched
        campusrocks-0-14.local - daemon did not report back when launched

I am now submitting an IT request to have someone look into this.

Attempt 3

For this attempt I manually set up an openmpi environment and execute the program in parallel over this environment.

There are three main steps to setting up an openmpi environment and executing ABySS over this environment:

  1. Set up ssh-key so you can log into other nodes on campusrocks over ssh without typing in a password.
  2. Create a “machine file” that lists all node names you wish this openmpi run to have access to
  3. Choose a head node in your machine file list (probably campusrocks-0-6.local due to its large available memory) and issue the command to abyss-pe to get the job going.

SSH-key

Setting up the ssh-key was the most difficult part for me to get right. I probably shouldn't comment on this further, other people in the class seem much more confident in setting this up so I'll let one of them fill this in.

Machine File

An example machine file might look like this:

campusrocks-0-6.local
campusrocks-1-0.local
campusrocks-1-0.local
#campusrocks-1-15.local
campusrocks-1-15.local

The above snippet illustrates several key points about an openmpi machine file. First off each entry corresponds to one core on the respective node. Note that for the head node, I am utilizing a single core so that that core may take full advantage of the available memory. Also note that I am telling openmpi to use two cores on 'campusrocks-1-0.local' by listing that node twice. Finally one of the instances of 'campusrocks-1-15.local' is commented out in this example, this means that the line is skipped over. Commenting out lines in a machine file is a quick way to enable or disable nodes. In this way you can list all cores on all nodes in a machine file if you like, and comment out the resources that are already in use or that you won't want to use. For these high memory applications it is probably the best idea to only use one core on each node, as the memory will probably still be used up entirely.

Running abyss-pe

abyss-pe is a makefile that handles the abyss pipeline. The fact that abyss-pe is a makefile is great because it enables you to simply re-issue the same command if your assembly crashes, and it will pick up where it left off!

First I went to the campusrocks ganglia web page to check which nodes were free, and I modified my machine file accordingly, allocating one core per node I wanted to run my job on that was relatively free per ganglia. My machine file in this case is stored under the name “machines” and is located in the base directory of this assembly.

Next I added the appropriate abyss bin directory to the head of my path once I ssh'ed into my chosen openmpi head node (campusrocks-0-6.local). As of this writing the abyss installation in BME235/bin is still the non-mpi version. The mpi enabled version of abyss may be found here:

/campus/BME235/programs/abyss_tmp/bin/

and I modified my path by issuing the following command:

export PATH=/campus/BME235/programs/abyss_tmp/bin:$PATH

Alternatively I could have simply added this into my '.profile' but for now this is sufficient, especially because we are working on getting the parallel version installed into BME235/bin as a more permanent solution.

Finally I start screen (this process will take several days to run, and it is nice to check back on the progress) and issue the command from screen. Note that there is a nice screen tutorial that I use to remind me of the basic screen commands and usage.

screen

/campus/BME235/programs/abyss_tmp/bin/abyss-pe -j j=2 mpirun="/opt/openmpi/bin/mpirun -machinefile machines -x PATH=/campus/BME235/bin/programs/abyss_tmp/bin:$PATH" np=60 n=8 k=28 name=slugAbyss lib='lane1 lane2 lane3 lane5 lane6 lane7 lane8'  lane1='/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_1_1_all_qseq.fastq /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_1_2_all_qseq.fastq'  lane2='/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_2_1_all_qseq.fastq /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_2_2_all_qseq.fastq' lane3='/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_3_1_all_qseq.fastq /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_3_2_all_qseq.fastq' lane5='/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_5_1_all_qseq.fastq /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_5_2_all_qseq.fastq' lane6='/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_6_1_all_qseq.fastq /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_6_2_all_qseq.fastq' lane7='/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_7_1_all_qseq.fastq /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_7_2_all_qseq.fastq' lane8='/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_8_1_all_qseq.fastq /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_8_2_all_qseq.fastq'

Note that I originally started this process on campusrocks-1-0.local with two cores allocated per available compute node on campusrocks. The process was eventually killed either by a cluster admin, or something else. I then decided to re run the program from campusrocks-0-6.local due to its larger amount of available ram. The assembly picked up where it left off and the process that previously was killed finished within a fairly short period of time.

After completely crashing campusrocks-0-6.local with my process I realized that the makefile was taking the -j j=2 command, probably ignoring the -j=2 part, and parallelizing as much as possible at each step (on each core). On my head node I was running 8 huge processes simultaniously, which probably lead to node 6 going down. I almost did the same to node 1-20 before I realized what was going on and stopped the script. I have reissued the makefile with the following command which doesn't try to pump more parallelization out of the head node:

/campus/BME235/programs/abyss_tmp/bin/abyss-pe mpirun="/opt/openmpi/bin/mpirun -machinefile machines -x PATH=/campus/BME235/bin/programs/abyss_tmp/bin:$PATH" np=60 n=8 k=28 name=slugAbyss lib='lane1 lane2 lane3 lane5 lane6 lane7 lane8'  lane1='/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_1_1_all_qseq.fastq /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_1_2_all_qseq.fastq'  lane2='/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_2_1_all_qseq.fastq /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_2_2_all_qseq.fastq' lane3='/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_3_1_all_qseq.fastq /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_3_2_all_qseq.fastq' lane5='/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_5_1_all_qseq.fastq /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_5_2_all_qseq.fastq' lane6='/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_6_1_all_qseq.fastq /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_6_2_all_qseq.fastq' lane7='/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_7_1_all_qseq.fastq /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_7_2_all_qseq.fastq' lane8='/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_8_1_all_qseq.fastq /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads/s_8_2_all_qseq.fastq'

Also because the makefile crashed, it didn't get a chance to clean up the output from the previous step. I had to manually delete the lane-x-3.hist files (which were all of size 0 anyway). After doing this the makefile was able to pick up where it left off and re-generate the lane-x-3.hist files.

Campusrocks-0-6.local is back up so I am re-starting this task. At its peak the KAligner step (where it crashed previously when the -j option was enabled) requires quite a lot of ram. I am hoping that the 30GB available on this node is sufficient.

Attempt 4

I have access to kolossus which has 1.1tb of ram. I will now run the program on kolossus to see if it will assemble there…

Step1: Install ABySS on kolossus. Following the exactly same process as listed above except with –prefix=/scratch/jstjohn on kolossus. The installation was straightforward and went without a hitch.

Binaries and libraries are located here:

/scratch/jstjohn/bin
/scratch/jstjohn/lib

Step2: Galt has already coppied the banana slug illumina reads to /scratch/galt/bananaSlug, I added the 454 fastq reads to that folder as well.

Step3: from screen on kolossus execute the following command:

set path = ( /scratch/jstjohn/bin $path )
abyss-pe -j j=4 k=35 n=2 mpirun="/scratch/jstjohn/bin/mpirun -machinefile machinefile -x PATH=/scratch/jstjohn/bin:$PATH" np=30 lib='lib1' lib1='/scratch/galt/bananaSlug/slug_1.fastq /scratch/galt/bananaSlug/slug_2.fastq' se='/scratch/galt/bananaSlug/GAZ7HUX02.fastq /scratch/galt/bananaSlug/GAZ7HUX03.fastq /scratch/galt/bananaSlug/GAZ7HUX04.fastq /scratch/galt/bananaSlug/GCLL8Y406.fastq' name=slugAbyss3

Note that this run combines both the illumina runs and the 454 data for banana slug. I am also experimenting with a k=35 since Galt had better luck with a kmer size of 31 using SOAPdenovo than a kmer size of 23, perhaps the trend continues into larger kmers. If this doesn't work for whatever reason, I will also try shorter and longer kmers.

We combined all fastq files into two large files representing the two read pairs. Each of these files is approximately 50GB and contain roughly 20GB of reads. Even on kolossus I am getting some out of disk space errors in the following step:

KAligner   -j4 -k35 /scratch/galt/bananaSlug/slug_1.fastq /scratch/galt/bananaSlug/slug_2.fastq slugAbyss3-3.fa \
                |ParseAligns  -k35 -h lib1-3.hist \
                |sort -nk2,2 \
                |gzip >lib1-3.pair.gz

Near the height I have observed this is eating up about 50G of ram, but the issue appears to be in available space for the sort algorithm in kolossus's /tmp/ directory. I am trying this again so I can copy down the error and send it to cluster-admin because kolossus should have around 400GB free of local HD space on top of its 1.1TB of ram. (kolossus has more ram than HD space: 1.1TB of ram vs 750GB hd)

To get around the issue of sort running out of memory in its temp directory, I found an alternate command where you can supply your own temp directory to sort. Since there is plenty of room left on the hive I issue the following command to generate the files myself. The nice thing is that since this is a makefile, once I have done this I can simply re-start the assembler, and it will see the files I have manually generated and move on to the next step.

KAligner   -j4 -k35 /scratch/galt/bananaSlug/slug_1.fastq /scratch/galt/bananaSlug/slug_2.fastq slugAbyss3-3.fa \
                |ParseAligns  -k35 -h lib1-3.hist \
                |sort -T /hive/users/jstjohn/slugAssembly/tmp -nk2,2 \
                |gzip >lib1-3.pair.gz

This step takes a lot of time. After running for approximately 24 hours it finally finished, and then when I tried to restart the makefile I accidently executed the previous KAligner command and the lib1-3.pair.gz file was written over… For now I am going to let the Ray assembly finish on Kolossus, and then I will re-run this step. Note that since campusrocks-0-6.local is back online I am also re-trying this stage in “Attempt 3” above. Since the run on campusrocks is split up between the 7 lanes rather than one large run, it is possible that it will work even with limited ram. One of the selling points of ABySS is that it is supposed to run on “commodity hardware” so we will see if it lives up to that claim.

References

1. a Jared T. Simpson, Kim Wong, Shaun D. Jackman, Jacqueline E. Schein, Steven J.M. Jones, and İnanç Birol. ABySS: A parallel assembler for short read sequence data. Genome Res. June 2009 19: 1117-1123; Published in Advance February 27, 2009, doi:10.1101/gr.089532.108.
3. a Inanç Birol, Shaun D. Jackman, Cydney B. Nielsen, Jenny Q. Qian, Richard Varhol, Greg Stazyk, Ryan D. Morin, Yongjun Zhao, Martin Hirst, Jacqueline E. Schein, Doug E. Horsman, Joseph M. Connors, Randy D. Gascoyne, Marco A. Marra, and Steven J. M. Jones. De novo transcriptome assembly with ABySS. Bioinformatics 25: 2872-2877. Advance Access published on November 1, 2009, doi:10.1093/bioinformatics/btp367.