====== Celera Assembler ======
The Celera Assembler is a De novo whole genome sequence assembler, designed by Celera Genomics. It was created in 1999 and remained proprietary until its release under the GNU license in 2004. With its open-source release came a name change to wgs-assembler. It was also included in the 454 pipeline under the name CABOG in 2008.
Here is a link to the [[http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Main_Page|Source Forge Page]]
===== Installation instructions =====
Celera assembler is now installed under /campusdata/BME235/programs/Celera following the instructions given at [[http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Version_5.4_Release_Notes | the Source Forge Release Notes Page]].
Following steps were followed to install :
% bzip2 -dc wgs-6.0-beta.tar.bz2 | tar -xf -
% cd wgs-6.0-beta
% cd kmer
% sh configure.sh
% gmake install
% cd ../src
% gmake
% cd ..
Binary distributions need only be unpacked.
% bzip2 -dc wgs-6.0-beta-Linux-amd64.tar.bz2 | tar -xf -
===== Assembly of Slug Illumina =====
First I converted the illumina .txt output files into .fastq files using "illuminaToFastq" which is a c program I wrote with source located here:
programs/johnScripts/illuminaToFastq.c
This program takes the two files of each corresponding read pair, checks that both reads in each pair pass illumina's quality filter, and outputs them both into their two respective .fastq files if they both pass the quality filter. Simply judging from the file size of the output, there weren't too many reads weeded out in this step. The reads that pass illumina's quality filter can still be pretty crappy.
After generating these .fastq files, I concatenated the files corresponding to one end of the paired end from each library into a single mate from a single lane. This is necessary in case we want to exclude certain lanes, or tell celera that different lanes have different insert lengths (which it looks like we do want to do).
Finally I generate celera formated files (simply pointers to the .fastq files with additional information tacked on) using the fastqToCA script and the corresponding insert length for each lane.
I exclude lane 4 because it is the controle.
Lane 1-3 have a library size of 202-320. Each read is approximately 75 bases, totaling 150 bases for the read and a corresponding insert length range of (202-150) to (320-150) = 52-170bp insert. So our mean insert is 111 +/- 59.
Lane 5-8 have a library size of 225-263. Each read is approximately 75 bases, totaling 150 bases for the read and a corresponding insert length range of (225-150)-(263-150) = 75 to 113 which gives a mean of 94 +/- 19.
I used these numbers to define the insert size and deviation for the corresponding lane pair files to generate .frg files for celera.
All of these files including .frg output and concatenated fastq reads may be found here:
/campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads
==== Generate ".frg" files from fastq files ====
fastqToCA -insertsize 111 59 -libraryname i1l1 -type illumina -fastq `pwd`/s_1_1_all_qseq.fastq,`pwd`/s_1_2_all_qseq.fastq >s_1_all_all_qseq.frg
fastqToCA -insertsize 111 59 -libraryname i1l2 -type illumina -fastq `pwd`/s_2_1_all_qseq.fastq,`pwd`/s_2_2_all_qseq.fastq >s_2_all_all_qseq.frg
fastqToCA -insertsize 111 59 -libraryname i1l3 -type illumina -fastq `pwd`/s_3_1_all_qseq.fastq,`pwd`/s_3_2_all_qseq.fastq >s_3_all_all_qseq.frg
fastqToCA -insertsize 94 19 -libraryname i1l5 -type illumina -fastq `pwd`/s_5_1_all_qseq.fastq,`pwd`/s_5_2_all_qseq.fastq >s_5_all_all_qseq.frg
fastqToCA -insertsize 94 19 -libraryname i1l6 -type illumina -fastq `pwd`/s_6_1_all_qseq.fastq,`pwd`/s_6_2_all_qseq.fastq >s_6_all_all_qseq.frg
fastqToCA -insertsize 94 19 -libraryname i1l7 -type illumina -fastq `pwd`/s_7_1_all_qseq.fastq,`pwd`/s_7_2_all_qseq.fastq >s_7_all_all_qseq.frg
fastqToCA -insertsize 94 19 -libraryname i1l8 -type illumina -fastq `pwd`/s_8_1_all_qseq.fastq,`pwd`/s_8_2_all_qseq.fastq >s_8_all_all_qseq.frg
==== Run on high memory machine ====
Celera requires lots of Memory and Hard disk space to build up its substantial database of overlaps with the high number of reads that we have in our illumina dataset.
After several unsuccessful attempts at getting celera to work on our data, despite the fact that they claim it should be capable given sufficient resources, they finally posted an example settings file that they claimed would work on our data. Their example didn't work for me, but it got me close enough that I was able to tweek some settings resulting in the following settings file:
Working spec file:
# -------------------------------------
# SCIENCE
# -------------------------------------
#
# Expected rate of sequencing error. Allow pairwise alignments up to this rate.
# Sanger reads can use values less than one. Titanium reads require 3% in unitig.
#
shell = /bin/bash
utgErrorRate=0.03
ovlErrorRate=0.06 # Larger than utg to allow for correction.
cnsErrorRate=0.10 # Larger than utg to avoid occasional consensus failures
cgwErrorRate=0.10 # Larger than utg to allow contig merges across high-error ends
#
merSize = 22 # default=22; use lower to combine across heterozygosity, higher to separate near-identical repeat copies
overlapper=mer # the mer overlapper for 454-like data is insensitive to homopolymer problems but requires more RAM and disk
merOverlapperThreads= 2
merOverlapperSeedBatchSize= 5000000 #integer (default=100000) The number of fragments used per batch of seed finding. The amount of memory used is directly proportional to the number of fragments. (sorry, no documentation on what that relationship is, yet)
merOverlapperExtendBatchSize= 5000000 #integer (default=75000) The number of fragments used per batch of seed extension. The amount of memory used is directly proportional to the number of fragments. See option for hits, but use those numbers with caution fragments. See option frgCorrBatchSize for hits, but use those numbers with caution
merOverlapperSeedConcurrency= 30 #integer (default=1) If not on the grid, run this many seed finding processes on the local machine at the same time
merOverlapperExtendConcurrency= 30 # integer (default=1) If not on the grid, run this many seed extension processes on the local machine at the same time
#
unitigger = bog
utgBubblePopping = 1
# utgGenomeSize = # not set!
#
# -------------------------------------
# OFF-GRID ENGINEERING
# -------------------------------------
# MERYL calculates K-mer seeds
#merylMemory = 44000
merylMemory = 512000
merylThreads = 25
#
# OVERLAPPER calculates overlaps
#ovlMemory = 8GB
#ovlThreads = 2
#ovlConcurrency = 30
#ovlHashBlockSize = 2000000
#ovlRefBlockSize = 32000000
#
# OVERLAP STORE build the database
ovlStoreMemory = 131072 # This is single-process
#
# ERROR CORRECTION applied to overlaps
frgCorrThreads = 10
frgCorrConcurrency = 3
ovlCorrBatchSize = 1000000
ovlCorrConcurrency = 25
#
# UNITIGGER configuration
#
# CONSENSUS configuration
cnsConcurrency = 16
useGrid = 0
scriptOnGrid = 0
/scratch/galt/bananaSlug/GAZ7HUX02.frg
/scratch/galt/bananaSlug/GAZ7HUX03.frg
/scratch/galt/bananaSlug/GAZ7HUX04.frg
/scratch/galt/bananaSlug/slug_pair.frg #the frg file is made from the fastqToCA utility
/scratch/galt/bananaSlug/GCLL8Y406.frg
With this configuration it seems to be performing decently well on kolossus. The max memory I have observed the program consume is around 400GB, so we definitely need a high memory system to work with a large dataset on a large genome.
To execute the program I saved the above settings into a file called run1.spec. To run the file with these settings I issue the following command:
runCA -d celeraSlug1 -p slugCelera -s run1.spec
This command tells the program to make/work in the celeraSlug1 directory, and append the slugCelera prefix to its output, using the settings in run1.spec.