Banana Slug Genomics

**This is an old revision of the document!** ----

A PCRE internal error occured. This might be caused by a faulty plugin

====== Celera Assembler ====== The Celera Assembler is a De novo whole genome sequence assembler, designed by Celera Genomics. It was created in 1999 and remained proprietary until its release under the GNU license in 2004. With its open-source release came a name change to wgs-assembler. It was also included in the 454 pipeline under the name CABOG in 2008. Here is a link to the [[http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Main_Page|Source Forge Page]] ===== Installation instructions ===== Celera assembler is now installed under /campusdata/BME235/programs/Celera following the instructions given at [[http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Version_5.4_Release_Notes | the Source Forge Release Notes Page]]. Following steps were followed to install : <code> % bzip2 -dc wgs-6.0-beta.tar.bz2 | tar -xf - % cd wgs-6.0-beta % cd kmer % sh configure.sh % gmake install % cd ../src % gmake % cd .. Binary distributions need only be unpacked. % bzip2 -dc wgs-6.0-beta-Linux-amd64.tar.bz2 | tar -xf - </code> ===== Assembly of Slug Illumina ===== First I converted the illumina .txt output files into .fastq files using "illuminaToFastq" which is a c program I wrote with source located here: programs/johnScripts/illuminaToFastq.c This program takes the two files of each corresponding read pair, checks that both reads in each pair pass illumina's quality filter, and outputs them both into their two respective .fastq files if they both pass the quality filter. Simply judging from the file size of the output, there weren't too many reads weeded out in this step. The reads that pass illumina's quality filter can still be pretty crappy. After generating these .fastq files, I concatenated the files corresponding to one end of the paired end from each library into a single mate from a single lane. This is necessary in case we want to exclude certain lanes, or tell celera that different lanes have different insert lengths (which it looks like we do want to do). Finally I generate celera formated files (simply pointers to the .fastq files with additional information tacked on) using the fastqToCA script and the corresponding insert length for each lane. I exclude lane 4 because it is the controle. Lane 1-3 have a library size of 202-320. Each read is approximately 75 bases, totaling 150 bases for the read and a corresponding insert length range of (202-150) to (320-150) = 52-170bp insert. So our mean insert is 111 +/- 59. Lane 5-8 have a library size of 225-263. Each read is approximately 75 bases, totaling 150 bases for the read and a corresponding insert length range of (225-150)-(263-150) = 75 to 113 which gives a mean of 94 +/- 19. I used these numbers to define the insert size and deviation for the corresponding lane pair files to generate .frg files for celera. All of these files including .frg output and concatenated fastq reads may be found here: /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads ==== Generate ".frg" files from fastq files ==== fastqToCA -insertsize 111 59 -libraryname i1l1 -type illumina -fastq `pwd`/s_1_1_all_qseq.fastq,`pwd`/s_1_2_all_qseq.fastq >s_1_all_all_qseq.frg fastqToCA -insertsize 111 59 -libraryname i1l2 -type illumina -fastq `pwd`/s_2_1_all_qseq.fastq,`pwd`/s_2_2_all_qseq.fastq >s_2_all_all_qseq.frg fastqToCA -insertsize 111 59 -libraryname i1l3 -type illumina -fastq `pwd`/s_3_1_all_qseq.fastq,`pwd`/s_3_2_all_qseq.fastq >s_3_all_all_qseq.frg fastqToCA -insertsize 94 19 -libraryname i1l5 -type illumina -fastq `pwd`/s_5_1_all_qseq.fastq,`pwd`/s_5_2_all_qseq.fastq >s_5_all_all_qseq.frg fastqToCA -insertsize 94 19 -libraryname i1l6 -type illumina -fastq `pwd`/s_6_1_all_qseq.fastq,`pwd`/s_6_2_all_qseq.fastq >s_6_all_all_qseq.frg fastqToCA -insertsize 94 19 -libraryname i1l7 -type illumina -fastq `pwd`/s_7_1_all_qseq.fastq,`pwd`/s_7_2_all_qseq.fastq >s_7_all_all_qseq.frg fastqToCA -insertsize 94 19 -libraryname i1l8 -type illumina -fastq `pwd`/s_8_1_all_qseq.fastq,`pwd`/s_8_2_all_qseq.fastq >s_8_all_all_qseq.frg ==== Run on high memory machine ==== Celera has basically stalled out on campusrocks. It outright crashes when you try to use the entire slug dataset. As a result I have experimented with running this on kolossus. Initially these runs on kolossus have even caused it to crash, although this problem was narrowed down to a faulty driver for the NIC card on the system, that this program was somehow overloading. The program is like ABySS in that it can be restarted and it will pick up where it left off. Here is the settings file I am currrently using on kolossus to do the assembly: <code> useGrid = 0 scriptOnGrid = 0 merylMemory = 150GB -segments 20 -threads 20 ovlMemory = 8GB --hashload 0.8 --hashstrings 100000 ovlThreads = 16 ovlHashBlockSize = 180000 ovlRefBlockSize = 2000000 shell = /bin/bash overlapper = mer obtOverlapper = mer ovlOverlapper = mer ovlStoreMemory = 40960 merSize = 25 obtMerSize = 25 ovlMerSize = 25 merOverlapperThreads = 16 frgCorrBatchSize = 1000000 frgCorrThreads = 16 doToggle = 1 closureOverlaps = 0 closurePlacement = 2 utgErrorRate = 0.015 /scratch/galt/bananaSlug/GAZ7HUX02.frg /scratch/galt/bananaSlug/GAZ7HUX03.frg /scratch/galt/bananaSlug/GAZ7HUX04.frg /scratch/galt/bananaSlug/slug_pair.frg /scratch/galt/bananaSlug/GCLL8Y406.frg </code> Since the program doesn't have access to sun grid engine, those aspects of its parallelization are disabled, although it does have some settings enabling multi-threading. Save the above settings in a file called run1.spec and execute the following command to either start the assembly or re-start the assembly if it crashes: <code> set path = ( /scratch/jstjohn/wgs-6.1/Linux-amd64/bin $path ) runCA-OBT.pl -p slugCelera -d celeraSlug1 -s run1.spec </code> Currently the program is at the meryl overlapper stage of assembly. Note that this program requires quite a large system to do a resonably sized assembly. I am currently using approximately 160GB of HD space for the storage of overlap information and whatever else the algorithm caches on disk, and the max memory usage I have observed so far is 50GB of ram.

You could leave a comment if you were logged in.

Banana Slug Genomics

User Tools

Site Tools

Page Tools