User Tools

Site Tools


archive:bioinformatic_tools:celera

**This is an old revision of the document!** ----

A PCRE internal error occured. This might be caused by a faulty plugin

====== Celera Assembler ====== The Celera Assembler is a De novo whole genome sequence assembler, designed by Celera Genomics. It was created in 1999 and remained proprietary until its release under the GNU license in 2004. With its open-source release came a name change to wgs-assembler. It was also included in the 454 pipeline under the name CABOG in 2008. Here is a link to the [[http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Main_Page|Source Forge Page]] ===== Installation instructions ===== Celera assembler is now installed under /campusdata/BME235/programs/Celera following the instructions given at [[http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Version_5.4_Release_Notes | the Source Forge Release Notes Page]]. Following steps were followed to install : <code> % bzip2 -dc wgs-6.0-beta.tar.bz2 | tar -xf - % cd wgs-6.0-beta % cd kmer % sh configure.sh % gmake install % cd ../src % gmake % cd .. Binary distributions need only be unpacked. % bzip2 -dc wgs-6.0-beta-Linux-amd64.tar.bz2 | tar -xf - </code> ===== Assembly of Slug Illumina ===== First I converted the illumina .txt output files into .fastq files using "illuminaToFastq" which is a c program I wrote with source located here: programs/johnScripts/illuminaToFastq.c This program takes the two files of each corresponding read pair, checks that both reads in each pair pass illumina's quality filter, and outputs them both into their two respective .fastq files if they both pass the quality filter. Simply judging from the file size of the output, there weren't too many reads weeded out in this step. The reads that pass illumina's quality filter can still be pretty crappy. After generating these .fastq files, I concatenated the files corresponding to one end of the paired end from each library into a single mate from a single lane. This is necessary in case we want to exclude certain lanes, or tell celera that different lanes have different insert lengths (which it looks like we do want to do). Finally I generate celera formated files (simply pointers to the .fastq files with additional information tacked on) using the fastqToCA script and the corresponding insert length for each lane. I exclude lane 4 because it is the controle. Lane 1-3 have a library size of 202-320. Each read is approximately 75 bases, totaling 150 bases for the read and a corresponding insert length range of (202-150) to (320-150) = 52-170bp insert. So our mean insert is 111 +/- 59. Lane 5-8 have a library size of 225-263. Each read is approximately 75 bases, totaling 150 bases for the read and a corresponding insert length range of (225-150)-(263-150) = 75 to 113 which gives a mean of 94 +/- 19. I used these numbers to define the insert size and deviation for the corresponding lane pair files to generate .frg files for celera. All of these files including .frg output and concatenated fastq reads may be found here: /campus/BME235/data/slug/Illumina/illumina_run_1/CeleraReads

You could leave a comment if you were logged in.
archive/bioinformatic_tools/celera.1273010736.txt.gz · Last modified: 2010/05/04 22:05 by jstjohn