Differences

This shows you the differences between two versions of the page.

--- archive:bioinformatic_tools:celera [2010/05/19 17:54]
jstjohn
+++ archive:bioinformatic_tools:celera [2015/07/28 06:22] (current)
ceisenhart ↷ Page moved from bioinformatic_tools:celera to archive:bioinformatic_tools:celera
@@ Line 53: / Line 53: @@
   fastqToCA -insertsize 94 19 -libraryname i1l7 -type illumina -fastq `pwd`/s_7_1_all_qseq.fastq,`pwd`/s_7_2_all_qseq.fastq >s_7_all_all_qseq.frg
   fastqToCA -insertsize 94 19 -libraryname i1l8 -type illumina -fastq `pwd`/s_8_1_all_qseq.fastq,`pwd`/s_8_2_all_qseq.fastq >s_8_all_all_qseq.frg
 ==== Run on high memory machine ====
-Celera has basically stalled out on campusrocks. It outright crashes when you try to use the entire slug dataset. As a result I have experimented with running this on kolossus. Initially these runs on kolossus have even caused it to crash, although this problem was narrowed down to a faulty driver for the NIC card on the system, that this program was somehow overloading. The program is like ABySS in that it can be restarted and it will pick up where it left off.
+Celera requires lots of Memory and Hard disk space to build up its substantial database of overlaps with the high number of reads that we have in our illumina dataset.
-Here is the settings file I am currrently using on kolossus to do the assembly:
+After several unsuccessful attempts at getting celera to work on our data, despite the fact that they claim it should be capable given sufficient resources, they finally posted an example settings file that they claimed would work on our data. Their example didn't work for me, but it got me close enough that I was able to tweek some settings resulting in the following settings file:
+Working spec file:
 <code>
-useGrid          = 0
+# -------------------------------------
-scriptOnGrid     = 0
+#  SCIENCE
+# -------------------------------------
+#
+#  Expected rate of sequencing error. Allow pairwise alignments up to this rate.
+#  Sanger reads can use values less than one. Titanium reads require 3% in unitig.
+#
+shell = /bin/bash
+utgErrorRate=0.03
+ovlErrorRate=0.06 # Larger than utg to allow for correction.
+cnsErrorRate=0.10 # Larger than utg to avoid occasional consensus failures
+cgwErrorRate=0.10 # Larger than utg to allow contig merges across high-error ends
+#
+merSize = 22 # default=22; use lower to combine across heterozygosity, higher to separate near-identical repeat copies
+overlapper=mer # the mer overlapper for 454-like data is insensitive to homopolymer problems but requires more RAM and disk
+merOverlapperThreads= 2
+merOverlapperSeedBatchSize= 5000000 #integer (default=100000) The number of fragments used per batch of seed finding. The amount of memory used is directly proportional to the number of fragments. (sorry, no documentation on what that relationship is, yet)
-merylMemory   = 150GB -segments 20 -threads 20
+ merOverlapperExtendBatchSize= 5000000 #integer (default=75000) The number of fragments used per batch of seed extension. The amount of memory used is directly proportional to the number of fragments. See option for hits, but use those numbers with caution fragments. See option frgCorrBatchSize for hits, but use those numbers with caution
-merylThreads  = 20
+merOverlapperSeedConcurrency= 30 #integer (default=1) If not on the grid, run this many seed finding processes on the local machine at the same time
-ovlMemory        = 8GB --hashload 0.8 --hashstrings 100000
+merOverlapperExtendConcurrency= 30 # integer (default=1) If not on the grid, run this many seed extension processes on the local machine at the same time
-ovlThreads       = 16
-ovlHashBlockSize = 180000
-ovlRefBlockSize  = 2000000
+#
+unitigger = bog
+utgBubblePopping = 1
+# utgGenomeSize = # not set!
+#
+# -------------------------------------
+# OFF-GRID ENGINEERING
+# -------------------------------------
+#  MERYL calculates K-mer seeds
+#merylMemory   = 44000
+merylMemory 	= 512000
+merylThreads    = 25
+#
+#  OVERLAPPER calculates overlaps
+#ovlMemory         = 8GB
+#ovlThreads          = 2
+#ovlConcurrency                = 30
+#ovlHashBlockSize = 2000000
+#ovlRefBlockSize  = 32000000
+#
+#  OVERLAP STORE build the database
+ovlStoreMemory   = 131072  # This is single-process
+#
+# ERROR CORRECTION applied to overlaps
+frgCorrThreads    = 10
+frgCorrConcurrency = 3
+ovlCorrBatchSize  = 1000000
+ovlCorrConcurrency = 25
+#
+# UNITIGGER configuration
+#
+# CONSENSUS configuration
+cnsConcurrency   = 16
+useGrid          = 0
+scriptOnGrid     = 0
-shell		 = /bin/bash
-overlapper       = mer
-obtOverlapper	 = mer
-ovlOverlapper	 = mer
-ovlStoreMemory   = 40960
-merSize		 = 25
-obtMerSize	 = 25
-ovlMerSize       = 25
-merOverlapperThreads = 16
-frgCorrBatchSize = 1000000
-frgCorrThreads   = 16
-doToggle	 = 1
-closureOverlaps	 = 0
-closurePlacement = 2
-utgErrorRate	 = 0.015
 /scratch/galt/bananaSlug/GAZ7HUX02.frg
 /scratch/galt/bananaSlug/GAZ7HUX03.frg
 /scratch/galt/bananaSlug/GAZ7HUX04.frg
-/scratch/galt/bananaSlug/slug_pair.frg
+/scratch/galt/bananaSlug/slug_pair.frg #the frg file is made from the fastqToCA utility
 /scratch/galt/bananaSlug/GCLL8Y406.frg
 </code>
-Since the program doesn't have access to sun grid engine, those aspects of its parallelization are disabled, although it does have some settings enabling
+With this configuration it seems to be performing decently well on kolossus. The max memory I have observed the program consume is around 400GB, so we definitely need a high memory system to work with a large dataset on a large genome.
+To execute the program I saved the above settings into a  file called run1.spec. To run the file with these settings I issue the following command:
+<code>
+runCA -d celeraSlug1 -p slugCelera -s run1.spec
+</code>
+This command tells the program to make/work in the celeraSlug1 directory, and append the slugCelera prefix to its output, using the settings in run1.spec.

Banana Slug Genomics

User Tools

Site Tools

Differences

Page Tools