User Tools

Site Tools


archive:bioinformatic_tools:celera

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
archive:bioinformatic_tools:celera [2010/05/19 17:55]
jstjohn
archive:bioinformatic_tools:celera [2015/07/28 06:22] (current)
ceisenhart ↷ Page moved from bioinformatic_tools:celera to archive:bioinformatic_tools:celera
Line 54: Line 54:
   fastqToCA -insertsize 94 19 -libraryname i1l8 -type illumina -fastq `pwd`/​s_8_1_all_qseq.fastq,​`pwd`/​s_8_2_all_qseq.fastq >​s_8_all_all_qseq.frg   fastqToCA -insertsize 94 19 -libraryname i1l8 -type illumina -fastq `pwd`/​s_8_1_all_qseq.fastq,​`pwd`/​s_8_2_all_qseq.fastq >​s_8_all_all_qseq.frg
 ==== Run on high memory machine ==== ==== Run on high memory machine ====
-Celera ​has basically stalled out on campusrocks. It outright crashes when you try to use the entire slug dataset. As a result I have experimented ​with running this on kolossus. Initially these runs on kolossus have even caused it to crash, although this problem was narrowed down to a faulty driver for the NIC card on the system, ​that this program was somehow overloading. The program is like ABySS in that it can be restarted and it will pick up where it left off+Celera ​requires lots of Memory and Hard disk space to build up its substantial database of overlaps ​with the high number of reads that we have in our illumina dataset
  
-Here is the settings file I am currrently using on kolossus ​to do the assembly:+After several unsuccessful attempts at getting celera to work on our data, despite ​the fact that they claim it should be capable given sufficient resources, they finally posted an example ​settings file that they claimed would work on our data. Their example didn't work for me, but it got me close enough that I was able to tweek some settings resulting in the following settings file:
  
 +Working spec file:
 <​code>​ <​code>​
-useGrid ​         ​= 0 +# ------------------------------------- 
-scriptOnGrid ​    = 0+#  SCIENCE 
 +# ------------------------------------- 
 +
 +#  Expected rate of sequencing error. Allow pairwise alignments up to this rate. 
 +#  Sanger reads can use values less than one. Titanium reads require 3% in unitig. 
 +#   
 +shell = /bin/bash 
 +utgErrorRate=0.03 
 +ovlErrorRate=0.06 # Larger than utg to allow for correction. 
 +cnsErrorRate=0.10 # Larger than utg to avoid occasional consensus failures 
 +cgwErrorRate=0.10 # Larger than utg to allow contig merges across high-error ends 
 +#  
 +merSize = 22 # default=22; use lower to combine across heterozygosity,​ higher to separate near-identical repeat copies  
 +overlapper=mer # the mer overlapper for 454-like data is insensitive to homopolymer problems but requires more RAM and disk 
 +merOverlapperThreads= 2 
 +merOverlapperSeedBatchSize= 5000000 #integer (default=100000) The number of fragments used per batch of seed finding. The amount of memory used is directly proportional to the number of fragments. (sorry, no documentation on what that relationship is, yet)
  
-merylMemory ​  150GB -segments 20 -threads 20+ ​merOverlapperExtendBatchSize5000000 #integer (default=75000) The number of fragments used per batch of seed extension. The amount of memory used is directly proportional to the number of fragments. See option for hits, but use those numbers with caution fragments. See option frgCorrBatchSize for hits, but use those numbers with caution ​
  
 +merOverlapperSeedConcurrency= 30 #integer (default=1) If not on the grid, run this many seed finding processes on the local machine at the same time 
  
-ovlMemory ​       ​8GB --hashload 0.8 --hashstrings 100000 +merOverlapperExtendConcurrency30 # integer (default=1) If not on the grid, run this many seed extension processes on the local machine at the same time
-ovlThreads ​      16 +
-ovlHashBlockSize = 180000 +
-ovlRefBlockSize ​ = 2000000+
  
 +#
 +
 +unitigger = bog
 +utgBubblePopping = 1
 +# utgGenomeSize = # not set!
 +#
 +# -------------------------------------
 +# OFF-GRID ENGINEERING
 +# -------------------------------------
 +#  MERYL calculates K-mer seeds
 +#​merylMemory ​  = 44000
 +merylMemory = 512000
 +merylThreads ​   = 25
 +#
 +#  OVERLAPPER calculates overlaps
 +#​ovlMemory ​        = 8GB  ​
 +#​ovlThreads ​         = 2
 +#​ovlConcurrency ​               = 30
 +#​ovlHashBlockSize = 2000000
 +#​ovlRefBlockSize ​ = 32000000
 +#
 +#  OVERLAP STORE build the database
 +ovlStoreMemory ​  = 131072 ​ # This is single-process ​
 +
 +# ERROR CORRECTION applied to overlaps
 +frgCorrThreads ​   = 10
 +frgCorrConcurrency = 3
 +ovlCorrBatchSize ​ = 1000000
 +ovlCorrConcurrency = 25
 +
 +# UNITIGGER configuration
 +
 +# CONSENSUS configuration
 +cnsConcurrency ​  = 16
 +
 +useGrid ​         = 0
 +scriptOnGrid ​    = 0
  
-shell = /bin/bash 
-overlapper ​      = mer 
-obtOverlapper = mer 
-ovlOverlapper = mer 
-ovlStoreMemory ​  = 40960 
-merSize = 25 
-obtMerSize = 25 
-ovlMerSize ​      = 25 
-merOverlapperThreads = 16 
-frgCorrBatchSize = 1000000 
-frgCorrThreads ​  = 16 
-doToggle = 1 
-closureOverlaps = 0 
-closurePlacement = 2 
-utgErrorRate = 0.015 
  
 /​scratch/​galt/​bananaSlug/​GAZ7HUX02.frg /​scratch/​galt/​bananaSlug/​GAZ7HUX02.frg
 /​scratch/​galt/​bananaSlug/​GAZ7HUX03.frg /​scratch/​galt/​bananaSlug/​GAZ7HUX03.frg
 /​scratch/​galt/​bananaSlug/​GAZ7HUX04.frg /​scratch/​galt/​bananaSlug/​GAZ7HUX04.frg
-/​scratch/​galt/​bananaSlug/​slug_pair.frg+/​scratch/​galt/​bananaSlug/​slug_pair.frg ​#the frg file is made from the fastqToCA utility
 /​scratch/​galt/​bananaSlug/​GCLL8Y406.frg /​scratch/​galt/​bananaSlug/​GCLL8Y406.frg
 </​code>​ </​code>​
  
-Since the program ​doesn'​t have access ​to sun grid enginethose aspects of its parallelization are disabledalthough it does have some settings ​enabling ​+With this configuration it seems to be performing decently well on kolossus. The max memory I have observed ​the program ​consume is around 400GB, so we definitely need a high memory system ​to work with a large dataset on a large genome. 
 + 
 +To execute the program I saved the above settings into a  file called run1.spec. To run the file with these settings I issue the following command: 
 +<​code>​ 
 +runCA -d celeraSlug1 -p slugCelera -s run1.spec 
 +</​code>​ 
 + 
 +This command tells the program to make/work in the celeraSlug1 directoryand append the slugCelera prefix to its outputusing the settings ​in run1.spec. 
  
archive/bioinformatic_tools/celera.1274291735.txt.gz · Last modified: 2010/05/19 17:55 by jstjohn