User Tools

Site Tools


archive:bioinformatic_tools:celera

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
archive:bioinformatic_tools:celera [2010/05/04 22:23]
jstjohn
archive:bioinformatic_tools:celera [2015/07/28 06:22] (current)
ceisenhart ↷ Page moved from bioinformatic_tools:celera to archive:bioinformatic_tools:celera
Line 45: Line 45:
 All of these files including .frg output and concatenated fastq reads may be found here: All of these files including .frg output and concatenated fastq reads may be found here:
   /​campus/​BME235/​data/​slug/​Illumina/​illumina_run_1/​CeleraReads   /​campus/​BME235/​data/​slug/​Illumina/​illumina_run_1/​CeleraReads
- +==== Generate ​"​.frg"​ files from fastq files ==== 
-==== Generate fastq ==== +
   fastqToCA -insertsize 111 59 -libraryname i1l1 -type illumina -fastq `pwd`/​s_1_1_all_qseq.fastq,​`pwd`/​s_1_2_all_qseq.fastq >​s_1_all_all_qseq.frg   fastqToCA -insertsize 111 59 -libraryname i1l1 -type illumina -fastq `pwd`/​s_1_1_all_qseq.fastq,​`pwd`/​s_1_2_all_qseq.fastq >​s_1_all_all_qseq.frg
   fastqToCA -insertsize 111 59 -libraryname i1l2 -type illumina -fastq `pwd`/​s_2_1_all_qseq.fastq,​`pwd`/​s_2_2_all_qseq.fastq >​s_2_all_all_qseq.frg   fastqToCA -insertsize 111 59 -libraryname i1l2 -type illumina -fastq `pwd`/​s_2_1_all_qseq.fastq,​`pwd`/​s_2_2_all_qseq.fastq >​s_2_all_all_qseq.frg
Line 54: Line 53:
   fastqToCA -insertsize 94 19 -libraryname i1l7 -type illumina -fastq `pwd`/​s_7_1_all_qseq.fastq,​`pwd`/​s_7_2_all_qseq.fastq >​s_7_all_all_qseq.frg   fastqToCA -insertsize 94 19 -libraryname i1l7 -type illumina -fastq `pwd`/​s_7_1_all_qseq.fastq,​`pwd`/​s_7_2_all_qseq.fastq >​s_7_all_all_qseq.frg
   fastqToCA -insertsize 94 19 -libraryname i1l8 -type illumina -fastq `pwd`/​s_8_1_all_qseq.fastq,​`pwd`/​s_8_2_all_qseq.fastq >​s_8_all_all_qseq.frg   fastqToCA -insertsize 94 19 -libraryname i1l8 -type illumina -fastq `pwd`/​s_8_1_all_qseq.fastq,​`pwd`/​s_8_2_all_qseq.fastq >​s_8_all_all_qseq.frg
 +==== Run on high memory machine ====
 +Celera requires lots of Memory and Hard disk space to build up its substantial database of overlaps with the high number of reads that we have in our illumina dataset. ​
 +
 +After several unsuccessful attempts at getting celera to work on our data, despite the fact that they claim it should be capable given sufficient resources, they finally posted an example settings file that they claimed would work on our data. Their example didn't work for me, but it got me close enough that I was able to tweek some settings resulting in the following settings file:
 +
 +Working spec file:
 +<​code>​
 +# -------------------------------------
 +#  SCIENCE
 +# -------------------------------------
 +#
 +#  Expected rate of sequencing error. Allow pairwise alignments up to this rate.
 +#  Sanger reads can use values less than one. Titanium reads require 3% in unitig.
 +#  ​
 +shell = /bin/bash
 +utgErrorRate=0.03
 +ovlErrorRate=0.06 # Larger than utg to allow for correction.
 +cnsErrorRate=0.10 # Larger than utg to avoid occasional consensus failures
 +cgwErrorRate=0.10 # Larger than utg to allow contig merges across high-error ends
 +
 +merSize = 22 # default=22; use lower to combine across heterozygosity,​ higher to separate near-identical repeat copies ​
 +overlapper=mer # the mer overlapper for 454-like data is insensitive to homopolymer problems but requires more RAM and disk
 +merOverlapperThreads= 2
 +merOverlapperSeedBatchSize= 5000000 #integer (default=100000) The number of fragments used per batch of seed finding. The amount of memory used is directly proportional to the number of fragments. (sorry, no documentation on what that relationship is, yet)
 +
 + ​merOverlapperExtendBatchSize= 5000000 #integer (default=75000) The number of fragments used per batch of seed extension. The amount of memory used is directly proportional to the number of fragments. See option for hits, but use those numbers with caution fragments. See option frgCorrBatchSize for hits, but use those numbers with caution ​
 +
 +merOverlapperSeedConcurrency= 30 #integer (default=1) If not on the grid, run this many seed finding processes on the local machine at the same time 
 +
 +merOverlapperExtendConcurrency= 30 # integer (default=1) If not on the grid, run this many seed extension processes on the local machine at the same time
 +
 +#
 +
 +unitigger = bog
 +utgBubblePopping = 1
 +# utgGenomeSize = # not set!
 +#
 +# -------------------------------------
 +# OFF-GRID ENGINEERING
 +# -------------------------------------
 +#  MERYL calculates K-mer seeds
 +#​merylMemory ​  = 44000
 +merylMemory = 512000
 +merylThreads ​   = 25
 +#
 +#  OVERLAPPER calculates overlaps
 +#​ovlMemory ​        = 8GB  ​
 +#​ovlThreads ​         = 2
 +#​ovlConcurrency ​               = 30
 +#​ovlHashBlockSize = 2000000
 +#​ovlRefBlockSize ​ = 32000000
 +#
 +#  OVERLAP STORE build the database
 +ovlStoreMemory ​  = 131072 ​ # This is single-process ​
 +
 +# ERROR CORRECTION applied to overlaps
 +frgCorrThreads ​   = 10
 +frgCorrConcurrency = 3
 +ovlCorrBatchSize ​ = 1000000
 +ovlCorrConcurrency = 25
 +
 +# UNITIGGER configuration
 +
 +# CONSENSUS configuration
 +cnsConcurrency ​  = 16
 +
 +useGrid ​         = 0
 +scriptOnGrid ​    = 0
 +
 +
 +/​scratch/​galt/​bananaSlug/​GAZ7HUX02.frg
 +/​scratch/​galt/​bananaSlug/​GAZ7HUX03.frg
 +/​scratch/​galt/​bananaSlug/​GAZ7HUX04.frg
 +/​scratch/​galt/​bananaSlug/​slug_pair.frg #the frg file is made from the fastqToCA utility
 +/​scratch/​galt/​bananaSlug/​GCLL8Y406.frg
 +</​code>​
 +
 +With this configuration it seems to be performing decently well on kolossus. The max memory I have observed the program consume is around 400GB, so we definitely need a high memory system to work with a large dataset on a large genome.
 +
 +To execute the program I saved the above settings into a  file called run1.spec. To run the file with these settings I issue the following command:
 +<​code>​
 +runCA -d celeraSlug1 -p slugCelera -s run1.spec
 +</​code>​
  
 +This command tells the program to make/work in the celeraSlug1 directory, and append the slugCelera prefix to its output, using the settings in run1.spec.
  
  
archive/bioinformatic_tools/celera.1273011823.txt.gz · Last modified: 2010/05/04 22:23 by jstjohn