archive:priceti

[jolespin@campusrocks2 PriceWithDocsAndSampleJob140408]$ pwd
/afs/cats.ucsc.edu/users/b/jolespin/PriceWithDocsAndSampleJob140408
[jolespin@campusrocks2 PriceWithDocsAndSampleJob140408]$ ./PriceTI --help
PRICE Assembler v1.2


These are the command options for the PRICE assembler. 
For more details, see the accompanying README.txt file. 
Contact: price@derisilab.ucsf.edu 

Usage: ./PriceTI [args] 

INPUT FILES: 
 accepted formats are fasta (appended .fa, .fasta, .fna, .ffn, .frn), fastq (.fq, .fastq, or _sequence.txt), or priceq (.pq or .priceq) 
  NOTE ABOUT FASTQ ENCODING: multiple encodings are currently used for fastq quality scores.  The traditional encoding is Phred+33,
                             and PRICE will interpret scores from any .fq or .fastq file according to that encoding.  The Phred+64 
                             encoding has been used extensively by Illumina, and so it is applied to Illumina's commonly-used _sequence.txt
                             file append.  Please make sure that your encoding matches your file append.
INPUT READ FILES: 
  NOTE: these flags can be used multiple times in the same command to include multiple read datasets. 
  (default % ID = 90%) 
 PAIRED-END FILES (reads are 3p of one another on opposite strands, i.e. pointing towards one another)
 -fp a b c [d e [f]]: (a,b)input file pair, (c)amplicon insert size (including read) 
                  (d,e,f) are optional; (d)the num. cycles to be skipped before this file is used;
                  if (f) is provided, then the file will alternate between being used for (e) cycles and not used for (f) cycles;
                  otherwise, the file will be used for (e) cycles then will not be used again. 
 -fpp a b c d [e f [g]]: (a,b)input file pair, (c)amplicon insert size (including read), (d)required % identity for match (25-100 allowed) 
                  (e,f,g) are optional; (e)the num. cycles to be skipped before this file is used;
                  if (g) is provided, then the file will alternate between being used for (f) cycles and not used for (g) cycles;
                  otherwise, the file will be used for (f) cycles then will not be used again. 
 -fs a b [c d [e]]: (a)input paired-end file (alternating entries are paired-end reads), (b)amplicon insert size (including read) 
                  (c,d,e) are optional; (c)the num. cycles to be skipped before this file is used;
                  if (e) is provided, then the file will alternate between being used for (d) cycles and not used for (e) cycles;
                  otherwise, the file will be used for (d) cycles then will not be used again. 
 -fsp a b c [d e [f]]: (a)input paired-end file (alternating entries are paired-end reads), (b)amplicon insert size (including read),
                  (c)required % identity for match (25-100 allowed) 
                  (d,e,f) are optional; (d)the num. cycles to be skipped before this file is used;
                  if (f) is provided, then the file will alternate between being used for (e) cycles and not used for (f) cycles;
                  otherwise, the file will be used for (e) cycles then will not be used again. 
 MATE-PAIR FILES (reads are 5p of one another on opposite strands, i.e. pointing away from one another)
 -mp a b c [d e [f]]: like -fp above, but with reads in the opposite orientation.
 -mpp a b c d [e f [g]]: like -fpp above, but with reads in the opposite orientation.
 -ms a b [c d [e]]: like -fs above, but with reads in the opposite orientation.
 -msp a b c [d e [f]]: like -fsp above, but with reads in the opposite orientation.
 FALSE PAIRED-END FILES (unpaired reads are split into paired ends, with the scores of double-use nuceotides halved)
 -spf a b c [d e [f]]: (a)input file, (b)the length of the 'reads' that will be taken from each side of the input reads,
                  (c)amplicon insert size (including read) 
                  (d,e,f) are optional; (d)the num. cycles to be skipped before this file is used;
                  if (f) is provided, then the file will alternate between being used for (e) cycles and not used for (f) cycles;
                  otherwise, the file will be used for (e) cycles then will not be used again. 
 -spfp a b c d [e f [g]]: (a)input file, (b)the length of the 'reads' that will be taken from each side of the input reads,
                  (c)amplicon insert size (including read), (d)required % identity for match (25-100 allowed) 
                  (e,f,g) are optional; (e)the num. cycles to be skipped before this file is used;
                  if (g) is provided, then the file will alternate between being used for (f) cycles and not used for (g) cycles;
                  otherwise, the file will be used for (f) cycles then will not be used again. 
INPUT INITIAL CONTIG FILES: 
  NOTE: these flags can be used multiple times in the same command to include multiple initial contig datasets. 
 -icf a b c d: (a)initial contig file, (b)number of addition steps, (c)number of cycles per step,
               (d)const by which to multiply quality scores 
 -picf a b c d e: (a)num of initial contigs from this file, (b)initial contig file, (c)num addition steps,
                  (d)num cycles per step (e)const by which to multiply quality scores 
 -icfNt/-picfNt: same as -icf/-picf, but if target mode is invoked, contigs with matches to these input sequences will not necessarily
                 be retained 
OUTPUT FILES: 
 accepted formats are fasta (.fa or .fasta) or priceq (.pq or .priceq) 
 -o a: (a)output file name (.fasta or .priceq) 
 -nco a: (a)num. cycles that pass in between output files being written (default=1) 
OTHER PARAMS: 
 -nc a: (a)num. of cycles 
 -link a: (a)max. number of contigs that are allowed to replace a read in a mini-assembly (default=2)
 -mol a: (a)minimum overlap length for mini-assembly (default=35) 
 -tol a: (a)threshold seq num for scaling overlap for overhang assemblies (default=20)
 NOTE: -mol and -tol do not affect the parameters for de-Bruijn-graph-based assembly.
 -mpi a: (a)minimum % identity for mini-assembly (default=85) 
 -tpi a: (a)threshold seq num for scaling % ID for mini-assemblies (default=20)
 -MPI, -TPI : same as above, but for meta-assembly (-MPI default=85, -TPI default=1000)
 NOTE: there is no minimum overlap value for meta-assembly
 -dbmax a: (a) the maximum length sequence that will be fed into de Bruijn assembly 
           (default=100; recommended: max paired-end read length)
 -dbk a: (a) the k-mer size for de Bruijn assembly (default=20; keep less than the read length)
 -dbms a: (a) the minimum number of sequences to which de Bruijn assembly will be applied (default=3)
 -r a: (a) alignment score reward for a nucleotide match; should be a positive integer (default=1)
 -q a: (a) alignment score penalty for a nucleotide mismatch; should be a negative integer (default=-2)
 -G a: (a) alignment score penalty for opening a gap; should be a negative integer (default=-5)
 -E a: (a) alignment score penalty for extending a gap; should be a negative integer (default=-2)
FILTERING READS: 
 -rqf a b [c d]: filters pairs of reads if either has an unaccptably high number of low-quality nucleotides, as defined
                 by the provided quality scores (only applies to files whose formats include quality score information). 
                 (a) the percentage of nucleotides in a read that must be high-quality; (b) the minimum allowed probability
                 of a nucleotide being correct (must be between 0 and 1, and will usually be a decimal value close to 1);
                 (c) and (d) optionally constrain this filter to use after (c) cycles have passed, to run for (d) cycles.
                 This flag may be called multiple times to generate variable behavior across a PRICE run.
 -rnf a [b c]: filters pairs of reads if either has an unaccptably high number of uncalled nucleotides (Ns or other ambiguous
                 IUPAC codes).  Like -rqf, but will also filter fasta-format data.  (a) the percentage of nucleotides in a read
                 that must be called; (b) and (c) optionally constrain this filter to use after (b) cycles have passed, to run
                 for (c) cycles.  This flag may be called multiple times to generate variable behavior across a PRICE run.
 -maxHp a: filters out a pair of reads if either read has a homo-polymer track >(a) nucleotides in length.
 -maxDi a: filters out a pair of reads if either read has a repeating di-nucleotide track >(a) nucleotides in length.
           NOTE: this will also catch mono-nucleotide repeats of the specified length (a string of A's is also a string
           of AA's), so calling -maxHp in addition to -maxDi is superfluous unless -maxHp is given a smaller max value.
 -badf a b: prevents reads with a match of at least (b)% identity to a sequence in file (a) from being mapped to contigs.
 -repmask a b c d e f [g]: uses coverage levels of constructed and/or input contigs to find repetitive elements and mask them 
                           as if they were sequences input using -badf.
                           (a) = cycle number (1-indexed) at which repeats will be detected.
                           (b) = 's' if repeats will be sought at the start of the cycle or 'f' if they will be sought at the finish.
                           (c) = the min. number of variance units above the median that will be counted as high-coverage.
                           (d) = the min. fold increase in coverage above the median that will be counted as high-coverage.
                           (e) = the min. size in nt for a detected repeat. 
                           (f) = reads with a match of at least this % identity to a repeat will not be mapped to contigs.
                           (g) = an optional output file (.fasta or .priceq) to which the detected repeats will be written.
 -reset a [b c d...]: re-introduces contigs that were previously not generating assembly jobs of their own
                      (a) is the one-indexed cycle where the contigs will be reset.  Same with b, c, d. 
                      Any number of args may be added.
FILTERING INITIAL CONTIGS: 
 -icbf a b [c]: prevents input sequences with a match of at least (b)% identity to a sequence in file (a) from being used.
                This filter is optionally not applied to sequences of length greater than (c) nucleotides.
 -icmHp a [b]: filters out an initial contig if it has a homo-polymer track >(a) nucleotides in length.
               This filter is optionally not applied to sequences of length greater than (b) nucleotides.
 -icmDi a [b]: filters out an initial contig if it has a repeating di-nucleotide track >(a) nucleotides in length.
               NOTE: this will also catch mono-nucleotide repeats of the specified length (a string of A's is also a string
               of AA's), so calling -icmHp in addition to -icmDi is superfluous unless -icmHp is given a smaller max value.
               This filter is optionally not applied to sequences of length greater than (b) nucleotides.
 -icqf a b [c]: filters out an initial contig if it has an unaccptably high number of low-quality nucleotides, as defined
                by the provided quality scores (only applies to files whose formats include quality score information). 
                (a) the percentage of nucleotides in a read that must be high-quality; (b) the minimum allowed probability
                of a nucleotide being correct (must be between 0 and 1, and will usually be a decimal value close to 1);
                This filter is optionally not applied to sequences of length greater than (c) nucleotides.
 -icnf a [b]: filters pairs of reads if either has an unaccptably high number of uncalled nucleotides (Ns or other ambiguous
              IUPAC codes).  Like -icqf, but will also filter fasta-format data.  (a) the percentage of nucleotides in a read
              that must be called. This filter is optionally not applied to sequences of length greater than (c) nucleotides.
FILTERING/PROCESSING ASSEMBLED CONTIGS: 
 -lenf a b: filters out contigs shorter than (a) nt at the end of every cycle, after skipping (b) cycles.
            NOTE: multiple -lenf commands can be entered; for any cycle, the most recently-initiated filter is used.
            Example: -lenf 50 2 -lenf 300 4 -lenf 200 6 => no filter for the first two cycles, then a 50nt filter for cycles
                     3 & 4, then a 300nt filter for cycles 5 & 6, then a 200nt filter for cycles 7 onwards.
 -trim a b [c]: at the end of the (a)th cycle (indexed from 1), trim off the edges of conigs until reaching the minimum coverage
                level (b), optionally deleting contigs shorter than (c) after trimming; this flag may be used repeatedly.
 -trimB a b [c]: basal trim; after skipping (a) cycles, trim off the edges of conigs until reaching the minimum coverage
                 level (b) at the end of EVERY cycle, optionally deleting contigs shorter than (c) after trimming.
                 -trimB may be called many times, and multiple calls will interact in the same way as multiple -lenf calls
                 (explained above). A call to -trim will override the basal trim values for that specified cycle only.
 -trimI a [b]: initial trim; input initial contigs are trimmed before being used by PRICE to seed assemblies. Contigs are
               trimmed from their outside edges until reaching the minimum coverage level (a), optionally deleting contigs
               shorter than (b) after trimming. This flag is most appropriate for .priceq input, can be appropriate for
               .fastq input, and is inappropriate for .fasta input.  It will be equally applied to ALL input contigs.
 -target a b [c d]: limit output contigs to those with matches to input initial contigs at the end of each cycle.
                    (a) % identity to an input initial contig to count as a match (ungapped); (b)num cycles to skip
                    before applying this filter.  [c and d are optional, but must both be provided if either is]
                    After target filtering has begin, target-filtered/-unfiltered cycles will alternate with (c)
                    filtered cycles followed by (d) unfiltered cycles.
 -targetF a b [c d]: the same as -target, but now matches to all reads in the input set will be specified, not just
                    the ones that have been introduced up to that point (this is FullFile mode).
COMPUTATIONAL EFFICIENCY: 
 -a x: (x)num threads to use (default=1) 
 -mtpf a: (a)max threads per file (default=1) 
USER INTERFACE: 
 -log a: determines the type of outputmakes the output verbose (lots of time stamp tags) 
         (a) = c: concise stdout (default)
         (a) = n: no stdout 
         (a) = v: verbose stdout 
 -logf a: (a)the name of an output file for verbose log info to be written (doesn't change stdout format) 
 -, -h, or --help: user interface info. 

number of cycles MUST be specified with -nc flag
No assembly was run; help message printed instead.