User Tools

Site Tools


contributors:team_3:filter
$ sga filter --help
Usage: sga filter [OPTION] ... READSFILE
Remove reads from a data set.
The currently available filters are removing exact-match duplicates
and removing reads with low-frequency k-mers.
Automatically rebuilds the FM-index without the discarded reads.

    --help                           display this help and exit
    -v, --verbose                    display verbose output
    -p, --prefix=PREFIX              use PREFIX for the names of the index files (default: prefix of the input file)
    -o, --outfile=FILE               write the qc-passed reads to FILE (default: READSFILE.filter.pass.fa)
    -t, --threads=NUM                use NUM threads to compute the overlaps (default: 1)
    -d, --sample-rate=N              use occurrence array sample rate of N in the FM-index. Higher values use significantly
                           less memory at the cost of higher runtime. This value must be a power of 2 (default: 128)
    --no-duplicate-check             turn off duplicate removal
    --substring-only                 when removing duplicates, only remove substring sequences, not full-length matches
    --no-kmer-check                  turn off the kmer check
    --kmer-both-strand               mimimum kmer coverage is required for both strand
    --homopolymer-check              check reads for hompolymer run length sequencing errors
    --low-complexity-check           filter out low complexity reads

K-mer filter options:
    -k, --kmer-size=N                The length of the kmer to use. (default: 27)
    -x, --kmer-threshold=N           Require at least N kmer coverage for each kmer in a read. (default: 3)

Report bugs to js18@sanger.ac.uk

Discussion

, 2015/04/28 03:38
$ qsub -V -cwd -pe mpi 32 initialrun.sh
[jolespin@campusrocks2 SGA]$ cat initialrun.sh.e55293
[timer - sga::filter] wall clock: 23599.21s CPU: 62984.74s
[jolespin@campusrocks2 SGA]$ cat initialrun.sh.o55293
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.

RLBWT info:
Large Sample rate: 8192
Small Sample rate: 128
Contains 19924474285 symbols in 4702993309 runs (4.2366 symbols per run)
Marker Memory -- Small Markers: 1867919484 (1781.4 MB) Large Markers: 116745024 (111.3 MB)
Total Memory -- Markers: 1984664508 (1892.7 MB) Str: 4702993309 (4485.1 MB) Misc: 152 Total: 6687657969 (6377.847642 MB)
N: 19924474285 Bytes per symbol: 0.335650
You could leave a comment if you were logged in.
contributors/team_3/filter.txt · Last modified: 2015/09/02 16:24 by ceisenhart