User Tools

Site Tools


archive:bioinformatic_tools:pluck-scripts:look-for-exit

look-for-exit

look-for-exit reads.fastq reads.fastq … -r repeat.fasta

Look-for-exit can be used for several tasks:

  • interactively looking for the best extensions of seed sequences,
  • looking for exits from a repeat region (substrings connected to kmers in the repeat but which are not themselves in the repeat)
  • counting reads supporting each kmer in the repeat region (to try to find possible unique primers)
  • looking for variants in a genome that are supported by multiple reads (for example, not-yet-identified repeats)

The python program takes any number of fastq files of reads as arguments and indexes them (and their reverse-complements) in a suffix array. The suffix array code is currently implemented using the Python module tools_karkkainen_sanders, also known as pysuffix. Because the creation of the suffix array for the reads is so slow, interactive commands are used to control the choice of kmer size and other parameters of the process. Although the reads are read as fastq files, look-for-exit currently ignores the quality values.

The optional –repeat_file argument specifies a fasta file of a repeat region (or whole genome), which is also indexed (though not its reverse complement) in a suffix array. Note: the repeat is not wrapped around, so it is usually a good idea to have at least 2 copies of the repeat in the input.

(Bug: both the reads and the repeat_file arguments should accept either fasta or fastq format and should accept gzipped files. This would be a simple fix, but would need some modification to the fastq.py module.)

Interactive commands:

    q to quit
    h for help
    > to search forward (just sets search direction, doesn't do search)
    < to search backward (just sets search direction, doesn't do search)
    s agct      set seed to search from
    agct        add to seed (in search direction)
                empty line extends seed by 1 in search direction
    -           shrinks seed by 1 in search direction (undoes previous extension)
    r           take reverse complement of seed
    e 5,3       extend the seed as far as possible, as long as the support for
                the best extension is >=5 and no other extension has support 
                >=3.  (The parameters are remembered, so subsequent uses can 
                be just "e".)
    w 5,10000   "walk" the seed, doing extensions until the
                seed would have <5 reads in support, then shrinking
                the back end until extensions can continue.
                Stop when a previously seen seed is repeated, or
                when 10000 bases have been walked over.
                (The parameters are remembered.)
The following commands only make sense if there is a --repeat_file option.    
    x           to look for exit from repeat
    k 31        to set a kmer size
    m 10        to set the minimum number of reads needed 
                to continue a search
    d 30        maximum depth to search (if minimum # reads not hit first)
    l 20        minimum length of a kmer extending out of repeat to be worth reporting
    c file_prefix       count reads supporting each kmer (size specified by k)
                of the repeat region and output them (sorted by number
                of reads) to file named file_prefix followed by k.
                Warning: the crude command parser discards case
                information, so the file prefix will end up all lower-case.

Options:

  -h, --help            show this help message and exit
  -r REPEAT_FILE, --repeat_file=REPEAT_FILE
                        file name for a fasta file that contains a repeat
                        sequence.
You could leave a comment if you were logged in.
archive/bioinformatic_tools/pluck-scripts/look-for-exit.txt · Last modified: 2015/07/28 06:21 by ceisenhart