look-for-exit reads.fastq reads.fastq … -r repeat.fasta
Look-for-exit can be used for several tasks:
The python program takes any number of fastq files of reads as arguments and indexes them (and their reverse-complements) in a suffix array. The suffix array code is currently implemented using the Python module tools_karkkainen_sanders, also known as pysuffix. Because the creation of the suffix array for the reads is so slow, interactive commands are used to control the choice of kmer size and other parameters of the process. Although the reads are read as fastq files, look-for-exit currently ignores the quality values.
The optional –repeat_file argument specifies a fasta file of a repeat region (or whole genome), which is also indexed (though not its reverse complement) in a suffix array. Note: the repeat is not wrapped around, so it is usually a good idea to have at least 2 copies of the repeat in the input.
(Bug: both the reads and the repeat_file arguments should accept either fasta or fastq format and should accept gzipped files. This would be a simple fix, but would need some modification to the fastq.py module.)
Interactive commands:
q to quit h for help > to search forward (just sets search direction, doesn't do search) < to search backward (just sets search direction, doesn't do search) s agct set seed to search from agct add to seed (in search direction) empty line extends seed by 1 in search direction - shrinks seed by 1 in search direction (undoes previous extension) r take reverse complement of seed e 5,3 extend the seed as far as possible, as long as the support for the best extension is >=5 and no other extension has support >=3. (The parameters are remembered, so subsequent uses can be just "e".) w 5,10000 "walk" the seed, doing extensions until the seed would have <5 reads in support, then shrinking the back end until extensions can continue. Stop when a previously seen seed is repeated, or when 10000 bases have been walked over. (The parameters are remembered.) The following commands only make sense if there is a --repeat_file option. x to look for exit from repeat k 31 to set a kmer size m 10 to set the minimum number of reads needed to continue a search d 30 maximum depth to search (if minimum # reads not hit first) l 20 minimum length of a kmer extending out of repeat to be worth reporting c file_prefix count reads supporting each kmer (size specified by k) of the repeat region and output them (sorted by number of reads) to file named file_prefix followed by k. Warning: the crude command parser discards case information, so the file prefix will end up all lower-case.
Options:
-h, --help show this help message and exit -r REPEAT_FILE, --repeat_file=REPEAT_FILE file name for a fasta file that contains a repeat sequence.