pluck_scripts
This is a set of Python scripts written by Kevin Karplus, originally created in /projects/lowelab/users/course/karplus (hence the acronym “pluck”). The scripts perform a variety of useful tasks. Each one has minimal internal documentation, invoked by running with the –help option.
Scripts
check-hypotheses-with-solid given a set of hypotheses about possible changes to a scaffold, checks to see whether the SOLiD mate-pair reads provide more support for the original scaffold or the modified one. The hypotheses are in the form “AGT:contig00001 345:AGTA”, that is, a string in the scaffold, where the string is located, and what the string is replaced with. Either string can be “-” to represent an insertion or deletion. The insertion “-:contig00001 345:AA” would mean the insertion of two As before position 345 of contig00001, with numbering starting at 1 in the contig.
check-inversions
classify-blast-reads
differences2stitcher Given a reference and difference format, output changes in stitcher format. Can have problems with many SNPs close together. Will try to change multiple SNPs in a single expression if possible.
extract-fragments
filter-blat
find-dna-differences compares a genome (or set of contigs) to a reference genome and reports differences in three formats: alignments of matching regions in a human-readable format, bed format for the location in the reference genome (loses some information about long insertions or replacements), and a short form that gives old_seq:reference location:new_seq for each change. This program is only intended for small sets of differences, not for large rearrangements or distant relationships. It may be buggy at the moment, as some of the Pog contigs that mapped completely to the genome were reported as not mapping (perhaps they were exact repeats?).
find-frequent-color-kmers
kstitcher David Bernick originally created a program called “stitcher” which would stitch together newbler contigs. Stitcher format: “@name {+contigname|-contigname}+ 15*N” start a new contig called “name comment”. This format allows for nested parantheses. No operator precedence; i.e., 15*(-contig1) and 15*-contig1 have very different results. The numeric value does not need to be a scalar. 0.5*contig will report the first half of the contig. Use parantheses whenever possible. Contig names should not be made soley of bases (i.e. GATACA). Strange operator “expr1 < expr2 > expr3” is analagous to “expr1 =~ /expr2/expr3/” iff expr2 is unique. If expr2 is not unique, the replacement will not take place.
look-for-exit used in the mitochondrial genome to find variants of the repeats and exits from the repeats.
make-contig-lengths
make-inversion-hypotheses
make-pseudoreads
make-scaffold-from-blat Takes a psl file as input. Will try to create a scaffold. Useful for compiling data from many sources that have alot of overlap. For example, after initial contigs are made, many reads are left over. De novo assembly of those reads can create new contigs which span the gaps between initial assembled contigs. make-scaffold-from-blast can create new scaffolds from these two sets of contigs.
map-colorspace
pair-contigs
select-by-color-kmers
tsv2gnuplot
Python modules
aligner.py for doing local or global alignment. This is a rather slow way to align things and should only be used for short sequences (like showing a small indel in context).
compress.py This was an attempt to merge color-space and flow space, by reducing all runs of 0s to a single 0. It turned out not to be as useful as I'd hoped, and further exploration was abandoned. —
Kevin Karplus 2010/04/15 22:19
fasta.py Input/Output module for fasta files, together with alphabet definitions and utility functions like reverse complement.
subst_matrix.py for creating DNA substitution matrices from a small number of parameters.