~/bin/pluck-scripts/README contains documentation on many python scripts.
make-contig-lengths
fasta.py
map-colorspace
pair-contigs
analyze-joins
check-hypothesis-with-solid
Checks differences between reference and solid reads given the output format of find-dna-differences or map-colorspace.
Looks for a number of reads that support or reject the hypothesis.
Can report if there is no data supports either the null or alternative hypothesis.
Utilizes paired-end data to map inversions/deletions.
generate-homopolymer-hypothesis
Came about because there was doubt in newbler's ability to accurately output homopolymers.
This finds all occurences of homopolymers in the assembly and checks the number of bases within the region.
Takes in some parameters that allows for screening short homopolymers.
In pog, there weren't many homopolymers.
454 data was good at predicting homopolymer regions up to 12 bases in length.
With 25bp solid reads, it is difficult to disambiguate homopolymers longer than 18-19 bases.
In H. pylori, there are many homopolymer regions that stretch far beyond solid read lengths.
generate-integration-hypothesis
These checking programs which utilize solid-data do not take into account amplification biases created during library prepration.
Output posterior probabilities on various hypothesis being checked by these programs.
There is a correction that you can do: use only pairs that map uniquely. I.e., duplicate reads should only be counted once.
This should be done after mapping using reference coordinates rather than raw reads.
generate-integration-hypotheses
Given a viral and reference genome, generates hypotheses for inserting the viral inseration at every location that math the first f and last l bases.
This program is modeled after the POG virus's (PIV) insertion mechanism. [citation needed Bernick]
In H. Pylori cleaning raw data of transposons was very important to assemble the genome.
check-inversions
Written because using check-hypothesis-location would be cumbersome due to the format it takes in.
This program takes in CircOS format which takes in the two endpoints of the inversions.
Looks for the amount of evidence at the low-end and the high-end of each orientation.
Comes up with an estimate of inversion frequency within the population.
Still needs to be fixed for deduplicating the templates.
find-common-short-reads
Looks for identical short reads that has coverage above expected values.
Hashes each short read as an exact string and look at counts.
This program can be improved by using multiple passes and random sampling.
First pass looks at every 1000 or so reads and hashes them.
On second pass, add to the first sample hash table.
Should be reimplemented in a memory efficient language
make-inversion-hypotheses
generate-inversion-hypotheses
Takes in a regular expression that may be recognized by some invertase.
Looks for the regular expression on the positive strand and the reverse complement on the opposite strand.
Looks for possible sites for inversion.
Used to check integrase sites are actually inversion sites in pog (they're not).
count-kmers
find-frequency-color-kmers
Similar to find-common-short-reads
Showed a large number of all-0, all-1, all-2, or all-3 reads.
Adapter sequences were also very common.
Was not a great filter– find-common-short-reads was better at filtering out noisy data.
filter-blat
make-scaffold-from-blat
Takes output of filter-blat and makes a scaffold from the information.
Better matches are chosen.
Can raise the priority of certain contigs.
Replaces the current assembly with a newer assembly utilizing the new data.
Handles overlapping contigs (for blunt ended contigs use kstitcher)