Pluck-scripts
  * ~/bin/pluck-scripts/README contains documentation on many python scripts.
  * make-contig-lengths
    * extracts reads and reads/base if it is stored in the header of each fasta.
  * fasta.py
    * A python library that contains commonly used methods and functions for reading and manipulating fasta sequences
  * map-colorspace
    * Generates a lot of information as a consequence of mapping colorspace reads to a reference.
  * pair-contigs
    * Takes in one of the outputs of map-colorspace (trim%-cross.rdb) and outputs mapped reads between contigs.
  * analyze-joins
    * Takes output from pair-contigs and connects contigs.
    * Can make mistakes-- needs better heuristics.
  * check-hypothesis-with-solid
    * Checks differences between reference and solid reads given the output format of find-dna-differences or map-colorspace.
    * Looks for a number of reads that support or reject the hypothesis.
    * Can report if there is no data supports either the null or alternative hypothesis.
    * Utilizes paired-end data to map inversions/deletions.
  * generate-homopolymer-hypothesis
    * Came about because there was doubt in newbler's ability to accurately output homopolymers.
    * This finds all occurences of homopolymers in the assembly and checks the number of bases within the region.
    * Takes in some parameters that allows for screening short homopolymers.
    * In pog, there weren't many homopolymers.
      * 454 data was good at predicting homopolymer regions up to 12 bases in length.
      * With 25bp solid reads, it is difficult to disambiguate homopolymers longer than 18-19 bases.
    * In H. pylori, there are many homopolymer regions that stretch far beyond solid read lengths.
  * generate-integration-hypothesis
  * These checking programs which utilize solid-data do not take into account amplification biases created during library prepration.
    * Output posterior probabilities on various hypothesis being checked by these programs.
    * There is a correction that you can do: use only pairs that map uniquely. I.e., duplicate reads should only be counted once.
    * This should be done after mapping using reference coordinates rather than raw reads. 
  * generate-integration-hypotheses
    * Given a viral and reference genome, generates hypotheses for inserting the viral inseration at every location that math the first f and last l bases.
      * the virus should be represented as a linear dna with the first base and last base representing the insertion site.
    * This program is modeled after the POG virus's (PIV) insertion mechanism. [citation needed Bernick]
      * The transposons were incorporating at virtually every possible site within the genome.
      * There was at least some evidence for hundreds of integration sites.
    * In H. Pylori cleaning raw data of transposons was very important to assemble the genome.
      * There was no single transposon site that was always present in the population.
      * Eliminating all transposons allowed the genome to be assembled without a problem.
        * Then by using solid and 454 data, the transposons were inserted.
        * Only 4 places the transposons was found.
  * check-inversions
    * Written because using check-hypothesis-location would be cumbersome due to the format it takes in.
    * This program takes in CircOS format which takes in the two endpoints of the inversions.
    * Looks for the amount of evidence at the low-end and the high-end of each orientation.
    * Comes up with an estimate of inversion frequency within the population.
    * Still needs to be fixed for deduplicating the templates.
  * find-common-short-reads
    * Looks for identical short reads that has coverage above expected values.
    * Hashes each short read as an exact string and look at counts.
    * This program can be improved by using multiple passes and random sampling.
      * First pass looks at every 1000 or so reads and hashes them.
      * On second pass, add to the first sample hash table.
      * Should be reimplemented in a memory efficient language
  * make-inversion-hypotheses
    * Converts the trim%inversions.rdb file to the format needed for check-inversions. Each inversion in the rdb file is converted to three seperate regions.
      * Gives the shortest, medium, and largest length inversions by looking for mate-pair reads that cross the inversion point in both orientations.
  * generate-inversion-hypotheses
    * Takes in a regular expression that may be recognized by some invertase.
    * Looks for the regular expression on the positive strand and the reverse complement on the opposite strand.
    * Looks for possible sites for inversion.
    * Used to check integrase sites are actually inversion sites in pog (they're not).
  * count-kmers
    * counts the frequency of different kemers within a genome. 
  * find-frequency-color-kmers
    * Similar to find-common-short-reads
    * Showed a large number of all-0, all-1, all-2, or all-3 reads.
    * Adapter sequences were also very common.
    * Was not a great filter-- find-common-short-reads was better at filtering out noisy data.
  * filter-blat
    * Take the psl file that comes out of blat and grabs out the biggest and best contig matches.
    * Sorts by start site of each alignment in the reference genome.
  * make-scaffold-from-blat
    * Takes output of filter-blat and makes a scaffold from the information.
    * Better matches are chosen.
    * Can raise the priority of certain contigs.
    * Replaces the current assembly with a newer assembly utilizing the new data.
    * Handles overlapping contigs (for blunt ended contigs use kstitcher)

Problems with assemblers
  * Not one single genome assembler are designed for linear chromosomes.
    * Is there some sort of eukaryotic bias?
    * Most assemblers throw away or ignore data that indicates circular chromosomes