Pluck-scripts * ~/bin/pluck-scripts/README contains documentation on many python scripts. * make-contig-lengths * extracts reads and reads/base if it is stored in the header of each fasta. * fasta.py * A python library that contains commonly used methods and functions for reading and manipulating fasta sequences * map-colorspace * Generates a lot of information as a consequence of mapping colorspace reads to a reference. * pair-contigs * Takes in one of the outputs of map-colorspace (trim%-cross.rdb) and outputs mapped reads between contigs. * analyze-joins * Takes output from pair-contigs and connects contigs. * Can make mistakes-- needs better heuristics. * check-hypothesis-with-solid * Checks differences between reference and solid reads given the output format of find-dna-differences or map-colorspace. * Looks for a number of reads that support or reject the hypothesis. * Can report if there is no data supports either the null or alternative hypothesis. * Utilizes paired-end data to map inversions/deletions. * generate-homopolymer-hypothesis * Came about because there was doubt in newbler's ability to accurately output homopolymers. * This finds all occurences of homopolymers in the assembly and checks the number of bases within the region. * Takes in some parameters that allows for screening short homopolymers. * In pog, there weren't many homopolymers. * 454 data was good at predicting homopolymer regions up to 12 bases in length. * With 25bp solid reads, it is difficult to disambiguate homopolymers longer than 18-19 bases. * In H. pylori, there are many homopolymer regions that stretch far beyond solid read lengths. * generate-integration-hypothesis * These checking programs which utilize solid-data do not take into account amplification biases created during library prepration. * Output posterior probabilities on various hypothesis being checked by these programs. * There is a correction that you can do: use only pairs that map uniquely. I.e., duplicate reads should only be counted once. * This should be done after mapping using reference coordinates rather than raw reads. * generate-integration-hypotheses * Given a viral and reference genome, generates hypotheses for inserting the viral inseration at every location that math the first f and last l bases. * the virus should be represented as a linear dna with the first base and last base representing the insertion site. * This program is modeled after the POG virus's (PIV) insertion mechanism. [citation needed Bernick] * The transposons were incorporating at virtually every possible site within the genome. * There was at least some evidence for hundreds of integration sites. * In H. Pylori cleaning raw data of transposons was very important to assemble the genome. * There was no single transposon site that was always present in the population. * Eliminating all transposons allowed the genome to be assembled without a problem. * Then by using solid and 454 data, the transposons were inserted. * Only 4 places the transposons was found. * check-inversions * Written because using check-hypothesis-location would be cumbersome due to the format it takes in. * This program takes in CircOS format which takes in the two endpoints of the inversions. * Looks for the amount of evidence at the low-end and the high-end of each orientation. * Comes up with an estimate of inversion frequency within the population. * Still needs to be fixed for deduplicating the templates. * find-common-short-reads * Looks for identical short reads that has coverage above expected values. * Hashes each short read as an exact string and look at counts. * This program can be improved by using multiple passes and random sampling. * First pass looks at every 1000 or so reads and hashes them. * On second pass, add to the first sample hash table. * Should be reimplemented in a memory efficient language * make-inversion-hypotheses * Converts the trim%inversions.rdb file to the format needed for check-inversions. Each inversion in the rdb file is converted to three seperate regions. * Gives the shortest, medium, and largest length inversions by looking for mate-pair reads that cross the inversion point in both orientations. * generate-inversion-hypotheses * Takes in a regular expression that may be recognized by some invertase. * Looks for the regular expression on the positive strand and the reverse complement on the opposite strand. * Looks for possible sites for inversion. * Used to check integrase sites are actually inversion sites in pog (they're not). * count-kmers * counts the frequency of different kemers within a genome. * find-frequency-color-kmers * Similar to find-common-short-reads * Showed a large number of all-0, all-1, all-2, or all-3 reads. * Adapter sequences were also very common. * Was not a great filter-- find-common-short-reads was better at filtering out noisy data. * filter-blat * Take the psl file that comes out of blat and grabs out the biggest and best contig matches. * Sorts by start site of each alignment in the reference genome. * make-scaffold-from-blat * Takes output of filter-blat and makes a scaffold from the information. * Better matches are chosen. * Can raise the priority of certain contigs. * Replaces the current assembly with a newer assembly utilizing the new data. * Handles overlapping contigs (for blunt ended contigs use kstitcher) Problems with assemblers * Not one single genome assembler are designed for linear chromosomes. * Is there some sort of eukaryotic bias? * Most assemblers throw away or ignore data that indicates circular chromosomes