lecture_notes:05-26-2010

Pluck-scripts

~/bin/pluck-scripts/README contains documentation on many python scripts.
make-contig-lengths
- extracts reads and reads/base if it is stored in the header of each fasta.
fasta.py
- A python library that contains commonly used methods and functions for reading and manipulating fasta sequences
map-colorspace
- Generates a lot of information as a consequence of mapping colorspace reads to a reference.
pair-contigs
- Takes in one of the outputs of map-colorspace (trim%-cross.rdb) and outputs mapped reads between contigs.
analyze-joins
- Takes output from pair-contigs and connects contigs.
- Can make mistakes– needs better heuristics.
check-hypothesis-with-solid
- Checks differences between reference and solid reads given the output format of find-dna-differences or map-colorspace.
- Looks for a number of reads that support or reject the hypothesis.
- Can report if there is no data supports either the null or alternative hypothesis.
- Utilizes paired-end data to map inversions/deletions.
generate-homopolymer-hypothesis
- Came about because there was doubt in newbler's ability to accurately output homopolymers.
- This finds all occurences of homopolymers in the assembly and checks the number of bases within the region.
- Takes in some parameters that allows for screening short homopolymers.
- In pog, there weren't many homopolymers.
  - 454 data was good at predicting homopolymer regions up to 12 bases in length.
  - With 25bp solid reads, it is difficult to disambiguate homopolymers longer than 18-19 bases.
- In H. pylori, there are many homopolymer regions that stretch far beyond solid read lengths.
generate-integration-hypothesis
These checking programs which utilize solid-data do not take into account amplification biases created during library prepration.
- Output posterior probabilities on various hypothesis being checked by these programs.
- There is a correction that you can do: use only pairs that map uniquely. I.e., duplicate reads should only be counted once.
- This should be done after mapping using reference coordinates rather than raw reads.
generate-integration-hypotheses
- Given a viral and reference genome, generates hypotheses for inserting the viral inseration at every location that math the first f and last l bases.
  - the virus should be represented as a linear dna with the first base and last base representing the insertion site.
- This program is modeled after the POG virus's (PIV) insertion mechanism. [citation needed Bernick]
  - The transposons were incorporating at virtually every possible site within the genome.
  - There was at least some evidence for hundreds of integration sites.
- In H. Pylori cleaning raw data of transposons was very important to assemble the genome.
  - There was no single transposon site that was always present in the population.
  - Eliminating all transposons allowed the genome to be assembled without a problem.
    - Then by using solid and 454 data, the transposons were inserted.
    - Only 4 places the transposons was found.
check-inversions
- Written because using check-hypothesis-location would be cumbersome due to the format it takes in.
- This program takes in CircOS format which takes in the two endpoints of the inversions.
- Looks for the amount of evidence at the low-end and the high-end of each orientation.
- Comes up with an estimate of inversion frequency within the population.
- Still needs to be fixed for deduplicating the templates.
find-common-short-reads
- Looks for identical short reads that has coverage above expected values.
- Hashes each short read as an exact string and look at counts.
- This program can be improved by using multiple passes and random sampling.
  - First pass looks at every 1000 or so reads and hashes them.
  - On second pass, add to the first sample hash table.
  - Should be reimplemented in a memory efficient language
make-inversion-hypotheses
- Converts the trim%inversions.rdb file to the format needed for check-inversions. Each inversion in the rdb file is converted to three seperate regions.
  - Gives the shortest, medium, and largest length inversions by looking for mate-pair reads that cross the inversion point in both orientations.
generate-inversion-hypotheses
- Takes in a regular expression that may be recognized by some invertase.
- Looks for the regular expression on the positive strand and the reverse complement on the opposite strand.
- Looks for possible sites for inversion.
- Used to check integrase sites are actually inversion sites in pog (they're not).
count-kmers
- counts the frequency of different kemers within a genome.
find-frequency-color-kmers
- Similar to find-common-short-reads
- Showed a large number of all-0, all-1, all-2, or all-3 reads.
- Adapter sequences were also very common.
- Was not a great filter– find-common-short-reads was better at filtering out noisy data.
filter-blat
- Take the psl file that comes out of blat and grabs out the biggest and best contig matches.
- Sorts by start site of each alignment in the reference genome.
make-scaffold-from-blat
- Takes output of filter-blat and makes a scaffold from the information.
- Better matches are chosen.
- Can raise the priority of certain contigs.
- Replaces the current assembly with a newer assembly utilizing the new data.
- Handles overlapping contigs (for blunt ended contigs use kstitcher)

Problems with assemblers

Not one single genome assembler are designed for linear chromosomes.
- Is there some sort of eukaryotic bias?
- Most assemblers throw away or ignore data that indicates circular chromosomes