User Tools

Site Tools


lecture_notes:05-26-2010

Pluck-scripts

  • ~/bin/pluck-scripts/README contains documentation on many python scripts.
  • make-contig-lengths
    • extracts reads and reads/base if it is stored in the header of each fasta.
  • fasta.py
    • A python library that contains commonly used methods and functions for reading and manipulating fasta sequences
  • map-colorspace
    • Generates a lot of information as a consequence of mapping colorspace reads to a reference.
  • pair-contigs
    • Takes in one of the outputs of map-colorspace (trim%-cross.rdb) and outputs mapped reads between contigs.
  • analyze-joins
    • Takes output from pair-contigs and connects contigs.
    • Can make mistakes– needs better heuristics.
  • check-hypothesis-with-solid
    • Checks differences between reference and solid reads given the output format of find-dna-differences or map-colorspace.
    • Looks for a number of reads that support or reject the hypothesis.
    • Can report if there is no data supports either the null or alternative hypothesis.
    • Utilizes paired-end data to map inversions/deletions.
  • generate-homopolymer-hypothesis
    • Came about because there was doubt in newbler's ability to accurately output homopolymers.
    • This finds all occurences of homopolymers in the assembly and checks the number of bases within the region.
    • Takes in some parameters that allows for screening short homopolymers.
    • In pog, there weren't many homopolymers.
      • 454 data was good at predicting homopolymer regions up to 12 bases in length.
      • With 25bp solid reads, it is difficult to disambiguate homopolymers longer than 18-19 bases.
    • In H. pylori, there are many homopolymer regions that stretch far beyond solid read lengths.
  • generate-integration-hypothesis
  • These checking programs which utilize solid-data do not take into account amplification biases created during library prepration.
    • Output posterior probabilities on various hypothesis being checked by these programs.
    • There is a correction that you can do: use only pairs that map uniquely. I.e., duplicate reads should only be counted once.
    • This should be done after mapping using reference coordinates rather than raw reads.
  • generate-integration-hypotheses
    • Given a viral and reference genome, generates hypotheses for inserting the viral inseration at every location that math the first f and last l bases.
      • the virus should be represented as a linear dna with the first base and last base representing the insertion site.
    • This program is modeled after the POG virus's (PIV) insertion mechanism. [citation needed Bernick]
      • The transposons were incorporating at virtually every possible site within the genome.
      • There was at least some evidence for hundreds of integration sites.
    • In H. Pylori cleaning raw data of transposons was very important to assemble the genome.
      • There was no single transposon site that was always present in the population.
      • Eliminating all transposons allowed the genome to be assembled without a problem.
        • Then by using solid and 454 data, the transposons were inserted.
        • Only 4 places the transposons was found.
  • check-inversions
    • Written because using check-hypothesis-location would be cumbersome due to the format it takes in.
    • This program takes in CircOS format which takes in the two endpoints of the inversions.
    • Looks for the amount of evidence at the low-end and the high-end of each orientation.
    • Comes up with an estimate of inversion frequency within the population.
    • Still needs to be fixed for deduplicating the templates.
  • find-common-short-reads
    • Looks for identical short reads that has coverage above expected values.
    • Hashes each short read as an exact string and look at counts.
    • This program can be improved by using multiple passes and random sampling.
      • First pass looks at every 1000 or so reads and hashes them.
      • On second pass, add to the first sample hash table.
      • Should be reimplemented in a memory efficient language
  • make-inversion-hypotheses
    • Converts the trim%inversions.rdb file to the format needed for check-inversions. Each inversion in the rdb file is converted to three seperate regions.
      • Gives the shortest, medium, and largest length inversions by looking for mate-pair reads that cross the inversion point in both orientations.
  • generate-inversion-hypotheses
    • Takes in a regular expression that may be recognized by some invertase.
    • Looks for the regular expression on the positive strand and the reverse complement on the opposite strand.
    • Looks for possible sites for inversion.
    • Used to check integrase sites are actually inversion sites in pog (they're not).
  • count-kmers
    • counts the frequency of different kemers within a genome.
  • find-frequency-color-kmers
    • Similar to find-common-short-reads
    • Showed a large number of all-0, all-1, all-2, or all-3 reads.
    • Adapter sequences were also very common.
    • Was not a great filter– find-common-short-reads was better at filtering out noisy data.
  • filter-blat
    • Take the psl file that comes out of blat and grabs out the biggest and best contig matches.
    • Sorts by start site of each alignment in the reference genome.
  • make-scaffold-from-blat
    • Takes output of filter-blat and makes a scaffold from the information.
    • Better matches are chosen.
    • Can raise the priority of certain contigs.
    • Replaces the current assembly with a newer assembly utilizing the new data.
    • Handles overlapping contigs (for blunt ended contigs use kstitcher)

Problems with assemblers

  • Not one single genome assembler are designed for linear chromosomes.
    • Is there some sort of eukaryotic bias?
    • Most assemblers throw away or ignore data that indicates circular chromosomes
You could leave a comment if you were logged in.
lecture_notes/05-26-2010.txt · Last modified: 2010/05/26 22:19 by hyjkim