User Tools

Site Tools


lecture_notes:05-19-2010

ALLPATHS

Presented by Thomas

  • ALLPATHS was created to improve reference genomes.
  • The version described here is optimized for 100 bases (illumina reads).
  • Does paired-end.
  • Requires high coverage 40x+ raw read coverage for each library.
  • A minimum of 2 paired-end libraries: one short and one long
    • The short separation size must be less than twice the read size.
    • The distribution of siezes should be as small as possible, with a std dev of < 20%.
    • Long library insert size should be approximtely 4000 bases long and can have a larger size distribution
  • Installation
    • Requires Boost libraries and an up-to-date c compiler
    • Very long installation, over 2 hours of compilation time.
    • Download and extract the tarball
    • autoconf
    • ./configure
    • make -j8 (parallel compilation)
    • make install scripts
  • Pipeline/Modules
    • All binaries are located in /bin
    • RunAllpaths3g controls the entire pipeline.
    • Directories are created for each new job so different assemblies can be compared.
      • Reference
        • Contains the reference genome
      • Data
        • reads fasta, qual, and pairs files.
        • May contain many run directories, each representing a particular attempt to assemble the original data using a different set of parameters.
      • Run
        • Intermediate files.
      • Assemblies
        • finished assemblies are stored in this directory.
      • SubDir
      • OptionsFile
        • There are many options.
  • Preparing read data
    • ploidy file: 1 for haploid, 2 for diploid
    • Fragment library reads are expected to be oriented towards each other.
    • Jumping library reads away from each other.
  • difference in v1 and v2 ALLPATHS
    • v1: high quality assemblies from simulated shor treads
    • v2: high quality assemblies can be optained from read data
      • beat Velvet and Euler-SR
  • Input:
    • Three different bacteria
      • S. Aureus
  • Output
    • A graph of continuous paths.
      • Shows paths between contigs.
      • Each component is its own scaffold
  • Some other things
    • Removal of reads that are >90% A. claims to be an artifact of the illumina sequencing platform.
  • Runtime
    • Scales almost linearly according to genome size. E.coli ~8.2h.
    • May be too slow for large genomes (> 1gb)
  • Error rate:
    • Beats velvet and euler-sr in all categories. (Measured in 10kb windows).

Mira

Presented by Michael Cusack

  • Designed to work with difficult genomes (lots of repeats or other sequence aberrations)
  • Hybrid Assembly
    • Can combine serveral data types
      • Does not work with SOLiD
      • Can take trace data from sanger in addition to base calls
      • Position specific confidence blues
      • A strech in each sequence marked as high confidence regions
      • General properties such as directionality
  • Mira is an Iterative Process
    • Read Scanning with a fast error tolerant pair-wise comparison. (Both less sensitive than smith-waterman)
      • DNA-Shift-AND
        • Align small words within a read
        • O(c*n), c=# allowed errors
        • Must find 2 of 3 words to establish a relationship
      • Zebra
        • Transcribe, Divide, Reorganize, Concentrate and Conquer strategy
        • Hashes each octet of bases into a 16-bit int and creates a hash-index table.
    • More thorough comparison oto establish type of relationship
      • Uses a modified smith-waterman alignment.
      • uses banding
      • uses information generated from DNA-SAND/ZEBRA
    • Building graph
      • Overlap alignment + complementary data (orientation, overlap region etc.)
    • Iterative Processing
      • Start with highest quality.
        • Split each read into high confidence and low confidence regions by quality clipping.
        • Only high confidence regions are used to build initial contigs.
        • Low confidence regions are used “cautiously”
    • Creating Contigs
      • Pathfinder
        • Finds best nodes and uses them as anchors
        • Extens while minimizing uncertainties of consensus bases
        • Uses a n, m-step recursive look-ahead algorithm to detech repeats.
      • Contig Builder
        • Once a path is decided, each contig must be compiled and approved
        • If a read is too different for existing consensus, depsite a high scoring overlat, it is regejected and the pathfinder is run again from that point.
    • Independent Observations
      • “Once central pillar of the quality calculation in MIRA is the rule that independent observations of base confirm this base better than non-independent observations. When a base was read from both directions, one can assume independence of observations: it's not the whole truth, but close enough. As a side note: observing a…” something…
    • Handing repeats
      • Can take in known repetitive elements.
        • When these reads are detected, much stricter control mechanisms can be applied.
      • When there is a discrepancy in a read matching a repeated element, signal processing of the trace is used to determin if the error is explainable
      • if percentage of unexplainable errors is greater than a threshold(default: 1%), reads are rejceted from consensus and returned to assembly graph.
You could leave a comment if you were logged in.
lecture_notes/05-19-2010.txt · Last modified: 2010/05/19 15:14 by hyjkim