User Tools

Site Tools


lecture_notes:05-19-2010

This is an old revision of the document!


ALLPATHS

  • ALLPATHS was created to improve reference genomes.
  • The version described here is optimized for 100 bases (illumina reads).
  • Does paired-end.
  • Requires high coverage 40x+ raw read coverage for each library.
  • A minimum of 2 paired-end libraries: one short and one long
    • The short separation size must be less than twice the read size.
    • The distribution of siezes should be as small as possible, with a std dev of < 20%.
    • Long library insert size should be approximtely 4000 bases long and can have a larger size distribution
  • Installation
    • Requires Boost libraries and an up-to-date c compiler
    • Very long installation, over 2 hours of compilation time.
    • Download and extract the tarball
    • autoconf
    • ./configure
    • make -j8 (parallel compilation)
    • make install scripts
  • Pipeline/Modules
    • All binaries are located in /bin
    • RunAllpaths3g controls the entire pipeline.
    • Directories are created for each new job so different assemblies can be compared.
      • Reference
        • Contains the reference genome
      • Data
        • reads fasta, qual, and pairs files.
        • May contain many run directories, each representing a particular attempt to assemble the original data using a different set of parameters.
      • Run
        • Intermediate files.
      • Assemblies
        • finished assemblies are stored in this directory.
      • SubDir
      • OptionsFile
        • There are many options.
  • Preparing read data
    • ploidy file: 1 for haploid, 2 for diploid
    • Fragment library reads are expected to be oriented towards each other.
    • Jumping library reads away from each other.
  • difference in v1 and v2 ALLPATHS
    • v1: high quality assemblies from simulated shor treads
    • v2: high quality assemblies can be optained from read data
      • beat Velvet and Euler-SR
  • Input:
    • Three different bacteria
      • S. Aureus
  • Output
    • A graph of continuous paths.
      • Shows paths between contigs.
      • Each component is its own scaffold
  • Some other things
    • Removal of reads that are >90% A. claims to be an artifact of the illumina sequencing platform.
  • Runtime
    • Scales almost linearly according to genome size. E.coli ~8.2h.
    • May be too slow for large genomes (> 1gb)
  • Error rate:
    • Beats velvet and euler-sr in all categories. (Measured in 10kb windows).
You could leave a comment if you were logged in.
lecture_notes/05-19-2010.1274306318.txt.gz · Last modified: 2010/05/19 14:58 by hyjkim