Banana Slug Genomics

ALLPATHS

Presented by Thomas

ALLPATHS was created to improve reference genomes.
The version described here is optimized for 100 bases (illumina reads).
Does paired-end.
Requires high coverage 40x+ raw read coverage for each library.
A minimum of 2 paired-end libraries: one short and one long
- The short separation size must be less than twice the read size.
- The distribution of siezes should be as small as possible, with a std dev of < 20%.
- Long library insert size should be approximtely 4000 bases long and can have a larger size distribution
Installation
- Requires Boost libraries and an up-to-date c compiler
- Very long installation, over 2 hours of compilation time.
- Download and extract the tarball
- autoconf
- ./configure
- make -j8 (parallel compilation)
- make install scripts
Pipeline/Modules
- All binaries are located in /bin
- RunAllpaths3g controls the entire pipeline.
- Directories are created for each new job so different assemblies can be compared.
  - Reference
    - Contains the reference genome
  - Data
    - reads fasta, qual, and pairs files.
    - May contain many run directories, each representing a particular attempt to assemble the original data using a different set of parameters.
  - Run
    - Intermediate files.
  - Assemblies
    - finished assemblies are stored in this directory.
  - SubDir
  - OptionsFile
    - There are many options.
Preparing read data
- ploidy file: 1 for haploid, 2 for diploid
- Fragment library reads are expected to be oriented towards each other.
- Jumping library reads away from each other.
difference in v1 and v2 ALLPATHS
- v1: high quality assemblies from simulated shor treads
- v2: high quality assemblies can be optained from read data
  - beat Velvet and Euler-SR
Input:
- Three different bacteria
  - S. Aureus
Output
- A graph of continuous paths.
  - Shows paths between contigs.
  - Each component is its own scaffold
Some other things
- Removal of reads that are >90% A. claims to be an artifact of the illumina sequencing platform.
Runtime
- Scales almost linearly according to genome size. E.coli ~8.2h.
- May be too slow for large genomes (> 1gb)
Error rate:
- Beats velvet and euler-sr in all categories. (Measured in 10kb windows).

Mira

Presented by Michael Cusack

Designed to work with difficult genomes (lots of repeats or other sequence aberrations)
Hybrid Assembly
- Can combine serveral data types
  - Does not work with SOLiD
  - Can take trace data from sanger in addition to base calls
  - Position specific confidence blues
  - A strech in each sequence marked as high confidence regions
  - General properties such as directionality
Mira is an Iterative Process
- Read Scanning with a fast error tolerant pair-wise comparison. (Both less sensitive than smith-waterman)
  - DNA-Shift-AND
    - Align small words within a read
    - O(c*n), c=# allowed errors
    - Must find 2 of 3 words to establish a relationship
  - Zebra
    - Transcribe, Divide, Reorganize, Concentrate and Conquer strategy
    - Hashes each octet of bases into a 16-bit int and creates a hash-index table.
- More thorough comparison oto establish type of relationship
  - Uses a modified smith-waterman alignment.
  - uses banding
  - uses information generated from DNA-SAND/ZEBRA
- Building graph
  - Overlap alignment + complementary data (orientation, overlap region etc.)
- Iterative Processing
  - Start with highest quality.
    - Split each read into high confidence and low confidence regions by quality clipping.
    - Only high confidence regions are used to build initial contigs.
    - Low confidence regions are used “cautiously”
- Creating Contigs
  - Pathfinder
    - Finds best nodes and uses them as anchors
    - Extens while minimizing uncertainties of consensus bases
    - Uses a n, m-step recursive look-ahead algorithm to detech repeats.
  - Contig Builder
    - Once a path is decided, each contig must be compiled and approved
    - If a read is too different for existing consensus, depsite a high scoring overlat, it is regejected and the pathfinder is run again from that point.
- Independent Observations
  - “Once central pillar of the quality calculation in MIRA is the rule that independent observations of base confirm this base better than non-independent observations. When a base was read from both directions, one can assume independence of observations: it's not the whole truth, but close enough. As a side note: observing a…” something…
- Handing repeats
  - Can take in known repetitive elements.
    - When these reads are detected, much stricter control mechanisms can be applied.
  - When there is a discrepancy in a read matching a repeated element, signal processing of the trace is used to determin if the error is explainable
  - if percentage of unexplainable errors is greater than a threshold(default: 1%), reads are rejceted from consensus and returned to assembly graph.

You could leave a comment if you were logged in.

Banana Slug Genomics

User Tools

Site Tools

ALLPATHS

Presented by Thomas

Mira

Presented by Michael Cusack

Page Tools