=====ALLPATHS===== ===Presented by Thomas=== * ALLPATHS was created to improve reference genomes. * The version described here is optimized for 100 bases (illumina reads). * Does paired-end. * Requires high coverage 40x+ raw read coverage for each library. * A minimum of 2 paired-end libraries: one short and one long * The short separation size must be less than twice the read size. * The distribution of siezes should be as small as possible, with a std dev of < 20%. * Long library insert size should be approximtely 4000 bases long and can have a larger size distribution * Installation * Requires Boost libraries and an up-to-date c compiler * Very long installation, over 2 hours of compilation time. * Download and extract the tarball * autoconf * ./configure * make -j8 (parallel compilation) * make install scripts * Pipeline/Modules * All binaries are located in /bin * RunAllpaths3g controls the entire pipeline. * Directories are created for each new job so different assemblies can be compared. * Reference * Contains the reference genome * Data * reads fasta, qual, and pairs files. * May contain many run directories, each representing a particular attempt to assemble the original data using a different set of parameters. * Run * Intermediate files. * Assemblies * finished assemblies are stored in this directory. * SubDir * OptionsFile * There are many options. * Preparing read data * ploidy file: 1 for haploid, 2 for diploid * Fragment library reads are expected to be oriented towards each other. * Jumping library reads away from each other. * difference in v1 and v2 ALLPATHS * v1: high quality assemblies from simulated shor treads * v2: high quality assemblies can be optained from read data * beat Velvet and Euler-SR * Input: * Three different bacteria * S. Aureus * Output * A graph of continuous paths. * Shows paths between contigs. * Each component is its own scaffold * Some other things * Removal of reads that are >90% A. claims to be an artifact of the illumina sequencing platform. * Runtime * Scales almost linearly according to genome size. E.coli ~8.2h. * May be too slow for large genomes (> 1gb) * Error rate: * Beats velvet and euler-sr in all categories. (Measured in 10kb windows). =====Mira===== ===Presented by Michael Cusack=== * Designed to work with difficult genomes (lots of repeats or other sequence aberrations) * Hybrid Assembly * Can combine serveral data types * Does not work with SOLiD * Can take trace data from sanger in addition to base calls * Position specific confidence blues * A strech in each sequence marked as high confidence regions * General properties such as directionality * Mira is an Iterative Process * Read Scanning with a fast error tolerant pair-wise comparison. (Both less sensitive than smith-waterman) * DNA-Shift-AND * Align small words within a read * O(c*n), c=# allowed errors * Must find 2 of 3 words to establish a relationship * Zebra * Transcribe, Divide, Reorganize, Concentrate and Conquer strategy * Hashes each octet of bases into a 16-bit int and creates a hash-index table. * More thorough comparison oto establish type of relationship * Uses a modified smith-waterman alignment. * uses banding * uses information generated from DNA-SAND/ZEBRA * Building graph * Overlap alignment + complementary data (orientation, overlap region etc.) * Iterative Processing * Start with highest quality. * Split each read into high confidence and low confidence regions by quality clipping. * Only high confidence regions are used to build initial contigs. * Low confidence regions are used "cautiously" * Creating Contigs * Pathfinder * Finds best nodes and uses them as anchors * Extens while minimizing uncertainties of consensus bases * Uses a n, m-step recursive look-ahead algorithm to detech repeats. * Contig Builder * Once a path is decided, each contig must be compiled and approved * If a read is too different for existing consensus, depsite a high scoring overlat, it is regejected and the pathfinder is run again from that point. * Independent Observations * "Once central pillar of the quality calculation in MIRA is the rule that independent observations of base confirm this base better than non-independent observations. When a base was read from both directions, one can assume independence of observations: it's not the whole truth, but close enough. As a side note: observing a..." something... * Handing repeats * Can take in known repetitive elements. * When these reads are detected, much stricter control mechanisms can be applied. * When there is a discrepancy in a read matching a repeated element, signal processing of the trace is used to determin if the error is explainable * if percentage of unexplainable errors is greater than a threshold(default: 1%), reads are rejceted from consensus and returned to assembly graph.