This is an old revision of the document!
=====ALLPATHS===== * ALLPATHS was created to improve reference genomes. * The version described here is optimized for 100 bases (illumina reads). * Does paired-end. * Requires high coverage 40x+ raw read coverage for each library. * A minimum of 2 paired-end libraries: one short and one long * The short separation size must be less than twice the read size. * The distribution of siezes should be as small as possible, with a std dev of < 20%. * Long library insert size should be approximtely 4000 bases long and can have a larger size distribution * Installation * Requires Boost libraries and an up-to-date c compiler * Very long installation, over 2 hours of compilation time. * Download and extract the tarball * autoconf * ./configure * make -j8 (parallel compilation) * make install scripts * Pipeline/Modules * All binaries are located in /bin * RunAllpaths3g controls the entire pipeline. * Directories are created for each new job so different assemblies can be compared. * Reference * Contains the reference genome * Data * reads fasta, qual, and pairs files. * May contain many run directories, each representing a particular attempt to assemble the original data using a different set of parameters. * Run * Intermediate files. * Assemblies * finished assemblies are stored in this directory. * SubDir * OptionsFile * There are many options. * Preparing read data * ploidy file: 1 for haploid, 2 for diploid * Fragment library reads are expected to be oriented towards each other. * Jumping library reads away from each other. * difference in v1 and v2 ALLPATHS * v1: high quality assemblies from simulated shor treads * v2: high quality assemblies can be optained from read data * beat Velvet and Euler-SR * Input: * Three different bacteria * S. Aureus * Output * A graph of continuous paths. * Shows paths between contigs. * Each component is its own scaffold * Some other things * Removal of reads that are >90% A. claims to be an artifact of the illumina sequencing platform. * Runtime * Scales almost linearly according to genome size. E.coli ~8.2h. * May be too slow for large genomes (> 1gb) * Error rate: * Beats velvet and euler-sr in all categories. (Measured in 10kb windows).