User Tools

Site Tools


contributors:team_4_page

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revision Both sides next revision
contributors:team_4_page [2015/05/07 22:47]
JaredC created
contributors:team_4_page [2015/05/08 22:32]
sihussai
Line 1: Line 1:
-====== ABySS ======+======Team 4 | ABySS ====== 
 + 
 +**A**ssembly **By** **S**hort **S**equences - a //de novo//, parallel, paired-end sequence assembler 
 + 
 + 
 +=====Team composition===== 
 + 
 +| Name | Email |  
 +| Jared Copher | jcopher@ucsc.edu | 
 +| Emilio Feal | efeal@ucsc.edu | 
 +| Sidra Hussain | sihussai@ucsc.edu | 
 + 
 +=====ABySS overview===== 
 + 
 +ABySS is published by Canada'​s Michael Smith Genome Sciences Centre, and was the first //de novo// assembler for large genomes recommended bu Illumina in [[http://​www.illumina.com/​Documents/​products/​technotes/​technote_denovo_assembly_ecoli.pdf|this technical note]] ​ when using only their data. The ABySS team are active members on [[https://​www.biostars.org/​t/​Abyss/​|BioStars]] where they recommend all technical questions be asked. 
 + 
 +[[http://​www.bcgsc.ca/​platform/​bioinfo/​software/​abyss | ABySS main site]] 
 + 
 +[[http://​genome.cshlp.org/​content/​19/​6/​1117.full.pdf| ABySS paper]] 
 + 
 +[[https://​github.com/​bcgsc/​abyss| ABySS manual and source code]] 
 + 
 +=====General notes===== 
 +  - ABySS can run in serial mode, but that isn't too useful for such a large genome.  
 +  - The documentation recommends creating assemblies with several values of k and selecting the "​best"​ one. 
 +  - The program involves its own error correction. ​  
 + 
 +=====Installing ABySS===== 
 + 
 +ABySS source code was downloaded from Github 
 +<​code>​ 
 +% lftpget https://​github.com/​bcgsc/​abyss/​archive/​master.zip 
 +</​code>​ 
 +ABySS needs to be configured with it's dependencies 
 +<​code>​ 
 +% ./​autogen.sh 
 +% ./configure --prefix=/​campusdata/​BME235\ 
 +% --enable-maxk=96\ ​ #must be a multiple of 32 
 +% --enable-dependency-tracking\ 
 +% --with-boost=/​campusdata/​BME235/​include/​boost\ 
 +% --with-mpi=/​campusdata/​BME235/​include\ 
 +% CC=gcc-4.9.2 CXX=g++-4.9.2\ 
 +% CPPFLAGS=-I/​campusdata/​BME235/​include/​sparsehash 
 +</​code>​ 
 +Then ABySS can be installed via the makefile 
 +<​code>​ 
 +% make 
 +% make install 
 +</​code>​ 
 + 
 +=====ABySS parameters===== 
 + 
 +Parameters of the driver script, abyss-pe, and their [default value] 
 + 
 +  * a: maximum number of branches of a bubble [2] 
 +  * b: maximum length of a bubble (bp) [10000] 
 +  * c: minimum mean k-mer coverage of a unitig [sqrt(median)] 
 +  * d: allowable error of a distance estimate (bp) [6] 
 +  * e: minimum erosion k-mer coverage [sqrt(median)] 
 +  * E: minimum erosion k-mer coverage per strand [1] 
 +  * j: number of threads [2] 
 +  * k: size of k-mer (bp) [no default] 
 +  * l: minimum alignment length of a read (bp) [k] 
 +  * m: minimum overlap of two unitigs (bp) [30] 
 +  * n: minimum number of pairs required for building contigs [10] 
 +  * N: minimum number of pairs required for building scaffolds [n] 
 +  * p: minimum sequence identity of a bubble [0.9] 
 +  * q: minimum base quality [3] 
 +  * s: minimum unitig size required for building contigs (bp) [200] 
 +  * S: minimum contig size required for building scaffolds (bp) [s] 
 +  * t: minimum tip size (bp) [2k] 
 +  * v: use v=-v for verbose logging, v=-vv for extra verbose [disabled] 
 +Please see the abyss-pe manual page for more information on assembly parameters. 
 + 
 +Possibly, abyss-pe parameters can have same names as existing environment variables'​. The parameters then cannot be used until the environment variables are unset. To detect such occasions, run the command: 
 +<​code>​ 
 +abyss-pe env [options] 
 +</​code>​ 
 +Above command will report all abyss-pe parameters that are set from various origins. However it will not operate ABySS programs. 
 + 
 +=====Running ABySS===== 
 + 
 +abyss-pe is a driver script implemented as a Makefile. Any option of make may be used with abyss-pe. Particularly useful options are: 
 +<​code>​ 
 +-C dir, --directory=dir 
 +</​code>​ 
 +Change to the directory dir and store the results there. 
 +<​code>​ 
 +-n, --dry-run 
 +</​code>​ 
 +Print the commands that would be executed, but do not execute them. 
 + 
 +===Commands of abyss-pe=== 
 +  * default: Equivalent to `scaffolds scaffolds-dot stats'​. 
 +  * unitigs: Assemble unitigs. 
 +  * unitigs-dot:​ Output the unitig overlap graph. 
 +  * pe-sam: Map paired-end reads to the unitigs and output a SAM file. 
 +  * pe-bam: Map paired-end reads to the unitigs and output a BAM file. 
 +  * pe-index: Generate an index of the unitigs used by abyss-map. 
 +  * contigs: Assemble contigs. 
 +  * contigs-dot:​ Output the contig overlap graph. 
 +  * mp-sam: Map mate-pair reads to the contigs and output a SAM file. 
 +  * mp-bam: Map mate-pair reads to the contigs and output a BAM file. 
 +  * mp-index: Generate an index of the contigs used by abyss-map. 
 +  * scaffolds: Assemble scaffolds. 
 +  * scaffolds-dot:​ Output the scaffold overlap graph. 
 +  * stats: Display assembly contiguity statistics. 
 +  * clean: Remove intermediate files. 
 +  * version: Display the version of abyss-pe. 
 +  * versions: Display the versions of all programs used by abyss-pe. 
 +  * help: Display a helpful message. 
 + 
 +===Programs in pipeline=== 
 +abyss-pe uses the following programs, which must be found in your PATH: 
 + 
 +  * ABYSS: de Bruijn graph assembler 
 +  * ABYSS-P: parallel (MPI) de Bruijn graph assembler 
 +  * AdjList: find overlapping sequences 
 +  * DistanceEst:​ estimate the distance between sequences 
 +  * MergeContigs:​ merge sequences 
 +  * MergePaths: merge overlapping paths 
 +  * Overlap: find overlapping sequences using paired-end reads 
 +  * PathConsensus:​ find a consensus sequence of ambiguous paths 
 +  * PathOverlap:​ find overlapping paths 
 +  * PopBubbles: remove bubbles from the sequence overlap graph 
 +  * SimpleGraph:​ find paths through the overlap graph 
 +  * abyss-fac: calculate assembly contiguity statistics 
 +  * abyss-filtergraph:​ remove shim contigs from the overlap graph 
 +  * abyss-fixmate:​ fill the paired-end fields of SAM alignments 
 +  * abyss-map: map reads to a reference sequence (BW transform) 
 +  * abyss-scaffold:​ scaffold contigs using distance estimates 
 +  * abyss-todot:​ convert graph formats and merge graphs 
 +New to Version 1.3.5 (Mar 05, 2013) 
 +  * abyss-mergepairs:​ Merges overlapping read pairs. 
 +  * abyss-layout:​ Layout contigs using the sequence overlap graph. 
 +  * abyss-samtobreak:​ Calculate contig and scaffold contiguity and correctness metrics. 
 +New to Version 1.5.2 (Jul 10, 2014) 
 +  * konnector: fill the gaps between paired-end reads by building a Bloom filter de Bruijn graph and searching for paths between paired-end reads within the graph 
 +  * abyss-bloom:​ construct reusable bloom filter files for input to Konnector  
 + 
 +=====ABySS pipeline===== 
 + 
 + 
 +{{ :​bioinformatic_tools:​abysspipeline.png?​nolink |}} 
 + 
 +=====Test run===== 
 + 
 +This run was done using version 1.5.2. The assembly used k=59, 10 processes, and requested mem_free=15g from qsub. The assembly was done using the SW018 and SW019 libraries only. Specifically,​ the files used were: 
 +  * /​campusdata/​BME235/​Spring2015Data/​adapter_trimming/​SeqPrep/​SW018_S1_L007_R1_001_trimmed.fastq.gz 
 +  * /​campusdata/​BME235/​Spring2015Data/​adapter_trimming/​SeqPrep/​SW018_S1_L007_R2_001_trimmed.fastq.gz 
 +  * /​campusdata/​BME235/​Spring2015Data/​adapter_trimming/​SeqPrep/​SW019_S1_L001_R1_001_trimmed.fastq.gz 
 +  * /​campusdata/​BME235/​Spring2015Data/​adapter_trimming/​SeqPrep/​SW019_S1_L001_R2_001_trimmed.fastq.gz 
 +  * /​campusdata/​BME235/​Spring2015Data/​adapter_trimming/​SeqPrep/​SW019_S2_L008_R1_001_trimmed.fastq.gz 
 +  * /​campusdata/​BME235/​Spring2015Data/​adapter_trimming/​SeqPrep/​SW019_S2_L008_R2_001_trimmed.fastq.gz 
 + 
 +The files had adapters trimmed using SeqPrep (see the data pages for more details). SW019_S1 and SW019_S2 were treated as two separate libraries.  
 + 
 +The output and log files for this assembly are in /​campusdata/​BME235/​S15_assemblies/​abyss/​sidra/​test_run/​singleK. 
 + 
 + 
 +====Results==== 
 + 
 +Note: the N50, etc., stats only include contigs >= 500 bp (I believe the rest are discarded).  
 + 
 +There are 10.23 * 10^6 contigs. The N50 contig size is 2,669. The number of contigs of at least N50 (n:N50) is 174,507. The maximum contig size is 31,605, and the total number of bp (in contigs >= 500 bp) is 1.557 * 10^9.  
 + 
 +Here are the stats summarized for the contigs and also for scaffolds and unitigs. n:500 is the number of contigs/​unitigs/​scaffolds at least as long as 500 bp. sum is the number of bases in all the contigs/​unitigs/​scaffolds at least as long as 500 bp combined.  
 + 
 +| n | n:500 | n:N50 | min | N80 | N50 | N20 | E-size | max | sum | name | 
 +| 11.95e6 | 993409 | 247109 | 500 | 962  | 1795 | 3327 | 2296 | 30520 | 1.456e9 | slug-unitigs.fa | 
 +| 10.23e6 | 785054 | 174507 | 500 | 1320 | 2669 | 5079 | 3433 | 31605 | 1.557e9 | slug-contigs.fa |  
 +| 10.11e6 | 711022 | 153036 | 500 | 1490 | 3063 | 5870 | 3945 | 37466 | 1.573e9 | slug-scaffolds.fa | 
 + 
 + 
 +====Notes==== 
 + 
 +The success of this run means we are probably ready to do a run with all the data (not including the mate-pair data, that can be used for scaffolding later). For that run, the different trimmed files for each library should be concatenated,​ so that the run involves only the actual number of libraries we had (I believe 4?). It should also use many more than 10 processes. ​
contributors/team_4_page.txt · Last modified: 2015/07/18 20:52 by 92.247.181.31