Table of Contents

Discovar de novo manual

Team page

Team 5: Discovar de novo

Introduction

The information below is a summary of the Discovar de novo manual which can be found at: http://www.broadinstitute.org/software/discovar/blog/?page_id=19

DISCOVAR de novo is a new fully de novo genome assembler. Its inputs are designed to optimize quality while keeping costs low. Currently it takes as input Illumina read s of length 250 or longer produced on MiSeq or HiSeq 2500 and from a single PCR-free library. These data enable a level of completeness and co ntinuity that was not previously possible.

The best source of current news and information on DISCOVAR de novo is the Broad Institute blog: http://www.broadinstitute.org/software/discovar/blog/

Here you will find announcements, FAQ, links to the latest code, manual and test data, build requirements and instructions. We recommend that our blog page be your star ting point whenever you have problems, questions or are just looking for the latest version.

You should also consider joining the DISCOVAR user forum: https://groups.google.com/a/broadinstitute.org/forum/?hl=en&fromgroups#!forum/discovar-user-forum

The help section of our blog should be your starting point if you encounter problems: http://www.broadinstitute.org/software/discovar/blog/?page_id=19

Requirements

To compile and run DISCOVAR you will need a 64-bit Linux/UNIX system with at least 32 GB of RAM. Our software does not run on 32-bit machines.

The DISCOVAR source code is available for download via our ftp site: ftp://ftp.broadinstitute.org/pub/crd/DiscovarDeNovo/

We do not issue official releases. Instead, please download the latest version from our nightly builds. Only builds that pass our internal tests are made available in t his way - we do not release broken builds.

The g++ compiler from GCC, version 4.7.0 or higher. http://gcc.gnu.org/

The GMP library compiled with the C++ interface. Your GCC installation may already include GMP. http://gmplib.org /

The jemalloc replacement MALLOC library, version 3.6.0 or higher. http://www.canonware.com/jemalloc/

The SAMtools command-line utilities for SAM and BAM file manipulation. http://samtools.sourceforge.net/

We also recommend:

The graph command dot from the GraphViz package - to visualize assembly graphs. http://www.graphviz.org/

The Picard set of Java-based command-line utilities for SAM and BAM file manipulation. http://picard.sourceforge.net/

Building

See instructions in the file: INSTALL

Performance

On systems we have tested on, allowing per-thread memory management will improve computational performance. If not already enabled by default, you can achieve this using:

setenv MALLOC_PER_THREAD 1

Testing

Example data, along with instructions are available via our FTP site. Before attempting to run DISCOVAR with your own data, please first try the examples available via our FTP site:

ftp://ftp.broadinstitute.org/pub/crd/DiscovarDeNovo/

Generating sequencing data

DISCOVAR has specific requirements for input data, and will likely fail if you do not meet them.

DISCOVAR requires a single Illumina fragment (paired end) library. The fragment size should be approximately 450 bp, from which are generated 250 base paired end reads <E2><80><93> using either Illumina MiSeq or HiSeq 2500 genome sequencers. Illumina reads longer than 250 bases also work. The reads should be inward facing - reading towards the center of the fragment.

We strongly recommend using data generated by a PCR-free protocol. As per the Illumina protocol, you should not use a gel to size select.

The recommended coverage is about 60x. Somewhat higher or lower coverage is fine.

DISCOVAR does not require a jumping library and cannot currently use one. Nor can it use 100 base Illumina reads, or reads from other sequencing technologies at this time. However we are investigating new technologies that might extend DISCOVAR<E2><80><99>s capabilities.

Sequencing data requirements summary:

         Illumina MiSeq or HiSeq 2500 genome sequencers
         PCR-free library preparation
         250 base paired end reads (or longer)
         ~450 base pair fragment size
         ~60x coverage

Input files

DISCOVAR requires a BAM file containing the raw reads from the sequencer.

The reads to assemble must be in a BAM file or files. The name of the BAM file is specified with the required argument READS:

READS=filename

Multiple BAM files are specified using a comma separated list:

READS=filename1,filename2,…

Alternatively, the BAM files can be specified in a separate file contain a list of BAM filenames, one per line:

READS=@list-filename

DISCOVAR calls SAMtools internally to extract reads from the BAM. If you encounter any issues importing your BAM files into DISCOVAR, try examining your BAMs using this tool.

The reads in the BAM files may be mapped or unmapped - any alignment information present in the BAMs is not used in the assembly process.

Running DISCOVAR de novo

DISCOVAR can currently de novo assemble genomes up to ~3 Gb in size. All that is required are paired end reads, contained within one or more BAM files. See the previous section for details on generating the appropriate sequence data and the BAM file requirements.

The syntax for DISCOVAR de novo assembly is:

Discovar READS=bam-filenames OUT_DIR=output-dir

For example:

Discovar READS=reads.bam OUT_DIR=my_assembly

This will take as input all the reads in the BAM file reads.bam, generate an assembly, then write the output to the directory my_assembly. The location of the final assembly files is: my_assembly/a.final/

Viewing a DISCOVAR de novo assembly

The assembly graph produced by DISCOVAR de novo can be explored using the tool NhoodInfo, which is part of the DISCOVAR package. Please see the NhoodInfo manual for more details.

Brief guide to the assembly output

A DISCOVAR de novo assembly is a graph whose edges represent DNA sequences. Within any assembly one can find regions that are essentially linear. We call these lines.

This line has two cells. In this case, for each cell, there are two paths across the cell.

Multiple paths within a cell may reflect biological differences, such as heterozygous sites, or somatic mutations. Similarly a line could represent multiple, highly similar loci (which would be reflected in the observed copy number). Partial phasing will sometimes lead to more than two paths. However multiple paths can also represent non-biological differences, such as those arising at loci that are very hard to sequence, and for which consequently the assembly is unable to determine the exact sequence, instead providing alternatives.

We allow cells having no paths across, representing captured gaps, and displayed in files below using 100 Ns.

DISCOVAR de novo assemblies are symmetric: for each edge, there is a reverse complement edge, and for each line, there is a reverse complement line.

DISCOVAR de novo provides several output forms from which you can select:

-a.fasta = fasta file of edges

-a.lines = binary file of lines, mathematically a vec<vec<vec<vec<int»», in which the ints are edge ids.

-a.lines.efasta = standard scaffold efasta file, which shows {s1,…,sn} for the ALTERNATIVES associated to a given cell. *

-a.lines.fasta = standard scaffold fasta file, obtained by taking the highest coverage path through each cell; LOSES INFORMATION! * -a.lines.src = human-readable form of a.lines, represented using nested brackets {…}

-'Duplicate' reverse complement lines have been removed from these files. Also for circular chromosomes or episomes, the header line is labeled 'circular' and the ends of the sequence overlap by exactly K-1 bases (K = 200).