User Tools

Site Tools


archive:bioinformatic_tools:gs_de_novo_assembler

NEWBLER a.k.a. GS De Novo Assembler Software

Newbler is a proprietary assembler provided by 454 Roche.

It works in flow-space to reduce the impact of its most common sequencing error (uncertainty about the length of homopolymers).
It claims it can assemble a 3GB genome in one day and can use paired-end information to construct scaffolds from contigs. Currently the paired-end data must have at least 50 bases in each end, so only 454 paired-end libraries are accepted—it would be good if they relaxed that constraint so that their data could be mixed with data from other platforms.

Roche 454 info about Newbler

Wiki article.
Documentation: http://xyala.cap.ed.ac.uk/Gene_Pool/454_software/ (probably not supposed to be free on the web, though…)
First description: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1464427/
A review article with a decent description: http://www.ncbi.nlm.nih.gov/pubmed/20211242

Installation on campusrocks

The following tools were installed on BME235/bin/ on 15 April 2010:

  • addRun
  • createProject
  • doAmplicon
  • fnafile
  • getProjAlignData
  • gsAmplicon
  • gsAssembler
  • gsMapper
  • newAssembly
  • newbler
  • newMapping
  • removeRun
  • runAssembly
  • runMapping
  • runProject
  • setRef
  • sff2scf
  • sfffile
  • sffinfo
  • sffrescore
  • stopRun

Installed and being tested

De novo assembly

The standard approach for de novo assembly is to create a new directory, and in that directory create a Makefile that includes a target to execute the following commands:

newAssembly .
addRun . /campusdata/BME235/data/Pog/454_run/sff/FUIPDCZ01.sff
addRun . /campusdata/BME235/data/Pog/454_run/sff/FUIPDCZ02.sff
runProject -e 50 -nobig -rst 0 .

Of course, different sff files will be used on different runs.

The “-e” value is the expected coverage. For the Pog 454 data, that should be about 60. For the banana-slug data, it is very much smaller (0.05?).

The -nobig parameter suppresses the generation of big output files.

The -rst 0 parameter (repeat score threshold) says that a read should be labeled uniquely mapped if its best hit scores >0 more than the next best (the default value is 12, which means that a lot of hits get labeled as repeats, even though they can distinguish between similar repeat regions).

A Makefile that illustrates the use of the SunGrid to avoid running on the head node is shown in /campusdata/BME235/assemblies/Pog/newbler-assembly2/Makefile

Note: earlier versions of Newbler provided serially numbered contigs, but version 2.3 seems to skip numbers rather arbitrarily, so that the range of the numbers is larger than the size of the set of contigs. Look at the counts (in assembly/454NewblerMetrics.txt) or run a program to count the contigs, rather than relying on the largest contig number.

For large genomes you can pass the -large argument to runProject and it will take some time-saving shortcuts.

        ${BIN}/newAssembly .
        ${BIN}/addRun . ${SFFS_IN}
        ${BIN}/addRun . ${FA_IN}
        ${BIN}/runProject -e ${EXPECTED_COVERAGE} -large -rst 0 -noace .
Mapping to existing genome

Many genomes can be assembled by mapping reads to an existing “close enough” genome. This “close-enough” genome can even be one assembled by Newbler ab initio! The standard commands for mapping assembly are to create a new directory, and in that directory create a Makefile that includes a target to execute the following commands:

newMapping .
setRef . /campusdata/BME235/data/Pog/finished/Pog.chr.v3.fa
setRef . /campusdata/BME235/data/Pog/finished/Pog.ece.v3.fa 
addRun . /campusdata/BME235/data/Pog/454_run/sff/FUIPDCZ01.sff
addRun . /campusdata/BME235/data/Pog/454_run/sff/FUIPDCZ02.sff
runProject -e 50 -nobig -rst 0 .
You could leave a comment if you were logged in.
archive/bioinformatic_tools/gs_de_novo_assembler.txt · Last modified: 2015/07/27 23:23 by ceisenhart