User Tools

Site Tools


archive:bioinformatic_tools:gs_de_novo_assembler

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
archive:bioinformatic_tools:gs_de_novo_assembler [2010/04/09 19:35]
galt
archive:bioinformatic_tools:gs_de_novo_assembler [2015/07/28 06:23] (current)
ceisenhart ↷ Page moved from bioinformatic_tools:gs_de_novo_assembler to archive:bioinformatic_tools:gs_de_novo_assembler
Line 4: Line 4:
  
 It works in flow-space to reduce the impact of its most common ​ It works in flow-space to reduce the impact of its most common ​
-sequencing error (uncertainty about the length of homopolymers).+sequencing error (uncertainty about the length of homopolymers).\\ 
 +It claims it can assemble a 3GB genome in one day and can use paired-end 
 +information to construct scaffolds from contigs. ​ Currently the paired-end data must have at least 50 bases in each end, so only 454 paired-end libraries are accepted---it would be good if they relaxed that constraint so that their data could be mixed with data from other platforms.
  
 Roche 454 info about [[http://​454.com/​products-solutions/​analysis-tools/​gs-de-novo-assembler.asp|Newbler]] Roche 454 info about [[http://​454.com/​products-solutions/​analysis-tools/​gs-de-novo-assembler.asp|Newbler]]
  
-Wiki [[http://​en.wikipedia.org/​wiki/​Newbler|article]].+Wiki [[http://​en.wikipedia.org/​wiki/​Newbler|article]].\\ 
 +Documentation:​ [[http://​xyala.cap.ed.ac.uk/​Gene_Pool/​454_software/​]] (probably not supposed to be free on the web, though...)\\ 
 +First description:​ [[http://​www.ncbi.nlm.nih.gov/​pmc/​articles/​PMC1464427/​ ]]\\ 
 +A review article with a decent description:​ [[http://​www.ncbi.nlm.nih.gov/​pubmed/​20211242]] 
 + 
 +== Installation on campusrocks == 
 + 
 +The following tools were installed on BME235/bin/ on 15 April 2010: 
 +  * addRun 
 +  * createProject 
 +  * doAmplicon 
 +  * fnafile 
 +  * getProjAlignData 
 +  * gsAmplicon 
 +  * gsAssembler 
 +  * gsMapper 
 +  * newAssembly 
 +  * newbler 
 +  * newMapping 
 +  * removeRun 
 +  * runAssembly 
 +  * runMapping 
 +  * runProject 
 +  * setRef 
 +  * sff2scf 
 +  * sfffile 
 +  * sffinfo 
 +  * sffrescore 
 +  * stopRun 
 + 
 +Installed and being tested  
 + 
 +== De novo assembly == 
 + 
 +The standard approach for de novo assembly is to create a new directory, and in that directory create a Makefile that includes a target to execute the following commands: 
 +<​code>​ 
 +newAssembly . 
 +addRun . /​campusdata/​BME235/​data/​Pog/​454_run/​sff/​FUIPDCZ01.sff 
 +addRun . /​campusdata/​BME235/​data/​Pog/​454_run/​sff/​FUIPDCZ02.sff 
 +runProject -e 50 -nobig -rst 0 . 
 +</​code>​ 
 +Of course, different sff files will be used on different runs. 
 + 
 +The "​-e"​ value is the expected coverage. ​ For the Pog 454 data, that should be about 60.  For the banana-slug data, it is very much smaller (0.05?). 
 + 
 +The -nobig parameter suppresses the generation of big output files. 
 + 
 +The -rst 0 parameter (repeat score threshold) says that a read should be labeled uniquely mapped if its best hit scores >0 more than the next best (the default value is 12, which means that a lot of hits get labeled as repeats, even though they can distinguish between similar repeat regions). 
 + 
 +A Makefile that illustrates the use of the SunGrid to avoid running on the head node is shown in /​campusdata/​BME235/​assemblies/​Pog/​newbler-assembly2/​Makefile 
 + 
 +Note: earlier versions of Newbler provided serially numbered contigs, but version 2.3 seems to skip numbers rather arbitrarily,​ so that the range of the numbers is larger than the size of the set of contigs. ​ Look at the counts (in assembly/​454NewblerMetrics.txt) or run a program to count the contigs, rather than relying on the largest contig number. 
 + 
 +For large genomes you can pass the -large argument to runProject and it will take some time-saving shortcuts. 
 +<​code>​ 
 +        ${BIN}/​newAssembly . 
 +        ${BIN}/​addRun . ${SFFS_IN} 
 +        ${BIN}/​addRun . ${FA_IN} 
 +        ${BIN}/​runProject -e ${EXPECTED_COVERAGE} -large -rst 0 -noace . 
 +</​code>​ 
 + 
 +== Mapping to existing genome == 
 + 
 +Many genomes can be assembled by mapping reads to an existing "close enough"​ genome. ​ This "​close-enough"​ genome can even be one assembled by Newbler ab initio! 
 +The standard commands for mapping assembly are to create a new directory, and in that directory create a Makefile that includes a target to execute the following commands: 
 +<​code>​ 
 +newMapping . 
 +setRef . /​campusdata/​BME235/​data/​Pog/​finished/​Pog.chr.v3.fa 
 +setRef . /​campusdata/​BME235/​data/​Pog/​finished/​Pog.ece.v3.fa  
 +addRun . /​campusdata/​BME235/​data/​Pog/​454_run/​sff/​FUIPDCZ01.sff 
 +addRun . /​campusdata/​BME235/​data/​Pog/​454_run/​sff/​FUIPDCZ02.sff 
 +runProject -e 50 -nobig -rst 0 . 
 +</​code>​ 
 + 
 + 
 + 
archive/bioinformatic_tools/gs_de_novo_assembler.1270841731.txt.gz · Last modified: 2010/04/09 19:35 by galt