User Tools

Site Tools


archive:bioinformatic_tools:gs_de_novo_assembler

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
archive:bioinformatic_tools:gs_de_novo_assembler [2010/04/20 14:17]
karplus Added warning about non-serial conting numbering, added -noace
archive:bioinformatic_tools:gs_de_novo_assembler [2015/07/28 06:23]
ceisenhart ↷ Page moved from bioinformatic_tools:gs_de_novo_assembler to archive:bioinformatic_tools:gs_de_novo_assembler
Line 44: Line 44:
 == De novo assembly == == De novo assembly ==
  
-The standard ​commands ​for de novo assembly ​are to create a new directory, and in that directory create a Makefile that includes a target to execute the following commands:+The standard ​approach ​for de novo assembly ​is to create a new directory, and in that directory create a Makefile that includes a target to execute the following commands:
 <​code>​ <​code>​
 newAssembly . newAssembly .
 addRun . /​campusdata/​BME235/​data/​Pog/​454_run/​sff/​FUIPDCZ01.sff addRun . /​campusdata/​BME235/​data/​Pog/​454_run/​sff/​FUIPDCZ01.sff
 addRun . /​campusdata/​BME235/​data/​Pog/​454_run/​sff/​FUIPDCZ02.sff addRun . /​campusdata/​BME235/​data/​Pog/​454_run/​sff/​FUIPDCZ02.sff
-runProject -e 50 -noace -rst 0 .+runProject -e 50 -nobig -rst 0 .
 </​code>​ </​code>​
 Of course, different sff files will be used on different runs. Of course, different sff files will be used on different runs.
 +
 +The "​-e"​ value is the expected coverage. ​ For the Pog 454 data, that should be about 60.  For the banana-slug data, it is very much smaller (0.05?).
 +
 +The -nobig parameter suppresses the generation of big output files.
 +
 +The -rst 0 parameter (repeat score threshold) says that a read should be labeled uniquely mapped if its best hit scores >0 more than the next best (the default value is 12, which means that a lot of hits get labeled as repeats, even though they can distinguish between similar repeat regions).
  
 A Makefile that illustrates the use of the SunGrid to avoid running on the head node is shown in /​campusdata/​BME235/​assemblies/​Pog/​newbler-assembly2/​Makefile A Makefile that illustrates the use of the SunGrid to avoid running on the head node is shown in /​campusdata/​BME235/​assemblies/​Pog/​newbler-assembly2/​Makefile
  
 Note: earlier versions of Newbler provided serially numbered contigs, but version 2.3 seems to skip numbers rather arbitrarily,​ so that the range of the numbers is larger than the size of the set of contigs. ​ Look at the counts (in assembly/​454NewblerMetrics.txt) or run a program to count the contigs, rather than relying on the largest contig number. Note: earlier versions of Newbler provided serially numbered contigs, but version 2.3 seems to skip numbers rather arbitrarily,​ so that the range of the numbers is larger than the size of the set of contigs. ​ Look at the counts (in assembly/​454NewblerMetrics.txt) or run a program to count the contigs, rather than relying on the largest contig number.
 +
 +For large genomes you can pass the -large argument to runProject and it will take some time-saving shortcuts.
 +<​code>​
 +        ${BIN}/​newAssembly .
 +        ${BIN}/​addRun . ${SFFS_IN}
 +        ${BIN}/​addRun . ${FA_IN}
 +        ${BIN}/​runProject -e ${EXPECTED_COVERAGE} -large -rst 0 -noace .
 +</​code>​
  
 == Mapping to existing genome == == Mapping to existing genome ==
Line 67: Line 81:
 addRun . /​campusdata/​BME235/​data/​Pog/​454_run/​sff/​FUIPDCZ01.sff addRun . /​campusdata/​BME235/​data/​Pog/​454_run/​sff/​FUIPDCZ01.sff
 addRun . /​campusdata/​BME235/​data/​Pog/​454_run/​sff/​FUIPDCZ02.sff addRun . /​campusdata/​BME235/​data/​Pog/​454_run/​sff/​FUIPDCZ02.sff
-runProject -e 50 -noace -rst 0 .+runProject -e 50 -nobig -rst 0 .
 </​code>​ </​code>​
  
archive/bioinformatic_tools/gs_de_novo_assembler.txt · Last modified: 2015/07/28 06:23 by ceisenhart