====== Newbler assembly on POG ======
====== Overview ======
Outlines how Kevin assembled 454 data of Pyrobaculum oguniense (POG) using Newbler 2.3 version.
===== Key points =====
* Kevin installed Newbler 2.3 version in Campusrocks cluster under /campusdata/BME235/programs/DataAnalysis_2.3.
* Newbler GUI is not installed as it has some issues with unpacking.
* Kevin ran the assembly tool on POG 454 data under /campusdata/BME235/assemblies/Pog.
* The README file in the directory contains important information about the assembly.
* Info about tools installed is listed in bioinformatic_tools [[https://banana-slug.soe.ucsc.edu/bioinformatic_tools:gs_de_novo_assembler | GS De Novo Assembler]]. Info about how to run the De novo as well as Mapping assembly tools is also included there.
* Currently, tools are installed under /campusdata/BME235/bin/old_Newbler/.
* Tools with prefix "gs" are not supposed to be run directly.
* Kevin has written several scripts in Python (version 2.6) which aid in building and analyzing genomes. Currently, these scripts do not work on Campusrocks, as the version of Python installed is 2.4 and it is under the process of being updated to version 2.6. (Python2.6.5 has now been installed in /campusdata/BME235/bin/ --- //[[karplus@soe.ucsc.edu|Kevin Karplus]] 2010/04/19 20:03//)
* Newbler assembly tools take .sff (color space and quality data) files as input and converts them into .fna (fasta file with nucleotide information) files.
* Good only with 454 data, and is not good on reads with length < 50.
* Example code to run the De novo tool on data is shown below. The code is taken from [[https://banana-slug.soe.ucsc.edu/bioinformatic_tools:gs_de_novo_assembler | GS De Novo Assembler]].
newAssembly .
addRun . /campusdata/BME235/data/Pog/454_run/sff/FUIPDCZ01.sff
addRun . /campusdata/BME235/data/Pog/454_run/sff/FUIPDCZ02.sff
runProject -e 50 .
* Where, -e 50 is an important parameter -> implies expected coverage and it defaults to 50.
* Currently, De novo assembly is done on POG, Mapping is not done yet.
* Output : Generated in a separate directory called "assembly". Main outputs - .fna files and .qual files. Look at "/campusdata/BME235/assemblies/Pog/newbler-assembly1/assembly".
* make.log - keeps track of what happened.
* Mapping to an existing genome, an example from Kevin Karplus /pluck/Vc/map23_scaffold
newMapping .
addRun . /projects/lowelab/users/course/karplus/Vc/sequencing/sff/*.sff
setRef . Vc.scaffold
runProject -e 25 -rst 0 -noace .
* Where -e 25 is specific to the Vibrio sequence coverage, and -noace prevents the building of an ace file (large file) which is used with CONSED.
* Slug is AT-rich, so Illumina data may be better than 454.
* rdb files were described as useful simple to create relational databases. An example of rdb file generation with a makefile is given below as implemented by Kevin's in /pluck/rachel/combined_cleaning1/Makefile . Note that this example was **not** given in class, and is intended for pulling out a subset of the contigs, not making an rdb file for all contigs.
%.stats: %.ids
echo "name length numreads" > $@
echo "S N N" >> $@
grep '^>' < contigs_all.fa \
| grep -f $*.ids \
| sed 's/=/ /g' \
| sed 's/>//' \
| awk '{printf "%s\t%d\t%d\n", $$1, $$3, $$5}' \
>> $@
* If anyone finds good user based documentation or tutorials versus feature based documentation, please share them with the group.
* Don't copy sfffiles use soft links to data files.
* Useful output cam be found in /assembly/454NewblerMetrics.txt . The inputs, reads, bases (to calculate coverage= bases/ genome size), readAlignmentResults, inferredReadError (0.8%= OK), estimatedGenomeSize, consesusResults (largeContigMetrics, allContigs, ...)
====== Things to remember while running assembly tools ======
* All the assemblies should be listed under /campusdata/BME235/assemblies.
* Include .cshrc file in your path.
* Its better to run the tool in the current working directory.
* Create a README file in each new directory and it should contain all the necessary stuff required to run the assembly tool.
* Create Makefile for each assembly tool. (Makefile for newbler_assembly tool is in /campusdata/BME235/assemblies/Pog/newbler-assembly1/ ). You can use it as a template and modify the data source and the expected coverage as required. Makefile should be considered as "a book for lab protocols".
* It is always better to say append to make.log in Makefile.
* Wiki page for assembly tools should contain a summary of how to run the tool and other things that might be useful to look at.