Kevin installed Newbler 2.3 version in Campusrocks cluster under /campusdata/BME235/programs/DataAnalysis_2.3.
Newbler
GUI is not installed as it has some issues with unpacking.
Kevin ran the assembly tool on POG 454 data under /campusdata/BME235/assemblies/Pog.
The README file in the directory contains important information about the assembly.
Info about tools installed is listed in bioinformatic_tools
GS De Novo Assembler. Info about how to run the De novo as well as Mapping assembly tools is also included there.
Currently, tools are installed under /campusdata/BME235/bin/old_Newbler/.
Tools with prefix “gs” are not supposed to be run directly.
Kevin has written several scripts in Python (version 2.6) which aid in building and analyzing genomes. Currently, these scripts do not work on Campusrocks, as the version of Python installed is 2.4 and it is under the process of being updated to version 2.6. (Python2.6.5 has now been installed in /campusdata/BME235/bin/ —
Kevin Karplus 2010/04/19 20:03)
Newbler assembly tools take .sff (color space and quality data) files as input and converts them into .fna (fasta file with nucleotide information) files.
Good only with 454 data, and is not good on reads with length < 50.
Example code to run the De novo tool on data is shown below. The code is taken from
GS De Novo Assembler.
newAssembly .
addRun . /campusdata/BME235/data/Pog/454_run/sff/FUIPDCZ01.sff
addRun . /campusdata/BME235/data/Pog/454_run/sff/FUIPDCZ02.sff
runProject -e 50 .
Where, -e 50 is an important parameter → implies expected coverage and it defaults to 50.
Currently, De novo assembly is done on POG, Mapping is not done yet.
Output : Generated in a separate directory called “assembly”. Main outputs - .fna files and .qual files. Look at “/campusdata/BME235/assemblies/Pog/newbler-assembly1/assembly”.
make.log - keeps track of what happened.
Mapping to an existing genome, an example from Kevin Karplus /pluck/Vc/map23_scaffold
newMapping .
addRun . /projects/lowelab/users/course/karplus/Vc/sequencing/sff/*.sff
setRef . Vc.scaffold
runProject -e 25 -rst 0 -noace .
Where -e 25 is specific to the Vibrio sequence coverage, and -noace prevents the building of an ace file (large file) which is used with CONSED.
Slug is AT-rich, so Illumina data may be better than 454.
rdb files were described as useful simple to create relational databases. An example of rdb file generation with a makefile is given below as implemented by Kevin's in /pluck/rachel/combined_cleaning1/Makefile . Note that this example was not given in class, and is intended for pulling out a subset of the contigs, not making an rdb file for all contigs.
%.stats: %.ids
echo "name length numreads" > $@
echo "S N N" >> $@
grep '^>' < contigs_all.fa \
| grep -f $*.ids \
| sed 's/=/ /g' \
| sed 's/>//' \
| awk '{printf "%s\t%d\t%d\n", $$1, $$3, $$5}' \
>> $@
If anyone finds good user based documentation or tutorials versus feature based documentation, please share them with the group.
Don't copy sfffiles use soft links to data files.
Useful output cam be found in /assembly/454NewblerMetrics.txt . The inputs, reads, bases (to calculate coverage= bases/ genome size), readAlignmentResults, inferredReadError (0.8%= OK), estimatedGenomeSize, consesusResults (largeContigMetrics, allContigs, …)