There will be a guest lecture on Friday, April 23rd, 2010. Please do not be late!
By Friday, April 23rd, 2010, all published sources which contain information pertaining to Banana slug should be added to the Banana Slug Biology page. Is this the right page?
Expect 40-50 different resources (papers, books, phd theses)
The bibliography must be annotated, though annotations are not required this friday.
As you run assembly tool, keep the assemblies/ wiki page updated!
Continuation of Newbler Assembler methods
By running Newbler using settings described last friday, Kevin obtained 43 contigs and ~2.4 megabases of DNA.
This measure is not bad compared to most other assemblers
Overview of directories
Newbler-clean
Not attempting to assemble data.
Used to remove contamination for H. Pylori which was sequenced on the same machine in the same run.
If you have a reference genome for your contaminants, you can try to clean them out computationally by mapping and removing matches.
Makefile Overview
Set up a new mapping
Maps reads from 454 Pog data to the contaminant genome
Interesting output: ReadStatus.txt
Contains mapping status of each read (Mapped, unmapped, partially mapped, too short)
Reads mapped to contaminants should be filtered out from 454 data and should not be sued for assembly.
Create a new SFF file (using sfffile utility) with unmapped data in file “no_Hyp.sff”
It may be beneficial to hang onto “too short” reads for use in assemblers which utilize shorter reads.
You can also use Megablast and NCBI's Taxonomy Report to identify contaminants.
Newbler-assembly2
Uses clean data (no H. Pylori) from newbler-clean1
One less small contig, mostly the same
Newbler *
Changed expected coverage to 60x– This number is closer to the real coverage based on better estimates of genome size.
Still reported 41 contigs (or maybe one less?)
good news- All contigs map to reference genome
Map-colorspace3 (Map-colorspace directories begin indexing at 3 rather than 1)
The scripts ran in this directory were originally intended for finding inverseions from mate-pair reads.
New features have been added since then.
Now looks for reads between contigs
Tries to orient contigs
Newbler-partial3
attempts to use only partially mapped/unmapped reads
Plan is to later map contigs from this assembly to the full contigs from before (extend edges?)
Megablast, blastn, blat, find dna differences are four methods for mapping partially mapped reads onto contigs created by newbler.
Megablast and blastn showed very similar results
blat
output is a psl file
Nice results
Very sensitive but also very slow
Shows matches down to 14 bases in size
can handle intros and splice sites
May be too sensitive for some purposes
There are some parameters you can tweak which will alter the output.
Kevin thought it would be best to keep blat's default parameters and filter output.
A script was written to filter output and find the best matches.
For a resequencing project, take all contigs, map with blat, sort by start site and build a scaffold according to the order of the start sites of each contig.
Find dna differences
Shows differences between contigs and a reference genome in a human readable format
Newbler-assembly4
Adds partial3 contigs to full search as reads
Didn't seem to help much
The resulting output had fewer total bases and more contigs than originally - perhaps worse.
Newbler-assembly5
Utilized Sanger reads produced by David Bernick to join contigs
45 Sanger reads total
Sanger reads were created by running pcr on conjectured contig join sites.
Two possible methods for guessing contig join sites
Use mate-pair data from solid run as evidence to create primers
David Bernick used know Crispr repeats. Some contigs showed the start of a repeat, while others showed the end
After adding Sanger reads, the final assembly went down to 31 contigs
In contig-length.rdb
Most reads have ~0.16 reads/base
The reads/base metric was utilized because computation was simpler than coverage.
Some reads had greater reads/base ~0.32 (qty. 3) and ~0.44 reads per base
Could 0.32 reads/base indicate two occurences of the contig in the genome?
Could 0.44 occur three times in the genome?
All 0.16 reads/base contigs occured a single time in the genome
2/3 0.32 reads/base appeared twice in the genome.
1/3 0.32 reads/base was a short contig and was within the expected deviation in reads/base from a sequencing run
0.44 reads/base occured twice, not three times
Less contigs and more bases
Map-colorspace5
map-colorspace has many parameters.
You can list all the parameters by issuing the command “map-colorspace –help”
Lots of output files using this command (~2 per contig). May prove problematic with assemblies that contain many contigs
Set length parameters using a histogram of reads mapped to contigs
merge_cross option merges all cross products into a single file rather than (number of contigs)^2 files
scaffold fasta & and color qualities are thrown out in this method
If F3 or R3 maps, the program aggresivly tries to map the paired read.
Output
Uniquely mapped reads mapped both F3 and R3 (Forward and Reverse) reads to the same contig
Multiply mapped- can map to multiple contigs
Wrong range, only F3 or R3 map
Summary File
Maps F3 and R3 from the end of one contig to the beginning of another
Shows that contigs are near, not touching.
short contigs may be skipped in paired end reads
QUESTION: Can we make a single connected graph using this data and all possible paths? -sometimes if data is particularly coherent. not guaranteed.