Assembler takes only Illumina libraries:
Ideally, PCR free, high coverage and insert size ~450bp
Had to use Fastuniq to remove duplicates:
16X coverage being inputted, somewhat low for what discovar wants as input
Used fraction option limit input of files to only portion of reads
Needed to specify threads and maximum memory for the run as well
50% UCSF run showed much better results in N50 for contig and scaffold than 50% original data run and used less memory
Discovar performed much better with 2×250 reads vs 2×100 reads; more scaffolds of longer length
Want to use full data set when there is more RAM available
8th longest scaffold when nucleotide BLASTed matched a transcript variant of sea hare
metallothionein hit may be result of having cysteine rich scaffold
most common gene hit was ribosomal subunit 28S, which is a good sign because this gene is consistent across species
Want to run PRICE to find viral sequences that were found with blast
would create an assembly for the viral sequnce that was found and determine if sequence was integrated in the genome or are extranuclear
Can map contigs to scaffolds to see if any contig has a different coverage than normal coverage
SSpace to do scaffolding after getting contigs
Scaffolds and contigs had been coming out identical sequences
used 50% UCSF contigs as input, using SW041 and SW042 files
run with old BWA 0.5, will re-run with bwa 0.7 version
SSpace merged a few scaffolds, but only added more Ns
no change in scaffold N50 only affected shorter contigs number of scaffolds decreased by 20-50
probably due to not enough coverage of the assembly
Looked for contig that might have been mitochondrial (previous class iteration) Took reads that mapped to the 2012 consensus sequence Hiseq w018 and sw019 reads so far mito size 14kb estiamte used discovar sw018 data that mapped to 2012 seq→ coverage 60X
Want to use contigs built from read data rather than scaffold start with one contig that maps well to mito (use 12kb discovar 18+19 output)
mito genome does integrate into nuclear genome, over time mutates and changes sequence, results in lots of ambiguity in contig construction
seems like 12kb contig is entire mitochondria genome 2nd largest contig (3245bp) looks like might be missing part of the mito
look at ends of contigs and compare, try to join Ns together
sea hare is 14kb, usually doesnt include hypervariable region that is very difficult to assemble