User Tools

Site Tools


lecture_notes:05-20-2015

Discovar Team 5 Update

Assembler takes only Illumina libraries:

Ideally, PCR free, high coverage and insert size ~450bp

Had to use Fastuniq to remove duplicates:

16X coverage being inputted, somewhat low for what discovar wants as input

Running Discovar

Used fraction option limit input of files to only portion of reads

Needed to specify threads and maximum memory for the run as well

50% UCSF run showed much better results in N50 for contig and scaffold than 50% original data run and used less memory

Discovar performed much better with 2×250 reads vs 2×100 reads; more scaffolds of longer length

Want to use full data set when there is more RAM available

BLAST results

8th longest scaffold when nucleotide BLASTed matched a transcript variant of sea hare

metallothionein hit may be result of having cysteine rich scaffold

most common gene hit was ribosomal subunit 28S, which is a good sign because this gene is consistent across species

Want to run PRICE to find viral sequences that were found with blast

would create an assembly for the viral sequnce that was found and determine if sequence was integrated in the genome or are extranuclear

Can map contigs to scaffolds to see if any contig has a different coverage than normal coverage

SSpace

SSpace to do scaffolding after getting contigs

Scaffolds and contigs had been coming out identical sequences

used 50% UCSF contigs as input, using SW041 and SW042 files

run with old BWA 0.5, will re-run with bwa 0.7 version

SSpace merged a few scaffolds, but only added more Ns

no change in scaffold N50 only affected shorter contigs number of scaffolds decreased by 20-50

probably due to not enough coverage of the assembly

mitochondrion assembly

Looked for contig that might have been mitochondrial (previous class iteration) Took reads that mapped to the 2012 consensus sequence Hiseq w018 and sw019 reads so far mito size 14kb estiamte used discovar sw018 data that mapped to 2012 seq→ coverage 60X

Want to use contigs built from read data rather than scaffold start with one contig that maps well to mito (use 12kb discovar 18+19 output)

mito genome does integrate into nuclear genome, over time mutates and changes sequence, results in lots of ambiguity in contig construction

seems like 12kb contig is entire mitochondria genome 2nd largest contig (3245bp) looks like might be missing part of the mito

look at ends of contigs and compare, try to join Ns together

sea hare is 14kb, usually doesnt include hypervariable region that is very difficult to assemble

You could leave a comment if you were logged in.
lecture_notes/05-20-2015.txt · Last modified: 2015/05/21 20:47 by nsaremi