Sea Hare and Panda
Looking at recent successful de-novo assemblies can help inform future sequencing and assembly plans for the Banana Slug.
Sea Hare
Sea Hare is interesting because it is a recent de-novo mollusc assembly using 454 mate-pairs.
Analyzing data from previously sequenced mollusk genomes
Broad institute /ftp/pub/assemblies/invertebrates/aplysia (seahare)
Need two files in this directory to analyze the data
LibStatsOverview.out
LibStatsOverview.out
Used an overlap consensus method (ARACHNE)
13x coverage with 454 produced a publishable genome
Probably need alot more coverage with the shorter read length illumina reads.
Banana slug insert size estimation
Trimming reads may increase assembly
First read may have higher quality than the second read
Estimating error rates
Use quality information directly from the sequencing machine.
Estimate error after mapping.
These two measures should be correlated to each other.
Panda
Panda is interesting because it is a recent de-novo assembly of a large
genome of approximately the same size as banana slug (3Gb). It is also
done using SOAPdenovo which we were able to use to assemble our slug data.
Panda is also the only known large genome yet assembled de-novo using
only Illumina/Solexa reads.
Panda Genome statistics
38.5x coverage of the panda genome yielded an N50 contig size of 1,483
218 total lanes of illumina data
Comparing Bacterial Artificial chromosome.
good computational challange:
Subdivide small reads into regions that they group into, then you can do local denovo assemblies on subsets of reads. This is done biologically with things like the BACs.
Example, shorty: map reads to a contig, then map out to reads in other contigs, and then map back. It collects a bunch of reads that might belong together and can assemble these.
Can use SOAPdenovo to get initial contigs. Then can map pieces onto contigs and gather reads togeather. Store contigs in memory, and stream out data to sub-assemblers. PHD level question, can we make an efficient parallel assembler out of this? How to stream through this and partition efficiently? How can we get efficient ways of dealing with all of this?