Panda is interesting because it is a recent de-novo assembly of a large
genome of approximately the same size as banana slug (3Gb). It is also
done using SOAPdenovo which we were able to use to assemble our slug data.
Panda is also the only known large genome yet assembled de-novo using
only Illumina/Solexa reads.
Panda Genome statistics
38.5x coverage of the panda genome yielded an N50 contig size of 1,483
218 total lanes of illumina data
Comparing Bacterial Artificial chromosome.
good computational challange:
Subdivide small reads into regions that they group into, then you can do local denovo assemblies on subsets of reads. This is done biologically with things like the BACs.
Example, shorty: map reads to a contig, then map out to reads in other contigs, and then map back. It collects a bunch of reads that might belong together and can assemble these.
Can use SOAPdenovo to get initial contigs. Then can map pieces onto contigs and gather reads togeather. Store contigs in memory, and stream out data to sub-assemblers. PHD level question, can we make an efficient parallel assembler out of this? How to stream through this and partition efficiently? How can we get efficient ways of dealing with all of this?