Sea Hare and Panda

Looking at recent successful de-novo assemblies can help inform future sequencing and assembly plans for the Banana Slug.

Sea Hare

Sea Hare is interesting because it is a recent de-novo mollusc assembly using 454 mate-pairs.

Analyzing data from previously sequenced mollusk genomes
- Broad institute /ftp/pub/assemblies/invertebrates/aplysia (seahare)
- Need two files in this directory to analyze the data
- LibStatsOverview.out
  - 5 libraries they're using in the genome assembly.
    - insert sizes of 2,000, 4,000, 4,000, 10,000 and 40,000
    - Coverage ranges betwens 0.04x and 60.83x
    - These coverage numbers are dubious, but the insert size are of interest
- LibStatsOverview.out
  - Read size seen in this file are around 600 bases.
    - Indicates the sequencing was done mostly by 454.
- Used an overlap consensus method (ARACHNE)
13x coverage with 454 produced a publishable genome
Probably need alot more coverage with the shorter read length illumina reads.

Banana slug insert size estimation
- soap de nova estimate for insert size: 135
  - size between sequencing reads
- dna fragment length during library prep: 250-375
Trimming reads may increase assembly
First read may have higher quality than the second read

Estimating error rates
- Use quality information directly from the sequencing machine.
- Estimate error after mapping.
- These two measures should be correlated to each other.

Panda

Panda is interesting because it is a recent de-novo assembly of a large
genome of approximately the same size as banana slug (3Gb).  It is also
done using SOAPdenovo which we were able to use to assemble our slug data.
Panda is also the only known large genome yet assembled de-novo using 
only Illumina/Solexa reads.

Panda Genome statistics

38.5x coverage of the panda genome yielded an N50 contig size of 1,483
- Adding paired end distances yielded larger and larger contigs
218 total lanes of illumina data
- 73x raw coverage
- 56x coverage after filtering reads
- read lengths ranged from 35bp-71bp
Comparing Bacterial Artificial chromosome.
- Made 9 BACs as a quality check.
- Each BAC is covered by 90-99.9% by the short-read-assembled genome.
  - Indicates the genome is very patchy

We need about 50-75 basepairs for each read to assemble a genome from short reads according to panda paper.
How can we better utilized paired data for a de bruijn assembler?
- We need to be able to subdivide reads into smaller subset of reads because shorter segments are easier to assemble.
  - Shorty utilizes a method similar to this.
  - Do some assembly to create short contigs.
  - Several reads can map to the same contig.
    - Take all paired reads that map to the short contig and map to paired clusters.
    - Take all reads that map to the original contig.
    - Assemble reads that fall within the same region.
    - Could be done with any size matepair library.
    - Danger of this method is the susceptibility to noise and randomly paired reads.
    - Another problem is the speed of this mapping

Soap de novo runs in about 7-9 hours for banana slug data to create initial contigs. Creates about 3 million contigs.
- Use mapping software to map pieces onto contigs.
- Parallelize assembly of many small chunks of reads.
- For each of the 3 million contigs, build new contigs using paired reads.
- How can you parallelize this algorithm such that it can be used for de novo assembly?
Soap de novo comes with a program that can use this called “Gap closer”
- Is gap closer this program? If it is how, can we parallelize it.

Next slug to be sequenced should be photographed during dissection in order to identify the species.

good computational challange:

Subdivide small reads into regions that they group into, then you can do local denovo assemblies on subsets of reads. This is done biologically with things like the BACs.
Example, shorty: map reads to a contig, then map out to reads in other contigs, and then map back. It collects a bunch of reads that might belong together and can assemble these.
Can use SOAPdenovo to get initial contigs. Then can map pieces onto contigs and gather reads togeather. Store contigs in memory, and stream out data to sub-assemblers. PHD level question, can we make an efficient parallel assembler out of this? How to stream through this and partition efficiently? How can we get efficient ways of dealing with all of this?

Table of Contents

Sea Hare and Panda

Sea Hare

Panda