User Tools

Site Tools


This is an old revision of the document!

  • Analyzing data from previously sequenced mollusk genomes
    • Broad institute /ftp/pub/assemblies/invertebrates/aplysia (seahare)
    • Need two files in this directory to analyze the data
    • LibStatsOverview.out
      • 5 libraries they're using in the genome assembly.
        • insert sizes of 2,000, 4,000, 4,000, 10,000 and 40,000
        • Coverage ranges betwens 0.04x and 60.83x
        • These coverage numbers are dubious, but the insert size are of interest
    • LibStatsOverview.out
      • Read size seen in this file are around 600 bases.
        • Indicates the sequencing was done mostly by 454.
    • Used an overlap consensus method (ARACHNE)
  • 13x coverage with 454 produced a publishable genome
  • Probably need alot more coverage with the shorter read length illumina reads.
  • Banana slug insert size estimation
    • soap de nova estimate for insert size: 135
      • size between sequencing reads
    • dna fragment length during library prep: 250-375
  • Trimming reads may increase assembly
  • First read may have higher quality than the second read
  • Estimating error rates
    • Use quality information directly from the sequencing machine.
    • Estimate error after mapping.
    • These two measures should be correlated to each other.

Panda Genome statistics

  • 38.5x coverage of the panda genome yielded an N50 contig size of 1,483
    • Adding paired end distances yielded larger and larger contigs
  • 218 total lanes of illumina data
    • 73x raw coverage
    • 56x coverage after filtering reads
    • read lengths ranged from 35bp-71bp
  • Comparing Bacterial Artificial chromosome.
    • Made 9 BACs as a quality check.
    • Each BAC is covered by 90-99.9% by the short-read-assembled genome.
      • Indicates the genome is very patchy
  • We need about 50-75 basepairs for each read to assemble a genome from short reads according to panda paper.
  • How can we better utilized paired data for a de bruijn assembler?
    • We need to be able to subdivide reads into smaller subset of reads because shorter segments are easier to assemble.
      • Shorty utilizes a method similar to this.
      • Do some assembly to create short contigs.
      • Several reads can map to the same contig.
        • Take all paired reads that map to the short contig and map to paired clusters.
        • Take all reads that map to the original contig.
        • Assemble reads that fall within the same region.
        • Could be done with any size matepair library.
        • Danger of this method is the susceptibility to noise and randomly paired reads.
        • Another problem is the speed of this mapping
  • Soap de novo runs in about 7-9 hours for banana slug data to create initial contigs. Creates about 3 million contigs.
    • Use mapping software to map pieces onto contigs.
    • Parallelize assembly of many small chunks of reads.
    • For each of the 3 million contigs, build new contigs using paired reads.
    • How can you parallelize this algorithm such that it can be used for de novo assembly?
  • Soap de novo comes with a program that can use this called “Gap closer”
    • Is gap closer this program? If it is how, can we parallelize it.
  • Next slug to be sequenced should be photographed during dissection in order to identify the species.
You could leave a comment if you were logged in.
lecture_notes/06-02-2010.1275516950.txt.gz · Last modified: 2010/06/02 15:15 by hyjkim