User Tools

Site Tools


lecture_notes:06-02-2010

This is an old revision of the document!


A PCRE internal error occured. This might be caused by a faulty plugin

* Analyzing data from previously sequenced mollusk genomes * Broad institute /ftp/pub/assemblies/invertebrates/aplysia (seahare) * Need two files in this directory to analyze the data * LibStatsOverview.out * 5 libraries they're using in the genome assembly. * insert sizes of 2,000, 4,000, 4,000, 10,000 and 40,000 * Coverage ranges betwens 0.04x and 60.83x * These coverage numbers are dubious, but the insert size are of interest * LibStatsOverview.out * Read size seen in this file are around 600 bases. * Indicates the sequencing was done mostly by 454. * Used an overlap consensus method (ARACHNE) * 13x coverage with 454 produced a publishable genome * Probably need alot more coverage with the shorter read length illumina reads. * Banana slug insert size estimation * soap de nova estimate for insert size: 135 * size between sequencing reads * dna fragment length during library prep: 250-375 * Trimming reads may increase assembly * First read may have higher quality than the second read * Estimating error rates * Use quality information directly from the sequencing machine. * Estimate error after mapping. * These two measures should be correlated to each other. Panda Genome statistics * 38.5x coverage of the panda genome yielded an N50 contig size of 1,483 * Adding paired end distances yielded larger and larger contigs * 218 total lanes of illumina data * 73x raw coverage * 56x coverage after filtering reads * read lengths ranged from 35bp-71bp * Comparing Bacterial Artificial chromosome. * Made 9 BACs as a quality check. * Each BAC is covered by 90-99.9% by the short-read-assembled genome. * Indicates the genome is very patchy * We need about 50-75 basepairs for each read to assemble a genome from short reads according to panda paper. * How can we better utilized paired data for a de bruijn assembler? * We need to be able to subdivide reads into smaller subset of reads because shorter segments are easier to assemble. * Shorty utilizes a method similar to this. * Do some assembly to create short contigs. * Several reads can map to the same contig. * Take all paired reads that map to the short contig and map to paired clusters. * Take all reads that map to the original contig. * Assemble reads that fall within the same region. * Could be done with any size matepair library. * Danger of this method is the susceptibility to noise and randomly paired reads. * Another problem is the speed of this mapping * Soap de novo runs in about 7-9 hours for banana slug data to create initial contigs. Creates about 3 million contigs. * Use mapping software to map pieces onto contigs. * Parallelize assembly of many small chunks of reads. * For each of the 3 million contigs, build new contigs using paired reads. * How can you parallelize this algorithm such that it can be used for de novo assembly? * Soap de novo comes with a program that can use this called "Gap closer" * Is gap closer this program? If it is how, can we parallelize it. * Next slug to be sequenced should be photographed during dissection in order to identify the species.

You could leave a comment if you were logged in.
lecture_notes/06-02-2010.1275516950.txt.gz · Last modified: 2010/06/02 22:15 by hyjkim