Differences

This shows you the differences between two versions of the page.

--- computer_resources:assemblies:mitochondrion [2011/06/21 23:32]
karplus [Method] added link to determining_paired-end_insert_size
+++ — (current)
@@ Line 1: / Line 1: @@
-====== Mitochondrion ======
-The mitochondrion was assembled by Kevin Karplus in the assemblies/slug/barcode-of-life/ directory. The reason for the strange name for the directory was that at first the attempt was just to recover the COX1 gene that is used for the [[http://www.boldsystems.org/|BOLD (barcode of life database)]] project to characterize eukaryotes by their mitochondrial sequences.  When it became clear that the whole mitochondrial genome was well covered in the Illumina data, the project switched to trying to reconstruct the full mitochondrial genome.
-===== Method =====
-There were many iterations and many different attempts at assembling the mitochondrial genome.  Most of the iterations consisted of taking some draft genome, mapping all the Illumina reads to it using bwa, and selecting out all reads that mapped and their paired ends (even if the pairs didn't map), and reassembling those reads.
-The initial attempts used just some contigs from the previous year's (2010) attempt at a whole-genome assembly using SOAPdenovo.  Later, I found that the 2011 SOAPdenovo assembly had most of the genome in one contig (ending at two repeat regions).  Both SOAPdenovo and abyss were used to try to assemble the reads.  Generally, abyss got somewhat longer longest contigs, and the two assemblers agreed on the parts they had in common.
-   * Started with a search of SOAPdenovo-assembly1/k31/soapSlug.scafSeq for scaffolds that matched examples from other mollusks.
-   * Looked for 454 reads that extended or joined contigs in scaffold
-   * Repeated (sometimes using more sensitive searches) until no more credible scaffolds from the SOAPdenovo-assembly1/k31/ assembly nor 454 reads were found.
-   * The 454 coverage of the mitochondrion is so slight as to be nearly useless, so instead we can iterate:
-        - find all Illumina reads that map to the mitochondrial draft, using BWA
-        - assemble them using SOAPdenovo.
-   * It looks like the Illumina reads have about 228x coverage of the mitochondrion, but coverage is patchy, and it seems to be difficult to close the circle (at least with SOAPdenovo).
-   * It turns out that a lot of the hard hand work and iterated searching to assemble the mitochondrion was not necessary, as the SOAPdenovo-assembly2/all/k63/illumina-454-all_63-mers.scafSeq assembly now has a 14960-long contig (not scaffold!) which is an almost-full-length mitochondrion, roughly as good as the best I'd managed to assemble from the SOAPdenovo-assembly1/ bits.
-   * Iterating mapping reads with BWA and assembling them with SOAPdenovo made some progress, but there was a gap that just wouldn't close.
-   * Switching to abyss (version 1.2.7) for the assembly of the reads made a much larger contig (15535-long after pasting on a suggestion from one abyss assembly onto another).
-   * Iterating search and abyss assembly does not lengthen the large contig.  Cleaning up and calling the consensus with bwa+samtools+bcftools doesn't change things much either.  There seems to be a large variation in coverage (from 20x to 2300x, with a median of 225x), so I suspect that there is a repeat region at the beginning of the current contig that may have 10 repeats in it.
-Alternating finding new reads and assembling them made very slow progress, because the new reads only extended the assembled region by 50–100 bases.  Eventually, I wrote a new program (look-for-exit [needs its own page FIXME]) to manually extend the contigs and find exits from repeat regions, being more aggressive in extending the contig than the automatic assemblers.  I was eventually able to close the circle this way, and get a complete genome, though there is one repeat region with long repeats (about a dozen copies of a 615±1 long repeat) that I could not order, because the differences between repeats were far enough apart that I couldn't disambiguate the order with the [[bioinformatic_tools:bwa#determining_paired-end_insert_size|short fragment lengths]] of the data available.  I think I have all the variants of repeat, but in some cases I can't even tell which first half of the repeat goes with which second half.
-At some point in the process, I rotated the genome to correspond to the closest previous mitochondrial genome: //Biomphalaria glabrata// strain M, a gastropod.
-I used bwa with mpileup to make new consensus sequences, and iterated that a few times to get about as good an assembly as I can without some longer fragments or some PCR to determine the order.
-===== Mitochondrial sequence =====
-The first draft sequence is available as {{mitochondrion-draft1.fasta.gz|gzipped fasta file}}.  This corresponds to /campusdata/BME235/assemblies/slug/barcode-of-life/map-Illumina-raw-45/consensus-6 on campusrocks.
-===== Annotation =====
-I sent the sequence to [[http://dogma.ccbb.utexas.edu/]] for annotation, and it quickly created a very crude web interface that almost works.  Unfortunately, it seems to have missed some of the protein genes and it provides no way to download its annotation in a GENBANK format.  Since there are hundreds of tRNA genes (the repeat region is full of them), I'm not tempted to try screen-scraping.  A better way to annotate the mitochondrion needs to be found.

Banana Slug Genomics

User Tools

Site Tools

Differences

Page Tools