User Tools

Site Tools


Mitochondrial sequence

The mitochondrion was re assembled in 2015, follow the new assembly here, 2015 Mitochondrion assembly . This page is for the 2011 mitochondrial assembly.

The first draft sequence is available as draft1 gzipped fasta file. This corresponds to /campusdata/BME235/assemblies/slug/barcode-of-life/map-Illumina-raw-45/consensus-6 on campusrocks. It has 23,642 bases.


The mitochondrion was assembled by Kevin Karplus in the assemblies/slug/barcode-of-life/ directory. The reason for the strange name for the directory was that at first the attempt was just to recover the COX1 gene that is used for the BOLD (barcode of life database) project to characterize eukaryotes by their mitochondrial sequences. When it became clear that the whole mitochondrial genome was well covered in the Illumina data, the project switched to trying to reconstruct the full mitochondrial genome.


There were many iterations and many different attempts at assembling the mitochondrial genome. Most of the iterations consisted of taking some draft genome, mapping all the Illumina reads to it using bwa, and selecting out all reads that mapped and their paired ends (even if the pairs didn't map), and reassembling those reads.

The initial attempts used just some contigs from the previous year's (2010) attempt at a whole-genome assembly using SOAPdenovo. Later, I found that the 2011 SOAPdenovo assembly had most of the genome in one contig (ending at two repeat regions). Both SOAPdenovo and abyss were used to try to assemble the reads. Generally, abyss got somewhat longer longest contigs, and the two assemblers agreed on the parts they had in common.

  • Started with a search of SOAPdenovo-assembly1/k31/soapSlug.scafSeq for scaffolds that matched examples from other mollusks.
  • Looked for 454 reads that extended or joined contigs in scaffold
  • Repeated (sometimes using more sensitive searches) until no more credible scaffolds from the SOAPdenovo-assembly1/k31/ assembly nor 454 reads were found.
  • The 454 coverage of the mitochondrion is so slight as to be nearly useless, so instead we can iterate:
    1. find all Illumina reads that map to the mitochondrial draft, using BWA
    2. assemble them using SOAPdenovo.
  • It looks like the Illumina reads have about 228x coverage of the mitochondrion, but coverage is patchy, and it seems to be difficult to close the circle (at least with SOAPdenovo).
  • It turns out that a lot of the hard hand work and iterated searching to assemble the mitochondrion was not necessary, as the SOAPdenovo-assembly2/all/k63/illumina-454-all_63-mers.scafSeq assembly now has a 14960-long contig (not scaffold!) which is an almost-full-length mitochondrion, roughly as good as the best I'd managed to assemble from the SOAPdenovo-assembly1/ bits.
  • Iterating mapping reads with BWA and assembling them with SOAPdenovo made some progress, but there was a gap that just wouldn't close.
  • Switching to abyss (version 1.2.7) for the assembly of the reads made a much larger contig (15535-long after pasting on a suggestion from one abyss assembly onto another).
  • Iterating search and abyss assembly does not lengthen the large contig. Cleaning up and calling the consensus with bwa+samtools+bcftools doesn't change things much either. There seems to be a large variation in coverage (from 20x to 2300x, with a median of 225x), so I suspect that there is a repeat region at the beginning of the current contig that may have 10 repeats in it.

Alternating finding new reads and assembling them made very slow progress, because the new reads only extended the assembled region by 50–100 bases. Eventually, I wrote a new program (look-for-exit) to manually extend the contigs and find exits from repeat regions, being more aggressive in extending the contig than the automatic assemblers. I was eventually able to close the circle this way, and get a complete genome, though there is one repeat region with long repeats (about a dozen copies of a 615±1 long repeat) that I could not order, because the differences between repeats were far enough apart that I couldn't disambiguate the order with the short fragment lengths of the data available. I think I have all the variants of repeat, but in some cases I can't even tell which first half of the repeat goes with which second half.

At some point in the process, I rotated the genome to correspond to the closest previous mitochondrial genome: Biomphalaria glabrata strain M, a gastropod.

I used bwa with mpileup to make new consensus sequences, and iterated that a few times to get about as good an assembly as I can without some longer fragments or some PCR to determine the order.

Finishing the genome

We plan to use PCR to amplify parts of the repeat region and do Sanger sequencing to confirm the sequence on those blocks. To find distinguishing features in the repeat region to design primers, the look-for-exit program was used to walk forward and backward through the repeat, looking for alternative paths that had significant read support. All the variants were recorded in README files (in assemblies/slug/barcode-of-life/map-Illumina-raw-42/ and assemblies/slug/barcode-of-life/map-Illumina-raw-45/) and look-for-exit was used to build putative single copies of repeats from each of the observed variants.

The repeat region starts at position 7037 in draft1, with CTGTAAGAGAATTATTTTAGTAATAAAATTTAATTTTAAGAAAAGAATTTTTCT There are 12 full (615±1 long) repeats there, then a partial one (456 long) that ends with CTGTAAGAGAATTATTTTAGTAATAAAATTTAATTTTAAGAAAAGAATTTTTCT at 14867.

Details of the PCR strategy are still being worked out, trying to minimize the effort and cost of the wet-lab work.

Blocks in the long repeats

Here are some blocks that appear to be in the repeats (the repeat consists of blocks A,B,C,D,E in order repeated a dozen or so times, ending with block D).

Before the repeats:


The long repeat blocks (one each of A, B, C, D, E):





































Last block (D8 until the last 6 bases)


There are several links between blocks (particularly E, A, and B) that can be made even with the short-read data, but the long stretches with few SNPs in blocks B, C, and D make disambiguating the order of the repeats difficult. I don't believe that any of the blocks here are chimeric, but some have low enough coverage that they may be just a result of sequencing error, rather than real variants in the genome.

Some of the more confident links are start→A2, A1→B1, A2→B2, A3→B2+B3+B4, A4→B4+B5+B8, A5→B7, E1+E2+E3→A3, E4→A4+A5, E5→A1, D9→end.

[Note:D6 and D7 were eliminated—they turned out to be identical to D1, but the bases had been miscounted in doing the SNP marking.]

Primer design

We need primers that are of reasonable lengths (20–30 bases) and that uniquely identify particular blocks. Ideally, there should be several differences in sequence within the primer, since having only a single mismatch could result in PCR amplifying the wrong target. We may not need a primer for every block, as Sanger sequencing can read through an entire repeat. It may be enough just to have primers for the more easily distinguished blocks, measuring the lengths between them and sequencing the ones that have unique lengths.

Another approach we are looking at is to do long-range PCR to amplify the whole repeat region (using unique primers just outside the repeat region), then drop sequencing primers into unique spots within the repeat region, Sanger sequencing backwards and forwards from each unique spot. Each spot would then yield about 2k contiguous bases of the repeat. The crucial thing is that the primers for selecting the region are unique to the mitochondrion (not in nuclear DNA!) and that the long-range PCR is capable of amplifying the whole repeat region. We selected two pairs of primers to test:

  OLIGO            start  len      tm     gc%   any    3' seq
  LEFT PRIMER         24   27   50.14   33.33  8.00  2.00 ACTCTATCGAATAGTGTATATTAGAGC
  RIGHT PRIMER     15495   25   54.61   48.00  5.00  2.00 GTTTCCTAGAGCCATATAGGTGAGC
        PRODUCT SIZE: 15472, PAIR ANY COMPL: 5.00, PAIR 3' COMPL: 2.00

  OLIGO            start  len      tm     gc%   any    3' seq
  LEFT PRIMER         20   31   52.60   32.26  8.00  2.00 ACTAACTCTATCGAATAGTGTATATTAGAGC
  RIGHT PRIMER     15494   24   53.47   45.83  5.00  2.00 TTTCCTAGAGCCATATAGGTGAGC
        PRODUCT SIZE: 15475, PAIR ANY COMPL: 5.00, PAIR 3' COMPL: 2.00

The product size is a bogus number, as the primer design was done with many variants of the repeats inserted between the sequences that come before and after the repeat regions. I'm expecting a length more like 8000 bases.

The primer design was done with Primer3 using the SantaLucia 1998 parameters for thermodynamics and salt correction. Both of these designs were aiming for a 55ºC melting temperature and a length of 25. GC content was required to be at least 30% (higher constraints resulted in no possible left primers in the upstream sequence I had included). Both primer pairs required a 3' GC clamp of 2 bases, and the second set required the melting temperatures to be within 3 degrees of each other.

It seems that there about 28 unique spots that we can drop 25-mers in, which should give us pretty thorough coverage of the repeat region, if we can extract it. I'm a little concerned that the coverage of some of the variants is very low. This may mean that there is some variation between different mitochondria, rather than a single unique mitochondrial genome. I expect to see unique short k-mers in the mitochondrion about 150–200 times, but one of the 25-mers I had picked out as distinctive for particular blocks occurred in only 7 reads, and many in 20–100 reads. Only 2 of them had counts in the range I expected. This leads me to some concern about our ability to pick out sequencing primers that identify unique positions in the extracted long-range PCR product.

One test we should make with the PCR product is to measure its length. If there are several different products amplified, with length differences around 615 bases, then we know that the DNA contains mitochondrial genomes with different numbers of repeats, and so the problem needs to be addressed as a meta-assembly problem. If only one length comes out, I expect there to be a consensus genome that we can assemble.

Blocks in the short repeats

After I had gotten all the variants of the long repeats into the draft genome, I used look-for-exit on the whole mitochondrial genome to see if there were any variants that were missing. In this way I found another group of repeats, about 150 bases upstream from the long repeats, each about 185 long.

I used look-for-exit to find the variants of these repeats, and found several possible variants. I was only able to order a few of them, as walking forward or backward from most blocks with the longest extension that kept at least one read resulted in going to the generic block (R99) from most blocks. R01 is first and R03→R18. R15, R16, and R17 back up to R15 rather than R99, which means that they are probably preceded by R15, R07, R14, or R04.

R01: .        .   .              t       a             .  .  .   .          .                      .   .    .      .     .                      ...              ..               .             
R02: .        .   .              .       .             .  .  .   .          .                      .   .    .      .     .                      ...              .a               .             
R03: .        a   .              .       .             .  .  c   .          .                      .   .    a      .     -(at)                ...              .t               t               
R04: .        .   g              .       .             .  .  .   -a        t                      .   .    .      .     .                      ...              .a               .              
R05: .        .   .              .       .             .  .  .   .          .                      .   -c  .      .     .                      ...              .a               .              
R06: .        .   .              .       a             .  .  .   .          .                      .   .    .      .     .                      a..              ..               .             
R07: .        .   .              .       a             .  .  .   .          t                      .   .    .      .     .                      ...              .a               .             
R08: .        .   .              .      -t            .  .  .   .          .                      .   .    .      .     .                      ...              g.               .              
R09: .        .   .              .       .             .  .  .   .          .                      .   .    .      .     .                      ...              g.               .             
R10: .        .   .              .      -t            .  .  .   .          .                      .   .    .      g     .                      ...              ..               .              
R11: .        .   .              .      -t           -t .  .   .          .                      .   .    .      .     .                      ...              ..               .               
R12: .        .   .              .      -t            .  .  .   .          t                      .   .    .      .     .                      ...              ..               .              
R13: .        .   .              .       .             .  t  .   .          .                      g   t    .      g     .                      ...              .a               .             
R14: .        .   g              .       .             .  .  .   -a        t                      .   .    .      .     .                      ...              .a               .              
R15: .        .   .              .       .             .  .  .   .          t                      .   .    .      .     .                      ...              .a               .             
R16: .        .   .              .       .             .  .  .   .          t                      .   .    .      .     g                      ...              ..               .             
R17: .        .   .              .       .             .  .  .   .          t                      .   .    .      .     .                      ..g              .a               .             
R18: a        .   .              .       .             .  .  .   .          .                      .   .    .      .     .                      ...              ..               .             
R19: .        .   .              .       .             .  .  .   .          .                      .   .    .      .     .                      .g.              ..               a             
R20: .        .   .              .       .             .  .  .   .          .                      .   .    .      .     .                      ...              ..               a           
R99: .        .   .              .       .             .  .  .   .          .                      .   .    .      .     .                      . .              ..               .           

Note: I fixed the R14 annotation—it is actually identical to R04.

When I put these blocks into the genome, look-for-exit found no variants needed other than a C→T SNP a little earlier than the short repeats that had 7 reads supporting it. As this was less than 1.5% of the reads for that location, it is quite likely a sequencing error rather than a true variant.

For the short repeats, I expect around 32 copies based on expected coverage of 176 and 5671 copies of ATCTTATTAGTCTTTATTA. Of course, some of the repeat variants I have may just be sequencing errors, and there may be some exact repetitions of blocks.

Primer design

The best pair of primers I've found for the short repeat region (using the SantaLucia 1998 parameters, aiming for 55 degrees C with a temp difference of <3 degrees and a 1-base GC clamp) is

  OLIGO            start  len      tm     gc%   any    3' seq
  LEFT PRIMER        114   31   55.20   35.48  6.00  2.00 CGAGTATACAATATTTAGTTGTGCTAAGTGC
  RIGHT PRIMER      4174   27   53.85   37.04  6.00  2.00 TCACAAGGAATTTAAGGATCCACAATC
    PRODUCT SIZE: 4061, PAIR ANY COMPL: 5.00, PAIR 3' COMPL: 0.00


One of the biggest tasks in assembling the mitochondrion was separating the mitochondrial reads from the full set of reads. I needed a seed to start with, which I initially found by searching the earliest assembly for a contig that matched the COX1 protein from several different mollusks using tblastn, and to cox1 genes using blastn. I got 3 contigs for this small gene in the low-quality assembly from SOAPdenovo_assembly1/k31, and had to do iterative searching to extend the set. (I could have saved about 10 iterations by starting from the SOAPdenovo-assembly2/all/k63/ assembly instead, but it had not been completed when I started the search for mitochondrial reads.

The selection process used bwa to map reads that had been run through seqprep to the mitochondrial contigs found so far, keeping any reads that mapped or whose paired read mapped. I sometimes used the quake-corrected set used for this year's assembly, and sometimes used the set that had been run through seqprep but not corrrected with quake.

In barcode-of-life/map-Illumina-raw-45/, which has a draft of the complete mitochondrial genome to map to, I mapped the reads that had not been run through quake (that is what the “raw” signifies in the directory name, though the reads had been run through seqprep, and so were not really the raw reads). I found 61,428 reads that are most likely mitochondrial.

samtools flagstat merged.bam
61428 + 0 in total (QC-passed reads + QC-failed reads)                 
0 + 0 duplicates                                                       
60932 + 0 mapped (99.19%:nan%)                                         
35560 + 0 paired in sequencing                                         
17780 + 0 read1                                                        
17780 + 0 read2                                                        
31562 + 0 properly paired (88.76%:nan%)                                
34568 + 0 with itself and mate mapped                                  
496 + 0 singletons (1.39%:nan%)                                        
0 + 0 with mate mapped to a different chr                              
0 + 0 with mate mapped to a different chr (mapQ>=5)    

I'm wondering about the 3004 pairs that map both reads but are not “properly paired”. Are these indicating a misassembly, or is this just mismapping in the repeat region?

I'm also curious about whether I could have found the mitochondrial reads more simply by extracting every read that had a high-count k-mer. Looking at the jellyfish 19-mer counts from this set, I see counts as high as 5707 (for AATAAAGACTAATAAGATA) and 12,976 19-mers with multiplicity ≥100. But in the jellyfish counts for the whole set of reads, I see 7 low-complexity kmers that occur over 10 million times:

CCCTAACCCTAACCCTAAC     10752814                                       
AACCCTAACCCTAACCCTA     10736174                                       
CCTAACCCTAACCCTAACC     10715055                                       
CTAACCCTAACCCTAACCC     10710935                                       
AGGGTTAGGGTTAGGGTTA     10654682                                       
ACCCTAACCCTAACCCTAA     10607845                                       
AAAAAAAAAAAAAAAAAAA     10486112                                       

Even if entire reads are full of CCCTAA, that is still hundreds of thousands of reads, so separating by frequency would not have been good enough.

The most frequent 19-mer in the subset occurs 6035 times in the full set (so I may be missing 272 copies in the subset), but there are almost 209,000 more common 19-mers, so selecting by frequency would have gotten me mostly low-complexity junk, not mitochondrial sequence.

After cleaning the mitochondrial reads with jellyfish and quake, in map-Illumina-raw-45/ we have

  clean_19_dir/merged.fastq has 2,860,095 bases in 26,498 reads. 
  clean_19_dir/merged_1.fastq has 1,253,271 bases in 16,885 reads.
  clean_19_dir/merged_2.fastq has 1,252,843 bases in 16,885 reads.
  Total: 5,366,209 bases in 60,268 reads. 

The estimated coverage of the genome from fitting a gamma distribution to the 19-mers is about 216x, giving a genome-size estimate of 24,844 bases, not far from the 23642 I currently have—perhaps there are a couple more repeats than I've been estimating? But the accuracy of the genome size estimates is not likely to be that precise.

In map-Illumina-raw-46/, where the constraint on insert size was increased to keep bwa from throwing out discordant pairs, and more variants of the repeats were in the draft that was mapped to,

look-for-exit merged.fastq merged_[12].fastq


  merged.fastq has 3,027,249 bases in 27,429 reads.
  merged_1.fastq has 1,360,645 bases in 17,875 reads.
  merged_2.fastq has 1,359,545 bases in 17,875 reads.
  Total: 5,747,439 bases in 63,179 reads.  

After cleaning with quake to correct 19-mers, we have

  clean_19_dir/merged.fastq has 3,044,997 bases in 28,051 reads.
  clean_19_dir/merged_1.fastq has 1,259,044 bases in 16,974 reads.            
  clean_19_dir/merged_2.fastq has 1,258,440 bases in 16,974 reads. 
  Total: 5,562,481 bases in 61,999 reads.                                     

I also counted the 31-mers in the cleaned reads, and got 3 estimates of the genome size:

  • 18.2k = 20,945-2,761 distinct 31-mers (excluding those with support<10)
  • 21.5k = 3702511/172.3 total 31-mers/coverage estimate from gamma fit
  • 21.5k = 5562481 * (89.72-30)/89.72 / 172.3 (total bases correct for estimated number of 31-mers based on average read length, divided by coverage).

The last 2 estimates are using the same data, so are not really independent estimates, just different ways of computing the same length. (The corrected coverage estimate 89.72/(89.72-30)*172.3 is about 259x.)

I have other estimates of coverage. For example, if I count just coverage in bases 500 to 5500 (well before the repeat region and far enough from the ends to avoid edge effects), I get an estimate of 274x from the pileup output, using

awk '$2<5500 && $2>=500 {sum+=$4}; END {print sum/5000.}' < merged.pileup       

Using this estimate with the 5,747,439 bases of the uncleaned reads, we get a genome length of 21k.

Looking at the statistics from samtools in map-Illumina-raw-46/, we see more sequences than in map-Illumina-raw-45/, but still a large fraction not properly paired.

samtools flagstat merged.bam
63179 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
62625 + 0 mapped (99.12%:nan%)
35750 + 0 paired in sequencing
17875 + 0 read1
17875 + 0 read2
32888 + 0 properly paired (91.99%:nan%)
34642 + 0 with itself and mate mapped
554 + 0 singletons (1.55%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

The second draft sequence (with the short repeats expanded and the repeats ordered as best I can from the short-insert reads) is available as draft2 gzipped fasta file. This corresponds to /campusdata/BME235/assemblies/slug/barcode-of-life/map-Illumina-raw-47/draft on campusrocks. It has 36363 bases.

I no longer believe that this sequence is correct. I'm now fairly sure that the region that I assembled as repeats should actually have been assembled as variants from a heterogeneous population of mitochondria. We'll need long reads or PCR products to tell for sure.


I sent the sequence to for annotation, and it quickly created a very crude web interface that almost works. Unfortunately, it seems to have missed some of the protein genes and it provides no way to download its annotation in a GENBANK format. Since there are hundreds of tRNA genes (the repeat region is full of them), I'm not tempted to try screen-scraping. A better way to annotate the mitochondrion needs to be found.

The closest sequenced mitochondrial genomes at NCBI are

NC_010220.1     Biomphalaria tenagophila mitochondrion, complete genome
>gb|EF433576.1| Biomphalaria tenagophila strain Taim-RS mitochondrion, complete genome
      max     total   query   E-value max
      score   score   coverage        identity
      2679    4235    51%     0.0     71%
NC_005439.1             Biomphalaria glabrata mitochondrion, complete genome
>gb|AY380531.1| Biomphalaria glabrata strain 1742 mitochondrion, complete genome
      max     total   query   E-value max
      score   score   coverage        identity
      2562    3934    52%     0.0     69%

So this is pretty far from the previously known genomes.

It would be useful to find some spots that are highly conserved between Biomphalaria and Ariolimax, to place primers for extracting mitochondrial DNA by long-range PCR. That would make a population study of various Ariolimax species by sequencing their mitochondria quite tractable.

Protein genes

I can find 12 of the 13 standard mitochondrial protein genes using Blast to labeled mitochondria, and ORFfinder finds ORFs of about the right length for each. The missing gene (ATP8) is often missing in mollusks. There is some concern about whether the ORFs found by ORFfinder are correct, as some mollusks have RNA processing in their mitochondrial genes to insert the stop codon.

I'll annotate the ORFs as best I can (assuming no post-transcriptional mRNA processing) once the repeats have been resolved.

Ribosomal RNA genes

I found both the small subunit rRNA gene and the large subunit rRNA gene using BLAST to a few annotated mollusk mitochondria. I'm not sure how one gets precise ends for these RNA genes, without wet-lab work to get cDNA.

tRNA genes

I looked for tRNA genes with tRNAscan-SE run locally, and only found 7, but tRNAscan is known to have problems with mitochondial tRNAs. I'll work with Todd Lowe later this summer to try to improve the tRNAscan covariance models to handle them better.

The 7 easy-to-find tRNA genes are Asp-GTC, SeC-TCA, Val-TAC, Pro-TGG, Ala-TGC, Thr-TGT, Met-CAT. These are not correctly labeled for the mitochondrial genetic code: TCA is codon TGA which is Trp in code 5, though the others are ok.

I expect to find many copies of tRNA genes, as the repeat regions are in a tRNA-rich area of homologous mitochondria.

You could leave a comment if you were logged in.
assemblies/2011/mitochondrion_assembly.txt · Last modified: 2015/10/31 20:15 by karplus