A draft mitochondrion assembly was made by Kevin Karplus in 2012.
The goal in 2015 is to create a high coverage, closed mitochondrion assembly.
You can view the mitochondrion assembly on the UCSC genome browser by following instructions posted on the wiki. It is currently showing the first Discovar de novo assembly generated using only the HiSeq SW018 reads.
File | Location |
2012 Mitochondrion assembly | /campusdata/BME235/assemblies/slug/barcode-of-life/map-Illumina-raw-45/consensus-6 |
Sam file for SW018 reads from the HiSeq run that map to the mitochondrion | /campusdata/BME235/mitochondrion |
Sam file for SW019 reads from the HiSeq run that map to the mitochondrion | /campusdata/BME235/mitochondrion |
Assembly fasta files | /campusdata/BME235/mitochondrion/ |
May 2015
The first step taken was to map reads from the HiSeq SW018 and SW019 reads against the 2012 draft genome. Reads were then assembled using Discovar de novo, first using only the SW018 reads (assembly mitochondrion_SW018_discovar) and then using reads from both SW018 and SW019 (assembly mitochondrion_SW018-9_discovar). This second assembly produced two long contigs that appear to be the majority of the mitochondrion genome since the length of both summed is close to the expected genome size.
A second iteration of read mapping was done using the first 2015 assembly and mapping SW018 and SW019 reads against it, to try to pull out any more reads that belong to the mitochondrion. This assembly has a smaller N50 and a smaller total length, and therefore the iterative mapping was not continued for more iterations.
Assembly name | Bytes | Total bases | # scafs | contig N50 | scaffold N50 | total bases in 1kb+ scaffolds | total bases in 10kb+ scaffolds | coverage |
mitochondrion_SW018_discovar | 20K | 18,983 | 29 | 9,041 | 9,041 | 14, 048 | 0 | 60X |
mitochondrion_SW018-9_discovar | 48K | 46,358 | 110 | 12,883 | 12,883 | 17,173 | 12,684 | 411X |
mitochondrion_iteration2_SW018-9_discovar | 48K | 45,494 | 114 | 9,030 | 9,030 | 15,548 | 0 | 436X |
For comparison, the mitochondrion size of the closest related molluscs that have mitochondrion assemblies are 14,100bp (grove snail - Cepaea nemoralis) and 14,130bp (land snail - Albinaria coerulea). But there are mollusk mitochondrial genomes that are much larger: sea scallop Plactopecten magellanicus is reported to have 30.6-30.7kbp (David R. Smith and Marlene Snyder. Complete Mitochondrial DNA Sequence of the Scallop Placopecten magellanicus: Evidence of Transposition Leading to an Uncharacteristically Large Mitochondrial Genome. J Mol Evol (2007) 65:380–391 doi:10.1007/s00239-007-9016-x).
It looks as though the majority of the mitochondrion is in two contigs in the assembly using both the SW018 and SW019 assemblies. One of these contigs is 12,884bp and the other (which is the second largest contig) is 3,425bp.
When these the largest of these contigs is blasted against the consensus sequence from the 2012 assembly you can see that some of the repeat regions present in the 2012 assembly were merged in the 2015 assembly. In general, there is a pretty good agreement between the two assemblies. In the dot plot below, the 2015 assembly is on the x-axis and the 2012 assembly is on the y-axis.
Most of the other contigs generated are quite small (200-400bp) and largely look like repeats.
May 2015
The COX1 gene sequence (used for barcoding) was extracted from the contigs by blasting contigs against the nr/nt database and looking for a contig with hits to other COX1 genes (there are only two long contigs in the first iteration assembly). This contig was annotated using DOGMA and the exact COX1 gene sequence was extracted and can be seen below:
>COX1 AGAAGATTCAGTATTTATATGAAAATCATGAGGTAAAAACATCTCCCATTCACGAGAAAATGATCCTGAAACCGAAAATACAGAACTACGTTGACTAATTATAGCTTCCCATAATATTAATAAAAATAAAAGAACACCAAAAATAGATACCAAAGATCCATAAGAAGAAATTTGATTTCAAAAAAAATAAGAATCAGGATAATCAGAATAACGTCGTGGTATACCGGCTAACCCTAAAAAATGTTGAGGAAAAAAAGTTATATTAACAGCAATAAATATAATAAAAAATTGAGCTTTTGCTCATCGCTCATGAAGTGTAACACCTCTTATTAGTGGGAATCAATAAACAAATGCTGCAAAAATAGCAAATACAGCTCCTATTGATAAGACATAATGAAAATGAGCTACAACATAATAAGTATCATGAAGAACAATATCTAAGGAAGAATTTGATAACACAATACCTGTTAATCCACCTAATGTAAATAAAAAAATAAAACCTAAAACCCAATATATAGAAGCTGAAAAAGAACAATTACTTCCATATAAAGTTATAAGTCACCTAAAAATTTTAATACCCGTAGGTACAGCAATAACTATAGTAGCAGCAGTAAAATATGCTCGAGTGTCTACATCTATCCCAACAGTAAATATATGATGTGCTCACACAATAAAACCTAAAACACCAATAGAAATTATAGCATAAATTATACCTAATGTACCAAAGGGTTGTTTAATAGTAAAATTACTTAAAATATGTGAAATAATTCCAAATCCTGGTAAAATTAAAATATATACTTCAGGGTGACCAAAAAATCAAAATAAATGTTGATATAAAATTGGGTCCCCTCCACCAGCTGGATCAAAAAACCTAGTATTAAAATTACGATCTGTTAAAAGTATAGTAATGGCACCTGCTAAAACCGGAAGTGATAGTAGTAATAAAAATACTGTAATTAAAATAGATCATACAAATAAACTTACACGTTCCATTAATATCCCAGATGCACGTATATTAAAAATAGTAGTAATAAAATTAATTGCTCCTAAAATAGAAGATATACCTGCTAAATGTAATGAAAAAATAGCTAAATCAACAGAAGCCCCACCATGACCTACTGGTCCTCTTAAAGGGGGGTATACTGTTCAACCAGTACCAACACCACCTTCAATTATTGAAGAAGAAATTAATAATAAAAAAGAGGGAGGAAGTAACCAGAATCTTATATTATTTATTCGAGGAAATCTTATATCAGGAGCTCCAATTAATAACGGTACTATTCAATTACCAAATCCACCAATTATTAAAGGTATAACTATAAAAAAAATCATAACAAAAGCATGAGCAGTAATAATTACATTAAAAAAATGATCATCTATTAATACTCTAGCAGTACTTAACTCTAAACGAATTAATAAAGATAACCCTGTACCTACTATCCCACATCATACCCCAAAAATTATATACAATGTACCAATATCTTTATGGTTTGTAGAAAAAAGTCAACGCAA
The COX1 gene was compared (using blastn) to the COX1 gene assembled from the 2012 assembly, which came from a different specimen of banana slug. The two gene sequences are 100% identical over 100% of the sequence, meaning that the two specimens were both from the same species of banana slug.
June 2015
In an attempt to close gaps between scaffolds, the reads from the SW041, SW042, and lucigen mate pair libraries were mapped against the assembly. This was done using bwa samse for each the forward and reverse reads for each library, after which the resulting sam file was visualized using Tablet. Tablet allows the user to see which reads mapped to which scaffolds (and where). For each sam file/mate pair library, the following characteristics were recorded for each read that mapped to the mitochondrion assembly: 1) the name of the read, 2) what scaffold it mapped to, 3) its orientation on the scaffold. This was done for both the forward and reverse reads, and then any pairs where both reads mapped were noted. The hope was that there would be a pair where each read mapped to a different scaffold, but that was not seen.
The results are shown in the table below. For each the SW041 and SW042 libraries, you can see the reads in the forward files that mapped to the mitochondrion scaffolds, what scaffold they mapped to, and the orientation of each read within the scaffold. You can then compare this list against the list of reads from the file or reverse reads. If a forward read is listed but there is a blank space on the reverse list (or vice-versa) it means the mate did not map. None of the lucigen mates mapped to the mitochondrion assembly, which is why the lucigen library is not shown below.
SW041 forward scaffold orientation reverse scaffold orientation M00160:77:000000000-AE834:1:1109:22029:10206_1:N:0:23 0 ---> M00160:77:000000000-AE834:1:1114:9169:6584_1:N:0:23 0 ---> M00160:77:000000000-AE834:1:1114:9169:6584_2:N:0:23 0 <--- M00160:77:000000000-AE834:1:1118:14243:1636_1:N:0:23 0 ---> M00160:77:000000000-AE834:1:2113:13837:23518_1:N:0:23 0 ---> M00160:77:000000000-AE834:1:2113:13837:23518_2:N:0:23 0 <--- M00160:77:000000000-AE834:1:2103:25437:5669_1:N:0:23 0 <--- M00160:77:000000000-AE834:1:2103:25437:5669_2:N:0:23 0 ---> M00160:77:000000000-AE834:1:1115:7606:16026_1:N:0:23 0 ---> M00160:77:000000000-AE834:1:1114:22230:12762_1:N:0:23 0 <--- M00160:77:000000000-AE834:1:1114:22230:12762_2:N:0:23 0 ---> M00160:77:000000000-AE834:1:1116:27469:17030_2:N:0:23 180 ---> M00160:77:000000000-AE834:1:2101:17342:16622_2:N:0:23 0 ---> M00160:77:000000000-AE834:1:2107:16918:15898_2:N:0:23 2 ---> M00160:77:000000000-AE834:1:2104:14511:13397_2:N:0:23 2 ---> M00160:77:000000000-AE834:1:2101:20707:13721_2:N:0:23 150 <--- SW042 forward scaffold orientation reverse scaffold orientation M00160:77:000000000-AE834:1:1115:19644:3622_1:N:0:24 0 <--- M00160:77:000000000-AE834:1:1115:19644:3622_2:N:0:24 0 ---> M00160:77:000000000-AE834:1:2114:25204:13527_1:N:0:24 0 <--- M00160:77:000000000-AE834:1:2114:25204:13527_2:N:0:24 0 ---> M00160:77:000000000-AE834:1:1102:12109:23847_1:N:0:24 0 <--- M00160:77:000000000-AE834:1:1102:12109:23847_2:N:0:24 0 ---> M00160:77:000000000-AE834:1:1102:15829:18820_1:N:0:24 0 ---> M00160:77:000000000-AE834:1:1102:15829:18820_2:N:0:24 0 <--- M00160:77:000000000-AE834:1:1102:14067:11879_1:N:0:24 0 ---> M00160:77:000000000-AE834:1:1102:14067:11879_2:N:0:24 0 <--- M00160:77:000000000-AE834:1:2105:9179:8802_1:N:0:24 2 ---> M00160:77:000000000-AE834:1:2105:9179:8802_2:N:0:24 2 <--- M00160:77:000000000-AE834:1:2109:15500:19867_1:N:0:24 2 <--- M00160:77:000000000-AE834:1:2119:25402:21282_2:N:0:24 168 ---> M00160:77:000000000-AE834:1:2107:8879:15431_2:N:0:24 166 <---
This result shows that the mate pair libraries will not be useful in scaffolding together the mitochondrion assembly.
Analysis by Kevin Karplus: with about 17000 bases out of a genome size of 2.3Gbases, we expect only about one read in 135,000 to be from the mitochondrion if the mitochondrion DNA is at the same number of copies as the nuclear. I'd expect maybe 100 times that for the many mitochondrion copies—what was the average coverage of each mitochondrial contig? With 1.2M read pairs in SW41 and 660K read pairs in SW42, we'd expect about 8 read pairs in SW41 and 4 in SW42, if there was only one mitochondrion copy per genome. So the number being mapped here seems a bit small (as if there were 1.5–2 copies of the mitochondrion).
It would be useful to map all the paired-end reads to the mitochondrion contigs, but using a mapper that shows multiple mappings, not just the best or a randomly chosen good mapping. That way we should be able to see reads that join neighboring contigs, with multiple mapping for the forking. Mining the data to extract conjectures about copy numbers and what is adjacent to what can be frustrating, though, and it may be quicker to do PCR to join the contigs, as suggested below.
June 2015
Primers were designed to amplify the regions between the two largest scaffolds. These two scaffolds likely contain the majority of the mitochondrion genome (explanation above).
There are several important considerations for primer design:
A first attempt at primer design can be seen below. Three sets were designed in case one set does not work as well as expected. These primers should be verified by somebody with experience designing primers before being ordered.
Primers for the 5' end of largest contig (reverse complements) Primer sequence primer length (bp) Melting temperature (°C) 5' GTTATTCATATTATTCGTGATGTACC 3' 26 50.9 5' GTGTGAGGCGGATTTTC 3' 17 50.8 5' GTTATTACAAATTTACTATCTGCAATTC 3' 28 50.8 Primers for the 3' end of the largest contig Primer sequence primer length (bp) Melting temperature (°C) 5' GAACACCCTTATAAAGAAGCC 3' 21 51.2 5' GGCTAATATTAGCGCTGG 3' 18 50.2 5' GAAGTATTTGTTTCATTAATTCAGGG 3' 26 51.7 Primers for the 5' end of the second largest contig (reverse complements) Primer sequence primer length (bp) Melting temperature (°C) 5' CACTGTATATAGTTATTATTGAAGTTTATTAAC 3' 33 50.9 5' GGAAGAATTTAAGTGTGTAGTATTTAG 3' 27 50.9 5' CACATATAAAAAACTTTAGTACACAATATTAG 3' 30 51.3 Primers for the 3' end of the second largest contig Primer sequence primer length (bp) Melting temperature (°C) 5' CATAACTTCAATATCACTGATGTC 3' 24 50.2 5' GAATTGGTGATGCTGGATC 3' 19 51.1 5' GCGAAGACTAGGTAATGC 3' 18 50
The sequence of the two largest sequences can be seen here, with the primer sequences highlighted in red.
July 20
Originally we thought that the mitochondrion was in two contigs - one at 12,844bp and one at 3,425bp. Blasting these two contigs against one another results in no significant similarity being found. The expected mitochondrion size is ~14,000bp, based on the size of other mollusc mitochondria (all are within a fairly small range around 14,000bp)
Running the PCR was challenging because there is extremely little DNA left from the banana slug specimen. Steven found some very small remnants in a leftover tube and ran the PCRs. Here are the results:
Unfortunately we were expecting a smaller product, and so the ladder used does not extend to large enough fragments. However, using a 10kb 2 log ladder we were able to estimate the size of the PCR products.
Primers used in lane 4 were:
Primers for the 5' end of largest contig (reverse complements) 5' GTGTGAGGCGGATTTTC 3' Primers for the 3' end of the largest contig 5' GGCTAATATTAGCGCTGG 3'
These should have amplified the following region:
Primers used in lanes 7 were:
Primers for the 5' end of the second largest contig (reverse complements) 5' GGAAGAATTTAAGTGTGTAGTATTTAG 3' Primers for the 3' end of the second largest contig 5' GAATTGGTGATGCTGGATC 3'
and in lane 8 were:
Primers for the 5' end of the second largest contig (reverse complements) 5' CACATATAAAAAACTTTAGTACACAATATTAG 3' Primers for the 3' end of the second largest contig 5' GCGAAGACTAGGTAATGC 3'
These should have amplified the following region:
These results suggest that both contigs would be “circularized” by the PCR products that were amplified. This is somewhat surprising for the second, shorter contig.
July 22
There are multiple explanations for why both fragments were circularized during the PCR.
August 2015
PCR products were sent for Sanger sequencing. Something went wrong - the result was largely “N”s. I checked whether I could improve calls by looking at the chromatogram, but did not have significant success. Below is an example of what the chromatogram looked like. I will be re-sending the PCR products for sequencing.
September 2015
PCR products were sent for sequencing a second time, this time after a size selection on the strongest band. Results were essentially the same.
The mitochondrion assembly is being worked on by Natasha Dudek (natasha@dudek.org) from Team 5: Discovar de novo.