User Tools

Site Tools


assemblies:2015:mitochondrion_assembly

This is an old revision of the document!


A PCRE internal error occured. This might be caused by a faulty plugin

====Mitochondrion assembly==== A [[assemblies::2011::mitochondrion_assembly|draft mitochondrion assembly]] was made by Kevin Karplus in 2012. The goal in 2015 is to create a high coverage, closed mitochondrion assembly. You can view the mitochondrion assembly on the UCSC genome browser by following [[https://banana-slug.soe.ucsc.edu/banana_slug_genome_browser|instructions]] posted on the wiki. It is currently showing the first Discovar de novo assembly generated using only the HiSeq SW018 reads. === Files === | File | Location | | 2012 Mitochondrion assembly | /campusdata/BME235/assemblies/slug/barcode-of-life/map-Illumina-raw-45/consensus-6 | | Sam file for SW018 reads from the HiSeq run that map to the mitochondrion| /campusdata/BME235/mitochondrion | | Sam file for SW019 reads from the HiSeq run that map to the mitochondrion| /campusdata/BME235/mitochondrion | | Assembly fasta files | /campusdata/BME235/mitochondrion/ | ====Methods and results==== ====Assembly==== May 2015 The first step taken was to map reads from the HiSeq SW018 and SW019 reads against the 2012 draft genome. Reads were then assembled using Discovar de novo, first using only the SW018 reads (assembly mitochondrion_SW018_discovar) and then using reads from both SW018 and SW019 (assembly mitochondrion_SW018-9_discovar). This second assembly produced two long contigs that appear to be the majority of the mitochondrion genome since the length of both summed is close to the expected genome size. A second iteration of read mapping was done using the first 2015 assembly and mapping SW018 and SW019 reads against it, to try to pull out any more reads that belong to the mitochondrion. This assembly has a smaller N50 and a smaller total length, and therefore the iterative mapping was not continued for more iterations. |Assembly name | Bytes | Total bases | # scafs | contig N50 | scaffold N50 | total bases in 1kb+ scaffolds | total bases in 10kb+ scaffolds | coverage| | mitochondrion_SW018_discovar | 20K | 18,983 | 29 | 9,041 | 9,041 | 14, 048 | 0 | 60X | | mitochondrion_SW018-9_discovar | 48K | 46,358 | 110 | 12,883 | 12,883 | 17,173 | 12,684 | 411X| |mitochondrion_iteration2_SW018-9_discovar| 48K | 45,494 | 114 | 9,030 | 9,030 | 15,548 | 0 | 436X | For comparison, the mitochondrion size of the closest related molluscs that have mitochondrion assemblies are 14,100bp (grove snail - //Cepaea nemoralis//) and 14,130bp (land snail - //Albinaria coerulea//) It looks as though the majority of the mitochondrion is in two contigs in the assembly using both the SW018 and SW019 assemblies. One of these contigs is 12,884bp and the other (which is the second largest contig) is 3,425bp. When these the largest of these contigs is blasted against the consensus sequence from the 2012 assembly you can see that some of the repeat regions present in the 2012 assembly were merged in the 2015 assembly. In general, there is a pretty good agreement between the two assemblies. In the dot plot below, the 2015 assembly is on the x-axis and the 2012 assembly is on the y-axis. {{:2012_vs_2015_assemblies.png?200|}} Most of the other contigs generated are quite small (200-400bp) and largely look like repeats. ====The COX1 gene sequence==== May 2015 The COX1 gene sequence (used for barcoding) was extracted from the contigs by blasting contigs against the nr/nt database and looking for a contig with hits to other COX1 genes (there are only two long contigs in the first iteration assembly). This contig was annotated using [[http://dogma.ccbb.utexas.edu/|DOGMA]] and the exact COX1 gene sequence was extracted and can be seen below: <code> >COX1 AGAAGATTCAGTATTTATATGAAAATCATGAGGTAAAAACATCTCCCATTCACGAGAAAATGATCCTGAAACCGAAAATACAGAACTACGTTGACTAATTATAGCTTCCCATAATATTAATAAAAATAAAAGAACACCAAAAATAGATACCAAAGATCCATAAGAAGAAATTTGATTTCAAAAAAAATAAGAATCAGGATAATCAGAATAACGTCGTGGTATACCGGCTAACCCTAAAAAATGTTGAGGAAAAAAAGTTATATTAACAGCAATAAATATAATAAAAAATTGAGCTTTTGCTCATCGCTCATGAAGTGTAACACCTCTTATTAGTGGGAATCAATAAACAAATGCTGCAAAAATAGCAAATACAGCTCCTATTGATAAGACATAATGAAAATGAGCTACAACATAATAAGTATCATGAAGAACAATATCTAAGGAAGAATTTGATAACACAATACCTGTTAATCCACCTAATGTAAATAAAAAAATAAAACCTAAAACCCAATATATAGAAGCTGAAAAAGAACAATTACTTCCATATAAAGTTATAAGTCACCTAAAAATTTTAATACCCGTAGGTACAGCAATAACTATAGTAGCAGCAGTAAAATATGCTCGAGTGTCTACATCTATCCCAACAGTAAATATATGATGTGCTCACACAATAAAACCTAAAACACCAATAGAAATTATAGCATAAATTATACCTAATGTACCAAAGGGTTGTTTAATAGTAAAATTACTTAAAATATGTGAAATAATTCCAAATCCTGGTAAAATTAAAATATATACTTCAGGGTGACCAAAAAATCAAAATAAATGTTGATATAAAATTGGGTCCCCTCCACCAGCTGGATCAAAAAACCTAGTATTAAAATTACGATCTGTTAAAAGTATAGTAATGGCACCTGCTAAAACCGGAAGTGATAGTAGTAATAAAAATACTGTAATTAAAATAGATCATACAAATAAACTTACACGTTCCATTAATATCCCAGATGCACGTATATTAAAAATAGTAGTAATAAAATTAATTGCTCCTAAAATAGAAGATATACCTGCTAAATGTAATGAAAAAATAGCTAAATCAACAGAAGCCCCACCATGACCTACTGGTCCTCTTAAAGGGGGGTATACTGTTCAACCAGTACCAACACCACCTTCAATTATTGAAGAAGAAATTAATAATAAAAAAGAGGGAGGAAGTAACCAGAATCTTATATTATTTATTCGAGGAAATCTTATATCAGGAGCTCCAATTAATAACGGTACTATTCAATTACCAAATCCACCAATTATTAAAGGTATAACTATAAAAAAAATCATAACAAAAGCATGAGCAGTAATAATTACATTAAAAAAATGATCATCTATTAATACTCTAGCAGTACTTAACTCTAAACGAATTAATAAAGATAACCCTGTACCTACTATCCCACATCATACCCCAAAAATTATATACAATGTACCAATATCTTTATGGTTTGTAGAAAAAAGTCAACGCAA </code> The COX1 gene was compared (using blastn) to the COX1 gene assembled from the 2012 assembly, which came from a different specimen of banana slug. The two gene sequences are 100% identical over 100% of the sequence, meaning that the two specimens were both from the same species of banana slug. ====Scaffolding with mate pairs to try and close gaps==== June 2015 In an attempt to close gaps between scaffolds, the reads from the SW041, SW042, and lucigen mate pair libraries were mapped against the assembly. This was done using bwa samse for each the forward and reverse reads for each library, after which the resulting sam file was visualized using Tablet. Tablet allows the user to see which reads mapped to which scaffolds (and where). For each sam file/mate pair library, the following characteristics were recorded for each read that mapped to the mitochondrion assembly: 1) the name of the read, 2) what scaffold it mapped to, 3) its orientation on the scaffold. This was done for both the forward and reverse reads, and then any pairs where both reads mapped were noted. The hope was that there would be a pair where each read mapped to a different scaffold, but that was not seen. The results are shown in the table below. For each the SW041 and SW042 libraries, you can see the reads in the forward files that mapped to the mitochondrion scaffolds, what scaffold they mapped to, and the orientation of each read within the scaffold. You can then compare this list against the list of reads from the file or reverse reads. If a forward read is listed but there is a blank space on the reverse list (or vice-versa) it means the mate did not map. None of the lucigen mates mapped to the mitochondrion assembly, which is why the lucigen library is not shown below. <code> SW041 forward scaffold orientation reverse scaffold orientation M00160:77:000000000-AE834:1:1109:22029:10206_1:N:0:23 0 ---> M00160:77:000000000-AE834:1:1114:9169:6584_1:N:0:23 0 ---> M00160:77:000000000-AE834:1:1114:9169:6584_2:N:0:23 0 <--- M00160:77:000000000-AE834:1:1118:14243:1636_1:N:0:23 0 ---> M00160:77:000000000-AE834:1:2113:13837:23518_1:N:0:23 0 ---> M00160:77:000000000-AE834:1:2113:13837:23518_2:N:0:23 0 <--- M00160:77:000000000-AE834:1:2103:25437:5669_1:N:0:23 0 <--- M00160:77:000000000-AE834:1:2103:25437:5669_2:N:0:23 0 ---> M00160:77:000000000-AE834:1:1115:7606:16026_1:N:0:23 0 ---> M00160:77:000000000-AE834:1:1114:22230:12762_1:N:0:23 0 <--- M00160:77:000000000-AE834:1:1114:22230:12762_2:N:0:23 0 ---> M00160:77:000000000-AE834:1:1116:27469:17030_2:N:0:23 180 ---> M00160:77:000000000-AE834:1:2101:17342:16622_2:N:0:23 0 ---> M00160:77:000000000-AE834:1:2107:16918:15898_2:N:0:23 2 ---> M00160:77:000000000-AE834:1:2104:14511:13397_2:N:0:23 2 ---> M00160:77:000000000-AE834:1:2101:20707:13721_2:N:0:23 150 <--- SW042 forward scaffold orientation reverse scaffold orientation M00160:77:000000000-AE834:1:1115:19644:3622_1:N:0:24 0 <--- M00160:77:000000000-AE834:1:1115:19644:3622_2:N:0:24 0 ---> M00160:77:000000000-AE834:1:2114:25204:13527_1:N:0:24 0 <--- M00160:77:000000000-AE834:1:2114:25204:13527_2:N:0:24 0 ---> M00160:77:000000000-AE834:1:1102:12109:23847_1:N:0:24 0 <--- M00160:77:000000000-AE834:1:1102:12109:23847_2:N:0:24 0 ---> M00160:77:000000000-AE834:1:1102:15829:18820_1:N:0:24 0 ---> M00160:77:000000000-AE834:1:1102:15829:18820_2:N:0:24 0 <--- M00160:77:000000000-AE834:1:1102:14067:11879_1:N:0:24 0 ---> M00160:77:000000000-AE834:1:1102:14067:11879_2:N:0:24 0 <--- M00160:77:000000000-AE834:1:2105:9179:8802_1:N:0:24 2 ---> M00160:77:000000000-AE834:1:2105:9179:8802_2:N:0:24 2 <--- M00160:77:000000000-AE834:1:2109:15500:19867_1:N:0:24 2 <--- M00160:77:000000000-AE834:1:2119:25402:21282_2:N:0:24 168 ---> M00160:77:000000000-AE834:1:2107:8879:15431_2:N:0:24 166 <--- </code> This result shows that the mate pair libraries will not be useful in scaffolding together the mitochondrion assembly. Analysis by Kevin Karplus: with about 17000 bases out of a genome size of 2.3Gbases, we expect only about one read in 135,000 to be from the mitochondrion if the mitochondrion DNA is at the same number of copies as the nuclear. I'd expect maybe 100 times that for the many mitochondrion copies—what was the average coverage of each mitochondrial contig? With 1.2M read pairs in SW41 and 660K read pairs in SW42, we'd expect about 8 read pairs in SW41 and 4 in SW42, if there was only one mitochondrion copy per genome. So the number being mapped here seems a bit small (as if there were 1.5–2 copies of the mitochondrion). It would be useful to map all the paired-end reads to the mitochondrion contigs, but using a mapper that shows multiple mappings, not just the best or a randomly chosen good mapping. That way we should be able to see reads that join neighboring contigs, with multiple mapping for the forking. Mining the data to extract conjectures about copy numbers and what is adjacent to what can be frustrating, though, and it may be quicker to do PCR to join the contigs, as suggested below. ====Primer design for PCR to close gaps==== June 2015 Primers were designed to amplify the regions between the two largest scaffolds. These two scaffolds likely contain the majority of the mitochondrion genome (explanation above). There are several important considerations for primer design: - Primers should start and end with a C or a G. - Primers should have approximately the same melting temperature (to within 1-2 degrees Celsius). Melting temperature of primers was calculated with the online [[http://www.biophp.org/minitools/melting_temperature/demo.php?primer|BioPHP melting temperature calculation tool]]. - Primers should be ~20bp. The probability of having multiple identical 20bp regions within the entire banana slug genome is very unlikely. - Primers should not amplify low complexity regions of the genome. This was a bit tricky since the ends of the two big scaffolds are AT-rich, lower complexity regions. - Primers should be near the ends of the scaffolds. This was again a bit tricky because it looks as though the ends of the two biggest scaffolds may be repeat regions, which may also explain why the assembler region was not able to join them together. In particular, it looks like the 5' end of "flattened_line_0" (the largest scaffold) and the 3' end of "flattened_line_2" (the second largest scaffold) may be separated by a repeat region (based on an ~200bp-long near-identical region they both share). - Primers need to be designed for scaffolds that are oriented in the same direction - i.e. on the same strand. By mapping the scaffolds against the 2012 mitochondrion assembly, it was apparent that they were both oriented in the same way. A first attempt at primer design can be seen below. Three sets were designed in case one set does not work as well as expected. These primers should be verified by somebody with experience designing primers before being ordered. <code> Primers for the 5' end of largest contig (reverse complements) Primer sequence primer length (bp) Melting temperature (°C) 5' GTTATTCATATTATTCGTGATGTACC 3' 26 50.9 5' GTGTGAGGCGGATTTTC 3' 17 50.8 5' GTTATTACAAATTTACTATCTGCAATTC 3' 28 50.8 Primers for the 3' end of the largest contig Primer sequence primer length (bp) Melting temperature (°C) 5' GAACACCCTTATAAAGAAGCC 3' 21 51.2 5' GGCTAATATTAGCGCTGG 3' 18 50.2 5' GAAGTATTTGTTTCATTAATTCAGGG 3' 26 51.7 Primers for the 5' end of the second largest contig (reverse complements) Primer sequence primer length (bp) Melting temperature (°C) 5' CACTGTATATAGTTATTATTGAAGTTTATTAAC 3' 33 50.9 5' GGAAGAATTTAAGTGTGTAGTATTTAG 3' 27 50.9 5' CACATATAAAAAACTTTAGTACACAATATTAG 3' 30 51.3 Primers for the 3' end of the second largest contig Primer sequence primer length (bp) Melting temperature (°C) 5' CATAACTTCAATATCACTGATGTC 3' 24 50.2 5' GAATTGGTGATGCTGGATC 3' 19 51.1 5' GCGAAGACTAGGTAATGC 3' 18 50 </code> The sequence of the two largest sequences can be seen {{:primer_design_locations.pdf|here}}, with the primer sequences highlighted in red. ====PCR results==== July 20 Originally we thought that the mitochondrion was in two contigs - one at 12,844bp and one at 3,425bp. Blasting these two contigs against one another results in no significant similarity being found. The expected mitochondrion size is ~14,000bp, based on the size of other mollusc mitochondria (all are within a fairly small range around 14,000bp) Running the PCR was challenging because there is extremely little DNA left from the banana slug specimen. Steven found some very small remnants in a leftover tube and ran the PCRs. Here are the results: {{:assemblies:2015:mitochondrion_pcr.jpg?200|}} Unfortunately we were expecting a smaller product, and so the ladder used does not extend to large enough fragments. However, using a 10kb 2 log ladder we were able to estimate the size of the PCR products. Primers used in lane 4 were: <code> Primers for the 5' end of largest contig (reverse complements) 5' GTGTGAGGCGGATTTTC 3' Primers for the 3' end of the largest contig 5' GGCTAATATTAGCGCTGG 3' </code> These should have amplified the following region: {{:assemblies:2015:longest_contig_pcr.png?200|}} Primers used in lanes 7 were: <code> Primers for the 5' end of the second largest contig (reverse complements) 5' GGAAGAATTTAAGTGTGTAGTATTTAG 3' Primers for the 3' end of the second largest contig 5' GAATTGGTGATGCTGGATC 3' </code> and in lane 8 were: <code> Primers for the 5' end of the second largest contig (reverse complements) 5' CACATATAAAAAACTTTAGTACACAATATTAG 3' Primers for the 3' end of the second largest contig 5' GCGAAGACTAGGTAATGC 3' </code> These should have amplified the following region: {{:assemblies:2015:2nd_longest_contig_pcr.png?200|}} These results suggest that both contigs would be "circularized" by the PCR products that were amplified. This is somewhat surprising for the second, shorter contig. July 22 There are multiple explanations for why both fragments were circularized during the PCR. - Some kind of technical error or PCR artifact. - The smaller contig may be a nump that, for some reason, circularized. Supporting this theory (possibly) is that if you blast the larger contig against the //Albinaria caerulea// genome, it looks like we have nearly the entire mitochondrial genome, expect for an ~800bp region (see dot plot below). This is only slightly shorter than the fragment amplified with the PCR above. Given that molluscs seem to have fairly conserved mitochondrial genome sizes of ~14,000bp, if both fragments are truly mitochondrial it means the total length must be over 16,309bp, which is unexpectedly large and may be incorrect. On the other hand, if the second largest contig is a nump, you would likely expect it to have some sequence similarity to the largest contig, but blasting them against each other finds no significant sequence similarity. Additionally, I used [[http://dogma.ccbb.utexas.edu/|DOGMA]] to do a preliminary annotation of both contigs, and it predicted that the largest carries the cox1 and cox 3 genes, whereas the second largest contig carries the cox2 and cob genes, so they appear to be complimentary. - The banana slug could possibly have a mitochondrial genome that is present as separate mini circular chromosomes, as seen in the human body louse, //Pediculus humanus// (see "The single mitochondrial chromosome typical of animals has evolved into 18 mini chromosomes in the human body louse, //Pediculus humanus//, by Shao //et al//., 2009). This could explain why both contigs circularized with the PCR that was run. However, it is probably too early to say whether this circularization is really "valid" or not. This is something that will be examined in greater detail in the future. {{:assemblies:2015:longest_vs_a_caerulea.png?200|}} The mitochondrion assembly is being worked on by Natasha Dudek (natasha@dudek.org) from Team 5: Discovar //de novo//.

Discussion

, 2015/08/27 03:49

That chromatogram looks like two different PCR products are superimposed—there seem to be always either one or two traces high. Disentangling two DNA reads is difficult in Sanger sequencing, unless you have a very good guess at what one of them is. I'd want to see the beginning of the chromatogram, though, rather than the end, as the end is often out of sync.

A better base caller would give you the appropriate ambiguity code, rather than just N, which might provide some hope of finding the sequence in the genome. If one of PCR products starts in the mitochondrial sequence where we expect, then we could subtract the (short) piece that we assembled to get a short sequence for the other piece.

You could try PeakTrace Online (http://www.nucleics.com/peaktrace-sequencing/), though I suspect it only reports N for ambiguous bases also—I don't know whether any of the Sanger basecallers properly report the correct ambiguity code for 2 bases.

, 2015/07/28 20:47

Update: mitochondrial PCR products from July 20th, 2015 have been sent to UC Berkeley for Sanger sequencing.

You could leave a comment if you were logged in.
assemblies/2015/mitochondrion_assembly.1437625242.txt.gz · Last modified: 2015/07/23 04:20 by ndudek