User Tools

Site Tools


computer_resources:assemblies:mitochondrion

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
computer_resources:assemblies:mitochondrion [2011/07/02 18:39]
karplus [Finishing the genome] added primers to be used for long-range PCR
computer_resources:assemblies:mitochondrion [2015/07/16 18:33]
ceisenhart Deleting this page
Line 1: Line 1:
-====== Mitochondrion ====== 
  
 +
 +===== Mitochondrial sequence =====
 +
 +The mitochondrion was re assembled in 2015, follow the new assembly here, [[::​mitochondrion_2015 | 2015 Mitochondrion assembly ]].  This page is for the 2012 mitochondrial assembly. ​
 +
 +The first draft sequence is available as {{mitochondrion-draft1.fasta.gz|draft1 gzipped fasta file}}. ​ This corresponds to /​campusdata/​BME235/​assemblies/​slug/​barcode-of-life/​map-Illumina-raw-45/​consensus-6 on campusrocks. ​ It has 23,642 bases.
 +
 +
 +====== Mitochondrion ======
 The mitochondrion was assembled by Kevin Karplus in the assemblies/​slug/​barcode-of-life/​ directory. The reason for the strange name for the directory was that at first the attempt was just to recover the COX1 gene that is used for the [[http://​www.boldsystems.org/​|BOLD (barcode of life database)]] project to characterize eukaryotes by their mitochondrial sequences. ​ When it became clear that the whole mitochondrial genome was well covered in the Illumina data, the project switched to trying to reconstruct the full mitochondrial genome. The mitochondrion was assembled by Kevin Karplus in the assemblies/​slug/​barcode-of-life/​ directory. The reason for the strange name for the directory was that at first the attempt was just to recover the COX1 gene that is used for the [[http://​www.boldsystems.org/​|BOLD (barcode of life database)]] project to characterize eukaryotes by their mitochondrial sequences. ​ When it became clear that the whole mitochondrial genome was well covered in the Illumina data, the project switched to trying to reconstruct the full mitochondrial genome.
  
Line 21: Line 29:
    * Iterating search and abyss assembly does not lengthen the large contig. ​ Cleaning up and calling the consensus with bwa+samtools+bcftools doesn'​t change things much either. ​ There seems to be a large variation in coverage (from 20x to 2300x, with a median of 225x), so I suspect that there is a repeat region at the beginning of the current contig that may have 10 repeats in it.    * Iterating search and abyss assembly does not lengthen the large contig. ​ Cleaning up and calling the consensus with bwa+samtools+bcftools doesn'​t change things much either. ​ There seems to be a large variation in coverage (from 20x to 2300x, with a median of 225x), so I suspect that there is a repeat region at the beginning of the current contig that may have 10 repeats in it.
  
-Alternating finding new reads and assembling them made very slow progress, because the new reads only extended the assembled region by 50–100 bases. ​ Eventually, I wrote a new program (look-for-exit ​[needs its own page FIXME]) to manually extend the contigs and find exits from repeat regions, being more aggressive in extending the contig than the automatic assemblers. ​ I was eventually able to close the circle this way, and get a complete genome, though there is one repeat region with long repeats (about a dozen copies of a 615±1 long repeat) that I could not order, because the differences between repeats were far enough apart that I couldn'​t disambiguate the order with the [[bioinformatic_tools:​bwa#​determining_paired-end_insert_size|short fragment lengths]] of the data available. ​ I think I have all the variants of repeat, but in some cases I can't even tell which first half of the repeat goes with which second half.+Alternating finding new reads and assembling them made very slow progress, because the new reads only extended the assembled region by 50–100 bases. ​ Eventually, I wrote a new program ([[bioinformatic_tools:​pluck-scripts:​look-for-exit|look-for-exit]]) to manually extend the contigs and find exits from repeat regions, being more aggressive in extending the contig than the automatic assemblers. ​ I was eventually able to close the circle this way, and get a complete genome, though there is one repeat region with long repeats (about a dozen copies of a 615±1 long repeat) that I could not order, because the differences between repeats were far enough apart that I couldn'​t disambiguate the order with the [[bioinformatic_tools:​bwa#​determining_paired-end_insert_size|short fragment lengths]] of the data available. ​ I think I have all the variants of repeat, but in some cases I can't even tell which first half of the repeat goes with which second half.
  
 At some point in the process, I rotated the genome to correspond to the closest previous mitochondrial genome: //​Biomphalaria glabrata// strain M, a gastropod. At some point in the process, I rotated the genome to correspond to the closest previous mitochondrial genome: //​Biomphalaria glabrata// strain M, a gastropod.
Line 30: Line 38:
  
 We plan to use PCR to amplify parts of the repeat region and do Sanger sequencing to confirm the sequence on those blocks. We plan to use PCR to amplify parts of the repeat region and do Sanger sequencing to confirm the sequence on those blocks.
-To find distinguishing features in the repeat region to design primers, the [[look-for-exit]] program was used to walk forward and backward through the repeat, looking for alternative paths that had significant read support. ​ All the variants were recorded in README files (in assemblies/​slug/​barcode-of-life/​map-Illumina-raw-42/ ​ and assemblies/​slug/​barcode-of-life/​map-Illumina-raw-45/​) and look-for-exit was used to build putative single copies of repeats from each of the observed variants. ​+To find distinguishing features in the repeat region to design primers, the [[bioinformatic_tools:​pluck-scripts:​look-for-exit|look-for-exit]] program was used to walk forward and backward through the repeat, looking for alternative paths that had significant read support. ​ All the variants were recorded in README files (in assemblies/​slug/​barcode-of-life/​map-Illumina-raw-42/ ​ and assemblies/​slug/​barcode-of-life/​map-Illumina-raw-45/​) and look-for-exit was used to build putative single copies of repeats from each of the observed variants.  
 + 
 +The repeat region starts at position 7037 in draft1, with CTGTAAGAGAATTATTTTAGTAATAAAATTTAATTTTAAGAAAAGAATTTTTCT 
 +There are 12 full (615±1 long) repeats there, then a partial one (456 long) that ends with  
 +CTGTAAGAGAATTATTTTAGTAATAAAATTTAATTTTAAGAAAAGAATTTTTCT at 14867.
  
 Details of the PCR strategy are still being worked out, trying to minimize the effort and cost of the wet-lab work.  Details of the PCR strategy are still being worked out, trying to minimize the effort and cost of the wet-lab work. 
  
-=== Blocks in the repeats ===+=== Blocks in the long repeats ===
  
 Here are some blocks that appear to be in the repeats (the repeat consists of blocks A,B,C,D,E in order repeated a dozen or so times, ending with block D). Here are some blocks that appear to be in the repeats (the repeat consists of blocks A,B,C,D,E in order repeated a dozen or so times, ending with block D).
  
 Before the repeats: Before the repeats:
-''​+<​code>​
 GTTATTTAAACTAAAGGATTGTGGATCCTTAAATTCCTTGTGATTTATAACTTATTATATTTTATACTAACTCTATCGAATAGTGTATATTAGAGCAATAAAAATTTGAGTATATATTATAGAAAAGAATATATATACTAGATTTATTAA GTTATTTAAACTAAAGGATTGTGGATCCTTAAATTCCTTGTGATTTATAACTTATTATATTTTATACTAACTCTATCGAATAGTGTATATTAGAGCAATAAAAATTTGAGTATATATTATAGAAAAGAATATATATACTAGATTTATTAA
-''​ +</​code>​ 
- +The long repeat blocks (one each of A, B, C, D, E): 
-''​+<​code>​
 A1:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,​a,,,,,,,,,,,,​a,,​a,,,,,,,,​t,,,,,,,,,,,,​t,,,,,,,,,,,,,,,,,,,,,,,​ A1:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,​a,,,,,,,,,,,,​a,,​a,,,,,,,,​t,,,,,,,,,,,,​t,,,,,,,,,,,,,,,,,,,,,,,​
 A1:​CTGTAAGAGAATTATTTTAGTAATAAAATTTAATTTTAAGAAAAGAATTTTACTACATATTTTTATAAATTAACTATACTATAATTCTTAAAGGAAAATA A1:​CTGTAAGAGAATTATTTTAGTAATAAAATTTAATTTTAAGAAAAGAATTTTACTACATATTTTTATAAATTAACTATACTATAATTCTTAAAGGAAAATA
Line 61: Line 73:
 A6:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,​g,,,,,,,,,,,,​t,,​a,,,,,,,,​+t,,,,,,,,,,,,​t,,,,,,,,,,,,,,,,,,,,,,,​ A6:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,​g,,,,,,,,,,,,​t,,​a,,,,,,,,​+t,,,,,,,,,,,,​t,,,,,,,,,,,,,,,,,,,,,,,​
 A6:​CTGTAAGAGAATTATTTTAGTAATAAAATTTAATTTTAGGAAAAGAATTTTTCTACATATTTTTTATAAATTAACTATACTATAATTCTTAAAGGAAAATA A6:​CTGTAAGAGAATTATTTTAGTAATAAAATTTAATTTTAGGAAAAGAATTTTTCTACATATTTTTTATAAATTAACTATACTATAATTCTTAAAGGAAAATA
-''​ 
  
-''​+
 B1:,,,,,,,,​c,,,,,,,,,,,,,,,​t,,,,,,,,,,,,,,,,,​at,,,​c,,,,​c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,​ B1:,,,,,,,,​c,,,,,,,,,,,,,,,​t,,,,,,,,,,,,,,,,,​at,,,​c,,,,​c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,​
 B1:​TTTTTATGCACTATCATATGTTTTTAGGTTAAATCTTTAATAATATACTAATCTTTTAAATATATTTAATAAGTTACTACTTTAGAAAATTATATTTTAGCTTATTAATTATTT B1:​TTTTTATGCACTATCATATGTTTTTAGGTTAAATCTTTAATAATATACTAATCTTTTAAATATATTTAATAAGTTACTACTTTAGAAAATTATATTTTAGCTTATTAATTATTT
Line 87: Line 98:
 B8:,,,,,,,,​t,,,,,,,,,,,,,,,​t,,,,,,,,,,,,,,,,,​gt,,,​c,,,,​c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,​ B8:,,,,,,,,​t,,,,,,,,,,,,,,,​t,,,,,,,,,,,,,,,,,​gt,,,​c,,,,​c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,​
 B8:​TTTTTATGTACTATCATATGTTTTTAGGTTAAATCTTTAATAGTATACTAATCTTTTAAATATATTTAATAAGTTACTACTTTAGAAAATTATATTTTAGCTTATTAATTATTT B8:​TTTTTATGTACTATCATATGTTTTTAGGTTAAATCTTTAATAGTATACTAATCTTTTAAATATATTTAATAAGTTACTACTTTAGAAAATTATATTTTAGCTTATTAATTATTT
-''​ 
  
-''​+
 C1:,,,,,,,,,,,,,,,,,​-ta,,,,,,,,,,,,,,,,,,,,,,,,​t,,,,,,,,,,,,,,,,,,,,,,,,,,,​a,,,,,,​gt,,​t,,,,,,,,,,,,,,,,,,,,,,,,,,​a,,​c,,,,,,,,​ C1:,,,,,,,,,,,,,,,,,​-ta,,,,,,,,,,,,,,,,,,,,,,,,​t,,,,,,,,,,,,,,,,,,,,,,,,,,,​a,,,,,,​gt,,​t,,,,,,,,,,,,,,,,,,,,,,,,,,​a,,​c,,,,,,,,​
 C1:​CATGCTTTTAATCTATTTTAAAAAATTTGGTCTAACAACACCCATAATAATTTAATTTTCTTCTTAAAATTAATATAGAGTATTACTGTAGAAATAAATAAAATTTTGGTAGGCGATTATTT C1:​CATGCTTTTAATCTATTTTAAAAAATTTGGTCTAACAACACCCATAATAATTTAATTTTCTTCTTAAAATTAATATAGAGTATTACTGTAGAAATAAATAAAATTTTGGTAGGCGATTATTT
Line 113: Line 123:
 C8:,,,,,,,,,,,,,,,,,,,​ta,,,,,,,,,,,,,,,,,,,,,,,,​t,,,,,,,,,,,,,,,,,,,,,,,,,,,​a,,,,,,​gt,,​c,,,,,,,,,,,,,,,,,,,,,,,,,,​a,,​t,,,,,,,,​ C8:,,,,,,,,,,,,,,,,,,,​ta,,,,,,,,,,,,,,,,,,,,,,,,​t,,,,,,,,,,,,,,,,,,,,,,,,,,,​a,,,,,,​gt,,​c,,,,,,,,,,,,,,,,,,,,,,,,,,​a,,​t,,,,,,,,​
 C8:​CATGCTTTTAATCTATTTTTAAAAAATTTGGTCTAACAACACCCATAATAATTTAATTTTCTTCTTAAAATTAATATAGAGTATCACTGTAGAAATAAATAAAATTTTGGTAGGTGATTATTT C8:​CATGCTTTTAATCTATTTTTAAAAAATTTGGTCTAACAACACCCATAATAATTTAATTTTCTTCTTAAAATTAATATAGAGTATCACTGTAGAAATAAATAAAATTTTGGTAGGTGATTATTT
-''​ 
  
-''​+
 D1:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,​-t,,,,,,,,,,,​tt,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,​g,,,,,,,,,,,,,,,,,,,,,,​at,,,,,,,,,,,,,,​t,,,,,​ D1:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,​-t,,,,,,,,,,,​tt,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,​g,,,,,,,,,,,,,,,,,,,,,,​at,,,,,,,,,,,,,,​t,,,,,​
 D1:​AACTCTTTAGCTTTTTTCATAATAATTTTTTTTTATAGTATATTTTCATAATAATATTTTTTTAATTGTTTTATTAGTTGTGAACTCCGTAAATGTGGAGTAATATAAGAATTTAACATATGTT D1:​AACTCTTTAGCTTTTTTCATAATAATTTTTTTTTATAGTATATTTTCATAATAATATTTTTTTAATTGTTTTATTAGTTGTGAACTCCGTAAATGTGGAGTAATATAAGAATTTAACATATGTT
Line 133: Line 142:
 D8:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,​t,,,,,,,,,,,,,,​tc,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,​a,,,,,,,,,,,,,,,,,,,,,,​gc,,,,,,,,,,,,,,​t,,,,,​ D8:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,​t,,,,,,,,,,,,,,​tc,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,​a,,,,,,,,,,,,,,,,,,,,,,​gc,,,,,,,,,,,,,,​t,,,,,​
 D8:​AACTCTTTAGCTTTTTTCATAATAATTTTTTTTTTATAGTATATTTCCATAATAATATTTTTTTAATTGTTTTATTAGTTATGAACTCCGTAAATGTGGAGTAGCATAAGAATTTAACATATGTT D8:​AACTCTTTAGCTTTTTTCATAATAATTTTTTTTTTATAGTATATTTCCATAATAATATTTTTTTAATTGTTTTATTAGTTATGAACTCCGTAAATGTGGAGTAGCATAAGAATTTAACATATGTT
-''​ 
  
-''​+
 E1:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,​a,,,,,,,,,,,,,​t,,,,,,​c,,,,,,,,,,,​a,​t,,,,,,,,,​c,,,,,,,,,,,,,,,,​t,,​a,,,​g,​c,​c,​c,​c,,,,,,,,,,,,,,,,,,,,,,,,,,,,​a,​c,,,,,,,,,,,,​ E1:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,​a,,,,,,,,,,,,,​t,,,,,,​c,,,,,,,,,,,​a,​t,,,,,,,,,​c,,,,,,,,,,,,,,,,​t,,​a,,,​g,​c,​c,​c,​c,,,,,,,,,,,,,,,,,,,,,,,,,,,,​a,​c,,,,,,,,,,,,​
 E1:​GAGTATTAAATATAGAAGATAATATTTAATAAATTGTATTTTAGTATTTTAGTCTTTATTTATTTATTATATATAATCATTTAGAATTCTTATATTTAAATGACACACACTATAAATAGTATTATTAAAATTTATATTAACATATTTAAATAG E1:​GAGTATTAAATATAGAAGATAATATTTAATAAATTGTATTTTAGTATTTTAGTCTTTATTTATTTATTATATATAATCATTTAGAATTCTTATATTTAAATGACACACACTATAAATAGTATTATTAAAATTTATATTAACATATTTAAATAG
Line 159: Line 167:
 E8:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,​a,,,,,,,,,,,,,​t,,,,,,​c,,,,,,,,,,,​a,​a,,,,,,,,,​c,,,,,,,,,,,,,,,,​t,,​a,,,​g,​c,​c,​c,​c,,,,,,,,,,,,,,,,,,,,,,,,,,,,​t,​t,,,,,,,,,,,,​ E8:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,​a,,,,,,,,,,,,,​t,,,,,,​c,,,,,,,,,,,​a,​a,,,,,,,,,​c,,,,,,,,,,,,,,,,​t,,​a,,,​g,​c,​c,​c,​c,,,,,,,,,,,,,,,,,,,,,,,,,,,,​t,​t,,,,,,,,,,,,​
 E8:​GAGTATTAAATATAGAAGATAATATTTAATAAATTGTATTTTAGTATTTTAGTCTTTATTTATTTATAATATATAATCATTTAGAATTCTTATATTTAAATGACACACACTATAAATAGTATTATTAAAATTTATATTTATATATTTAAATAG E8:​GAGTATTAAATATAGAAGATAATATTTAATAAATTGTATTTTAGTATTTTAGTCTTTATTTATTTATAATATATAATCATTTAGAATTCTTATATTTAAATGACACACACTATAAATAGTATTATTAAAATTTATATTTATATATTTAAATAG
- +</​code>​
-''​+
  
 Last block (D8 until the last 6 bases) Last block (D8 until the last 6 bases)
  
-''​+<​code>​
 D9:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,​t,,,,,,,,,,,,,,​tc,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,​a,,,,,,,,,,,,,,,,,,,,,,​gc,,,,,,,,,,,,,,​c,,​aa,​ D9:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,​t,,,,,,,,,,,,,,​tc,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,​a,,,,,,,,,,,,,,,,,,,,,,​gc,,,,,,,,,,,,,,​c,,​aa,​
 D9:​AACTCTTTAGCTTTTTTCATAATAATTTTTTTTTTATAGTATATTTCCATAATAATATTTTTTTAATTGTTTTATTAGTTATGAACTCCGTAAATGTGGAGTAGCATAAGAATTTAACATATGTT D9:​AACTCTTTAGCTTTTTTCATAATAATTTTTTTTTTATAGTATATTTCCATAATAATATTTTTTTAATTGTTTTATTAGTTATGAACTCCGTAAATGTGGAGTAGCATAAGAATTTAACATATGTT
     ​     ​
 AGATATAATAAAATTATAATTTATAGGCTATAAATTTAAAATATTTATTTAAGTTTTTATTAATTTAGTACTATTTATATATGCAGATATGCCTCATATATCCCA AGATATAATAAAATTATAATTTATAGGCTATAAATTTAAAATATTTATTTAAGTTTTTATTAATTTAGTACTATTTATATATGCAGATATGCCTCATATATCCCA
-''​+</​code>​
  
 There are several links between blocks (particularly E, A, and B) that can be made even with the short-read data, but the long stretches with few SNPs in blocks B, C, and D make disambiguating the order of the repeats difficult. ​ I don't believe that any of the blocks here are chimeric, but some have low enough coverage that they may be just a result of sequencing error, rather than real variants in the genome. There are several links between blocks (particularly E, A, and B) that can be made even with the short-read data, but the long stretches with few SNPs in blocks B, C, and D make disambiguating the order of the repeats difficult. ​ I don't believe that any of the blocks here are chimeric, but some have low enough coverage that they may be just a result of sequencing error, rather than real variants in the genome.
Line 177: Line 184:
 [Note:D6 and D7 were eliminated—they turned out to be identical to D1, but the bases had been miscounted in doing the SNP marking.] [Note:D6 and D7 were eliminated—they turned out to be identical to D1, but the bases had been miscounted in doing the SNP marking.]
  
-=== Primer design ​===+== Primer design ==
  
 We need primers that are of reasonable lengths (20–30 bases) and that uniquely identify particular blocks. Ideally, there should be several differences in sequence within the primer, since having only a single mismatch could result in PCR amplifying the wrong target. ​ We may not need a primer for every block, as Sanger sequencing can read through an entire repeat. ​ It may be enough just to have primers for the more easily distinguished blocks, measuring the lengths between them and sequencing the ones that have unique lengths. We need primers that are of reasonable lengths (20–30 bases) and that uniquely identify particular blocks. Ideally, there should be several differences in sequence within the primer, since having only a single mismatch could result in PCR amplifying the wrong target. ​ We may not need a primer for every block, as Sanger sequencing can read through an entire repeat. ​ It may be enough just to have primers for the more easily distinguished blocks, measuring the lengths between them and sequencing the ones that have unique lengths.
Line 204: Line 211:
  
 One test we should make with the PCR product is to measure its length. ​ If there are several different products amplified, with length differences around 615 bases, then we know that the DNA contains mitochondrial genomes with different numbers of repeats, and so the problem needs to be addressed as a meta-assembly problem. ​ If only one length comes out, I expect there to be a consensus genome that we can assemble. ​ One test we should make with the PCR product is to measure its length. ​ If there are several different products amplified, with length differences around 615 bases, then we know that the DNA contains mitochondrial genomes with different numbers of repeats, and so the problem needs to be addressed as a meta-assembly problem. ​ If only one length comes out, I expect there to be a consensus genome that we can assemble. ​
 +
 +=== Blocks in the short repeats ===
 +
 +After I had gotten all the variants of the long repeats into the draft genome, I used look-for-exit on the whole mitochondrial genome to see if there were any variants that were missing. ​ In this way I found another group of repeats, about 150 bases upstream from the long repeats, each about 185 long.
 +
 +I used look-for-exit to find the variants of these repeats, and found several possible variants. ​ I was only able to order a few of them, as walking forward or backward from most blocks with the longest extension that kept at least one read resulted in going to the generic block (R99) from most blocks. ​ R01 is first and R03->​R18. ​ R15, R16, and R17 back up to R15 rather than R99, which means that they are probably preceded by R15, R07, R14, or R04.
 +
 +<​code>​
 +R01: .        .   ​. ​             t       ​a ​            ​. ​ .  .   ​. ​         .                      .   ​. ​   .      .     ​. ​                     ...              ..               ​. ​            
 +R01: GTTATTATTGAAGTTTATTAACGTAAAGTTGTAACTATAAAAATATCTCTTTACTATAATAATGTTTAATACATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATCTATAATATATTATTTCCATTATATATA  ​
 +R02: .        .   ​. ​             .       ​. ​            ​. ​ .  .   ​. ​         .                      .   ​. ​   .      .     ​. ​                     ...              .a               ​. ​            
 +R02: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATAATGTTTAATACATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATATATAATATATTATTTCCATTATATATA  ​
 +R03: .        a   ​. ​             .       ​. ​            ​. ​ .  c   ​. ​         .                      .   ​. ​   a      .     ​-(at) ​               ...              .t               ​t ​              
 +R03: GTTATTATTAAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTACAATAATGTTTAATACACTTATCTTATTAGTCTTTATTATACCATTTAATAATAATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATTTATAATATATTATTTTCATTATATATA ​   ​
 +R04: .        .   ​g ​             .       ​. ​            ​. ​ .  .   ​-a ​       t                      .   ​. ​   .      .     ​. ​                     ...              .a               ​. ​             ​
 +R04: GTTATTATTGAAGGTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATATGTTTAATATATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATATATAATATATTATTTCCATTATATATA ​  
 +R05: .        .   ​. ​             .       ​. ​            ​. ​ .  .   ​. ​         .                      .   ​-c ​ .      .     ​. ​                     ...              .a               ​. ​             ​
 +R05: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATAATGTTTAATACATTTATCTTATTAGTCTTTATTATACATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATATATAATATATTATTTCCATTATATATA ​  
 +R06: .        .   ​. ​             .       ​a ​            ​. ​ .  .   ​. ​         .                      .   ​. ​   .      .     ​. ​                     a..              ..               ​. ​            
 +R06: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTATAAAAATATCTCTTTACTATAATAATGTTTAATACATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTATATAGTATTATTACTATCTATAATATATTATTTCCATTATATATA  ​
 +R07: .        .   ​. ​             .       ​a ​            ​. ​ .  .   ​. ​         t                      .   ​. ​   .      .     ​. ​                     ...              .a               ​. ​            
 +R07: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTATAAAAATATCTCTTTACTATAATAATGTTTAATATATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATATATAATATATTATTTCCATTATATATA  ​
 +R08: .        .   ​. ​             .      -t            .  .  .   ​. ​         .                      .   ​. ​   .      .     ​. ​                     ...              g.               ​. ​             ​
 +R08: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTAAAAATATCTCTTTACTATAATAATGTTTAATACATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTAGCTATAATATATTATTTCCATTATATATA ​  
 +R09: .        .   ​. ​             .       ​. ​            ​. ​ .  .   ​. ​         .                      .   ​. ​   .      .     ​. ​                     ...              g.               ​. ​            
 +R09: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATAATGTTTAATACATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTAGCTATAATATATTATTTCCATTATATATA  ​
 +R10: .        .   ​. ​             .      -t            .  .  .   ​. ​         .                      .   ​. ​   .      g     ​. ​                     ...              ..               ​. ​             ​
 +R10: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTAAAAATATCTCTTTACTATAATAATGTTTAATACATTTATCTTATTAGTCTTTATTATACCATTTGATAATAGTATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATCTATAATATATTATTTCCATTATATATA ​  
 +R11: .        .   ​. ​             .      -t           -t .  .   ​. ​         .                      .   ​. ​   .      .     ​. ​                     ...              ..               ​. ​              
 +R11: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTAAAAATATCTCTTACTATAATAATGTTTAATACATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATCTATAATATATTATTTCCATTATATATA ​   ​
 +R12: .        .   ​. ​             .      -t            .  .  .   ​. ​         t                      .   ​. ​   .      .     ​. ​                     ...              ..               ​. ​             ​
 +R12: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTAAAAATATCTCTTTACTATAATAATGTTTAATATATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATCTATAATATATTATTTCCATTATATATA ​  
 +R13: .        .   ​. ​             .       ​. ​            ​. ​ t  .   ​. ​         .                      g   ​t ​   .      g     ​. ​                     ...              .a               ​. ​            
 +R13: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTATTATAATAATGTTTAATACATTTATCTTATTAATCTTTATTGTACTATTTGATAATAGTATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATATATAATATATTATTTCCATTATATATA  ​
 +R14: .        .   ​g ​             .       ​. ​            ​. ​ .  .   ​-a ​       t                      .   ​. ​   .      .     ​. ​                     ...              .a               ​. ​             ​
 +R14: GTTATTATTGAAGGTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATATGTTTAATATATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATATATAATATATTATTTCCATTATATATA ​  
 +R15: .        .   ​. ​             .       ​. ​            ​. ​ .  .   ​. ​         t                      .   ​. ​   .      .     ​. ​                     ...              .a               ​. ​            
 +R15: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATAATGTTTAATATATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATATATAATATATTATTTCCATTATATATA  ​
 +R16: .        .   ​. ​             .       ​. ​            ​. ​ .  .   ​. ​         t                      .   ​. ​   .      .     ​g ​                     ...              ..               ​. ​            
 +R16: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATAATGTTTAATATATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATGTCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATCTATAATATATTATTTCCATTATATATA  ​
 +R17: .        .   ​. ​             .       ​. ​            ​. ​ .  .   ​. ​         t                      .   ​. ​   .      .     ​. ​                     ..g              .a               ​. ​            
 +R17: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATAATGTTTAATATATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTGTAGTATTATTACTATATATAATATATTATTTCCATTATATATA  ​
 +R18: a        .   ​. ​             .       ​. ​            ​. ​ .  .   ​. ​         .                      .   ​. ​   .      .     ​. ​                     ...              ..               ​. ​            
 +R18: ATTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATAATGTTTAATACATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATCTATAATATATTATTTCCATTATATATA  ​
 +R19: .        .   ​. ​             .       ​. ​            ​. ​ .  .   ​. ​         .                      .   ​. ​   .      .     ​. ​                     .g.              ..               ​a ​            
 +R19: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATAATGTTTAATACATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGGATAGTATTATTACTATCTATAATATATTATTTACATTATATATA  ​
 +R20: .        .   ​. ​             .       ​. ​            ​. ​ .  .   ​. ​         .                      .   ​. ​   .      .     ​. ​                     ...              ..               ​a ​          
 +R20: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATAATGTTTAATACATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATCTATAATATATTATTTACATTATATATA
 +R99: .        .   ​. ​             .       ​. ​            ​. ​ .  .   ​. ​         .                      .   ​. ​   .      .     ​. ​                     . .              ..               ​. ​          
 +R99: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATAATGTTTAATACATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATCTATAATATATTATTTCCATTATATATA
 +</​code>​
 +
 +Note: I fixed the R14 annotation—it is actually identical to R04.
 +
 +When I put these blocks into the genome, look-for-exit found no variants needed other than a C->T SNP a little earlier than the short repeats that had 7 reads supporting it.  As this was less than 1.5% of the reads for that location, it is quite likely a sequencing error rather than a true variant.
 +
 +For the short repeats, I expect around 32 copies based on expected coverage of 176 and 5671 copies of ATCTTATTAGTCTTTATTA. Of course, some of the repeat variants I have may just be sequencing errors, and there may be some exact repetitions of blocks.
 +
 +== Primer design ==
 +
 +The best pair of primers I've found for the short repeat region (using the SantaLucia 1998 parameters, aiming for 55 degrees C with a temp difference of <3 degrees and a 1-base GC clamp) is
 +<​code>​
 +  OLIGO            start  len      tm     ​gc% ​  ​any ​   3' seq
 +  LEFT PRIMER ​       114   ​31 ​  ​55.20 ​  ​35.48 ​ 6.00  2.00 CGAGTATACAATATTTAGTTGTGCTAAGTGC
 +  RIGHT PRIMER ​     4174   ​27 ​  ​53.85 ​  ​37.04 ​ 6.00  2.00 TCACAAGGAATTTAAGGATCCACAATC
 +    PRODUCT SIZE: 4061, PAIR ANY COMPL: 5.00, PAIR 3' COMPL: 0.00
 +</​code>​
  
  
Line 243: Line 317:
 The most frequent 19-mer in the subset occurs 6035 times in the full set (so I may be missing 272 copies in the subset), but there are almost 209,000 more common 19-mers, so selecting by frequency would have gotten me mostly low-complexity junk, not mitochondrial sequence. ​     ​ The most frequent 19-mer in the subset occurs 6035 times in the full set (so I may be missing 272 copies in the subset), but there are almost 209,000 more common 19-mers, so selecting by frequency would have gotten me mostly low-complexity junk, not mitochondrial sequence. ​     ​
  
-After cleaning the mitochondrial reads with [[bioinformatic_tools:​jellyfish|jellyfish]] and [[bioinformatic_tools:​quake|quake]],​  +After cleaning the mitochondrial reads with [[bioinformatic_tools:​jellyfish|jellyfish]] and [[bioinformatic_tools:​quake|quake]], ​in map-Illumina-raw-45/​ we have 
-  clean_19_dir/​merged_1.fastq has 1253271 ​bases in 16885 reads. +    clean_19_dir/​merged.fastq has 2,​860,​095 ​bases in 26,​498 ​reads.  
-  clean_19_dir/​merged_2.fastq has 1252843 ​bases in 16885 reads. +    clean_19_dir/​merged_1.fastq has 1,​253,​271 ​bases in 16,​885 ​reads. 
-  clean_19_dir/​merged.fastq has 2860095 ​bases in 26498 reads.  +    clean_19_dir/​merged_2.fastq has 1,​252,​843 ​bases in 16,​885 ​reads. 
-  Total: 5,366,209 bases in 60,268 reads. ​+    Total: 5,366,209 bases in 60,268 reads. ​
  
 The estimated coverage of the genome from fitting a gamma distribution to the 19-mers is about 216x, giving a genome-size estimate of 24,844 bases, not far from the 23642 I currently have---perhaps there are a couple more repeats than I've been estimating? ​ But the accuracy of the genome size estimates is not likely to be that precise. ​ The estimated coverage of the genome from fitting a gamma distribution to the 19-mers is about 216x, giving a genome-size estimate of 24,844 bases, not far from the 23642 I currently have---perhaps there are a couple more repeats than I've been estimating? ​ But the accuracy of the genome size estimates is not likely to be that precise. ​
  
-===== Mitochondrial ​sequence ​=====+In map-Illumina-raw-46/,​ where the constraint on insert size was increased to keep bwa from throwing out discordant pairs, and more variants of the repeats were in the draft that was mapped to, 
 +  look-for-exit merged.fastq merged_[12].fastq 
 +counted 
 +    merged.fastq has 3,027,249 bases in 27,429 reads. 
 +    merged_1.fastq has 1,360,645 bases in 17,875 reads. 
 +    merged_2.fastq has 1,359,545 bases in 17,875 reads. 
 +    Total: 5,747,439 bases in 63,179 reads. ​  
 + 
 +After cleaning with quake to correct 19-mers, we have 
 +    clean_19_dir/​merged.fastq has 3,044,997 bases in 28,051 reads. 
 +    clean_19_dir/​merged_1.fastq has 1,259,044 bases in 16,974 reads. ​            
 +    clean_19_dir/​merged_2.fastq has 1,258,440 bases in 16,974 reads.  
 +    Total: 5,562,481 bases in 61,999 reads. ​                                     
 + 
 +I also counted the 31-mers in the cleaned reads, and got 3 estimates of the genome size: 
 +  * 18.2k  ​20,​945-2,​761 distinct 31-mers (excluding those with support<​10) 
 +  * 21.5k  ​3702511/​172.3 total 31-mers/​coverage estimate from gamma fit 
 +  * 21.5k  ​5562481 * (89.72-30)/​89.72 / 172.3 (total bases correct for estimated number of 31-mers based on average read length, divided by coverage). 
 +The last 2 estimates are using the same data, so are not really independent estimates, just different ways of computing the same length. 
 +(The corrected coverage estimate 89.72/​(89.72-30)*172.3 is about 259x.) 
 + 
 +I have other estimates of coverage. ​ For example, if I count just coverage in bases 500 to 5500 (well before the repeat region and far enough from the ends to avoid edge effects), I get an estimate of 274x from the pileup output, using 
 +  awk '​$2<​5500 && $2>=500 {sum+=$4}; END {print sum/​5000.}'​ < merged.pileup ​       
 +Using this estimate with the 5,747,439 bases of the uncleaned reads, we get a genome length of 21k. 
 + 
 +Looking at the statistics from samtools in map-Illumina-raw-46/,​ we see more sequences than in map-Illumina-raw-45/,​ but still a large fraction not properly paired. 
 +<​code>​ 
 +samtools flagstat merged.bam 
 +63179 + 0 in total (QC-passed reads + QC-failed reads) 
 +0 + 0 duplicates 
 +62625 + 0 mapped (99.12%:​nan%) 
 +35750 + 0 paired in sequencing 
 +17875 + 0 read1 
 +17875 + 0 read2 
 +32888 + 0 properly paired (91.99%:​nan%) 
 +34642 + 0 with itself and mate mapped 
 +554 + 0 singletons (1.55%:​nan%) 
 +0 + 0 with mate mapped to a different chr 
 +0 + 0 with mate mapped to a different chr (mapQ>​=5) 
 +</​code>​ 
 + 
 + 
 + 
 +The second draft sequence ​(with the short repeats expanded and the repeats ordered as best I can from the short-insert reads) is available as {{mitochondrion-draft2.fasta.gz|draft2 gzipped fasta file}}. ​ This corresponds to /​campusdata/​BME235/​assemblies/​slug/​barcode-of-life/​map-Illumina-raw-47/​draft on campusrocks. ​ It has 36363 bases.
  
-The first draft sequence is available as {{mitochondrion-draft1.fasta.gz|gzipped fasta file}}. ​ This corresponds to /​campusdata/​BME235/​assemblies/​slug/​barcode-of-life/​map-Illumina-raw-45/​consensus-6 on campusrocks. 
  
 +**I no longer believe that this sequence is correct. ​ I'm now fairly sure that the region that I assembled as repeats should actually have been assembled as variants from a heterogeneous population of mitochondria. ​ We'll need long reads or PCR products to tell for sure. **
 ===== Annotation ===== ===== Annotation =====
  
 I sent the sequence to [[http://​dogma.ccbb.utexas.edu/​]] for annotation, and it quickly created a very crude web interface that almost works. ​ Unfortunately,​ it seems to have missed some of the protein genes and it provides no way to download its annotation in a GENBANK format. ​ Since there are hundreds of tRNA genes (the repeat region is full of them), I'm not tempted to try screen-scraping. ​ A better way to annotate the mitochondrion needs to be found. I sent the sequence to [[http://​dogma.ccbb.utexas.edu/​]] for annotation, and it quickly created a very crude web interface that almost works. ​ Unfortunately,​ it seems to have missed some of the protein genes and it provides no way to download its annotation in a GENBANK format. ​ Since there are hundreds of tRNA genes (the repeat region is full of them), I'm not tempted to try screen-scraping. ​ A better way to annotate the mitochondrion needs to be found.
                                                                                                                                                                                                                    
- 
-The repeat region starts at position 7037 in draft1, with CTGTAAGAGAATTATTTTAGTAATAAAATTTAATTTTAAGAAAAGAATTTTTCT 
-There are 12 full (615±1 long) repeats there, then a partial one (456 long) that ends with  
-CTGTAAGAGAATTATTTTAGTAATAAAATTTAATTTTAAGAAAAGAATTTTTCT at 14867. 
- 
 The closest sequenced mitochondrial genomes at NCBI are The closest sequenced mitochondrial genomes at NCBI are
  
Line 281: Line 393:
  
 It would be useful to find some spots that are highly conserved between Biomphalaria and Ariolimax, to place primers for extracting mitochondrial DNA by long-range PCR.  That would make a population study of various Ariolimax species by sequencing their mitochondria quite tractable. It would be useful to find some spots that are highly conserved between Biomphalaria and Ariolimax, to place primers for extracting mitochondrial DNA by long-range PCR.  That would make a population study of various Ariolimax species by sequencing their mitochondria quite tractable.
 +
 +==== Protein genes ====
 +
 +I can find 12 of the 13 standard mitochondrial protein genes using Blast to labeled mitochondria,​ and [[http://​www.ncbi.nlm.nih.gov/​projects/​gorf/​orfig.cgi|ORFfinder]] finds ORFs of about the right length for each.  The missing gene (ATP8) is often missing in mollusks. ​ There is some concern about whether the ORFs found by ORFfinder are correct, as some mollusks have RNA processing in their mitochondrial genes to insert the stop codon.
 +
 +I'll annotate the ORFs as best I can (assuming no post-transcriptional mRNA processing) once the repeats have been resolved.
 +
 +==== Ribosomal RNA genes ====
 +
 +I found both the small subunit rRNA gene and the large subunit rRNA gene using BLAST to a few annotated mollusk mitochondria. I'm not sure how one gets precise ends for these RNA genes, without wet-lab work to get cDNA.
 +
 +==== tRNA genes ====
 +
 +I looked for tRNA genes with [[http://​lowelab.ucsc.edu/​tRNAscan-SE/​|tRNAscan-SE]] run locally, and only found 7, but tRNAscan is known to have problems with mitochondial tRNAs. ​ I'll work with Todd Lowe later this summer to try to improve the tRNAscan covariance models to handle them better.
 +
 +The 7 easy-to-find tRNA genes are Asp-GTC, SeC-TCA, Val-TAC, Pro-TGG, Ala-TGC, Thr-TGT, Met-CAT. ​ These are not correctly labeled for the mitochondrial genetic code:  TCA is codon TGA which is Trp in code 5, though the others are ok.
 +
 +I expect to find many copies of tRNA genes, as the repeat regions are in a tRNA-rich area of homologous mitochondria.