User Tools

Site Tools


assemblies:2011:mitochondrion_assembly

This is an old revision of the document!


A PCRE internal error occured. This might be caused by a faulty plugin

====== Mitochondrial sequence ====== The mitochondrion was re assembled in 2015, follow the new assembly here, [[assemblies::2015::mitochondrion_assembly | 2015 Mitochondrion assembly ]]. This page is for the 2012 mitochondrial assembly. The first draft sequence is available as {{{{:computer_resources:assemblies:mitochondrion-draft1.fasta.gz|draft1 gzipped fasta file}}. This corresponds to /campusdata/BME235/assemblies/slug/barcode-of-life/map-Illumina-raw-45/consensus-6 on campusrocks. It has 23,642 bases. ====== Mitochondrion ====== The mitochondrion was assembled by Kevin Karplus in the assemblies/slug/barcode-of-life/ directory. The reason for the strange name for the directory was that at first the attempt was just to recover the COX1 gene that is used for the [[http://www.boldsystems.org/|BOLD (barcode of life database)]] project to characterize eukaryotes by their mitochondrial sequences. When it became clear that the whole mitochondrial genome was well covered in the Illumina data, the project switched to trying to reconstruct the full mitochondrial genome. ===== Method ===== There were many iterations and many different attempts at assembling the mitochondrial genome. Most of the iterations consisted of taking some draft genome, mapping all the Illumina reads to it using bwa, and selecting out all reads that mapped and their paired ends (even if the pairs didn't map), and reassembling those reads. The initial attempts used just some contigs from the previous year's (2010) attempt at a whole-genome assembly using SOAPdenovo. Later, I found that the 2011 SOAPdenovo assembly had most of the genome in one contig (ending at two repeat regions). Both SOAPdenovo and abyss were used to try to assemble the reads. Generally, abyss got somewhat longer longest contigs, and the two assemblers agreed on the parts they had in common. * Started with a search of SOAPdenovo-assembly1/k31/soapSlug.scafSeq for scaffolds that matched examples from other mollusks. * Looked for 454 reads that extended or joined contigs in scaffold * Repeated (sometimes using more sensitive searches) until no more credible scaffolds from the SOAPdenovo-assembly1/k31/ assembly nor 454 reads were found. * The 454 coverage of the mitochondrion is so slight as to be nearly useless, so instead we can iterate: - find all Illumina reads that map to the mitochondrial draft, using BWA - assemble them using SOAPdenovo. * It looks like the Illumina reads have about 228x coverage of the mitochondrion, but coverage is patchy, and it seems to be difficult to close the circle (at least with SOAPdenovo). * It turns out that a lot of the hard hand work and iterated searching to assemble the mitochondrion was not necessary, as the SOAPdenovo-assembly2/all/k63/illumina-454-all_63-mers.scafSeq assembly now has a 14960-long contig (not scaffold!) which is an almost-full-length mitochondrion, roughly as good as the best I'd managed to assemble from the SOAPdenovo-assembly1/ bits. * Iterating mapping reads with BWA and assembling them with SOAPdenovo made some progress, but there was a gap that just wouldn't close. * Switching to abyss (version 1.2.7) for the assembly of the reads made a much larger contig (15535-long after pasting on a suggestion from one abyss assembly onto another). * Iterating search and abyss assembly does not lengthen the large contig. Cleaning up and calling the consensus with bwa+samtools+bcftools doesn't change things much either. There seems to be a large variation in coverage (from 20x to 2300x, with a median of 225x), so I suspect that there is a repeat region at the beginning of the current contig that may have 10 repeats in it. Alternating finding new reads and assembling them made very slow progress, because the new reads only extended the assembled region by 50–100 bases. Eventually, I wrote a new program ([[bioinformatic_tools:pluck-scripts:look-for-exit|look-for-exit]]) to manually extend the contigs and find exits from repeat regions, being more aggressive in extending the contig than the automatic assemblers. I was eventually able to close the circle this way, and get a complete genome, though there is one repeat region with long repeats (about a dozen copies of a 615±1 long repeat) that I could not order, because the differences between repeats were far enough apart that I couldn't disambiguate the order with the [[bioinformatic_tools:bwa#determining_paired-end_insert_size|short fragment lengths]] of the data available. I think I have all the variants of repeat, but in some cases I can't even tell which first half of the repeat goes with which second half. At some point in the process, I rotated the genome to correspond to the closest previous mitochondrial genome: //Biomphalaria glabrata// strain M, a gastropod. I used bwa with mpileup to make new consensus sequences, and iterated that a few times to get about as good an assembly as I can without some longer fragments or some PCR to determine the order. ==== Finishing the genome ==== We plan to use PCR to amplify parts of the repeat region and do Sanger sequencing to confirm the sequence on those blocks. To find distinguishing features in the repeat region to design primers, the [[bioinformatic_tools:pluck-scripts:look-for-exit|look-for-exit]] program was used to walk forward and backward through the repeat, looking for alternative paths that had significant read support. All the variants were recorded in README files (in assemblies/slug/barcode-of-life/map-Illumina-raw-42/ and assemblies/slug/barcode-of-life/map-Illumina-raw-45/) and look-for-exit was used to build putative single copies of repeats from each of the observed variants. The repeat region starts at position 7037 in draft1, with CTGTAAGAGAATTATTTTAGTAATAAAATTTAATTTTAAGAAAAGAATTTTTCT There are 12 full (615±1 long) repeats there, then a partial one (456 long) that ends with CTGTAAGAGAATTATTTTAGTAATAAAATTTAATTTTAAGAAAAGAATTTTTCT at 14867. Details of the PCR strategy are still being worked out, trying to minimize the effort and cost of the wet-lab work. === Blocks in the long repeats === Here are some blocks that appear to be in the repeats (the repeat consists of blocks A,B,C,D,E in order repeated a dozen or so times, ending with block D). Before the repeats: <code> GTTATTTAAACTAAAGGATTGTGGATCCTTAAATTCCTTGTGATTTATAACTTATTATATTTTATACTAACTCTATCGAATAGTGTATATTAGAGCAATAAAAATTTGAGTATATATTATAGAAAAGAATATATATACTAGATTTATTAA </code> The long repeat blocks (one each of A, B, C, D, E): <code> A1:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,,a,,a,,,,,,,,t,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,, A1:CTGTAAGAGAATTATTTTAGTAATAAAATTTAATTTTAAGAAAAGAATTTTACTACATATTTTTATAAATTAACTATACTATAATTCTTAAAGGAAAATA A2:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,,t,,a,,,,,,,,t,,,,,,,,,,,,c,,,,,,,,,,,,,,,,,,,,,,, A2:CTGTAAGAGAATTATTTTAGTAATAAAATTTAATTTTAAGAAAAGAATTTTTCTACATATTTTTATAAATTAACTACACTATAATTCTTAAAGGAAAATA A3:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,,t,,a,,,,,,,,t,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,, A3:CTGTAAGAGAATTATTTTAGTAATAAAATTTAATTTTAAGAAAAGAATTTTTCTACATATTTTTATAAATTAACTATACTATAATTCTTAAAGGAAAATA A4:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,,t,,a,,,,,,,,+t,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,, A4:CTGTAAGAGAATTATTTTAGTAATAAAATTTAATTTTAAGAAAAGAATTTTTCTACATATTTTTTATAAATTAACTATACTATAATTCTTAAAGGAAAATA A5:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,,t,,a,,,,,,,,,+2t,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,, A5:CTGTAAGAGAATTATTTTAGTAATAAAATTTAATTTTAAGAAAAGAATTTTTCTACATATTTTTTTATAAATTAACTATACTATAATTCTTAAAGGAAAATA A6:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,t,,a,,,,,,,,+t,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,, A6:CTGTAAGAGAATTATTTTAGTAATAAAATTTAATTTTAGGAAAAGAATTTTTCTACATATTTTTTATAAATTAACTATACTATAATTCTTAAAGGAAAATA B1:,,,,,,,,c,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,at,,,c,,,,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, B1:TTTTTATGCACTATCATATGTTTTTAGGTTAAATCTTTAATAATATACTAATCTTTTAAATATATTTAATAAGTTACTACTTTAGAAAATTATATTTTAGCTTATTAATTATTT B2:,,,,,,,,c,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,gt,,,c,,,,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, B2:TTTTTATGCACTATCATATGTTTTTAGGTTAAATCTTTAATAGTATACTAATCTTTTAAATATATTTAATAAGTTACTACTTTAGAAAATTATATTTTAGCTTATTAATTATTT B3:,,,,,,,,c,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,gt,,,c,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, B3:TTTTTATGCACTATCATATGTTTTTAGGTTAAATCTTTAATAGTATACTAATTTTTTAAATATATTTAATAAGTTACTACTTTAGAAAATTATATTTTAGCTTATTAATTATTT B4:,,,,,,,,t,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,gt,,,t,,,,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, B4:TTTTTATGTACTATCATATGTTTTTAGGTTAAATCTTTAATAGTATATTAATCTTTTAAATATATTTAATAAGTTACTACTTTAGAAAATTATATTTTAGCTTATTAATTATTT B5:,,,,,,,,t,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,ga,,,t,,,,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, B5:TTTTTATGTACTATCATATGTTTTTAGGTTAAATCTTTAATAGAATATTAATCTTTTAAATATATTTAATAAGTTACTACTTTAGAAAATTATATTTTAGCTTATTAATTATTT B6:,,,,,,,,t,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,gt,,,c,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, B6:TTTTTATGTACTATCATATGTTTTTAGGTTAAATCTTTAATAGTATACTAATTTTTTAAATATATTTAATAAGTTACTACTTTAGAAAATTATATTTTAGCTTATTAATTATTT B7:,,,,,,,,t,,,,,,,,,,,,,-t,,,,,,,,,,,,,,,,,gt,,,t,,,,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, B7:TTTTTATGTACTATCATATGTTTTAGGTTAAATCTTTAATAGTATATTAATCTTTTAAATATATTTAATAAGTTACTACTTTAGAAAATTATATTTTAGCTTATTAATTATTT B8:,,,,,,,,t,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,gt,,,c,,,,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, B8:TTTTTATGTACTATCATATGTTTTTAGGTTAAATCTTTAATAGTATACTAATCTTTTAAATATATTTAATAAGTTACTACTTTAGAAAATTATATTTTAGCTTATTAATTATTT C1:,,,,,,,,,,,,,,,,,-ta,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,gt,,t,,,,,,,,,,,,,,,,,,,,,,,,,,a,,c,,,,,,,, C1:CATGCTTTTAATCTATTTTAAAAAATTTGGTCTAACAACACCCATAATAATTTAATTTTCTTCTTAAAATTAATATAGAGTATTACTGTAGAAATAAATAAAATTTTGGTAGGCGATTATTT C2:,,,,,,,,,,,,,,,,,,,ta,,,,,,,,,,,,,,,,,,,,,,,,c,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,gc,,c,,,,,,,,,,,,,,,,,,,,,,,,,,a,,c,,,,,,,, C2:CATGCTTTTAATCTATTTTTAAAAAATTTGGTCTAACAACACCCACAATAATTTAATTTTCTTCTTAAAATTAATATAGAGCATCACTGTAGAAATAAATAAAATTTTGGTAGGCGATTATTT C3:,,,,,,,,,,,,,,,,,,,ta,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,gt,,c,,,,,,,,,,,,,,,,,,,,,,,,,,a,,c,,,,,,,, C3:CATGCTTTTAATCTATTTTTAAAAAATTTGGTCTAACAACACCCATAATAATTTAATTTTCTTCTTAAAATTAATATAGAGTATCACTGTAGAAATAAATAAAATTTTGGTAGGCGATTATTT C4:,,,,,,,,,,,,,,,,,,,ta,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,gt,,c,,,,,,,,,,,,,,,,,,,,,,,,,,g,,c,,,,,,,, C4:CATGCTTTTAATCTATTTTTAAAAAATTTGGTCTAACAACACCCATAATAATTTAATTTTCTTCTTAAAATTAATATAGAGTATCACTGTAGAAATAAATAAAATTTTGGTGGGCGATTATTT C5:,,,,,,,,,,,,,,,,,,,ta,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,gt,,c,,,,,,,,,,,,,,,,,,,,,,,,,,a,,c,,,,,,,, C5:CATGCTTTTAATCTATTTTTAAAAAATTTGGTCTAACAACACCCATAATAATTTAATTTTCTTCTTAAAATTAGTATAGAGTATCACTGTAGAAATAAATAAAATTTTGGTAGGCGATTATTT C6:,,,,,,,,,,,,,,,,,,,tt,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,at,,c,,,,,,,,,,,,,,,,,,,,,,,,,,a,,c,,,,,,,, C6:CATGCTTTTAATCTATTTTTTAAAAATTTGGTCTAACAACACCCATAATAATTTAATTTTCTTCTTAAAATTAATATAGAATATCACTGTAGAAATAAATAAAATTTTGGTAGGCGATTATTT C7:,,,,,,,,,,,,,,,,,-ta,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,gt,,t,,,,,,,,,,,,,,,,,,,,,,,,,,a,,c,,,,,,,, C7:CATGCTTTTAATCTATTTTAAAAAATTTGGTCTAACAACACCCATAATAATTTAATTTTCTTCTTAAAATTAGTATAGAGTATTACTGTAGAAATAAATAAAATTTTGGTAGGCGATTATTT C8:,,,,,,,,,,,,,,,,,,,ta,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,gt,,c,,,,,,,,,,,,,,,,,,,,,,,,,,a,,t,,,,,,,, C8:CATGCTTTTAATCTATTTTTAAAAAATTTGGTCTAACAACACCCATAATAATTTAATTTTCTTCTTAAAATTAATATAGAGTATCACTGTAGAAATAAATAAAATTTTGGTAGGTGATTATTT D1:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-t,,,,,,,,,,,tt,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,,,,,,,,at,,,,,,,,,,,,,,t,,,,, D1:AACTCTTTAGCTTTTTTCATAATAATTTTTTTTTATAGTATATTTTCATAATAATATTTTTTTAATTGTTTTATTAGTTGTGAACTCCGTAAATGTGGAGTAATATAAGAATTTAACATATGTT D2:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,tc,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,,,,,,,,ac,,,,,,,,,,,,,,t,,,,, D2:AACTCTTTAGCTTTTTTCATAATAATTTTTTTTTTATAGTATATTTCCATAATAATATTTTTTTAATTGTTTTATTAGTTGTGAACTCCGTAAATGTGGAGTAACATAAGAATTTAACATATGTT D3:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,tc,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,,,,,,,,gc,,,,,,,,,,,,,,t,,,,, D3:AACTCTTTAGCTTTTTTCATAATAATTTTTTTTTTATAGTATATTTCCATAATAATATTTTTTTAATTGTTTTATTAGTTGTGAACTCCGTAAATGTGGAGTAGCATAAGAATTTAACATATGTT D4:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,ct,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,,,,,,,,ac,,,,,,,,,,,,,,t,,,,, D4:AACTCTTTAGCTTTTTTCATAATAATTTTTTTTTTATAGTATATTCTCATAATAATATTTTTTTAATTGTTTTATTAGTTGTGAACTCCGTAAATGTGGAGTAACATAAGAATTTAACATATGTT D5:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,ct,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,,,,,,,,at,,,,,,,,,,,,,,t,,,,, D5:AACTCTTTAGCTTTTTTCATAATAATTTTTTTTTTATAGTATATTCTCATAATAATATTTTTTTAATTGTTTTATTAGTTGTGAACTCCGTAAATGTGGAGTAATATAAGAATTTAACATATGTT D8:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,tc,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,gc,,,,,,,,,,,,,,t,,,,, D8:AACTCTTTAGCTTTTTTCATAATAATTTTTTTTTTATAGTATATTTCCATAATAATATTTTTTTAATTGTTTTATTAGTTATGAACTCCGTAAATGTGGAGTAGCATAAGAATTTAACATATGTT E1:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,,,t,,,,,,c,,,,,,,,,,,a,t,,,,,,,,,c,,,,,,,,,,,,,,,,t,,a,,,g,c,c,c,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,c,,,,,,,,,,,, E1:GAGTATTAAATATAGAAGATAATATTTAATAAATTGTATTTTAGTATTTTAGTCTTTATTTATTTATTATATATAATCATTTAGAATTCTTATATTTAAATGACACACACTATAAATAGTATTATTAAAATTTATATTAACATATTTAAATAG E2:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,,,t,,,,,,c,,,,,,,,,,,a,t,,,,,,,,,c,,,,,,,,,,,,,,,,t,,a,,,g,c,c,t,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,c,,,,,,,,,,,, E2:GAGTATTAAATATAGAAGATAATATTTAATAAATTGTATTTTAGTATTTTAGTCTTTATTTATTTATTATATATAATCATTTAGAATTCTTATATTTAAATGACACATATTATAAATAGTATTATTAAAATTTATATTAACATATTTAAATAG E3:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-a,,,,,,,,,,,,c,,,,,,c,,,,,,,,,,,a,t,,,,,,,,,t,,,,,,,,,,,,,,,,t,,a,,,a,t,t,c,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,c,,,,,,,,,,,, E3:GAGTATTAAATATAGAAGATAATATTTAATAATTGTATTTTAGTACTTTAGTCTTTATTTATTTATTATATATAATTATTTAGAATTCTTATATTTAAATAATATACACTATAAATAGTATTATTAAAATTTATATTAACATATTTAAATAG E4:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,,,t,,,,,,c,,,,,,,,,,,a,t,,,,,,,,,c,,,,,,,,,,,,,,,,a,,a,,,g,c,c,t,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,t,,,,,,,,,,,, E4:GAGTATTAAATATAGAAGATAATATTTAATAAATTGTATTTTAGTATTTTAGTCTTTATTTATTTATTATATATAATCATTTAGAATTCTTATAATTAAATGACACATATTATAAATAGTATTATTAAAATTTATATTTATATATTTAAATAG E5:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,,,t,,,,,,t,,,,,,,,,,,a,t,,,,,,,,,c,,,,,,,,,,,,,,,,t,,t,,,a,c,c,t,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,t,,,,,,,,,,,, E5:GAGTATTAAATATAGAAGATAATATTTAATAAATTGTATTTTAGTATTTTAGTTTTTATTTATTTATTATATATAATCATTTAGAATTCTTATATTTTAATAACACATACTATAAATAGTATTATTAAAATTTATATTTATATATTTAAATAG E6:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,,,t,,,,,,c,,,,,,,,,,,a,t,,,,,,,,,c,,,,,,,,,,,,,,,,t,,a,,,g,c,c,c,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,t,,,,,,,,,,,, E6:GAGTATTAAATATAGAAGATAATATTTAATAAATTGTATTTTAGTATTTTAGTCTTTATTTATTTATTATATATAATCATTTAGAATTCTTATATTTAAATGACACACACTATAAATAGTATTATTAAAATTTATATTTATATATTTAAATAG E7:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-a,,,,,,,,,,,,t,,,,,,c,,,,,,,,,,,g,t,,,,,,,,,c,,,,,,,,,,,,,,,,t,,a,,,g,c,c,c,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,t,,,,,,,,,,,, E7:GAGTATTAAATATAGAAGATAATATTTAATAATTGTATTTTAGTATTTTAGTCTTTATTTATTTGTTATATATAATCATTTAGAATTCTTATATTTAAATGACACACACTATAAATAGTATTATTAAAATTTATATTTATATATTTAAATAG E8:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,,,t,,,,,,c,,,,,,,,,,,a,a,,,,,,,,,c,,,,,,,,,,,,,,,,t,,a,,,g,c,c,c,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,t,,,,,,,,,,,, E8:GAGTATTAAATATAGAAGATAATATTTAATAAATTGTATTTTAGTATTTTAGTCTTTATTTATTTATAATATATAATCATTTAGAATTCTTATATTTAAATGACACACACTATAAATAGTATTATTAAAATTTATATTTATATATTTAAATAG </code> Last block (D8 until the last 6 bases) <code> D9:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,tc,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,gc,,,,,,,,,,,,,,c,,aa, D9:AACTCTTTAGCTTTTTTCATAATAATTTTTTTTTTATAGTATATTTCCATAATAATATTTTTTTAATTGTTTTATTAGTTATGAACTCCGTAAATGTGGAGTAGCATAAGAATTTAACATATGTT AGATATAATAAAATTATAATTTATAGGCTATAAATTTAAAATATTTATTTAAGTTTTTATTAATTTAGTACTATTTATATATGCAGATATGCCTCATATATCCCA </code> There are several links between blocks (particularly E, A, and B) that can be made even with the short-read data, but the long stretches with few SNPs in blocks B, C, and D make disambiguating the order of the repeats difficult. I don't believe that any of the blocks here are chimeric, but some have low enough coverage that they may be just a result of sequencing error, rather than real variants in the genome. Some of the more confident links are start->A2, A1->B1, A2->B2, A3->B2+B3+B4, A4->B4+B5+B8, A5->B7, E1+E2+E3->A3, E4->A4+A5, E5->A1, D9->end. [Note:D6 and D7 were eliminated—they turned out to be identical to D1, but the bases had been miscounted in doing the SNP marking.] == Primer design == We need primers that are of reasonable lengths (20–30 bases) and that uniquely identify particular blocks. Ideally, there should be several differences in sequence within the primer, since having only a single mismatch could result in PCR amplifying the wrong target. We may not need a primer for every block, as Sanger sequencing can read through an entire repeat. It may be enough just to have primers for the more easily distinguished blocks, measuring the lengths between them and sequencing the ones that have unique lengths. Another approach we are looking at is to do long-range PCR to amplify the whole repeat region (using unique primers just outside the repeat region), then drop sequencing primers into unique spots within the repeat region, Sanger sequencing backwards and forwards from each unique spot. Each spot would then yield about 2k contiguous bases of the repeat. The crucial thing is that the primers for selecting the region are unique to the mitochondrion (not in nuclear DNA!) and that the long-range PCR is capable of amplifying the whole repeat region. We selected two pairs of primers to test: <code> OLIGO start len tm gc% any 3' seq LEFT PRIMER 24 27 50.14 33.33 8.00 2.00 ACTCTATCGAATAGTGTATATTAGAGC RIGHT PRIMER 15495 25 54.61 48.00 5.00 2.00 GTTTCCTAGAGCCATATAGGTGAGC PRODUCT SIZE: 15472, PAIR ANY COMPL: 5.00, PAIR 3' COMPL: 2.00 OLIGO start len tm gc% any 3' seq LEFT PRIMER 20 31 52.60 32.26 8.00 2.00 ACTAACTCTATCGAATAGTGTATATTAGAGC RIGHT PRIMER 15494 24 53.47 45.83 5.00 2.00 TTTCCTAGAGCCATATAGGTGAGC PRODUCT SIZE: 15475, PAIR ANY COMPL: 5.00, PAIR 3' COMPL: 2.00 </code> The product size is a bogus number, as the primer design was done with many variants of the repeats inserted between the sequences that come before and after the repeat regions. I'm expecting a length more like 8000 bases. The primer design was done with [[http://frodo.wi.mit.edu/primer3/|Primer3]] using the SantaLucia 1998 parameters for thermodynamics and salt correction. Both of these designs were aiming for a 55ºC melting temperature and a length of 25. GC content was required to be at least 30% (higher constraints resulted in no possible left primers in the upstream sequence I had included). Both primer pairs required a 3' GC clamp of 2 bases, and the second set required the melting temperatures to be within 3 degrees of each other. It seems that there about 28 unique spots that we can drop 25-mers in, which should give us pretty thorough coverage of the repeat region, if we can extract it. I'm a little concerned that the coverage of some of the variants is very low. This may mean that there is some variation between different mitochondria, rather than a single unique mitochondrial genome. I expect to see unique short k-mers in the mitochondrion about 150–200 times, but one of the 25-mers I had picked out as distinctive for particular blocks occurred in only 7 reads, and many in 20–100 reads. Only 2 of them had counts in the range I expected. This leads me to some concern about our ability to pick out sequencing primers that identify unique positions in the extracted long-range PCR product. One test we should make with the PCR product is to measure its length. If there are several different products amplified, with length differences around 615 bases, then we know that the DNA contains mitochondrial genomes with different numbers of repeats, and so the problem needs to be addressed as a meta-assembly problem. If only one length comes out, I expect there to be a consensus genome that we can assemble. === Blocks in the short repeats === After I had gotten all the variants of the long repeats into the draft genome, I used look-for-exit on the whole mitochondrial genome to see if there were any variants that were missing. In this way I found another group of repeats, about 150 bases upstream from the long repeats, each about 185 long. I used look-for-exit to find the variants of these repeats, and found several possible variants. I was only able to order a few of them, as walking forward or backward from most blocks with the longest extension that kept at least one read resulted in going to the generic block (R99) from most blocks. R01 is first and R03->R18. R15, R16, and R17 back up to R15 rather than R99, which means that they are probably preceded by R15, R07, R14, or R04. <code> R01: . . . t a . . . . . . . . . . ... .. . R01: GTTATTATTGAAGTTTATTAACGTAAAGTTGTAACTATAAAAATATCTCTTTACTATAATAATGTTTAATACATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATCTATAATATATTATTTCCATTATATATA R02: . . . . . . . . . . . . . . . ... .a . R02: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATAATGTTTAATACATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATATATAATATATTATTTCCATTATATATA R03: . a . . . . . c . . . . a . -(at) ... .t t R03: GTTATTATTAAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTACAATAATGTTTAATACACTTATCTTATTAGTCTTTATTATACCATTTAATAATAATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATTTATAATATATTATTTTCATTATATATA R04: . . g . . . . . -a t . . . . . ... .a . R04: GTTATTATTGAAGGTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATATGTTTAATATATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATATATAATATATTATTTCCATTATATATA R05: . . . . . . . . . . . -c . . . ... .a . R05: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATAATGTTTAATACATTTATCTTATTAGTCTTTATTATACATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATATATAATATATTATTTCCATTATATATA R06: . . . . a . . . . . . . . . . a.. .. . R06: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTATAAAAATATCTCTTTACTATAATAATGTTTAATACATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTATATAGTATTATTACTATCTATAATATATTATTTCCATTATATATA R07: . . . . a . . . . t . . . . . ... .a . R07: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTATAAAAATATCTCTTTACTATAATAATGTTTAATATATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATATATAATATATTATTTCCATTATATATA R08: . . . . -t . . . . . . . . . . ... g. . R08: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTAAAAATATCTCTTTACTATAATAATGTTTAATACATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTAGCTATAATATATTATTTCCATTATATATA R09: . . . . . . . . . . . . . . . ... g. . R09: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATAATGTTTAATACATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTAGCTATAATATATTATTTCCATTATATATA R10: . . . . -t . . . . . . . . g . ... .. . R10: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTAAAAATATCTCTTTACTATAATAATGTTTAATACATTTATCTTATTAGTCTTTATTATACCATTTGATAATAGTATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATCTATAATATATTATTTCCATTATATATA R11: . . . . -t -t . . . . . . . . . ... .. . R11: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTAAAAATATCTCTTACTATAATAATGTTTAATACATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATCTATAATATATTATTTCCATTATATATA R12: . . . . -t . . . . t . . . . . ... .. . R12: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTAAAAATATCTCTTTACTATAATAATGTTTAATATATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATCTATAATATATTATTTCCATTATATATA R13: . . . . . . t . . . g t . g . ... .a . R13: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTATTATAATAATGTTTAATACATTTATCTTATTAATCTTTATTGTACTATTTGATAATAGTATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATATATAATATATTATTTCCATTATATATA R14: . . g . . . . . -a t . . . . . ... .a . R14: GTTATTATTGAAGGTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATATGTTTAATATATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATATATAATATATTATTTCCATTATATATA R15: . . . . . . . . . t . . . . . ... .a . R15: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATAATGTTTAATATATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATATATAATATATTATTTCCATTATATATA R16: . . . . . . . . . t . . . . g ... .. . R16: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATAATGTTTAATATATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATGTCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATCTATAATATATTATTTCCATTATATATA R17: . . . . . . . . . t . . . . . ..g .a . R17: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATAATGTTTAATATATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTGTAGTATTATTACTATATATAATATATTATTTCCATTATATATA R18: a . . . . . . . . . . . . . . ... .. . R18: ATTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATAATGTTTAATACATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATCTATAATATATTATTTCCATTATATATA R19: . . . . . . . . . . . . . . . .g. .. a R19: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATAATGTTTAATACATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGGATAGTATTATTACTATCTATAATATATTATTTACATTATATATA R20: . . . . . . . . . . . . . . . ... .. a R20: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATAATGTTTAATACATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATCTATAATATATTATTTACATTATATATA R99: . . . . . . . . . . . . . . . . . .. . R99: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATAATGTTTAATACATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATCTATAATATATTATTTCCATTATATATA </code> Note: I fixed the R14 annotation—it is actually identical to R04. When I put these blocks into the genome, look-for-exit found no variants needed other than a C->T SNP a little earlier than the short repeats that had 7 reads supporting it. As this was less than 1.5% of the reads for that location, it is quite likely a sequencing error rather than a true variant. For the short repeats, I expect around 32 copies based on expected coverage of 176 and 5671 copies of ATCTTATTAGTCTTTATTA. Of course, some of the repeat variants I have may just be sequencing errors, and there may be some exact repetitions of blocks. == Primer design == The best pair of primers I've found for the short repeat region (using the SantaLucia 1998 parameters, aiming for 55 degrees C with a temp difference of <3 degrees and a 1-base GC clamp) is <code> OLIGO start len tm gc% any 3' seq LEFT PRIMER 114 31 55.20 35.48 6.00 2.00 CGAGTATACAATATTTAGTTGTGCTAAGTGC RIGHT PRIMER 4174 27 53.85 37.04 6.00 2.00 TCACAAGGAATTTAAGGATCCACAATC PRODUCT SIZE: 4061, PAIR ANY COMPL: 5.00, PAIR 3' COMPL: 0.00 </code> ===== Reads ===== One of the biggest tasks in assembling the mitochondrion was separating the mitochondrial reads from the full set of reads. I needed a seed to start with, which I initially found by searching the earliest assembly for a contig that matched the COX1 protein from several different mollusks using tblastn, and to cox1 genes using blastn. I got 3 contigs for this small gene in the low-quality assembly from SOAPdenovo_assembly1/k31, and had to do iterative searching to extend the set. (I could have saved about 10 iterations by starting from the SOAPdenovo-assembly2/all/k63/ assembly instead, but it had not been completed when I started the search for mitochondrial reads. The selection process used bwa to map reads that had been run through seqprep to the mitochondrial contigs found so far, keeping any reads that mapped or whose paired read mapped. I sometimes used the quake-corrected set used for this year's assembly, and sometimes used the set that had been run through seqprep but not corrrected with quake. In barcode-of-life/map-Illumina-raw-45/, which has a draft of the complete mitochondrial genome to map to, I mapped the reads that had not been run through quake (that is what the "raw" signifies in the directory name, though the reads had been run through seqprep, and so were not really the raw reads). I found 61,428 reads that are most likely mitochondrial. >samtools flagstat merged.bam 61428 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 60932 + 0 mapped (99.19%:nan%) 35560 + 0 paired in sequencing 17780 + 0 read1 17780 + 0 read2 31562 + 0 properly paired (88.76%:nan%) 34568 + 0 with itself and mate mapped 496 + 0 singletons (1.39%:nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) I'm wondering about the 3004 pairs that map both reads but are not "properly paired". Are these indicating a misassembly, or is this just mismapping in the repeat region? I'm also curious about whether I could have found the mitochondrial reads more simply by extracting every read that had a high-count k-mer. Looking at the jellyfish 19-mer counts from this set, I see counts as high as 5707 (for AATAAAGACTAATAAGATA) and 12,976 19-mers with multiplicity ≥100. But in the jellyfish counts for the whole set of reads, I see 7 low-complexity kmers that occur over 10 million times: CCCTAACCCTAACCCTAAC 10752814 AACCCTAACCCTAACCCTA 10736174 CCTAACCCTAACCCTAACC 10715055 CTAACCCTAACCCTAACCC 10710935 AGGGTTAGGGTTAGGGTTA 10654682 ACCCTAACCCTAACCCTAA 10607845 AAAAAAAAAAAAAAAAAAA 10486112 Even if entire reads are full of CCCTAA, that is still hundreds of thousands of reads, so separating by frequency would not have been good enough. The most frequent 19-mer in the subset occurs 6035 times in the full set (so I may be missing 272 copies in the subset), but there are almost 209,000 more common 19-mers, so selecting by frequency would have gotten me mostly low-complexity junk, not mitochondrial sequence. After cleaning the mitochondrial reads with [[bioinformatic_tools:jellyfish|jellyfish]] and [[bioinformatic_tools:quake|quake]], in map-Illumina-raw-45/ we have clean_19_dir/merged.fastq has 2,860,095 bases in 26,498 reads. clean_19_dir/merged_1.fastq has 1,253,271 bases in 16,885 reads. clean_19_dir/merged_2.fastq has 1,252,843 bases in 16,885 reads. Total: 5,366,209 bases in 60,268 reads. The estimated coverage of the genome from fitting a gamma distribution to the 19-mers is about 216x, giving a genome-size estimate of 24,844 bases, not far from the 23642 I currently have---perhaps there are a couple more repeats than I've been estimating? But the accuracy of the genome size estimates is not likely to be that precise. In map-Illumina-raw-46/, where the constraint on insert size was increased to keep bwa from throwing out discordant pairs, and more variants of the repeats were in the draft that was mapped to, look-for-exit merged.fastq merged_[12].fastq counted merged.fastq has 3,027,249 bases in 27,429 reads. merged_1.fastq has 1,360,645 bases in 17,875 reads. merged_2.fastq has 1,359,545 bases in 17,875 reads. Total: 5,747,439 bases in 63,179 reads. After cleaning with quake to correct 19-mers, we have clean_19_dir/merged.fastq has 3,044,997 bases in 28,051 reads. clean_19_dir/merged_1.fastq has 1,259,044 bases in 16,974 reads. clean_19_dir/merged_2.fastq has 1,258,440 bases in 16,974 reads. Total: 5,562,481 bases in 61,999 reads. I also counted the 31-mers in the cleaned reads, and got 3 estimates of the genome size: * 18.2k = 20,945-2,761 distinct 31-mers (excluding those with support<10) * 21.5k = 3702511/172.3 total 31-mers/coverage estimate from gamma fit * 21.5k = 5562481 * (89.72-30)/89.72 / 172.3 (total bases correct for estimated number of 31-mers based on average read length, divided by coverage). The last 2 estimates are using the same data, so are not really independent estimates, just different ways of computing the same length. (The corrected coverage estimate 89.72/(89.72-30)*172.3 is about 259x.) I have other estimates of coverage. For example, if I count just coverage in bases 500 to 5500 (well before the repeat region and far enough from the ends to avoid edge effects), I get an estimate of 274x from the pileup output, using awk '$2<5500 && $2>=500 {sum+=$4}; END {print sum/5000.}' < merged.pileup Using this estimate with the 5,747,439 bases of the uncleaned reads, we get a genome length of 21k. Looking at the statistics from samtools in map-Illumina-raw-46/, we see more sequences than in map-Illumina-raw-45/, but still a large fraction not properly paired. <code> samtools flagstat merged.bam 63179 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 62625 + 0 mapped (99.12%:nan%) 35750 + 0 paired in sequencing 17875 + 0 read1 17875 + 0 read2 32888 + 0 properly paired (91.99%:nan%) 34642 + 0 with itself and mate mapped 554 + 0 singletons (1.55%:nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) </code> The second draft sequence (with the short repeats expanded and the repeats ordered as best I can from the short-insert reads) is available as {{mitochondrion-draft2.fasta.gz|draft2 gzipped fasta file}}. This corresponds to /campusdata/BME235/assemblies/slug/barcode-of-life/map-Illumina-raw-47/draft on campusrocks. It has 36363 bases. **I no longer believe that this sequence is correct. I'm now fairly sure that the region that I assembled as repeats should actually have been assembled as variants from a heterogeneous population of mitochondria. We'll need long reads or PCR products to tell for sure. ** ===== Annotation ===== I sent the sequence to [[http://dogma.ccbb.utexas.edu/]] for annotation, and it quickly created a very crude web interface that almost works. Unfortunately, it seems to have missed some of the protein genes and it provides no way to download its annotation in a GENBANK format. Since there are hundreds of tRNA genes (the repeat region is full of them), I'm not tempted to try screen-scraping. A better way to annotate the mitochondrion needs to be found. The closest sequenced mitochondrial genomes at NCBI are NC_010220.1 Biomphalaria tenagophila mitochondrion, complete genome >gb|EF433576.1| Biomphalaria tenagophila strain Taim-RS mitochondrion, complete genome max total query E-value max score score coverage identity 2679 4235 51% 0.0 71% NC_005439.1 Biomphalaria glabrata mitochondrion, complete genome >gb|AY380531.1| Biomphalaria glabrata strain 1742 mitochondrion, complete genome max total query E-value max score score coverage identity 2562 3934 52% 0.0 69% So this is pretty far from the previously known genomes. It would be useful to find some spots that are highly conserved between Biomphalaria and Ariolimax, to place primers for extracting mitochondrial DNA by long-range PCR. That would make a population study of various Ariolimax species by sequencing their mitochondria quite tractable. ==== Protein genes ==== I can find 12 of the 13 standard mitochondrial protein genes using Blast to labeled mitochondria, and [[http://www.ncbi.nlm.nih.gov/projects/gorf/orfig.cgi|ORFfinder]] finds ORFs of about the right length for each. The missing gene (ATP8) is often missing in mollusks. There is some concern about whether the ORFs found by ORFfinder are correct, as some mollusks have RNA processing in their mitochondrial genes to insert the stop codon. I'll annotate the ORFs as best I can (assuming no post-transcriptional mRNA processing) once the repeats have been resolved. ==== Ribosomal RNA genes ==== I found both the small subunit rRNA gene and the large subunit rRNA gene using BLAST to a few annotated mollusk mitochondria. I'm not sure how one gets precise ends for these RNA genes, without wet-lab work to get cDNA. ==== tRNA genes ==== I looked for tRNA genes with [[http://lowelab.ucsc.edu/tRNAscan-SE/|tRNAscan-SE]] run locally, and only found 7, but tRNAscan is known to have problems with mitochondial tRNAs. I'll work with Todd Lowe later this summer to try to improve the tRNAscan covariance models to handle them better. The 7 easy-to-find tRNA genes are Asp-GTC, SeC-TCA, Val-TAC, Pro-TGG, Ala-TGC, Thr-TGT, Met-CAT. These are not correctly labeled for the mitochondrial genetic code: TCA is codon TGA which is Trp in code 5, though the others are ok. I expect to find many copies of tRNA genes, as the repeat regions are in a tRNA-rich area of homologous mitochondria.

You could leave a comment if you were logged in.
assemblies/2011/mitochondrion_assembly.1437073286.txt.gz · Last modified: 2015/07/16 19:01 by ceisenhart