User Tools

Site Tools


computer_resources:assemblies:mitochondrion

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
computer_resources:assemblies:mitochondrion [2011/07/04 20:22]
karplus [Reads]
computer_resources:assemblies:mitochondrion [2015/07/16 18:33]
ceisenhart Deleting this page
Line 1: Line 1:
-====== Mitochondrion ====== 
  
 +
 +===== Mitochondrial sequence =====
 +
 +The mitochondrion was re assembled in 2015, follow the new assembly here, [[::​mitochondrion_2015 | 2015 Mitochondrion assembly ]].  This page is for the 2012 mitochondrial assembly. ​
 +
 +The first draft sequence is available as {{mitochondrion-draft1.fasta.gz|draft1 gzipped fasta file}}. ​ This corresponds to /​campusdata/​BME235/​assemblies/​slug/​barcode-of-life/​map-Illumina-raw-45/​consensus-6 on campusrocks. ​ It has 23,642 bases.
 +
 +
 +====== Mitochondrion ======
 The mitochondrion was assembled by Kevin Karplus in the assemblies/​slug/​barcode-of-life/​ directory. The reason for the strange name for the directory was that at first the attempt was just to recover the COX1 gene that is used for the [[http://​www.boldsystems.org/​|BOLD (barcode of life database)]] project to characterize eukaryotes by their mitochondrial sequences. ​ When it became clear that the whole mitochondrial genome was well covered in the Illumina data, the project switched to trying to reconstruct the full mitochondrial genome. The mitochondrion was assembled by Kevin Karplus in the assemblies/​slug/​barcode-of-life/​ directory. The reason for the strange name for the directory was that at first the attempt was just to recover the COX1 gene that is used for the [[http://​www.boldsystems.org/​|BOLD (barcode of life database)]] project to characterize eukaryotes by their mitochondrial sequences. ​ When it became clear that the whole mitochondrial genome was well covered in the Illumina data, the project switched to trying to reconstruct the full mitochondrial genome.
  
Line 21: Line 29:
    * Iterating search and abyss assembly does not lengthen the large contig. ​ Cleaning up and calling the consensus with bwa+samtools+bcftools doesn'​t change things much either. ​ There seems to be a large variation in coverage (from 20x to 2300x, with a median of 225x), so I suspect that there is a repeat region at the beginning of the current contig that may have 10 repeats in it.    * Iterating search and abyss assembly does not lengthen the large contig. ​ Cleaning up and calling the consensus with bwa+samtools+bcftools doesn'​t change things much either. ​ There seems to be a large variation in coverage (from 20x to 2300x, with a median of 225x), so I suspect that there is a repeat region at the beginning of the current contig that may have 10 repeats in it.
  
-Alternating finding new reads and assembling them made very slow progress, because the new reads only extended the assembled region by 50–100 bases. ​ Eventually, I wrote a new program (look-for-exit ​[needs its own page FIXME]) to manually extend the contigs and find exits from repeat regions, being more aggressive in extending the contig than the automatic assemblers. ​ I was eventually able to close the circle this way, and get a complete genome, though there is one repeat region with long repeats (about a dozen copies of a 615±1 long repeat) that I could not order, because the differences between repeats were far enough apart that I couldn'​t disambiguate the order with the [[bioinformatic_tools:​bwa#​determining_paired-end_insert_size|short fragment lengths]] of the data available. ​ I think I have all the variants of repeat, but in some cases I can't even tell which first half of the repeat goes with which second half.+Alternating finding new reads and assembling them made very slow progress, because the new reads only extended the assembled region by 50–100 bases. ​ Eventually, I wrote a new program ([[bioinformatic_tools:​pluck-scripts:​look-for-exit|look-for-exit]]) to manually extend the contigs and find exits from repeat regions, being more aggressive in extending the contig than the automatic assemblers. ​ I was eventually able to close the circle this way, and get a complete genome, though there is one repeat region with long repeats (about a dozen copies of a 615±1 long repeat) that I could not order, because the differences between repeats were far enough apart that I couldn'​t disambiguate the order with the [[bioinformatic_tools:​bwa#​determining_paired-end_insert_size|short fragment lengths]] of the data available. ​ I think I have all the variants of repeat, but in some cases I can't even tell which first half of the repeat goes with which second half.
  
 At some point in the process, I rotated the genome to correspond to the closest previous mitochondrial genome: //​Biomphalaria glabrata// strain M, a gastropod. At some point in the process, I rotated the genome to correspond to the closest previous mitochondrial genome: //​Biomphalaria glabrata// strain M, a gastropod.
Line 30: Line 38:
  
 We plan to use PCR to amplify parts of the repeat region and do Sanger sequencing to confirm the sequence on those blocks. We plan to use PCR to amplify parts of the repeat region and do Sanger sequencing to confirm the sequence on those blocks.
-To find distinguishing features in the repeat region to design primers, the [[look-for-exit]] program was used to walk forward and backward through the repeat, looking for alternative paths that had significant read support. ​ All the variants were recorded in README files (in assemblies/​slug/​barcode-of-life/​map-Illumina-raw-42/ ​ and assemblies/​slug/​barcode-of-life/​map-Illumina-raw-45/​) and look-for-exit was used to build putative single copies of repeats from each of the observed variants. ​+To find distinguishing features in the repeat region to design primers, the [[bioinformatic_tools:​pluck-scripts:​look-for-exit|look-for-exit]] program was used to walk forward and backward through the repeat, looking for alternative paths that had significant read support. ​ All the variants were recorded in README files (in assemblies/​slug/​barcode-of-life/​map-Illumina-raw-42/ ​ and assemblies/​slug/​barcode-of-life/​map-Illumina-raw-45/​) and look-for-exit was used to build putative single copies of repeats from each of the observed variants.  
 + 
 +The repeat region starts at position 7037 in draft1, with CTGTAAGAGAATTATTTTAGTAATAAAATTTAATTTTAAGAAAAGAATTTTTCT 
 +There are 12 full (615±1 long) repeats there, then a partial one (456 long) that ends with  
 +CTGTAAGAGAATTATTTTAGTAATAAAATTTAATTTTAAGAAAAGAATTTTTCT at 14867.
  
 Details of the PCR strategy are still being worked out, trying to minimize the effort and cost of the wet-lab work.  Details of the PCR strategy are still being worked out, trying to minimize the effort and cost of the wet-lab work. 
Line 233: Line 245:
 R13: .        .   ​. ​             .       ​. ​            ​. ​ t  .   ​. ​         .                      g   ​t ​   .      g     ​. ​                     ...              .a               ​. ​             R13: .        .   ​. ​             .       ​. ​            ​. ​ t  .   ​. ​         .                      g   ​t ​   .      g     ​. ​                     ...              .a               ​. ​            
 R13: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTATTATAATAATGTTTAATACATTTATCTTATTAATCTTTATTGTACTATTTGATAATAGTATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATATATAATATATTATTTCCATTATATATA  ​ R13: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTATTATAATAATGTTTAATACATTTATCTTATTAATCTTTATTGTACTATTTGATAATAGTATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATATATAATATATTATTTCCATTATATATA  ​
-R14: .        .                ​. ​      ​. ​            ​. ​ .  .   ​-a ​       t                      .   ​. ​   .      .     ​. ​                     ...              .a               ​. ​             ​+R14: .        .                ​. ​      ​. ​            ​. ​ .  .   ​-a ​       t                      .   ​. ​   .      .     ​. ​                     ...              .a               ​. ​             ​
 R14: GTTATTATTGAAGGTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATATGTTTAATATATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATATATAATATATTATTTCCATTATATATA ​   R14: GTTATTATTGAAGGTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATATGTTTAATATATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATATATAATATATTATTTCCATTATATATA ​  
 R15: .        .   ​. ​             .       ​. ​            ​. ​ .  .   ​. ​         t                      .   ​. ​   .      .     ​. ​                     ...              .a               ​. ​             R15: .        .   ​. ​             .       ​. ​            ​. ​ .  .   ​. ​         t                      .   ​. ​   .      .     ​. ​                     ...              .a               ​. ​            
Line 250: Line 262:
 R99: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATAATGTTTAATACATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATCTATAATATATTATTTCCATTATATATA R99: GTTATTATTGAAGTTTATTAACGTAAAGCTGTAACTTTAAAAATATCTCTTTACTATAATAATGTTTAATACATTTATCTTATTAGTCTTTATTATACCATTTGATAATAATATATATCTAGTAGCTTACCTTTTTGCTGTATAGTATTATTACTATCTATAATATATTATTTCCATTATATATA
 </​code>​ </​code>​
 +
 +Note: I fixed the R14 annotation—it is actually identical to R04.
  
 When I put these blocks into the genome, look-for-exit found no variants needed other than a C->T SNP a little earlier than the short repeats that had 7 reads supporting it.  As this was less than 1.5% of the reads for that location, it is quite likely a sequencing error rather than a true variant. When I put these blocks into the genome, look-for-exit found no variants needed other than a C->T SNP a little earlier than the short repeats that had 7 reads supporting it.  As this was less than 1.5% of the reads for that location, it is quite likely a sequencing error rather than a true variant.
Line 304: Line 318:
  
 After cleaning the mitochondrial reads with [[bioinformatic_tools:​jellyfish|jellyfish]] and [[bioinformatic_tools:​quake|quake]],​ in map-Illumina-raw-45/​ we have After cleaning the mitochondrial reads with [[bioinformatic_tools:​jellyfish|jellyfish]] and [[bioinformatic_tools:​quake|quake]],​ in map-Illumina-raw-45/​ we have
-  ​clean_19_dir/​merged_1.fastq has 1253271 ​bases in 16885 reads. +    ​clean_19_dir/​merged.fastq has 2,​860,​095 ​bases in 26,​498 ​reads.  
-  clean_19_dir/​merged_2.fastq has 1252843 ​bases in 16885 reads. +    clean_19_dir/​merged_1.fastq has 1,​253,​271 ​bases in 16,​885 ​reads. 
-  clean_19_dir/​merged.fastq has 2860095 ​bases in 26498 reads.  +    clean_19_dir/​merged_2.fastq has 1,​252,​843 ​bases in 16,​885 ​reads. 
-  Total: 5,366,209 bases in 60,268 reads. ​+    Total: 5,366,209 bases in 60,268 reads. ​
  
 The estimated coverage of the genome from fitting a gamma distribution to the 19-mers is about 216x, giving a genome-size estimate of 24,844 bases, not far from the 23642 I currently have---perhaps there are a couple more repeats than I've been estimating? ​ But the accuracy of the genome size estimates is not likely to be that precise. ​ The estimated coverage of the genome from fitting a gamma distribution to the 19-mers is about 216x, giving a genome-size estimate of 24,844 bases, not far from the 23642 I currently have---perhaps there are a couple more repeats than I've been estimating? ​ But the accuracy of the genome size estimates is not likely to be that precise. ​
  
 In map-Illumina-raw-46/,​ where the constraint on insert size was increased to keep bwa from throwing out discordant pairs, and more variants of the repeats were in the draft that was mapped to, In map-Illumina-raw-46/,​ where the constraint on insert size was increased to keep bwa from throwing out discordant pairs, and more variants of the repeats were in the draft that was mapped to,
-  look-for-exit ​clean_19_dir/​*.fastq ​                                      ​ +  look-for-exit ​merged.fastq merged_[12].fastq 
-produced +counted 
-  ​clean_19_dir/​merged_1.fastq has 1,259,044 bases in 16,974 reads. ​            +    ​merged.fastq has 3,027,249 bases in 27,429 reads. 
-  ​clean_19_dir/​merged_2.fastq has 1,258,440 bases in 16,974 reads. ​            +    ​merged_1.fastq has 1,360,645 bases in 17,875 reads. 
-  ​clean_19_dir/​merged.fastq has 3,044,997 bases in 28,051 reads. ​              +    ​merged_2.fastq has 1,359,545 bases in 17,875 reads. 
-  Total: 5,562,481 bases in 61,999 reads. ​                                    ​+    Total: 5,747,439 bases in 63,179 reads. ​ 
  
-After cleaning ​the reads, I counted the 31-mers, and got 3 estimates of the genome size:+After cleaning ​with quake to correct 19-mers, we have 
 +    clean_19_dir/​merged.fastq has 3,044,997 bases in 28,​051 ​reads
 +    clean_19_dir/​merged_1.fastq has 1,259,044 bases in 16,974 reads. ​            
 +    clean_19_dir/​merged_2.fastq has 1,258,440 bases in 16,974 reads.  
 +    Total: 5,562,481 bases in 61,999 reads. ​                                     
 + 
 +also counted the 31-mers ​in the cleaned reads, and got 3 estimates of the genome size:
   * 18.2k  = 20,​945-2,​761 distinct 31-mers (excluding those with support<​10)   * 18.2k  = 20,​945-2,​761 distinct 31-mers (excluding those with support<​10)
   * 21.5k  = 3702511/​172.3 total 31-mers/​coverage estimate from gamma fit   * 21.5k  = 3702511/​172.3 total 31-mers/​coverage estimate from gamma fit
-  * 21.5k  = 5562481 * (89.72-30)/​89./ 172.3 (total bases correct for estimated number of 31-mers based on average read length, divided by coverage).+  * 21.5k  = 5562481 * (89.72-30)/​89.72 / 172.3 (total bases correct for estimated number of 31-mers based on average read length, divided by coverage).
 The last 2 estimates are using the same data, so are not really independent estimates, just different ways of computing the same length. The last 2 estimates are using the same data, so are not really independent estimates, just different ways of computing the same length.
 +(The corrected coverage estimate 89.72/​(89.72-30)*172.3 is about 259x.)
  
-Looking at the statistics from samtools in map-Illumina-raw-46/,​ we see more sequences, but still a large fraction not properly paired.+I have other estimates of coverage. ​ For example, if I count just coverage in bases 500 to 5500 (well before the repeat region and far enough from the ends to avoid edge effects), I get an estimate of 274x from the pileup output, using 
 +  awk '​$2<​5500 && $2>=500 {sum+=$4}; END {print sum/​5000.}'​ < merged.pileup ​       
 +Using this estimate with the 5,747,439 bases of the uncleaned reads, we get a genome length of 21k. 
 + 
 +Looking at the statistics from samtools in map-Illumina-raw-46/,​ we see more sequences ​than in map-Illumina-raw-45/​, but still a large fraction not properly paired.
 <​code>​ <​code>​
 samtools flagstat merged.bam samtools flagstat merged.bam
Line 341: Line 366:
 </​code>​ </​code>​
  
-===== Mitochondrial sequence ===== 
  
-The first draft sequence is available as {{mitochondrion-draft1.fasta.gz|gzipped fasta file}}. ​ This corresponds to /​campusdata/​BME235/​assemblies/​slug/​barcode-of-life/​map-Illumina-raw-45/​consensus-6 on campusrocks. 
  
 +The second draft sequence (with the short repeats expanded and the repeats ordered as best I can from the short-insert reads) is available as {{mitochondrion-draft2.fasta.gz|draft2 gzipped fasta file}}. ​ This corresponds to /​campusdata/​BME235/​assemblies/​slug/​barcode-of-life/​map-Illumina-raw-47/​draft on campusrocks. ​ It has 36363 bases.
 +
 +
 +**I no longer believe that this sequence is correct. ​ I'm now fairly sure that the region that I assembled as repeats should actually have been assembled as variants from a heterogeneous population of mitochondria. ​ We'll need long reads or PCR products to tell for sure. **
 ===== Annotation ===== ===== Annotation =====
  
 I sent the sequence to [[http://​dogma.ccbb.utexas.edu/​]] for annotation, and it quickly created a very crude web interface that almost works. ​ Unfortunately,​ it seems to have missed some of the protein genes and it provides no way to download its annotation in a GENBANK format. ​ Since there are hundreds of tRNA genes (the repeat region is full of them), I'm not tempted to try screen-scraping. ​ A better way to annotate the mitochondrion needs to be found. I sent the sequence to [[http://​dogma.ccbb.utexas.edu/​]] for annotation, and it quickly created a very crude web interface that almost works. ​ Unfortunately,​ it seems to have missed some of the protein genes and it provides no way to download its annotation in a GENBANK format. ​ Since there are hundreds of tRNA genes (the repeat region is full of them), I'm not tempted to try screen-scraping. ​ A better way to annotate the mitochondrion needs to be found.
                                                                                                                                                                                                                    
- 
-The repeat region starts at position 7037 in draft1, with CTGTAAGAGAATTATTTTAGTAATAAAATTTAATTTTAAGAAAAGAATTTTTCT 
-There are 12 full (615±1 long) repeats there, then a partial one (456 long) that ends with  
-CTGTAAGAGAATTATTTTAGTAATAAAATTTAATTTTAAGAAAAGAATTTTTCT at 14867. 
- 
 The closest sequenced mitochondrial genomes at NCBI are The closest sequenced mitochondrial genomes at NCBI are
  
Line 371: Line 393:
  
 It would be useful to find some spots that are highly conserved between Biomphalaria and Ariolimax, to place primers for extracting mitochondrial DNA by long-range PCR.  That would make a population study of various Ariolimax species by sequencing their mitochondria quite tractable. It would be useful to find some spots that are highly conserved between Biomphalaria and Ariolimax, to place primers for extracting mitochondrial DNA by long-range PCR.  That would make a population study of various Ariolimax species by sequencing their mitochondria quite tractable.
 +
 +==== Protein genes ====
 +
 +I can find 12 of the 13 standard mitochondrial protein genes using Blast to labeled mitochondria,​ and [[http://​www.ncbi.nlm.nih.gov/​projects/​gorf/​orfig.cgi|ORFfinder]] finds ORFs of about the right length for each.  The missing gene (ATP8) is often missing in mollusks. ​ There is some concern about whether the ORFs found by ORFfinder are correct, as some mollusks have RNA processing in their mitochondrial genes to insert the stop codon.
 +
 +I'll annotate the ORFs as best I can (assuming no post-transcriptional mRNA processing) once the repeats have been resolved.
 +
 +==== Ribosomal RNA genes ====
 +
 +I found both the small subunit rRNA gene and the large subunit rRNA gene using BLAST to a few annotated mollusk mitochondria. I'm not sure how one gets precise ends for these RNA genes, without wet-lab work to get cDNA.
 +
 +==== tRNA genes ====
 +
 +I looked for tRNA genes with [[http://​lowelab.ucsc.edu/​tRNAscan-SE/​|tRNAscan-SE]] run locally, and only found 7, but tRNAscan is known to have problems with mitochondial tRNAs. ​ I'll work with Todd Lowe later this summer to try to improve the tRNAscan covariance models to handle them better.
 +
 +The 7 easy-to-find tRNA genes are Asp-GTC, SeC-TCA, Val-TAC, Pro-TGG, Ala-TGC, Thr-TGT, Met-CAT. ​ These are not correctly labeled for the mitochondrial genetic code:  TCA is codon TGA which is Trp in code 5, though the others are ok.
 +
 +I expect to find many copies of tRNA genes, as the repeat regions are in a tRNA-rich area of homologous mitochondria.