User Tools

Site Tools


assemblies:2011:mitochondrion_assembly

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Last revision Both sides next revision
assemblies:2011:mitochondrion_assembly [2015/07/16 19:01]
ceisenhart created
assemblies:2011:mitochondrion_assembly [2015/09/09 04:17]
karplus [Mitochondrial sequence] date correction
Line 1: Line 1:
 ====== Mitochondrial sequence ====== ====== Mitochondrial sequence ======
  
-The mitochondrion was re assembled in 2015, follow the new assembly here, [[assemblies::​2015::​mitochondrion_assembly | 2015 Mitochondrion assembly ]].  This page is for the 2012 mitochondrial assembly. ​+The mitochondrion was re assembled in 2015, follow the new assembly here, [[assemblies::​2015::​mitochondrion_assembly | 2015 Mitochondrion assembly ]].  This page is for the 2011 mitochondrial assembly. ​
  
 The first draft sequence is available as {{{{:​computer_resources:​assemblies:​mitochondrion-draft1.fasta.gz|draft1 gzipped fasta file}}. ​ This corresponds to /​campusdata/​BME235/​assemblies/​slug/​barcode-of-life/​map-Illumina-raw-45/​consensus-6 on campusrocks. ​ It has 23,642 bases. The first draft sequence is available as {{{{:​computer_resources:​assemblies:​mitochondrion-draft1.fasta.gz|draft1 gzipped fasta file}}. ​ This corresponds to /​campusdata/​BME235/​assemblies/​slug/​barcode-of-life/​map-Illumina-raw-45/​consensus-6 on campusrocks. ​ It has 23,642 bases.
Line 27: Line 27:
    * Iterating search and abyss assembly does not lengthen the large contig. ​ Cleaning up and calling the consensus with bwa+samtools+bcftools doesn'​t change things much either. ​ There seems to be a large variation in coverage (from 20x to 2300x, with a median of 225x), so I suspect that there is a repeat region at the beginning of the current contig that may have 10 repeats in it.    * Iterating search and abyss assembly does not lengthen the large contig. ​ Cleaning up and calling the consensus with bwa+samtools+bcftools doesn'​t change things much either. ​ There seems to be a large variation in coverage (from 20x to 2300x, with a median of 225x), so I suspect that there is a repeat region at the beginning of the current contig that may have 10 repeats in it.
  
-Alternating finding new reads and assembling them made very slow progress, because the new reads only extended the assembled region by 50–100 bases. ​ Eventually, I wrote a new program ([[bioinformatic_tools:​pluck-scripts:​look-for-exit|look-for-exit]]) to manually extend the contigs and find exits from repeat regions, being more aggressive in extending the contig than the automatic assemblers. ​ I was eventually able to close the circle this way, and get a complete genome, though there is one repeat region with long repeats (about a dozen copies of a 615±1 long repeat) that I could not order, because the differences between repeats were far enough apart that I couldn'​t disambiguate the order with the [[bioinformatic_tools:​bwa#​determining_paired-end_insert_size|short fragment lengths]] of the data available. ​ I think I have all the variants of repeat, but in some cases I can't even tell which first half of the repeat goes with which second half.+Alternating finding new reads and assembling them made very slow progress, because the new reads only extended the assembled region by 50–100 bases. ​ Eventually, I wrote a new program ([[archive:bioinformatic_tools:​pluck-scripts:​look-for-exit|look-for-exit]]) to manually extend the contigs and find exits from repeat regions, being more aggressive in extending the contig than the automatic assemblers. ​ I was eventually able to close the circle this way, and get a complete genome, though there is one repeat region with long repeats (about a dozen copies of a 615±1 long repeat) that I could not order, because the differences between repeats were far enough apart that I couldn'​t disambiguate the order with the [[archive:bioinformatic_tools:​bwa#​determining_paired-end_insert_size|short fragment lengths]] of the data available. ​ I think I have all the variants of repeat, but in some cases I can't even tell which first half of the repeat goes with which second half.
  
 At some point in the process, I rotated the genome to correspond to the closest previous mitochondrial genome: //​Biomphalaria glabrata// strain M, a gastropod. At some point in the process, I rotated the genome to correspond to the closest previous mitochondrial genome: //​Biomphalaria glabrata// strain M, a gastropod.
Line 36: Line 36:
  
 We plan to use PCR to amplify parts of the repeat region and do Sanger sequencing to confirm the sequence on those blocks. We plan to use PCR to amplify parts of the repeat region and do Sanger sequencing to confirm the sequence on those blocks.
-To find distinguishing features in the repeat region to design primers, the [[bioinformatic_tools:​pluck-scripts:​look-for-exit|look-for-exit]] program was used to walk forward and backward through the repeat, looking for alternative paths that had significant read support. ​ All the variants were recorded in README files (in assemblies/​slug/​barcode-of-life/​map-Illumina-raw-42/ ​ and assemblies/​slug/​barcode-of-life/​map-Illumina-raw-45/​) and look-for-exit was used to build putative single copies of repeats from each of the observed variants. ​+To find distinguishing features in the repeat region to design primers, the [[archive:bioinformatic_tools:​pluck-scripts:​look-for-exit|look-for-exit]] program was used to walk forward and backward through the repeat, looking for alternative paths that had significant read support. ​ All the variants were recorded in README files (in assemblies/​slug/​barcode-of-life/​map-Illumina-raw-42/ ​ and assemblies/​slug/​barcode-of-life/​map-Illumina-raw-45/​) and look-for-exit was used to build putative single copies of repeats from each of the observed variants. ​
  
 The repeat region starts at position 7037 in draft1, with CTGTAAGAGAATTATTTTAGTAATAAAATTTAATTTTAAGAAAAGAATTTTTCT The repeat region starts at position 7037 in draft1, with CTGTAAGAGAATTATTTTAGTAATAAAATTTAATTTTAAGAAAAGAATTTTTCT
Line 315: Line 315:
 The most frequent 19-mer in the subset occurs 6035 times in the full set (so I may be missing 272 copies in the subset), but there are almost 209,000 more common 19-mers, so selecting by frequency would have gotten me mostly low-complexity junk, not mitochondrial sequence. ​     ​ The most frequent 19-mer in the subset occurs 6035 times in the full set (so I may be missing 272 copies in the subset), but there are almost 209,000 more common 19-mers, so selecting by frequency would have gotten me mostly low-complexity junk, not mitochondrial sequence. ​     ​
  
-After cleaning the mitochondrial reads with [[bioinformatic_tools:​jellyfish|jellyfish]] and [[bioinformatic_tools:​quake|quake]],​ in map-Illumina-raw-45/​ we have+After cleaning the mitochondrial reads with [[archive:bioinformatic_tools:​jellyfish|jellyfish]] and [[archive:bioinformatic_tools:​quake|quake]],​ in map-Illumina-raw-45/​ we have
     clean_19_dir/​merged.fastq has 2,860,095 bases in 26,498 reads. ​     clean_19_dir/​merged.fastq has 2,860,095 bases in 26,498 reads. ​
     clean_19_dir/​merged_1.fastq has 1,253,271 bases in 16,885 reads.     clean_19_dir/​merged_1.fastq has 1,253,271 bases in 16,885 reads.
assemblies/2011/mitochondrion_assembly.txt · Last modified: 2015/11/01 03:15 by karplus