This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
assemblies:2011:mitochondrion_assembly [2015/07/16 19:01] ceisenhart created |
assemblies:2011:mitochondrion_assembly [2015/11/01 03:15] (current) karplus [Reads] updated mitochondrion-draft2 link |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== Mitochondrial sequence ====== | ====== Mitochondrial sequence ====== | ||
- | The mitochondrion was re assembled in 2015, follow the new assembly here, [[assemblies::2015::mitochondrion_assembly | 2015 Mitochondrion assembly ]]. This page is for the 2012 mitochondrial assembly. | + | The mitochondrion was re assembled in 2015, follow the new assembly here, [[assemblies::2015::mitochondrion_assembly | 2015 Mitochondrion assembly ]]. This page is for the 2011 mitochondrial assembly. |
The first draft sequence is available as {{{{:computer_resources:assemblies:mitochondrion-draft1.fasta.gz|draft1 gzipped fasta file}}. This corresponds to /campusdata/BME235/assemblies/slug/barcode-of-life/map-Illumina-raw-45/consensus-6 on campusrocks. It has 23,642 bases. | The first draft sequence is available as {{{{:computer_resources:assemblies:mitochondrion-draft1.fasta.gz|draft1 gzipped fasta file}}. This corresponds to /campusdata/BME235/assemblies/slug/barcode-of-life/map-Illumina-raw-45/consensus-6 on campusrocks. It has 23,642 bases. | ||
Line 27: | Line 27: | ||
* Iterating search and abyss assembly does not lengthen the large contig. Cleaning up and calling the consensus with bwa+samtools+bcftools doesn't change things much either. There seems to be a large variation in coverage (from 20x to 2300x, with a median of 225x), so I suspect that there is a repeat region at the beginning of the current contig that may have 10 repeats in it. | * Iterating search and abyss assembly does not lengthen the large contig. Cleaning up and calling the consensus with bwa+samtools+bcftools doesn't change things much either. There seems to be a large variation in coverage (from 20x to 2300x, with a median of 225x), so I suspect that there is a repeat region at the beginning of the current contig that may have 10 repeats in it. | ||
- | Alternating finding new reads and assembling them made very slow progress, because the new reads only extended the assembled region by 50–100 bases. Eventually, I wrote a new program ([[bioinformatic_tools:pluck-scripts:look-for-exit|look-for-exit]]) to manually extend the contigs and find exits from repeat regions, being more aggressive in extending the contig than the automatic assemblers. I was eventually able to close the circle this way, and get a complete genome, though there is one repeat region with long repeats (about a dozen copies of a 615±1 long repeat) that I could not order, because the differences between repeats were far enough apart that I couldn't disambiguate the order with the [[bioinformatic_tools:bwa#determining_paired-end_insert_size|short fragment lengths]] of the data available. I think I have all the variants of repeat, but in some cases I can't even tell which first half of the repeat goes with which second half. | + | Alternating finding new reads and assembling them made very slow progress, because the new reads only extended the assembled region by 50–100 bases. Eventually, I wrote a new program ([[archive:bioinformatic_tools:pluck-scripts:look-for-exit|look-for-exit]]) to manually extend the contigs and find exits from repeat regions, being more aggressive in extending the contig than the automatic assemblers. I was eventually able to close the circle this way, and get a complete genome, though there is one repeat region with long repeats (about a dozen copies of a 615±1 long repeat) that I could not order, because the differences between repeats were far enough apart that I couldn't disambiguate the order with the [[archive:bioinformatic_tools:bwa#determining_paired-end_insert_size|short fragment lengths]] of the data available. I think I have all the variants of repeat, but in some cases I can't even tell which first half of the repeat goes with which second half. |
At some point in the process, I rotated the genome to correspond to the closest previous mitochondrial genome: //Biomphalaria glabrata// strain M, a gastropod. | At some point in the process, I rotated the genome to correspond to the closest previous mitochondrial genome: //Biomphalaria glabrata// strain M, a gastropod. | ||
Line 36: | Line 36: | ||
We plan to use PCR to amplify parts of the repeat region and do Sanger sequencing to confirm the sequence on those blocks. | We plan to use PCR to amplify parts of the repeat region and do Sanger sequencing to confirm the sequence on those blocks. | ||
- | To find distinguishing features in the repeat region to design primers, the [[bioinformatic_tools:pluck-scripts:look-for-exit|look-for-exit]] program was used to walk forward and backward through the repeat, looking for alternative paths that had significant read support. All the variants were recorded in README files (in assemblies/slug/barcode-of-life/map-Illumina-raw-42/ and assemblies/slug/barcode-of-life/map-Illumina-raw-45/) and look-for-exit was used to build putative single copies of repeats from each of the observed variants. | + | To find distinguishing features in the repeat region to design primers, the [[archive:bioinformatic_tools:pluck-scripts:look-for-exit|look-for-exit]] program was used to walk forward and backward through the repeat, looking for alternative paths that had significant read support. All the variants were recorded in README files (in assemblies/slug/barcode-of-life/map-Illumina-raw-42/ and assemblies/slug/barcode-of-life/map-Illumina-raw-45/) and look-for-exit was used to build putative single copies of repeats from each of the observed variants. |
The repeat region starts at position 7037 in draft1, with CTGTAAGAGAATTATTTTAGTAATAAAATTTAATTTTAAGAAAAGAATTTTTCT | The repeat region starts at position 7037 in draft1, with CTGTAAGAGAATTATTTTAGTAATAAAATTTAATTTTAAGAAAAGAATTTTTCT | ||
Line 315: | Line 315: | ||
The most frequent 19-mer in the subset occurs 6035 times in the full set (so I may be missing 272 copies in the subset), but there are almost 209,000 more common 19-mers, so selecting by frequency would have gotten me mostly low-complexity junk, not mitochondrial sequence. | The most frequent 19-mer in the subset occurs 6035 times in the full set (so I may be missing 272 copies in the subset), but there are almost 209,000 more common 19-mers, so selecting by frequency would have gotten me mostly low-complexity junk, not mitochondrial sequence. | ||
- | After cleaning the mitochondrial reads with [[bioinformatic_tools:jellyfish|jellyfish]] and [[bioinformatic_tools:quake|quake]], in map-Illumina-raw-45/ we have | + | After cleaning the mitochondrial reads with [[archive:bioinformatic_tools:jellyfish|jellyfish]] and [[archive:bioinformatic_tools:quake|quake]], in map-Illumina-raw-45/ we have |
clean_19_dir/merged.fastq has 2,860,095 bases in 26,498 reads. | clean_19_dir/merged.fastq has 2,860,095 bases in 26,498 reads. | ||
clean_19_dir/merged_1.fastq has 1,253,271 bases in 16,885 reads. | clean_19_dir/merged_1.fastq has 1,253,271 bases in 16,885 reads. | ||
Line 366: | Line 366: | ||
- | The second draft sequence (with the short repeats expanded and the repeats ordered as best I can from the short-insert reads) is available as {{mitochondrion-draft2.fasta.gz|draft2 gzipped fasta file}}. This corresponds to /campusdata/BME235/assemblies/slug/barcode-of-life/map-Illumina-raw-47/draft on campusrocks. It has 36363 bases. | + | The second draft sequence (with the short repeats expanded and the repeats ordered as best I can from the short-insert reads) is available as {{computer_resources:assemblies:mitochondrion-draft2.fasta.gz|draft2 gzipped fasta file}}. This corresponds to /campusdata/BME235/assemblies/slug/barcode-of-life/map-Illumina-raw-47/draft on campusrocks. It has 36363 bases. |