Differences

This shows you the differences between two versions of the page.

--- archive:computer_resources:data [2011/05/16 19:45]
karplus Added discussion of incorrect DNA length for Illumina run 2
+++ archive:computer_resources:data [2015/09/06 06:32] (current)
117.28.251.165 ↷ Links adapted because of a move operation
@@ Line 16: / Line 16: @@
     * paired/ contains interleaved files with F3 followed by matching R3, but without quality data
     * csfasta lines are 26 characters long (not counting the newline\n). The first letter is the last base of the primer and the first color is the transition from that base to the actual data, then 24 colors that correspond to transitions in the genomic DNA. Translated to basespace are they are 25 base long reads. Many times working in colorspace, you just have the 24 colors, and ignore or do not know the phase.
-    * The separation between the starts of the R3 and F3 reads is approximately 2220 bases.  The [[bioinformatic_tools:pluck-scripts|pluck-scripts]] program map-colorspace provides a distribution of the lengths.  The separation between the R3 and F3 starts is roughly in the range 750-4700, and 500-5000 contains essentially all the good pairs.
+    * The separation between the starts of the R3 and F3 reads is approximately 2220 bases.  The [[archive:bioinformatic_tools:pluck-scripts|pluck-scripts]] program map-colorspace provides a distribution of the lengths.  The separation between the R3 and F3 starts is roughly in the range 750-4700, and 500-5000 contains essentially all the good pairs.
   * sanger/ contains Sanger reads from PCR experiments to fill gaps and verify conjectured assembly of contigs. We have fasta files and the traces as .ab1 files. Unfortunately, the only software I've found so far for getting quality data from the traces is [[https://products.appliedbiosystems.com/ab/en/US/adirect/ab?cmd=catNavigate2&catID=600583&tab=DetailInfo| Applied Biosystem's Sequence Scanner]], which is a Windows-only product. I also tried [[http://www.nucleics.com/peaktrace-sequencing/|PeakTrace]], an on-line service, but they simply rejected the traces as too noisy. Currently there are 3 fasta files:
     * PogSanger.fa
@@ Line 25: / Line 25: @@
     * Pog.chr.v3.fa   earlier draft of circular bacterial genome
     * Pog.v4c.fa both chr and ece.  The ece draft is identical to the v3 draft, but the chr draft has a few fixes.  This one may end up being the published genome, but we still have a handful of changes that we are debating.  Compare assemblies from other tools to this assembly.  If you find discrepancies, we would like to examine them to see if we find evidence for better results.
-  * [[bioinformatic_tools:jellyfish|Jellyfish]] was run on the 454 data to get k-mer counts.  There was a peak at k-mers occuring 46 times, consistent with 46-47x coverage (rather than 60x).  (More information on the Jellyfish page..)
+  * [[archive:bioinformatic_tools:jellyfish|Jellyfish]] was run on the 454 data to get k-mer counts.  There was a peak at k-mers occuring 46 times, consistent with 46-47x coverage (rather than 60x).  (More information on the Jellyfish page..)
 ===== slug/ =====
@@ Line 34: / Line 34: @@
   * 454_run2/ contains second 454 run, one file
     * GCLL8Y406 54,283 reads 11,724,143 bases
-  * 454_run3/ contains a 454 run
+  * 454_run3/ is bogus: it contains 454_run2/ plus some other lanes that are not banana slug.
-    * Unclear which regions correspond to banana slug dna.
   * solid_run1/ will contain first SOLiD run
-  * [[lab_protocols:illumina_run1|illumina_run1/]] contains data from first illumina run
+  * [[archive:lab_protocols:illumina_run1|illumina_run1/]] contains data from first illumina run
     * This data was available for the 2010 class.
     * Lane 4 is a control lane and this data should not be used during assembly.
     * Sequence data are in 1920 files with suffix _qseq.txt totalling 128,746,146,454 bytes
-      * Each line within a datafile constitutes a read, and there are about 350k reads per file, making about 670Mreads.  (Question: how long are the reads on average? FIXME)
+      * Each line within a datafile constitutes a read, and there are about 350k reads per file, making about 670Mreads.
-      * Lines are tab delimited with the following fields:
+      * After running through Seqprep, the DNA length seems to peak around 140 long, both for merged reads and mapping of pairs to 454 reads using BWA.
-        * Machine_name: Unique Identifier of the sequencer
+  * illumina_run_2/
-        * Run_number: Identifier of the sequencing run
-        * Lane_number: Positive Integer between 1-8 signifying the lane from which the reads originate
-        * Tile_number: Positive Integer
-        * X_coordinate: of the read cluster on the slide
-        * Y_coordinate: of the read cluster on the slide
-        * Index: Identifier for multiplexed runs. This field contains 0 for non-multiplexed runs.
-        * Read_Number: 1 for single reads, 1 or 2 for paired end reads, 1, 2 or 3 for multiplexed paired end reads. For multiplexed paired end reads, 1 and 3 represent sequence from the library, and 2 represents the index of the read.
-        * Sequence: Base called sequence of the ring
-        * Quality: Calibrated quality string. This quality string measures the quality of each base in a read. 'B' is the lowest quality while 'b' is the highest quality. More information for the non-polar measures is necessary. This quality score is in phred format, not the solexa/illumina format.
-        * Filter: 0 if the sequence fails the quality filter, 1 if the sequence passes.
-  *illumina_run_2/
     * This was a paired end sequencing run.
     * These two libraries were multiplexed with 6 other libraries within a single lane of an Illumina GAIIx.  For the distribution of fragment size, please refer to {{:computer_resources:banana_slug_fragment_size_for_reads.pdf|bioanalyzer results of final libraries}}. In the file, "sample 7" corresponds to barcode 7 and "sample 8" corresponds to barcode 8.
-      * Barcode 7 has a mean fragment length of 411bp. (7_1.fastq & 7_2.fastq) based on Bioanalyzer results, but the length looks more like 250 in [[lecture_notes:05-13-2011|Insert Length Analysis]]  There are 747,017 read pairs with barcode 7.
+      * Barcode 7 has a mean fragment length of 411bp. (7_1.fastq & 7_2.fastq) based on Bioanalyzer results, but the length peaks at 249 in [[archive:bioinformatic_tools:bwa#determining_paired-end_insert_size|Determining Paired-end Insert Size]]  There are 747,017 read pairs with barcode 7. The Bioanlyzer result includes 119 bases of adapter, so the insert length is about 292, not far from our mapping result.
-      * Barcode 8 has a mean fragment length of 372bp. (8_1.fastq & 8_2.fastq) based on the Bioanalyzer results, but the length looks like it averages less than 100 in [[lecture_notes:05-13-2011|Insert Length Analysis]].  There are 7,118,223 read pairs with barcode 8.
+      * Barcode 8 has a mean fragment length of 372bp. (8_1.fastq & 8_2.fastq) based on the Bioanalyzer results, but the length looks like it averages less than 100 in [[archive:bioinformatic_tools:bwa#determining_paired-end_insert_size|Determining Paired-end Insert Size]]. (the distribution peaks at 115, but there seems to be a truncation artifact, and the true sizes are probably smaller). There are 7,118,223 read pairs with barcode 8.  Removing the adapters (372-119=253) is still too big.
-      * The over-estimate of length apparently is a common problem with Illumina libraries, perhaps due to dimerization of DNA pieces from hybridization of adapters.
+    * The over-estimate of length apparently is a common problem with Illumina libraries, perhaps due to dimerization of DNA pieces from hybridization of adapters.
   * solid_run_1
     * This was a single ended run produced on a SOLiD 3 machine. There was an observed GC bias in other libraries sequenced in the same region as this run.
+    * primary.20100403223319810/ has no data, just 4 million empty reads
+    * primary.20100411112358944/ does have data: 4,042,811 50-long single-end reads.
+    * secondary.F3.20100403054816892/ seems to be quality control information from spiked-in controls.
   * solid_run_2
     * This was a single ended run produced on a SOLiD 4 machine. There was no observed GC bias in other libraries sequenced in the same region as this run.
+    * primary.20101127015521942/ has no data
+    * primary.20101202225202063/ has 26,987,171 50-base reads.
+    * secondary.F3.20101127015516841/ seems to be quality control from spiked-in samples.
+  * kmer-counts/
+    * Used for running [[archive:bioinformatic_tools:jellyfish|Jellyfish]].
+    * Contains Jellyfish k-mer tables and histogram files for 454 and Illumina runs before and after SeqPrep
+  * insert/
+    * Used for estimating distribution of template sizes for the paired-end Illumina reads.
+    * Accomplished by mapping pairs to a reference, originally the 454 reads and later the contigs from a SOAPdenovo assembly, using [[archive:bioinformatic_tools:bwa|BWA]].
+  * clean/
+    * Used for running [[archive:bioinformatic_tools:seqprep|SeqPrep]] and [[archive:bioinformatic_tools:quake|Quake]] correction on the Illumina reads.
+    * run1_seqprep_quake/ and run2_seqprep_quake/ contain the corrected reads used for assembly.
+      * *_[12]_qseq_seqprep.cor.fastq.gz contain the corrected paired reads.
+      * *_[12]_qseq_seqprep.cor_single.fastq.gz contain the corrected reads whose pair could not be corrected.
+      * *_merged_qseq_seqprep.cor.fastq.gz contain the corrected reads that were merged by SeqPrep.
 ==== Illumina Data Notes ====
@@ Line 90: / Line 96: @@
 #0      index number for a multiplexed sample (0 for no indexing)\\
 /1      the member of a pair, /1 or /2 (paired-end or mate-pair reads only)\\
+In the orignal files, lines are tab delimited with the following fields:
+        * Machine_name: Unique Identifier of the sequencer
+        * Run_number: Identifier of the sequencing run
+        * Lane_number: Positive Integer between 1-8 signifying the lane from which the reads originate
+        * Tile_number: Positive Integer
+        * X_coordinate: of the read cluster on the slide
+        * Y_coordinate: of the read cluster on the slide
+        * Index: Identifier for multiplexed runs. This field contains 0 for non-multiplexed runs.
+        * Read_Number: 1 for single reads, 1 or 2 for paired end reads, 1, 2 or 3 for multiplexed paired end reads. For multiplexed paired end reads, 1 and 3 represent sequence from the library, and 2 represents the index of the read.
+        * Sequence: Base called sequence of the ring
+        * Quality: Calibrated quality string. This quality string measures the quality of each base in a read. 'B' is the lowest quality while 'b' is the highest quality. More information for the non-polar measures is necessary. This quality score is in phred format, not the solexa/illumina format.
+        * Filter: 0 if the sequence fails the quality filter, 1 if the sequence passes.

Banana Slug Genomics

User Tools

Site Tools

Differences

Page Tools