Differences

This shows you the differences between two versions of the page.

--- archive:computer_resources:data [2011/04/07 21:03]
karplus updated info about k-mer counts using jellyfish
+++ archive:computer_resources:data [2011/06/08 16:38]
svohr [slug/]
@@ Line 34: / Line 34: @@
   * 454_run2/ contains second 454 run, one file
     * GCLL8Y406 54,283 reads 11,724,143 bases
-  * 454_run3/ contains a 454 run
+  * 454_run3/ is bogus: it contains 454_run2/ plus some other lanes that are not banana slug.
-    * Unclear which regions correspond to banana slug dna.
   * solid_run1/ will contain first SOLiD run
   * [[lab_protocols:illumina_run1|illumina_run1/]] contains data from first illumina run
@@ Line 41: / Line 40: @@
     * Lane 4 is a control lane and this data should not be used during assembly.
     * Sequence data are in 1920 files with suffix _qseq.txt totalling 128,746,146,454 bytes
-      * Each line within a datafile constitutes a read, and there are about 350k reads per file, making about 670Mreads.  (Question: how long are the reads on average? FIXME)
+      * Each line within a datafile constitutes a read, and there are about 350k reads per file, making about 670Mreads.
-      * Lines are tab delimited with the following fields:
+      * After running through Seqprep, the DNA length seems to peak around 140 long, both for merged reads and mapping of pairs to 454 reads using BWA.
-        * Machine_name: Unique Identifier of the sequencer
+  * illumina_run_2/
-        * Run_number: Identifier of the sequencing run
-        * Lane_number: Positive Integer between 1-8 signifying the lane from which the reads originate
-        * Tile_number: Positive Integer
-        * X_coordinate: of the read cluster on the slide
-        * Y_coordinate: of the read cluster on the slide
-        * Index: Identifier for multiplexed runs. This field contains 0 for non-multiplexed runs.
-        * Read_Number: 1 for single reads, 1 or 2 for paired end reads, 1, 2 or 3 for multiplexed paired end reads. For multiplexed paired end reads, 1 and 3 represent sequence from the library, and 2 represents the index of the read.
-        * Sequence: Base called sequence of the ring
-        * Quality: Calibrated quality string. This quality string measures the quality of each base in a read. 'B' is the lowest quality while 'b' is the highest quality. More information for the non-polar measures is necessary. This quality score is in phred format, not the solexa/illumina format.
-        * Filter: 0 if the sequence fails the quality filter, 1 if the sequence passes.
-  *illumina_run_2/
     * This was a paired end sequencing run.
-    * These two libraries were multiplexed with 6 other libraries within a single lane of an Illumina GAIIx
+    * These two libraries were multiplexed with 6 other libraries within a single lane of an Illumina GAIIx.  For the distribution of fragment size, please refer to {{:computer_resources:banana_slug_fragment_size_for_reads.pdf|bioanalyzer results of final libraries}}. In the file, "sample 7" corresponds to barcode 7 and "sample 8" corresponds to barcode 8.
-    * Barcode 7 has a mean fragment length of 411bp. (7_1.fastq & 7_2.fastq)
+      * Barcode 7 has a mean fragment length of 411bp. (7_1.fastq & 7_2.fastq) based on Bioanalyzer results, but the length peaks at 249 in [[bioinformatic_tools:bwa#determining_paired-end_insert_size|Determining Paired-end Insert Size]]  There are 747,017 read pairs with barcode 7.
-    * Barcode 8 has a mean fragment length of 372bp. (8_1.fastq & 8_2.fastq)
+      * Barcode 8 has a mean fragment length of 372bp. (8_1.fastq & 8_2.fastq) based on the Bioanalyzer results, but the length looks like it averages less than 100 in [[bioinformatic_tools:bwa#determining_paired-end_insert_size|Determining Paired-end Insert Size]]. (the distribution peaks at 115, but there seems to be a truncation artifact, and the true sizes are probably smaller). There are 7,118,223 read pairs with barcode 8.
-    * For the distribution of fragment size, please refer to {{:computer_resources:banana_slug_fragment_size_for_reads.pdf|bioanalyzer results of final libraries}}. In the file, "sample 7" corresponds to barcode 7 and "sample 8" corresponds to barcode 8.
+    * The over-estimate of length apparently is a common problem with Illumina libraries, perhaps due to dimerization of DNA pieces from hybridization of adapters.
   * solid_run_1
     * This was a single ended run produced on a SOLiD 3 machine. There was an observed GC bias in other libraries sequenced in the same region as this run.
+    * primary.20100403223319810/ has no data, just 4 million empty reads
+    * primary.20100411112358944/ does have data: 4,042,811 50-long single-end reads.
+    * secondary.F3.20100403054816892/ seems to be quality control information from spiked-in controls.
   * solid_run_2
     * This was a single ended run produced on a SOLiD 4 machine. There was no observed GC bias in other libraries sequenced in the same region as this run.
+    * primary.20101127015521942/ has no data
+    * primary.20101202225202063/ has 26,987,171 50-base reads.
+    * secondary.F3.20101127015516841/ seems to be quality control from spiked-in samples.
+  * kmer-counts/
+    * Used for running [[bioinformatic_tools:jellyfish|Jellyfish]].
+    * Contains Jellyfish k-mer tables and histogram files for 454 and Illumina runs before and after SeqPrep
+  * insert/
+    * Used for estimating distribution of template sizes for the paired-end Illumina reads.
+    * Accomplished by mapping pairs to a reference, originally the 454 reads and later the contigs from a SOAPdenovo assembly, using [[bioinformatic_tools:bwa|BWA]].
+  * clean/
+    * Used for running [[bioinformatic_tools:seqprep|SeqPrep]] and [[bioinformatic_tools:quake|Quake]] correction on the Illumina reads.
+    * run1_seqprep_quake/ and run2_seqprep_quake/ contain the corrected reads used for assembly.
+      * *_[12]_qseq_seqprep.cor.fastq.gz contain the corrected paired reads.
+      * *_[12]_qseq_seqprep.cor_single.fastq.gz contain the corrected reads whose pair could not be corrected.
+      * *_merged_qseq_seqprep.cor.fastq.gz contain the corrected reads that were merged by SeqPrep.
 ==== Illumina Data Notes ====
@@ Line 90: / Line 96: @@
 #0      index number for a multiplexed sample (0 for no indexing)\\
 /1      the member of a pair, /1 or /2 (paired-end or mate-pair reads only)\\
+In the orignal files, lines are tab delimited with the following fields:
+        * Machine_name: Unique Identifier of the sequencer
+        * Run_number: Identifier of the sequencing run
+        * Lane_number: Positive Integer between 1-8 signifying the lane from which the reads originate
+        * Tile_number: Positive Integer
+        * X_coordinate: of the read cluster on the slide
+        * Y_coordinate: of the read cluster on the slide
+        * Index: Identifier for multiplexed runs. This field contains 0 for non-multiplexed runs.
+        * Read_Number: 1 for single reads, 1 or 2 for paired end reads, 1, 2 or 3 for multiplexed paired end reads. For multiplexed paired end reads, 1 and 3 represent sequence from the library, and 2 represents the index of the read.
+        * Sequence: Base called sequence of the ring
+        * Quality: Calibrated quality string. This quality string measures the quality of each base in a read. 'B' is the lowest quality while 'b' is the highest quality. More information for the non-polar measures is necessary. This quality score is in phred format, not the solexa/illumina format.
+        * Filter: 0 if the sequence fails the quality filter, 1 if the sequence passes.

Banana Slug Genomics

User Tools

Site Tools

Differences

Page Tools