Line 34: Line 34:
   * 454_run2/ contains second 454 run, one file   * 454_run2/ contains second 454 run, one file
     * GCLL8Y406 54,283 reads 11,724,143 bases     * GCLL8Y406 54,283 reads 11,724,143 bases
-  * 454_run3/ contains ​a 454 run +  * 454_run3/ ​is bogus: it contains ​454_run2/ plus some other lanes that are not banana slug.
-    * Unclear which regions correspond to banana slug dna.+
   * solid_run1/ will contain first SOLiD run   * solid_run1/ will contain first SOLiD run
   * [[lab_protocols:​illumina_run1|illumina_run1/​]] contains data from first illumina run   * [[lab_protocols:​illumina_run1|illumina_run1/​]] contains data from first illumina run
Line 41: Line 40:
     * Lane 4 is a control lane and this data should not be used during assembly. ​     * Lane 4 is a control lane and this data should not be used during assembly. ​
     * Sequence data are in 1920 files with suffix _qseq.txt totalling 128,​746,​146,​454 bytes     * Sequence data are in 1920 files with suffix _qseq.txt totalling 128,​746,​146,​454 bytes
-      * Each line within a datafile constitutes a read, and there are about 350k reads per file, making about 670Mreads.  ​(Question: how long are the reads on average? FIXME) +      * Each line within a datafile constitutes a read, and there are about 350k reads per file, making about 670Mreads. ​  
-      * Lines are tab delimited with the following fields: +      * After running through Seqprep, ​the DNA length seems to peak around 140 longboth for merged ​reads and mapping ​of pairs to 454 reads using BWA
-        * Machine_name:​ Unique Identifier of the sequencer +  * illumina_run_2/​
-        * Run_number: Identifier of the sequencing run +
-        * Lane_number:​ Positive Integer between 1-8 signifying the lane from which the reads originate +
-        * Tile_number:​ Positive Integer +
-        * X_coordinate:​ of the read cluster on the slide +
-        * Y_coordinate:​ of the read cluster on the slide +
-        * Index: Identifier for multiplexed runs. This field contains 0 for non-multiplexed runs. +
-        * Read_Number:​ 1 for single reads1 or 2 for paired end reads, 1, 2 or 3 for multiplexed paired end reads. For multiplexed paired end reads, 1 and 3 represent sequence from the library, and 2 represents the index of the read. +
-        * Sequence: Base called sequence of the ring +
-        * Quality: Calibrated quality string. This quality string measures the quality of each base in a read. '​B'​ is the lowest quality while '​b'​ is the highest quality. More information for the non-polar measures is necessary. This quality score is in phred format, not the solexa/​illumina format.  +
-        * Filter: 0 if the sequence fails the quality filter, 1 if the sequence passes+
-  *illumina_run_2/​+
     * This was a paired end sequencing run.     * This was a paired end sequencing run.
-    * These two libraries were multiplexed with 6 other libraries within a single lane of an Illumina GAIIx +    * These two libraries were multiplexed with 6 other libraries within a single lane of an Illumina GAIIx.  For the distribution of fragment size, please refer to {{:​computer_resources:​banana_slug_fragment_size_for_reads.pdf|bioanalyzer results of final libraries}}. In the file, "​sample 7" corresponds to barcode 7 and "​sample 8" corresponds to barcode 8. 
-    * Barcode 7 has a mean fragment length of 411bp. (7_1.fastq & 7_2.fastq) +      * Barcode 7 has a mean fragment length of 411bp. (7_1.fastq & 7_2.fastq) ​based on Bioanalyzer results, but the length peaks at 249 in [[bioinformatic_tools:​bwa#​determining_paired-end_insert_size|Determining Paired-end Insert Size]] ​ There are 747,017 read pairs with barcode 7. 
-    * Barcode 8 has a mean fragment length of 372bp. (8_1.fastq & 8_2.fastq) +      * Barcode 8 has a mean fragment length of 372bp. (8_1.fastq & 8_2.fastq) ​based on the Bioanalyzer resultsbut the length looks like it averages less than 100 in [[bioinformatic_tools:bwa#​determining_paired-end_insert_size|Determining Paired-end Insert Size]](the distribution peaks at 115but there seems to be a truncation artifact, ​and the true sizes are probably smaller). There are 7,118,223 read pairs with barcode 8
-    * For the distribution of fragment sizeplease refer to {{:computer_resources:​banana_slug_fragment_size_for_reads.pdf|bioanalyzer results of final libraries}}In the file"​sample 7" corresponds ​to barcode 7 and "​sample 8" corresponds to barcode 8.+    * The over-estimate of length apparently is a common problem with Illumina libraries, perhaps due to dimerization of DNA pieces from hybridization of adapters.
   * solid_run_1   * solid_run_1
     * This was a single ended run produced on a SOLiD 3 machine. There was an observed GC bias in other libraries sequenced in the same region as this run.     * This was a single ended run produced on a SOLiD 3 machine. There was an observed GC bias in other libraries sequenced in the same region as this run.
 +    * primary.20100403223319810/​ has no data, just 4 million empty reads
 +    * primary.20100411112358944/​ does have data: 4,042,811 50-long single-end reads.
 +    * secondary.F3.20100403054816892/​ seems to be quality control information from spiked-in controls. ​
   * solid_run_2   * solid_run_2
-    * This was a single ended run produced on a SOLiD 4 machine. There was no observed GC bias in other libraries sequenced in the same region as this run.+    * This was a single ended run produced on a SOLiD 4 machine. There was no observed GC bias in other libraries sequenced in the same region as this run.  
 +    * primary.20101127015521942/​ has no data 
 +    * primary.20101202225202063/​ has 26,987,171 50-base reads. 
 +    * secondary.F3.20101127015516841/​ seems to be quality control from spiked-in samples. ​  
 +  * kmer-counts/​ 
 +    * Used for running [[bioinformatic_tools:​jellyfish|Jellyfish]].  
 +    * Contains Jellyfish k-mer tables and histogram files for 454 and Illumina runs before and after SeqPrep 
 +  * insert/ 
 +    * Used for estimating distribution of template sizes for the paired-end Illumina reads. 
 +    * Accomplished by mapping pairs to a reference, originally the 454 reads and later the contigs from a SOAPdenovo assembly, using [[bioinformatic_tools:​bwa|BWA]]. 
 +  * clean/ 
 +    * Used for running [[bioinformatic_tools:​seqprep|SeqPrep]] and [[bioinformatic_tools:​quake|Quake]] correction on the Illumina reads. 
 +    * run1_seqprep_quake/​ and run2_seqprep_quake/​ contain the corrected reads used for assembly. 
 +      * *_[12]_qseq_seqprep.cor.fastq.gz contain the corrected paired reads. 
 +      * *_[12]_qseq_seqprep.cor_single.fastq.gz contain the corrected reads whose pair could not be corrected. 
 +      * *_merged_qseq_seqprep.cor.fastq.gz contain the corrected reads that were merged by SeqPrep.
 ==== Illumina Data Notes ==== ==== Illumina Data Notes ====
Line 90: Line 96:
 #0      index number for a multiplexed sample (0 for no indexing)\\ #0      index number for a multiplexed sample (0 for no indexing)\\
 /1      the member of a pair, /1 or /2 (paired-end or mate-pair reads only)\\ /1      the member of a pair, /1 or /2 (paired-end or mate-pair reads only)\\
 +In the orignal files, lines are tab delimited with the following fields:
 +        * Machine_name:​ Unique Identifier of the sequencer
 +        * Run_number: Identifier of the sequencing run
 +        * Lane_number:​ Positive Integer between 1-8 signifying the lane from which the reads originate
 +        * Tile_number:​ Positive Integer
 +        * X_coordinate:​ of the read cluster on the slide
 +        * Y_coordinate:​ of the read cluster on the slide
 +        * Index: Identifier for multiplexed runs. This field contains 0 for non-multiplexed runs.
 +        * Read_Number:​ 1 for single reads, 1 or 2 for paired end reads, 1, 2 or 3 for multiplexed paired end reads. For multiplexed paired end reads, 1 and 3 represent sequence from the library, and 2 represents the index of the read.
 +        * Sequence: Base called sequence of the ring
 +        * Quality: Calibrated quality string. This quality string measures the quality of each base in a read. '​B'​ is the lowest quality while '​b'​ is the highest quality. More information for the non-polar measures is necessary. This quality score is in phred format, not the solexa/​illumina format. ​
 +        * Filter: 0 if the sequence fails the quality filter, 1 if the sequence passes.
