User Tools

Site Tools


archive:computer_resources:data

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
archive:computer_resources:data [2011/05/24 16:52]
karplus Added DNA length information for run1, updated for run2
archive:computer_resources:data [2011/06/22 22:49]
karplus [slug/] added info about adapters to Bioanlyzer lengths.
Line 34: Line 34:
   * 454_run2/ contains second 454 run, one file   * 454_run2/ contains second 454 run, one file
     * GCLL8Y406 54,283 reads 11,724,143 bases     * GCLL8Y406 54,283 reads 11,724,143 bases
-  * 454_run3/ contains ​a 454 run +  * 454_run3/ ​is bogus: it contains ​454_run2/ plus some other lanes that are not banana slug.
-    * Unclear which regions correspond to banana slug dna.+
   * solid_run1/ will contain first SOLiD run   * solid_run1/ will contain first SOLiD run
   * [[lab_protocols:​illumina_run1|illumina_run1/​]] contains data from first illumina run   * [[lab_protocols:​illumina_run1|illumina_run1/​]] contains data from first illumina run
Line 43: Line 42:
       * Each line within a datafile constitutes a read, and there are about 350k reads per file, making about 670Mreads.  ​       * Each line within a datafile constitutes a read, and there are about 350k reads per file, making about 670Mreads.  ​
       * After running through Seqprep, the DNA length seems to peak around 140 long, both for merged reads and mapping of pairs to 454 reads using BWA.       * After running through Seqprep, the DNA length seems to peak around 140 long, both for merged reads and mapping of pairs to 454 reads using BWA.
-      * Lines are tab delimited with the following fields: +  ​* illumina_run_2/​
-        * Machine_name:​ Unique Identifier of the sequencer +
-        * Run_number: Identifier of the sequencing run +
-        * Lane_number:​ Positive Integer between 1-8 signifying the lane from which the reads originate +
-        * Tile_number:​ Positive Integer +
-        * X_coordinate:​ of the read cluster on the slide +
-        * Y_coordinate:​ of the read cluster on the slide +
-        * Index: Identifier for multiplexed runs. This field contains 0 for non-multiplexed runs. +
-        * Read_Number:​ 1 for single reads, 1 or 2 for paired end reads, 1, 2 or 3 for multiplexed paired end reads. For multiplexed paired end reads, 1 and 3 represent sequence from the library, and 2 represents the index of the read. +
-        * Sequence: Base called sequence of the ring +
-        * Quality: Calibrated quality string. This quality string measures the quality of each base in a read. '​B'​ is the lowest quality while '​b'​ is the highest quality. More information for the non-polar measures is necessary. This quality score is in phred format, not the solexa/​illumina format.  +
-        * Filter: 0 if the sequence fails the quality filter, 1 if the sequence passes. +
-  ​*illumina_run_2/​+
     * This was a paired end sequencing run.     * This was a paired end sequencing run.
     * These two libraries were multiplexed with 6 other libraries within a single lane of an Illumina GAIIx. ​ For the distribution of fragment size, please refer to {{:​computer_resources:​banana_slug_fragment_size_for_reads.pdf|bioanalyzer results of final libraries}}. In the file, "​sample 7" corresponds to barcode 7 and "​sample 8" corresponds to barcode 8.     * These two libraries were multiplexed with 6 other libraries within a single lane of an Illumina GAIIx. ​ For the distribution of fragment size, please refer to {{:​computer_resources:​banana_slug_fragment_size_for_reads.pdf|bioanalyzer results of final libraries}}. In the file, "​sample 7" corresponds to barcode 7 and "​sample 8" corresponds to barcode 8.
-      * Barcode 7 has a mean fragment length of 411bp. (7_1.fastq & 7_2.fastq) based on Bioanalyzer results, but the length peaks at 269 in [[bioinformatic_tools:​bwa#​determining_paired-end_insert_size|Determining Paired-end Insert Size]] ​ There are 747,017 read pairs with barcode 7. +      * Barcode 7 has a mean fragment length of 411bp. (7_1.fastq & 7_2.fastq) based on Bioanalyzer results, but the length peaks at 249 in [[bioinformatic_tools:​bwa#​determining_paired-end_insert_size|Determining Paired-end Insert Size]] ​ There are 747,017 read pairs with barcode 7. The Bioanlyzer result includes 119 bases of adapter, so the insert length is about 292, not far from our mapping result
-      * Barcode 8 has a mean fragment length of 372bp. (8_1.fastq & 8_2.fastq) based on the Bioanalyzer results, but the length looks like it averages less than 100 in [[bioinformatic_tools:​bwa#​determining_paired-end_insert_size|Determining Paired-end Insert Size]]. (the distribution peaks at 115, but there seems to be a truncation artifact, and the true sizes are probably smaller). There are 7,118,223 read pairs with barcode 8. +      * Barcode 8 has a mean fragment length of 372bp. (8_1.fastq & 8_2.fastq) based on the Bioanalyzer results, but the length looks like it averages less than 100 in [[bioinformatic_tools:​bwa#​determining_paired-end_insert_size|Determining Paired-end Insert Size]]. (the distribution peaks at 115, but there seems to be a truncation artifact, and the true sizes are probably smaller). There are 7,118,223 read pairs with barcode 8.  Removing the adapters (372-119=253) is still too big
-      * The over-estimate of length apparently is a common problem with Illumina libraries, perhaps due to dimerization of DNA pieces from hybridization of adapters.+    * The over-estimate of length apparently is a common problem with Illumina libraries, perhaps due to dimerization of DNA pieces from hybridization of adapters.
   * solid_run_1   * solid_run_1
     * This was a single ended run produced on a SOLiD 3 machine. There was an observed GC bias in other libraries sequenced in the same region as this run.     * This was a single ended run produced on a SOLiD 3 machine. There was an observed GC bias in other libraries sequenced in the same region as this run.
 +    * primary.20100403223319810/​ has no data, just 4 million empty reads
 +    * primary.20100411112358944/​ does have data: 4,042,811 50-long single-end reads.
 +    * secondary.F3.20100403054816892/​ seems to be quality control information from spiked-in controls. ​
   * solid_run_2   * solid_run_2
-    * This was a single ended run produced on a SOLiD 4 machine. There was no observed GC bias in other libraries sequenced in the same region as this run.+    * This was a single ended run produced on a SOLiD 4 machine. There was no observed GC bias in other libraries sequenced in the same region as this run.  
 +    * primary.20101127015521942/​ has no data 
 +    * primary.20101202225202063/​ has 26,987,171 50-base reads. 
 +    * secondary.F3.20101127015516841/​ seems to be quality control from spiked-in samples. ​  
 +  * kmer-counts/​ 
 +    * Used for running [[bioinformatic_tools:​jellyfish|Jellyfish]].  
 +    * Contains Jellyfish k-mer tables and histogram files for 454 and Illumina runs before and after SeqPrep 
 +  * insert/ 
 +    * Used for estimating distribution of template sizes for the paired-end Illumina reads. 
 +    * Accomplished by mapping pairs to a reference, originally the 454 reads and later the contigs from a SOAPdenovo assembly, using [[bioinformatic_tools:​bwa|BWA]]. 
 +  * clean/ 
 +    * Used for running [[bioinformatic_tools:​seqprep|SeqPrep]] and [[bioinformatic_tools:​quake|Quake]] correction on the Illumina reads. 
 +    * run1_seqprep_quake/​ and run2_seqprep_quake/​ contain the corrected reads used for assembly. 
 +      * *_[12]_qseq_seqprep.cor.fastq.gz contain the corrected paired reads. 
 +      * *_[12]_qseq_seqprep.cor_single.fastq.gz contain the corrected reads whose pair could not be corrected. 
 +      * *_merged_qseq_seqprep.cor.fastq.gz contain the corrected reads that were merged by SeqPrep.
  
 ==== Illumina Data Notes ==== ==== Illumina Data Notes ====
Line 91: Line 96:
 #0      index number for a multiplexed sample (0 for no indexing)\\ #0      index number for a multiplexed sample (0 for no indexing)\\
 /1      the member of a pair, /1 or /2 (paired-end or mate-pair reads only)\\ /1      the member of a pair, /1 or /2 (paired-end or mate-pair reads only)\\
 +
 +In the orignal files, lines are tab delimited with the following fields:
 +        * Machine_name:​ Unique Identifier of the sequencer
 +        * Run_number: Identifier of the sequencing run
 +        * Lane_number:​ Positive Integer between 1-8 signifying the lane from which the reads originate
 +        * Tile_number:​ Positive Integer
 +        * X_coordinate:​ of the read cluster on the slide
 +        * Y_coordinate:​ of the read cluster on the slide
 +        * Index: Identifier for multiplexed runs. This field contains 0 for non-multiplexed runs.
 +        * Read_Number:​ 1 for single reads, 1 or 2 for paired end reads, 1, 2 or 3 for multiplexed paired end reads. For multiplexed paired end reads, 1 and 3 represent sequence from the library, and 2 represents the index of the read.
 +        * Sequence: Base called sequence of the ring
 +        * Quality: Calibrated quality string. This quality string measures the quality of each base in a read. '​B'​ is the lowest quality while '​b'​ is the highest quality. More information for the non-polar measures is necessary. This quality score is in phred format, not the solexa/​illumina format. ​
 +        * Filter: 0 if the sequence fails the quality filter, 1 if the sequence passes.
archive/computer_resources/data.txt · Last modified: 2015/09/06 06:32 by 117.28.251.165