This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
archive:computer_resources:data [2011/04/07 21:03] karplus updated info about k-mer counts using jellyfish |
archive:computer_resources:data [2011/06/08 16:38] svohr [slug/] |
||
---|---|---|---|
Line 34: | Line 34: | ||
* 454_run2/ contains second 454 run, one file | * 454_run2/ contains second 454 run, one file | ||
* GCLL8Y406 54,283 reads 11,724,143 bases | * GCLL8Y406 54,283 reads 11,724,143 bases | ||
- | * 454_run3/ contains a 454 run | + | * 454_run3/ is bogus: it contains 454_run2/ plus some other lanes that are not banana slug. |
- | * Unclear which regions correspond to banana slug dna. | + | |
* solid_run1/ will contain first SOLiD run | * solid_run1/ will contain first SOLiD run | ||
* [[lab_protocols:illumina_run1|illumina_run1/]] contains data from first illumina run | * [[lab_protocols:illumina_run1|illumina_run1/]] contains data from first illumina run | ||
Line 41: | Line 40: | ||
* Lane 4 is a control lane and this data should not be used during assembly. | * Lane 4 is a control lane and this data should not be used during assembly. | ||
* Sequence data are in 1920 files with suffix _qseq.txt totalling 128,746,146,454 bytes | * Sequence data are in 1920 files with suffix _qseq.txt totalling 128,746,146,454 bytes | ||
- | * Each line within a datafile constitutes a read, and there are about 350k reads per file, making about 670Mreads. (Question: how long are the reads on average? FIXME) | + | * Each line within a datafile constitutes a read, and there are about 350k reads per file, making about 670Mreads. |
- | * Lines are tab delimited with the following fields: | + | * After running through Seqprep, the DNA length seems to peak around 140 long, both for merged reads and mapping of pairs to 454 reads using BWA. |
- | * Machine_name: Unique Identifier of the sequencer | + | * illumina_run_2/ |
- | * Run_number: Identifier of the sequencing run | + | |
- | * Lane_number: Positive Integer between 1-8 signifying the lane from which the reads originate | + | |
- | * Tile_number: Positive Integer | + | |
- | * X_coordinate: of the read cluster on the slide | + | |
- | * Y_coordinate: of the read cluster on the slide | + | |
- | * Index: Identifier for multiplexed runs. This field contains 0 for non-multiplexed runs. | + | |
- | * Read_Number: 1 for single reads, 1 or 2 for paired end reads, 1, 2 or 3 for multiplexed paired end reads. For multiplexed paired end reads, 1 and 3 represent sequence from the library, and 2 represents the index of the read. | + | |
- | * Sequence: Base called sequence of the ring | + | |
- | * Quality: Calibrated quality string. This quality string measures the quality of each base in a read. 'B' is the lowest quality while 'b' is the highest quality. More information for the non-polar measures is necessary. This quality score is in phred format, not the solexa/illumina format. | + | |
- | * Filter: 0 if the sequence fails the quality filter, 1 if the sequence passes. | + | |
- | *illumina_run_2/ | + | |
* This was a paired end sequencing run. | * This was a paired end sequencing run. | ||
- | * These two libraries were multiplexed with 6 other libraries within a single lane of an Illumina GAIIx | + | * These two libraries were multiplexed with 6 other libraries within a single lane of an Illumina GAIIx. For the distribution of fragment size, please refer to {{:computer_resources:banana_slug_fragment_size_for_reads.pdf|bioanalyzer results of final libraries}}. In the file, "sample 7" corresponds to barcode 7 and "sample 8" corresponds to barcode 8. |
- | * Barcode 7 has a mean fragment length of 411bp. (7_1.fastq & 7_2.fastq) | + | * Barcode 7 has a mean fragment length of 411bp. (7_1.fastq & 7_2.fastq) based on Bioanalyzer results, but the length peaks at 249 in [[bioinformatic_tools:bwa#determining_paired-end_insert_size|Determining Paired-end Insert Size]] There are 747,017 read pairs with barcode 7. |
- | * Barcode 8 has a mean fragment length of 372bp. (8_1.fastq & 8_2.fastq) | + | * Barcode 8 has a mean fragment length of 372bp. (8_1.fastq & 8_2.fastq) based on the Bioanalyzer results, but the length looks like it averages less than 100 in [[bioinformatic_tools:bwa#determining_paired-end_insert_size|Determining Paired-end Insert Size]]. (the distribution peaks at 115, but there seems to be a truncation artifact, and the true sizes are probably smaller). There are 7,118,223 read pairs with barcode 8. |
- | * For the distribution of fragment size, please refer to {{:computer_resources:banana_slug_fragment_size_for_reads.pdf|bioanalyzer results of final libraries}}. In the file, "sample 7" corresponds to barcode 7 and "sample 8" corresponds to barcode 8. | + | * The over-estimate of length apparently is a common problem with Illumina libraries, perhaps due to dimerization of DNA pieces from hybridization of adapters. |
* solid_run_1 | * solid_run_1 | ||
* This was a single ended run produced on a SOLiD 3 machine. There was an observed GC bias in other libraries sequenced in the same region as this run. | * This was a single ended run produced on a SOLiD 3 machine. There was an observed GC bias in other libraries sequenced in the same region as this run. | ||
+ | * primary.20100403223319810/ has no data, just 4 million empty reads | ||
+ | * primary.20100411112358944/ does have data: 4,042,811 50-long single-end reads. | ||
+ | * secondary.F3.20100403054816892/ seems to be quality control information from spiked-in controls. | ||
* solid_run_2 | * solid_run_2 | ||
- | * This was a single ended run produced on a SOLiD 4 machine. There was no observed GC bias in other libraries sequenced in the same region as this run. | + | * This was a single ended run produced on a SOLiD 4 machine. There was no observed GC bias in other libraries sequenced in the same region as this run. |
+ | * primary.20101127015521942/ has no data | ||
+ | * primary.20101202225202063/ has 26,987,171 50-base reads. | ||
+ | * secondary.F3.20101127015516841/ seems to be quality control from spiked-in samples. | ||
+ | * kmer-counts/ | ||
+ | * Used for running [[bioinformatic_tools:jellyfish|Jellyfish]]. | ||
+ | * Contains Jellyfish k-mer tables and histogram files for 454 and Illumina runs before and after SeqPrep | ||
+ | * insert/ | ||
+ | * Used for estimating distribution of template sizes for the paired-end Illumina reads. | ||
+ | * Accomplished by mapping pairs to a reference, originally the 454 reads and later the contigs from a SOAPdenovo assembly, using [[bioinformatic_tools:bwa|BWA]]. | ||
+ | * clean/ | ||
+ | * Used for running [[bioinformatic_tools:seqprep|SeqPrep]] and [[bioinformatic_tools:quake|Quake]] correction on the Illumina reads. | ||
+ | * run1_seqprep_quake/ and run2_seqprep_quake/ contain the corrected reads used for assembly. | ||
+ | * *_[12]_qseq_seqprep.cor.fastq.gz contain the corrected paired reads. | ||
+ | * *_[12]_qseq_seqprep.cor_single.fastq.gz contain the corrected reads whose pair could not be corrected. | ||
+ | * *_merged_qseq_seqprep.cor.fastq.gz contain the corrected reads that were merged by SeqPrep. | ||
==== Illumina Data Notes ==== | ==== Illumina Data Notes ==== | ||
Line 90: | Line 96: | ||
#0 index number for a multiplexed sample (0 for no indexing)\\ | #0 index number for a multiplexed sample (0 for no indexing)\\ | ||
/1 the member of a pair, /1 or /2 (paired-end or mate-pair reads only)\\ | /1 the member of a pair, /1 or /2 (paired-end or mate-pair reads only)\\ | ||
+ | |||
+ | In the orignal files, lines are tab delimited with the following fields: | ||
+ | * Machine_name: Unique Identifier of the sequencer | ||
+ | * Run_number: Identifier of the sequencing run | ||
+ | * Lane_number: Positive Integer between 1-8 signifying the lane from which the reads originate | ||
+ | * Tile_number: Positive Integer | ||
+ | * X_coordinate: of the read cluster on the slide | ||
+ | * Y_coordinate: of the read cluster on the slide | ||
+ | * Index: Identifier for multiplexed runs. This field contains 0 for non-multiplexed runs. | ||
+ | * Read_Number: 1 for single reads, 1 or 2 for paired end reads, 1, 2 or 3 for multiplexed paired end reads. For multiplexed paired end reads, 1 and 3 represent sequence from the library, and 2 represents the index of the read. | ||
+ | * Sequence: Base called sequence of the ring | ||
+ | * Quality: Calibrated quality string. This quality string measures the quality of each base in a read. 'B' is the lowest quality while 'b' is the highest quality. More information for the non-polar measures is necessary. This quality score is in phred format, not the solexa/illumina format. | ||
+ | * Filter: 0 if the sequence fails the quality filter, 1 if the sequence passes. |