data/
The /campusdata/BME235/data/ directory on campusrocks contains the sequencing data from different organisms.
Pog/ contains data for //Pyrobaculum oguniense//
Expected genome size around 2.4M
454_run/ contains data from the first 454 run
sff/ contains the original data
*.TCA*.fna contains fasta-formatted reads
*.TCA*.qual contains pseudo-fasta-formatted quality information
Read-length avg:371 min:36 max:644. Expected coverage 60x.
solid_run/ contains data from the first SOLiD paired-end run
F3_reads/ contains the downstream reads (original data)
R3_reads/ contains the upstream reads (original data)
paired/ contains interleaved files with F3 followed by matching R3, but without quality data
csfasta lines are 26 characters long (not counting the newline\n). The first letter is the last base of the primer and the first color is the transition from that base to the actual data, then 24 colors that correspond to transitions in the genomic DNA. Translated to basespace are they are 25 base long reads. Many times working in colorspace, you just have the 24 colors, and ignore or do not know the phase.
The separation between the starts of the R3 and F3 reads is approximately 2220 bases. The
pluck-scripts program map-colorspace provides a distribution of the lengths. The separation between the R3 and F3 starts is roughly in the range 750-4700, and 500-5000 contains essentially all the good pairs.
sanger/ contains Sanger reads from PCR experiments to fill gaps and verify conjectured assembly of contigs. We have fasta files and the traces as .ab1 files. Unfortunately, the only software I've found so far for getting quality data from the traces is
Applied Biosystem's Sequence Scanner, which is a Windows-only product. I also tried
PeakTrace, an on-line service, but they simply rejected the traces as too noisy. Currently there are 3 fasta files:
PogSanger.fa
sequences-3-2-2010.fa
sequences-3-19-2010.fa
finished/ contains draft genome assemblies, as well as David Bernick and Kevin Karplus could determine from the above data.
Pog.ece.v3.fa finished draft of circular viral chromosome
Pog.chr.v3.fa earlier draft of circular bacterial genome
Pog.v4c.fa both chr and ece. The ece draft is identical to the v3 draft, but the chr draft has a few fixes. This one may end up being the published genome, but we still have a handful of changes that we are debating. Compare assemblies from other tools to this assembly. If you find discrepancies, we would like to examine them to see if we find evidence for better results.
Jellyfish was run on the 454 data to get k-mer counts. There was a peak at k-mers occuring 46 times, consistent with 46-47x coverage (rather than 60x). (More information on the Jellyfish page..)
slug/
454_run1/ contains first 454 run (3 files)
GAZ7HUX02 155,488 reads 45,092,614 bases
GAZ7HUX03 148,246 reads 41,386,658 bases
GAZ7HUX04 141,861 reads 40,429,297 bases
454_run2/ contains second 454 run, one file
454_run3/ is bogus: it contains 454_run2/ plus some other lanes that are not banana slug.
solid_run1/ will contain first SOLiD run
-
This data was available for the 2010 class.
Lane 4 is a control lane and this data should not be used during assembly.
Sequence data are in 1920 files with suffix _qseq.txt totalling 128,746,146,454 bytes
Each line within a datafile constitutes a read, and there are about 350k reads per file, making about 670Mreads.
After running through Seqprep, the DNA length seems to peak around 140 long, both for merged reads and mapping of pairs to 454 reads using BWA.
illumina_run_2/
This was a paired end sequencing run.
These two libraries were multiplexed with 6 other libraries within a single lane of an Illumina GAIIx. For the distribution of fragment size, please refer to
bioanalyzer results of final libraries. In the file, “sample 7” corresponds to barcode 7 and “sample 8” corresponds to barcode 8.
Barcode 7 has a mean fragment length of 411bp. (7_1.fastq & 7_2.fastq) based on Bioanalyzer results, but the length peaks at 249 in
Determining Paired-end Insert Size There are 747,017 read pairs with barcode 7. The Bioanlyzer result includes 119 bases of adapter, so the insert length is about 292, not far from our mapping result.
Barcode 8 has a mean fragment length of 372bp. (8_1.fastq & 8_2.fastq) based on the Bioanalyzer results, but the length looks like it averages less than 100 in
Determining Paired-end Insert Size. (the distribution peaks at 115, but there seems to be a truncation artifact, and the true sizes are probably smaller). There are 7,118,223 read pairs with barcode 8. Removing the adapters (372-119=253) is still too big.
The over-estimate of length apparently is a common problem with Illumina libraries, perhaps due to dimerization of DNA pieces from hybridization of adapters.
solid_run_1
This was a single ended run produced on a SOLiD 3 machine. There was an observed GC bias in other libraries sequenced in the same region as this run.
primary.20100403223319810/ has no data, just 4 million empty reads
primary.20100411112358944/ does have data: 4,042,811 50-long single-end reads.
secondary.F3.20100403054816892/ seems to be quality control information from spiked-in controls.
solid_run_2
This was a single ended run produced on a SOLiD 4 machine. There was no observed GC bias in other libraries sequenced in the same region as this run.
primary.20101127015521942/ has no data
primary.20101202225202063/ has 26,987,171 50-base reads.
secondary.F3.20101127015516841/ seems to be quality control from spiked-in samples.
kmer-counts/
insert/
clean/
Illumina Data Notes
The fastq/ directory contains the illumina run converted into fastq format. Note that I do not include any reads that do not pass illumina's quality filter. Additionally I do not include any reads from the control lane from this experiment (lane 4). To convert the data from illumina's .txt output to fastq I use a c script I wrote which is installed in our bin directory with the source located here:
/programs/johnScripts/illuminaToFastq.c
Note that this c script preserves the quality score format in the _qseq file. Since the _qseq file uses phread quality scores rather than illumina/solexa, you need to be careful to make note of this when you use these fastq files in programs that may expect otherwise.
The c script takes four arguments, an illumina.txt file and its pair, and the file names of the corresponding two output .fastq files. It goes through each read and pair and checks that both pass illumina's quality filter. If one or both do not pass, then those reads are excluded. It looks like the majority of reads pass this criteria as the output fastq files are approximately the same size as the input files.
Here is an example of what a fastq read should look like:
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
Example ID line explained:
@HWUSI-EAS100R:6:73:941:1973#0/1
HWUSI-EAS100R the unique instrument name
6 flowcell lane
73 tile number within the flowcell lane
941 'x'-coordinate of the cluster within the tile
1973 'y'-coordinate of the cluster within the tile
#0 index number for a multiplexed sample (0 for no indexing)
/1 the member of a pair, /1 or /2 (paired-end or mate-pair reads only)
In the orignal files, lines are tab delimited with the following fields:
Machine_name: Unique Identifier of the sequencer
Run_number: Identifier of the sequencing run
Lane_number: Positive Integer between 1-8 signifying the lane from which the reads originate
Tile_number: Positive Integer
X_coordinate: of the read cluster on the slide
Y_coordinate: of the read cluster on the slide
Index: Identifier for multiplexed runs. This field contains 0 for non-multiplexed runs.
Read_Number: 1 for single reads, 1 or 2 for paired end reads, 1, 2 or 3 for multiplexed paired end reads. For multiplexed paired end reads, 1 and 3 represent sequence from the library, and 2 represents the index of the read.
Sequence: Base called sequence of the ring
Quality: Calibrated quality string. This quality string measures the quality of each base in a read. 'B' is the lowest quality while 'b' is the highest quality. More information for the non-polar measures is necessary. This quality score is in phred format, not the solexa/illumina format.
Filter: 0 if the sequence fails the quality filter, 1 if the sequence passes.
Discussion
We have illumina data. Please check the illumina Sequencing Run 1 for details on which lanes contain which data. The size distribution is somewhere between 100-300 bp.
It looks like we have paired end data on the illumina run. Can anyone verify this and find out what the average insert length and insert length deviation is if we do indeed have paired end data?
Most assemblers want either some proprietary format, fasta, or fasta+qual files. Do we have a conversion tool for convert the Illumina “txt” files to fasta or fasta and qual files?
What is the insert or fragment size on the Pog solid paired reads?
SOLiD data for Banana Slug is not ready yet. Eveline from Nader's group is working on it and she says that its going to take some time to get that data.
data/ is now organized for Banana Slug, by organism → by platform → and by run number, under /campusdata/BME235/data/slug/
banana slug data from illumina is available! I will transfer the files from the sequencing cluster to campusrocks ~/data/slug/illumina_run1
Oh! I should have checked it here first. I have been trying to contact the person associated with the Banana Slug data from Illumina.
Probably should have cross references from the platform names to the documentation of the sequencing methods.