data/
- Pog/ contains data for //Pyrobaculum oguniense//
- slug/
  - Illumina Data Notes
Discussion

data/

The /campusdata/BME235/data/ directory on campusrocks contains the sequencing data from different organisms.

Pog/ contains data for //Pyrobaculum oguniense//

Expected genome size around 2.4M
454_run/ contains data from the first 454 run
- sff/ contains the original data
- *.TCA*.fna contains fasta-formatted reads
- *.TCA*.qual contains pseudo-fasta-formatted quality information
- Read-length avg:371 min:36 max:644. Expected coverage 60x.
solid_run/ contains data from the first SOLiD paired-end run
- F3_reads/ contains the downstream reads (original data)
- R3_reads/ contains the upstream reads (original data)
- paired/ contains interleaved files with F3 followed by matching R3, but without quality data
- csfasta lines are 26 characters long (not counting the newline\n). The first letter is the last base of the primer and the first color is the transition from that base to the actual data, then 24 colors that correspond to transitions in the genomic DNA. Translated to basespace are they are 25 base long reads. Many times working in colorspace, you just have the 24 colors, and ignore or do not know the phase.
- The separation between the starts of the R3 and F3 reads is approximately 2220 bases. The pluck-scripts program map-colorspace provides a distribution of the lengths. The separation between the R3 and F3 starts is roughly in the range 750-4700, and 500-5000 contains essentially all the good pairs.
sanger/ contains Sanger reads from PCR experiments to fill gaps and verify conjectured assembly of contigs. We have fasta files and the traces as .ab1 files. Unfortunately, the only software I've found so far for getting quality data from the traces is Applied Biosystem's Sequence Scanner, which is a Windows-only product. I also tried PeakTrace, an on-line service, but they simply rejected the traces as too noisy. Currently there are 3 fasta files:
- PogSanger.fa
- sequences-3-2-2010.fa
- sequences-3-19-2010.fa
finished/ contains draft genome assemblies, as well as David Bernick and Kevin Karplus could determine from the above data.
- Pog.ece.v3.fa finished draft of circular viral chromosome
- Pog.chr.v3.fa earlier draft of circular bacterial genome
- Pog.v4c.fa both chr and ece. The ece draft is identical to the v3 draft, but the chr draft has a few fixes. This one may end up being the published genome, but we still have a handful of changes that we are debating. Compare assemblies from other tools to this assembly. If you find discrepancies, we would like to examine them to see if we find evidence for better results.
Jellyfish was run on the 454 data to get k-mer counts. There was a peak at k-mers occuring 46 times, consistent with 46-47x coverage (rather than 60x). (More information on the Jellyfish page..)

slug/

454_run1/ contains first 454 run (3 files)
- GAZ7HUX02 155,488 reads 45,092,614 bases
- GAZ7HUX03 148,246 reads 41,386,658 bases
- GAZ7HUX04 141,861 reads 40,429,297 bases
454_run2/ contains second 454 run, one file
- GCLL8Y406 54,283 reads 11,724,143 bases
454_run3/ is bogus: it contains 454_run2/ plus some other lanes that are not banana slug.
solid_run1/ will contain first SOLiD run
illumina_run1/ contains data from first illumina run
- This data was available for the 2010 class.
- Lane 4 is a control lane and this data should not be used during assembly.
- Sequence data are in 1920 files with suffix _qseq.txt totalling 128,746,146,454 bytes
  - Each line within a datafile constitutes a read, and there are about 350k reads per file, making about 670Mreads.
  - After running through Seqprep, the DNA length seems to peak around 140 long, both for merged reads and mapping of pairs to 454 reads using BWA.
illumina_run_2/
- This was a paired end sequencing run.
- These two libraries were multiplexed with 6 other libraries within a single lane of an Illumina GAIIx. For the distribution of fragment size, please refer to bioanalyzer results of final libraries. In the file, “sample 7” corresponds to barcode 7 and “sample 8” corresponds to barcode 8.
  - Barcode 7 has a mean fragment length of 411bp. (7_1.fastq & 7_2.fastq) based on Bioanalyzer results, but the length peaks at 249 in Determining Paired-end Insert Size There are 747,017 read pairs with barcode 7. The Bioanlyzer result includes 119 bases of adapter, so the insert length is about 292, not far from our mapping result.
  - Barcode 8 has a mean fragment length of 372bp. (8_1.fastq & 8_2.fastq) based on the Bioanalyzer results, but the length looks like it averages less than 100 in Determining Paired-end Insert Size. (the distribution peaks at 115, but there seems to be a truncation artifact, and the true sizes are probably smaller). There are 7,118,223 read pairs with barcode 8. Removing the adapters (372-119=253) is still too big.
- The over-estimate of length apparently is a common problem with Illumina libraries, perhaps due to dimerization of DNA pieces from hybridization of adapters.
solid_run_1
- This was a single ended run produced on a SOLiD 3 machine. There was an observed GC bias in other libraries sequenced in the same region as this run.
- primary.20100403223319810/ has no data, just 4 million empty reads
- primary.20100411112358944/ does have data: 4,042,811 50-long single-end reads.
- secondary.F3.20100403054816892/ seems to be quality control information from spiked-in controls.
solid_run_2
- This was a single ended run produced on a SOLiD 4 machine. There was no observed GC bias in other libraries sequenced in the same region as this run.
- primary.20101127015521942/ has no data
- primary.20101202225202063/ has 26,987,171 50-base reads.
- secondary.F3.20101127015516841/ seems to be quality control from spiked-in samples.
kmer-counts/
- Used for running Jellyfish.
- Contains Jellyfish k-mer tables and histogram files for 454 and Illumina runs before and after SeqPrep
insert/
- Used for estimating distribution of template sizes for the paired-end Illumina reads.
- Accomplished by mapping pairs to a reference, originally the 454 reads and later the contigs from a SOAPdenovo assembly, using BWA.
clean/
- Used for running SeqPrep and Quake correction on the Illumina reads.
- run1_seqprep_quake/ and run2_seqprep_quake/ contain the corrected reads used for assembly.
  - *_[12]_qseq_seqprep.cor.fastq.gz contain the corrected paired reads.
  - *_[12]_qseq_seqprep.cor_single.fastq.gz contain the corrected reads whose pair could not be corrected.
  - *_merged_qseq_seqprep.cor.fastq.gz contain the corrected reads that were merged by SeqPrep.

Illumina Data Notes

The fastq/ directory contains the illumina run converted into fastq format. Note that I do not include any reads that do not pass illumina's quality filter. Additionally I do not include any reads from the control lane from this experiment (lane 4). To convert the data from illumina's .txt output to fastq I use a c script I wrote which is installed in our bin directory with the source located here:

/programs/johnScripts/illuminaToFastq.c

Note that this c script preserves the quality score format in the _qseq file. Since the _qseq file uses phread quality scores rather than illumina/solexa, you need to be careful to make note of this when you use these fastq files in programs that may expect otherwise.

The c script takes four arguments, an illumina.txt file and its pair, and the file names of the corresponding two output .fastq files. It goes through each read and pair and checks that both pass illumina's quality filter. If one or both do not pass, then those reads are excluded. It looks like the majority of reads pass this criteria as the output fastq files are approximately the same size as the input files.

Here is an example of what a fastq read should look like:

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

Example ID line explained:

@HWUSI-EAS100R:6:73:941:1973#0/1
HWUSI-EAS100R the unique instrument name
6 flowcell lane
73 tile number within the flowcell lane
941 'x'-coordinate of the cluster within the tile
1973 'y'-coordinate of the cluster within the tile
#0 index number for a multiplexed sample (0 for no indexing)
/1 the member of a pair, /1 or /2 (paired-end or mate-pair reads only)

In the orignal files, lines are tab delimited with the following fields:

Machine_name: Unique Identifier of the sequencer
Run_number: Identifier of the sequencing run
Lane_number: Positive Integer between 1-8 signifying the lane from which the reads originate
Tile_number: Positive Integer
X_coordinate: of the read cluster on the slide
Y_coordinate: of the read cluster on the slide
Index: Identifier for multiplexed runs. This field contains 0 for non-multiplexed runs.
Read_Number: 1 for single reads, 1 or 2 for paired end reads, 1, 2 or 3 for multiplexed paired end reads. For multiplexed paired end reads, 1 and 3 represent sequence from the library, and 2 represents the index of the read.
Sequence: Base called sequence of the ring
Quality: Calibrated quality string. This quality string measures the quality of each base in a read. 'B' is the lowest quality while 'b' is the highest quality. More information for the non-polar measures is necessary. This quality score is in phred format, not the solexa/illumina format.
Filter: 0 if the sequence fails the quality filter, 1 if the sequence passes.

Table of Contents

data/

Pog/ contains data for //Pyrobaculum oguniense//

slug/

Illumina Data Notes