User Tools

Site Tools


archive:computer_resources:data

This is an old revision of the document!


A PCRE internal error occured. This might be caused by a faulty plugin

====== data/ ====== The /campusdata/BME235/data/ directory on campusrocks contains the sequencing data from different organisms. ===== Pog/ contains data for //Pyrobaculum oguniense// ===== * Expected genome size around 2.4M * 454_run/ contains data from the first 454 run * sff/ contains the original data * *.TCA*.fna contains fasta-formatted reads * *.TCA*.qual contains pseudo-fasta-formatted quality information * Read-length avg:371 min:36 max:644. Expected coverage 60x. * solid_run/ contains data from the first SOLiD paired-end run * F3_reads/ contains the downstream reads (original data) * R3_reads/ contains the upstream reads (original data) * paired/ contains interleaved files with F3 followed by matching R3, but without quality data * csfasta lines are 26 characters long (not counting the newline\n). The first letter is the last base of the primer and the first color is the transition from that base to the actual data, then 24 colors that correspond to transitions in the genomic DNA. Translated to basespace are they are 25 base long reads. Many times working in colorspace, you just have the 24 colors, and ignore or do not know the phase. * The separation between the starts of the R3 and F3 reads is approximately 2220 bases. The [[bioinformatic_tools:pluck-scripts|pluck-scripts]] program map-colorspace provides a distribution of the lengths. The separation between the R3 and F3 starts is roughly in the range 750-4700, and 500-5000 contains essentially all the good pairs. * sanger/ contains Sanger reads from PCR experiments to fill gaps and verify conjectured assembly of contigs. We have fasta files and the traces as .ab1 files. Unfortunately, the only software I've found so far for getting quality data from the traces is [[https://products.appliedbiosystems.com/ab/en/US/adirect/ab?cmd=catNavigate2&catID=600583&tab=DetailInfo| Applied Biosystem's Sequence Scanner]], which is a Windows-only product. I also tried [[http://www.nucleics.com/peaktrace-sequencing/|PeakTrace]], an on-line service, but they simply rejected the traces as too noisy. Currently there are 3 fasta files: * PogSanger.fa * sequences-3-2-2010.fa * sequences-3-19-2010.fa * finished/ contains draft genome assemblies, as well as David Bernick and Kevin Karplus could determine from the above data. * Pog.ece.v3.fa finished draft of circular viral chromosome * Pog.chr.v3.fa earlier draft of circular bacterial genome * Pog.v4c.fa both chr and ece. The ece draft is identical to the v3 draft, but the chr draft has a few fixes. This one may end up being the published genome, but we still have a handful of changes that we are debating. Compare assemblies from other tools to this assembly. If you find discrepancies, we would like to examine them to see if we find evidence for better results. * [[bioinformatic_tools:jellyfish|Jellyfish]] was run on the 454 data to get 24-mer counts. There was a peak at 24-mers occuring 46 times, consistent with 46x coverage (rather than 60x). ===== slug/ ===== * 454_run1/ contains first 454 run (3 files) * GAZ7HUX02 155,488 reads 45,092,614 bases * GAZ7HUX03 148,246 reads 41,386,658 bases * GAZ7HUX04 141,861 reads 40,429,297 bases * 454_run2/ contains second 454 run, one file * GCLL8Y406 54,283 reads 11,724,143 bases * 454_run3/ contains a 454 run * Unclear which regions correspond to banana slug dna. * solid_run1/ will contain first SOLiD run * [[lab_protocols:illumina_run1|illumina_run1/]] contains data from first illumina run * This data was available for the 2010 class. * Lane 4 is a control lane and this data should not be used during assembly. * Sequence data are in 1920 files with suffix _qseq.txt totalling 128,746,146,454 bytes * Each line within a datafile constitutes a read, and there are about 350k reads per file, making about 670Mreads. (Question: how long are the reads on average? FIXME) * Lines are tab delimited with the following fields: * Machine_name: Unique Identifier of the sequencer * Run_number: Identifier of the sequencing run * Lane_number: Positive Integer between 1-8 signifying the lane from which the reads originate * Tile_number: Positive Integer * X_coordinate: of the read cluster on the slide * Y_coordinate: of the read cluster on the slide * Index: Identifier for multiplexed runs. This field contains 0 for non-multiplexed runs. * Read_Number: 1 for single reads, 1 or 2 for paired end reads, 1, 2 or 3 for multiplexed paired end reads. For multiplexed paired end reads, 1 and 3 represent sequence from the library, and 2 represents the index of the read. * Sequence: Base called sequence of the ring * Quality: Calibrated quality string. This quality string measures the quality of each base in a read. 'B' is the lowest quality while 'b' is the highest quality. More information for the non-polar measures is necessary. This quality score is in phred format, not the solexa/illumina format. * Filter: 0 if the sequence fails the quality filter, 1 if the sequence passes. * [[lab_protocols:illumina_run2|illumina_run2/]] * 4 fastq files total. * solid_run_1 * solid_run_2 ==== Illumina Data Notes ==== The fastq/ directory contains the illumina run converted into fastq format. Note that I do not include any reads that do not pass illumina's quality filter. Additionally I do not include any reads from the control lane from this experiment (lane 4). To convert the data from illumina's .txt output to fastq I use a c script I wrote which is installed in our bin directory with the source located here: /programs/johnScripts/illuminaToFastq.c Note that this c script preserves the quality score format in the _qseq file. Since the _qseq file uses phread quality scores rather than illumina/solexa, you need to be careful to make note of this when you use these fastq files in programs that may expect otherwise. The c script takes four arguments, an illumina.txt file and its pair, and the file names of the corresponding two output .fastq files. It goes through each read and pair and checks that both pass illumina's quality filter. If one or both do not pass, then those reads are excluded. It looks like the majority of reads pass this criteria as the output fastq files are approximately the same size as the input files. Here is an example of what a fastq read should look like: @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36\\ GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC\\ +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36\\ IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC\\ Example ID line explained: @HWUSI-EAS100R:6:73:941:1973#0/1\\ HWUSI-EAS100R the unique instrument name\\ 6 flowcell lane\\ 73 tile number within the flowcell lane\\ 941 'x'-coordinate of the cluster within the tile\\ 1973 'y'-coordinate of the cluster within the tile\\ #0 index number for a multiplexed sample (0 for no indexing)\\ /1 the member of a pair, /1 or /2 (paired-end or mate-pair reads only)\\

Discussion

, 2010/04/28 19:18

We have illumina data. Please check the illumina Sequencing Run 1 for details on which lanes contain which data. The size distribution is somewhere between 100-300 bp.

, 2010/04/25 16:53

It looks like we have paired end data on the illumina run. Can anyone verify this and find out what the average insert length and insert length deviation is if we do indeed have paired end data?

, 2010/04/20 15:55

Most assemblers want either some proprietary format, fasta, or fasta+qual files. Do we have a conversion tool for convert the Illumina “txt” files to fasta or fasta and qual files?

, 2010/04/19 00:18

What is the insert or fragment size on the Pog solid paired reads?

, 2010/04/16 20:46

SOLiD data for Banana Slug is not ready yet. Eveline from Nader's group is working on it and she says that its going to take some time to get that data.

, 2010/04/16 20:39

data/ is now organized for Banana Slug, by organism → by platform → and by run number, under /campusdata/BME235/data/slug/

, 2010/04/09 23:28

banana slug data from illumina is available! I will transfer the files from the sequencing cluster to campusrocks ~/data/slug/illumina_run1

, 2010/04/16 20:43

Oh! I should have checked it here first. I have been trying to contact the person associated with the Banana Slug data from Illumina.

, 2010/04/03 19:12

Probably should have cross references from the platform names to the documentation of the sequencing methods.

You could leave a comment if you were logged in.
archive/computer_resources/data.1302143221.txt.gz · Last modified: 2011/04/07 02:27 by hyjkim