The raw data locations are listed below. However many of these files have been processed through Skewer/fastUniq and a variety of other programs. To see and download these 'processed' files please view the data set page.
The shotgun library preparation protocol used was provided by Steven Weber, steven_weber_s_notes_on_lab_prep
Discussion
Massive changes throughout. Everything is name spaced now. This is done primarily to keep things directly comparable to the UNIX file server, and so that the site map can be used to navigate the site efficiently.
Every year has it's own namespace. Simple and straightforward, it should allow for many more years to be easily added in. The hiccup here is that the user must ensure their new page goes into the correct namespace. People seem to care very little about this, which defeats the purpose of setting it up.
Major changes all over the site, Kevin has asked me not to remove any information. This has sparked the creation of an archive namespace. Some pages/information are being moved there.
This is the data overview page, what this means is that everything in this directory should be on campus rocks in the BME235/Spring2015Data. Therefore,
/campusdata/BME235/S15_assemblies/SOAPdenovo2/Kmergenie/SW018_S1_L007_R1_001_trimmed.fastq
is wrong and should be fixed, also please read the page before you go off on a crazy witch hunt. This is one of the first things on this page, and yet still the data was put in the wrong place.
“Data is on campusrocks2 in /campusdata/BME235/Spring2015Data. Adapter-trimmed data from various programs/runs can be found in /campusdata/BME235/Spring2015Data/adapter_trimming (and possibly other locations–check the page for each data set).”
PreQC Clarification
I talked to Jared (the developer of preqc).
(1) If you checked the preqc report then you will probably have noticed that in the de bruijn graph statistics we only see kmers up to 46. Jared said that's due to coverage issues. Preqc analyses kmers of increasing size. However, it requires a certain kmer coverage and in our case 47 failed this criterion, so it only analyzed kmers up to 46. If you look at the HiSeq and the MiSeq only preqc reports, you will see that those have even less coverage and that thus even lower kmers were used in the de bruijn graph statistics calculation.
(2) You might also have noticed that “Mean quality score by position” and “Fraction of bases at least Q30” are only calculated for fragments up to 100bp, but “k-mer position of first error” and “Per-position error rate” were calculated for fragments up to ~270 and 300bp, respectively. In the first case, preqc only looks at the first N reads of the file, which were HiSeq reads in our case. Whereas, for the latter statistics it samples reads uniformly from the whole set, which means it sampled HiSeq and MiSeq, and/or MiSeq only (mostly) for those statistics.
Hi,
The two MiSeq libraries have different insert sizes. Ask Ed or Kevin for the requested inserts sizes. However, the expected insert size is often different to the observed one. Some of the assemblers, such as Allpaths-lg, provide calculations of the observed insert size after the assembly. Alternatively, you can use other tools such as hagfish, or so.
https://github.com/mfiers/hagfish/wiki
Also, preqc should give you a better idea of the insert sizes. Though, estimations based on mapping to the final assembly or from assemblers should be more reliable.
Cheers, Stefan
Looks like the 'Raw fastq adapter presence analysis' at the bottom should be split apart and put into the correct directories, lets try and only post information about merged data sets here.
I am worried that we will get in the habit of posting information that applies to a specific data set here and then it will become a massive slew of unorganized information.
Looks like whoever made the changes forgot to subscribe to the comments… Please subscribe if your making significant changes (like adding blocks of text!)!!! I am not entirely certain how these data should be separated, and it would have been much easier if the person who uploaded them put them in the correct place to begin with.
@Chris : Where would you suggest the adapter presence analysis should go? This page already includes fastq analysis via kmer size vs. genomic kmer distribution, wouldn't this analysis also apply here as it too is analyzing read content?
Edit: Made the recommended changes. After thinking about it your suggestion makes more sense after the new HiSeq and Mate Pair data is analyzed.
Made some improvements, haven't provided any analysis but hopefully the data is more clearly presented.
The post error correction is a single image, can we embed it on here rather than linking to the PDF?
Edit- Done
“histograms showing the multiplicity of each kmer value can currently be found at: /campusdata/BME235/S15_assemblies/SOAPdenovo2/Kmergenie/SW018_SW019_merged ”
That's fine, but not visible here. Please put plots on this page, since readers of the wiki will not all have access to campusrocks. The raw data doesn't fit on a wiki, but the summaries of it should be here.
The name“slugoutput.dat.pdf” is poorly chosen, as it doesn't tell me what it is the output OF, nor what sort of information I can expect there. Looking at the plot I see “genomic k-mers vs. kmer size”, so I'm guessing that it is a KmerGenie output (is it?), but the text here says just “after running skewer on the raw data” so maybe this is a Skewer output? Which raw data? We'll have different raw data at different times this quarter!
Has anyone written up a summary of what KmerGenie is and how it works? Linking to descriptions of the tools that create files would be a big help in understanding how the files are created and what they mean.
Everyone, please be much more informative in your wiki postings! And choose more informative file names!
Putting the plots in so that they are visible (not just in pdf files) would be useful, too, though having the PDF files available for later inclusion in reports is a good idea.
Same problems (only worse) for merged_output.dat.pdf.
I would expect to see a kmer spectrum (number of kmers vs multiplicity) before and after error correction, not just a “what is the best kmer” distribution. For that matter, it would be good to see the before and after results on the same plot, to see whether error correction shifts the curves any.
I would like to see a very brief explanation of what the data set is (trimmed/ adaptor removed/ etc) as well as an equally brief explanation of how it was made (merging/filtering/etc)
There should be comments either here or (better) on the individual pages for the libraries saying how the insert size is known (or estimated). It should also give the definition of “insert size” as different people sometimes use different meanings. The numbers here look like the “distance between adapters” definition, which is the one I favor.
Why is there no estimate of insert size for the MiSeq lane?
I assume that an estimate of insert size for the Lucigen mate pairs will come, though it may need some refinement after we have some assemblies with long enough contigs.
My understanding of the miSeq lane (could be flawed) is that there was no insert. I was under the impression that the dna was purified into 300 bp long fragments, which were sequenced from both sides.