User Tools

Site Tools


data_overview:data_overview

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
data_overview:data_overview [2015/06/12 19:25]
chkcole
data_overview:data_overview [2015/07/16 18:50]
ceisenhart Hierarchy change
Line 1: Line 1:
-====== ​Spring 2015 Data ======+====== Data ====== ​
  
-Data is on campusrocks2 in /​campusdata/​BME235/​Spring2015Data. Adapter-trimmed data from various programs/​runs can be found in /​campusdata/​BME235/​Spring2015Data/​adapter_trimming (and possibly other locations--check the page for each data set). +===== 2015 Data =====
  
-[[MiSeq data| MiSeq data SW019_S1_L001]] ​2x300bp reads from a single MiSeq lane+The raw data locations are listed below. However many of these files have been processed through Skewer/​fastUniq and a variety of other programs. To see and download these '​processed'​ files please view the data set page.  
 +| Data set | Description | Location |  
 +[[data_overview::​2015::​MiSeq data| MiSeq data SW019_S1_L001]] ​2x300bp reads from a single MiSeq lane 
 +| [[data_overview::​2015::​HiSeq data 2| HiSeq data SW018_S1_L007]] | 2x100bp reads from a single HiSeq lane with 597bp insert size | 
 +| [[data_overview::​2015::​HiSeq data 1| HiSeq data SW019_S2_L008]] | 2x100bp reads from a single HiSeq lane with 374bp insert size | 
 +| [[data_overview::​2015::​UCSF_BS-MK | UCSF BS-MK data]] | 2x250bp reads with 450-650bp insert size | 
 +| [[data_overview::​2015::​UCSF_BS-tag | UCSF BS-tag data]] | 2x250bp reads with 375-575bp insert size | 
 +| [[data_overview::​2015::​UCSF_SW018 | UCSF SW018 Data]] | 2x250bp reads from SW018 library | 
 +| [[data_overview::​2015::​UCSF_SW019 | UCSF SW019 Data]] | 2x250bp reads from SW019 library | 
 +| [[data_overview::​2015::​Lucigen mate-pair data]] | 2x300bp reads, expected insert size is greater than 1kb | 
 +| [[data_overview::​2015::​SW041 | SW041 mate-pair data ]] | 2x76bp reads, expected insert size is 3-4kb | 
 +| [[data_overview::​2015::​SW042 | SW042 mate-pair data ]] | 2x76bp reads, expected insert size is 5-6kb | 
 +| [[data_overview::​2015::​RNA-Seq | RNA-Seq data]] | Preliminary data generated as of 06/12/15 | 
  
-[[HiSeq data 2| HiSeq data SW018_S1_L007]] - 2x100bp reads from a single HiSeq lane with 597bp insert size 
  
-[[HiSeq data 1| HiSeq data SW019_S2_L008]] ​2x100bp reads from a single HiSeq lane with 374bp insert size +==== Wet-lab procedures ====
- +
-[[UCSF_BS-MK | UCSF BS-MK data]] - 2x250bp reads with 450-650bp insert size +
- +
-[[UCSF_BS-tag | UCSF BS-tag data]] - 2x250bp reads with 375-575bp insert size +
- +
-[[UCSF_SW018 | UCSF SW018 Data]] - 2x250bp reads from SW018 library +
- +
-[[UCSF_SW019 | UCSF SW019 Data]] - 2x250bp reads from SW019 library +
- +
-[[Lucigen mate-pair data]] - 2x300bp reads, expected insert size is greater than 1kb  +
- +
-[[ SW041 | SW041 mate-pair data ]] - 2x76bp reads, expected insert size is 3-4kb +
- +
-[[ SW042 | SW042 mate-pair data ]] - 2x76bp reads, expected insert size is 5-6kb +
- +
-[[RNA-Seq | RNA-Seq data]] - Preliminary data generated as of 06/12/15 +
- +
-[[computer_resources:​assemblies:​mitochondrion| Mitochondrion assembly]] - Generated in 2012+
  
 The shotgun library preparation protocol used was provided by Steven Weber, ​ The shotgun library preparation protocol used was provided by Steven Weber, ​
Line 33: Line 26:
  
  
 +==== Analysis of processed data==== ​
  
-==== kmergenie Output Showing Kmer Distribution ==== +[[data_overview::analysis::kmergenie | kmergenie ​Output Showing ​Kmer Distribution ]]
- +
-The following data is produced by a program called ​[[http://​kmergenie.bx.psu.edu/​|Kmergenie]] +
- +
-kmergenie is a program that looks at the multiplicity of kmers of various size in a set of reads. It uses this information to then predict the best Kmer to use for a denovo assembly using the dataset. The below information is generated by this program. +
- +
-__DATA USED__  +
-The data used in the following sections are modifications of the data setsMiSeq data SW019_S1_L001,​ HiSeq data SW018_S1_L007,​ and HiSeq data SW019_S2_L008. The inputs to kmergenie are these 6 files (notethe undetermined files are not included), however in the first run the adapters have been removed by [[http://​www.biomedcentral.com/​1471-2105/​15/​182|skewer]]. In the second run listed below the adapters have been removed by skewer AND [[http://​musket.sourceforge.net/​homepage.htm#​latest|musket]] error correction has been used to correct these adapter-less reads. +
- +
-=== Post Adapter Removal Kmer Choice === +
- +
-This output from kmergenie ​corresponds to data that has been generated using not the raw data, but the result of running the skewer adapter removal program on the raw data: +
- +
-{{ adapterremovalchart.png?​300 ​}} +
- +
-Best k : 61mer +
- +
-Here is a pdf containing a full report of kmerenie for this run. It contains not only the graph above but also the graphs showing multiplicity for kmers of size 21, 31, 41, 51, 61, 71, 81, and 91 (other kmers are not checked). It totals 7 pages: +
-{{::​kmergenie_output_adapter_trimming_only.pdf|}} +
- +
-**Seqprep Kmergenie Results** +
- +
-Kmergenie results for adapter trimming using Seqprep on  +
- +
-/​campusdata/​BME235/​S15_assemblies/​SOAPdenovo2/​Kmergenie/​SW018_S1_L007_R1_001_trimmed.fastq +
- +
-/​campusdata/​BME235/​S15_assemblies/​SOAPdenovo2/​Kmergenie/​SW018_S1_L007_R2_001_trimmed.fastq +
- +
-/​campusdata/​BME235/​S15_assemblies/​SOAPdenovo2/​Kmergenie/​SW019_S1_L001_R1_001_trimmed.fastq +
- +
-/​campusdata/​BME235/​S15_assemblies/​SOAPdenovo2/​Kmergenie/​SW019_S1_L001_R2_001_trimmed.fastq +
- +
-/​campusdata/​BME235/​S15_assemblies/​SOAPdenovo2/​Kmergenie/​SW019_S2_L008_R1_001_trimmed.fastq +
- +
-/​campusdata/​BME235/​S15_assemblies/​SOAPdenovo2/​Kmergenie/​SW019_S2_L008_R2_001_trimmed.fastq +
- +
- +
-has an identical optimal k of 61 +
- +
-{{ :​sw018_sw019_adapter_seqprep.png?​300 |}} +
- +
-Kmergenie was also run on adapter trimmed and merged files +
- +
-/​campusdata/​BME235/​S15_assemblies/​SOAPdenovo2/​Kmergenie/​SW018_S1_L007_001_merged.fastq +
-/​campusdata/​BME235/​S15_assemblies/​SOAPdenovo2/​Kmergenie/​SW019_S1_L001_001_merged.fastq +
-/​campusdata/​BME235/​S15_assemblies/​SOAPdenovo2/​Kmergenie/​SW019_S2_L008_001_merged.fastq +
- +
- +
-These results show an optimal k of 31 on these files +
- +
-{{ :kmergenie.dat.sw018_sw019_seqprep_merged.png?​300 |}} +
- +
-This maybe a topic to discuss, if using merged reads is more promising for assembly +
- +
- +
-=== Post Error Correction ​Kmer Choice === +
- +
-This section contains the Kmergenie output after running musket (Error Correction) on the Skewer data set (the Skewer dataset is the dataset used in the Kmergenie output above): +
- +
- +
-{{ :​merged_output_final.dat.png?​300 |}} +
- +
-Best k : 61mer +
- +
-Here is a pdf containing a full report of kmerenie for this run. It contains not only the graph above but also the graphs showing multiplicity for kmers of size 21, 31, 41, 51, 61, 71, 81, and 91 (other kmers are not checked). It totals 7 pages: +
-{{::​ec_merged_data_kmergenie.pdf|}} +
- +
- +
-=====Analysis of processed data===== +
- +
-====PreqC of adapter-trimmed and PCR duplicate-removed data==== +
-The initial datasets were ran through Skewer and FastUniq to create PCR duplicate free and adaptor free files. ​ The libraries were condensed, so now the MiSeq and HiSeq 19 library are condensed. The information and location for these data are embedded in the library data pages. PreqC was ran on all these data combined.  +
- +
- +
-{{::​pooleddatapreqcresults.pdf|}} +
- +
-{{ :​alldataprocessedpreqcestdupper.png |}}Note that the PCR duplication cannot be directly compared to the unprocessed data since the unprocessed data was ran for each library. ​ By weighing the libraries based on file size a weighted PCR duplication percent for all the unprocessed files was calculated to be roughly 2.7%. This number can be directly compared to the percent duplication in these results above. It seems that the pre processing removed just under half of the total duplicates. +
- +
-{{ :​alldataprocessedpreqcestkmersize.png |}}Additionally the ideal K-mer size for these data is longer than for the unprocessed data. This graph shows the ideal K-mer size to be around 75 bases. ​ This is 15 bases longer than previous estimates. ​+
  
-====FastQC ​of adapter-trimmed and PCR duplicate-removed data==== +[[data_overview::​analysis::​preqc | PreqC of adapter-trimmed and PCR duplicate-removed data ]]
-After removing adapters and PCR duplicates, we run FastQC in two of the libraries. In general, the quality of the reads decrease in the last base-positions. Also, read 2 of the SW019 library shows problems in the per tile sequence quality. Bellow are the pdf files with the fastqc for the PCR and adapter removed libraries. The protocol we used to run fastqc is uploaded in this link: [[fastqc:​fastqc]].+
  
-{{:sw018_adaptertrimmed_dup..._r1.pdfSW018_R1}}+[[data_overview::​analysis::​fastQC ​FastQC of adapter-trimmed and PCR duplicate-removed data ]] 
 +===== 2011 Data =====
  
-{{:​sw018_adaptertrimmed_dup..._r2.pdf| SW018_R2}}+===== 2010 Data =====
  
-{{:​sw019_adaptertrimmed_dup..._r1.pdf| SW019_R1}} 
  
-{{:​sw019_adaptertrimmed_dup..._r2.pdf| SW019_R2}} 
  
  
data_overview/data_overview.txt · Last modified: 2015/07/28 06:29 by ceisenhart