===== Sequencing data =====
| Library | Run | Location | Notes | 
| SW041 |   |/campusdata/BME235/Spring2015Data/ | Mate pair library. Expected insert size is 3-4kb. |

===== Files =====
| File | Size| Reads| 
| SW041.r1.trimmed.fastq | 285M | 1,987,204 |
| SW041.r2.trimmed.fastq | 285M | 1,969,081 |
| /campusdata/gchaves/SW041/SW041_adapter_trimmed_2-trimmed-pair1.fastq | 285M | 1,449,417 |
| /campusdata/gchaves/SW041/SW041_adapter_trimmed_2-trimmed-pair2.fastq | 276M | 1,449,417|
| Matepair_trimmed/skewer_run2_SW041_1_trimmed-pair1.fastq | 257M| 1,311,142 |
| Matepair_trimmed/skewer_run2_SW041_1-trimmed-pair2.fastq | 248M| 1,311,142 |
| Matepair_dupRemoved/myskewer_41_dupRemoved_R1.fastq | 243M| 1,237,731 |
| Matepair_dupRemoved/myskewer_41_dupRemoved_R2.fastq | 234M| 1,237,731 |


Note: Duplicates, concatemers, and linkers have already been removed in the "trimmed" files.

===== FastQC analysis =====

There are several summary statistics that fastqc flags as potentially unusual such as the per base sequence content and kmer content.

{{:sw041.r1.trimmed.fastq_fastqc_report.pdf| Fastqc results for SW041.r1.trimmed.fastq}}

{{:sw041.r2.trimmed.fastq_fastqc_report.pdf| Fastqc results for SW041.r2.trimmed.fastq}}

=====PreQC analysis=====

Run on SW041.r1.trimmed.fastq and SW041.r2.trimmed.fastq

{{:preqc_report_sw041.pdf|Preqc for SW041}}

=====Nextera transposase sequences=====
These are the sequences used to trim:
| read 1 | 5' TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG |
| read 2 | 5' GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG |

====Fastqc after trimming using Nextera transposase sequences====

Using the above-mentioned trimming sequences, seems not to let the data get rid of Nextera Transposase adapters, as seen in the fastqc outputs.

{{:sw041_adapter_trimmed_2-...ir1.pdf|Fastqc for SW041 after skewer trimming using Nextera transposase sequences, read1}}


{{:sw041_adapter_trimmed_2-...ir2.pdf|Fastqc for SW041 after skewer trimming using Nextera transposase sequences, read2}}

Files located here:

/campusdata/gchaves/SW041/SW041_adapter_trimmed_2-trimmed-pair1.fastq

/campusdata/gchaves/SW041/SW041_adapter_trimmed_2-trimmed-pair2.fastq

====Insert size distribution====

The distribution of insert sizes for inward facing, outward facing, and same strand reads is shown below. Mate pairs should be outward facing.

{{:sw041_insert_size_distribution.jpg?200|}}

To generate this distribution, mates pairs were mapped to all the soapdenovo "run 1" contigs using bwa. The orientation of reads was pulled from the resulting sam file using a script from the Green lab. 

How were these insert sizes determined?  Mapping to contigs? to scaffolds?  What set of contigs or scaffolds?  It looks like the distribution tails off before the expected insert size−is this because the library is short, or because what it was being mapped to is short?  If you are mapping to contigs, then you can't see insert size longer than the contigs, and that may be too short for properly viewing the library.  Even with larger scaffolds, you'll still see an enrichment for short fragments in this distribution, because you are much more likely to have both ends of a short fragment be mappable to the same scaffold than both ends of a long fragment.

It might be good to map a lot of read pairs, but only report those whose first position is within the first 3kB of scaffolds 10kB or longer.  That would eliminate the bias towards short fragments, and (now that we have a scaffold N50 bigger than 10kB, 2015 May 22) should be enough spots to map to to get a decent histogram.

The SW041 mates were against the soap de novo scaffolds from the "attempt 1" assembly, which was the same assembly as was used above to map reads against contigs. This was done so that we could directly compare the results of mapping to contigs vs scaffolds without confounding factors of which data sets were used to create the scaffolds. 

{{:sw041_vs_soap_scaffolds_attempt1.jpg?200|}}

The SW041 mates were also mapped against the soap de novo scaffolds from the "attempt 2", which has a longer scaffold N50.

{{:sw041_mates_vs_soap_run2_scaffolds.jpg?200|}}

There were no outies or innies from the SW041 library that mapped to this assembly, and only a few same-strand pairs that mapped. It is very surprising that no innies or outies mapped to these scaffolds at all, especially since they did map to the scaffolds from the "attempt 1" assembly.  

Reads were not mapped to contigs/scaffolds of a particular size because, as we discussed in class, this can artificially "force" mates to map to long contigs/scaffolds where they do not really belong.


=====Trial and Error with trimming and duplicate removal =====

In an attempt to remove the Nextera transposase sequences, skewer was run again, using adapters from the following paper

[[http://www.illumina.com/documents/products/technotes/technote_nextera_matepair_data_processing.pdf| Nextera Mate Pair Kit]]

skewer-0.1.123-linux-x86_64 -x CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG -y CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG -m mp -j CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG -t 32 -o ${OUTDIR} /campusdata/BME235/Spring2015Data/SW041.r1.trimmed.fastq /campusdata/BME235/Spring2015Data/SW041.r2.trimmed.fastq


Resulted in a greater number of highly present kmers being removed (kmer content section), but a comparable number of over-represented sequences overall.

Fastqc results shown below:

{{::fastqc_skewer_run2_sw041_pair1.pdf|}}

{{::fastqc_skewer_run2_sw041_pair2.pdf|}}


===Fastuniq===

Both sets of adapter trimmed files were processed with Fastuniq to remove duplicates

Fastqc results below: 

original skewer run

{{::fastqc_skewer_41_seqprep_duprem_r1.pdf|fastqc_skewer_41_duprem_r1}}

{{::fastqc_skewer_41_seqprep_duprem_r2.pdf|fastqc_skewer_41_duprem_r2}}

second skewer run with junction sequence

{{::fastqc_myskewer_41_dupr_r1.pdf|}}

{{::fastqc_myskewer_41_dupr_r2.pdf|}}


It seems like the first run did a better job of removing the over-represented sequences, but the second run with the junction sequence did better at reducing the high kmer frequency numbers. The first run also kept more reads overall, 1366241 compared to 1237731, starting with 1449494 reads originally. 

I'm unsure which would be preferred.

(not kevin) How does the data set with both runs applied to it look? My understanding is that each filter made the data less contaminated, so why would we not put it through both filters?