===== Sequencing data =====
| Library | Run | Location | Notes | 
|  | Lucigen NxSeq Long Mate Pair Library Kit  |/campusdata/BME235/Spring2015Data/ | Supposed to be 2x300bp reads with long (>1000bp) insert size |
===== Files =====
| File | Size| Reads| 
| R1_IJS8_mates_ICC5_SW023_S60_L001_R1_001.fastq | 43M| 217,484 |
| R2_IJS8_mates_ICC5_SW023_S60_L001_R2_001.fastq | 43M| 217,484 |
| /campusdata/BME235/Spring2015Data/Matepair_dupRemoved/lucigen_mp_dupRemoved_R1.fastq | 20M| 89,856 |
| /campusdata/BME235/Spring2015Data/Matepair_dupRemoved/lucigen_mp_dupRemoved_R2.fastq | 21M| 89,856 |

=== Note, these files should be designated as paired-end when using for assembly ===

These data were generated from the Lucigen NxSeq Long Mate Pair Library Kit. Reads were processed as described on page 57 and 58 of the [[https://banana-slug.soe.ucsc.edu/_media/lab_protocols:ma160-nxseq-long-mate-pair-library-kit.pdf|NxSeq manual]].


=== Summary of data processing ===

Lucigen Mate Pair Post Sequencing Filter Steps

  * User manual found: http://lucigen.com/NxSeq-Long-Mate-Pair-Library-Kit/ Pages 57-58
  * Lucigen scripts found: http://lucigen.com/NGS-Long-Read-Mate-Pair-Scripts-Sample.html
  * Need to give contact information to access the downloads
  * Software requirements: Python 2.7, Regex module (re can't be substituted), Biopython
  * Follow the pipeline outlined on pages 57-58 of user manual.
  * Run IlluminaChimera-Clean5.py
  * Run IlluminaNxSeqJunction-Split8.py

  - Stats: 1099799  reads processed, 1007381  true mate reads ( 91 %) and  92416 non-mates/chimeras ( 8 %), 2 mates too short to keep after trimming
  - Final usable output = R1_IJS7_mates_ICC4_SW023_S60_L001_R1_001.fastq and R2_IJS7_mates_ICC4_SW023_S60_L001_R2_001.fastq

=== FastQC analysis ===

Fastqc indicates that there are multiple technical problems with the reads, beyond the usual decrease in quality scores at the ends of reads. For example, most of these reads are only about 30bp long, when they are supposed to be 300bp long. 

{{seqlength.png}}

There are also unusual sequence duplication levels and abnormal k-mer content at the ends of reads.

{{::r1_ijs8_mates_icc5_sw023...001.fastq_fastqc_report.pdf| FastQC results Lucigen mate pair R1}}

{{:r2_ijs8_mates_icc5_sw023...001.fastq_fastqc_report.pdf| FastQC results Lucigen mate pair R2}}

=== Insert size distribution ===

The distribution of insert sizes for inward facing, outward facing, and same strand reads is shown below. Mate pairs should be outward facing. 

{{:lucigen_insert_size_distribution.jpg?200|}}

To generate this distribution, mates pairs were mapped to all the soapdenovo "run 1" contigs using bwa. The orientation of reads was pulled from the resulting sam file using a script from the Green lab. 


====Duplicate Removal with Fastuniq====

Using Fastuniq to remove duplicates decreased the number of reads significantly
~217,000 to ~90,000


**Fastqc**

{{::lucigen_mp_dupremoved_r1.pdf|}}

{{::lucigen_mp_dupremoved_r2.pdf|}}