Banana Slug Genomics

Sequencing data

Library	Run	Location	Notes
SW041		/campusdata/BME235/Spring2015Data/	Mate pair library. Expected insert size is 3-4kb.

Files

File	Size	Reads
SW041.r1.trimmed.fastq	285M	1,987,204
SW041.r2.trimmed.fastq	285M	1,969,081
/campusdata/gchaves/SW041/SW041_adapter_trimmed_2-trimmed-pair1.fastq	285M	1,449,417
/campusdata/gchaves/SW041/SW041_adapter_trimmed_2-trimmed-pair2.fastq	276M	1,449,417
Matepair_trimmed/skewer_run2_SW041_1_trimmed-pair1.fastq	257M	1,311,142
Matepair_trimmed/skewer_run2_SW041_1-trimmed-pair2.fastq	248M	1,311,142
Matepair_dupRemoved/myskewer_41_dupRemoved_R1.fastq	243M	1,237,731
Matepair_dupRemoved/myskewer_41_dupRemoved_R2.fastq	234M	1,237,731

Note: Duplicates, concatemers, and linkers have already been removed in the “trimmed” files.

FastQC analysis

There are several summary statistics that fastqc flags as potentially unusual such as the per base sequence content and kmer content.

Fastqc results for SW041.r1.trimmed.fastq

Fastqc results for SW041.r2.trimmed.fastq

PreQC analysis

Run on SW041.r1.trimmed.fastq and SW041.r2.trimmed.fastq

Preqc for SW041

Nextera transposase sequences

These are the sequences used to trim:

read 1	5' TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG
read 2	5' GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG

Fastqc after trimming using Nextera transposase sequences

Using the above-mentioned trimming sequences, seems not to let the data get rid of Nextera Transposase adapters, as seen in the fastqc outputs.

Fastqc for SW041 after skewer trimming using Nextera transposase sequences, read1

Fastqc for SW041 after skewer trimming using Nextera transposase sequences, read2

Files located here:

/campusdata/gchaves/SW041/SW041_adapter_trimmed_2-trimmed-pair1.fastq

/campusdata/gchaves/SW041/SW041_adapter_trimmed_2-trimmed-pair2.fastq

Insert size distribution

The distribution of insert sizes for inward facing, outward facing, and same strand reads is shown below. Mate pairs should be outward facing.

To generate this distribution, mates pairs were mapped to all the soapdenovo “run 1” contigs using bwa. The orientation of reads was pulled from the resulting sam file using a script from the Green lab.

How were these insert sizes determined? Mapping to contigs? to scaffolds? What set of contigs or scaffolds? It looks like the distribution tails off before the expected insert size−is this because the library is short, or because what it was being mapped to is short? If you are mapping to contigs, then you can't see insert size longer than the contigs, and that may be too short for properly viewing the library. Even with larger scaffolds, you'll still see an enrichment for short fragments in this distribution, because you are much more likely to have both ends of a short fragment be mappable to the same scaffold than both ends of a long fragment.

It might be good to map a lot of read pairs, but only report those whose first position is within the first 3kB of scaffolds 10kB or longer. That would eliminate the bias towards short fragments, and (now that we have a scaffold N50 bigger than 10kB, 2015 May 22) should be enough spots to map to to get a decent histogram.

The SW041 mates were against the soap de novo scaffolds from the “attempt 1” assembly, which was the same assembly as was used above to map reads against contigs. This was done so that we could directly compare the results of mapping to contigs vs scaffolds without confounding factors of which data sets were used to create the scaffolds.

The SW041 mates were also mapped against the soap de novo scaffolds from the “attempt 2”, which has a longer scaffold N50.

There were no outies or innies from the SW041 library that mapped to this assembly, and only a few same-strand pairs that mapped. It is very surprising that no innies or outies mapped to these scaffolds at all, especially since they did map to the scaffolds from the “attempt 1” assembly.

Reads were not mapped to contigs/scaffolds of a particular size because, as we discussed in class, this can artificially “force” mates to map to long contigs/scaffolds where they do not really belong.

Trial and Error with trimming and duplicate removal

In an attempt to remove the Nextera transposase sequences, skewer was run again, using adapters from the following paper

Nextera Mate Pair Kit

skewer-0.1.123-linux-x86_64 -x CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG -y CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG -m mp -j CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG -t 32 -o ${OUTDIR} /campusdata/BME235/Spring2015Data/SW041.r1.trimmed.fastq /campusdata/BME235/Spring2015Data/SW041.r2.trimmed.fastq

Resulted in a greater number of highly present kmers being removed (kmer content section), but a comparable number of over-represented sequences overall.

Fastqc results shown below:

fastqc_skewer_run2_sw041_pair1.pdf

fastqc_skewer_run2_sw041_pair2.pdf

Fastuniq

Both sets of adapter trimmed files were processed with Fastuniq to remove duplicates

Fastqc results below:

original skewer run

fastqc_skewer_41_duprem_r1

fastqc_skewer_41_duprem_r2

second skewer run with junction sequence

fastqc_myskewer_41_dupr_r1.pdf

fastqc_myskewer_41_dupr_r2.pdf

It seems like the first run did a better job of removing the over-represented sequences, but the second run with the junction sequence did better at reducing the high kmer frequency numbers. The first run also kept more reads overall, 1366241 compared to 1237731, starting with 1449494 reads originally.

I'm unsure which would be preferred.

(not kevin) How does the data set with both runs applied to it look? My understanding is that each filter made the data less contaminated, so why would we not put it through both filters?

You could leave a comment if you were logged in.