Banana Slug Genomics

**This is an old revision of the document!** ----

A PCRE internal error occured. This might be caused by a faulty plugin

======SW018_S1====== ===== Sequencing data ===== | Library | Run | Location | Notes | |SW018_S1 | HiSeq |/campusdata/BME235/Spring2015Data/ | 2x100, insert size: 597 +/- 93| ===== Files ===== | File | Size| | SW018_S1_L007_R1_001.fastq | 34G | | SW018_S1_L007_R2_001.fastq | 34G | | Undetermined_S0_L007_R1_001.fastq | 3.1G | | Undetermined_S0_L007_R2_001.fastq | 3.1G | | bams/SW018_S1_L007_001.bam | 21G | | adapterAndPCRFreeFiles/SW018_adapterTrimmed_dupRemoved_150424_R1.fastq |33 G| | adapterAndPCRFreeFiles/SW018_adapterTrimmed_dupRemoved_150424_R2.fastq |33 G| | ErrorCorrected/SW018_seqprep_dupRemoved_ec_R1.fastq |38G| | ErrorCorrected/SW018_seqprep_dupRemoved_ec_R2.fastq |38G| ===== Fastqc results ===== For SW018: For forward reads: {{:sw018_s1_l007_r1_001.fastq.pdf|}} For reverse reads: {{:sw018_s1_l007_r2_001.fastq.pdf|}} ==== Comments ==== Fastqc indicated several issues with the raw reads: a) the per base sequence content, b) there are overrepresented sequences, which it detects may be the Illumina single end PCR primer 1, and c) there is abnormal k-mer content at the start of many reads. ===== Preqc (SGA preprocessing) results ===== {{:pooled_hiseq_preqc_report.pdf|}} ====Comments==== ===Remarks from Stefan:=== The genome size (estimates to be 2.29Gb) is more like what we expected after Kevin's prediction in one of the last courses (2.3159 Gb). {{:sgapreqcestgenomesize.png?300 |}} Also the de Bruijn graph stats look reasonable. Apart from the absence of different kmers... which is odd. [I will check why that is so] The graphs usually follow a trend, so if the estimates based on the two kmers plotted are indeed correct, then it looks like there will be low heterozygosity (variant branches in k-de Bruijn graph), a high repeat content (repeat branches in k-de Bruijn graph) - which is expected given the genome size, and a low sequencing error rate (error branches in k-de Bruijn graph). The "Simulated contig lengths vs k" is very low, as we expect, since this analysis didn't include long insert mate-pair libraries. This statistic will go up once MP's are included. {{ :sgapreqcsimulatedcontig.png?300|}} There seems to be low duplication levels. The "Mean quality score by position" and the "Fraction of bases at least Q30" show that the reads are pretty high quality. And the "k-mer position of first error" and the "Per-position error rate" show that errors are very infrequent and again, as expected, increase in frequency at the end of the reads The "Estimated Fragment Size Histogram" estimates one library to around 350bp and the other to around 450bp fragment length. The "51-mer count distribution" estimates the current coverage to slightly more than 10x, but this statistic often underestimates the real coverage. Based on the number of reads it's probably closer to 20x. "GC Bias" indicates some bias in the data. So, overall, it tells us that the genome is quite big (maybe around 2.3Gb), that we definitely need more data (but remember we didn't include all data in this analysis and that there is more data coming) and that the assembly will be very tricky (mostly because of a high repeat content). If you remember my lecture, the presence of many long repeats makes de-novo assembly much harder. ===== Skewer adapter removal ====== Fri Apr 17 These data with the adapters removed are located at /campusdata/BME235/S15_assemblies/SOAPdenovo2/adapterRemovalTask/skewer_run/SW018_S1_L007_better/SW018_S1_L007-trimmed-pair1.fastq /campusdata/BME235/S15_assemblies/SOAPdenovo2/adapterRemovalTask/skewer_run/SW018_S1_L007_better/SW018_S1_L007-trimmed-pair2.fastq ===== Fastq to bam ===== The fastq to bam conversion was performed using the picard toolset. Specifically the fastqToSam.jar file was used to prepare the bam files. [[team_5_page:fastqToSamCommands | FastqToSam commands]] ===== Raw fastq adapter presence analysis ===== This section contains various notes made when doing a second pass in analyzing the presence of potential adapter sequences in the raw .fastq datasets. For forward (R1) strands: - SW018_s1_l007_r1_001.fastq fastqc file indicates an overrepresented sequence where a majority of its substring represents the adapter sequence of Oligo ID BO3.P7.part1.F as indicated within the library rep protocol (https://banana-slug.soe.ucsc.edu/_media/meyer_kircher.pdf). A substring of this sequence also overlaps with Ed Greens statement on what the adapter sequence at the end of the forward (R1) reads are (AGATCGGAAGAGCACACGTCTGAACTCCAGTC). - For kmer content, all biased kmers followed certain sequence patters when spliced together as they are ordered in the fastqc plot: - AGATCGGAAGAGC: Resembles multiple adapter sequences of Oligo IDs IS3_adapter.P5+P7, beginning of BO2.P5.R or beginning of BO3.P7.part1.F. - TCTTCCGATCT: Resembles multiple adapter sequences of Oligo IDs at the end of IS1_adapter.P5, the end of IS2_adapter.P7, the end of BO1.P5.F or at the end of BO4.P7.part1.R. For reverse (R2) strands: - SW018_s1_l007_r2_001.fastq fastqc file indicates an overrepresented sequence where a majority of its substring represents the adapter sequence of Oligo ID BO2.P5.R as indicated within the library rep protocol (https://banana-slug.soe.ucsc.edu/_media/meyer_kircher.pdf). A substring of this sequence also overlaps with Ed Greens statement on what the adapter sequence at the end of the reverse (R2) reads are (AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTG). - For kmer content, all biased kmers followed certain sequence patters when spliced together as they are ordered in the fastqc plot: - AGATCGGAAGAGCGT: Resembles an adapter sequence of Oligo ID at the start of BO2.P5.R. - CTTCCGATCT: Resembles multiple adapter sequences of Oligo IDs at the end of IS1_adapter.P5, the end of IS2_adapter.P7, the end of BO1.P5.F or at the end of BO4.P7.part1.R. - ATCGGAAG: Resembles part of IS3_adapter.P5+P7 or part of BO2.P5.R or part of BO3.P7.part1.F =====SeqPrep results===== The data files were trimmed using SeqPrep, both with and without merging. The output for the run without merging is in /campusdata/BME235/Spring2015Data/adapter_trimming/SeqPrep and the output for the run with merging is in /campusdata/BME235/Spring2015Data/merging/SeqPrep. For some reason the trimmed R1 and R2 files for the run with merging are strangely small. The adapters used for both runs were AGATCGGAAGAGCACACGTCTGAACTCCAG (-A option) and AGATCGGAAGAGCGTCGTGTAGGGAAAGAG (-B option). ===== Merged SW018 Libraries ===== All SW018 data sets that had been adapter trimmed using Seqprep were merged with Fastuniq to remove duplicates and then error corrected using Musket

You could leave a comment if you were logged in.

Banana Slug Genomics

User Tools

Site Tools

Page Tools