Banana Slug Genomics

**This is an old revision of the document!** ----

A PCRE internal error occured. This might be caused by a faulty plugin

====Team composition==== | Name | Email| | Robert Calef | rcalef@ucsc.edu | | Chris Eisenhart | ceisenha@ucsc.edu | | Natasha Dudek | natasha@dudek.org |https://banana-slug.soe.ucsc.edu/team_5_page?do=edit#dokuwiki__top | Gepoliano Chaves | gchaves@ucsc.edu | ====Discovar de novo overview==== Discovar //de novo// is a next generation sequence assembly program. The program was developed by the Broad Institute and was released late in 2014. Discovar //de novo// is designed for 250 bp long illumina reads with the PCR duplicates and adaptor sequences removed. The following webpage contains the manual as provided by the Broad Institute (http://www.broadinstitute.org/software/discovar/blog/): [[Discovar //de novo// manual:Discovar de novo manual]]. ====Team workflow==== The raw data was received as fastq pairs. Each pair contains a forward and reverse strand. These pairs are ran through Skewer to remove adaptor sequences, then ran through fastUniq to remove PCR duplicates. Next the forward and reverse strand are merged into a single unaligned BAM file. {{:fastqtobam.png}} All unaligned BAM files are then passed into Discovar //de novo//. The output is an assembly in .fasta format and Discovar //de novo// visualization files. The .fasta file can then be re-scaffolded with a scaffolding program (see next workflow). The finished file can be represented on the UCSC genome browser. {{:bamstodiscovarwsspace.png}} Discovar //de novo// currently does not support mate pair libraries. To incorporate these data the program a program SSpace was used to re-scaffold the Discovar //de novo// fasta output. The workflow below describes how mate pair libraries are processed and factored into the assembly. The SSpace library is a small text file that points to the libraries and provided meta data statistics. Please see the 'Programs used' section for SSpace citation and further details. {{:fastqtosslib.png}} ====FastQC of adapter-trimmed and PCR duplicate-removed data==== After removing adapters and PCR duplicates, we run FastQC in two of the libraries. In general, the quality of the reads decrease in the last base-positions. Also, read 2 of the SW019 library shows problems in the per tile sequence quality. Bellow are the pdf files with the fastqc for the PCR and adapter removed libraries. The protocol we used to run fastqc is uploaded in this link: [[fastqc:fastqc]]. The path to the files is: campusdata/gchaves/fastqc_trimmed_PCR_duplicates. {{:sw018_adaptertrimmed_dup..._r1.pdf| SW018_R1}} {{:sw018_adaptertrimmed_dup..._r2.pdf| SW018_R2}} {{:sw019_adaptertrimmed_dup..._r1.pdf| SW019_R1}} {{:sw019_adaptertrimmed_dup..._r2.pdf| SW019_R2}} ====Fastq to bam==== The fastq to bam conversion was performed using the picard toolset. Specifically the fastqToSam.jar file was used to prepare the bam files. [[team_5_page:fastqToSamCommands | FastqToSam commands]] ====Discovar de novo run logs==== Discovar //de novo// was designed for very specific data. To test the validity of our data we perform different test runs. The test runs used a percentage of data from the libraries available. All the tests were run on .bam files. The log files were run on edser2 or campusrocks nodes with more than 200 GB of RAM available. The run logs are stored as .txt files. The full logs can be seen on the wiki here, | Run log | Data used| |[[team_5_page:1PerRun | 1% data ]]| (Pre Skewer and FastUniq) MiSeq data SW019_S1_L001, HiSeq data SW018_S1_L007, HiSeq data SW019_S2_L008 | |[[team_5_page:5PerRun | 5% data]]| (Pre Skewer and FastUniq) MiSeq data SW019_S1_L001, HiSeq data SW018_S1_L007, HiSeq data SW019_S2_L008 | |[[team_5_page:10PerRun | 10% data]]| (Pre Skewer and FastUniq) MiSeq data SW019_S1_L001, HiSeq data SW018_S1_L007, HiSeq data SW019_S2_L008 | |[[team_5_page:50PerRun | 50% data]]| (Post Skewer and FastUniq) MiSeq data SW019_S1_L001, HiSeq data SW018_S1_L007, HiSeq data SW019_S2_L008 | |[[team_5_page:50PerRunUCSF | 50% data UCSF]]| (Post Skewer and FastUniq) UCSF SW018 and SW019 data | The logs are very large, important statistics have been gathered and are compared below. Note that MPL1 is an acronym for mean length of first read in pair up to first error. | | 1% run | 5% run | 10 % run| 50 % run | 50% UCSF run | | Total runtime | 1.75 hours| 1.53 hours| 2.4 hours| 8.53 hours| 14.9 hours | | Peak memory use | 43.92 GB | 78.10 GB| 151.05 GB| 220.11 GB | 184.09 GB| | Bases in 1kb+ scaffolds| 75,233 | 592,685 | 1,476,875 | 101,397,871 | 1,528,625,509 | | Bases in 10kb+ scaffolds| 10,572 | 11,088 | 168,543 | 151,417 | 137,959,107 | | MPL1 | 2 | 2 | 3 | 7 | 156 | | Contig N50 | 2,622 | 2,067 | 2,563 | 1,489 | 3,979 | | Scaffold N50 | 2,622 | 2,067 | 2,563 | 1,489 | 3,979 | ====Fasta assemblies==== Our fasta assembly files are located at /campusdata/BME235/S15_assemblies/DiscovarDeNovo Each fasta assembly is in its own directory, the directory name is the assembly name. Currently there are five assemblies, The final fasta file is name a.lines.fasta. Note originally the authors used the a.fasta file (which contains the reverse complement of every contig) for statistics. Consequently statistics were falsely reported as twice the actual number until May 15th 2015 when the error was identified. The statistics have since been corrected. Raw stats were mined from the fasta files using a python script fastaStats.py. The script is available online at (https://github.com/ChrisEisenhart/binfProgs/blob/master/basicProgs/fastaStats.py). |Assembly name| Bytes | Total bases | # scafs | Av. scaf len | Longest scaf | Scaf N50 | # Scaf > 5Kb | Bases in 10kb+ scafs| |1%run| 463K | 448,486 | 2,558 | 350 | 5,385 | 2,622| | 10,572 | |5%run| 3.4 M | 3,377,064 | 18,224 | 370 | 6,637 | 2,067 | | 11,088 | |10%run| 7.5 M | 7,382,612 | 38,195 | 386 | 11,911 | 2,563 | | 168,543 | |50%run| 137 M | 137,695,736 | 273,653 | 1,006 | 19,658 | 1,489 | | 151,417 | |UCSF50%run | 1.9 G | 1,839,371,352 | 1,126,557 | 1,632| 55,757 | 3,979 | 80,721 | 137,959,107 | |FullRun1 | 2.2G | - | - | - | - | 10,634 | - | 972,798,485 | Looking at the 10% run, the majority of scaffolds generated are quite short (<1kb). {{:histogram_for_10_run.png?200|{{:histogram_of_contig_length_discovar_10_run_log_y_.png?200|}} When a similar histogram is generated for the assembly made with 50% of the SW018 and SW019 reads from UCSF, you can see that the average scaffold length is higher and there are quite a few more scaffolds that are over 10kb. {{:histogram_for_ucsf_50_run.png?200|}} The banana slug genome is estimated to be 2.1 billion bases (2,800 million), our latest run has assembled just under 2 billion bases! ====Post assembly scaffolding==== The program SSPace (documentation below) was used to scaffold the the assembly with mate pair data. The UCSF SW041 and SW042 mate pair libraries were used to generate the library.txt file. [[team_5_page:SSPaceSummaryFile | SSpace summary file ]] ====BLAST results==== The .fasta assemblies were run through BLAST. The results are below, [[team_5_page:10%Blast | 10% BLAST results ]] There seems to be a very high sequence identity with Notopygos (http://sv.wikipedia.org/wiki/Notopygos) [[ team_5_page:50%_UCSF_Blast | 50% UCSF Data BLAST results ]] ====UCSC genome browser hub==== See instructions for setting up the hub here, [[banana_slug_genome_browser |Banana slug browser ]] ====Programs used==== The program, its location, and a brief, (brief!) explanation of what the program does ===Picard=== The Picard is a set of Java-based command-line utilities for SAM and BAM file manipulation (edser2:/soe/calef/picardtools and edser2:/soe/calef/picard_jars). Webpage: http://picard.sourceforge.net/. ===Jemalloc=== Is a general purpose malloc(3) implementation that emphasizes fragmentation avoidance and scalable concurrency support (edser2:/soe/calef/jemalloc). Webpage: http://www.canonware.com/jemalloc/. ===Skewer=== Skewer is an adapter trimmer for Illumina paired-end sequences (/campusdata/BME235/S15_assemblies/SOAPdenovo2/adapterRemovalTask/skewer_run). Webpage: http://sourceforge.net/projects/skewer/. ===FastUniq=== FastUniq is a fast de novo duplicate removal tool for paired short DNA sequences (/campusdata/BME235/bin). Webpage: http://sourceforge.net/projects/fastuniq/. ===GCC=== GCC is a compiler for the GNU operating system. Webpage: https://gcc.gnu.org/. ===SSpace=== A program for scaffolding fasta files, written by Marten Boetzer and Walter Pirovano. The authors are very protective over the use and citation of this program. The program can be obtained by at http://www.baseclear.com/genomics/bioinformatics/basetools/SSPACE. Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W(2011), Scaffolding pre-assembled contigs using SSPACE, Bioinformatics 27(4):578-9 ====References==== blog posts, related information etc. http://blastedbio.blogspot.com/2011/10/fastq-must-die-long-live-sambam.html ====Lecture slides==== {{:team_5_report_1.pdf| First report, Wednesday April 29th 2015}}

Discussion

Jared N Copher, 2015/05/13 00:28

where is your script fastaStats.py in the BME235 directory? I tried to use your script after downloading it but it needs the fastFunctions module and I can't find it.

Christopher Eisenhart, 2015/05/16 20:40

Get the entire source through git, or you can find the fastFunctions.py file specifically. It needs to be in the same directory as the fastaStats.py file. I will put both of them in the BME235 bin so you should not have to worry about it.

Thanks for the headsup

EDIT: Both are in the BME235 bin, if you have it on your path you can run the program

fastaStats.py < input.fasta > output.stats

Good luck!

Kevin Karplus, 2015/05/09 03:49

Your BLAST hits seem to be to the 18S ribosomal subunit, which is not extremely helpful. I'd rather see blast (or megablast) hits for just the longest few contigs.

BWA mapping of the longest contigs onto the old assembly of the mitochondrion would also be useful.

Kevin Karplus, 2015/05/09 03:45

The histogram of contig lengths would be much more useful if the y-axis were on a log scale. In general, when you are interested in the tail of a distribution, then a log scale for the counts or probabilities gives you a much more informative view.

Natasha Dudek, 2015/05/09 04:12

Good point, thank you for the suggestion. The histogram plot has been adjusted accordingly.

Kevin Karplus, 2015/05/18 14:38

Although the new histograms look better, the log scaling is not done well. You should be plotting on a log scale with proper log-scale tick marks, not plotting 1+log(x) on a linear scale. Please use a plotting package that has proper log scales—whichever one you are using looks rather unprofessional.

You could leave a comment if you were logged in.

Banana Slug Genomics

User Tools

Site Tools

Discussion

Page Tools