SOAPdenovo2

Team composition

Name	Email
Charles Markello	cmarkell@ucsc.edu
Thomas Matthew	thjmatth@ucsc.edu
Nedda Saremi	nsaremi@ucsc.edu

SOAPdenovo2 Overview

Short Oligonucleotide Analysis Package de novo (SOAPdenovo) is a novel short-read assembly method that can build a de novo draft assembly for the human-sized genomes. The program is specially designed to assemble Illumina GA short reads.

SOAPdenovo2, the latest version of SOAPdenovo, has the advantage of a new algorithm design that reduces memory consumption in graph construction, resolves more repeat regions in contig assembly, increases coverage and length in scaffold construction, improves gap closing, and optimizes for large genome.

SOAPdenovo2 paper

SOAPdenovo2 manual

Workflow

Pre-Processing

The following are tools used for processing of the reads prior to assembly using SOAPdenovo2

Skewer

Skewer is an adapter trimming tool specially designed for processing illumina paired-end sequences.

Skewer

Skewer paper

Fastuniq

Fastuniq is a replicates removal tool for de novo assembly

All data sets from the same library were merged and run through Fastuniq

Fastuniq paper

Musket

Musket was used for error correction on the data sets.

http://musket.sourceforge.net/homepage.htm#latest

KmerGenie

KmerGenie was used to determine optimal k-mer size for de novo assembly. The program produces abundance histograms for many values of k, and then predicts the number of distinct genomic k-mers, returning the k-mer length which maximizes this number.

KmerGenie

KmerGenie Readme file

KmerGenie paper

Running SOAPdenovo2

Configuration File

The configuration file provides the assembler with the information required about each library used for the assembly

configuration file

Kmer selection

One of the key advantages of SOAPdenovo2 is the ability to select a range of k-mers for the de Bruijn graph assembly step of the genome assembly. This feature is activated through the use of two commands:

-K to select the low end of the kmer range

-M to select the high end of the kmer range

Our Post-Processing

The following are tools used for processing/analysis of the reads after running SOAPdenovo2

CEGMA

Core Eukaryotic Genes Mapping Approach

Builds a highly reliable set of gene annotations in the absence of experimental data. Defines a set of 458 core proteins that are present in a wide range of taxa. Since these proteins are highly conserved, sequence alignment methods can reliably identify their exon-intron structures in genomic sequences.

CEGMA Readme file

QUAST

QUality ASsessment Tool

QUAST performs fast and convenient quality evaluation and comparison of genome assemblies.

QUAST computes a number of well-known metrics, including contig accuracy, number of genes discovered, N50, and others

QUAST 2.3 manual

Assembly results

Attempt	Libraries	File location	Contig/Scaffold/Statistics file	Stats file dump	Run log file
1	SW018_S1, SW019_S1, SW019_S2	/campusdata/BME235/S15_assemblies/SOAPdenovo2/assemblyTask/SOAPdenovo2_run1/	soapdenovo2_sparseGraph.contig, soapdenovo2_sparseGraph.scafSeq, soapdenovo2_sparseGraph.scafStatistics	run1 stats	runlog
2	SW018_S1, SW019_S1, SW019_S2, R1_IJS8_mates_ICC5_SW023_S60	/campusdata/BME235/S15_assemblies/SOAPdenovo2/assemblyTask/SOAPdenovo2_run2/	soapdenovo2_sparseGraph.contig, soapdenovo2_sparseGraph.scafSeq, soapdenovo2_sparseGraph.scafStatistics	run2 stats	runlog2
3	SW018_S1, SW019_S1, SW019_S2, R1_IJS8_mates_ICC5_SW023_S60, BS-MK, BS-tag, SW041, SW042, UCSF SW019, UCSF SW018	/campusdata/BME235/S15_assemblies/SOAPdenovo2/assemblyTask/SOAPdenovo2_run3/	soapdenovo2_allAssembly_1.contig, soapdenovo2_allAssembly_1.scafSeq, soapdenovo2_allAssembly_1.scafStatistics	run3 stats	runlog3

GapCloser

SOAPdenovo2's GapCloser tool was run on the scaffold file to fill in gaps of Ns

Resultant file located here:

/campusdata/BME235/S15_assemblies/SOAPdenovo2/assemblyTask/SOAPdenovo2_run3_all/soapdenovo2_allAssembly_1.scafSeq.gapclosed

GapCloser Download

GapCloser manual

QUAST was then used to assess how successful GapCloser was on the scaffold file

GapCloser significantly decreased the number of Ns in the gap-closed scaffold from roughly 17% to 1.2%, resulting in the decrease in the genome size approximation, the Size_includeN parameter

Parameter	scaffold file	gap closed scaffold file
Size_includeN	2587459468	2322694661
Number of Scaffolds	2427310	2427310
Longest_Seq	124995	114336
# N's per 100 kbp	17265.92	1249.37
N50	12629	11000
L50	59060	60106

QUAST will take inputted scaffolds and break them at repeats of Ns to create theoretical contigs

GapCloser roughly tripled the size of the longest theoretical contig, seen in the longest sequence parameter. It roughly quadrupled the N50 and thus decreased the number of theoretical contigs.

Parameter	theoretical contig file	theoretical gap closed contig file
Size_includeN	2140718695	2293889058
Number of Contigs	3849281	2598123
Longest_Seq	31799	94266
N50	1814	8033
L50	315423	80940

BLAST results

May 26 2015 BLASTN

Searching only NCBI Genomic Reference Sequences

Using the 6 largest scaffolds from the full assembly (Run 3)

Aplysia californica was first when sorted by E value

Aplysia californica

It was the first mollusk to be sequenced

Broad Institute Aplysia Sequencing Project

Table of Contents

SOAPdenovo2

Team composition

SOAPdenovo2 Overview

Workflow

Pre-Processing

Skewer

Fastuniq

Musket

KmerGenie

Running SOAPdenovo2

Our Post-Processing

CEGMA

QUAST

Assembly results

GapCloser

BLAST results