Table of Contents

RNA sequencing

RNAseq applications and main issues

This technology represents a good way to provide information about gene expression. The lecturer also stated that it is comparable to protein expression analysis for the acquisition of information. RNAseq allows also investigation of the metabolic state of the tissue/cell by studying processes such as RNA-protein interaction (CLIPseq), small RNA and alternative splicing. The main issues with RNAseq include all of those DNAseq, such as number, quality and length of the reads, PCR duplicates, plus sample and library preparation issues like ribosomal elimination, mapping across splice junctions and normalization among samples. For annotated organisms, RNAseq applies to:

For non-annotated genomes like the Ariolimax dolichophallus, the applications are:

The biggest difference between RNAseq and DNA sequencing is that different cells in the body will have different RNA expression levels for each gene, whereas every cell in the body will have the same DNA.

Mapping of RNAseq reads

In the process of mapping reads in RNA and DNA sequencing, assemblers may map the reads to multiple places, therefore the mapping algorithm might allow for mismatch, multiple maps, low quality reads, not penalize mismatch because in some cases, the sequencer was not confident about that specific base. So, it is necessary to take all this information into account. Plus, this process needs to be fast, meaning it should be parallelizable. RNAseq mapping also add those requirements, mapping of splice junctions, genes highly expressed (rRNA, tRNA), genes related by duplication (paralogs), pseudogenes, introns. The algorithms used for RNAseq must include also the fact that due to splicing, some products are not transcribed, and this may affect the biological information.

RNAseq mappers

The algorithms used to map RNAseq reads map to the genome, pre-split the reads (tophat1, 2), map to the transcriptome (tophat2), or do the read splitting during the mapping (star).

TopHat2

STAR

Other mappers

Camparing expression

FPKM

FPKM stands for “Fragments Per Kilobase of gene length per million of Mapped reads”. Briefly, this concept seems to bring the idea of normalizing the fragments (paired-reads) by the length of the gene as well as by the total of mapped reads. Although this idea seems to be intuitive, from the literature it seems that this metric is controversial and should be abandoned, citing Dillies et al., 2012. In addition to FPKM, Dillies et al., 2012 cite other six metrics to measure differential expression. Concepts like variance and median per sample are important to chose the metric to be used in expression analysis. Dillies and collaborators also state that methods such as Total Count (TC), RPKM, UQ, Med and Q increase the false-positive discovery rate if used as normalization method, whereas methods such as TMM and DESeq “are able to control the false-positive rate and detect differentially-expressed genes”.

DESeq

One advantage this methods presents over FPKM is that it uses replicates per condition. It estimates variance based on the average gene expression, plotting the result against the expression level. It then uses the median relative expression to normalize the counts of the gene in the sample and estimates the dispersion in the gene expression. Therefore, this method uses dispersion and normalization to estimate the significance of the differential expression conditions. DESeq stands for Differential Expression Sequencing and uses a R package part of bioconductor.

Tools for alternative splicing