User Tools

Site Tools


lecture_notes:05-06-2015

RNA sequencing

RNAseq applications and main issues

This technology represents a good way to provide information about gene expression. The lecturer also stated that it is comparable to protein expression analysis for the acquisition of information. RNAseq allows also investigation of the metabolic state of the tissue/cell by studying processes such as RNA-protein interaction (CLIPseq), small RNA and alternative splicing. The main issues with RNAseq include all of those DNAseq, such as number, quality and length of the reads, PCR duplicates, plus sample and library preparation issues like ribosomal elimination, mapping across splice junctions and normalization among samples. For annotated organisms, RNAseq applies to:

  • Relative gene expression
  • Alternative expression of specific gene isoforms
  • Comparison of gene and isoform expression
  • Non-annotated gene expression

For non-annotated genomes like the Ariolimax dolichophallus, the applications are:

  • Expression of small RNAs
  • Expression of specific alleles
  • Information about the process of RNA editing

The biggest difference between RNAseq and DNA sequencing is that different cells in the body will have different RNA expression levels for each gene, whereas every cell in the body will have the same DNA.

Mapping of RNAseq reads

In the process of mapping reads in RNA and DNA sequencing, assemblers may map the reads to multiple places, therefore the mapping algorithm might allow for mismatch, multiple maps, low quality reads, not penalize mismatch because in some cases, the sequencer was not confident about that specific base. So, it is necessary to take all this information into account. Plus, this process needs to be fast, meaning it should be parallelizable. RNAseq mapping also add those requirements, mapping of splice junctions, genes highly expressed (rRNA, tRNA), genes related by duplication (paralogs), pseudogenes, introns. The algorithms used for RNAseq must include also the fact that due to splicing, some products are not transcribed, and this may affect the biological information.

RNAseq mappers

The algorithms used to map RNAseq reads map to the genome, pre-split the reads (tophat1, 2), map to the transcriptome (tophat2), or do the read splitting during the mapping (star).

TopHat2

  • TopHat is a mapper for splice junctions
  • The junctions are the canonical splice sites, (GU/AG are favored)
  • Reads are split into thirds (note that your reads must be long enough that, when split into thirds, they still contain meaningful information)
  • Aligns RNAseq reads using Bowtie

STAR

  • Stands for Spliced Transcripts Alignment to a Reference
  • Very fast compared to TopHat (outperforms other aligners by a factor of 50, Dobin 2013)
  • Does on-the-fly splicing
  • Runs itertively
  • Discriminates “chimeric” mappings

Other mappers

  • BWA (Burrows-Wheeler Aligner): is a package for mapping low-divergent sequences
  • MapSplice: maps RNA-seq reads aiming to discover splice junctions
  • Bowtie: aligner for short reads to genomes. Bowtie is the engine for mapping reads that TopHat uses.

Camparing expression

FPKM

FPKM stands for “Fragments Per Kilobase of gene length per million of Mapped reads”. Briefly, this concept seems to bring the idea of normalizing the fragments (paired-reads) by the length of the gene as well as by the total of mapped reads. Although this idea seems to be intuitive, from the literature it seems that this metric is controversial and should be abandoned, citing Dillies et al., 2012. In addition to FPKM, Dillies et al., 2012 cite other six metrics to measure differential expression. Concepts like variance and median per sample are important to chose the metric to be used in expression analysis. Dillies and collaborators also state that methods such as Total Count (TC), RPKM, UQ, Med and Q increase the false-positive discovery rate if used as normalization method, whereas methods such as TMM and DESeq “are able to control the false-positive rate and detect differentially-expressed genes”.

DESeq

One advantage this methods presents over FPKM is that it uses replicates per condition. It estimates variance based on the average gene expression, plotting the result against the expression level. It then uses the median relative expression to normalize the counts of the gene in the sample and estimates the dispersion in the gene expression. Therefore, this method uses dispersion and normalization to estimate the significance of the differential expression conditions. DESeq stands for Differential Expression Sequencing and uses a R package part of bioconductor.

Tools for alternative splicing

  • MISO (mixture of isoforms): mapped reads as input. Quantitates the expression of alternatively spliced genes.
  • Splicetrap
  • CuffDiff: tests differential expression and regulation taking into account biases in different library preparation protocols.
  • rMATs
You could leave a comment if you were logged in.
lecture_notes/05-06-2015.txt · Last modified: 2015/05/08 18:57 by ndudek