RNA-seq Data Analysis with edgeR (2024)

Introduction

Differential gene expression (DGE) analysis is commonly used in the transcriptome-wide analysis (using RNA-seq) forstudying the changes in gene or transcripts expressions under different conditions (e.g. control vs infected).
RNA sequencing (bulk and single-cell RNA-seq) using next-generation sequencing (e.g. Illumina short-read sequencing)is a de facto method for quantifying the transcriptome-wide gene or transcript expressions and performing DGE analysis.
There are several computational tools are available for DGE analysis. In this article, I will cover edgeRfor DGE analysis.

Check DGE analysis using DESeq2

DGE analysis using edgeR

The standard workflow for DGE analysis involves the following steps

RNA-seq with a sequencing depth of 10-30 M reads per library (at least 3 biological replicates per sample)
aligning or mapping the quality-filtered sequenced reads to respective genome (e.g. HISAT2 or STAR). You can read my articleon how to map RNA-seq reads using STAR
quantifying reads that are mapped to genes or transcripts (e.g. featureCounts, RSEM, HTseq)
Actual raw integer read counts (un-normalized) are then used for DGE analysis using edgeR. edgeR prefers the rawinteger read counts, but it can also work with expected counts obtained from RSEM.

Let’s start now for analyzing the gene expression data using edgeR,

Install edgeR

Install edgeR (follow this step if you have not installed edgeR before)

# R version 4.2.0 (2022-04-22 ucrt)if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")BiocManager::install("edgeR")

Import table of gene expression counts

Import the gene expression counts (read counts) as a matrix or a dataframe,

# load library # edgeR version 3.38.0library("edgeR")# get count datasetcount_matrix <- as.matrix(read.csv("https://reneshbedre.github.io/assets/posts/gexp/df_sc.csv", row.names = "gene"))# view first two rowshead(count_matrix, 2) ctr1 ctr2 ctr3 trt1 trt2 trt3 lengthSobic.001G000200 338 324 246 291 202 168 1982Sobic.001G000400 49 21 53 16 16 11 4769# drop length columncount_matrix <- count_matrix[, -7]head(count_matrix, 2) ctr1 ctr2 ctr3 trt1 trt2 trt3Sobic.001G000200 338 324 246 291 202 168Sobic.001G000400 49 21 53 16 16 11

Create DGEList data class

DGEList data class contains the integer count and sample information. DGEList is a list-based data object which is easy tomanipulate in R.

Create a sample information for the count data,

sample_info <- c("ctr", "ctr", "ctr", "trt", "trt", "trt")

Now, create DGEList data class for count and sample information,

Filter out the genes with low counts

It is recommended to filter out the genes which have low expression counts across all samples. These genes may not haveenough evidence for differential gene expression. The removal of low count genes also reduces the number of tests andultimately multiple hypothesis testing corrections.

We will use the filterByExpr function to remove the low count genes. This function keeps the genes with a worthwhilecounts (at least 10 read counts) in a minimal number of samples. Here the minimal number of samples is 3 i.e. number of samplesin each group is 3.

keep <- filterByExpr(y = dge)dge <- dge[keep, , keep.lib.sizes=FALSE]# outputAn object of class "DGEList"$counts ctr1 ctr2 ctr3 trt1 trt2 trt3Sobic.001G000200 338 324 246 291 202 168Sobic.001G000400 49 21 53 16 16 11Sobic.001G000700 39 49 30 46 52 25Sobic.001G000800 530 530 499 499 386 264Sobic.001G001000 12 3 4 3 10 713828 more rows ...$samples group lib.size norm.factorsctr1 ctr 3343275 1ctr2 ctr 3171392 1ctr3 ctr 3278446 1trt1 trt 3130360 1trt2 trt 2709917 1trt3 trt 1755290 1

Recalculating the library sizes (keep.lib.sizes=FALSE) for each sample is recommended following the filtering step.After filtering, library sizes will be slightly changed for each sample.

Normalization and effective library sizes

edgeR normalizes the read counts for varying library sizes (sample-specific effect) by finding a scaling (normalization) factor foreach sample. The normalization is performed using the TMM (Trimmed Mean of M-values) between-sample normalization method.TMM assumes most of the genes are not differentially expressed. The effective library sizes are then calculated byusing the scaling factors.

calcNormFactors function is used for TMM normalization and calculating normalization(scaling) factors

dge <- calcNormFactors(object = dge)# outputAn object of class "DGEList"$counts ctr1 ctr2 ctr3 trt1 trt2 trt3Sobic.001G000200 338 324 246 291 202 168Sobic.001G000400 49 21 53 16 16 11Sobic.001G000700 39 49 30 46 52 25Sobic.001G000800 530 530 499 499 386 264Sobic.001G001000 12 3 4 3 10 713828 more rows ...$samples group lib.size norm.factorsctr1 ctr 3343275 1.0473246ctr2 ctr 3171392 0.9815466ctr3 ctr 3278446 1.0421847trt1 trt 3130360 0.9837185trt2 trt 2709917 0.9736776trt3 trt 1755290 0.9744892

The normalization factors are an indicator of the status of gene expression. If the normalization factor is < 1 for some samples, itindicates a small number of high count genes are abundant in that sample and vice versa. The product of all normalizationfactors is equal to 1.

Model fitting and estimating dispersions

The edgeR uses the quantile-adjusted conditional maximum likelihood (qCML) method for single-factor design expressionanalysis. qCML method uses the pseudo count for adjusting the library sizes for all samples. edgeR internally calculatespseudo count for speeding up the analysis for negative binomial (NB) dispersion estimation and exact test for pairwise comparison.

The dispersion is estimated using estimateDisp() function. It estimates both common and tagwise dispersions in onerun.

dge <- estimateDisp(y = dge)# outputAn object of class "DGEList"$counts ctr1 ctr2 ctr3 trt1 trt2 trt3Sobic.001G000200 338 324 246 291 202 168Sobic.001G000400 49 21 53 16 16 11Sobic.001G000700 39 49 30 46 52 25Sobic.001G000800 530 530 499 499 386 264Sobic.001G001000 12 3 4 3 10 713828 more rows ...$samples group lib.size norm.factorsctr1 ctr 3343275 1.0473246ctr2 ctr 3171392 0.9815466ctr3 ctr 3278446 1.0421847trt1 trt 3130360 0.9837185trt2 trt 2709917 0.9736776trt3 trt 1755290 0.9744892$common.dispersion[1] 0.06972943$trended.dispersion[1] 0.05246027 0.10467010 0.09078381 0.05013202 0.1228066513828 more elements ...$tagwise.dispersion[1] 0.03726749 0.08900702 0.05944770 0.02538477 0.2119479413828 more elements ...$AveLogCPM[1] 6.507202 3.302021 3.883089 7.283077 1.57039013828 more elements ...$trend.method[1] "locfit"$prior.df[1] 3.945552$prior.n[1] 0.9863879$span[1] 0.2950908

Testing for differential gene expression

After NB model fitting and dispersion estimation, the exact test (exact p values) is used for testing the differentialexpression of genes between two condition counts data.

The exact test is performed using the exactTest() function and is applied only on a single factor design. exactTest()function outputs a DGEExact object containing the differential expression table (containing columns for log fold change,logCPM, and two-tailed p values), comparison vector (comparing groups), and genesdata frame (gene annotation).

et <- exactTest(object = dge)An object of class "DGEExact"$table logFC logCPM PValueSobic.001G000200 -0.01932342 6.507202 0.94991425Sobic.001G000400 -1.04663708 3.302021 0.01929426Sobic.001G000700 0.47456757 3.883089 0.17954409Sobic.001G000800 -0.01486902 7.283077 0.94319628Sobic.001G001000 0.59323937 1.570390 0.4782896113828 more rows ...$comparison[1] "ctr" "trt"$genesNULL

topTags() function is useful to extract the table with adjusted p values (FDR). The output table is ordered byp values.

top_degs = topTags(object = et, n = "Inf")top_degs# outputComparison of groups: trt-ctr logFC logCPM PValue FDRSobic.006G070564 5.1400611 3.308905 4.157243e-23 5.750715e-19Sobic.003G079200 3.9750417 4.407321 4.527009e-21 3.131106e-17Sobic.007G191200 3.8315481 6.609410 1.079306e-15 4.976681e-12Sobic.010G095100 1.9839749 7.296541 1.905123e-15 6.588393e-12Sobic.004G086500 4.1189965 4.615163 3.795766e-15 1.050137e-11Sobic.004G222000 2.5899849 5.013418 1.395054e-14 3.216296e-11Sobic.010G082600 4.6446796 3.485784 1.708966e-14 3.377162e-11Sobic.004G121200 2.5401701 4.023315 2.491992e-14 4.308965e-11..

As the comparison of groups is trt-ctr, the positive log fold change represents the gene is more highlyexpressed in the trt condition as compared to the ctr condition and vice versa.

Get a summary DGE table (returns significant genes with absolute log fold change at least 1 and adjusted p value < 0.05),

summary(decideTests(object = et, lfc = 1)) trt-ctrDown 177NotSig 13367Up 289

The differential gene expression of the sugarcane RNA-seq dataset suggests that 289 and 177 genes were significantly upregulatedand downregulated respectively in response to the infected (trt) condition.

Export differential gene expression analysis table to CSV file,

write.csv(as.data.frame(top_degs), file="condition_infected_vs_control_dge.csv")

DGE Visualization

I will visualize the DGE using Volcano plot using Python,

Volcano plot,

# import packagesimport pandas as pdfrom bioinfokit import visuz# import the DGE table (condition_infected_vs_control_dge.csv)df = pd.read_csv("condition_infected_vs_control_dge.csv")# drop NA valuesdf=df.dropna()# create volcano plotvisuz.GeneExpression.volcano(df=df, lfc='logFC', pv='FDR', sign_line=True, plotlegend=True)

If you want to create a heatmap, check this article

Visualize the MA plot,

plotMD(object = et)

Enhance your skills with courses on genomics and bioinformatics

References

Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digitalgene expression data. Bioinformatics. 2010 Jan 1;26(1):139-40.
edgeR: differential analysis of sequence read count data

If you have any questions, comments or recommendations, please email me atreneshbe@gmail.com

This work is licensed under a Creative Commons Attribution 4.0 International License

Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase.The retailer will pay the commission at no additional cost to you.