Differential meta-analysis of RNA-seq data from multiple studies

Background High-throughput sequencing is now regularly used for studies of the transcriptome (RNA-seq), particularly for comparisons among experimental conditions. For the time being, a limited number of biological replicates are typically considered in such experiments, leading to low detection power for differential expression. As their cost continues to decrease, it is likely that additional follow-up studies will be conducted to re-address the same biological question. Results We demonstrate how p-value combination techniques previously used for microarray meta-analyses can be used for the differential analysis of RNA-seq data from multiple related studies. These techniques are compared to a negative binomial generalized linear model (GLM) including a fixed study effect on simulated data and real data on human melanoma cell lines. The GLM with fixed study effect performed well for low inter-study variation and small numbers of studies, but was outperformed by the meta-analysis methods for moderate to large inter-study variability and larger numbers of studies. Conclusions The p-value combination techniques illustrated here are a valuable tool to perform differential meta-analyses of RNA-seq data by appropriately accounting for biological and technical variability within studies as well as additional study-specific effects. An R package metaRNASeq is available on the CRAN (http://cran.r-project.org/web/packages/metaRNASeq).


Introduction
High-throughput sequencing (HTS) data, such as RNA-sequencing (RNA-seq) data, are increasingly used to conduct differential analyses, in which gene-by-gene statistical tests are performed in order to identify genes whose expression levels show systematic covariation with a particular condition, such as a treatment or phenotype of interest. Due to their large cost, however, only few biological replicates are often considered in each experiment leading to a low detection power of differentially expressed genes. For this reason, analyzing data arising from several experiments studying the same question can be a useful way to increase detection power for the identification of differentially expressed genes.
The metaRNASeq package implements two p-value combination techniques (inverse normal and Fisher methods); see [4] for additional details. There are two fundamental assumptions behind the use of these p-value combination procedures: first, that p-values have been obtained the same way for each experiment (i.e., using the same model and test); and second, that they follow a uniform distribution under the null hypothesis. In this vignette, we illustrate these p-value combination techniques after obtaining p-values for differential expression in each individual experiment using the DESeq2 Bioconductor package [1]. Count data are simulated using the sim.function provided in the metaRNASeq package; see section 2 for additional detail.

Simulation study
To begin, we load the necessary packages and simulation parameters: These simulation parameters include the following information: • param: Matrix of dimension (26408 × 3) containing mean expression in each of two conditions (here, labeled "condition 1" and "condition 2") and a logical vector indicating the presence or absence of differential expression for each of 26,408 genes • dispFuncs: List of length 2, where each list is a vector containing the two estimated coefficients (α 0 and α 1 ) for the gamma-family generalized linear model (GLM) fit by DESeq (version 1.8.3) describing the mean-dispersion relationship for each of the two real datasets considered in [4]. These regressions represent the typical relationship between mean expression values µ and dispersions α in each dataset, where the coefficients α 0 and α 1 are found to parameterize the fit as α = α 0 +α 1 /µ.
These parameters were calculated on real data sets from two human melanoma cell lines [5], corresponding to two different studies performed for the same cell line comparison, with two biological replicates per cell line in the first and three per cell line in the second. These data are presented in greater detail in [5] and [2], and are freely available in the Supplementary Materials of the latter.
Once parameters are loaded, we simulate data. We use the set.seed function to obtain reproducible results.

Individual analyses of the two simulated data sets
Before performing a combination of p-values from each study, it is necessary to perform a differential analysis of the individual studies (using the same method). In the following example, we make use of the DESeq2 package to obtain p-values for differential analyses of each study independently; however, we note that other differential analysis methods (e.g., edgeR or baySeq) could be used prior to the meta analysis.

Differential analysis of each individual study with DESeq2
Inputs to DEseq2 methods can be extracted with extractfromsim for each individual study whose name appears in the column names of matsim, see the following example for study1 and study2.
> if (requireNamespace("DESeq2", quietly = TRUE)) { + dds1 <-DESeq2::DESeqDataSetFromMatrix(countData = simstudy1$study, + colData = simstudy1$pheno,design =~condition) + res1 <-DESeq2::results(DESeq2::DESeq(dds1)) + dds2 <-DESeq2::DESeqDataSetFromMatrix(countData = simstudy2$study, + colData = simstudy2$pheno,design =~condition) + res2 <-DESeq2::results(DESeq2::DESeq(dds2)) + } We recommand to store both p-value and Fold Change results in lists in order to perform meta-analysis and keep track of the potential conflicts (see section 5) > if (exists("res1") && exists("res2")) Differentially expressed genes in each individual study can also be marked in a matrix DE: DE returns a matrix with 1 for genes identified as differentially expressed and 0 otherwise (one column per study) Since the proposed p-value combination techniques rely on the assumption that pvalues follow a uniform distribution under the null hypothesis, it is necesary to check that the histograms of raw-pvalues reflect that assumption: > par(mfrow = c(1,2)) > hist(rawpval[ [1]], breaks=100, col="grey", main="Study 1", xlab="Raw p-values") > hist(rawpval[ [2]], breaks=100, col="grey", main="Study 2", xlab="Raw p-values") The peak near 0 corresponds to differentially expressed genes, no other peak should appear. Sometimes another peak may appear due to genes with very low values of expression which often lead to an enrichment of p-values close to 1 as they take on discrete values. As such genes are unlikely to display evidence for differential expression, it is recommended to perform an independent filtering. The application of such a filter typically removes those genes contributing to a peak of p-values close to 1, leading to a distribution of p-values under the null hypothesis more closely following a uniform distribution. As the proposed p-value combination techniques rely on this assumption, it is sometimes necessary to independently filter genes with very low read counts.
In this example the results function of DESeq2 performs an automatic independent filtering. If a row is filtered by independent filtering, then only the adjusted p-value will be set to NA, and the graphic of raw p-values does not change. In order to have a distribution of raw p-values under the null hypothesis following a uniform distribution, we must manually set the corresponding raw p-values to NA.

Use of p-value combination techniques
The code in this section may be used independently from the previous section if p-values from each study have been obtained using the same differential analysis test between the different studies. Vectors of p-values must have the same length; rawpval is a list (or data.frame) containing the vectors of raw p-values obtained from the individual differential analyses of each study. The p-value combination using the Fisher method may be performed with the fishercomb function, and the subsequent p-values obtained from the meta-analysis may be examined (Figure 3, left): Figure 2: Histograms of raw p-values for each of the individual differential analyses performed using the independent filtering from DESeq2 package.

Treatment of conflicts in differential expression
As pointed out in [4], it is not possible to directly avoid conflicts between over-and under-expressed genes in separate studies that appear in differential meta-analyses of RNA-seq data. We thus advise checking that individual studies identify differential expression in the same direction (i.e., if in one study, a gene is identified as differentially over-expressed in condition 1 as compared to condition 2, it should not be identified as under-expressed in condition 1 as compared to condition 2 in a second study). Genes displaying contradictory differential expression in separate studies should be removed from the list of genes identified as differentially expressed via meta-analysis. We build a matrix signsFC gathering all signs of fold changes from individual studies.
> signsFC <-mapply(FC, FUN=function(x) sign(x)) > sumsigns <-apply(signsFC,1,sum) > commonsgnFC <-ifelse(abs(sumsigns)==dim(signsFC) [2], sign(sumsigns),0) The vector commonsgnFC will return a value of 1 if the gene has a positive log 2 fold change in all studies, -1 if the gene has a negative log 2 fold change in all studies, and 0 if contradictory log 2 fold changes are observed across studies (i.e., positive in one and negative in the other). By examining the elements of commonsgnFC, it is thus possible to identify genes displaying contradictory differential expression among studies. In this example, the p-value combination technique with Fisher's method gives 18 (1.44%) new genes and 0 (0%), are sidetracked. The inverse normal combination technique gives 22 (1.81%) new genes and 35 (2.85%), are sidetracked To compare visually the number of differentially expressed genes in individual studies or in meta-analysis, it is also possible to draw a Venn diagram, for example with the VennDiagram package. /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0