 Research article
 Open Access
 Published:
Metaanalysis methods for combining multiple expression profiles: comparisons, statistical characterization and an application guideline
BMC Bioinformatics volume 14, Article number: 368 (2013)
Abstract
Background
As highthroughput genomic technologies become accurate and affordable, an increasing number of data sets have been accumulated in the public domain and genomic information integration and metaanalysis have become routine in biomedical research. In this paper, we focus on microarray metaanalysis, where multiple microarray studies with relevant biological hypotheses are combined in order to improve candidate marker detection. Many methods have been developed and applied in the literature, but their performance and properties have only been minimally investigated. There is currently no clear conclusion or guideline as to the proper choice of a metaanalysis method given an application; the decision essentially requires both statistical and biological considerations.
Results
We performed 12 microarray metaanalysis methods for combining multiple simulated expression profiles, and such methods can be categorized for different hypothesis setting purposes: (1) HS_{ A }: DE genes with nonzero effect sizes in all studies, (2) HS_{ B }: DE genes with nonzero effect sizes in one or more studies and (3) HS_{ r }: DE gene with nonzero effect in "majority" of studies. We then performed a comprehensive comparative analysis through six largescale real applications using four quantitative statistical evaluation criteria: detection capability, biological association, stability and robustness. We elucidated hypothesis settings behind the methods and further apply multidimensional scaling (MDS) and an entropy measure to characterize the metaanalysis methods and data structure, respectively.
Conclusions
The aggregated results from the simulation study categorized the 12 methods into three hypothesis settings (HS_{ A }, HS_{ B }, and HS_{ r }). Evaluation in real data and results from MDS and entropy analyses provided an insightful and practical guideline to the choice of the most suitable method in a given application. All source files for simulation and real data are available on the author’s publication website.
Background
Microarray technology has been widely used to identify differential expressed (DE) genes in biomedical research in the past decade. Many transcriptomic microarray studies have been generated and made available in public domains such as the Gene Expression Omnibus (GEO) from NCBI (http://www.ncbi.nlm.nih.gov/geo/) and ArrayExpress from EBI (http://www.ebi.ac.uk/arrayexpress/). From the databases, one can easily obtain multiple studies of a relevant biological or disease hypothesis. Since a single study often has small sample size and limited statistical power, combining information across multiple studies is an intuitive way to increase sensitivity. Ramasamy, et al. proposed a sevenstep practical guidelines for conducting microarray metaanalysis [1]: "(i) identify suitable microarray studies; (ii) extract the data from studies; (iii) prepare the individual datasets; (iv) annotate the individual datasets; (v) resolve the manytomany relationship between probes and genes; (vi) combine the studyspecific estimates; (vii) analyze, present, and interpret results". In the first step although theoretically metaanalysis increases the statistical power to detect DE genes, the performance can be deteriorated if problematic or heterogeneous studies are combined. In many applications, the data inclusion/exclusion criteria are based on adhoc expert opinions, a naïve sample size threshold or selection of platforms without an objective quality control procedure. Kang et al. proposed six quantitative quality control measures (MetaQC) for decision of study inclusion [2]. Step (ii)(v) are related to data preprocessing. Finally, Step (vi) and (vii) involve the selection of metaanalysis method and interpretation of the result and are the foci of this paper.
Many microarray metaanalysis methods have been developed and applied in the literature. According to a recent review paper by Tseng et al. [3], popular methods mainly combine three different types of statistics: combine pvalues, combine effect sizes and combine ranks. In this paper, we include 12 popular as well as stateoftheart methods in the evaluation and comparison. Six methods (Fisher, Stouffer, adaptively weighted Fisher, minimum pvalue, maximum pvalue and rth ordered pvalue) belonged to the pvalue combination category, two methods (fixed effects model and random effects model) belonged to the effect size combination category and four methods (RankProd, RankSum, product of ranks and sum of ranks) belonged to the rank combination category. Details of these methods and citations will be provided in the Method section. Despite the availability of many methods, pros and cons of these methods and a comprehensive evaluation remain largely missing in the literature. To our knowledge, Hong and Breitling [4], Campain and Yang [5] are the only two comparative studies that have systematically compared multiple metaanalysis methods. The number of methods compared (three and five methods, respectively) and the number of real examples examined (two and three examples respectively with each example covering 25 microarray studies) were, however, limited. The conclusions of the two papers were suggestive with limited insights to guide practitioners. In addition, as we will discuss in the Method section, different metaanalysis methods have different underlying hypothesis setting targets. As a result, the selection of an adequate (or optimal) metaanalysis method depends heavily on the data structure and the hypothesis setting to achieve the underlying biological goal.
In this paper, we compare 12 popular microarray metaanalysis methods using simulation and six real applications to benchmark their performance by four statistical criteria (detection capability, biological association, stability and robustness). Using simulation, we will characterize the strength of each method under three different hypothesis settings (i.e. detect DE genes in "all studies", "majority of studies" or "one or more studies"; see Method section for more details). We will compare the similarity and grouping of the metaanalysis methods based on their DE gene detection results (by using a similarity measure and multidimension scaling plot) and use an entropy measure to characterize the data structure to determine which hypothesis setting may be more adequate in a given application. Finally, we give a guideline to help practitioners select the best metaanalysis method under the choice of hypothesis setting in their applications.
Methods
Real data sets
Six example data sets for microarray metaanalysis were collected for evaluations in this paper. Each example contained 48 microarray studies. Five of the six examples were of the commonly seen twogroup comparison and the last breast cancer example contained relapsefree survival outcome. We applied the MetaQC package [2] to assess quality of the studies for metaanalysis and determined the final inclusion/exclusion criteria. The principal component analysis (PCA) biplots and the six QC measures are summarized in Additional file 1: Figure S1, Tables S2 and S3. Details of the data sets are available in Additional file 1: Table S1.
Underlying hypothesis settings
Following the classical convention of Brinbaum [6] and Li and Tseng [7] (see also Tseng et al. [3]), metaanalysis methods can be classified into two complementary hypothesis settings. In the first hypothesis setting (denoted as HS_{ A }), the goal is to detect DE genes that have nonzero effect sizes in all studies:
where θ_{ k } is the effect size of study k. The second hypothesis setting (denoted as HS_{ B }), however, aims to detect a DE gene if it has nonzero effect size in "one or more" studies:
In most applications, HS_{ A } is more appropriate to detect conserved and consistent candidate markers across all studies. However, different degrees of heterogeneity can exist in the studies and HS_{ B } can be useful to detect studyspecific markers (e.g. studies from different tissues are combined and tissue specific markers are expected and of interest). Since HS_{ A } is often too conservative when many studies are combined, Song and Tseng (2012) proposed a more practical and robust hypothesis setting (namely HS_{ r }) that targets on DE genes with nonzero effect sizes in "majority" of studies, where majority of studies is defined as, for example, more than 50% of combined studies (i.e. r ≥ 0.5⋅K). The robust hypothesis setting considered was:
A major contribution of this paper is to characterize metaanalysis methods suitable for different hypothesis settings (HS_{ A }, HS_{ B } and HS_{ r }) using simulation and real applications and to compare their performance with four benchmarks to provide a practical guideline.
Microarray metaanalysis data preprocessing
Assume that we have K microarray studies to combine. For study k (1 ≤ k ≤ K), denote by x_{ gsk } the gene expression intensity of gene g (1 ≤ g ≤ G) and sample s (1 ≤ s ≤ S_{ k }; S_{ k } the number of samples in study k), and y_{ sk } the disease/outcome variable of sample s. The disease/outcome variable can be of binary, multiclass, continuous or censored data, representing the disease state, severity or prognosis outcome (e.g. tumor versus normal or recurrence survival time). The goal of microarray metaanalysis is to combine information of K studies to detect differentially expressed (DE) genes associated with the disease/outcome variable. Such DE genes serve as candidate markers for disease classification, diagnosis or prognosis prediction and help understand the genetic mechanisms underlying a disease. In this paper, before metaanalysis we first applied penalized tstatistic to each individual study to generate pvalues or DE ranks [8] for a binary outcome. In contrast to traditional tstatistic, penalized tstatistic adds a fudge parameter s_{0} to stabilize the denominator $\left(T=\left(\overline{X}\overline{Y}\right)/(\widehat{s}+{s}_{0}\right)$; $\overline{X}$ and $\overline{Y}$ are means of case and control groups) and to avoid a large tstatistic due to small estimated variance $\widehat{s}$. The pvalues were calculated using the null distributions derived from conventional nonparametric permutation analysis by randomly permuting the case and control labels for 10,000 times [9]. For censored outcome variables, Cox proportion hazard model and logrank test were used [10]. Metaanalysis methods (described in the next subsection) were then used to combine information across studies and generate metaanalyzed pvalues. To account for multiple comparison, Benjamini and Hochberg procedure was used to control false discovery rate (FDR) [11]. All methods were implemented using the "MetaDE" package in R [12]. Data sets and all programming codes are available at http://www.biostat.pitt.edu/bioinfo/publication.htm.
Microarray metaanalysis methods
According to a recent review paper [3], microarray metaanalysis methods can be categorized into three types: combine pvalues, combine effect sizes and combine ranks. Below, we briefly describe 12 methods that were selected for comparison.
Combine pvalues
Fisher The Fisher’s method [13] sums up the logtransformed pvalues obtained from individual studies. The combined Fisher’s statistic ${\chi}_{\text{Fisher}}^{2}=2{\sum}_{i=1}^{k}\text{log}\left({P}_{i}\right)$ follows a χ^{2} distribution with 2 k degrees of freedom under the null hypothesis (assuming null pvalues are un;iformly distributed). Note that we perform permutation analysis instead of such parametric evaluation for Fisher and other methods in this paper. Smaller pvalues contribute larger scores to the Fisher’s statistic.
Stouffer Stouffer’s method [14] sums the inverse normal transformed pvalues. Stouffer’s statistics ${T}_{\text{Stouffer}}={\sum}_{i=1}^{k}{z}_{i}/\sqrt{k}({z}_{i}{\mathrm{\Phi}}^{1}\left({p}_{i}\right),$ where Φ is standard normal c.c.f) follows a standard normal distribution under the null hypothesis. Similar to Fisher’s method, smaller pvalues contribute more to the Stouffer’s score, but in a smaller magnitude.
Adaptively weighted (AW) Fisher The AW Fisher’s method [7] assigns different weights to each individual study ${T}_{\text{AW}}={\sum}_{k=1}^{K}{w}_{k}\cdot log\left({P}_{i}\right),{w}_{k}=0\phantom{\rule{0.24em}{0ex}}\text{or}\phantom{\rule{0.12em}{0ex}}1$ and it searches through all possible weights to find the best adaptive weight with the smallest derived pvalue. One significant advantage of this method is its ability to indicate which studies contribute to the evidence aggregation and elucidates heterogeneity in the metaanalysis. Details can be referred to the Additional file 1.
Minimum p value (minP) The minP method takes the minimum pvalue among the K studies as the test statistic [15]. It follows a beta distribution with degrees of freedom α = 1 and β = k under the null hypothesis. This method detects a DE gene whenever a small pvalue exists in any one of the K studies.
Maximum p value (maxP) The maxP method takes maximum pvalue as the test statistic [16]. It follows a beta distribution with degrees of freedom α = K and β = 1 under the null hypothesis. This method targets on DE genes that have small pvalues in "all" studies.
rth ordered p value (rOP) The rOP method takes the rth order statistic among sorted pvalues of K combined studies. Under the null hypothesis, the statistic follows a beta distribution with degrees of freedom α = r and β = K  r + 1. The minP and maxP methods are special cases of rOP. In Song and Tseng [17], rOP is considered a robust form of maxP (where r is set as greater than 0.5∙K) to identify candidate markers differentially expressed in "majority" of studies.
Combine effect size
Fixed effects model (FEM) FEM combines the effect size across K studies by assuming a simple linear model with an underlying true effect size plus a random error in each study.
Random effects model (REM) REM [18] extends FEM by allowing random effects for the interstudy heterogeneity in the model. Detailed formulation and inference of FEM and REM are available in the Additional file 1.
Combine rank statistics
RankProd (RP) and RankSum (RS) RankProd and RankSum are based on the common biological belief that if a gene is repeatedly at the top of the lists ordered by up or downregulation fold change in replicate experiments, the gene is more likely a DE gene [19]. Detailed formulation and algorithms are available in the Additional file 1.
Product of ranks (PR) and Sum of ranks (SR) These two methods apply a naïve product or sum of the DE evidence ranks across studies [20]. Suppose R_{ gk } represents the rank of pvalue of gene g among all genes in study k. The test statistics of PR and SR methods are calculated as $P{R}_{g}={\prod}_{k=1}^{K}{R}_{\mathit{\text{gk}}}$ and $S{R}_{g}={\sum}_{k=1}^{K}{R}_{\mathit{\text{gk}}},$ respectively. Pvalues of the test statistics can be calculated analytically or obtained from a permutation analysis. Note that the ranks taken from the smallest to largest (the choice in the method) are more sensitive than ranking from largest to smallest in the PR method, while it makes no difference to SR.
Characterization of metaanalysis methods
MDS plots to characterize the methods
The multidimensional scaling (MDS) plot is a useful visualization tool for exploring highdimensional data in a lowdimensional space [21]. In the evaluation of 12 metaanalysis methods, we calculated the adjusted DE similarity measure for every pair of methods to quantify the similarity of their DE analysis results in a given example. A dissimilarity measure is then defined as one minus the adjusted DE similarity measure and the dissimilarity measure is used to generate an MDS plot of the 12 methods. In the MDS plot, methods that are clustered in a neighborhood indicate that they produce similar DE analysis results.
Entropy measure to characterize data sets
As indicated in the Section of "Underlying hypothesis settings", selection of the most suitable metaanalysis method(s) largely depends on their underlying hypothesis setting (HS_{ A }, HS_{ B } and HS_{ r }). The selection of a hypothesis setting for a given application should be based on the experimental design, biological knowledge and the associated analytical objectives. There are, however, occasions that little prior knowledge or preference is available and an objective characterization of the data structure is desired in a given application. For this purpose, we developed a datadriven entropy measure to characterize whether a given metaanalysis data set contains more HS_{ A }type markers or HS_{ B }type markers [22]. The algorithm is described below:

1.
Apply Fisher’s metaanalysis method to combine pvalues across studies to identify the top H candidate markers. Here we used H = 1,000, H represents the rough number of DE genes (in our belief) that are contained in the data.

2.
For each selected marker, the standardized minus pvalue score for gene g in the kth study is defined as ${l}_{\mathit{\text{gk}}}=\text{log}\left({p}_{\mathit{\text{gk}}}\right)/{\sum}_{k=1}^{K}\text{log}\left({p}_{\mathit{\text{gk}}}\right).$ Note that 0 ≤ l _{ gk } ≤ 1, large l _{ gk } corresponds to more significant pvalue p _{ gk }, and ${\sum}_{k=1}^{K}{l}_{\mathit{\text{gk}}}=1.$

3.
The entropy of gene g is defined as ${e}_{g}={\sum}_{k=1}^{K}{l}_{\mathit{\text{gk}}}\phantom{\rule{0.12em}{0ex}}\text{log}\left({l}_{\mathit{\text{gk}}}\right)$. Boxplots of entropies of the top H genes are generated for each metaanalysis application (Figure 1(b)).
Intuitively, a high entropy value indicates that the gene has small pvalues in all or most studies and is of HS_{ A } or HS_{ r }type. Conversely, genes with small entropy have small pvalues in one or only few studies where HS_{ B }type methods are more adequate. When calculating l_{ gk } in step 2, we capped log(p_{ gk }) at 10 to avoid contributions of closetozero pvalues that can generate nearinfinite scores. The entropy boxplot helps determine an appropriate metaanalysis hypothesis setting if no preset biological objective exists.
Evaluation criteria
For objective quantitative evaluation, we developed the following four statistical criteria to benchmark performance of the methods.
Detection capability
The first criterion considers the number of DE genes detected by each metaanalysis method under the same preset FDR threshold (e.g. FDR = 1%). Although detecting more DE genes does not guarantee better "statistical power", this criterion has served as a surrogate of statistical power in previous comparative studies [23]. Since we do not know the underlying true DE genes, we refer to this evaluation as "detection capability" in this paper. An implicit assumption underlying this criterion is that the statistical procedure to detect DE genes in each study and the FDR control in the metaanalysis are accurate (or roughly accurate). To account for data variability in the evaluation, we bootstrapped (i.e. sampled with replacement to obtain the same number of samples in each bootstrapped dataset) the samples in each study for B = 50 times and show the plots of ean with standard error bars. In the bootstrapping, the entire sample is either selected or not so the gene dependence structure is maintained. Denote by r_{ meb } the rank of detection capability performance (the smaller the better) of method m (1 ≤ m ≤ 12) in example e (1 ≤ e ≤ 6) and in the b^{th} (1 ≤ b ≤ 12) bootstrap simulation. The mean standardized rank (MSR) for method m and example e is calculated as ${\mathit{\text{MSR}}}_{\mathit{\text{me}}}={\sum}_{b=1}^{B}\left({r}_{\mathit{\text{meb}}}/\#\text{of}\phantom{\rule{0.25em}{0ex}}\text{methods}\phantom{\rule{0.25em}{0ex}}\text{compared}\right)/B$ and the aggregated standardized rank (ASR) is calculated as ${\mathit{\text{ASR}}}_{m}={\sum}_{e=1}^{6}{{\mathit{\text{MSR}}}_{m}}_{e}/6,$ representing the overall performance of method m across all six examples. Additional file 1: Table S4 shows the MSR and ASR of all 12 methods and Figure 2 (in the Result section) shows plot of mean with standard error bars for each method ordered by ASR. We note that MSR and ASR are both standardized between 0 and 1. The standardization in MSR is necessary because in the breast cancer survival example we cannot apply FEM, REM, RankSum and RankProd as they are developed only for a two group comparison.
Biological association
The second criterion requires that a good metaanalysis method should detect a DE gene list that has better association with predefined "gold standard" pathways related to the targeted disease. Such a "gold standard" pathway set should be obtained from biological knowledge for a given disease or biological mechanism under investigation. However, since most disease or biological mechanisms are not wellstudied, obtaining such "gold standard" pathways is either difficult or questionable. To facilitate this evaluation without bias, we develop a computational and datadriven approach to determine a set of surrogate diseaserelated pathways out of a large collection of pathways by combining pathway enrichment analysis results from each single study. Specifically, we first collected 2,287 pathways (gene sets) from MSigDB (http://www.broadinstitute.org/gsea/msigdb/): 1,454 pathways from "GO", 186 pathways from "KEGG", 217 pathways from "BIOCARTA" and 430 pathways from "REACTOME", respectively. We filtered out pathways with less than 5 genes or more than 200 genes and 2,113 pathways were left for the analysis. DE analysis was performed in each single study separately and pathway enrichment analysis was performed for all the 2,113 pathways by the KolmogorovSmirnov (KS) association test. Denote by p_{ uk } the resulting pathway enrichment pvalue from KS test for pathway u (1 ≤ u ≤ 2,113) and study k (1 ≤ k ≤ K). For a given study k, enrichment ranks over pathways were calculated as r_{ uk } = rank_{ u }(p_{ uk }). A ranksum score for a given pathway u was then derived as ${S}_{u}={\sum}_{k=1}^{K}{r}_{\mathit{\text{uk}}}.$ Intuitively, pathways with small ranksum scores indicate that they are likely associated with the disease outcome by aggregated evidence of the K individual study analyses. We choose the top D pathways that had the smallest ranksum scores as the surrogate diseaserelated pathways and used these to proceed with the biological association evaluation of metaanalysis methods in the following.
Given the selected surrogate pathways D, the following procedure was used to evaluate performance of the 12 metaanalysis methods for a given example e (1 ≤ e ≤ 6). For each metaanalysis method m (1 ≤ m ≤ M = 12), the DE analysis result was associated with pathway u and the resulting enrichment pvalue by KStest was denoted by ${\tilde{P}}_{\mathit{\text{med}}}\phantom{\rule{0.5em}{0ex}}\left(1\le d\le \leftD\right\right).$ The rank of ${\tilde{P}}_{\mathit{\text{med}}}$ for method m among 12 methods was denoted by ${v}_{\mathit{\text{med}}}={\text{rank}}_{m}\phantom{\rule{0.12em}{0ex}}\left({\tilde{P}}_{\mathit{\text{med}}}\right).$ Similar to the detection capability evaluation, we calculated the mean standardized rank (MSR) for method m and example e as ${\text{MSR}}_{\mathit{\text{me}}}={\sum}_{d=1}^{D}\left({v}_{\mathit{\text{med}}}/\#\text{of the methods compared}\right)/D$ and the aggregated standardized rank (ASR) as ${\text{ASR}}_{m}={\sum}_{e=1}^{6}{\text{MSR}}_{\mathit{\text{me}}}/6,$ representing the overall performance of method m. To select the parameter D for surrogate diseaserelated pathways, Additional file 1: Figure S4 shows the trend of MSR_{ me } (on the yaxis) versus D (on the xaxis) as D increases. The result indicated that the performance evaluation using different D only minimally impacted the conclusion when D > 30. We choose D = 100 throughout this paper.
Note that we used KS test, instead of the popular Fisher’s exact test because each single study detected variable number of DE genes under a given FDR cutoff and the Fisher’s exact test is usually not powerful unless a few hundred DE genes are detected. On the other hand, the KS test does not require an arbitrary pvalue cutoff to determine the DE gene list for enrichment analysis.
Stability
The third criterion examines whether a metaanalysis method generates stable DE analysis result. To achieve this goal, we randomly split samples into half in each study (so that cases and controls are as equally split as possible). The first half of each study was taken to perform the first metaanalysis and generate a DE analysis result. Similarly, the second half of each study was taken to perform a second metaanalysis. The generated DE analysis results from two separate metaanalyses were compared by the adjusted DE similarity measure (to be described in the next section). The procedure is repeated for B = 50 times. Denote by S_{ meb } the adjusted DE similarity measure of method m of the b^{th} simulation in example e. Similar to the first two criteria, MSR and ASR were calculated based on S_{ meb } to evaluate the methods.
Robustness
The final criterion investigates the robustness of a metaanalysis method when an outlying irrelevant study is mistakenly added to the metaanalysis. For each of the six real examples, we randomly picked one irrelevant study from the other five examples, added it to the specific example for metaanalysis and evaluated the change from the original metaanalysis. The adjusted DE similarity measure was calculated between the original metaanalysis and the new metaanalysis with an added outlier. A high adjusted DE similarity measure shows better robustness against inclusion of the outlying study. This procedure was repeated until all irrelevant studies were used. The MSR and ASR are then calculated based on the adjusted DE similarity measures to evaluate the methods.
Similarity measure between two ordered DE gene lists
To compare results of two DE detection methods (from single study analysis or metaanalysis), a commonly used method in the literature is to take the DE genes under certain pvalue or FDR threshold, plot the Venn diagram and compute the ratio of overlap. This method, however, greatly depends on the selection of FDR threshold and is unstable. Another approach is to take the generated DE ordered gene lists from two methods and compute the nonparametric Spearman rank correlation [24]. This method avoids the arbitrary FDR cutoff but gives, say, the top 100 important DE genes and the bottom 100 nonDE genes equal contribution. To circumvent this pitfall, Li et al. proposed a parametric reproducibility measure for ChIPseq data in the ENCODE project [25]. Yang et al. introduced an OrderedList measure to quantify similarity of two ordered DE gene lists [26]. For simplicity, we extended the OrderedList measure into a standardized similarity score for the evaluation purpose in this paper. Specifically, suppose G_{ 1 } and G_{ 2 } are two ordered DE gene lists (e.g. ordered by pvalues) and small ranks represent more significant DE genes. We denote by O_{ n }(G_{1}, G_{2}) the number of overlapped genes in the top n genes of G_{ 1 } and G_{ 2 }. As a result, 0 ≤ O_{ n }(G_{1}, G_{2}) ≤ n and a large O_{ n }(G_{1}, G_{2}) value indicates high similarity of the two ordered lists in the top n genes. A weighted average similarity score is calculated as $S\left({G}_{1},{G}_{2}\right)={\sum}_{n=1}^{G}{e}^{\mathit{\text{an}}}\xb7\phantom{\rule{0.5em}{0ex}}{O}_{n}\left({G}_{1},{G}_{2}\right),$ where G is the total number of matched genes and the power α controls the magnitude of weights emphasized on the top ranked genes. When α is large, top ranked genes are weighted higher in the similarity measure. The expected value (under the null hypothesis that the two gene rankings are randomly generated) and maximum value of S can be easily calculated: ${E}_{\mathit{\text{null}}}\left(S\left({G}_{1},{G}_{2}\right)\right)={\sum}_{n=1}^{G}{e}^{\alpha n}\xb7{n}^{2}/G$ and $\mathit{\text{max}}\phantom{\rule{0.5em}{0ex}}\left(S\left({G}_{1},{G}_{2}\right)\right)={\sum}_{n=1}^{G}{e}^{\mathit{an}}\xb7n.$ We apply an idea similar to adjusted Rand index [27] used to measure similarity of two clustering results and define the adjusted DE similarity measure as
This measure ranges between 1 to 1 and gives an expected value of 0 if two ordered gene lists are obtained by random chance. Yang et al. proposed a resamplingbased and ROC methods to estimate the best selection of α. Since the number of DE genes in our examples are generally high, we choose a relatively small α = 0.001 throughout this paper. We have tested different α and found that the results were similar (Additional file 1: Figure S7).
Results
Simulation setting
We conducted simulation studies to evaluate and characterize the 12 metaanalysis methods for detecting biomarkers in the underlying hypothesis settings of HS_{ A }, HS_{ B } or HS_{ r }. The simulation algorithm is described below:

1.
We simulated 800 genes with 40 gene clusters (20 genes in each cluster) and other 1,200 genes do not belong to any cluster. The cluster indexes C _{ g } for gene g (1 ≤ g ≤ 2, 000) were randomly sampled, such that ∑ I{C _{ g } = 0} = 1, 200 and ∑ I{C _{ g } = c} = 20, 1 ≤ c ≤ 40.

2.
For genes in cluster c (1 ≤ c ≤ 40) and in study k (1 ≤ k ≤ 5), we sampled ${\mathrm{\sum}}_{\mathit{\text{ck}}}^{\mathit{\text{'}}}~{W}^{1}\left(\mathrm{\Psi},60\right),$ where Ψ = 0.5I _{20 × 20} + 0.5J _{20 × 20}, W ^{ 1} denotes the inverse Wishart distribution, I is the identity matrix and J is the matrix with all elements equal 1. We then standardized ${\mathrm{\Sigma}}_{\mathit{\text{ck}}}^{\mathit{\text{'}}}$ into Σ_{ ck } where the diagonal elements are all 1’s.

3.
20 genes in cluster c was denoted by the index of g _{c1}, …, g _{c20}, i.e. ${C}_{{g}_{\mathit{\text{cj}}}}=c,\text{where}\phantom{\rule{0.25em}{0ex}}1\le c\le 40\phantom{\rule{0.25em}{0ex}}\text{and}\phantom{\rule{0.37em}{0ex}}1\le j\le 20.$ We sampled gene expression levels of genes in cluster c for sample n as ${\left({X}_{{g}_{c1}\mathit{\text{nk}}}^{\mathit{\text{'}}},\dots ,{X}_{{g}_{c20}\mathit{\text{nk}}}^{\mathit{\text{'}}}\right)}^{T}~\mathit{\text{MVN}}\left(0,{\mathrm{\sum}}_{\mathit{\text{ck}}}\right)$ where 1 ≤ n ≤ 100 and 1 ≤ k ≤ 5, and sample expression level for the gene $g~N\left(0,{\sigma}_{k}^{2}\right)$ which is not in any cluster for sample n, where 1 ≤ n ≤ 100, 1 ≤ k ≤ 5 and ${\sigma}_{k}^{2}$ was uniformly distributed from [0.8, 1.2], which indicates different variance for study k.

4.
For the first 1,000 genes (1 ≤ g ≤ 1, 000), k _{ g } (the number of studies that are differentially expressed for gene g) was generated by sampling k _{ g } = 1, 2, 3, 4 and 5, respectively. For the next 1,000 genes (1, 001 ≤ g ≤ 2, 000), k _{ g } = 0 represents nonDE genes in all five studies.

5.
To simulate expression intensities for cases, we randomly sampled δ _{ gk } ∈ {0, 1}, such that ∑ _{ k } δ _{ gk } = k _{ g }. If δ _{ gk } = 1, gene g in study k was a DE gene, otherwise it was a nonDE gene. When δ _{ gk } = 1, we sampled expression intensities μ _{ gk } from a uniform distribution in the range of [0.5, 3], which means we considered the concordance effect (upregulated) among all simulated studies. Hence, the expression for control samples are ${X}_{\mathit{\text{gnk}}}={X}_{\mathit{\text{gnk}}}^{\mathit{\text{'}}},$ and case samples are ${Y}_{\mathit{\text{gnk}}}={X}_{g\left(n+50\right)k}^{\text{'}}+{\mu}_{\mathit{\text{gk}}}\xb7{\delta}_{\mathit{\text{gk}}},$ for 1 ≤ g ≤ 2, 000, 1 ≤ n ≤ 50 and 1 ≤ k ≤ 5.
In the simulation study, we had 1,000 nonDE genes in all five studies (k_{ g } = 0), and 1,000 genes were differentially expressed in 1 ~ 5 studies (k_{ g } = 1, 2, 3, 4, 5). On average, we had roughly the same number (~200) of genes in each group of k_{ g } = 1, 2, 3, 4, 5. See Additional file 1: Figure S2 for the heatmap of a simulated example (red colour represents upregulated genes). We applied the 12 metaanalysis method under FDR control at 5%. With the knowledge of true k_{ g }, we were able to derive the sensitivity and specificity for HS_{ A } and HS_{ B }, respectively. In HS_{ A }, genes with k_{ g } = 5 were the underlying true positives and genes with k_{ g } = 0 ~ 4 were the underlying true negatives; in HS_{ B }, gene with k_{ g } = 1 ~ 5 were the underlying true positives and genes with k_{ g } = 0 were the true negatives. By adjusting the decision cutoff, the receiver operating characteristic (ROC) curves and the resulting area under the curve (AUC) were used to evaluate the performance. We simulated 50 data sets and reported the means and standard errors of the AUC values. AUC values range between 0 and 1. AUC = 50% represents a random guess and AUC = 1 reaches the perfect prediction. The above simulation scheme only considered the concordance effect sizes (i.e. all with upregulation when a gene is DE in a study) among five simulated studies. In many applications, some genes may have pvalue statistical significance in the metaanalysis but the effect sizes are discordant (i.e. a gene is upregulation in one study but downregulation in another study). To investigate that effect, we performed a second simulation that considers random discordant cases. In step 5, the μ_{ gk } became a mixture of two uniform distributions: π_{ gk } Unif ⋅[3, 0.5]+ (1  π_{ gk })⋅ Unif[0.5, 3], where π_{ gk } is the probability of gene g (1 ≤ g ≤ 2, 000) in study k(1 ≤ k ≤ 5) to have a discordant effect size (downregulated). We set π_{ gk } = 0.2 for the discordant simulation setting.
Simulation results to characterize the methods
The simulation study provided the underlying truth to characterize the metaanalysis methods according to their strengths and weaknesses for detecting DE genes of different hypothesis settings. The performances of 12 methods were evaluated by receiver operating characteristic (ROC) curves, which is a visualization tool that illustrates the sensitivity and specificity tradeoff, and the resulting area under the ROC curve (AUC) under two different hypothesis settings of HS_{ A } and HS_{ B }. Table 1 shows the detected number of DE genes under nominal FDR at 5%, the true FDR and AUC values under HS_{ A } and HS_{ B } for all 12 methods. The values were averaged over 50 simulations and the standard errors are shown in the parentheses.
Figure 3 shows the histogram of the true number of DE studies (i.e. k_{ g }) among the detected DE genes under FDR = 5% for each method. It is clearly seen that minP, Fisher, AW, Stouffer and FEM detected HS_{ B }type DE genes and had high AUC values under HS_{ B } criterion (0.980.99), compared to lower AUC values under HS_{ A } criterion (0.790.9). For these methods, the true FDR for HS_{ A } generally lost control (0.41 0.44). On the other hand, maxP, rOP and REM had high AUC under HS_{ A } criterion (0.960.99) (true FDR = 0.0680.117) compared to HS_{ B } (0.750.92). maxP detected mostly HS_{ A }type of markers and rOP and REM detected mostly HS_{ r }type DE genes. PR and SR detected mostly HS_{ A }type DE genes but they surprisingly had very high AUC under both HS_{ A } and HS_{ B } criteria. The RankProd method detected DE genes between HS_{ r } and HS_{ B } types and had a good AUC value under HS_{ B }. The RankSum detected HS_{ B }type DE genes but had poor AUC values (0.5) for both HS_{ A } and HS_{ B }. Table 1 includes our concluding characterization of the targeted hypothesis settings for each metaanalysis method (see also Additional file 1: Figure S5 of the ROC curve and AUC of HS_{ A }type and HS_{ B }type in 12 metaanalysis methods). Additional file 1: Figure S3 shows the result for the second discordant simulation setting. The numbers of studies with opposite effect size are represented by different colours in histogram plot (green: all studies with concordance effect size; blue: one study has opposite effect size with the remaining; red: two studies have opposite effect size with the remaining). In summary, almost all metaanalysis methods could not avoid inclusion of genes with opposite effect sizes. Particularly, methods utilizing pvalues from twosided tests (e.g. Fisher, AW, minP, maxP and rOP) could not distinguish direction of effect sizes. Stouffer was the only method that accommodated the effect size direction in its ztransformation formulation but its ability to avoid DE genes with discordant effect sizes seemed still limited. Owen (2009) proposed a onesided correction procedure for Fisher’s method to avoid detection of discordant effect sizes in metaanalysis [28]. The null distribution of the new statistic, however, became difficult to derive. The approach can potentially be extended to other methods and more future research will be needed for this issue.
Results of the four evaluation criteria
Detection capability
Figure 2 shows the number of DE genes identified by each of the 12 metaanalysis methods (FDR = 10% for MDD and breast cancer due to their weak signals and FDR = 1% for all the others). Each plot shows mean with standard error bars for 50 bootstrapped data sets. Additional file 1: Table S4 shows the MSR and ASR for each method in the six examples. The methods in Figure 2 are ordered according to their ASR values. The top six methods with the strongest detection capability were those that detected HS_{ B }type DE genes from the conclusion of Table 1: Fisher, AW, Stouffer, minP, FEM and RankSum. The order of performance of these six methods was pretty consistent across all six examples. The next four methods were rOP, RankProd, maxP and REM and they targeted on either HS_{ r } or HS_{ A }. PR and SR had the weakest detection capability, which was consistent with the simulation result in Table 1.
Biological association
Figure 4 shows plots of mean with standard error bars from the pathway association pvalues (minus logtransformed) of the top 100 surrogate diseaserelated pathways for the 12 methods. Additional file 1: Table S5 shows the corresponding MSR and ASR. We found that Stouffer, Fisher and AW had the best performance among the 12 methods. Surprisingly we found that although PR and SR had low detection capability in simulation and real data, they consistently had relatively high biological association results. This may be due to the better DE gene ordering results these two methods provide, as was also shown by the high AUC values under both hypothesis settings in the simulation.
Stability
Figure 5 shows the plots of mean with standard error bars of stability calculated by adjusted DE similarity measure. Additional file 1: Table S6 contains the corresponding MSR and ASR. In summary, RankProd and RankSum methods were the most stable metaanalysis methods probably because these two nonparametric approaches take into account all possible fold change calculations between cases and controls. They do not need any distributional assumptions, which provided stability even when sample sizes were small [29]. The maximum pvalue method consistently had the lowest stability in all data sets, which is somewhat expected. For a given candidate marker with a small maximum pvalue, the chance that at least one study has significantly inflated pvalues is high when sample size is reduced by half. The stability measures in the breast cancer example were generally lower than other examples. This is mainly due to the weak signals for survival outcome association, which might be improved if larger sample size is available.
Robustness
Figure 6 shows the plots of mean with standard error bars of robustness calculated by adjusted DE similarity measure between the original metaanalysis and the new metaanalysis with an added outlier. Additional file 1: Table S7 shows the corresponding MSR and ASR values. In general, methods suitable for HS_{ B } (minP, AW, Fisher and Stouffer) have better robustness than methods for HS_{ A } or HS_{ r } (e.g. maxP and rOP). The trend is consistent in the prostate cancer, brain cancer and IPF examples but is more variable in the weaksignal MDD and breast cancer examples. RankSum was surprisingly the most sensitive method to outliers, while RankProd performs not bad.
Characterization of methods by MDS plots
We applied the adjusted DE similarity measure to quantify the similarity of the DE gene orders from any two metaanalysis methods. The resulting dissimilarity measure (i.e. one minus adjusted similarity measure) was used to construct the multidimensional scaling (MDS) plot, showing the similarity/dissimilarity structure between the 12 methods in a twodimensional space. When two methods were close to each other, they generated similar DE gene ordering. The patterns of MDS plots from six examples generated quite consistent results (Additional file 1: Figure S6). Figure 1(a) shows an aggregated MDS plot where the input dissimilarity matrix is averaged from the six examples. We clearly observed that Fisher, AW, Stouffer, minP, PR and SR were consistently clustered together in all six individual and the aggregated MDS plot (labeled in red). This is not surprising given that these methods all sum transformed pvalue evidence across studies (except for minP). Two methods to combine effect sizes and two methods to combine ranks (FEM, REM, RankProd and RankSum labeled in blue) are consistently clustered together. Finally, the maxP and rOP methods seem to form a third loose cluster (labeled in green).
Characterization of data sets by entropy measure
From the simulation study, selection of a most suitable metaanalysis method depends on the hypothesis setting behind the methods. The choice of a hypothesis setting mostly depends on the biological purpose of the analysis; that is, whether one aims to detect candidate markers differentially expressed in "all" (HS_{ A }), "most" (HS_{ r }) or "one or more" (HS_{ B }) studies. However, when no biological prior information or preference exists, the entropy measure can be objectively used to determine the choice of hypothesis setting. The analysis identifies the top 1,000 genes from Fisher’s metaanalysis method and the genespecific entropy of each gene is calculated. When the entropy is small, the pvalues are small in only one or very few studies. Conversely, when the entropy is large, most or all of the studies have small pvalues. Figure 1(b) shows the boxplots of entropy of the top 1,000 candidate genes identified by Fisher’s method in the six data sets. The result shows that prostate cancer comparing primary and metastatic tumor samples had the smallest entropy values, which indicated high heterogeneity across the three studies and that HS_{ B } should be considered in the metaanalysis. On the other hand, MDD had the highest entropy values. Although the signals of each MDD study were very weak, they were rather consistent across studies and application of HS_{ A } or HS_{ r } was adequate. For the other examples, we suggest using the robust HS_{ r } unless other prior biological purpose is indicated.
Conclusions and discussions
An application guideline for practitioners
From the simulation study, the 12 metaanalysis methods were categorized into three hypothesis settings (HS_{ A }, HS_{ B } and HS_{ r }), showing their strengths for detecting different types of DE genes in the metaanalysis (Figure 3 and the second column of Table 2). For example, maxP is categorized to HS_{ A } since it tends to detect only genes that are differentially expressed in all studies. From the results using four evaluation criteria, we summarized the rank of ASR values (i.e. the order used in Figures 2 and 6) and calculated the rank sum of each method in Table 2. The methods were then sorted first by the hypothesis setting categories and then by the rank sum. The clusters of methods from the MDS plot were also displayed. For methods in the HS_{ A } category, we surprisingly see that the maxP method performed among the worst in all four evaluation criteria and should be avoided. PR was a better choice in this hypothesis setting although it provides a rather weak detection capability. For HS_{ B }, Fisher, AW and Stouffer performed very well in general. Among these three methods, we note that AW has an additional advantage to provide an adaptive weight index that indicates the subset of studies contributing to the metaanalysis and characterizes the heterogeneity (e.g. adaptive weight (1,0,…) indicates that the marker is DE in study 1 but not in study 2, etc.). As a result, we recommend AW over Fisher and Stouffer in the HS_{ B } category. For HS_{ r }, the result was less conclusive. REM provided better stability and robustness but sacrificed detection capability and biological association. On the other hand, rOP obtained better detection capability and biological association but was neither stable nor robust. In general, since detection capability and biological association are of more importance in the metaanalysis and rOP has the advantage to link the choice of r in HS_{ r } with the rOP method (e.g. when r = 0.7∙K, we identify genes that are DE in more than 70% of studies), we recommend rOP over REM.
Below, we provide a general guideline for a practitioner when applying microarray metaanalysis. Data sets of a relevant biological or disease hypothesis are firstly identified, preprocessed and annotated according to Step (i)  (v) in Ramasamy et al. Proper quality assessment should be performed to exclude studies with problematic quality (e.g. with the aid of MetaQC as we did in the six examples). Based on the experimental design and biological objectives of collected data, one should determine whether the metaanalysis aims to identify biomarkers differentially expressed in all studies (HS_{ A }), in one or more studies (HS_{ B }) or in majority of studies (HS_{ r }). In general, if higher heterogeneity is expected from, say, heterogeneous experimental protocol, cohort or tissues, HS_{ B } should be considered. For example, if the combined studies come from different tissues (e.g. the first study uses peripheral blood, the second study uses muscle tissue and so on), tissuespecific markers may be expected and HS_{ B } should be applied. On the contrary, if the collected studies are relatively homogeneous (e.g. use the same array platform or from the same lab), HS_{ r } is generally recommended, as it provides robustness and detects consistent signals across the majority of studies. In the situation that no prior knowledge is available to choose a desired hypothesis setting or if the researcher is interested in a datadriven decision, the entropy measure in Figure 1(b) can be applied and the resulting boxplot can be compared to the six examples in this paper to guide the decision. Once the hypothesis setting is determined, the choice of a metaanalysis method can be selected from the discussion above and Table 2.
Conclusions
In this paper, we performed a comprehensive comparative study to evaluate 12 microarray metaanalysis methods using simulation and six real examples with four evaluation criteria. We clarified three hypothesis settings that were implicitly assumed behind the methods. The evaluation results produced a practical guideline to inform biologists the best choice of method(s) in real applications.
With the reduced cost of highthroughput experiments, data from microarray, new sequencing techniques and mass spectrometry accumulate rapidly in the public domain. Integration of multiple data sets has become a routine approach to increase statistical power, reduce false positives and provide more robust and validated conclusions. The evaluation in this paper focuses on microarray metaanalysis but the principles and messages apply to other types of genomic metaanalysis (e.g. GWAS, methylation, miRNA and eQTL). When the nextgeneration sequencing technology becomes more affordable in the future, sequencing data will become more prevalent as well and similar metaanalysis techniques will apply. For these different types of genomic metaanalysis, similar comprehensive evaluation could be performed and application guidelines should be established as well.
References
 1.
Ramasamy A, Mondry A, Holmes CC, Altman DG: Key issues in conducting a metaanalysis of gene expression microarray datasets. PLoS Med. 2008, 5 (9): e18410.1371/journal.pmed.0050184.
 2.
Kang DD, Sibille E, Kaminski N, Tseng GC: MetaQC: objective quality control and inclusion/exclusion criteria for genomic metaanalysis. Nucleic Acids Res. 2012, 40 (2): e1510.1093/nar/gkr1071.
 3.
Tseng GC, Ghosh D, Feingold E: Comprehensive literature review and statistical considerations for microarray metaanalysis. Nucleic Acids Res. 2012, 40 (9): 37853799. 10.1093/nar/gkr1265.
 4.
Hong F, Breitling R: A comparison of metaanalysis methods for detecting differentially expressed genes in microarray experiments. Bioinformatics. 2008, 24 (3): 374382. 10.1093/bioinformatics/btm620.
 5.
Campain A, Yang YH: Comparison study of microarray metaanalysis methods. BMC Bioinforma. 2010, 11: 40810.1186/1471210511408.
 6.
Birnbaum A: Combining independent tests of significance. J Am Stat Assoc. 1954, 49: 559574.
 7.
Li J, Tseng GC: An adaptively weighted statistic for detecting differential gene expression when combining multiple transcriptomic studies. Ann Appl Stat. 2011, 5: 9941019. 10.1214/10AOAS393.
 8.
Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001, 98 (9): 51165121. 10.1073/pnas.091062498.
 9.
Pesarin F, Salmaso L: Permutation tests for complex data: theory, applications and software. 2010, Ames, IA 50010: Wiley.com
 10.
Cox DR: Regression models and lifetables. J R Stat Soc Ser B. 1972, 34 (2): 187220.
 11.
Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B. 1995, 57 (1): 289300.
 12.
Wang X, Kang DD, Shen K, Song C, Lu S, Chang LC, Liao SG, Huo Z, Tang S, Ding Y, et al: An R package suite for microarray metaanalysis in quality control, differentially expressed gene analysis and pathway enrichment detection. Bioinformatics. 2012, 28 (19): 25342536. 10.1093/bioinformatics/bts485.
 13.
Fisher RA: Statistical methods for research workers. 1925, Edinburgh: Genesis Publishing, Oliver and Boyd
 14.
Stouffer SA: A study of attitudes. Sci Am. 1949, 180 (5): 1115. 10.1038/scientificamerican054911.
 15.
Tippett LHC: The Methods of Statistics. An introduction mainly for workers in the biological sciences. The Methods of Statistics An Introduction mainly for Workers in the Biological Sciences. 1931, London: Williams &Norgate Ltd
 16.
Wilkinson B: A statistical consideration in psychological research. Psychol Bull. 1951, 48 (3): 156158.
 17.
Song C, Tseng GC: Order statistics for robust genomic metaanalysis. Ann Appl Stat. 2012, Accepted
 18.
Choi JK, Yu U, Kim S, Yoo OJ: Combining multiple microarray studies and modeling interstudy variation. Bioinformatics. 2003, 19 (Suppl 1): i84i90. 10.1093/bioinformatics/btg1010.
 19.
Hong F, Breitling R, McEntee CW, Wittner BS, Nemhauser JL, Chory J: RankProd: a bioconductor package for detecting differentially expressed genes in metaanalysis. Bioinformatics. 2006, 22 (22): 28252827. 10.1093/bioinformatics/btl476.
 20.
Dreyfuss JM, Johnson MD, Park PJ: Metaanalysis of glioblastoma multiforme versus anaplastic astrocytoma identifies robust gene markers. Mol Cancer. 2009, 8: 7110.1186/14764598871.
 21.
Borg I: Modern multidimensional scaling: theory and applications. 2005, New York, NY 10013: Springer
 22.
Martin NF, England JW: Mathematical theory of entropy, vol. 12. 2011, Cambridge CB2 8BS United Kingdom: Cambridge University Press
 23.
Wu W, Dave N, Tseng GC, Richards T, Xing EP, Kaminski N: Comparison of normalization methods for CodeLink Bioarray data. BMC Bioinforma. 2005, 6: 30910.1186/147121056309.
 24.
Spearman C: The proof and measurement of association between two things. Amer J Psychol. 1904, 15: 72101. 10.2307/1412159.
 25.
Li Q, Brown JB, Huang H, Bickel PJ: Measuring reproducibility of highthroughput experiments. Ann Appl Stat. 2011, 5 (3): 17521779. 10.1214/11AOAS466.
 26.
Yang X, Bentink S, Scheid S, Spang R: Similarities of ordered gene lists. J Bioinform Comput Biol. 2006, 4 (3): 693708. 10.1142/S0219720006002120.
 27.
Hubert L, Arabie P: Comparing partitions. J Classification. 1985, 2: 193218. 10.1007/BF01908075.
 28.
Owen AB: Karl Pearson's metaanalysis revisited. Ann Statistics. 2009, 37 (6B): 38673892. 10.1214/09AOS697.
 29.
Breitling R, Herzyk P: Rankbased methods as a nonparametric alternative of the Tstatistic for the analysis of biological microarray data. J Bioinform Comput Biol. 2005, 3 (5): 11711189. 10.1142/S0219720005001442.
Acknowledgements
This work was supported by the National Institutes of Health [R21MH094862].
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
GCT supervised the whole project. LCC developed all statistical analysis and HML developed partial statistical analysis. GCT and LCC drafted the manuscript. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Cite this article
Chang, L., Lin, H., Sibille, E. et al. Metaanalysis methods for combining multiple expression profiles: comparisons, statistical characterization and an application guideline. BMC Bioinformatics 14, 368 (2013). https://doi.org/10.1186/1471210514368
Received:
Accepted:
Published:
Keywords
 Differentially Express
 Differentially Express Gene
 Adaptively Weighted
 Hypothesis Setting
 False Discovery Rate Threshold