Meta-analysis methods for combining multiple expression profiles: comparisons, statistical characterization and an application guideline
- Lun-Ching Chang^{1},
- Hui-Min Lin^{1},
- Etienne Sibille^{2} and
- George C Tseng^{1, 3}Email author
https://doi.org/10.1186/1471-2105-14-368
© Chang et al.; licensee BioMed Central Ltd. 2013
Received: 17 August 2013
Accepted: 5 December 2013
Published: 21 December 2013
Abstract
Background
As high-throughput genomic technologies become accurate and affordable, an increasing number of data sets have been accumulated in the public domain and genomic information integration and meta-analysis have become routine in biomedical research. In this paper, we focus on microarray meta-analysis, where multiple microarray studies with relevant biological hypotheses are combined in order to improve candidate marker detection. Many methods have been developed and applied in the literature, but their performance and properties have only been minimally investigated. There is currently no clear conclusion or guideline as to the proper choice of a meta-analysis method given an application; the decision essentially requires both statistical and biological considerations.
Results
We performed 12 microarray meta-analysis methods for combining multiple simulated expression profiles, and such methods can be categorized for different hypothesis setting purposes: (1) HS_{ A }: DE genes with non-zero effect sizes in all studies, (2) HS_{ B }: DE genes with non-zero effect sizes in one or more studies and (3) HS_{ r }: DE gene with non-zero effect in "majority" of studies. We then performed a comprehensive comparative analysis through six large-scale real applications using four quantitative statistical evaluation criteria: detection capability, biological association, stability and robustness. We elucidated hypothesis settings behind the methods and further apply multi-dimensional scaling (MDS) and an entropy measure to characterize the meta-analysis methods and data structure, respectively.
Conclusions
The aggregated results from the simulation study categorized the 12 methods into three hypothesis settings (HS_{ A }, HS_{ B }, and HS_{ r }). Evaluation in real data and results from MDS and entropy analyses provided an insightful and practical guideline to the choice of the most suitable method in a given application. All source files for simulation and real data are available on the author’s publication website.
Background
Microarray technology has been widely used to identify differential expressed (DE) genes in biomedical research in the past decade. Many transcriptomic microarray studies have been generated and made available in public domains such as the Gene Expression Omnibus (GEO) from NCBI (http://www.ncbi.nlm.nih.gov/geo/) and ArrayExpress from EBI (http://www.ebi.ac.uk/arrayexpress/). From the databases, one can easily obtain multiple studies of a relevant biological or disease hypothesis. Since a single study often has small sample size and limited statistical power, combining information across multiple studies is an intuitive way to increase sensitivity. Ramasamy, et al. proposed a seven-step practical guidelines for conducting microarray meta-analysis [1]: "(i) identify suitable microarray studies; (ii) extract the data from studies; (iii) prepare the individual datasets; (iv) annotate the individual datasets; (v) resolve the many-to-many relationship between probes and genes; (vi) combine the study-specific estimates; (vii) analyze, present, and interpret results". In the first step although theoretically meta-analysis increases the statistical power to detect DE genes, the performance can be deteriorated if problematic or heterogeneous studies are combined. In many applications, the data inclusion/exclusion criteria are based on ad-hoc expert opinions, a naïve sample size threshold or selection of platforms without an objective quality control procedure. Kang et al. proposed six quantitative quality control measures (MetaQC) for decision of study inclusion [2]. Step (ii)-(v) are related to data preprocessing. Finally, Step (vi) and (vii) involve the selection of meta-analysis method and interpretation of the result and are the foci of this paper.
Many microarray meta-analysis methods have been developed and applied in the literature. According to a recent review paper by Tseng et al. [3], popular methods mainly combine three different types of statistics: combine p-values, combine effect sizes and combine ranks. In this paper, we include 12 popular as well as state-of-the-art methods in the evaluation and comparison. Six methods (Fisher, Stouffer, adaptively weighted Fisher, minimum p-value, maximum p-value and rth ordered p-value) belonged to the p-value combination category, two methods (fixed effects model and random effects model) belonged to the effect size combination category and four methods (RankProd, RankSum, product of ranks and sum of ranks) belonged to the rank combination category. Details of these methods and citations will be provided in the Method section. Despite the availability of many methods, pros and cons of these methods and a comprehensive evaluation remain largely missing in the literature. To our knowledge, Hong and Breitling [4], Campain and Yang [5] are the only two comparative studies that have systematically compared multiple meta-analysis methods. The number of methods compared (three and five methods, respectively) and the number of real examples examined (two and three examples respectively with each example covering 2-5 microarray studies) were, however, limited. The conclusions of the two papers were suggestive with limited insights to guide practitioners. In addition, as we will discuss in the Method section, different meta-analysis methods have different underlying hypothesis setting targets. As a result, the selection of an adequate (or optimal) meta-analysis method depends heavily on the data structure and the hypothesis setting to achieve the underlying biological goal.
In this paper, we compare 12 popular microarray meta-analysis methods using simulation and six real applications to benchmark their performance by four statistical criteria (detection capability, biological association, stability and robustness). Using simulation, we will characterize the strength of each method under three different hypothesis settings (i.e. detect DE genes in "all studies", "majority of studies" or "one or more studies"; see Method section for more details). We will compare the similarity and grouping of the meta-analysis methods based on their DE gene detection results (by using a similarity measure and multi-dimension scaling plot) and use an entropy measure to characterize the data structure to determine which hypothesis setting may be more adequate in a given application. Finally, we give a guideline to help practitioners select the best meta-analysis method under the choice of hypothesis setting in their applications.
Methods
Real data sets
Six example data sets for microarray meta-analysis were collected for evaluations in this paper. Each example contained 4-8 microarray studies. Five of the six examples were of the commonly seen two-group comparison and the last breast cancer example contained relapse-free survival outcome. We applied the MetaQC package [2] to assess quality of the studies for meta-analysis and determined the final inclusion/exclusion criteria. The principal component analysis (PCA) bi-plots and the six QC measures are summarized in Additional file 1: Figure S1, Tables S2 and S3. Details of the data sets are available in Additional file 1: Table S1.
Underlying hypothesis settings
A major contribution of this paper is to characterize meta-analysis methods suitable for different hypothesis settings (HS_{ A }, HS_{ B } and HS_{ r }) using simulation and real applications and to compare their performance with four benchmarks to provide a practical guideline.
Microarray meta-analysis data pre-processing
Assume that we have K microarray studies to combine. For study k (1 ≤ k ≤ K), denote by x_{ gsk } the gene expression intensity of gene g (1 ≤ g ≤ G) and sample s (1 ≤ s ≤ S_{ k }; S_{ k } the number of samples in study k), and y_{ sk } the disease/outcome variable of sample s. The disease/outcome variable can be of binary, multi-class, continuous or censored data, representing the disease state, severity or prognosis outcome (e.g. tumor versus normal or recurrence survival time). The goal of microarray meta-analysis is to combine information of K studies to detect differentially expressed (DE) genes associated with the disease/outcome variable. Such DE genes serve as candidate markers for disease classification, diagnosis or prognosis prediction and help understand the genetic mechanisms underlying a disease. In this paper, before meta-analysis we first applied penalized t-statistic to each individual study to generate p-values or DE ranks [8] for a binary outcome. In contrast to traditional t-statistic, penalized t-statistic adds a fudge parameter s_{0} to stabilize the denominator $\left(T=\left(\overline{X}-\overline{Y}\right)/(\widehat{s}+{s}_{0}\right)$; $\overline{X}$ and $\overline{Y}$ are means of case and control groups) and to avoid a large t-statistic due to small estimated variance $\widehat{s}$. The p-values were calculated using the null distributions derived from conventional non-parametric permutation analysis by randomly permuting the case and control labels for 10,000 times [9]. For censored outcome variables, Cox proportion hazard model and log-rank test were used [10]. Meta-analysis methods (described in the next subsection) were then used to combine information across studies and generate meta-analyzed p-values. To account for multiple comparison, Benjamini and Hochberg procedure was used to control false discovery rate (FDR) [11]. All methods were implemented using the "MetaDE" package in R [12]. Data sets and all programming codes are available at http://www.biostat.pitt.edu/bioinfo/publication.htm.
Microarray meta-analysis methods
According to a recent review paper [3], microarray meta-analysis methods can be categorized into three types: combine p-values, combine effect sizes and combine ranks. Below, we briefly describe 12 methods that were selected for comparison.
Combine p-values
Fisher The Fisher’s method [13] sums up the log-transformed p-values obtained from individual studies. The combined Fisher’s statistic ${\chi}_{\text{Fisher}}^{2}=-2{\sum}_{i=1}^{k}\text{log}\left({P}_{i}\right)$ follows a χ^{2} distribution with 2 k degrees of freedom under the null hypothesis (assuming null p-values are un;iformly distributed). Note that we perform permutation analysis instead of such parametric evaluation for Fisher and other methods in this paper. Smaller p-values contribute larger scores to the Fisher’s statistic.
Stouffer Stouffer’s method [14] sums the inverse normal transformed p-values. Stouffer’s statistics ${T}_{\text{Stouffer}}={\sum}_{i=1}^{k}{z}_{i}/\sqrt{k}({z}_{i}{\mathrm{\Phi}}^{-1}\left({p}_{i}\right),$ where Φ is standard normal c.c.f) follows a standard normal distribution under the null hypothesis. Similar to Fisher’s method, smaller p-values contribute more to the Stouffer’s score, but in a smaller magnitude.
Adaptively weighted (AW) Fisher The AW Fisher’s method [7] assigns different weights to each individual study ${T}_{\text{AW}}=-{\sum}_{k=1}^{K}{w}_{k}\cdot log\left({P}_{i}\right),{w}_{k}=0\phantom{\rule{0.24em}{0ex}}\text{or}\phantom{\rule{0.12em}{0ex}}1$ and it searches through all possible weights to find the best adaptive weight with the smallest derived p-value. One significant advantage of this method is its ability to indicate which studies contribute to the evidence aggregation and elucidates heterogeneity in the meta-analysis. Details can be referred to the Additional file 1.
Minimum p -value (minP) The minP method takes the minimum p-value among the K studies as the test statistic [15]. It follows a beta distribution with degrees of freedom α = 1 and β = k under the null hypothesis. This method detects a DE gene whenever a small p-value exists in any one of the K studies.
Maximum p -value (maxP) The maxP method takes maximum p-value as the test statistic [16]. It follows a beta distribution with degrees of freedom α = K and β = 1 under the null hypothesis. This method targets on DE genes that have small p-values in "all" studies.
r-th ordered p -value (rOP) The rOP method takes the r-th order statistic among sorted p-values of K combined studies. Under the null hypothesis, the statistic follows a beta distribution with degrees of freedom α = r and β = K - r + 1. The minP and maxP methods are special cases of rOP. In Song and Tseng [17], rOP is considered a robust form of maxP (where r is set as greater than 0.5∙K) to identify candidate markers differentially expressed in "majority" of studies.
Combine effect size
Fixed effects model (FEM) FEM combines the effect size across K studies by assuming a simple linear model with an underlying true effect size plus a random error in each study.
Random effects model (REM) REM [18] extends FEM by allowing random effects for the inter-study heterogeneity in the model. Detailed formulation and inference of FEM and REM are available in the Additional file 1.
Combine rank statistics
RankProd (RP) and RankSum (RS) RankProd and RankSum are based on the common biological belief that if a gene is repeatedly at the top of the lists ordered by up- or down-regulation fold change in replicate experiments, the gene is more likely a DE gene [19]. Detailed formulation and algorithms are available in the Additional file 1.
Product of ranks (PR) and Sum of ranks (SR) These two methods apply a naïve product or sum of the DE evidence ranks across studies [20]. Suppose R_{ gk } represents the rank of p-value of gene g among all genes in study k. The test statistics of PR and SR methods are calculated as $P{R}_{g}={\prod}_{k=1}^{K}{R}_{\mathit{\text{gk}}}$ and $S{R}_{g}={\sum}_{k=1}^{K}{R}_{\mathit{\text{gk}}},$ respectively. P-values of the test statistics can be calculated analytically or obtained from a permutation analysis. Note that the ranks taken from the smallest to largest (the choice in the method) are more sensitive than ranking from largest to smallest in the PR method, while it makes no difference to SR.
Characterization of meta-analysis methods
MDS plots to characterize the methods
The multi-dimensional scaling (MDS) plot is a useful visualization tool for exploring high-dimensional data in a low-dimensional space [21]. In the evaluation of 12 meta-analysis methods, we calculated the adjusted DE similarity measure for every pair of methods to quantify the similarity of their DE analysis results in a given example. A dissimilarity measure is then defined as one minus the adjusted DE similarity measure and the dissimilarity measure is used to generate an MDS plot of the 12 methods. In the MDS plot, methods that are clustered in a neighborhood indicate that they produce similar DE analysis results.
Entropy measure to characterize data sets
- 1.
Apply Fisher’s meta-analysis method to combine p-values across studies to identify the top H candidate markers. Here we used H = 1,000, H represents the rough number of DE genes (in our belief) that are contained in the data.
- 2.
For each selected marker, the standardized minus p-value score for gene g in the k-th study is defined as ${l}_{\mathit{\text{gk}}}=-\text{log}\left({p}_{\mathit{\text{gk}}}\right)/-{\sum}_{k=1}^{K}\text{log}\left({p}_{\mathit{\text{gk}}}\right).$ Note that 0 ≤ l _{ gk } ≤ 1, large l _{ gk } corresponds to more significant p-value p _{ gk }, and ${\sum}_{k=1}^{K}{l}_{\mathit{\text{gk}}}=1.$
- 3.
The entropy of gene g is defined as ${e}_{g}=-{\sum}_{k=1}^{K}{l}_{\mathit{\text{gk}}}\phantom{\rule{0.12em}{0ex}}\text{log}\left({l}_{\mathit{\text{gk}}}\right)$. Box-plots of entropies of the top H genes are generated for each meta-analysis application (Figure 1(b)).
Intuitively, a high entropy value indicates that the gene has small p-values in all or most studies and is of HS_{ A } or HS_{ r }-type. Conversely, genes with small entropy have small p-values in one or only few studies where HS_{ B }-type methods are more adequate. When calculating l_{ gk } in step 2, we capped -log(p_{ gk }) at 10 to avoid contributions of close-to-zero p-values that can generate near-infinite scores. The entropy box-plot helps determine an appropriate meta-analysis hypothesis setting if no pre-set biological objective exists.
Evaluation criteria
For objective quantitative evaluation, we developed the following four statistical criteria to benchmark performance of the methods.
Detection capability
Biological association
The second criterion requires that a good meta-analysis method should detect a DE gene list that has better association with pre-defined "gold standard" pathways related to the targeted disease. Such a "gold standard" pathway set should be obtained from biological knowledge for a given disease or biological mechanism under investigation. However, since most disease or biological mechanisms are not well-studied, obtaining such "gold standard" pathways is either difficult or questionable. To facilitate this evaluation without bias, we develop a computational and data-driven approach to determine a set of surrogate disease-related pathways out of a large collection of pathways by combining pathway enrichment analysis results from each single study. Specifically, we first collected 2,287 pathways (gene sets) from MSigDB (http://www.broadinstitute.org/gsea/msigdb/): 1,454 pathways from "GO", 186 pathways from "KEGG", 217 pathways from "BIOCARTA" and 430 pathways from "REACTOME", respectively. We filtered out pathways with less than 5 genes or more than 200 genes and 2,113 pathways were left for the analysis. DE analysis was performed in each single study separately and pathway enrichment analysis was performed for all the 2,113 pathways by the Kolmogorov-Smirnov (KS) association test. Denote by p_{ uk } the resulting pathway enrichment p-value from KS test for pathway u (1 ≤ u ≤ 2,113) and study k (1 ≤ k ≤ K). For a given study k, enrichment ranks over pathways were calculated as r_{ uk } = rank_{ u }(p_{ uk }). A rank-sum score for a given pathway u was then derived as ${S}_{u}={\sum}_{k=1}^{K}{r}_{\mathit{\text{uk}}}.$ Intuitively, pathways with small rank-sum scores indicate that they are likely associated with the disease outcome by aggregated evidence of the K individual study analyses. We choose the top |D| pathways that had the smallest rank-sum scores as the surrogate disease-related pathways and used these to proceed with the biological association evaluation of meta-analysis methods in the following.
Given the selected surrogate pathways D, the following procedure was used to evaluate performance of the 12 meta-analysis methods for a given example e (1 ≤ e ≤ 6). For each meta-analysis method m (1 ≤ m ≤ M = 12), the DE analysis result was associated with pathway u and the resulting enrichment p-value by KS-test was denoted by ${\tilde{P}}_{\mathit{\text{med}}}\phantom{\rule{0.5em}{0ex}}\left(1\le d\le \left|D\right|\right).$ The rank of ${\tilde{P}}_{\mathit{\text{med}}}$ for method m among 12 methods was denoted by ${v}_{\mathit{\text{med}}}={\text{rank}}_{m}\phantom{\rule{0.12em}{0ex}}\left({\tilde{P}}_{\mathit{\text{med}}}\right).$ Similar to the detection capability evaluation, we calculated the mean standardized rank (MSR) for method m and example e as ${\text{MSR}}_{\mathit{\text{me}}}={\sum}_{d=1}^{D}\left({v}_{\mathit{\text{med}}}/\#\text{of the methods compared}\right)/D$ and the aggregated standardized rank (ASR) as ${\text{ASR}}_{m}={\sum}_{e=1}^{6}{\text{MSR}}_{\mathit{\text{me}}}/6,$ representing the overall performance of method m. To select the parameter |D| for surrogate disease-related pathways, Additional file 1: Figure S4 shows the trend of MSR_{ me } (on the y-axis) versus |D| (on the x-axis) as |D| increases. The result indicated that the performance evaluation using different D only minimally impacted the conclusion when D > 30. We choose D = 100 throughout this paper.
Note that we used KS test, instead of the popular Fisher’s exact test because each single study detected variable number of DE genes under a given FDR cutoff and the Fisher’s exact test is usually not powerful unless a few hundred DE genes are detected. On the other hand, the KS test does not require an arbitrary p-value cutoff to determine the DE gene list for enrichment analysis.
Stability
The third criterion examines whether a meta-analysis method generates stable DE analysis result. To achieve this goal, we randomly split samples into half in each study (so that cases and controls are as equally split as possible). The first half of each study was taken to perform the first meta-analysis and generate a DE analysis result. Similarly, the second half of each study was taken to perform a second meta-analysis. The generated DE analysis results from two separate meta-analyses were compared by the adjusted DE similarity measure (to be described in the next section). The procedure is repeated for B = 50 times. Denote by S_{ meb } the adjusted DE similarity measure of method m of the b^{th} simulation in example e. Similar to the first two criteria, MSR and ASR were calculated based on S_{ meb } to evaluate the methods.
Robustness
The final criterion investigates the robustness of a meta-analysis method when an outlying irrelevant study is mistakenly added to the meta-analysis. For each of the six real examples, we randomly picked one irrelevant study from the other five examples, added it to the specific example for meta-analysis and evaluated the change from the original meta-analysis. The adjusted DE similarity measure was calculated between the original meta-analysis and the new meta-analysis with an added outlier. A high adjusted DE similarity measure shows better robustness against inclusion of the outlying study. This procedure was repeated until all irrelevant studies were used. The MSR and ASR are then calculated based on the adjusted DE similarity measures to evaluate the methods.
Similarity measure between two ordered DE gene lists
This measure ranges between -1 to 1 and gives an expected value of 0 if two ordered gene lists are obtained by random chance. Yang et al. proposed a resampling-based and ROC methods to estimate the best selection of α. Since the number of DE genes in our examples are generally high, we choose a relatively small α = 0.001 throughout this paper. We have tested different α and found that the results were similar (Additional file 1: Figure S7).
Results
Simulation setting
- 1.
We simulated 800 genes with 40 gene clusters (20 genes in each cluster) and other 1,200 genes do not belong to any cluster. The cluster indexes C _{ g } for gene g (1 ≤ g ≤ 2, 000) were randomly sampled, such that ∑ I{C _{ g } = 0} = 1, 200 and ∑ I{C _{ g } = c} = 20, 1 ≤ c ≤ 40.
- 2.
For genes in cluster c (1 ≤ c ≤ 40) and in study k (1 ≤ k ≤ 5), we sampled ${\mathrm{\sum}}_{\mathit{\text{ck}}}^{\mathit{\text{'}}}~{W}^{-1}\left(\mathrm{\Psi},60\right),$ where Ψ = 0.5I _{20 × 20} + 0.5J _{20 × 20}, W ^{- 1} denotes the inverse Wishart distribution, I is the identity matrix and J is the matrix with all elements equal 1. We then standardized ${\mathrm{\Sigma}}_{\mathit{\text{ck}}}^{\mathit{\text{'}}}$ into Σ_{ ck } where the diagonal elements are all 1’s.
- 3.
20 genes in cluster c was denoted by the index of g _{c1}, …, g _{c20}, i.e. ${C}_{{g}_{\mathit{\text{cj}}}}=c,\text{where}\phantom{\rule{0.25em}{0ex}}1\le c\le 40\phantom{\rule{0.25em}{0ex}}\text{and}\phantom{\rule{0.37em}{0ex}}1\le j\le 20.$ We sampled gene expression levels of genes in cluster c for sample n as ${\left({X}_{{g}_{c1}\mathit{\text{nk}}}^{\mathit{\text{'}}},\dots ,{X}_{{g}_{c20}\mathit{\text{nk}}}^{\mathit{\text{'}}}\right)}^{T}~\mathit{\text{MVN}}\left(0,{\mathrm{\sum}}_{\mathit{\text{ck}}}\right)$ where 1 ≤ n ≤ 100 and 1 ≤ k ≤ 5, and sample expression level for the gene $g~N\left(0,{\sigma}_{k}^{2}\right)$ which is not in any cluster for sample n, where 1 ≤ n ≤ 100, 1 ≤ k ≤ 5 and ${\sigma}_{k}^{2}$ was uniformly distributed from [0.8, 1.2], which indicates different variance for study k.
- 4.
For the first 1,000 genes (1 ≤ g ≤ 1, 000), k _{ g } (the number of studies that are differentially expressed for gene g) was generated by sampling k _{ g } = 1, 2, 3, 4 and 5, respectively. For the next 1,000 genes (1, 001 ≤ g ≤ 2, 000), k _{ g } = 0 represents non-DE genes in all five studies.
- 5.
To simulate expression intensities for cases, we randomly sampled δ _{ gk } ∈ {0, 1}, such that ∑ _{ k } δ _{ gk } = k _{ g }. If δ _{ gk } = 1, gene g in study k was a DE gene, otherwise it was a non-DE gene. When δ _{ gk } = 1, we sampled expression intensities μ _{ gk } from a uniform distribution in the range of [0.5, 3], which means we considered the concordance effect (up-regulated) among all simulated studies. Hence, the expression for control samples are ${X}_{\mathit{\text{gnk}}}={X}_{\mathit{\text{gnk}}}^{\mathit{\text{'}}},$ and case samples are ${Y}_{\mathit{\text{gnk}}}={X}_{g\left(n+50\right)k}^{\text{'}}+{\mu}_{\mathit{\text{gk}}}\xb7{\delta}_{\mathit{\text{gk}}},$ for 1 ≤ g ≤ 2, 000, 1 ≤ n ≤ 50 and 1 ≤ k ≤ 5.
In the simulation study, we had 1,000 non-DE genes in all five studies (k_{ g } = 0), and 1,000 genes were differentially expressed in 1 ~ 5 studies (k_{ g } = 1, 2, 3, 4, 5). On average, we had roughly the same number (~200) of genes in each group of k_{ g } = 1, 2, 3, 4, 5. See Additional file 1: Figure S2 for the heatmap of a simulated example (red colour represents up-regulated genes). We applied the 12 meta-analysis method under FDR control at 5%. With the knowledge of true k_{ g }, we were able to derive the sensitivity and specificity for HS_{ A } and HS_{ B }, respectively. In HS_{ A }, genes with k_{ g } = 5 were the underlying true positives and genes with k_{ g } = 0 ~ 4 were the underlying true negatives; in HS_{ B }, gene with k_{ g } = 1 ~ 5 were the underlying true positives and genes with k_{ g } = 0 were the true negatives. By adjusting the decision cut-off, the receiver operating characteristic (ROC) curves and the resulting area under the curve (AUC) were used to evaluate the performance. We simulated 50 data sets and reported the means and standard errors of the AUC values. AUC values range between 0 and 1. AUC = 50% represents a random guess and AUC = 1 reaches the perfect prediction. The above simulation scheme only considered the concordance effect sizes (i.e. all with up-regulation when a gene is DE in a study) among five simulated studies. In many applications, some genes may have p-value statistical significance in the meta-analysis but the effect sizes are discordant (i.e. a gene is up-regulation in one study but down-regulation in another study). To investigate that effect, we performed a second simulation that considers random discordant cases. In step 5, the μ_{ gk } became a mixture of two uniform distributions: π_{ gk } Unif ⋅[-3, -0.5]+ (1 - π_{ gk })⋅ Unif[0.5, 3], where π_{ gk } is the probability of gene g (1 ≤ g ≤ 2, 000) in study k(1 ≤ k ≤ 5) to have a discordant effect size (down-regulated). We set π_{ gk } = 0.2 for the discordant simulation setting.
Simulation results to characterize the methods
The detected number of DE genes (at FDR = 5%), the true FDR, AUC values under HS _{ A } and HS _{ B } and the concluding characterization of targeted hypothesis setting of each method
maxP | rOP | minP | Fisher | AW | Stouffer | |
---|---|---|---|---|---|---|
Detected # | 321 | 522 | 1005 | 1000 | 1000 | 974 |
(se) | (2.2) | (2.35) | (0.85) | (1.06) | (1.05) | (1.5) |
True FDR (HS_{ A }) | .068 | .018 | .447 | .444 | .444 | .43 |
(se) | (.0008) | (.0012) | (.0006) | (.0007) | (.0008) | (.0009) |
True FDR (HS_{ B }) | .007 | .011 | .016 | .017 | .016 | .022 |
(se) | (.0005) | (.0004) | (.0006) | (.0006) | (.0007) | (.0006) |
AUC (HS_{ A }) | .996 | .964 | .8 | .82 | .79 | .89 |
(se) | (.0003) | (.0014) | (.0005) | (.0005) | (.0005) | (.0006) |
AUC (HS_{ B }) | .75 | .833 | .99 | .99 | .99 | .99 |
(se) | (.0013) | (.01) | (.0001) | (.0001) | (.0001) | (.0005) |
Characterization | HS _{ A } | HS _{ r } | HS _{ B } | HS _{ B } | HS _{ B } | HS _{ B } |
PR | SR | FEM | REM | RankProd | RankSum | |
Detected # | 136 | 186 | 948 | 411 | 391 | 105 |
(se) | (2.51) | (2.3) | (1.75) | (2.86) | (3.31) | (1.514) |
True FDR (HS_{ A }) | .008 | .01 | .415 | .117 | .13 | .389 |
(se) | (.0003) | (.0004) | (.0009) | (.0015) | (.0014) | (.0008) |
True FDR (HS_{ B }) | 0 | 0 | .022 | .007 | 0 | 0 |
(se) | (0) | (0) | (.0007) | (.0004) | (0) | (0) |
AUC (HS_{ A }) | .986 | .99 | .917 | .99 | .916 | .504 |
(se) | (.0003) | (.0002) | (.0009) | (.0002) | (.0011) | (.0046) |
AUC (HS_{ B }) | .981 | .95 | .984 | .92 | .934 | .496 |
(se) | (.0004) | (.0008) | (.0004) | (.0011) | (.0012) | (.0025) |
Characterization | HS _{ A } | HS _{ A } | HS _{ B } | HS _{ r } | HS _{ B } | HS _{ B } |
Results of the four evaluation criteria
Detection capability
Figure 2 shows the number of DE genes identified by each of the 12 meta-analysis methods (FDR = 10% for MDD and breast cancer due to their weak signals and FDR = 1% for all the others). Each plot shows mean with standard error bars for 50 bootstrapped data sets. Additional file 1: Table S4 shows the MSR and ASR for each method in the six examples. The methods in Figure 2 are ordered according to their ASR values. The top six methods with the strongest detection capability were those that detected HS_{ B }-type DE genes from the conclusion of Table 1: Fisher, AW, Stouffer, minP, FEM and RankSum. The order of performance of these six methods was pretty consistent across all six examples. The next four methods were rOP, RankProd, maxP and REM and they targeted on either HS_{ r } or HS_{ A }. PR and SR had the weakest detection capability, which was consistent with the simulation result in Table 1.
Biological association
Stability
Robustness
Characterization of methods by MDS plots
We applied the adjusted DE similarity measure to quantify the similarity of the DE gene orders from any two meta-analysis methods. The resulting dissimilarity measure (i.e. one minus adjusted similarity measure) was used to construct the multidimensional scaling (MDS) plot, showing the similarity/dissimilarity structure between the 12 methods in a two-dimensional space. When two methods were close to each other, they generated similar DE gene ordering. The patterns of MDS plots from six examples generated quite consistent results (Additional file 1: Figure S6). Figure 1(a) shows an aggregated MDS plot where the input dissimilarity matrix is averaged from the six examples. We clearly observed that Fisher, AW, Stouffer, minP, PR and SR were consistently clustered together in all six individual and the aggregated MDS plot (labeled in red). This is not surprising given that these methods all sum transformed p-value evidence across studies (except for minP). Two methods to combine effect sizes and two methods to combine ranks (FEM, REM, RankProd and RankSum labeled in blue) are consistently clustered together. Finally, the maxP and rOP methods seem to form a third loose cluster (labeled in green).
Characterization of data sets by entropy measure
From the simulation study, selection of a most suitable meta-analysis method depends on the hypothesis setting behind the methods. The choice of a hypothesis setting mostly depends on the biological purpose of the analysis; that is, whether one aims to detect candidate markers differentially expressed in "all" (HS_{ A }), "most" (HS_{ r }) or "one or more" (HS_{ B }) studies. However, when no biological prior information or preference exists, the entropy measure can be objectively used to determine the choice of hypothesis setting. The analysis identifies the top 1,000 genes from Fisher’s meta-analysis method and the gene-specific entropy of each gene is calculated. When the entropy is small, the p-values are small in only one or very few studies. Conversely, when the entropy is large, most or all of the studies have small p-values. Figure 1(b) shows the box-plots of entropy of the top 1,000 candidate genes identified by Fisher’s method in the six data sets. The result shows that prostate cancer comparing primary and metastatic tumor samples had the smallest entropy values, which indicated high heterogeneity across the three studies and that HS_{ B } should be considered in the meta-analysis. On the other hand, MDD had the highest entropy values. Although the signals of each MDD study were very weak, they were rather consistent across studies and application of HS_{ A } or HS_{ r } was adequate. For the other examples, we suggest using the robust HS_{ r } unless other prior biological purpose is indicated.
Conclusions and discussions
An application guideline for practitioners
Ranks of method performance in the four evaluation criteria
Targeted HS | Power detection | Biological association | Stability | Robustness | Rank Sum | MDS^{*1} | |
---|---|---|---|---|---|---|---|
PR | HS _{ A } | 12 | 4 | 4 | 6 | 26 | 1 |
SR | HS _{ A } | 11 | 6 | 9 | 7 | 33 | 1 |
maxP | HS _{ A } | 9 | 10 | 12 | 11 | 42 | 2 |
rOP | HS _{ r } | 7 | 5 | 10 | 10 | 32 | 2 |
REM | HS _{ r } | 10 | 11 | 5 | 8 | 34 | 3 |
Fisher | HS _{ B } | 1 | 2 | 3 | 3 | 9 | 1 |
AW | HS _{ B } | 2 | 3 | 6 | 2 | 13 | 1 |
Stouffer | HS _{ B } | 3 | 1 | 8 | 4 | 16 | 1 |
minP | HS _{ B } | 4 | 7 | 7 | 1 | 19 | 1 |
RankProd | HS _{ B } | 8 | 8 | 1 | 5 | 22 | 3 |
RankSum | HS _{ B } | 6 | 12 | 2 | 12 | 32 | 3 |
FEM | HS _{ B } | 5 | 9 | 11 | 9 | 34 | 3 |
Below, we provide a general guideline for a practitioner when applying microarray meta-analysis. Data sets of a relevant biological or disease hypothesis are firstly identified, preprocessed and annotated according to Step (i) - (v) in Ramasamy et al. Proper quality assessment should be performed to exclude studies with problematic quality (e.g. with the aid of MetaQC as we did in the six examples). Based on the experimental design and biological objectives of collected data, one should determine whether the meta-analysis aims to identify biomarkers differentially expressed in all studies (HS_{ A }), in one or more studies (HS_{ B }) or in majority of studies (HS_{ r }). In general, if higher heterogeneity is expected from, say, heterogeneous experimental protocol, cohort or tissues, HS_{ B } should be considered. For example, if the combined studies come from different tissues (e.g. the first study uses peripheral blood, the second study uses muscle tissue and so on), tissue-specific markers may be expected and HS_{ B } should be applied. On the contrary, if the collected studies are relatively homogeneous (e.g. use the same array platform or from the same lab), HS_{ r } is generally recommended, as it provides robustness and detects consistent signals across the majority of studies. In the situation that no prior knowledge is available to choose a desired hypothesis setting or if the researcher is interested in a data-driven decision, the entropy measure in Figure 1(b) can be applied and the resulting box-plot can be compared to the six examples in this paper to guide the decision. Once the hypothesis setting is determined, the choice of a meta-analysis method can be selected from the discussion above and Table 2.
Conclusions
In this paper, we performed a comprehensive comparative study to evaluate 12 microarray meta-analysis methods using simulation and six real examples with four evaluation criteria. We clarified three hypothesis settings that were implicitly assumed behind the methods. The evaluation results produced a practical guideline to inform biologists the best choice of method(s) in real applications.
With the reduced cost of high-throughput experiments, data from microarray, new sequencing techniques and mass spectrometry accumulate rapidly in the public domain. Integration of multiple data sets has become a routine approach to increase statistical power, reduce false positives and provide more robust and validated conclusions. The evaluation in this paper focuses on microarray meta-analysis but the principles and messages apply to other types of genomic meta-analysis (e.g. GWAS, methylation, miRNA and eQTL). When the next-generation sequencing technology becomes more affordable in the future, sequencing data will become more prevalent as well and similar meta-analysis techniques will apply. For these different types of genomic meta-analysis, similar comprehensive evaluation could be performed and application guidelines should be established as well.
Declarations
Acknowledgements
This work was supported by the National Institutes of Health [R21MH094862].
Authors’ Affiliations
References
- Ramasamy A, Mondry A, Holmes CC, Altman DG: Key issues in conducting a meta-analysis of gene expression microarray datasets. PLoS Med. 2008, 5 (9): e184-10.1371/journal.pmed.0050184.PubMed CentralView ArticlePubMedGoogle Scholar
- Kang DD, Sibille E, Kaminski N, Tseng GC: MetaQC: objective quality control and inclusion/exclusion criteria for genomic meta-analysis. Nucleic Acids Res. 2012, 40 (2): e15-10.1093/nar/gkr1071.PubMed CentralView ArticlePubMedGoogle Scholar
- Tseng GC, Ghosh D, Feingold E: Comprehensive literature review and statistical considerations for microarray meta-analysis. Nucleic Acids Res. 2012, 40 (9): 3785-3799. 10.1093/nar/gkr1265.PubMed CentralView ArticlePubMedGoogle Scholar
- Hong F, Breitling R: A comparison of meta-analysis methods for detecting differentially expressed genes in microarray experiments. Bioinformatics. 2008, 24 (3): 374-382. 10.1093/bioinformatics/btm620.View ArticlePubMedGoogle Scholar
- Campain A, Yang YH: Comparison study of microarray meta-analysis methods. BMC Bioinforma. 2010, 11: 408-10.1186/1471-2105-11-408.View ArticleGoogle Scholar
- Birnbaum A: Combining independent tests of significance. J Am Stat Assoc. 1954, 49: 559-574.Google Scholar
- Li J, Tseng GC: An adaptively weighted statistic for detecting differential gene expression when combining multiple transcriptomic studies. Ann Appl Stat. 2011, 5: 994-1019. 10.1214/10-AOAS393.View ArticleGoogle Scholar
- Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001, 98 (9): 5116-5121. 10.1073/pnas.091062498.PubMed CentralView ArticlePubMedGoogle Scholar
- Pesarin F, Salmaso L: Permutation tests for complex data: theory, applications and software. 2010, Ames, IA 50010: Wiley.comView ArticleGoogle Scholar
- Cox DR: Regression models and life-tables. J R Stat Soc Ser B. 1972, 34 (2): 187-220.Google Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B. 1995, 57 (1): 289-300.Google Scholar
- Wang X, Kang DD, Shen K, Song C, Lu S, Chang LC, Liao SG, Huo Z, Tang S, Ding Y, et al: An R package suite for microarray meta-analysis in quality control, differentially expressed gene analysis and pathway enrichment detection. Bioinformatics. 2012, 28 (19): 2534-2536. 10.1093/bioinformatics/bts485.PubMed CentralView ArticlePubMedGoogle Scholar
- Fisher RA: Statistical methods for research workers. 1925, Edinburgh: Genesis Publishing, Oliver and BoydGoogle Scholar
- Stouffer SA: A study of attitudes. Sci Am. 1949, 180 (5): 11-15. 10.1038/scientificamerican0549-11.View ArticlePubMedGoogle Scholar
- Tippett LHC: The Methods of Statistics. An introduction mainly for workers in the biological sciences. The Methods of Statistics An Introduction mainly for Workers in the Biological Sciences. 1931, London: Williams &Norgate LtdGoogle Scholar
- Wilkinson B: A statistical consideration in psychological research. Psychol Bull. 1951, 48 (3): 156-158.View ArticlePubMedGoogle Scholar
- Song C, Tseng GC: Order statistics for robust genomic meta-analysis. Ann Appl Stat. 2012, AcceptedGoogle Scholar
- Choi JK, Yu U, Kim S, Yoo OJ: Combining multiple microarray studies and modeling interstudy variation. Bioinformatics. 2003, 19 (Suppl 1): i84-i90. 10.1093/bioinformatics/btg1010.View ArticlePubMedGoogle Scholar
- Hong F, Breitling R, McEntee CW, Wittner BS, Nemhauser JL, Chory J: RankProd: a bioconductor package for detecting differentially expressed genes in meta-analysis. Bioinformatics. 2006, 22 (22): 2825-2827. 10.1093/bioinformatics/btl476.View ArticlePubMedGoogle Scholar
- Dreyfuss JM, Johnson MD, Park PJ: Meta-analysis of glioblastoma multiforme versus anaplastic astrocytoma identifies robust gene markers. Mol Cancer. 2009, 8: 71-10.1186/1476-4598-8-71.PubMed CentralView ArticlePubMedGoogle Scholar
- Borg I: Modern multidimensional scaling: theory and applications. 2005, New York, NY 10013: SpringerGoogle Scholar
- Martin NF, England JW: Mathematical theory of entropy, vol. 12. 2011, Cambridge CB2 8BS United Kingdom: Cambridge University PressGoogle Scholar
- Wu W, Dave N, Tseng GC, Richards T, Xing EP, Kaminski N: Comparison of normalization methods for CodeLink Bioarray data. BMC Bioinforma. 2005, 6: 309-10.1186/1471-2105-6-309.View ArticleGoogle Scholar
- Spearman C: The proof and measurement of association between two things. Amer J Psychol. 1904, 15: 72-101. 10.2307/1412159.View ArticleGoogle Scholar
- Li Q, Brown JB, Huang H, Bickel PJ: Measuring reproducibility of high-throughput experiments. Ann Appl Stat. 2011, 5 (3): 1752-1779. 10.1214/11-AOAS466.View ArticleGoogle Scholar
- Yang X, Bentink S, Scheid S, Spang R: Similarities of ordered gene lists. J Bioinform Comput Biol. 2006, 4 (3): 693-708. 10.1142/S0219720006002120.View ArticlePubMedGoogle Scholar
- Hubert L, Arabie P: Comparing partitions. J Classification. 1985, 2: 193-218. 10.1007/BF01908075.View ArticleGoogle Scholar
- Owen AB: Karl Pearson's meta-analysis revisited. Ann Statistics. 2009, 37 (6B): 3867-3892. 10.1214/09-AOS697.View ArticleGoogle Scholar
- Breitling R, Herzyk P: Rank-based methods as a non-parametric alternative of the T-statistic for the analysis of biological microarray data. J Bioinform Comput Biol. 2005, 3 (5): 1171-1189. 10.1142/S0219720005001442.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.