Meta-analysis based on weighted ordered P-values for genomic data with heterogeneity
© Li and Ghosh; licensee BioMed Central Ltd. 2014
Received: 17 April 2014
Accepted: 11 June 2014
Published: 28 June 2014
Skip to main content
© Li and Ghosh; licensee BioMed Central Ltd. 2014
Received: 17 April 2014
Accepted: 11 June 2014
Published: 28 June 2014
Meta-analysis has become increasingly popular in recent years, especially in genomic data analysis, due to the fast growth of available data and studies that target the same questions. Many methods have been developed, including classical ones such as Fisher’s combined probability test and Stouffer’s Z-test. However, not all meta-analyses have the same goal in mind. Some aim at combining information to find signals in at least one of the studies, while others hope to find more consistent signals across the studies. While many classical meta-analysis methods are developed with the former goal in mind, the latter goal has much more practicality for genomic data analysis.
In this paper, we propose a class of meta-analysis methods based on summaries of weighted ordered p-values (WOP) that aim at detecting significance in a majority of studies. We consider weighted versions of classical procedures such as Fisher’s method and Stouffer’s method where the weight for each p-value is based on its order among the studies. In particular, we consider weights based on the binomial distribution, where the median of the p-values are weighted highest and the outlying p-values are down-weighted. We investigate the properties of our methods and demonstrate their strengths through simulations studies, comparing to existing procedures. In addition, we illustrate application of the proposed methodology by several meta-analysis of gene expression data.
Our proposed weighted ordered p-value (WOP) methods displayed better performance compared to existing methods for testing the hypothesis that there is signal in the majority of studies. They also appeared to be much more robust in applications compared to the rth ordered p-value (rOP) method (Song and Tseng, Ann. Appl. Stat. 2014, 8(2):777–800). With the flexibility of incorporating different p-value combination methods and different weighting schemes, the weighted ordered p-values (WOP) methods have great potential in detecting consistent signal in meta-analysis with heterogeneity.
Meta-analysis has long been used to integrate data and/or results from multiple studies targeting the same questions. It is commonly used in many areas of statistical applications such as clinical studies and psychology experiments. In recent years, meta-analysis has been frequently adopted in genomic data analysis, due to the fast development of high-throughput technology and the vast amounts of data available in public databases.
Many meta-analysis methods have been developed throughout the years. Roughly speaking, there are two main approaches to meta-analysis methods Song and Tseng . The first directly combines the p-values from the studies, while the second attempts to model the data or the effect sizes from the combined studies. The former includes methods such as the Fisher’s combined probability test  and the Stouffer’s Z-test , as well as weighted variations of these classical tests. The latter includes a variety of fixed effects and random effects models, such as GeneMeta . Each approach has its advantages and disadvantages. The p-value combining methods are relatively flexible in that they require minimal information and assumptions from the studies. In this paper we will mainly focus on methods that directly combine the p-values.
Most of the traditional meta-analysis methods (e.g. Fisher’s and Stouffer’s methods) aim at testing the alternative hypothesis that at least one of the studies is non-null. While this aligns with earlier goals of meta-analyses to gain power in detecting signals by combining multiple studies, it is frequently not the case with genomic data. In meta-analysis of genomic studies for example, the goal is often to identify genes that are differentially expressed in a consistent pattern across multiple studies. The extreme of this would be to test for the alternative that the null can be rejected for all the studies. A solution to this extreme alternative dates back to the maxP method by Wilkinson . However, the maxP method is often considered too conservative. A recent approach by Phillips and Ghosh  improves the power of testing for this disjunction of nulls when the rejection of all p-values associated with a gene is required. In practical meta-analyses, the goal of rejecting all studies may be considered too extreme. Ideally, we would want to target at detecting consistent signals across studies while avoiding being overly exclusive. This issue has gained attention in recent years, and a number of authors have tried to address this problem. Benjamini and Heller  discussed a framework for testing partial conjunction hypotheses, where they test for the alternative that at least u out of n null hypotheses are false against the partial conjunction null that no more than n−u+1 of the null hypotheses are true. Song and Tseng  proposed the rth ordered p-value (rOP) method that aims at testing the alternative hypothesis that there is signal in at least a given percentage of studies. Other methods exist that address this problem from different approaches, such as RankProd by Hong et al.  that looks for consistently highly ranked genes, and a weighted approach by Li and Ghosh  that weights genes by its expression consistency across studies.
We consider the problem of detecting signals in the majority of studies. Our approach adopts one aspect of the rOP method Song and Tseng  in that we also consider ordered p-values. But instead of using a single rth ordered p-value as the statistic (as rOP does), we combine all or a subset of the ordered p-values while weighting them based on their order (weighted ordered p-values, WOP). P-values closer to the median are highly weighted and the smallest/largest p-values are down-weighted. The idea is that among the collection of p-values, the median p-values are likely to be a better reflection of the behavior of the majority of studies than the smallest or largest p-values. Olkin and Saner  discussed a trimmed Fisher’s procedure that leaves out a number of the smallest and/or largest p-values from the calculation of Fisher’s statistic to remove the effect of possible aberrant extremes. In our consideration, we still keep the smallest/largest p-values because they do carry certain information, but we down-weight them because they may be relatively less relevant when considering the “majority” of studies. To reflect the up-weighting of the medians and down-weighting of the extremes, we calculated our weights based on the binomial distribution. To summarize the weighted ordered p-values, we mainly considered Fisher’s statistic and Stouffer’s Z-test. We also explored some other statistics, such as generalized Fisher’s test by Lancaster . In general, other summary statistics can be used under this framework as well.
While many weighted variations of Fisher’s statistic and Stouffer’s statistic have been developed throughout the years, most of them distribute weights according to the sample sizes and/or effect sizes of the studies, or other similar considerations (e.g. Mosteller and Bush , Won et al. , Makambi ). Li and Tseng  proposed an interesting adaptively weighted statistic, where the weights are used to maximize the significance of the summary statistic. However, to the best of our knowledge, none of the weighting schemes are based on the ordered p-values, which is what makes our method unique. Xie et al.  discussed a meta-analysis approach using confidence distributions, where they incorporated the use of medians and kernel functions, but their approach is under a completely different framework from ours.
By incorporating more than one order statistic into our summary statistic, our method can be considered an expansion of the rth ordered p-value (rOP) method Song and Tseng . In general, both the rOP method and the original summary statistic we use (e.g. Fisher’s or Stouffer’s method) are special cases under the WOP framework - one having all the weight on a single ordered p-value and the other having evenly distributed weights. However, our WOP methods have respective advantages over both the traditional summary statistics and the rOP method. Compared to the traditional summary statistics, the WOP methods better focus on detecting signal in a majority of studies. On the other hand, the WOP methods appear to be more robust compared to the rOP method. These observations are based upon results from simulation studies as well as data applications.
When r=1, H S 1 is the classical setting of testing for non-zero effect size in at least one study against the conjunction of nulls. H S 1 is the hypothesis setting that Fisher’s method, Stouffer’s Z-test and many other traditional methods test for. When r=K, H S K tests for the alternative that all the studies have non-zero effect size. For instance, the maxP method  tests for H S K . When 1<r<K,H S r provides a compromise between the two aforementioned hypothesis settings, and tests for at least a pre-specified number of non-zero effects. For a given r, the rOP (rth ordered p-value) method Song and Tseng  is used to test for H S r .
In this paper, we test for non-zero effect sizes in a majority of studies against the null that the effects sizes are zero in all studies. Thus we are testing against the conjunction of nulls while trying to focus on a certain subset of the non-null space. In general, the hypothesis for our meta-analysis approach can be considered to fall under the H S r setting. Song and Tseng  suggested a few data-driven methods for selecting r. To prevent any potential issues of post hoc choices of r, we choose to fix r before any analysis is conducted. While it is hard to specify what the term “majority” exactly means, as a general rule, we choose to test the hypothesis setting H S m , where m=⌈K/2⌉, ⌈x⌉ being the smallest integer no less than x. Essentially we are targeting the alternative that at least half of the studies have non-zero effect sizes. Under our weighted ordered p-values (WOP) framework, we are able to develop methods for other hypothesis settings with r ranging from m+1 to K. But in general we hope to provide a simple to use method without having to put too much effort in selecting a particular r, and therefore we will focus mostly on testing H S m . We will later show through data applications that using our WOP methods for testing H S m provides more robust results compared to using the rOP method with different choices of r.
As mentioned by previous authors, many traditional p-value combination methods can be expressed in the general form of (for example, see Zaykin ). For instance, Stouffer’s Z-test takes H(·) to be the inverse normal function, while Fisher’s method has H(p i )=−2 log(p i ). The difference between T ′ and T, albeit subtle in notation, is the essence of the WOP framework. In the WOP framework, the weight w i is associated with the ith ordered p-value p (i). In other words, the ranking of a p-value in relation to the p-values from the other studies determines its weighting. In traditional weighted p-value combining methods, the weight assigned to a p-value is associated with the characteristics of that particular study, be it the sample size, the effect size, or other features.
Allowing the weights to depend on the ordering of the p-values opens up a whole new area of considerations when combining multiple p-values. We can consider giving more weights to the p-values that are closer to the center of the distribution of the p-values since they might hold more credibility, and at the same time down-weight the outlying p-values. For instance, if the entirety of weights is placed on the rth ordered p-value, the statistic reduces to the rOP method. However, using the WOP framework, if we highly weight the rth ordered p-value but still distribute some weight to the other p-values, the method can be viewed as a more robust version of the rOP method for testing H S r .
In this paper, we shall develop a few specific methods under the WOP framework. We consider weights based on the binomial distribution, which will be described in more detail in the next section. As for the p-value combining methods, we shall focus on Fisher’s method, with H F (p (i))=−2 log(p (i)), and Stouffer’s method, with H Z (p (i))=ϕ −1(1−p (i)), ϕ(·) being the standard normal distribution function.
In this section, we discuss two possible weighting schemes for the WOP framework. We mainly consider testing the alternative hypothesis that the effect sizes are non-zero in the majority of studies, which is the hypothesis setting H S m . We will also briefly discuss extending the weighting schemes to testing other hypothesis settings H S r , for m<r≤K.
Inspired by the rOP method, which uses the rth ordered p-value for testing the hypothesis setting H S r , we consider placing the highest weight on the median p-values for testing H S m . This makes intuitive sense, since if a consensus does exist among the studies, we have reason to believe that the behavior of the majority of studies should be best captured by the p-values that are closer to the center of the distribution. Since we do not insist on non-zero effect sizes for every single study, we consider down-weighting the largest p-values among the studies. On the other hand, p-value combining methods such as Fisher’s method are known to be very sensitive to single extremely small p-values, thus the smallest p-values should also be down-weighted, to avoid a small number of extremely small p-values biasing the results of the majority of studies. In summary, we would like our weighting scheme w i , as a function of i, to reflect a unimodal shape, with the highest weights being w m (when K is odd) or w m−1 and w m (when K is even), and such that w i decreases as i goes to 1 or K.
We will discuss more on the effects of these two different weighting schemes through simulation studies in later sections.
Under the null hypothesis, the p k ’s are assumed to follow a uniform (0,1) distribution. For Stouffer’s Z-test, H Z (p i )=ϕ −1(1−p i ) follows a standard normal distribution. Therefore the traditional weighted Z-test in the form of still follows a normal distribution. For Fisher’s method, H F (p i )=−2 log(p i ) has a chi-square distribution with two degrees of freedom. Therefore the distribution of the traditional weighted Fisher’s test is essentially weighted sums of exponential distributions. The distribution of weighted sums of exponential variables is not as straightforward, though many authors have researched on both the exact and approximations of this distribution, a summary of which can be found in Olkin and Saner .
When we consider weighted ordered p-values in the form of , however, the problem becomes much more complicated. Even for Stouffer’s method, the distribution of the sum of weighted ordered normal variables is not readily available. As for Fisher’s method, Olkin and Saner  studied the distribution of the trimmed Fisher’s statistic, which is for the special case of w i =0 for i=1,⋯,s 1 and i=s 2,⋯,K. They transformed the distribution of the sum of ordered chi-squared variables back to a weighted sum of exponential variables without order. But as discussed in their paper, the exact distribution of weighted sums of exponential variables are generally impractical for practitioners, and we would therefore have to use approximation methods.
Considering the complexity of the exact distributions of weighted sums of ordered variables, as well as the fact that the uniformness of the original p-values is not always guaranteed in practice, we recommend two methods for obtaining the p-values for the WOP statistics: (1) permutation analysis, in the case that original data from all the studies are available; and (2) comparing to the numerical distribution, in the case that only the p-values for each gene and each study are known.
We first explain the steps of obtaining the WOP p-values through permutation analysis:
Let denote the WOP statistic for gene g, where p g(i) is the ith ordered p-value of p g1,⋯,p g K .
Permute group labels in each study B times, and recalculate the p-values for the permuted data , for 1≤k≤K, 1≤g≤G, 1≤b≤B.
Calculate the WOP statistics for the permuted p-values for 1≤g≤G, 1≤b≤B.
Once the p-values for the WOP statistics for each gene are obtained, we may apply the Benjamini-Hochberg (BH) method  on the ’s (1≤g≤G) to account for multiple testing across the genes and control the false discovery rate (FDR).
If the original data are not available, we can simulate the distribution of the WOP statistics numerically, by simulating U(0,1) random variables, the distribution of p-values under the null distribution. The WOP statistics calculated from the data can then be compared to the numerical distribution to obtain the WOP p-values. We simulated numerical distributions of the WOP statistics for testing H S m based on the Fisher’s and Stouffer’s combination methods with binomial and half-binomial weighting schemes respectively, for study numbers ranging from 4 to 23. We conducted simulation studies to compare the WOP p-values obtained either though comparing with the numerical distribution or by performing permutation analysis. Results show that the two methods provide perfectly correlated p-values and that the number of rejections obtained from the two methods after applying the Benjamini-Hochberg adjustment are very similar (data not shown). Therefore both methods are reliable choices for obtaining the WOP p-values in practice. The numerical distribution provides an option for when the original data are not available and is also more time-efficient. The permutation analysis can be used if the uniformness of the original p-values are questionable but that the original data is available.
Previously we used two-sided alternatives as an example when setting up the hypothesis. The hypothesis setup H S r can be similarly developed for one-sided alternatives. In fact, the interpretation of the meta-analysis results is easier for one-sided tests, since we do not need to worry about the concordance of the directions of the effect sizes as we do for two-sided tests. Since the WOP methods directly combine the p-values, the direction of the effect sizes are not taken into account for two-sided tests. Thus a significant result from the WOP meta-analysis of two-sided tests indicate that there are at least r studies with non-zero effect size, but without any implications about the concordance or discordance of the directions of the effects. This may not be an issue in the case that the direction of effect is not of great importance. However, in genomic data analysis, it is often desirable to distinguish between up- and down-regulated genes, and a result stating that the gene is differentially expressed across many studies but with possible opposite directions of expression change may be confusing. In these cases, it might be problematic to directly apply the WOP methods to the two-sided p-values. On the other hand, since both up- and down-regulated genes may be of interest at the same time, we cannot pre-specify one particular one-sided test for all genes. For such scenarios we recommend using the test of Pearson  in combination with the WOP methods. To do so, for each gene we need to conduct two WOP meta-analyses on one-sided p-values: one on the left-tailed p-values for all studies, and the other on the right-tailed p-values for all studies. Let and be the WOP meta-analysis p-values for the left-tailed and right-tailed tests respectively. We shall then adopt the idea of Pearson’s test and define , where the superscript “C” stands for “concordant”. As discussed in Owen , the equation for obtaining provides a conservative p-value for Pearson’s test. By adopting the Pearson’s test, the results are more interpretable. A significant result now indicates that the gene is consistently up- or down-regulated in at least r studies.
We conducted a simulation study to compare the performances of the WOP methods with the original Fisher’s and Stouffer’s method, as well as with the rOP method Song and Tseng . We shall also demonstrate the differences between the binomial and half-binomial weighting schemes through the simulation.
We simulate the setting of a meta-analysis of differential expression studies, with 2000 genes and 7 studies. Out of the 2000 genes, 1650 genes are assumed to be not differentially expressed in any study, while 50 genes are assumed to be differentially expressed in 1,2,⋯,7 studies respectively. The sample sizes for the treatment and control groups are randomly generated for each study, varying from 5 to 20. Gene expression values are randomly generated from normal distributions. Control samples are generated from a N(0,1) distribution, as well as treatment samples that are not differentially expressed. Treatment samples that are differentially expressed are generated from a N(1,1) distribution. Two-sample T-tests are used to obtain the p-values p g k for each gene and each study. Our WOP methods aim at testing the hypothesis that the gene is differentially expressed in the majority of studies. In this case, m=4, corresponding to the hypothesis setting H S 4. The rOP statistic Song and Tseng  for testing H S 4 is the 4th ordered p-value. Note that the original Fisher’s and Stouffer’s method are supposed to test for H S 1. We used permutation analysis to obtain the WOP p-values for binomial and half-binomial weighted Fisher’s and Stouffer’s statistic. P-values for the rOP method were also computed by permutation analysis as recommended in Song and Tseng . P-values for the original Fisher’s and Stouffer’s method are computed directly via their respective distributions. To obtain a list of significant genes, the Benjamini-Hochberg procedure is applied to the p-values with the FDR controlled at the 0.05 level. Results are averaged over 100 replications.
In summary, the binomial weighted WOP methods show improvement over the original Fisher’s or Stouffer’s method for testing differential expression in a majority of studies, with lower rejection rates for genes that are differentially expressed in a small number of studies, and just as high power for genes that are differentially expressed in almost all studies. On the other hand, the half-binomial weighted WOP methods are more robust versions of the rOP method, having similar properties to the rOP method, but even lower rejections rates for categories 1 to 3 and higher power for categories 6 and 7. In practice, the binomial weighting scheme is recommended if the user wishes to have a larger pool of significant genes, and when the control of false positives is relatively less important. For better false positive control, the half-binomial weighting scheme is recommended. We note that because of our hypothesis setup, false positives are not the same as type I error. A type I error would be rejecting a gene that is not differentially expressed in any studies. On the other hand, a false positive would be rejecting a gene that is differentially expressed in less than r studies.
As an application of the proposed methodology, we conduct meta-analysis on a set of microarray data studies from four stem cell papers: Chin et al. , Guenther et al. , Newman and Cooper  and Chin et al. . We wish to find probesets that are differentially expressed between human induced pluripotent stem (hiPS) cells and human embryonic stem (hES) cells in the majority of studies. Some of the studies contain other samples such as human fibroblasts, but we only used samples from hiPS cells and hES cells. We included studies that had at least two samples for each group (hiPS and hES), giving us a total of 9 studies. All the studies used the Affymetrix Human Genome U133 Plus 2.0 Array platform, which contains 54675 probesets. We directly used the data preprocessed by the original contributors and did not perform any additional normalization, except for taking the log for data that were not already on the log2 scale. We performed a two sample T-test between the hiPS cell samples and the hES cell samples to obtain the original p-values for differential expression for each probeset and each study. The hypothesis setting for the meta-analysis is H S 5, i.e. we aim at testing the alternative that the probeset is differentially expressed in at least 5 out of the 9 studies. We applied the proposed WOP methods, in particular the binomial and half-binomial weighted Fisher’s and Stouffer’s statistic to the p-values. The p-values for the WOP statistics are obtained by comparing the statistics to the corresponding numerical distributions. We also applied the original Fisher’s and Stouffer’s method, and the rOP method (in this case the 5th ordered p-value). The p-values for the rOP method are obtained via its theoretical distribution. To adjust for multiple testing, we applied the Benjamini-Hochberg (1995) procedure afterwards, controlling the false discovery rate at the 0.05 level.
Number of significant probesets for the meta-analysis of stem cell studies by different methods
WOP: Binomial weighted
WOP: Half-binomial weighted
To compare the performances of the WOP methods and the rOP method Song and Tseng  in real data application, we applied our WOP methods to the three microarray meta-analysis applications in Song and Tseng . The first application consists of comparisons of two subtypes of brain tumors - anaplastic astrocytoma (AA) and glioblastoma multiforme (GBM), from 7 studies. The second application combines 9 studies comparing post-mortem brain tissues between MDD patients and control samples. In the third application, 16 diabetes microarray studies consisting of different organisms and tissues were combined. See Song and Tseng  for more details on the contexts of these three meta-analysis applications.
To ensure that the results are directly comparable, for each meta-analysis we directly used the two-sided p-values for each gene and each individual study calculated in Song and Tseng . In Song and Tseng , permutation analysis is used for the brain cancer studies and the MDD studies, while theoretical distributions are used to obtain results for the diabetes studies. We follow Song and Tseng  and also directly use the two-sided p-values for the permuted datasets provided by Song and Tseng  for the brain cancer studies and the MDD studies. See Song and Tseng  for more details on the preprocessing of the data and the calculation of the original p-values.
Number of significant genes for the three meta-analyses by different methods
For comparison, to see how the results would differ for different choices of r for the WOP methods, we applied the WOP method to the MDD studies again - this time testing H S 7. We used the half-binomial versions of the three weighting schemes for m<r≤K that were discussed earlier: w h b1, w h b2 and w h b3. We used Fisher’s summary statistic in the WOP method. The numbers of genes found by using the three weighting schemes are 760, 770 and 774 respectively. The three weighting schemes are fairly consistent with each other, since for each weighting scheme more than 90% of the genes found were also found by the other two weighting schemes. Comparing to the number of genes found by the half-binomial weighted Fisher’s method for testing H S 5, which is 930, notice that the corresponding WOP methods for testing H S 7 yield smaller numbers, conforming to our expectations. As discussed earlier, the ideal result would be that the genes detected for H S 7 be a subset of the genes detected for H S 5. In reality, over 70% of the genes detected by WOP methods for H S 7 were also detected for H S 5 (the percentage being 72.6%, 71.4% and 74.2% for the three weighting schemes respectively). Recall that for the rOP method, only 43.6% of genes detected for r=7 were also detected for r=5. Even though the WOP methods are not perfect, we can still see the great improvement in robustness compared to the rOP method.
In summary, our observations confirm that the results of the rOP method are indeed heavily dependent on the choice of r. Whereas our WOP methods show much higher robustness. In particular, the WOP methods for testing H S m has shown superior robustness by being able to cover most of the genes detected by the rOP methods using different r. Since in practice it is not often clear which particular H S r should be tested, we believe the WOP methods for testing H S m is a better choice when the goal is to detect signal in the majority of studies.
Meta-analysis is a useful tool in integrating data from different sources to test a particular hypothesis. While this paper mainly discussed the application of meta-analysis on microarray differential expression studies, other areas of genomic studies have increasingly relied on the use of meta-analysis, such as genome-wide association studies (GWAS). Some seminal studies in this area include Scott et al.  and Willer et al. . Meta-analysis is also frequently used in clinical studies, psychological studies and statistical applications in other social sciences. More and more meta-analyses nowadays aim at detecting consistent findings across a number of studies. While most of the traditional meta-analysis methods test for significance in at least one of the studies, it is important to develop new meta-analysis methods that focus on testing for significance in the majority of studies.
The weighted ordered p-value (WOP) method provides such a framework. It is unique in its use of weights that are based on the order of the p-values. The rOP method Song and Tseng , which is also based on ordered p-values, can be considered a very special case under the WOP framework, where all the weight is placed on one single ordered p-value. The WOP methods do not require pre-specification of r and is less sensitive to the choice of its value. The half-binomial weighted WOP methods have been shown to be more robust and have better receiver operating characteristics compared to the corresponding rOP method.
As pointed out by a reviewer, one of our previously published meta-analysis methods also considers the issue of heterogeneity and utilizes weights. However, these two methods are quite different in concept. The method in  applies weights to the genes, essentially to re-rank the genes by adding in information about heterogeneity. On the other hand, the WOP method applies weights to the multiple p-values for each gene. Although, conceptually, we can first apply the WOP method to each gene and then use the method in  on top of that to re-rank the genes.
One advantage of the WOP framework is its flexibility. The framework allows for different weighting schemes and summary statistics to be used. Even though this paper mainly focused on two particular weighting schemes based on the binomial distribution and two summary statistics (Fisher’s and Stouffer’s statistics), in general, other summary statistics and weighting schemes can be used. Future research can be done to try to optimize the weighting scheme to suit specific meta-analysis purposes.
We acknowledge the support of the NSF grant ABI-1262538.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.