 Methodology Article
 Open Access
 Published:
Unsupervised gene set testing based on random matrix theory
BMC Bioinformatics volume 17, Article number: 442 (2016)
Abstract
Background
Gene set testing, or pathway analysis, is a bioinformatics technique that performs statistical testing on biologically meaningful sets of genomic variables. Although originally developed for supervised analyses, i.e., to test the association between gene sets and an outcome variable, gene set testing also has important unsupervised applications, e.g., pvalue weighting. For unsupervised testing, however, few effective gene set testing methods are available with support especially poor for several biologically relevant use cases.
Results
In this paper, we describe two new unsupervised gene set testing methods based on random matrix theory, the Marc̆enkoPastur Distribution Test (MPDT) and the TracyWidom Test (TWT), that support both selfcontained and competitive null hypotheses. For the selfcontained case, we contrast our proposed tests with the classic multivariate test based on a modified likelihood ratio criterion. For the competitive case, we compare the new tests against a competitive version of the classic test and our recently developed Spectral Gene Set Enrichment (SGSE) method. Evaluation of the TWT and MPDT methods is based on both simulation studies and a weighted pvalue analysis of two real gene expression data sets using gene sets drawn from MSigDB collections.
Conclusions
The MPDT and TWT methods are novel and effective tools for unsupervised gene set analysis with superior statistical performance relative to existing techniques and the ability to generate biologically important results on real genomic data sets.
Background
Gene set testing, or pathway analysis, is a powerful and extensively employed approach for analyzing the output from large scale assaying techniques for nucleic acids and nucleic acid products, such as microarrays and highthroughput sequencing [1, 2]. By focusing the analysis on the association between a smaller number of functional gene sets and a specific clinical outcome, gene set testing can substantially improve statistical power, biological interpretation and replication relative to an analysis based on individual genomic variables [1, 3–5]. Given these advantages, researchers have invested significant effort in the last 10 to 15 years creating large gene set collections [6–8] and developing effective gene set testing methods [4, 9–12].
Gene set testing was originally developed for use in a supervised context, i.e., to quantify the association between a set of genomic variables, or genes, and a clinical outcome or phenotype. Typically, this is carried out via a twostage process in which the association is first measured between each gene in the set and the phenotype, often using a simple linear regression model. The test statistics for all genes in the set are then combined into a gene set test statistic and significance is computed relative to the appropriate H _{0} and H _{ A }. Based on the form of H _{0} and H _{ A }, gene set testing methods are generally grouped into two main categories: selfcontained tests and competitive tests [5, 13]. For selfcontained tests, the null asserts that none of the gene set members have an association with the outcome, and for competitive tests, the null asserts that the members of the gene set are no more associated with the outcome than genes not in the set. In general, tests based on a competitive null hypothesis are viewed as more biologically relevant, and thus are far more commonly used, than those based on a selfcontained null [5].
Although most applications of gene set testing involve a supervised model, a number of important unsupervised use cases exist, where unsupervised implies that testing is performed in the absence of an outcome variable. Important unsupervised gene set testing use cases include caseonly analyzes and pvalue weighting [14], which is explained in greater detail in the next paragraph. In prior work, we addressed the lack of effective unsupervised gene set testing techniques by developing the Spectral Gene Set Enrichment (SGSE) method [15] (see the SGSE paper for a detailed review of existing unsupervised approaches). The SGSE method computes unsupervised enrichment for each gene set via the association between gene set members and the principal components (PCs) of a genomic data set using the Principal Component Gene Set Enrichment (PCGSE) method [16] while taking into account the statistical significance of the eigenvalue associated with each PC according to a TracyWidom test [17].
One unsupervised use case of particular importance to the SGSE method, and the work detailed in this paper, is pvalue weighting [14]. Pvalue weighting aims to reduce the burden on statistical power incurred by multiple hypothesis correction (MHC) by weighting the pvalues computed for each hypothesis using weights that reflect the likelihood that the alternative hypothesis (H _{ A }) is true. As long as the weights are independent of the test statistics under the null hypothesis (H _{0}), MHC methods will correctly maintain either the familywise error rate (FWER) or false discovery rate (FDR). When pvalue weighting is used with FDR methods, e.g., the Benjamini and Hochberg (BH) [18] method, the technique is referred to as weighted FDR control (wFDR). In the special case where the weights are binary, this approach is called screeningtesting [19] and has the effect of selecting and testing just a subset of the original family of hypotheses.
Although widely applied for geneenvironment interaction detection [20–23], pvalue weighting can have a significant impact on gene set testing power given the significant growth in the size of common gene set collections, e.g., even the very selective Molecular Signatures Database (MSigDB) [8] now includes over 10,000 sets. For such large gene set collections, MHC can lower statistical power so substantially that it becomes impossible to identify true associations for many genomic data sets [24]. The link between pvalue weighting and unsupervised gene set testing is based on the fact that an effective way to ensure the independence between datadriven weights and standard gene set test statistics under H _{0} is to ignore the outcome variable when computing the weight, i.e., base the weight on an unsupervised gene set test. In previously published research building on the SGSE method, we developed a screeningtesting framework for gene set testing, called Spectral Gene Set Filtering (SGSF) [24], that computes binary weights using the pvalues generated by the SGSE method.
Although it can be proven that weights based on an unsupervised gene set test (like the SGSE method) will be independent of common gene set test statistics under H _{0} (see proof in the Supplemental Material for [24]), this independence only ensures type I error control. To actually improve statistical power, the weight must also be associated with the test statistic under H _{ A }. For gene set testing, producing a datadriven weight therefore requires the use of an unsupervised test that can effectively identify the gene sets truly associated with the outcome of interest. As illustrated in the SGSF paper [24], statistics based on deviation from an identity population covariance structure represent a useful class of gene set weights that boost gene sets according to the empirical correlation among member genes. Although it is possible some biologically important gene sets exhibit little intergene correlation, it has been shown that groups of highly correlated genes are often associated with biological processes that play an important role in the measured experiment [11]. For example, the genes belonging to biological pathways have well characterized interactions and, if the pathway is active in a given experiment context, can be expected to exhibit correlated expression. This property is in fact used as the basis for computationally generating many gene sets, e.g., the MSigDB C4 cancer models [8, 25] were created via clustering of gene expression data.
While the output of the SGSE method is effective at identifying enriched gene sets for many use cases, there are several biologically important scenarios represented by nonidentity covariance structures where the SGSE method will fail to select the gene sets truly associated with the output. For example, if different subgroups of a gene set are associated with different PCs or if all gene set members are associated with the same PC but the direction of association varies for different subgroups, the SGSE method, as well as unsupervised methods based on clustering of the genomic variables [26–28], will perform poorly. Such scenarios are biologically important and reflect many large gene sets associated with processes or pathways comprised by multiple distinct groups of genes where the action of the subgroups is either uncorrelated or counterbalancing (these use cases are represented by the multiblock and anticorrelated multiblock covariance structures, as detailed in Section “Simple covariance structure examples”). Another important limitation of the SGSE method is the fact that it only supports a competitive H _{0} and therefore cannot be used in cases where a selfcontained test is of greater biological interest (see Sections “Selfcontained gene set testing” and “Competitive gene set testing” below for detailed definitions of selfcontained and competitive gene set tests).
To address the shortcomings of existing unsupervised tests and to support both selfcontained and competitive tests across a wider range of biologically relevant data models, we have developed two novel unsupervised gene set tests, the Marc̆enkoPastur Distribution Test (MPDT) and the TracyWidom Test (TWT) that are based on the covariance structure of the measured genomic variables. Although a number of existing gene set testing methods are also based on covariance structure analysis (e.g., GSCA by Choi and Kendziorski [29], GSNCA by Rahmatallah et al. [30], and GSASDR by Hsueh and Tsai [31]), with the exception of the SGSE method [15] and earlier clusterbased approaches [26–28], these methods are all supervised and identify interesting gene sets according to difference in the covariance structure between levels of a phenotype. The supervised nature of these tests means they cannot be used to support pvalue weighting, caseonly analyses or other unsupervised use cases.
Both the MPDT and TWT methods are based on random matrix theory (RMT) findings regarding the distribution of the eigenvalues of matrices with a white Wishart distribution [32, 33], i.e., the distribution of the sample covariance matrix for multivariate normal data with an identity population covariance matrix. As detailed in Section “Random matrix theory (RMT) benefits” below, these tests were based on RMT to provide better support for highdimensional data, nonnormal data and data based on small sample sizes. For MPDT, the test is based on the Marc̆enkoPastur quartercircle law characterizing the limiting empirical distribution of all of the eigenvalues of a white Wishart matrix. For TWT, the test is based on the TracyWidom law of order 1 characterizing the limiting distribution of the largest eigenvalue of a white Wishart matrix. Versions of the MPDT and the TWT are detailed in Section “Methods” for both selfcontained and competitive null hypotheses. For the selfcontained case, we contrast our proposed tests with the classic multivariate test based on a modified likelihood ratio criterion (MLRT) and, for the competitive case, we compare the new tests against a competitive version of the classic test and our SGSE method (also based on RMT).
As we demonstrate through simulation studies, the MPDT and TWT methods provide superior performance relative to the MLRT and SGSE methods on several biologically important covariance structures (e.g., the multiblock and anticorrelated multiblock structures detailed in Table 1). The practical utility of the TWT and MPDT methods is illustrated through a weighted FDR analysis of leukemia [34] and p53 [4] gene expression data sets relative to MSigDB gene set collections [8]. The remainder of this paper is organized as follows: Section “Methods” specifies the statistical properties of the MPDT and TWT methods, models for important use cases, simulation study design and approach for real data analysis, Section “Results” contains the results of the simulation studies and real data analyses and Section “Discussion” provides a discussion. Additional file 1 contains additional results for the real data analysis.
Methods
Data assumptions
It is assumed that measurements have been made for p genomic variables under n independent experimental conditions, e.g., expression levels of p mRNA molecules within tissue samples from n subjects. This data will be modeled as a sample of n independent observations from a pdimensional random vector x with mean μ and covariance Σ. Although the unsupervised gene set tests developed in this paper are robust to deviations from normality (see Section “Random matrix theory (RMT) benefits” below for details), it will be assumed that x can be well approximated by a multivariate normal distribution after appropriate transformation, i.e., \(\mathbf {x} \sim \mathcal {N}(\boldsymbol {\mu }, \boldsymbol {\Sigma })\). This data can be held in an n×p matrix X whose elements x _{ i.j } represent the measured value of genomic variable j under condition i. Let C represent the meancentered version of X.
It is assumed that the p genomic variables have been annotated to a collection of f biologicallybased sets of genomic variables or gene sets, e.g., Gene Ontology (GO) terms [6]. These annotations can be held in a f×p binary annotation matrix A whose rows represent the f biologicallybased sets and whose cells a _{ i,j } hold indicator variables whose value depends on whether an annotation exists between gene set i and genomic variable j.
For a given gene set, i, the p variables can be partitioned into two sets, p _{ g } and p _{ c }, according to the indicator variables in row i of matrix A with p _{ g } containing all variables that are members of the gene set, p _{ c } containing the complement of the gene set and p _{∗} representing the set of all p variables. Let the number of variables in subset p _{ g } be represented by g and the number in p _{ c } be represented by c=p−g. If the variables are reordered such that the variables in the gene set, i.e., p _{ g }, are listed before the variables not in the gene set, i.e., p _{ c }, the population, Σ, and sample, S, covariance matrices can be partitioned as:
This same partitioning can be applied to the population and sample correlation matrices.
Selfcontained gene set testing
For a selfcontained gene set test, only the genomic variables that are members of the gene set may be used to compute the test statistic. Informally, the null hypothesis for an unsupervised and selfcontained test asserts that none of the genomic variables in the gene set, i.e., p _{ g }, are enriched relative to the genomic data matrix X more than would be expected at random. Importantly, in an unsupervised context this measurement of gene set enrichment is defined with respect to the distribution of the random vector x and not with respect to the association between the members of x and some other covariate.
Given the partitioning of the population and sample covariance matrices specified in (1), one possible formulation of an unsupervised and selfcontained gene set test measures enrichment as departure from a null distribution of \(\mathcal {N}(\boldsymbol {\mu }_{g}, \mathbf {I})\) for the p _{ g } variables in the gene set. This can be formally defined using the following null and alternative hypotheses:
Under this null hypothesis, \(n\mathbf {S}_{p_{g},p_{g}}\) has a white Wishart distribution, \(\mathcal {W}(n1, \mathbf {I})\), i.e., a Wishart distribution where the population covariance matrix is equal to the identity matrix.
In Sections “Classic modified likelihood ratio test (MLRT)” – “Selfcontained Marc̆enkoPastur Distribution Test (MPDT)” below, three different tests of these hypotheses are described, each based on the null distribution of a test statistic T _{self} that is a function of X and p _{ g }. The test detailed in Section “Classic modified likelihood ratio test (MLRT)” is the classic test of Σ=I from multivariate statistics. Sections “Selfcontained TracyWidom Test (TWT)” and “Selfcontained Marc̆enkoPastur Distribution Test (MPDT)” describe selfcontained versions of two new unsupervised gene set tests, the Marc̆enkoPastur Distribution Test (MPDT) and the TracyWidom Test (TWT). Both the MPDT and the TWT are based on random matrix theory (RMT) findings regarding the distribution of the eigenvalues of matrices with a white Wishart distribution [32, 33].
It is important to note that, for practical applications, rejection of the null hypothesis (2) at a given α may be of little interest to researchers given the limited biological information provided by selfcontained tests and the general sensitivity of such tests to small deviations from the null [5]. As a consequence, gene set testing is almost always performed against a competitive null hypothesis. The selfcontained tests described in Sections “Classic modified likelihood ratio test (MLRT)” through “Selfcontained Marc̆enkoPastur Distribution Test (MPDT)” were therefore developed primarily to provide statistics for use in competitive tests, as detailed in Section “Competitive gene set testing” below.
Classic modified likelihood ratio test (MLRT)
The classic test of selfcontained hypotheses (2) for multivariate normal data is based on a modified likelihood ratio criterion [35]. Specifically, this criterion leads to the following selfcontained test statistic:
Under the asymptotic regime in which n→∞ and g is fixed, this statistic has a χ ^{2} distribution with g(g+1)/2 degrees of freedom.
Selfcontained TracyWidom Test (TWT)
The TWT is based on the TracyWidom law of order 1 characterizing the limiting distribution of the largest eigenvalue of a white Wishart matrix. Since \(n\mathbf {S}_{p_{g},p_{g}}\) has a white Wishart distribution, \(\mathcal {W}(n1, \mathbf {I})\) under the null hypothesis (2), the TracyWidom result can be used as the basis for a selfcontained gene set test. Specifically, the limiting distribution of a centered and scaled version of the largest eigenvalue of \(n\mathbf {S}_{p_{g},p_{g}}\), \(\hat {\lambda }_{1}\), under H _{0} (2) is given by the TracyWidom law of order 1 [17]:
where the scaling and centering terms are given by \(\mu (g, n) = (\sqrt {n1} + \sqrt {g})^{2}\) and \(\sigma (g, n) = (\sqrt {n1} + \sqrt {g}) \left (1/(\sqrt {n1}) + 1/(\sqrt {g}) \right)^{1/3}\).
The selfcontained version of the TWT therefore defines T _{self} as the scaled and centered version of \(\hat {\lambda }_{1}\), as specified in (4), and tests for deviation from the TracyWidom law of order 1 distribution expected under H _{0} (2).
One disadvantage of the TWT is that it considers only the principal eigenvalue of \(n\mathbf {S}_{p_{g},p_{g}}\) and will therefore ignore the significance of eigenvalues \(\hat {\lambda }_{i}, i \geq 2\).
Selfcontained Marc̆enkoPastur Distribution Test (MPDT)
The MPDT is based on the Marc̆enkoPastur quartercircle law characterizing the limiting empirical distribution of all of the eigenvalues of a white Wishart matrix [32, 33]. If it is assumed that \(n\mathbf {S}_{p_{g},p_{g}}\) has rank g, the empirical distribution function of the eigenvalues, \(\hat {\lambda }_{i}, i=1,\ldots,g\), of \(n\mathbf {S}_{p_{g},p_{g}}\) under H _{0} (2) is given by:
and the Marc̆enkoPastur quartercircle law holds that \({\lim }_{n,g \to \infty, g/(n1) \to \gamma \in (0, \infty)} F_{\hat {\lambda }_{i}}(x) = G(x)\) where the density of G(x) is given by \(g(x) = 1/(2\pi \gamma x) \sqrt {(b_{+}  x)(xb_{}}, b_{\pm } = (1 \pm \sqrt {\gamma })^{2}\). The selfcontained version of the MPDT leverages this result by comparing the empirical distribution function \(F_{\hat {\lambda }_{i}}(x)\) (6) against the expected Marc̆enkoPastur distribution for γ=g/(n−1) using a twosided, onesample KolmogorovSmirnov test. This is based on the null distribution of a statistic, D, defined as the maximum difference between the two distribution functions:
This test has the benefit, relative to the TWT, of accounting for all of the eigenvalues of \(n\mathbf {S}_{p_{g},p_{g}}\). A limitation of the MPDT is that the \(\hat {\lambda }_{i}\) are dependent, which results in an incorrectly large degreesoffreedom for the KolmogorvSmirnov test and an inflated type I error rate unless the dependence structure can be accurately modeled [36]. Fortunately, this inflated type I error rate is only an issue for a strictly selfcontained test. When the T _{self} from the MPDT is used in a competitive test (as described in Section “Competitive gene set testing” below), the incorrect degrees of freedom no longer poses a problem. This test is therefore primarily useful as a means of ranking the f gene sets (according to the magnitude of D (7)).
Competitive gene set testing
For competitive and unsupervised gene set tests, two primary forms of null hypothesis are possible. The first type of competitive null asserts that a given gene set is no more associated with X than are the other gene sets defined in the annotation matrix A. The second type of competitive null asserts that the members of a given gene set are no more associated with X than would be expected for a random set of genomic variables of the same size. Although both forms of null hypothesis are valid and address important biological questions, we focus solely on that later form of competitive test in this paper. Not only does this form of competitive null provide results for a given gene set that are invariant to the size and composition of A but it can easily be used as the basis for a comparative analysis of multiple gene sets, e.g., rank the gene sets defined in A according to the pvalues from competitive testing.
For this paper, then, the null hypothesis for an unsupervised competitive gene set test informally asserts that the genomic variables in the gene set, i.e., p _{ g }, are no more enriched relative to the variance structure of the matrix X than would be expected for a set of g genomic variables drawn at random from among all p variables, i.e., \(p_{g \in p_{*}}\) or p _{ g∗}. The onesided alternative hypothesis of interest informally asserts that the genomic variables in p _{ g } are more enriched relative to the variance structure of the matrix X than would be expected for a random set of variables p _{ g∗}. These competitive hypotheses can be formally defined in terms of the cumulative distribution functions of the eigenvalues of the population covariance matrices \(\boldsymbol {\Sigma }_{p_{g},p_{g}}\) and \(\boldsymbol {\Sigma }_{p_{g*},p_{g*}}\). Specifically, under the competitive H _{0} the eigenvalues of \(\boldsymbol {\Sigma }_{p_{g},p_{g}}\) and \(\boldsymbol {\Sigma }_{p_{g*},p_{g*}}\) have identical cumulative distributions and, under the corresponding H _{ A }, the cumulative distribution of the eigenvalues of \(\boldsymbol {\Sigma }_{p_{g},p_{g}}\) is pointwise smaller than the cumulative distribution of the eigenvalues of \(\boldsymbol {\Sigma }_{p_{g*},p_{g*}}\), i.e., distribution of the eigenvalues of \(\boldsymbol {\Sigma }_{p_{g},p_{g}}\) is shifted towards larger values relative to the eigenvalue distribution for \(\boldsymbol {\Sigma }_{p_{g*},p_{g*}}\). Mathematically, these competitive hypotheses can be stated as follows:
where \(F_{\lambda _{i}(\boldsymbol {\Sigma }_{p_{g}, p_{g}})}(x)\) and \(F_{\lambda _{i}(\boldsymbol {\Sigma }_{p_{g*}, p_{g*}})}(x)\) represent the eigenvalue cumulative distribution functions, as defined by (6) above, of the population covariance matrices \(\boldsymbol {\Sigma }_{p_{g}, p_{g}}\) and \(\boldsymbol {\Sigma }_{p_{g*}, p_{g*}}\). It is important to note that H _{0} (8) does not assert that either \(\boldsymbol {\Sigma }_{p_{g}, p_{g}} = \mathbf {I}\) or that \(\boldsymbol {\Sigma }_{p_{g*},p_{g*}} = \mathbf {I}\). Instead, the H _{0} (8) asserts that \(\boldsymbol {\Sigma }_{p_{g}, p_{g}}\) and \(\boldsymbol {\Sigma }_{p_{g}*, p_{g}*}\) have equivalent eigenvalue distributions. In other words, the H _{0} (8) can hold even if both \(\boldsymbol {\Sigma }_{p_{g}, p_{g}}\) and \(\boldsymbol {\Sigma }_{p_{g}*, p_{g}*}\) deviate significantly from the identity matrix as long as the deviation is equivalent when characterized by the eigenvalues of the matrices. This competitive H _{0} is quite distinct from the corresponding selfcontained H _{0} (2) that asserts an identity population covariance matrix.
To test competitive hypotheses (8), we define competitive versions of the MLRT, TWT and MPDT tests that use the selfcontained statistic T _{self} computed for both p _{ g } and p _{ g∗}, where T _{self} is defined by either (3), (5) or (7). For these competitive tests, statistical significance relative to the competitive hypotheses (8) is computed using the following permutation testing procedure for each gene set defined in A:

From among all \(p \choose g\) possible combinations of g variables from the set of all p variables, select B random combinations.

For each combination, p _{ b },b=1,…,B, compute T _{self}(X,p _{ b }) according to (3), (5) or (7).

Use the permutation distribution of T _{self}(X,p _{ b }) to compute the pvalue for hypotheses (8):
$$ \begin{aligned} &Pr(\mathbf{T}_{\text{self}}(\mathbf{X}, p_{g*}) > \mathbf{T}_{\text{self}}(\mathbf{X}, p_{g})  H_{0})\\ &= \frac{\sum_{b=1}^{B} 1(\mathbf{T}_{\text{self}}(\mathbf{X}, p_{b}) > \mathbf{T}_{\text{self}}(\mathbf{X}, p_{g}))}{B} \end{aligned} $$(9)
For standard gene set testing, i.e., testing in a supervised context, the use of such a genelevel permutation distribution, i.e., a distribution generated by permuting the assignment of genes to gene sets, is problematic and can lead to an inflated type I error rate [11]. Specifically, because the permutation distribution breaks the correlation structure of the gene set, the permutation distribution of a supervised gene set test statistic (i.e., a statistic that reflects the association of set members with an outcome variable) will have a much smaller variance than the true null distribution of the gene set test statistic given the correlation present among gene set members. As an example, suppose the supervised gene set statistic is the mean of the t statistics capturing association between genomic variables in the set and a binary outcome. If the members of a gene set are correlated, then the true null distribution of this gene set statistic has expectation 0 but a variance that is much larger than the variance of the statistic computed under the permutation null distribution. In an unsupervised context, however, breaking the correlation between gene set members through the permutation test does not pose a problem since the analysis is focused on the variance structure of the gene set members and not on their association with an outcome variable. In this case, the competitive null asserts a correlation structure among gene set members that matches what is generated through permutation, i.e., under H _{0} (8) the correlation structure for a gene set matches the correlation structure for a random sample of the same size.
Random matrix theory (RMT) benefits
The development of the RMTbased MPDT and TWT methods was motivated by three key factors (which we validate and quantify through the simulation studies described in Section “Evaluation design”):

1.
Asymptotic regime: The classic test holds under the standard asymptotic regime in which n→∞ while p is fixed. The RMTbased tests, on the other hand, hold under an asymptotic regime where both n and p→∞ while the ratio p/n→γ∈(0,∞). Importantly, RMT asymptotics can be expected to provide a more accurate approximation of genomic data than standard asymptotics [37].

2.
Deviation from normality: Although both the classic test and the RMTbased tests are derived under the assumption of multivariate normality for x, the RMTbased tests, due to the universality properties of the RMT distributions, can be expected to be more robust to deviations from normality for the elements of x [38]. Such distributional robustness is especially important in the context of gene set testing of genetic variation data, e.g., genomewide association data capturing single nucleotide polymorphisms (SNPs).

3.
Small sample size: Although both the classic test and the RMTbased tests are asymptotic approximations, the RMTbased tests can be expected to perform better for very small samples sizes, e.g., Johnstone found that the distribution of the largest eigenvalue of a white Wishart matrix was well approximated by the TracyWidom law of order 1 for data sets as small as p=5 and n=20 [17].
Simple covariance structure examples
The behavior of the selfcontained and competitive tests outlined in Sections “Selfcontained gene set testing” and “Competitive gene set testing” above can be illustrated using simple, but biologically meaningful, use cases based on different population covariance matrix structures. These covariance structures, along with expected results for the selfcontained and competitive tests, are shown in Table 1 using the partitioning from Section “Data assumptions”. For simplicity, it is assumed that there is only one gene set containing the first p/2 genomic variables. Examples where the two tests give different answers are in bold. These covariance structures are referenced in Sections “Simulation design to assess type I error control”, “Simulation design to assess statistical power” and “Discussion” to characterize the simulation designs and discuss the relative performance of the evaluated methods.
Evaluation design
Benchmark unsupervised and competitive gene set tests
For comparative evaluation of the competitive versions of the TWT and MPDT methods outlined in Sections “Selfcontained gene set testing” and “Competitive gene set testing”, two benchmark unsupervised gene set tests were used: the classic modified likelihood ratio test (MLRT) and the Spectral Gene Set Enrichment (SGSE) [15] method. For the MLRT test, competitive testing used the classic modified likelihood ratio test statistic detailed in Section “Classic modified likelihood ratio test (MLRT)Classic modified likelihood ratiotest (MLRT)” with the competitive permutation procedure outlined in Section “Competitive gene set testing”. For all simulation and real data analyses detailed in this paper, the SGSE method was executed using all PCs with nonzero eigenvalues, genelevel test statistics set to the Fishertransformed Pearson correlation coefficients between each gene and each PC, statistical association between each PC and each gene set computed using a correlationadjusted twosample ttest between the genelevel test statistics for gene set members and nongene set members and overall unsupervised gene set association computed using the weighted Zmethod on the PClevel pvalues with weights set to the PC variance multiplied by the lowertailed pvalue from a TracyWidom test on that PC eigenvalue. See the original SGSE paper [15] for more information on the operation and configuration of the method.
Simulation design to assess type I error control
To assess type I error control for the competitive versions of the MPDT and TWT methods, as detailed in Section “Competitive gene set testing”, and the two benchmark competitive methods listed in Section “Benchmark unsupervised and competitive gene set testsBenchmarkunsupervised and competitive gene set tests”, data were simulated according to eight null simulation designs as outlined in Table 2 (seven for multivariate normal data and one for multivariate binomial data). In Table 2, the covariance structure column refers to one of the models listed in Table 1, p represents the total number of simulated genes, n represents the number of independent samples in each data set, g represents the size of each disjoint gene set, σ ^{2} represents the variance for all variables and ρ represents the pairwise covariance between all genes (i.e., all gene pairs have the same covariance irrespective of gene set membership). The multivariate binomial distribution was included to mimic single nucleotide polymorphism (SNP) data specified using additive coding. For all designs, 1,000 data sets were simulated and tested for unsupervised enrichment against a gene set annotation matrix A that defined p/g disjoint gene sets each containing g genes. For the competitive versions of the MLRT, MPDT and TWT methods, the number of permutations B was set to 500. For SGSE, all g principal components were used and other method parameters were specified as detailed in Section “Benchmark unsupervised and competitive gene set testsBenchmark unsupervised and competitive gene set tests”.
Simulation design to assess statistical power
To assess statistical power for the four evaluated methods, data was simulated according to ten different simulation designs outlined in Table 3 (nine designs for multivariate normal data and one for multivariate binomial data). The motivation and implications of each simulation model are discussed in more detail in Sections “Unsupervised gene set testing for the single block modelUnsupervised geneset testing for the single block model” – “Unsupervised gene set testing for nonnormal dataUnsupervisedgene set testing for the single block model” below. With the exception of the σ ^{2} and ρ columns, the columns in Table 3 have the same interpretation as the corresponding columns in Table 2, as detailed in Section “Simulation design to assess type I error controlSimulationdesign to assess type I error control” above. For Table 3, σ ^{2} represents the variance of members of the nonnull gene sets (a variance of 1 was used for all genes in null sets) and ρ represents the pairwise covariance between the members of nonnull gene sets. All models listed in Table 3 included just a single nonnull gene set with the exception of model MVN9, for which all 10 disjoint gene sets were nonnull. For model MVN7, the nonnull gene set was divided into 5 disjoint subblocks of size 2 with a covariance of 0.2 between the two members of each subblock and 0 covariance between members of different subblocks. For model MVN8, the nonnull gene set was divided into two 5 member subblocks with pairwise covariance between subblock members of 0.1 and pairwise covariance between members of different subblocks set to 0.1. Similar to the simulation procedure for assessing type I error control outlined in Section “Simulation design to assess type I error control”, 1,000 data sets were simulated for each of the ten designs and tested for unsupervised enrichment against a gene set annotation matrix A that defined one truly enriched gene set containing the first 10 variables. Configuration of the MLRT, MPDT, TWT and SGSE methods mirrored the settings specified for the type I error control simulations.
Real data analysis design
To demonstrate the practical utility of the proposed methods, gene set testing was performed using a weighted FDR approach [14] on two classic gene expression data sets relative to v5.0 of the MSigDB gene set collections [8]. Specifically, the following MSigDB collections were tested: c1.all (positional), c2.cpg (curated: chemical and genetic perturbations), c2.cp (curated: canonical pathways), c3.mir (motif: microRNA targets), c3.tft (motif: transcription factor targets), c4.cgn (computational: cancer gene neighborhoods), c4.cm (computational: cancer modules), c5.bp (GO: biological process), c5.cc (GO: cellular component), c5.mf (GO: molecular function), c6.all (oncogenic signatures), c7.all (immunologic signatures). Prior to analysis, each MSigDB collection was filtered to remove gene sets with less than 5 or more than 200 members.
For the evaluations detailed in this paper, the proposed unsupervised gene set testing methods (TWT and MPDT) and the benchmark methods (SGSE and MLRT) were used to generate weights that were applied to the pvalues generated via supervised gene set testing using the CAMERA method [11]. The weighted CAMERA pvalues were then subjected to a wFDR analysis using the Benjamini and Hochberg (BH) [18] method. See Section “Configuration of CAMERA method for real data analysis” below for further details on the CAMERA method and the configuration settings used for these analyses. As detailed in Genovese et al. [14], the BH method provides valid FDR control when applied to a set of weighted pvalues as long as the weights have an average value of 1 and are independent of the pvalues under H _{0}. To meet these requirements, the weights were based on the log of the pvalues generated by the evaluated competitive unsupervised gene set testing methods. If u _{ i } represents the pvalue from the unsupervised test for gene set i (out of a total of f gene set tests), the weights, w _{ i }, were calculated as \(w_{i} = log(u_{i})/(1/f \sum _{j=1}^{f} log(u_{j}))\), which ensures that \(\sum _{i=1}^{f} w_{i} = f\). If s _{ i } represents the pvalue from the CAMERA supervised test for gene set i, weighted pvalues were then computed as \(s^{*}_{i} = w_{i} s_{i}\) with the wFDR qvalues computed using the standard BH method applied to \(s^{*}_{i}\).
For TWT, MPDT and MLRT, the number of permutations B for competitive testing was set to 5000. For SGSE, all PCs associated with nonzero eigenvalues were used and other method parameters were specified as detailed in Section “Benchmark unsupervised and competitive gene set testsBenchmark unsupervised and competitive geneset tests”. The two classic gene expression data sets analyzed were the Armstrong et al. [34] leukemia gene expression data set and the p53 gene expression data set used in the 2005 GSEA paper [4]. These data sets were selected because of their easy accessibility and extensive use in the gene set testing literature [4, 9], factors that will allow other researchers to more easily replicate and interpret the results outlined in this paper. Similarly, v5.0 of the MSigDB collections were chosen for analysis due to their accessibility and high quality annotations.
Configuration of CAMERA method for real data analysis
CAMERA [11] is a twostage, competitive gene set testing method that adjusts for the correlation between gene set members. CAMERA performs gene set testing using the following approach:

1.
Model the relationship between the genomic variables x _{ i },i=1,…,p and phenotype y using a series of p univariate linear models of the form x _{ i }∼β _{0}+β _{1} y+ε. If multiple phenotype variables exist, a contrast of model coefficients must also be specified.

2.
Compute genelevel test statistics, z _{ i },i=1,…,p, from each of the p univariate models. The tstatistic associated with \(\hat {\beta _{1}}\) is a typical choice. CAMERA uses a normalized tstatistic.

3.
Use the genelevel test statistics to generate gene set test statistics, S _{ j }, for each of the gene sets in the target collection. The mean difference test statistic, which follows a tdistribution under H _{0}, is a common choice: \(S_{j} = (\bar {z}_{j}  \bar {z}_{j^{c}})/(\sigma _{p} \sqrt {\frac {1}{m_{j}}  \frac {1}{pm_{j}}})\), where m _{ j } is the number of genomic variables in set j, \(\bar {z}_{j}\) is the mean of the z _{ i } for members of gene set j, \(\bar {z}_{j^{c}}\) is the mean of the z _{ i } for genes not in set j and σ _{ p } is the pooled standard deviation of the z _{ i }. CAMERA uses a correlationadjusted version of the mean difference statistic.

4.
Determine the statistical significance of the genelevel test statistics under null hypothesis that the z _{ i } for genomic variables in the gene set are identically distributed to the z _{ i } for genomic variables not in the gene set. CAMERA determines statistical significance using a twosample ttest on the correlationadjusted mean difference statistic. Many other twostage competitive gene set testing methods use permutation of y to calculate a pvalue.
For enrichment of the MSigDB gene set collections relative to the leukemia and p53 data, CAMERA was executed with default settings and genewise test statistics (z _{ i } above) calculated via the linear regression of the gene expression value on a data set specific phenotype. For the leukemia data, the phenotype was the acute myeloid leukemia (AML) versus acute lymphoblastic leukemia (ALL) status while for the p53 data the phenotype was the p53 mutated status. For both data sets, false discovery rate (FDR) values were computed using the BH method [18] for both unweighted and weighted pvalues.
Results
Simulation results
Type I error control results
Results from the type I error simulation studies detailed in Section “Simulation design to assess type I error control” are shown in Table 4. As listed in Table 4, type I error control was excellent for all evaluated methods on all eight null simulation designs.
Power results
Results from the power simulation studies detailed in Section “Simulation design to assess statistical power” are shown in Table 5. Although no single competitive method was superior for all models, the TWT method had the best overall performance with the largest average power for six of the ten simulations and close to the best power for the remaining four models. While the TWT method was the most powerful in the majority of the use cases, each of the tested methods had the best power for at least one of the tested models. In all cases, the average relative power of the four methods was consistent with the expected behavior of the methods for the simulated covariance structures. Sections “Unsupervised gene set testing for the single block model” thru “Unsupervised gene set testing for nonnormal data” below contain more detailed discussions of the results for each of these ten models.
Real data results
As outlined in Section “Real data analysis design”, the proposed and benchmark methods were evaluated via a wFDR analysis of leukemia [34] and p53 [4] gene expression data sets relative to 12 of the MSigDB v5.0 collections. For these analyses, the results of the unsupervised test were used to weight the pvalues generated by supervised gene set testing. Table 6 contains the results for the C7 collection (immunologic signatures) relative to the leukemia data and Table 7 contains the results for the C6 collection (oncogenic signatures) relative to the p53 data. The first column in each table contains the gene set name with the number of genes in the set in parentheses. The second column lists the direction of enrichment, the third column ("GSE pvalue) contains the enrichment significance as computed via the supervised CAMERA method and the forth column (“Unfiltered qvalue”) holds the false discovery rate qvalue based on the unweighted supervised GSE pvalues. Columns five through eight display the qvalues from a wFDR analysis using each of the evaluated unsupervised gene set testing methods to compute the pvalue weight.
The Additional file 1 contains similar results for all of the collections for both data sets. The Additional file 1 also contains the Spearman rank correlation values between the unsupervised gene set testing pvalues computed by the CAMERA method and various unsupervised statistics (mean intergene correlation and the pvalues from the MLRT, SGSE, TWT and MPDT methods). As seen in Tables 6 and 7 and the Additional file 1, no single method was dominant for all MSigDB collections on both data sets, however, the TWT method and SGSE method tended to provide the best overall results. The rank correlation values shown in Tables S1 and S14 of the Additional file 1 also demonstrate the general association between gene set biological activity (as represented by the CAMERA gene set test pvalues) and the sample covariance structure of the gene set members. Section “Performance on gene expression data and MSigDB collections” below contains a more detailed discussion of the real data analysis results.
Discussion
Gene set testing, or pathway analysis, is an important tool for analyzing and interpreting highdimensional genomic data sets [1, 2]. Compared to approaches that use a separate test for each variable, gene set testing offers greater statistical power, superior interpretation and improved replication. Although most gene set testing methods can only be used in a supervised context, i.e., assessing the association of gene set members with an outcome variable, a number of important unsupervised use cases exist. To address the lack of effective unsupervised gene set testing methods, we recently developed the SGSE method [15] and later demonstrated the effective use of this method for screeningtesting [19] in the SGSF approach [24]. Although the SGSE method was shown to be superior to existing unsupervised techniques and was able to significantly improve gene set testing power when used in the SGSF screeningtesting approach, the method only supports testing against a competitive null hypothesis and is not able to effectively identify biologically relevant gene sets for a number of important use cases. To remedy the limitations of the SGSE method and other available unsupervised techniques, we developed two new unsupervised gene set tests, the Marc̆enkoPastur Distribution Test (MPDT) and the TracyWidom Test (TWT). Both the MPDT and the TWT support selfcontained and competitive null hypotheses and both are based on random matrix theory (RMT) findings regarding the distribution of the eigenvalues of matrices with a white Wishart distribution [32, 33]. As outlined in Section “Random matrix theory (RMT) benefits”, the RMT basis of both methods conveys several general benefits: improved performance on highdimensional data, robustness to departures from normality and superior performance on small sample sizes.
Table 1 lists a set of biologically relevant population covariance matrix structures that illustrate the relative benefits of the proposed MDPT and TWT methods and existing methods such as SGSE and MLRT. These structures also formed the basis for the type I error control and power simulation designs detailed in Sections “Simulation design to assess type I error control” and “Simulation design to assess statistical power”. As shown in the type I error control results in Section “Type I error control results” and Table 4, all methods had excellent type I error control on each of the evaluated models. In contrast, the empirical power realized by each of four methods, as illustrated in Table 5, diverged significantly across each of the simulated models with each method delivering the best power for at least one model. Overall, the TWT method had the best performance, with the top power for six of the 10 cases and close to the best power for the remaining four cases. Sections “Unsupervised gene set testing for the single block model” through “Performance on gene expression data and MSigDB collections” below provide a more detailed discussion of these models and the simulation and real data analysis results.
Unsupervised gene set testing for the single block model
For the single block model, represented by power simulation models MVN1 thru MVN6 detailed in Section “Simulation design to assess statistical power”, all members of the gene set have a positive pairwise correlation with almost no correlation with genes not in the set. Such a covariance structure is quite common for genomic data, e.g., microarray gene expression, and gene sets containing coregulated genes [11] or gene sets based on the result of gene clustering [25]. For each of the first four simulations models (MVN1 thru MVN4), such a single block covariance structure was used in which all variances were set to 1 and all covariances were set to 0 except for the covariances between true gene set members, which were set to an equal and positive value. For these four models, the best power was generated by either the TWT method or the SGSE method with the classic MLRT method performing well on the two larger sample size models (MVN1 and MVN4) and the MPDT method generating average power substantially below the other methods. It is known that such a block covariance matrix structure will tend to produce a first principal component (PC) for the entire data set whose associated eigenvector has large weights of the same sign for all block members [39]. Because the SGSE method is based on the association between gene set members and the PCs of the entire data set, with weights based on the associated eigenvalue significance, it is able to perform well when the first PC effectively captures the true gene set signal. When the sample covariance matrix is computed using just the gene set members for the single block model, the first principal component can likewise be expected to have large weights of the same sign for all gene set members and a correspondingly large eigenvalue. Importantly, the largest eigenvalue of the sample covariance matrix for just the gene set members in this case can be expected to be stochastically larger than the principal eigenvalue of a sample covariance matrix for a random group of genes of the same size since, for the single block model, all covariances not between true gene set members are 0. This expected difference between the largest eigenvalues of random and nonrandom partitioned sample covariance matrices enabled the competitive TWT method to also perform well for the single block use case. The moderately higher average power achieved by the SGSE method relative the TWT method for lower sample sizes (MVN2 and MVN3) may be due to the fact that the SGSE method is parametric whereas the TWT method is based on a permutation distribution. The fact that only the largest eigenvalue will likely represent the nonnull gene set explains the poor performance of the MPDT method, which is based on the bulk eigenvalue distribution. The satisfactory performance of the MLRT method for the single block models was likely due to the fact that the MVN distribution of the simulated data aligned with the distributional assumptions of the MLRT test.
For single block models MVN5 and MVN6, the best power was provided by either the TWT method or the MPDT method. The covariance structure used for these simulations specified an elevated variance for true gene set members with all covariances set to 0. Such a covariance structure will tend to generate one PC for each variable that has an elevated variance with the loading for that variable dominating the other loadings and the eigenvalue associated with the PC similar in magnitude to the variance of the variable. The signal for the true gene set for these models was therefore spread across multiple eigenvalues of the sample covariance matrix. Because the MPDT method is based on all of eigenvalues of the partitioned sample covariance matrix, its good relative power was therefore expected for these models. Since no single PC was closely associated with the true gene set, the poor performance of the SGSE method was also expected. Because the TWT method is based on just the magnitude of the largest eigenvalue of the partitioned sample covariance matrix, it was able perform well even though the associated PC tended to represent just one gene set member. The poor power of the MLRT method in these cases was likely due to the fact that all population covariances were 0.
Unsupervised gene set testing for the multiblock model
A multiblock covariance structure was used for model MVN7 in which the portion of the population covariance matrix associated with the true gene set was divided into 5 disjoint 2 × 2 blocks with covariance 0.2. Multiblock covariance structures are also quite common for genomic data and biologicallybased gene sets, especially large gene sets that represent processes with multiple modes of activity. Similar to the single block case, such a multiblock structure will tend to produce one PC per block [39] and will therefore distribute the signal for the true gene set across multiple eigenvalues/PCs of the sample covariance matrix. For MVN7, the MLRT method had the best average power with the TWT and MPDT methods providing comparable performance and the SGSE method returning the worst average power. As detailed for models MVN5 and MVN6 above, the fact that multiple eigenvalues represent the true gene set is consistent with the good relative power of the MPDT and TWT methods and poor relative power of the SGSE method. The superior relative performance of the MLRT method for MVN7, and the fact that the MLRT method generated much larger average power for MVN7 than for MVN5 or MVN6, is likely due to the presence of nonzero population covariance values in the MVN7 model.
The anticorrelated multiblock covariance structure used for MVN8 was a variation of the multiblock model used for MVN7. In this case, the portion of the covariance matrix associated with the true gene set was split into two blocks with correlations between members of the same block set to 0.1. The key difference between the multiblock structure and the anticorrelated multiblock structure used for MVN8 is that the covariance between members of different blocks was set to 0.1 rather than 0. This type of structure also corresponds to a biologically realistic class of gene sets, e.g., a metabolic pathway that has several distinct modes of action represented by different subsets of associated genes. The covariance structure in MVN8 leads to two PCs representing the variable group members with opposite sign loadings for the members of each block. For MVN8, the TWT method provided the best power with the MLRT method a close second and significantly lower power for the MPDT and SGSE methods. The opposite sign loadings pose a particular problem for the SGSE method which effectively compares the mean PC loading for members of the gene set against the mean PC loading for nongene set members. The TWT method, on the other hand, is based just on the magnitude of the largest eigenvalue so has power comparable to the single block models. The results for the MPDT and MLRT methods on MVN8 are explained by reasoning similar to that outlined for the single block models above.
Unsupervised gene set testing for the repeated block model
Similar to the single block and multiblock models, the repeated block model, represented by MVN8, will tend to produce one PC per each block that has large and equal signed loadings for all block members. Such a model can be expected when there are multiple independent gene sets associated with a specific data set. Although every gene set in this case has a population covariance matrix with equal and nonzero covariance values, the gene sets are significant relative to H _{0} (8) in the main manuscript since a random partition of the sample covariance matrix of the same size as the gene sets will tend to include zero covariance values as well due to the zero covariance between members of different gene sets For model MVN9, the TWT method had substantially higher power than all other methods. In this case, the population covariance matrix was divided into one block per disjoint gene set. For the SGSE method, the poor power is explained by the fact that each gene set will only be associated with one PC yet multiple PCs will have significant eigenvalues so, when the association measures are combined for all PCs, the gene set will not appear enriched relative to a competitive null hypothesis. The poor power of the MPDT and MLRT methods relative to the TWT method is likely due to the fact that MPDT and MLRT consider all of the eigenvalues of the partitioned sample covariance matrix whereas the TWT method is based on just the largest eigenvalue.
Unsupervised gene set testing for nonnormal data
For many types of genomic data, such as the genotypic data collected by genomewide association studies, the measured values of genomic variables are not normally distributed. For the case of single nucleotide polymorphisms (SNPs) specified using additive coding, a data model similar to Binom1 can be expected. For this model, the TWT method had the best average power followed closely by the SGSE method. The superior performance of the TWT and SGSE methods relative to the MPDT and MLRT methods follows from the universality properties of the TracyWidom distribution, i.e., the TracyWidom distribution of the scaled and centered largest eigenvalue is known to be robust to departures from normality for x [38].
Performance on gene expression data and MSigDB collections
To assess the practical utility of the TWT and the MPDT methods, a wFDR analysis (detailed in Section “Real data analysis design”) was performed on two real gene expression data sets relative to MSigDB gene set collections. As detailed in Section “Real data results” above and the Additional file 1, the results on the p53 and leukemia data sets mirrored the results on the simulation examples. Specifically, the relative ranking of the four methods varied considerably across the MSigDB collections and two data sets with the TWT and SGSE methods delivering the best overall performance. As seen in Table 6, the eight most significantly enriched C7 gene sets relative to AML vs. ALL status represent the differential expression of different types of white blood cells (primarily lymphoid vs. myeloid cells) and are therefore biologically consistent with the phenotype. Although these gene sets have enrichment pvalues that are significant prior to MHC, after controlling the FDR for the family of all 1910 analyzed C7 gene sets, none have significant qvalues. When a wFDR approach is taken using weights based on the pvalues from the TWT test, three of the top eight gene sets had significant qvalues (these are marked in bold in Table 6 with a * prefixing the gene set name). When weights were based on either the MLRT, SGSE or MPDT method, no significant qvalues were generated.
For the p53 gene expression data and C6 collection, the impact of weighting was less pronounced. As seen in Table 7, the two most significantly enriched C6 gene sets relative to p53 mutated status represent either upregulated or downregulated genes in the NCI60 panel of cell lines with mutated p53. In this case, both gene sets had significant qvalues without any weighting and when pvalues were weighted using any of the unsupervised gene set tests. However, the use of the TWT and MPDT methods to generate weights also produced marginally significant qvalues of less than 0.3 for several other biologically plausible gene sets for the p53 data. Similar to the leukemia results, these qvalues are marked in bold with * prefixing the gene set names.
Tables S1 and S14 in the Additional file 1 show the overall association between the supervised gene set test pvalues generated by the CAMERA method and either the mean intergene correlation among gene set members (as estimated by CAMERA) or the unsupervised pvalues generated by the MLRT, SGSE, TWT and MPDT methods. These tables demonstrate the general association between the departure of gene set members from an identity covariance structure and biological activity as represented by the supervised gene set pvalues. As seen in these tables, the pvalues generated by the TWT and SGSE methods have the largest correlation with the supervised pvalues for both data sets across the different MSigDB gene set collections. In contrast, the mean intergene correlation estimated by the CAMERA method, while still associated with the supervised pvalues, is a comparatively poor predictor of gene set biological activity.
Conclusions
The TWT and MPDT methods represent important methodological advances for unsupervised gene set testing. These new methods support both selfcontained and competitive null hypotheses and provide performance superior to existing approaches, such as the SGSE and MLRT methods, on a set of biologically important data structures. The TWT method provides good power across most expected models and is clearly the best choice for nonnormal data (e.g., model Binom1), an anticorrelated multiblock structure (e.g., model MVN7) or a repeated block structure (e.g., model MVN8). If a single block structure can be expected with standardized variance (i.e., all members of the gene set have a positive pairwise correlation with almost no correlation with genes not in the set and variance of 1) and the number of samples is small relative to the size of the gene set (i.e., n/p≤5), as represented by models MVN2 and MVN3, then the SGSE method is the best choice. If variance of each gene in the set is large relative to genes not in the set (as represented by model MVN6), then the MPDT method can provide the best results. For multiblock data (e.g., model MVN7) or data following a single block structure with a large correlation between gene set members (e.g., MVN4), the best results are generated by the classic MLRT method.
Important directions for future research include the assessment of a broader range of biologically relevant covariance structures, the exploration of other classes of nonnormal data, and the use of the TWT and MPDT methods to make novel biological findings via pvalue weighting.
Abbreviations
 ALL:

Acute lymphoblastic leukemia
 AML:

Acute myeloid leukemia
 CAMERA:

Correlation adjusted mean rank gene set test
 FDR:

False discovery rate
 GO:

Gene ontology
 MHC:

Multiple hypothesis correction
 MLRT:

Modified likelihood ratio criterion test
 MPDT:

Marc̆enkoPastur distribution test
 MSigDB:

Molecular signatures database
 MVN:

Multivariate normal
 PC:

Principal component
 PCGSE:

Principal component gene set enrichment
 RMT:

Random matrix theory
 SGSE:

Spectral gene set enrichment
 SNP:

Single nucleotide polymorphism
 TWT:

TracyWidom Test
 wFDR:

Weighted false discovery rate
References
Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol. 2012; 8(2):1002375. doi:10.1371/journal.pcbi.1002375.
Hung JH, Yang TH, Hu Z, Weng Z, Delisi C. Gene set enrichment analysis: performance evaluation and usage guidelines. Brief Bioinform. 2012; 13(3):281–91. doi:10.1093/bib/bbr049.
Allison DB, Cui X, Page GP, Sabripour M. Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet. 2006; 7(1):55–65. doi:10.1038/nrg1749.
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: A knowledgebased approach for interpreting genomewide expression profiles. Proc Nat Acad Sci U S A. 2005; 102(43):15545–15550. doi:10.1073/pnas.0506580102.
Goeman JJ, Buehlmann P. Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics. 2007; 23(8):980–7. doi:10.1093/bioinformatics/btm05.
Gene Ontology Consortium. The gene ontology in 2010: extensions and refinements. Nucleic Acids Res. 2010; 38(Database issue):331–5. doi:10.1093/nar/gkp1018.
Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000; 28(1):27–30. doi:10.1093/nar/28.1.27.
Liberzon A, Subramanian A, Pinchback R, Thorvaldsdóttir H, Tamayo P, Mesirov JP. Molecular signatures database (msigdb) 3.0. Bioinformatics. 2011; 27(12):1739–40. doi:10.1093/bioinformatics/btr260.
Efron B, Tibshirani R. On testing the significance of sets of genes. Ann Appl Stat. 2007; 1(1):107–29. doi:10.1214/07AOAS101.
Barry WT, Nobel AB, Wright FA. A statistical framework for testing functional categories in microarray data. Ann Appl Stat. 2008; 2:286–315.
Wu D, Smyth GK. Camera: a competitive gene set test accounting for intergene correlation. Nucleic Acids Res. 2012; 40(17):133. doi:10.1093/nar/gks461.
Zhou YH, Barry WT, Wright FA. Empirical pathway analysis, without permutation. Biostatistics. 2013; 14(3):573–85. doi:10.1093/biostatistics/kxt004.
Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ. Discovering statistically significant pathways in expression profiling studies. Proc Natl Acad Sci U S A. 2005; 102(38):13544–9. doi:10.1073/pnas.0506577102.
Genovese CR, Roeder K, Wasserman L. False discovery control with pvalue weighting. Biometrika. 2006; 93(3):509–24. doi:10.1093/biomet/93.3.509.
Frost HR, Li Z, Moore JH. Spectral gene set enrichment (SGSE). BMC Bioinformatics. 2015; 16:70. doi:10.1186/s1285901504907.
Frost HR, Li Z, Moore JH. Principal component gene set enrichment (PCGSE). BioData Min. 2015; 8:25. doi:10.1186/s130400150059z.
Johnstone IM. On the distribution of the largest eigenvalue in principal components analysis. Ann Stat. 2001; 29(2):295–327.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc. Series B (Statistical Methodology). 1995; 57(1):289–300.
Bourgon R, Gentleman R, Huber W. Independent filtering increases detection power for highthroughput experiments. Proc Natl Acad Sci U S A. 2010; 107(21):9546–51. doi:10.1073/pnas.0914005107.
Murcray CE, Lewinger JP, Conti DV, Thomas DC, Gauderman WJ. Sample size requirements to detect geneenvironment interactions in genomewide association studies. Genet Epidemiol. 2011; 35(3):201–10. doi:10.1002/gepi.20569.
Dai JY, Kooperberg C, Leblanc M, Prentice RL. Twostage testing procedures with independent filtering for genomewide geneenvironment interaction. Biometrika. 2012; 99(4):929–44. doi:10.1093/biomet/ass044.
Hsu L, Jiao S, Dai JY, Hutter C, Peters U, Kooperberg C. Powerful cocktail methods for detecting genomewide geneenvironment interaction. Genet Epidemiol. 2012; 36(3):183–94. doi:10.1002/gepi.21610.
Frost HR, Andrew AS, Karagas MR, Moore JH. A screeningtesting approach for detecting geneenvironment interactions using sequential penalized and unpenalized multiple logistic regression. Pac Symp Biocomput. 2015; 20:183–94.
Frost HR, Li Z, Asselbergs FW, Moore JH. An independent filter for gene set testing based on spectral enrichment. Comput Biol Bioinformatics, IEEE/ACM Trans. 2015; PP(99):1–1. doi:10.1109/TCBB.2015.2415815.
Segal E, Friedman N, Koller D, Regev A. A module map showing conditional activity of expression modules in cancer. Nat Genet. 2004; 36(10):1090–8. doi:10.1038/ng1434.
Robinson MD, Grigull J, Mohammad N, Hughes TR. Funspec: a webbased cluster interpreter for yeast. BMC Bioinformatics. 2002; 3:35.
Toronen P. Selection of informative clusters from hierarchical cluster tree with gene classes. BMC Bioinformatics. 2004; 5:32. doi:10.1186/14712105532.
Freudenberg JM, Joshi VK, Hu Z, Medvedovic M. Clean: Clustering enrichment analysis. BMC Bioinformatics. 2009; 10:234. doi:10.1186/1471210510234.
Choi Y, Kendziorski C. Statistical methods for gene set coexpression analysis. Bioinformatics. 2009; 25(21):2780–6. doi:10.1093/bioinformatics/btp502.
Rahmatallah Y, EmmertStreib F, Glazko G. Gene sets net correlations analysis (gsnca): a multivariate differential coexpression test for gene sets. Bioinformatics. 2014; 30(3):360–8. doi:10.1093/bioinformatics/btt687.
Hsueh HM, Tsai CA. Gene set analysis using sufficient dimension reduction. BMC Bioinformatics. 2016; 17:74. doi:10.1186/s1285901609286.
Mehta ML, Random Matrices, Vol. 142. Pure and applied mathematics, 3rd ed. Amsterdam: Academic Press; 2004.
Johnstone IM. Approximate null distribution of the largest root in multivariate analysis. Ann Appl Stat. 2009; 3(4):1616–33. doi:10.1214/08AOAS220.
Armstrong SA, Staunton JE, Silverman LB, Pieters R, den Boer ML, Minden MD, Sallan SE, Lander ES, Golub TR, Korsmeyer SJ. Mll translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat Genet. 2002; 30(1):41–7. doi:10.1038/ng765.
Anderson TW. An Introduction to Multivariate Statistical Analysis, 3rd ed. Hoboken: WileyInterscience; 2003.
Chicheportiche R, Bouchaud JP. Goodnessoffit tests with dependent observations. J Stat Mech: Theory Experiment. 2011; 2011(09):09003.
Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLOS Genet. 2006; 2(12):190. doi:http://dx.doi.org/10.1371/journal.pgen.0020190.
Soshnikov A. A note on universality of the distribution of the largest eigenvalues in certain sample covariance matrices. J Stat Phys. 2002; 108:1033–56.
Jolliffe IT. Principal Component Analysis. Springer Series in Statistics. New York: Springer; 2002.
Funding
National Institutes of Health grants K01LM012426, P20GM103534, P30CA023108, U19CA148127 and U01CA196386.
Availability of data and materials
The MSigDB v5.0 gene sets can be downloaded from http://www.broadinstitute.org/gsea/msigdb/collections.jsp. The p53 and leukemia gene expression data sets can be downloaded from http://www.broadinstitute.org/gsea/datasets.jsp. An R implementation of the TWT and MPDT methods and a Sweave document containing a simple example are available at http://www.dartmouth.edu/~hrfrost/UnsupGST.
Authors’ contributions
HRF developed the TWT and MPDT methods, implemented the associated gene set testing algorithms, designed the simulation study, performed the real data analysis and drafted the manuscript. CIA participated in the development of the methodology, assisted with the design of the simulation study and real data analysis and helped draft the manuscript. Both HRF and CIA have read and approve of the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable. Although the results contained in this manuscript were generated through the analysis of data collected from human subjects, only previously collected, publicly available and deidentified data sources were be used. Consequently, the proposed research was except from Federal regulations according to category 4 (45 CFR 46.101.b.4) of the Common Rule for the Protection of Human Subjects.
Author information
Authors and Affiliations
Corresponding author
Additional file
Additional file 1
Supplementary results for leukemia and p53 gene expression examples. (168 KB PDF)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Frost, H.R., Amos, C.I. Unsupervised gene set testing based on random matrix theory. BMC Bioinformatics 17, 442 (2016). https://doi.org/10.1186/s1285901612998
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901612998
Keywords
 Gene set testing
 Pathway analysis
 Random matrix theory
 TracyWidom
 Marc̆enkoPastur