 Methodology article
 Open access
 Published:
Geneset distance analysis (GSDA): a powerful tool for geneset association analysis
BMC Bioinformatics volumeÂ 22, ArticleÂ number:Â 207 (2021)
Abstract
Background
Identifying sets of related genes (gene sets) that are empirically associated with a treatment or phenotype often yields valuable biological insights. Several methods effectively identify gene sets in which individual genes have simple monotonic relationships with categorical, quantitative, or censored eventtime variables. Some distancebased methods, such as distance correlations, may detect complex nonmonotone associations of a geneset with a quantitative variable that elude other methods. However, the distance correlations have yet to be generalized to associate genesets with categorical and censored eventtime endpoints. Also, there is a need to determine which genes empirically drive the significance of an association of a gene set with an endpoint.
Results
We develop geneset distance analysis (GSDA) by generalizing distance correlations to evaluate the association of a gene set with categorical and censored eventtime variables. We also develop a backward elimination procedure to identify a subset of genes that empirically drive significant associations. In simulation studies, GSDA more effectively identified complex nonmonotone geneset associations than did six other published methods. In the analysis of a pediatric acute myeloid leukemia (AML) data set, GSDA was the only method to discover that eventfree survival (EFS) was associated with the 56gene AML pathway geneset, narrow that result down to 5 genes, and confirm the association of those 5 genes with EFS in a separate validation cohort. These results indicate that GSDA effectively identifies and characterizes complex nonmonotonic geneset associations that are missed by other methods.
Conclusion
GSDA is a powerful and flexible method to detect geneset association with categorical, quantitative, or censored eventtime variables, especially to detect complex nonmonotonic geneset associations. Available at https://CRAN.Rproject.org/package=GSDA.
Background
Biomedical researchers frequently seek to determine which gene pathways or ontologies are affected by a treatment or involved in biological processes that influence a particular phenotype or clinical outcome. Genes act cooperatively in pathways to influence biological processes and subsequently clinical outcomes. There are several public resources that annotate genes to specific pathways and other biological processes or functions. Several analysis methods that combine gene annotations with statistical analysis results to evaluate the association of sets of genes with specific biological annotations with a treatment or outcome have been used to make many scientific discoveries. MaciejewskiÂ [1] has reviewed several of those methods.
Virtaneva et al.Â [2] first used the significance and function of expression (SAFE) method in a study of acute myeloid leukemia (AML) and then Barry et al.Â [3, 4] more fully described SAFE as an analysis method. First, for each gene, SAFE computes a statistic measuring differential expression across two groups. SAFE then ranks individual genes according to that statistic and computes a Wilcoxon ranksum statistic to compare the ranks of geneset genes to those of other genes. Finally, the statistical significance (p value) of the individual differential expression statistics and the geneset statistic is determined by repeating the analysis for a series of data sets in which the assignment of group labels to expression profiles has been permuted. SAFE has also been generalized to evaluate associations of gene sets with other phenotypes or endpoints, including quantitative variables and censored eventtime variables, such as survival times in oncology studies.
Mootha et al.Â [5] first used the geneset enrichment analysis (GSEA) method in a study of diabetes, and then Subramanian et al.Â [6] more fully described GSEA as an analysis method. The GSEA framework is very similar to that of SAFE, except that GSEA uses an enrichment statistic in place of the Wilcoxon ranksum statistic in SAFE. GSEA has been generalized to evaluate associations with several types of endpoints and been widely used with much success in the biomedical research literature. Efron and TibshiraniÂ [7] developed the geneset association (GSA) method by modifying the form of the GSEA enrichment statistic. SAFE, GSEA, and GSA are all methods that compare genelevel association statistics of genes annotated to a gene set with those of other genes and permute the assignment of genomic data to treatment or endpoint data to determine the statistical significance of that comparison.
Goeman et al.Â [8] proposed the global test (GT) as a different type of geneset association analysis method. For a given gene set, GT models the contributions of individual member genes as random effects and tests whether the variability of those random effects equals zero. In this way, GT builds upon standard theory for random effects in generalized linear models and can be used to evaluate association of genesets with many different treatments, phenotypes, or endpoints. Goeman and BĂĽhlmannÂ [9] also note that SAFE, GSEA, and GSA are competitive testing procedures that compare the individual geneassociation statistics of genes annotated to the gene set with those of other genes, whereas GT is a selfcontained test that is a function solely of the geneset genes and the treatment, phenotype, or endpoint. By building upon generalized linear models, GT is a very powerful method for detection of general linear associations, but its power to detect other forms of association is not well understood.
Irizarry et al.Â [10] and Ackermann and StrimmerÂ [11] suggest that the total of test statistics (TOTS) or total of squared test statistics for individual genes be used as a geneset association statistic and its significance be determined by permutation. This approach is a selfcontained method like GT but uses permutation like SAFE, GSEA, and GSA. The TOTS framework can be used in conjunction with general linear modeling or proportional hazards modeling and thus can be used to evaluate the association of a gene set with many different variable types, including categorical, quantitative, and censored event times.
Nettleton et al.Â [12] proposed the multiresponse permutation procedure (MRPP) to evaluate the association of a geneset with a set of treatments or a categorical endpoint or phenotype. MRPP measures the association of a geneset with a categorical treatment or endpoint variable by computing the sum of distances between each pair of subjects belonging to the same group. The distance is computed using the data for genes belonging to the gene set. A lesser value of this distancebased association statistic indicates the subjects belonging to the same group have very similar profiles for genes in the gene set and thus indicates a stronger association. The statistical significance (p value) of the distancebased association statistic is determined by permutation of the assignment of gene profile data to the categorical labels.
More recently, Cao et al.Â [13] developed projection onto orthogonal statistical tests (POST) as a general method to evaluate the association of a gene set. POST first computes an orthogonal decomposition (principal components) of the geneset data, selects a set of components that characterize most of the variation of the original data, and computes a test statistic that evaluates the association of each of those components with the treatment or endpoint variable of interest. Next, a geneset association statistic is computed as a weighted sum of the componentsâ€™ squared association statistics. A bootstrap procedure is then used to compute parameter estimates for a weighted chisquare distribution approximation that is used to compute the p value.
VĂ¤remo et al.Â [14] developed the platform for integrative analysis of omic (PIANO) package that implements eleven geneset analysis methods that operate on genelevel statistics or p values in one computational framework. PIANO can be used to identify genesets that have nondirectional, mixed directional, or distinct directional associations with the endpoint or phenotype of interest. Additionally, PIANO allows users to compute consensus results for multiple methods.
Each of the methods described above has its own unique set of strengths and limitations. Among the methods mentioned above, GT is the only method that does not rely on computationally intense permutation or resampling procedures to determine statistical significance. MRPP is the only method that has good power to detect some complex associations, such as genesets that show equal mean expression for all genes but differential correlation among genes across categorical groups. Methods that can detect complex associations without relying on permutation or bootstrapping would have great practical value for biological research applications.
Zhu et al.Â [15] review and develop statistical theory and methods to use various distance correlations to measure linear, monotonic, and nonmonotonic associations between two quantitative data matrices. Here, we extend the framework of Zhu et al.Â [15] to develop geneset distance analysis (GSDA) as a method to evaluate the association of a geneset with a categorical, quantitative, or censored eventtime variable by adapting distance correlations to those settings. Below, we describe the development of GSDA, conceptually compare it with other methods, and evaluate its performance in simulation studies and an analysis of a publicly available pediatric acute myeloid leukemia (AML) data set.
Methods
The distance correlation framework
Zhu et al.Â [15] review and develop a distance correlation ttest framework to statistically evaluate the association between two quantitative data matrices. They show that distance correlations can detect nonmonotonic associations with small to moderate sample sizes. These properties indicate that a generalized distance correlation ttest may be a statistically robust framework for many practical applications involving geneset association testing. Below, we briefly describe this distance correlation ttest framework and then describe how we adapt it to be applicable in other settings.
First, we introduce some general notation to describe the distance correlation framework of Zhu et al.Â [15]. Let \({\varvec{X}}\) be a \(n \times m\) matrix of the numeric data values of \(g=1,\dots ,m\) variables for each of \(i=1,\dots ,n\) individuals such that entry \(x_{ig}\) has the value of numeric variable g for individual i. Let \(d_x(x_i,x_j)\) be a metric of the distance between any two individuals i and j. Let \({\varvec{X}}^\star\) be the \(n \times n\) matrix of distances between each pair of individuals with entries \(x_{ij}^\star = d_x(x_i,x_j)\). Similarly, let \({\varvec{Y}}\) be a \(n \times k\) matrix of the numeric data values of a different set of \(v=1,\dots ,k\) variables for each of the same set of \(i=1,\dots ,n\) individuals with entry \(y_{iv}\) representing the data value of variable v for subject i. Also, let \(d_y(y_i,y_j)\) measure the distance between any two individuals i and j and let \({\varvec{Y}}^\star\) be the \(n \times n\) matrix of the distances \(y^\star _{ij}=d_y(y_i,y_j)\) for all i,Â j pairs of individuals.
Next, for any \(n \times\)n distance matrix \({\varvec{A}}^\star\) with \(n > 2\), the entries \({\tilde{a}}_{ij}\) of the Ucentered distance matrix \(\tilde{{\varvec{A}}}\) are computed as
where \(a^\star _{i\cdot } = \sum _{j=1}^n a_{ij}^\star\) is the sum over row i and \(a^\star _{\cdot j} = \sum _{i=1}^n a^\star _{ij}\) is the sum over column j of the distance matrix \({\varvec{A}}^\star\). Now, for any pair of \(n \times n\) Ucentered distance matrices \(\tilde{{\varvec{A}}}\) and \(\tilde{{\varvec{B}}}\) with \(n>3\), define the inner product as
With these notations and definitions, the distance correlation between the data matrices \({\varvec{X}}\) and \({\varvec{Y}}\) is defined and computed as
where \(\tilde{{\varvec{X}}}\) and \(\tilde{{\varvec{Y}}}\) are the Ucentered distance matrices for the data matrices \({\varvec{X}}\) and \({\varvec{Y}}\). Zhu et al.Â [15] elegantly show that the distance correlation tstatistic
with \(n(n3)/21\) degrees of freedom is a very powerful and wellcontrolled ttest of the null hypothesis that the two numeric data matrices have no association (i.e., the mutual information is zero) when Euclidean distance is used for \(d_x(x_i,x_j)\) and \(d_y(y_i,y_j)\).
The distance correlation ttest of Zhu et al.Â [15] in Eq.Â (4) is technically elegant. Also, its statistical power and Type I error control are rigorously shown by thorough mathematical proofs. It is a statistically rigorous test that can be widely used to evaluate the association of two numeric data matrices (such as gene expression and methylation) in practice. Below, we provide a general overview of GSDA and then propose specific adaptations of the distance correlation ttest to make it useful for evaluating the association of a geneset numeric data matrix \({\varvec{X}}\) with a quantitative variable, a categorical variable, and a censored eventtime variable. In each of these three distinct settings, we compute a Ucentered data matrix \(\tilde{{\varvec{Y}}}\) for the variable of interest in a settingspecific manner and then substitute it into Eqs.Â (3) and (4) to obtain a correlation and tstatistic to describe and test the association. We call this family of methods geneset distance analysis (GSDA) because it uses a distancetesting framework to evaluate the association of a geneset with a treatment, phenotype, or outcome variable of interest.
Overview of geneset distance analysis
FigureÂ 1 provides a general overview of geneset distance analysis (GSDA). The initial inputs are a list of genes belonging to the gene set, gene expression matrix, and the endpoint, phenotype, or treatment data which is to be associated with the expression matrix (Fig.Â 1a). The expression data and phenotype/endpoint/treatment data should be collected for the same set of subjects. GSDA then computes a phenotype distance matrix (Fig.Â 1b) and an expression distance matrix for the genes in the geneset (Fig.Â 1c). Each entry of the distance matrix gives the distance between a pair of subjects in terms of the variable for which the distance was computed. Corresponding entries of the distance matrix are then paired and the correlation of these paired distance matrix entires is evaluated with a distance correlation statistic (Fig.Â 1d). The statistical significance is computed using the equations of â€śThe distance correlation frameworkâ€ť section above or by permutation as described in â€śVerifying significant results with permutationâ€ť section below. Finally, the result may be visualized with a heatmap (Fig.Â 1e).
Associating a geneset with one quantitative variable
Evaluating the association of the geneset matrix \({\varvec{X}}\) with one quantitative variable \({\varvec{Y}}\) is a special case of the distance correlation framework described in â€śThe distance correlation frameworkâ€ť section above. The quantitative variable can be represented as a data matrix \({\varvec{Y}}\) with exactly \(v=1\) one column of data. For this setting, GSDA uses Euclidean distance to performs all calculations exactly as described above to obtain \(r_d\) of Eq.Â (3) and \(t_d\) of Eq.Â (4).
Associating a geneset with one categorical variable
To evaluate the association of a geneset data matrix \({\varvec{X}}\) with one categorical variable \({\varvec{Y}}\) (one column with the category label for each subject), GSDA first computes each (i,Â j) entry of initial distance matrix \({\varvec{Y}}^\star\) for Y as
where \(y_i\) and \(y_j\) are the values of the categorical variable for individuals i and j, respectively, and \(\mathrm{I}(\cdot )\) is the indicator function that equals 1 if the enclosed statement is true and equals 0 if the enclosed statement is false. In other words, for the categorical variable Y, the distance between each pair of individuals in the same category is zero and the distance between any pair of individuals in different categories is 1. The initial distance matrix \({\varvec{Y}}^\star\) is then Ucentered according to Eq.Â 1 and the resulting Ucentered distance matrix \(\tilde{{\varvec{Y}}}\) is substituted into Eqs.Â (3) and (4) to obtain the distance correlation \(r_d\) and tstatistic \(t_d\).
Associating a geneset with one eventtime endpoint
Similarly, GSDA evaluates the association of a numeric geneset matrix \({\varvec{X}}\) with one censored eventtime endpoint by computing an initial distance matrix \({\varvec{Y}}^\star\), then Ucentering it according to Eq.Â (1) to obtain the Ucentered distance matrix \(\tilde{{\varvec{Y}}}\) and finally substituting the Ucentered distance matrix into Eqs.Â (3) and (4) to compute the distance correlation \(r_d\) and \(t_d\). For each individual \(i=1,\dots ,n\), the censored eventtime data is represented as a pair (\(o_i\),\(s_i\)) with the observation time and event status of the individual. For each individual that has experienced the event of interest, the event status \(s_i=1\) and the observation time \(o_i\) is the time elapsed from baseline until the event occurred. For each individual that has not yet experienced the event of interest, the event status \(s_i=0\) and the observation time \(o_i\) is the time elapsed from baseline until the most recent determination of that individualâ€™s event status. Also, let \(l=1,\dots ,\acute{s}\) index the unique observation times \(u_1< u_2< \cdots < u_{\acute{s}}\) for the events. Given this information, GSDA computes the initial distance matrix \({\varvec{Y}}^\star\) for the censored eventtime variable as
For each pair of individuals i and j that have both not yet experienced an event (\(s_i=s_j=0\)), this distance metric is zero. For each pair of individuals such that individual i experiences an event (\(s_i=1\)) prior to the observation time of individual j (\(o_i < o_j\)), the distance metric is the number of unique event times that occur between \(o_i\) and \(o_j\). This initial distance matrix \({\varvec{Y}}^\star\) is then substituted into Eq.Â 1 to obtain the Ucentered distance matrix \(\tilde{{\varvec{Y}}}\). Finally, the Ucentered distance matrix \(\tilde{{\varvec{Y}}}\) is substituted into Eqs.Â (3) and (4) to obtain the distance correlation \(r_d\) and its tstatistic \(t_d\).
The distance metric of Eq.Â (6) is computed by categorizing subjects in each risk set as being eventfree at or having an event prior to each unique event time. Similar techniques in defining and averaging over risk sets are used in classical survival analysis methods such as the logrank test [16]. Also, this distance metric is similar to the differences of censoradjusted ranks in the rankbased survival correlation method of Jung et al.Â [17] and is closely related to the Cindex [18].
Verifying significant results with permutation
The distance correlation ttest framework is very useful to quickly identify nonsignificant results and eliminate them from further consideration. The derivation of the ttest relies on asymptotic approximations of the null distribution for evaluating the association of two data matrices. The ttest approximation is accurate for most of the p value range in each of our simulation studies described in â€śSimulation studiesâ€ť section below. Still, like other asymptotic approximations, the ttest approximation may not give accurate probabilities for the extreme tails. Thus, the GSDA package includes a permutation module to evaluate the reliability of significant distance correlation ttest results. The permutation procedure is very fast because it operates on the the Ucentered distance matrices \({\tilde{X}}\) and \({\tilde{Y}}\). Thus, only Eqs.Â (3) and (4) are computed in the permutation iterations. These equations only involve simple arithmetic operations and thus are completed very quickly. We use this permutation procedure in the illustrative examples and the example application below. For applications that involve evaluating the association of many genesets, we recommend using the ttest of Eq.Â (4) to rapidly compute initial p values for all genesets and then use permutation to ensure that the smallest p values do not overstate statistical significance. The ttest p value may also be used to break ties among gene sets that have the same permutation p value.
Data transformations and distance metrics
In any distancebased analysis framework, the choice of data transformations and distance metric(s) are important considerations. GSDA uses Euclidean distance for the numeric geneset data matrix \({\varvec{X}}\). In some settings, it may be desirable to transform the data so that each gene has similar variance. This may be accomplished by zscore transformation (subtracting the mean and then dividing by the standard deviation) or commesuration (Nettleton et al.Â [12]) in which data values are centered and then divided by the sum of distances between all pairs of points. These varianceequalizing transformations will ensure that all genes contribute equally to distance calculations; this may be advisable in some applications and not advisable in others. For example, the zscore transformation would equalize the contributions of lowly and highly expressed genes to the distance calculations. If highly expressed genes are more biologically influential in the system under study, then varianceequalizing transformations will obscure this effect in the statistical analysis. If highly and lowly expressed genes are both biologically important, then varianceequalizing transformations may be beneficial.
Also, power and logarithmic transformations of \({\varvec{X}}\) will also profoundly impact GSDA; these transformations may be advisable in some applications but not in others. For a quantitative \({\varvec{Y}}\), GSDA also uses Euclidean distance and thus the same principles apply. For a categorical \({\varvec{Y}}\), GSDA uses the categorical distance of Eq.Â 5, which is impacted by combining groups. For a censored survival time variable \({\varvec{Y}}\), GSDA uses the rankbased metric that is not affected by monotone transformations of the observation times \(o_i\). In this work, Euclidean distance is used for numerical variables and the distance metrics for categorical and censored eventtime variables are defined above. Future research should explore the how the statistical performance of GSDA is affected by incorporating various other distance metrics (such as cosine distance) and transformations into its calculations.
Multipletesting adjustments
The sections above describe how GSDA evaluates the association of one geneset with one quantitative, one categorical, or one censored eventtime variable. The ttest of Eq.Â 4 computes a p value for the association of the geneset with the variable of interest. Frequently in practice one wishes to evaluate the association of each of many genesets with the variable of interest. In these settings, GSDA may be applied as described above to evaluate the association of each geneset with the variable of interest. This will yield many p values and require a multiple testing adjustment. In most settings, it will be most reasonable to use these p values to estimate or control the false discovery rate (FDR) developed by Benjamini and HochbergÂ [19]. Several FDR methods are available for this purpose, including those developed by Benjamini and HochbergÂ [19], StoreyÂ [20], and Pounds and ChengÂ [21]. PoundsÂ [22], Cheng and PoundsÂ [23], and BenjaminiÂ [24] have reviewed many of these methods and provided guidance on how to select the best FDR method for particular applications. We recommend users consider those works to choose the best FDR method for their particular applications.
Comparison with other methods
Several methods have been proposed and used to evaluate the association of a geneset with a variable of interest, including significance and function of expression (SAFEÂ [2,3,4]), geneset enrichment analysis (GSEAÂ [5, 6]), geneset analysis (GSAÂ [7]), the global test (GTÂ [8, 9]), the total of test statistics (TOTSÂ [10, 11]), the multiresponse permutation procedure (MRPPÂ [12]), and projection onto orthogonal statistical tests (POSTÂ [13]) as briefly described in the introduction. As shown in Table 1 and described in greater detail below, these methods can be characterized in terms of various properties that have been described in the literature, including the ability to detect linear, monotone, and nonmonotone associations; the ability to evaluate associations with quantitative, categorical, or eventtime endpoints; being a selfcontained or competitive testing procedure; reliance on statistical model fitting; and use of resampling methods (permutation or bootstrap) to determine statistical significance. GSDA is unique in that it is the only selfcontained method that does not use resampling, does not rely on model fitting, can detect many different forms of association, and can evaluate associations with categorical, quantitative, and censored eventtime variables. The advantages and limitations of this unique combination of properties are discussed in greater detail below.
GSDA is a selfcontained procedure. For each geneset, the p value is a function of the endpoint and only the genes in the geneset. For competitive geneset testing methods, the p value of a geneset is a function of the endpoint and all genes. Competitive procedures compare the associations of geneset genes with the endpoint to the associations of other genes with the endpoint. Competitive testing procedures seek to find genesets for which member genes are more strongly associated with the endpoint of interest than are other genes. Goeman and BĂĽhlmannÂ [9] discuss the advantages and limitations of competitive and selftesting procedures in depth. Briefly, competitive testing procedures can sometimes be difficult to interpret because their results are function of all genes, not just geneset genes. Also, competitive testing procedures can have less statistical power than selfcontained procedures. In many settings, the improved statistical power of selfcontained procedures can be advantageous. However, in some settings, selfcontained procedures can be â€śoverpoweredâ€ť in the sense that so many genesets are identified as significant that it doesnâ€™t really help the investigator to identify a few genesets to more thoroughly evaluate in future research.
GSDA does not require resampling methods such as permutation or bootstraping to compute p values. It is wellknown that resampling methods dramatically increase the computational time and burden of an analysis because these methods require repeating an analysis procedure for each of many data sets generated by resampling. Bootstrapping and permutation are useful techniques to accurately evaluate significance for some analysis procedures for settings that have no known mathematical formula to accurately compute statistical significance. GSDA uses the formula derived by Zhu et al.Â [15] and therefore does not need to use resampling to compute p values. This dramatically reduces the computational burden of GSDA in practice. The ttest can eliminate nonsignificant genesets from consideration very quickly. The ttest approximation to the null distribution may not be accurate in the extreme tails in all applications. Therefore, the GSDA R package includes a module to perform a rapid permutation test to evaluate the accuracy of the most significant p values. We used permutation procedure in the example application below and recommend using it to followup significant distance correlation ttest p values.
GSDA also does not rely on fitting statistical models. This can be an advantage in some settings and a limitation in others. If the question of scientific interest requires evaluating the association of the geneset with the variable of interest after adjusting for some other factor, then some methods that fit statistical models may be able to more rigorously and easily perform an adjusted analysis by simply including an adjustment term in the model. GT, POST and SAFE can incorporate covariate adjustments in their modeling frameworks. However, if the question does not require adjustment for another factor, then the statistical model fitting can introduce additional computational complexity that can be cumbersome. Fitting many statistical models requires numerically optimizing a likelihood or other criterion. In some data sets, the criterion may not have an optimum; for example, a monotone likelihood can occur when fitting Cox or logistic regression models to some data sets as described by Heinze and ShemperÂ [25, 26]. The nonexistance of an optimum leads to nonconvergence of numerical optimization routines and can cause an analysis script to crash. In our experience in research of rare diseases with small sample sizes and relatively good outcomes, this occurs fairly often when fitting many statistical models with many different variables. The correlation and tstatistics of GSDA are computed by a series of simple arithmetic operations (addition, subtraction, multiplication, and division). Thus, nonconvergence of modelfitting is not a concern. Division by zero is the only mathematically problematic situation that can arise with GSDA. These situations can be identified and avoided prospectively by flagging any genesets for which \((\tilde{{\varvec{X}}} \cdot \tilde{{\varvec{X}}}) = 0\) or \((\tilde{{\varvec{Y}}} \cdot \tilde{{\varvec{Y}}}) = 0\) and lead to division by zero in Eq.Â (3). These conditions can only occur in the setting that all distances are zero, which would not be a case of scientific interest in most settings anyway.
Identifying the empirical drivers of an association
It can be difficult to provide a meaningful biological interpretation of a significant association of a geneset with a treatment or outcome variable because the significant association may exist only for a subset of the geneset. The ttest can be rapidly computed so it is feasible to incorporate it into a backwards elimination procedure to identify a subset of genes that are the empirical drivers of a significant association result. The procedure first computes the ttest p value for the geneset with each gene excluded and identifies which gene to eliminate to yield the smallest p value. This procedure is then repeated until only one variable remains. This procedure gives a series of subsets of the original geneset and the GSDA p value for each subset. The subset with the best GSDA p value may then be considered the empirical drivers of the significant association result and evaluated more carefully in followup laboratory research.
When interpreting the results of this exploratory followup analysis, it is important to recall that the classical interpretation of the p value for the selected subset may not hold because the hypothesis was not prespecified. In this followup analysis, the p value of the best subset should be viewed simply as a subset selection criterion, not a rigorous Type I error control metric. Nevertheless, the procedure can help identify the most important genes within a geneset. When strict Type I error control is necessary, one may embed this backwards elimination procedure within a permutation testing framework.
Illustrative examples
The primary statistical advantage of GSDA is its ability to detect complex forms of association that are not detectable by most other methods. GSDA can detect nonmonotone relationships of one or more genes with a categorical treatment or trait, a quantitative trait, or a censored eventtime endpoint. FigureÂ 2 gives a few simulated examples of complex associations that many other methods have difficulty detecting as statistically significant. Each of these illustrative examples is described in greater detail below.
FigureÂ 2aâ€“c illustrate a setting in which two genes have differential correlation (but equal mean expression) across two categories. Hu et al.Â [27] described differential correlation of several pairs of genes between two subtypes of acute lymphoblastic leukemia. Nettleton et al.Â [12] also described a differential correlation of a geneset with knockout of myostatin in a comparison of myostatinknockout and wild type mice. FigureÂ 2a clearly shows that expression of the two genes are positively correlated in the category represented by gold dots and are negatively correlated in the category represented by cyan dots. FigureÂ 2b shows that neither gene has differential median expression across the two categories according to the Wilcoxon rank sum test. FigureÂ 2c shows that GSDA finds that this geneset of two genes is very significantly associated with the category (GSDA ttest \(p = 1.41 \times 10^{33}\), permutation p \(< 10^{6}\)). Five major clades of the dendrogram are numerically indexed in Fig.Â 2c and the mean position of these five clades is indicated in Fig.Â 2c. Four of the five clades correspond to groups that have low or high expression for each of the two genes (\(2 \times 2 = 4\) and the fifth clade corresponds to a group with intermediate expression for both genes. The distance correlation test identifies this association.
FigureÂ 2dâ€“f illustrate a setting in which both genes have an association with a numeric variable Y that is more complex than what is represented by simple linear models. FigureÂ 2d shows a Vshaped scatterplot for the numeric variable Y and the expression of gene 1. FigureÂ 2e shows a scatterplot of the numeric variable Y versus the expression of gene 2 in which the points fall along two parallel lines with negative slope. From Fig.Â 2e, f, it is apparent that the association of Y and gene 1 depends on the expression of gene 2. Y and gene 1 are positively correlated when gene 2 is highly expressed and they are negatively correlated when gene 2 is lowly expressed. This is an example of a liquid association described by HoÂ [28] which they discovered among several triplets of genes with liquid associations in a cellcycle experiment. In our illustrative example, Spearmanâ€™s test does not give a significant result for gene 1 (Fig.Â 2d, \(p = 0.36\)) but gets a significant result for gene 2 (Fig.Â 2e, \(p = 0.00052\)). FigureÂ 2f shows that GSDA gives a very significant result for this twogene geneset (\(p = 5.28 \times 10^{41}\), permutation \(p < 10^{6}\)). The color bar for the numeric phenotype Y in Fig.Â 2f shows a clear association with the clades of the dendrogram based on gene expression distance.
FigureÂ 2gâ€“i illustrate an association between two genes and a censored survival time endpoint that is more complex than what is represented by simple Cox regression models. FigureÂ 2g shows a scatterplot of the expression of genes 1 and 2 and indicates the survival status by the plotting character (x = dead; o = alive). It is clear that these genes are negatively correlated in those patients who survived longer and positively correlated in those patients who died relatively quickly. This is another example of a liquid association. FigureÂ 2h plots the expression of gene 1 versus the survival time and indicates survival status by the plotting character. A singlepredictor Cox regression model does not find a significant association for gene 1 (\(p = 0.67\), Fig.Â 2h) or gene 2 (\(p = 0.83\), data not shown). GSDA finds this complex association to be statistically significant (Figure 2I, ttest \(p = 6.1 \times 10^{13}\), permutation \(p = 0.000111\)). Again, the color bar for survival outcome shows differences across the clades of the dendrogram computed by gene expression distance.
These illustrative examples show some complex forms of associations that violate the assumptions of the classical statistical methods such as the ttest, linear regression, Spearman correlation, logistic regression, Cox regression, etc that are utilized by widely used geneset association testing methods such as SAFE, GT, and GSEA. These classical statistical methods are very powerful for detecting associations that are linear or monotone on some scale. The examples above illustrate some complex associations that have Xshaped, Vshaped, or other patterns that are not accurately modeled by a monotonic relationship. These complex associations may occur when there are unrecognized latent subgroups in the analysis. This may be the case for some complex human diseases, particularly human malignancies, in which unrecognized molecular subgroups may be present and impact the association. These nonmonotonic relationships are still apparent in distance correlations and may be detected by GSDA.
Results
Simulation studies
We performed a series of simulation studies to evaluate the performance of the proposed GSDA method, GSEA, GSA, SAFE, GT, TOTS, and POST in simple settings involving a categorical, numeric, and survival outcome (SC, SN, and SS), complex settings involving categorical, numeric, and survival outcome (CC, CN, and CS). We also evaluated each of these settings with two different collections of gene sets. For gene set collection A, there were 100 genes (10 associated, 90 null) assigned into 60 gene sets with 8â€“10 genes each; 10 gene sets included at least one gene associated with the outcome and the other 50 gene sets had no gene associated with the outcome. For gene set collection B, there were 1000 genes (10 associated, 990 null) assigned into 100 gene sets with 10â€“100 genes each; 20 gene sets had at least one gene associated with the outcome and 80 gene sets had no gene associated with the outcome. We evaluated sample sizes of 10, 25, 50, and 100 subjects in each of two groups for categorical associations. For numeric and survival endpoints, we evaluated sample sizes of 10, 25, 50, and 100 subjects total. We evaluated the level and power of each method at the \(p=0.05\) threshold.
Data were generated so that genes had no association, simple associations, or complex associations with the endpoints. For genes with no association, expression values were independently and identically distributed (iid) standard normal values. A latent numeric variable was used in the generation of data for endpoints and associated genes. Endpoint data were generated as simple functions of a latent vector of iid standard normal values and some additional random variation. For genes with a simple association with endpoints, expression values were generated by multiplying the latent variable by a coefficient and then adding iid standard normal errors. For genes with a complex association with endpoints, a latent binary vector was generated and used to reverse the signs of the coefficients for multiplication by the numeric latent variable. In the statistical literature, latent variables are unobserved variables that are used to more effectively model or simulate complex association settings. The reversal of signs by a latent binary vector in these simulations mimics the differential correlations described in â€śIllustrative examplesâ€ť section and shown in Fig.Â 2a. These association patterns are similar to the complex differential expression of nucleotidyltransferase activity genes between myostatin knockout and wildtype mice described by Nettleton et al.Â [12] and the liquid associations among multiple genetriplets that Ho et al.Â [28] discovered in cell cycle data. Briefly, the expression correlation for a pair of genes depends on the expression of a third gene. For example, the expressions of genes A and B ar positively correlated when gene C is underexpressed and the expression of genes A and B are negatively correlated when gene C is overexpressed. The Additional file 1: simulationstructure.pdf provides a detailed narrative description of simulation with illustrative schema figures.
FigureÂ 3 shows each methodâ€™s average level over all null gene sets and average power over all associated gene sets in each simulation setting. All methods maintained Type I error controlless than 6% except that POST failed to do so in a few small sample size settings. For each setting, the method with the greatest power among those methods with average empirical level less than 6% was designated as the best performer. GT was the best performer in 23 of 24 simple association settings and one of the 24 simple association settings. GSEA was the best performer in five of 24 complex association settings; those five settings had small sample sizes (10 or 25). TOTS was the best performer in the simple survival association setting with100 genes and 60 geneset and a sample size 100. In that setting, POST, GT and GSDA had only slightly less power. GSDA was the best performer in 18 of 24 complex association settings; those settings had sample sizes of 10 to 100. In 10 of these 18 settings, the power of GSDA exceeded that of all other methods by 20% or more. In complex survival association settings with \(n=100\) subjects, the power of GSDA exceeded that of all other methods by at least 39%. Also, the power of GSDA was similar to that of GT in most of the simple association settings. Complete simulation results are provided in the Additional file 3: simulationboxplots.pdf and Additional file 4: simulationresults.xlsx.
We also used the simulation results to more fully understand the statistical properties of the GSDA distance correlation ttest p values by examining empirical distribution function (EDF) plots of those p values. The Additional file 2: simulationEDFplots.pdf includes p value EDF plots for each simulation. In all cases, the GSDA p values of associated genesets show a much greater concentration near zero than do those of null genesets. This observation is consistent with GSDA maintaining level and increasing power with sample size in the results reported above. Additionally, consistent with the theoretical derivations of Zhu et al.Â [15], the GSDA p values for null gene sets are approximately uniform over most of the p value range from 0 to 1. The GSDA p values for null gene sets tend to be slightly stochastically greater than uniform for \(p> 0.04\) and slightly stochastically less than uniform for \(p< 0.04\). This suggests that the GSDA distance correlation ttest of Eq.Â (4) may tend to overstate statistical significance when it gives \(p< 0.04\). Therefore, we recommend using the permutation procedure of â€śVerifying significant results with permutationâ€ť section to evaluate the accuracy of significant distance correlation ttest results as we have done in the illustrative examples of â€ś2.11â€ť section above and in the example application immediately below.
Additionally, we measured the computing times for each method in our simulation (Additional file 5: computespeeds.pdf). GT was one of the two fastest methods in all 48 simulation settings. GSDA was one of the three fastest methods in all 32 settings involving a numerical or categorical endpoint. GSDA was noticeably slower in evaluating survival endpoints. Computing the survival distnace in Eq.Â (6) can be time consuming because it must perform \(O(n^3)\) comparisons (compare data for all \(n(n1)\) pairs at each of n timepoints). Nevertheless, it is still computationally feasible to perform GSDA in practice; the superior statistical perfomance of GSDA is well worth the additional computing time.
A pediatric leukemia application
We applied these seven analysis methods to the pediatric acute myeloid leukemia (AML) data set that is publicly available from the TARGET project website (https://targetdata.nci.nih.gov/Public/AML/; accessed March 8, 2018). We also applied all eleven methods of the R package piano [14] to the data set. We evaluated the association of genelevel mRNAseq expression data with presence/absence of chloroma at diagnosis (a binary variable), log white blood cell count (logWBC) at diagnosis (a quantitative variable), and eventfree survival (a censored event time variable) for the 123 subjects with all of these data available. We used these 18 methods to evaluate the association of the expression of 56 genes annotated to the KEGG AML pathway (https://www.genome.jp/keggbin/show_pathway?hsa05221) with the presence or absence of chloroma at diagnosis (a binary variable), the log of the diagnostic white blood cell count (a numeric variable), and eventfree survival of patients (a censored event time variable). We also used the procedure of â€śVerifying significant results with permutationâ€ť section to compute GSDA p values based on one million permutations.
Table 2 provides the results. GSDA and 8 other methods found that the KEGG AML pathway was significantly associated with chloroma (\(p < 0.04\)); the other 10 methods did not find strong evidence of an association (\(p > 0.11\) ). The AML pathway genes were not significantly associated with WBC according to GSA (\(p = 0.688\)) or the PAGE method of PIANO (\(p=0.096\)), but was significantly associated according to GSDA (\(p < 10^{6}\)) and all other methods (\(p \le 0.03\)). GSDA found that the KEGG AML pathway had a marginally significant association with EFS (ttest \(p = 0.06\), permutation \(p = 0.050755\)); all other methods obtained \(p > 0.09\) for this association. GSDA obtaining the smallest p value for EFS is consistent with the simulation result showing that GSDA has unrivaled power to detect a complex association of a geneset with a censored survival time variable with sample size of 100 (Fig.Â 3).
FigureÂ 4 graphically illustrates the results of GSDA and the backward elimination procedure of â€śIdentifying the empirical drivers of an associationâ€ť section. FigureÂ 4a provides a heatmap of the expression of the AML pathway genes and a color scales for each of three endpoints. The backward elimination procedure identified RUNX1T1, MAPK3, PIK3CG, TCF7L1, GRB2, and MTOR as the empirical drivers of the association with chloroma (Fig.Â 4b). Hierarchical clustering of individuals by the expression of these six genes with Wardâ€™s criteria [29] defines one cluster of 52 patients with only one of the 17 chloroma cases (Fig.Â 4c). The backward elimination procedure identified 25 of the 56 genes as empirical drivers of the association with logWBC (Fig.Â 4d). Hierarchical clustering of individuals by the expression of these 25 genes defined two subgroups with strong differential logWBC values (Fig.Â 4). By backward elimination, AKT3, MAPK3, PIK3CG, PML, and STAT5A were identified as empirical drivers of the association with EFS (Fig.Â 4f). Again, hierarchical clustering of subjects by these genes defined subgroups with differing EFS (Fig.Â 4g). We verified this result by using GSDA to test the association of these five genes with EFS in the entirely separate AML02 clinical trial cohort of 168 patients [30, 31]. In this independent validation cohort, GSDA found this set of five genes to be significantly associated with EFS (ttest \(p = 0.013839\), permutation \(p = 0.024695\)).
In this example, only GSDA obtained \(p < 0.09\) for the association with EFS in the TARGET AML cohort. There are several reasons the other methods did not find suggestive evidence of this association. Each of the other methods rely on Cox proportional hazards regression to model and evaluate the significance of the association. Fitting a singlepredictor Cox regression to each gene in the geneset yields a set of p values with a mean of 0.45. Pounds and ChengÂ [21] have shown that twice the average p value is a reasonable estimator of the proportion of tests with a true null hypothesis. Thus, an average Cox p value of 0.45 indicates that only 10% of the 56 genes are associated with EFS. This estimate is also consistent with the backwards elimination procedure identifying 5 genes as empirical drivers of the association with EFS. Additionally, we used the method of Grambsch and TherneauÂ [32] to evaluate the validity of the proportional hazard assumption for each gene. The average p value was 0.41, indicating that 18% of genes in the gene set violate the assumption. The other methods did not find the association because a substantial proportion of genes violated the modeling assumption and only a small proportion were truly associated with EFS.
Discussion
Many scientific discoveries have been made by identifying gene sets that associate with a treatment or an endpoint of interest. According to Google Scholar, the geneset analysis methods evaluated in our simulation studies and example analysis have been cumulatively cited tens of thousands of times. Several methods have been successfully used for this purpose in the literature. Nevertheless, it is important to more fully understand the strengths and limitations of these methods in terms of recognizing biological settings for which each method has good or poor statistical performance. Our work has confirmed that many of the widely used methods perform well in certain settings, particularly in identifying gene sets for which most genes have a simple association with the endpoint or treatment of interest. Many times, these discoveries are scientifically meaningful; thus, the widely used methods can often reveal biological insights necessary to advance our understanding and treatment of several diseases.
However, as is the case for many statistical problems, each method has a niche of settings for which it performs well and for which it performs poorly. Nettleton et al.Â [12] observed that some complex forms of association that are essentially undetectable by some of the most widely used methods. HoÂ [28] also observed complex associations that are difficult to detect with methods that assume monotonic relationships between pairs of variables. NettletonÂ et al.Â [12] proposed the MRPP to evaluate the association of a gene set with a categorical endpoint or treatment variable. In addition to detecting differential average expression across groups, the MRPP can detect differential correlation across groups. However, the published MRPP did not evaluate the association of a gene set with a quantitative or censored survival time variable. In many disciplines, such as oncology, censored survival time variables are of greatest scientific interest.
ZhuÂ et al.Â [15] developed a distance correlation ttest that can detect complex associations between two numeric data matrices. The method is appealing in that it can detect complex associations without resorting to computationally burdensome resampling procedures to determine statistical significance. However, in its published form, it is limited to evaluating associations between two numeric data matrices. We extended this framework to develop the GSDA method that can also evaluate associations of a numeric data matrix with categorical variables and censored eventtime variables. GSDA was the best performer in 16 of 24 complex association settings and exceeded the power of all other methods by 20 percentage points in 11 of those settings. These results indicate that in many settings GSDA has much greater power than other methods to detect complex associations with a geneset. In this way, GSDA can effectively make discoveries that complement the discoveries made by other geneset testing methods.
We also developed methods to followup on the most statistically significant and/or biologically interesting results of a GSDA analysis. We developed a rapid permutation testing procedure for GSDA to confirm the statistical significance of distance correlation ttest. We used this procedure to confirm the accuracy of the significant distance correlation ttest in our example application and recommend using it in practice to followup on the most statistically significant distance correlation ttest results. We also developed a backward elimination procedure (â€śIdentifying the empirical drivers of an associationâ€ť section) to provide more focused biological insights by identifying the subset of genes in a gene set with the strongest distance correlation with the endpoint of scientific interest. This backward elimination procedure was very useful in the example application. It narrowed down the 56 genes in the AML gene set to a subset of 25 genes for association with chloroma, 6 genes for association with log WBC, and 5 genes for association with EFS. We were able to confirm the EFS association of the 5 genes in a separate clinical trial cohort.
Overall, the simulation and example application results have shown that GSDA is a useful tool to complement existing geneset association testing methods. GSDA can identify complex associations that are characterized by nonmonotonic relationships among pairs of variables or that violate other statistical modeling assumptions. These types of associations occur in practice and are difficult to detect with other methods. GSDA can rapidly complete the analysis of all genes and provide the rigor of a permutation test for the most significant results. Also, GSDA can provide a more focused biological insights by identifying the subset of genes in a gene set that most strongly associated with an endpoint of interest.
Future research should explore several intriguing open questions related to the use and improvement of GSDA. The use and development of GSDA may be substantially advanced by developing new distance measures and guidelines for their use in practice and considering different algorithms to find strongly associated subsets of gene sets that associate with the outcome or endpoint of scientific interest.
Conclusions
We developed the geneset distance analysis method and showed that it detects associations of genesets with phenotypes or treatments that are not easily identified by other methods. These results indicate that GSDA should compliment the use of other methods in data analysis practice to ensure that biologically meaningful associations are discovered that may otherwise be missed. Furthermore, our work suggests that future work should develop distnacebased methodologies for other problems and applications in the analysis of omics data.
Availability of data and materials
The AML data set used in the example application is from publicly available TARGET project with reformatting and available as rdata at https://github.com/xueyuancao/GSDA/tree/master/data; The source code is available at https://github.com/xueyuancao/GSDA/tree/master/R.
Abbreviations
 GSDA:

Geneset distance analysis
 AML:

Acute myeloid leukemia
 EFS:

Eventfree survival
 GT:

Global test
 TOTS:

Total of test statistics
 SAFE:

The significance and function of expression
 GSEA:

Geneset enrichment analysis
 GSA:

Geneset association
 MRPP:

Multiresponse permutation procedure
 POST:

Projection onto orthogonal statistical tests
 FDR:

False discovery rate
 EDF:

Empirical distribution function
References
Maciejewski H. Gene set analysis methods: statistical models and methodological difference. Brief Bioinform. 2013;15:504â€“18.
Virtaneva K, Wright FA, Tanner SM, et al. Expression profiling reveals fundamental biological differences in acute myeloid leukemia with isolated trisomy 8 and normal cytogenetics. Proc Natl Acad Sci (USA). 2001;98:1124â€“9.
Barry WT, Nobel AB, Wright FA. Significance analysis of functional categories in gene expression studies: a structured permutation approach. Bioinformatics. 2005;21:1943â€“9.
Barry WT, Nobel AB, Wright FA. A statistical framework for testing functional categories in microarray data. Ann Appl Stat. 2008;2:286â€“315.
Mootha VK, Lindgren CM, Eriksson KF, et al. Pgc1alpharesponsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. 2003;34:267â€“73.
Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis: a knowledgebased approach for interpreting genomewide expression profiles. Proc Natl Acad Sci (USA). 2005;102:15545â€“50.
Efron B, Tibshirani R. On testing the significance of sets of genes. Ann Appl Stat. 2007;1:107â€“29.
Goeman JJ, van de Geer SA, de Kort F, van Houwelingen HC. A global test for groups of genes: testing association with a clinical outcome. Bioinformatics. 2004;20:93â€“9.
Goeman JJ, BĂĽhlmann P. Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics. 2007;23:980â€“7.
Irizarry RA, Wang C, Zhou Y, Speed TP. Gene set enrichment analysis made simple. Stat Methods Med Res. 2009;18:565â€“75.
Ackermann M, Strimmer K. A general modular framework for gene set enrichment analysis. BMC Bioinform. 2009;10(47):1â€“20.
Nettleton D, Recknor J, Reecy JM. Identification of differentially expressed gene categories in microarray studies using nonparametric multivariate analysis. Bioinformatics. 2008;24:192â€“201.
Cao X, George EO, Wang M, Armstrong DB, Cheng C, Raimondi S, et al. POST: a framework for setbased association analysis in highdimensional data. Methods. 2018;145:76â€“81.
VĂ¤remo J, Nielsen I, Nookaew I. Enriching the gene set analysis of genomewide data by incorporating directionality of gene expression and combining statistical hypotheses and methods. Nucleic Acids Res. 2013;41:4378â€“91.
Zhu C, Yao S, Zhang X, Shao X. Distancebased and RKHSbased dependence metrics in high dimension. Ann Stat. 2020;48:3366â€“94.
Hosmer DW, Lemeshow S, May S. Applied survival analysis: regression modeling of timetoevent data. 2nd ed. Hoboken: Wiley; 2008.
Jung SH, Owzar K, George SL. A mutiple testing procedure to associate gene expression levels with survival. Stat Med. 2005;24:3077â€“88.
Harrell FE, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med. 1996;15:361â€“87.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc (Ser B). 1995;57:289â€“300.
Storey JD. A direct approach to false discovery rates. J R Stat Soc (Ser B). 2002;64:479â€“98.
Pounds SB, Cheng C. Robust estimation of the false discovery rate. Bioinformatics. 2006;22:1979â€“87.
Pounds SB. Estimation and control of multiple testing error rates for microarray studies. Brief Bioinform. 2006;7:25â€“36.
Cheng C, Pounds S. False discovery rate paradigms for statistical analyses of microarray gene expression data. Bioinformation. 2007;1:436â€“46.
Benjamini Y. Discovering the false discovery rate. J R Stat Soc (Ser B). 2010;72:405â€“16.
Heinze G, Schemper M. A solution to the problem of monotone likelihood in cox regression. Biometrics. 2001;57(1):114â€“9.
Heinze G, Schemper M. A solution to the problem of separation in logistic regression. Stat Med. 1996;21:2409â€“19.
Hu R, Qiu X, Glazko G, Klebanov L, Yakovlev A. Detecting intergene correlation changes in microarray analysis: a new approach to gene selection. BMC Bioinform. 2009;10(20):1â€“9.
Ho YY, Parmigiani G, Louis TA, Cope LM. Modeling liquid association. Biometrics. 2011;67:133â€“41.
Ward JH. Hierarchical grouping to optimize an objective function. J Am Stat Assoc. 1963;58:236â€“44.
Rubnitz JE, Inaba H, Dahl G, et al. Minimal residual diseasedirected therapy for childhood acute myeloid leukemia: results of the AML02 multicentre trial. Lancent Oncol. 2010;11:543â€“52.
Lamba JK, Cao X, Raimondi SC, Rafiee R, Downing JR, Lei S, et al. Integrated epigenetic and genetic analysis identifies markers of prognostic significance in pediatric acute myeloid leukemia. Oncotarget. 2018;9:26711â€“23.
Grambsch P, Therneau T. Proportional hazards tests and diagnostics based on weighted residuals. Biometrika. 1994;81:515â€“26.
Funding
This work has been supported by ALSAC. The funders had no role in preparing, reviewing, or approving this manuscript.
Author information
Authors and Affiliations
Contributions
SP developed and implemented the mathematical concept; XC performed the simulation and example application; SP and XC wrote this manuscript. Both authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1.
This supplementary file describes the detailed settings for simulation studyincluding gene set collection definition, association coefficients, simple/complex association and three types ofresponse variables.
Additional file 2.
This supplementary file provides the empirical distribution function (EDF) plots ofGSDA simulation p values for each scenario.
Additional file 3.
This supplementary file provides box plots of power and type I error rates at 5%nominal level for each scenario.
Additional file 4.
This supplementary file contains one spreadsheet with data describing geneeffect sizes and geneset memberships for each simulation setting and another spreadsheet with simulation resultsfor each geneset by each method in each setting.
Additional file 5.
This supplementary file contains bar plots of the compute speeds (number ofcompleted analyses per minute) for each method in each of the 48 scenarios.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Cao, X., Pounds, S. Geneset distance analysis (GSDA): a powerful tool for geneset association analysis. BMC Bioinformatics 22, 207 (2021). https://doi.org/10.1186/s1285902104110x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285902104110x