Evaluation of tree-based statistical learning methods for constructing genetic risk scores

Background Genetic risk scores (GRS) summarize genetic features such as single nucleotide polymorphisms (SNPs) in a single statistic with respect to a given trait. So far, GRS are typically built using generalized linear models or regularized extensions. However, these linear methods are usually not able to incorporate gene-gene interactions or non-linear SNP-response relationships. Tree-based statistical learning methods such as random forests and logic regression may be an alternative to such regularized-regression-based methods and are investigated in this article. Moreover, we consider modifications of random forests and logic regression for the construction of GRS. Results In an extensive simulation study and an application to a real data set from a German cohort study, we show that both tree-based approaches can outperform elastic net when constructing GRS for binary traits. Especially a modification of logic regression called logic bagging could induce comparatively high predictive power as measured by the area under the curve and the statistical power. Even when considering no epistatic interaction effects but only marginal genetic effects, the regularized regression method lead in most cases to inferior results. Conclusions When constructing GRS, we recommend taking random forests and logic bagging into account, in particular, if it can be assumed that possibly unknown epistasis between SNPs is present. To develop the best possible prediction models, extensive joint hyperparameter optimizations should be conducted. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04634-w.

In this supplementary file, additional information about the GRS construction methods and additional results about the simulation study and the real data application are presented. In Figure S1, model fitting and GRS prediction times are depicted. In Section 2, the considered hyperparameters for constructing the GRS models are described. In Section 3, we present the workflows for tuning and fitting each regarded statistical learning procedure for constructing GRS. Means and asymptotic 95% confidence intervals of the AUCs corresponding to the figures in the main text are depicted in the Figures S2, S9, and S16. Concrete estimates following statistical inference can be found in the Figures S3, S4, S10, S11, and in Table S1. Results for the classical classification metrics accuracy, sensitivity, and specificity are depicted in the Figures S5, S6, S7, S12, S13, S14, S17, S18, and S19. Training data AUCs are illustrated in the Figures S8, S15, S20, and S24. AUC comparisons when employing the binary {0, 1} SNP coding for each method are depicted in the Figures S21, S22, and S23. Table S2 depicts median p-values of the final adjusted models for the GRS, the environmental factor, and their interaction term. Final results for the sensitivity analysis excluding smokers from the SALIA data set can be found in Figure S25. In Figure S26, an exemplary GRS distribution is depicted which explains the observed sensitivities in the simulation study.

Model fitting and GRS prediction time
We, here, present the model fitting and GRS prediction times in the third simulation scenario. The times for single model constructions and evaluations in the hyperparameter optimization process are presented, since, in the hyperparameter optimization process, several different settings, which can have an impact on the time, are utilized.  Figure S1: Model fitting and GRS prediction time for random forests, random forests VIM, logic regression, logic bagging, and elastic net for the hyperparameter configuration in the third simulation scenario incorporating continuous input variables.

Hyperparameter descriptions
We, here, briefly describe the hyperparameters of each considered statistical learning procedure that were tuned in our analyses. Table 4 in the main text depicts the corresponding hyperparameter settings.

Random forests & random forests VIM
The parameter mtry determines the number of randomly chosen input variables regarded at each split in each tree. The parameter min.node.size configures the number of observations which have to belong to a certain tree node in order to continue splitting this node. Thus, min.node.size acts as a stopping criterion for prematurely terminating splitting of a tree branch. num.trees determines the total number of trees to be grown in random forests. A sufficiently high number should be chosen such that the performance will not increase substantially anymore.

Logic regression & logic bagging
For logic regression and logic bagging, ntrees and nleaves determine the model complexity. ntrees is the maximum number of trees to be included in the model and nleaves is the maximum number of leaves distributed over all trees. For conventional logic regression, simulated annealing is employed as the search algorithm which has to be tuned as well. For the number of simulated annealing iterations, analogously to the number of trees in random forests, a sufficiently high number should be chosen. The cooling schedule, which includes a start temperature and an end temperature, is manually tuned such that at the beginning of the search, almost all states are accepted, and at the end of the search, almost no states are accepted.
For logic bagging, the number of bagging iterations has to be set to a sufficiently high number, similar to num.trees and the number of simulated annealing iterations.

Elastic net
For fitting elastic net models, the parameter α controls the balance between the lasso and the ridge regularization. The parameter λ determines the strength of the regularization.  Figure S2: Mean AUC and asymptotic 95% confidence intervals for random forests, random forests VIM, logic regression, logic bagging, elastic net, and the true underlying model in the first simulation scenario considering marginal effective SNPs evaluated on the test data.   Figure S8: Mean AUC for random forests, random forests VIM, logic regression, logic bagging, elastic net, and the true underlying model in the first simulation scenario considering marginal effective SNPs evaluated on the training data itself.  Figure S12: Mean accuracy for random forests, random forests VIM, logic regression, logic bagging, elastic net, and the true underlying model in the second simulation scenario incorporating interactions of SNPs evaluated on the test data.  Figure S13: Mean sensitivity for random forests, random forests VIM, logic regression, logic bagging, elastic net, and the true underlying model in the second simulation scenario incorporating interactions of SNPs evaluated on the test data.  Figure S14: Mean specificity for random forests, random forests VIM, logic regression, logic bagging, elastic net, and the true underlying model in the second simulation scenario incorporating interactions of SNPs evaluated on the test data.  Figure S15: Mean AUC for random forests, random forests VIM, logic regression, logic bagging, elastic net, and the true underlying model in the second simulation scenario incorporating interactions of SNPs evaluated on the training data itself.  Figure S16: Mean AUC and asymptotic 95% confidence intervals for random forests, random forests VIM, logic regression, logic bagging, and elastic net in the third simulation scenario incorporating continuous input variables evaluated on the test data.  Figure S17: Mean accuracy for random forests, random forests VIM, logic regression, logic bagging, and elastic net in the third simulation scenario incorporating continuous input variables evaluated on the test data.  Figure S18: Mean sensitivity for random forests, random forests VIM, logic regression, logic bagging, and elastic net in the third simulation scenario incorporating continuous input variables evaluated on the test data.  Figure S19: Mean specificity for random forests, random forests VIM, logic regression, logic bagging, and elastic net in the third simulation scenario incorporating continuous input variables evaluated on the test data.  Figure S20: Mean AUC for random forests, random forests VIM, logic regression, logic bagging, and elastic net in the third simulation scenario incorporating continuous input variables evaluated on the training data itself.  Figure S23: Mean AUC for random forests, random forests VIM, logic regression, logic bagging, and elastic net in the third simulation scenario incorporating continuous input variables evaluated on the test data. Here, the binary {0, 1} SNP coding was used for each method.  Figure S24: AUC for random forests, random forests VIM, logic regression, logic bagging, and elastic net in the application to data from the SALIA study evaluated on the training data itself. Results for the final age-adjusted models with different air pollution indicators.  Figure S25: AUC for random forests, random forests VIM, logic regression, logic bagging, and elastic net in the application to data from the SALIA study evaluated on the test data. Results for the final age-adjusted models with different air pollution indicators. Current and former smokers were excluded from the base data set as part of a sensitivity analysis.

Distribution of the GRS
In the main effects simulation scenario and in the gene-gene interaction effect simulation scenario, the classification sensitivity is relatively low in some settings. This phenomenon can be explained by the need of dichotomizing the GRS into cases and controls for estimating the sensitivity and the discrete structure of the space of input variables/SNPs. To illustrate this, we present an exemplary GRS distribution occurring in the simulation study.