Empowering individual trait prediction using interactions for precision medicine

Background One component of precision medicine is to construct prediction models with their predicitve ability as high as possible, e.g. to enable individual risk prediction. In genetic epidemiology, complex diseases like coronary artery disease, rheumatoid arthritis, and type 2 diabetes, have a polygenic basis and a common assumption is that biological and genetic features affect the outcome under consideration via interactions. In the case of omics data, the use of standard approaches such as generalized linear models may be suboptimal and machine learning methods are appealing to make individual predictions. However, most of these algorithms focus mostly on main or marginal effects of the single features in a dataset. On the other hand, the detection of interacting features is an active area of research in the realm of genetic epidemiology. One big class of algorithms to detect interacting features is based on the multifactor dimensionality reduction (MDR). Here, we further develop the model-based MDR (MB-MDR), a powerful extension of the original MDR algorithm, to enable interaction empowered individual prediction. Results Using a comprehensive simulation study we show that our new algorithm (median AUC: 0.66) can use information hidden in interactions and outperforms two other state-of-the-art algorithms, namely the Random Forest (median AUC: 0.54) and Elastic Net (median AUC: 0.50), if interactions are present in a scenario of two pairs of two features having small effects. The performance of these algorithms is comparable if no interactions are present. Further, we show that our new algorithm is applicable to real data by comparing the performance of the three algorithms on a dataset of rheumatoid arthritis cases and healthy controls. As our new algorithm is not only applicable to biological/genetic data but to all datasets with discrete features, it may have practical implications in other research fields where interactions between features have to be considered as well, and we made our method available as an R package (https://github.com/imbs-hl/MBMDRClassifieR). Conclusions The explicit use of interactions between features can improve the prediction performance and thus should be included in further attempts to move precision medicine forward.


Background
The concept of precision medicine can be viewed as a continuous process of data preprocessing/data mining (track 1), construction of diagnostic/prognostic models (track 2) and prediction of treatment response/disease progression (track 3) [1]. Whereas track 1 focuses on the identification of important observed and latent variables, tracks 2 and 3 require models with highly accurate predictions about disease status, prognosis or progression of a disease of a single individual [2][3][4][5][6]. Explained with (generalized) linear models as an example, based on the estimation of inference on regression coefficients tracks 2 and 3 aim at constructing models with their predictive ability as high as possible, measured in terms of some performance, e.g. the area under the receiver operating characteristic curve (AUC). In genetic epidemiology, simple Mendelian diseases, such as cystic fibrosis and hereditary breast and ovarian cancer, allow for relatively straightforward predictions. However, more complex diseases like coronary artery disease, rheumatoid arthritis, and type 2 diabetes, involve complex molecular mechanisms and thus have a polygenic basis [7]. It is a common assumption that these biological/genetic features also are acting via interactions either with each other [8][9][10][11] or with environmental features [12]. To make it even more complicated, features may affect the outcome under consideration only via interactions. Thus, the interacting features do not have an effect on their own. An example of such a constellation is the effect of a variant in the MDR1 gene together with exposure to pesticides on Parkinson's disease [13].
The use of standard approaches such as generalized linear models is suboptimal in these cases because of the algorithm instabilities when modeling many variables and their interactions or requirements of large sample sizes [14]. Thus, regularized generalized linear models [15,16] or machine learning methods, e.g. Random Forest [17], are appealing to make individual predictions based on many variables. They differ in the details, but most of them share one important property: they focus mostly on main or marginal effects of the single features in a dataset. For example, in Random Forest, at each node, the single best feature and its best split point are selected [18]. This may lead to ignoring features without any or only small main effects, although Wright et al. [19] have shown that using enough single trees can compensate this issue. Likewise, regularized regression models are usually specified using main effect terms only, and interaction terms have to be included explicitly as new features [20]. These common limitations may limit the prediction performance of models based on currently used algorithms if features have an effect on the outcome only via interactions.
On the other hand, there has been much research on the detection of interacting features in the realm of genetic epidemiology [21]. One big class of algorithms to detect interacting features are the multifactor dimensionality reduction (MDR)-based algorithms based on the original idea by Ritchie et al. [22]. The basic idea of all MDR-based algorithms is to reduce the dimensionality of simultaneously analyzed features by pooling combinations of feature levels (cells) in high risk (H) and low risk (L) groups, resulting in a single best combination (MDR model) of d features. The original MDR algorithm has several drawbacks and limitations, thus a large number of modifications and extensions were proposed in recent years. A comprehensive review of the original MDR algorithm and its modifications and extensions is given by Gola et al. [23]. However, these algorithms aim at identifying the best classification model based on single and interacting features to identify interacting features but do not allow for individual predictions.
In this work we show how to extend the model-based MDR (MB-MDR), a powerful MDR-based algorithm to detect interacting features first described by Calle et al. [24], to enable interaction empowered individual prediction.

Statistical interactions
In statistics, two variables are said to be statistically independent if the influence of them on the target size can be modeled as a function with two separate parameters for the two variables. If an additional parameter is needed to describe the effect of both variables together on the target variable, this is referred to as interaction between the two variables in statistics. The presence of a statistical interaction depends on the type of model used for modeling. In the following, we will explain this in the light of a classification task in genetic epidemiology, i.e. the prediction of a dichotomous outcome Y based on genetic markers, here given as single nucleotide polymorphisms (SNPs) with two alleles A 1 and A 2 .

Note:
Example of a penetrance table for a multilocus L = (L 1 , L 2 ) with two diallelic SNPs L 1 = A and L 2 = B.

Definition via logistic regression
Cordell [9] defines the interaction of two SNPs A and B with alleles A 1 , A 2 and B 1 , B 2 on the basis of a saturated linear model, given by Here, l is a link function and, for a dichotomous outcome Y which follows a Bernoulli distribution with parameter π, l is the logarithm of the odds of π = P (Y = 1): are the main effects , and ι A1A2B1B2 , ι A1A2B2B2 , ι A2A2B1B2 , ι A2A2B2B2 are the interaction effects. The SNPs A and B are said to be interacting if at least one of the interaction effects is not equal to zero, i.e. there is a deviation from additivity of the main effects.
To simplify the model, one can assume a recessive, dominant or additive genetic model for the SNPs. In this case, the regression model simplifies to for a recessive genetic model, for a dominant genetic model and for an additive model for SNP A. The definitions for SNP B are equivalent.

Definition via penetrances
The definition of statistical interaction is not limited to the deviation from the additivity of the main effects in the logistic model. Another definition is based on the penetrance, which corresponds to the modeling of Example of a penetrance table for a multilocus L = (L 1 , L 2 ) with two diallelic SNPs L 1 = A and L 2 = B without any marginal effects. the probability π = P (Y = 1) by a linear model. The penetrance function is defined as i.e. the probability of Y = 1 given a specific genotype If multiple SNPs L j , j = 1, . . . , d, d ∈ N + as multilocus L = (L 1 , . . . , L d ) are considered, the definition of the penetrance function can easily be extended to Here, g j is one of three possible genotypes of SNP L j and g = g 1 , . . . , g d is one specific combination of genotypes. Similar to the definition in the regression framework, an interaction between loci in terms of penetrance may be expressed as deviation from the additivity in the penetrance changes per allele. As an example, assume two SNPs A and B with alleles A 1 , A 2 and B 1 , B 2 . Further, assume that at SNP A the penetracne increases from f A (0) = f A (1) = 0.62 by 0.11 to f A (2) = 0.73 for genotype A 2 A 2 compared to the genotypes A 1 A 1 and A 1 A 2 . At the same time, the penetrance is increased from f B (0) = f B (1) = 0.62 for the genotypes B 1 B 1 and B 1 B 2 by 0.  Table 1). Thus, there is a deviation of −0.05 from additivity and therefore the SNPs A and B could be considered as interacting.
If an SNP L j falls outside the consideration of the multilocus L, then the penetrances of the resulting multilocus genotypes are . . , g d , then we say that L j has no marginal effect. An example of this concept is given in Table 2 for a multilocus L with two diallelic SNPs. The respective genotype probabilities are given in parentheses. If SNP A is ignored, the penetrance on each genotype of SNP B is 0.25. Conversely, the penetrances of SNP A are also 0.25 if SNP B is ignored. Thus, both SNPs have no marginal effect. In this example, it is even the case that the marginal penetrances are identical with respect to both SNPs. If these loci were analyzed independently, no effect would be identifiable.
Using penetrances two measures can be calculated, which give information about the influence of the considered multilocus L on an outcome Y . For this purpose Urbanowicz et al. [25] calculate the population prevalence K as K = g∈G P (g) f L g . Table 3: Log-odds of penetrances in Table 2 for two diallelic SNPs A and B.
Genotype at SNP B This expresses the probability with which a phenotype can be observed in a population. Based on K, the heritability can also be calculated for L as a measure of the explained variance of Y by L. The larger h 2 ∈ [0, 1], the better Y can be explained only by considering L.

Connection between logistic regression and penetrance
In order to support the description of the simulation setting, we give a short explanation on how to convert the effects on a logistic regression scale to penetrances and vice versa.
The logarithmic odds at the multilocus L can be calculated by given effect parameters ι. In this notation, G k stands for all k tuples of the genotype combinations of g, and 0 ≤ i ≤ d is the number of genotypes in g different from the homozygous genotype with the most common allele at SNP L j , j = 1, . . . , d. In particular, for k = 1, the ι g 1 are the corresponding main effects β gj , g j ∈ G 1 .
With the inverse of the logit function logit −1 (z) = expit (z) := exp(z) 1+exp(z) the penetrance for each genotype combination can be calculated from the effect parameters of the logistic model.
Conversely, from a given penetrance f L g for all genotype combinations g, the effect parameters ι g of the corresponding logistic model can be calculated by Here,f L g = logit f L g and f 0 is the penetrance of the genotype combination of the homozygous genotypes with the most common allele at the respective SNP.
It should be noted at this point that an interaction effect at one level does not automatically imply an interaction effect at the other level. This is called a scale effect [26] and discussed in detail by Gola et al. [27]. As an example, Table 3 shows the combinations of penetrances transformed into effect parameters of the logistic regression model from Table 2. It can be seen that there is no deviation from additivity at the level of logistic regression. The logarithmic odds increase for the genotype A 2 A 2 at SNP A by β A = 0.5 and for the genotype B 2 B 2 at SNP B by β B = 1. When combining both genotypes, the logarithmic odds increase by β A + β B + ι = 0.5 + 1 + 0 = 1.5.
This effect may also lead to the effect that a pure interaction of SNPs without any main or marginal effects of the single SNPs on one level being an impure interaction of SNPs with main or marginal effects on the other level and thus easier to detect by appropriate methods.

Model-based multifactor dimensionality reduction (MB-MDR)
The MB-MDR is an extension of the MDR such that the assignment of the cell labels, i.e. the combinations of feature levels, is based on an appropriate statistical test and that each possible combination of features, i.e. MDR model, is ranked by a test statistic. Suppose each sample i, i = 1, . . . , n is characterized by q discrete features x i = (x 1 , . . . , x q ), with each feature x j , j = 1, . . . , q having l j levels. The observed outcome of each sample is denoted by y i and can be of arbitrary scale. The core algorithm of the MB-MDR consists of five steps:  This core algorithm is repeated for all r = 1, . . . , q d possible combinations of d out of q features and possibly for several values of d, constructing MDR models f d,r . Finally, the MDR models can be sorted by their respective test statistic and using a permutation-based strategy, p values can be assigned to each MDR model. Several improvements and extensions of this basic algorithm allow to analyze different outcomes, such as dichotomous [28], continuous [29] and survival [30] traits, or to adjust for covariates and lower order effects of the features of an MDR model [31]. A fast C++ implementation of the MB-MDR is available [32] and used in this work.

Extension of MB-MDR to individual prediction
We extended the MB-MDR algorithm to not only detect interactions between features but to allow individual predictions based on the MDR models. It is important to note that each MDR model is a prediction model in itself using d features and that each cell of an MDR model includes the predicted outcome for the respective feature levels combination. Thus, after the construction of the MDR models and selection of the s best MDR models, the prediction for a new sample is the aggregation of the characteristics of the cells the sample falls into. In our framework, instead of calculating p values of the MDR models, s is determined by cross-validation during training.

Simulation study
A simulation study was performed to compare our proposed algorithm with two state-of-the-art prediction algorithms, the Random Forest [17] and the Elastic Net [34], a generalization of the LASSO [16] and ridge regression [15], for classification tasks. As implementations we utilized the R (version 3.3.1) [35] packages ranger (version 0.8.1.300) [36] and glmnet (version 2.0-5) [20]. We considered eight scenarios with different numbers of SNPs or combinations of SNPs as effect feature components:

{TRUE, FALSE}
For the benchmarking regarding the AUC of the three algorithms, we used the mlr framework (version 2.12) [37]. Each dataset D was split into datasets D 1 and D 2 of the same size. Tuning was performed with 5-fold cross-validation on D 1 using the R package mlrMBO (version 1.1.0) [38] for 100 iterations with ranger (ntrees: 500, mtry: square root of the number of tuning hyperparameters) as the surrogate learner. The hyperparameter spaces considered for tuning are shown in Table 4 together with their respective descriptions. After tuning, a prediction model with the tuned parameters was built on D 1 and the prediction performance was calculated on D 2 for each replicate.

Application to real data
We also compared the performance of the algorithms on the data by the North American Rheumatoid Arthritis Consortium (NARAC) dataset comprised of 1194 cases with rheumatoid arthritis and 868 controls, genotyped at 545,080 SNPs, which is described in detail by [39]. Previously, [41] identified some putatively interacting loci in the HLA region on chromosome 6 in this dataset. We removed SNPs and samples with high missing rates (> 0.02 and > 0.1 respectively) and selected all SNPs with MAF > 0.1 on chromosome 6 after LD pruning (window size: 10 6 SNPs, step size: 1 SNP, r 2 threshold: 0.75). As in the benchmarking on the simulated datasets, we used mlr and mlrMBO with the same settings as before in nested cross-validation with 10-fold outer cross-validation.
The underlying R code is available on request.

Benchmark
In scenarios 1 and 2 with only main effects simulated, all algorithms achieve similar performances (see Figures  1 and 2 Simulating only one interaction effect as in scenario 3, MBMDRC models have the highest median prediction performance for all sample sizes as shown in Figure 3. In this scenario, the ranger and glmnet models do not achieve a median performance greater than 0.55, if the interacting SNPs have the same MAF. Interestingly, models can improve their median performance at the cost of increased variability, if the SNPs have different MAFs. This effect is most evident for ranger models, but also observable for the other two algorithms. For scenarios with both main and interaction effects, MBMDRC models dominate the other two algorithms for sample sizes greater than 200 (see Figure 4 for scenario 7). However, ranger models can reach similar or even better performances if the interacting SNPs have different MAFs. For example, for MAFs 0.1 and 0.4, heritability greater or equal 0.1 and sample size greater than 1000, ranger models achieve a better performance than the MBMDRC models in the median, although the variability is slightly increased. The glmnet models do not use the interaction information, thus their performance is just based on the available main effects and the maximum median performances remain at about the same level between 0.68 and 0.80 as in scenario 1.
In scenario 8, the interaction of three SNPs was simulated, wich is an interaction of higher order than considered by the MB-MDR. Still, MBMDRC models achieve at least a similar performance as the glmnet and ranger models (see Figure 5). Even though the MBMDRC models should be based mostly on the three additional single SNPs with marginal effects, the median performance for sample size 2000 is slightly better than those of ranger and glmnet. For sample size 10000, ranger can achieve similar median performance but with higher variability. The glmnet models are limited by the information based on the three main effects and their performance is comparable to those of scenario 7.
The further scenarios 4, 5, and 6 confirm the relationships described so far and the corresponding Figures 7,  8, and 9 as well as detailed result tables also for the already described scenarios (Tables 8, 9, 10, 11, 12, 13, 14, 15) can be found in Supplementary Information.

Selected hyperparameters
Inspection of the hyperparameters selected by tuning for the MBMDRC models shows that they correspond to the respective scenarios, i.e. the scenario settings determine the optimal hyperparameter settings. That, in turn, enables to select hyperparameter settings to include certain characteristics in prediction.

Note:
For hyperparameters alpha and min_cell_size the median and the first and third quartile in parantheses are given.

Note:
For hyperparameters alpha and min_cell_size the median and the first and third quartile in parantheses are given.
For scenarios 1, 3, and 7, Tables 5, 6, and 7 summarize how often specific combinations of the hyperparameters order_range, order und adjustment were selected in tuning. These hyperparameters depend on each other, i.e. interaction effects of two SNPs without any marginal effects can enter the prediction model only if order is set to 2. Simultaneously, the hyperparameters order_range and adjustment influence whether additional marginal effects can be detected by the MB-MDR algorithm and thus used for prediction. Since setting adjustment to CODOMINANT will adjust for possible marginal effects, order_range should be set to TRUE, to allow for marginal effects. On the other hand, if order is set to 1, both adjustment and order_range become meaningless.
In scenario 1, order is set to 2 in 72.87% of the replicates, although no interactions are simulated in this scenario. However, if the hyperparameters order_range and order are set to FALSE and 2, respectively, the hyperparameter adjustment is set to NONE more often (29.48%) than to CODOMINANT (3.65%). Thus, the MB-MDR algorithms also accounts for marginal effects. This is not the case, if order_range and order are set to TRUE and 2, respectively. In this setting, adjustment is almost equally frequent set to NONE (21.91%) and CODOMINANT (17.83%). The hyperparameter alpha seems to be dependent on the hyperparameter order.
If order is set to 1 the median of alpha between 0.22 and 0.27 whereas the median when order set to 2 ranges higher between 0.43 and 0.50. However, the range of the quartiles is rather big, thus there seems to be no optimal value for alpha in general for this scenario. Same is true for the hyperparameter min_cell_size with median between 18 and 27 and wide quartile ranges.
In scenario 3, order is set to 2 in 92.48% of the replicates, as expected, since in this scenario, only interaction effects are simulated. Although there are no marginal effects of single SNPs simulated in this scenarios,

Note:
For hyperparameters alpha and min_cell_size the median and the first and third quartile in parantheses are given.
there is a slight tendency to set adjustment to CODOMINANT, whereas the selection frequencies of the values TRUE (45.84%) and FALSE (54.16%) for the hyperparameter order_range are almost equal. Again, the hyperparameters alpha and min_cell_size depend on order and no optimal values can be determined because of the wide range of the quartiles.
In scenario 7 with simulated marginal and interaction effects, one combination of hyperparameters stands out. In this scenario the hyperparameters order, order_range, and adjustment are set to 2, TRUE, and CODOMINANT in 59.14% of the replicates, much more frequent than any other combination of hyperparameter values. Using these hyperparameter settings, marginal and interaction effects can clearly be differentiated and used for prediction. Again, no clear optimal values for the hyperparameters alpha and min_cell_size can be determined.

Real data
Application of the three algorithms to the rheumatoid arthritis dataset yields no relevant differences regarding their median performance. Here, the glmnet models have a median AUC of 0.86, the ranger models of 0.85 and the MBMDRC models of 0.83, comparable to the simulation results of scenario 2, MAF=0.4, h 2 = 0.1 and sample size between 1000 and 2000. The corresponding box plot is shown in Figure 6.
Comparison of the runtimes shows that the ranger implementation is the fastest of all three. Specifically, the mean runtime on Intel® Xeon® E5-2680 CPUs at 2.70GHz of one outer cross-validation fold is 1.5 times faster for ranger (378582.6 seconds) than for MBMDRC (603436.6 seconds), and 1.3 times faster for glmnet (465262.4 seconds) than for MBMDRC.

Discussion
In this work, we extended a known algorithm to detect interactions to a classification algorithm that has a performance comparable to two popular classifications algorithms if no interactions are present but which clearly outperforms these if interactions are present. We have shown this by a comprehensive simulation study and by application to real data. Specifically, our simulation study revealed that our new classification algorithm can use information hidden in interactions more efficiently than the Random Forest approach, i.e. smaller sample sizes are required to achieve similar performance. The Elastic Net, at least in available implementations, does not consider interactions at all, thus is inappropriate if the outcome is influenced by interacting features. In our application to the real dataset on RA, the performance of our algorithm was not relevantly different from that of the competitors, indicating that even though [41] claimed to have identified putative interactions on chromosome 6 and MDR models of two SNPs entered the MBMDRC models, this did not improve the classification performance.
One drawback of our new method is the exponential increase in runtime with increasing the number of features in a dataset. Whereas its runtime is not much slower than that of the Elastic Net approach, because both depend mostly on the number of features, Random Forest is clearly the fastest one, mainly dependent on the number of samples. This makes the application of our new method to datasets on a genome-wide scale still challenging at the moment.
As a clear strength, we have shown in our unbiased benchmark on simulated data that taking interactions into account can improve classification performance. As our method is not only applicable to biological/genetic data but to all datasets with discrete features, it may have practical implications in other applications, and we made our method available as an R package [33].
In addition, our observation that the Random Forest algorithm can make use of interacting features up to a certain degree fits well with Wright et al. [19], who conclude that Random Forest is able to capture interactions but not to detect them. In this regard, our method offers a clear advantage in that is not only able to capture interactions but the MDR models as the basic building block also allow insight into the underlying structure and dependencies among the features.

Conclusions
We conclude that the explicit use of interactions between features can improve the prediction performance and thus should be included in further attempts to move precision medicine forward.

Acknowledgments
This work is based on data that was gathered with the support of grants from the National Institutes of Health (NO1-AR-2-2263 and RO1-AR-44422), and the National Arthritis Foundation. We would like to thank Drs. Christopher I. Amos and Jean W. MacCluer, and Vanessa Olmo for the permission to use the dataset on rheumatoid arthritis.

Funding
This work was funded by the German Research Foundation (DFG, grant #KO2240/-1 to IRK).

Availability of data and materials
All data generated during this study and the underlying R code are available on request. The dataset on rheumatoid arthritis used during the current study is owned by the North American Rheumatoid Arthritis Consortium and is available from the corresponding author of [39] on reasonable request.

Authors' contributions
DG designed and implemented the new algorithm, ran the simulations, analysed the NARAC dataset on rheumatoid arthritis, and drafted the manuscript. IRK acquired the NARAC dataset and was involved in revising the manuscript critically. All authors read and approved the final manuscript.

Ethics approval and consent to participate
Not applicable.

Consent for publication
Not applicable.

Competing interests
The authors declare that they have no competing interests.

Note:
Performance of the algorithms MBMDRC, RANGER, and GLMNET measured as AUC over 50 replicates in scenario 7. The median of the AUC and the 25