Detection of correlated versus interacting features in simulated data
To investigate how well the rule networks from Ciruvis could detect feature interactions, we first tested it using simulated data. The data contained both features correlated to the decision and pairs of interacting features predictive for the decision. The level of correlation and pairwise predictability was determined by two parameters that defined a maximum level for the most predictive feature/pair in the dataset. The maximum level of correlation, X, and of interaction, Y, was varied between 0 and 1. Then, for each data set the number of correctly classified objects was counted (Additional file 4: Figure S2). As expected, there were usually more correctly classified objects when the features were more predictive (as measured by higher X and/or Y). Surprisingly, a higher level of interaction increased or at least retained the classification quality, whereas a higher correlation sometimes decreased the quality. Specifically, the quality was decreased when the pairwise correlation was high and the correlation increased over 0.20-0.30. When the interaction level was 1.00 this was the most evident, since the average number of correctly classified objects decreased from 998–999 out of 1000 for X < 0.25 to a local minimum 828 at X = 0.45.
This suggests that the rule generation algorithm was biased towards finding rules containing features correlated to the decision. When the correlated features were not present, then the combinatorial rules of higher quality were more likely to be found. The identified masking became one of the focuses in our study.
Next, we investigated the behavior of the rule networks for different datasets (Figure 1). Since both the features and the decision were binary only the networks for outcome “0” are presented. Based on the data generation algorithm opposite values of the R and S variables were expected to predict the “0” decision, e.g., IF R = 0 AND S = 1 THEN DEC = 0, whereas equal values predict the “1” decision. The aim was to observe how small interactions could still be detected and to learn about their properties; for instance, whether they would be masked by features strongly correlated to the decision.
Using X = 0.00 and Y = 0.10 we could identify visible connections between pairs with an interaction level at 10, 8, and 5% (Figure 1A). The connections between “R4_10” and “S4_10” were the two strongest in the figure demonstrating that very weak interactions may be detected in the Ciruvis networks even in the presence of very noisy data. This particular example also illustrated that the rules from a classifier may be informative, even when the quality of classification is essentially not better than “random guessing”.
In the following runs we processed datasets with a small background correlation, X = 0.15 (Figure 1B-H). With Y = 0.10 the pair with a 10% chance of interaction was barely visible, and not among the highest scored connections in the figure (Figure 1B). As Y was increased the two (or three) highest scored pairs became step-by-step more visible (Figure 1C-E) and when Y was set to 0.55 or higher the three most interacting pairs were by far the strongest connections (Figure 1F-G), with the exception of Y = 1.00 when the third pair (R2 + S2) was masked by the more predictive pairs (Figure 1H).
Similarly, when the best interaction was 100% predictive (Y = 1.00) and with higher correlation (X = 0.35 or X = 0.45, respectively), the strongest interacting pair was highly visible and the second pair had indeed a visible connection, but it was on the same level as some of the noise (Figure 1I-J). Although it is useful to know that stronger rules may mask weaker ones, masking caused by perfect correlation would normally not be expected in a real data set.
When the dataset had both a high level of correlation and interactions, the connections for the two strongest interacting pairs were visible, but not the strongest connections (Figure 1K-L). However, the true interactions are shown as connections from conditions with otherwise few and weak connections, while connections that are artifacts caused by combinations of correlated features origin from conditions with a lot of strong connections.
An observation in all of the generated rule networks was that at most three (out of four non-zero) interacting pairs appeared in the networks. A likely explanation is that the stronger interactions mask the weaker ones, similarly to how strong correlations do.
Removal of correlated features decreased the masking of weak interactions
In the previous section we showed that when features correlated to the decision were roughly as strong or stronger than the interacting pairs, the latter were masked by the former. Subsequently, rules containing the interacting pairs were rarely found or barely visible in the rule networks. To investigate whether the removal of correlated features from the data would benefit to the detection of the pairs, we used the data from Figure 1B (in which the pairs are heavily masked) and removed the correlated features C4 and C3 (15% and 11% correlation, respectively). The pair with the highest interaction (R4 + S4, with interaction frequency 10%) subsequently became relatively stronger (Figure 2A-B). For instance, in Figure 2A the connection score between “S4 = 1” and “R4 = 0” is 0.7% of the total score in the figure, which increases to 1.8% in Figure 2B; becoming the strongest connection in the figure. The increase for the combination “S4 = 0” and “R4 = 1” was smaller but still significant, from 0.6% to 1.1%. In addition in Figure 2B the “R3 = 0” and “S3 = 1” pair could be identified (increased from 0.4% to 0.7%), although the connection was still weak. When the last two correlated features (C2 and C1 with 8% and 4% correlation, respectively) were removed as well, the strength of the first and the second pair increased sharply (to 4.3% and 1.4% respectively) (Figure 2C).
Comparison to other methods using real data
In order to compare the interaction detection to other methods, and to apply the methodology to real data, we used the California Housing [22] dataset downloaded from [23]. This dataset was chosen as it had previously been subject for interactions detection [24].
California Housing describes housing value based on 1990 census data in California. The decision is the median value of a block group (medianHouseValue) and there are 8 features. We discretized the decision into three groups; one group of houses valued ≥500 000 which was encoded as ‘2’ , the remaining houses were split at their median into the intervals 0–173 600 and 173 601–499 999 (encoded as ‘0’ and ‘1’ , respectively). We used the features longitude, latitude, housingMedianAge, total Rooms, population, and median Income previously selected by [24] to build a rule-based model using ROSETTA. The numeric features were discretized using EqualFrequencyBinning with 4 intervals. The model accuracy was estimated using 10-fold cross validation.
The medianIncome feature was highly correlated to the decision (r = 0.61; Additional file 5: Figure S3) and when the rule-based model was built to include it, it dominated the strongest connections (Additional file 6: Figure S4). An alternative model was built excluding medianIncome which reduced the accuracy of the model from 72.4% to 66.5% as important information was excluded, but made the identification of interacting pairs easier. Inspecting the rule networks (Figure 3), we identified the ten strongest connections for each outcome (Additional file 7: Table S2). For instance, for medianHouseValue = 0 three of these described combinations of conditions with specific values for latitude and longitude, three combinations with population and totalRooms, two with population and longitude, and two with totalRooms and longitude. For each one of these specific combinations of features, we computed whether it had a significant interaction effect (see Additional file 3: Supplementary methods for details). Additionally, we computed the expected accuracy (Additional file 7: Table S2) by first estimating the effect of each condition separately and then combining these effects under a multiplicative model (see [25] for a mathematical derivation). The interaction effects could then be assessed by comparing the observed and the expected accuracies.
Out of the ten strongest connections for medianHouseValue = 0 three were describing significant interactions. For instance, “population = [1167, 1726) AND totalRooms = [1448, 2127)” had an accuracy of 67.8% compared to an expected 51.7%. This increase in accuracy is due to a specific interaction between the population in the area and the total number of rooms. Supposedly, the number of rooms per capita is what determines the house prices.
In conclusion, an interaction between population and totalRooms was described by several connections. Additionally, a specific combination of latitude and longitude described an interaction predictive for low house prices, and a combination of high houseMedianAge and high totalRooms described an interaction predictive for very high house prices. Two of these pairs were reported as interacting by [24], but the third one is novel. The interaction between latitude and longitude was very strong in the previous study and it indeed appeared in several of the strongest connections. However, only one specific combination of conditions showed a significant interaction effect. This is most likely due to these two features being strongly correlated (r = -0.92; Additional file 5: Figure S3) and the assumption of independent effects therefore underestimated their interaction.
Applications to leukemia and lymphoma
Finally, we applied Ciruvis to biological data describing leukemia [26] and lymphoma [27]. The leukemia set contained gene expression for 7129 genes from 38 patients divided into two different outcomes: acute lymphoblastic leukemia (ALL; n = 27) and acute myeloid leukemia (AML; n = 11). The lymphoma set contained 4026 genes from 62 patients divided into three outcomes: lymphoma and leukemia (DLCL or D; n = 42), follicular lymphoma (FL or F; n = 9) and chronic lymphocytic leukemia (CLL or C; n = 11). The probe names were changed into gene names when possible and otherwise kept as in the source data. A single quote was used to discern between multiple probes matching the same genes. Since most genes had their expression discretized into two intervals by ROSETTA (see Additional file 1: Supplementary methods for details on the discretization) the intervals were renamed into “low” and “high”, with the addition of “medium” if applicable. See (Additional file 8: Table S3 and Additional file 9: Table S4) for details on gene names and values.
Firstly, we used Monte Carlo feature selection [28] to rank the genes by significance. After correcting for multiple testing, there were 701 significant (p < 0.05) genes for leukemia and 512 for lymphoma. Details about the feature selection are described in the Supplementary methods (Additional file 3: Supplementary methods). A principal component analysis (PCA) verified that using the 30 most significant features the outcomes were separable by the first two principal components (Figure 4). Missing values were replaced by the gene average during the PCA. Performing a disease association analysis using WebGestalt [29] we could confirm that the top ten disease associations of the selected genes contained annotations related to lymphoma and leukemia. For example the leukemia data were enriched for genes related to Lymphoid Leukemia (LYN, CCND3, TCF3, CD33, and MYB; adjP = 0.024) and the lymphoma for Acute Myeloid Leukemia (CALR, SUMO, and MYB; adjP = 0.18) and Acute Erythroblastic Leukemia (PCBP2 and MYB; adjP = 0.18). The p-values were calculated by WebGestalt using the hypergeometric distribution and adjusted with Bonferroni correction.
Next, we used ROSETTA to train a rule-based classifier based on the selected features. The accuracy of the classifier was 100% for both data sets, estimated by leave-one-out cross validation. Details on the classification are described in the Supplementary methods (Additional file 3: Supplementary methods).
Since each rule set in the leave-one-out cross validation was trained from all objects except one, they are expected to be very similar to rules trained on the whole data. Therefore, instead of repeatedly training a classifier on the whole data, we merged all the rules from the cross validation iterations. Duplicates were removed and the rules were filtered so that rules that are supersets of other rules were removed if they had lower significance (hypergeometric distribution); for details on the p-value calculations, see [30]. The motivation behind the filtering strategy is that shorter rules are preferred if they are at least equally significant as their longer counterparts.
The filtered set of rules was submitted to Ciruvis using default parameters. The interactive rule networks are available online at [31].
The rule network for leukemia is shown in Figure 5. The difference in the overall topology of the networks for ALL and AML may partly be explained by a different number of rules for each outcome (48 for ALL and 254 for AML). Direct comparison between the networks was therefore difficult, since the same width would relate to a different number of rules. Instead we studied the strongest connections in each network. For this dataset both networks were quite simple, with all connections supported by only one high-quality rule. For ALL the highest scoring connections were based on any pair of the following conditions: SPTAN1 = high, PTX3 = low, and CFP = low; the conditions SPTAN1 = high and CFP = low were the most frequent in other rules as well. Had the set of patients been larger, noiseless relationships would likely have been harder to identify and Ciruvis might have helped us identify the most important pairs out of more complicated rules. The AML network showed the same property, with a large number of connections based on only one rule with a pair of conditions. Most likely, the reason why more combinations were found in this network was that no single condition constituted a high quality rule in itself which forced the generation of longer rules.
Similar behavior was observed in some of the rule networks for the lymphoma data (Figure 6). For CLL many connections were based on only one rule. The strongest connection (between MIF = low and GPX1 = high) was based on four rules. This combination corresponded to a rule with 73% accuracy, compared to an expected accuracy of 51% assuming independent and multiplicative effects, which indicated that an interaction could be present. The second strongest connection was between NT5C2 = low and GPX1 = high which showed an accuracy of 84% compared to the expected 55%. A three-way interaction could be hypothesized and tested between NT5C2 = low, MIF = low and GPX1 = high with accuracy of 92% compared to the expected 83%.
The connections for the next outcome, DLCL, were supported only by one rule of high quality. Apparently, adding more conditions did not yield a significant increase in the rule quality. Notably, there are groups of conditions in the network that are interchangeable in certain rules. For instance, CXCL9 = high may be combined with either of PRKCB = low, PRKCB’ = low, MXI1 = low, HDGF = high, TUBB = high and GENE669X = high to produce a rule for DLCL supported by all of the 42 patients in that group and with 100% accuracy. If instead GPX1 = high is combined with any of the six genes the second highest scoring connections are achieved with rules that are almost as good; supported by 41 patients and with 100% accuracy.
For the FL outcome, a hypothesized three-way interaction between GENE1625X = low, MIF = low and NT5C2 = low had to be rejected as the combined accuracy was lower than the predicted. Pairs of these conditions were separating FL + CLL from DLCL and together with any of several other conditions they defined three-way interactions.