Block Forests: random forests for blocks of clinical and omics covariate data

Background In the last years more and more multi-omics data are becoming available, that is, data featuring measurements of several types of omics data for each patient. Using multi-omics data as covariate data in outcome prediction is both promising and challenging due to the complex structure of such data. Random forest is a prediction method known for its ability to render complex dependency patterns between the outcome and the covariates. Against this background we developed five candidate random forest variants tailored to multi-omics covariate data. These variants modify the split point selection of random forest to incorporate the block structure of multi-omics data and can be applied to any outcome type for which a random forest variant exists, such as categorical, continuous and survival outcomes. Using 20 publicly available multi-omics data sets with survival outcome we compared the prediction performances of the block forest variants with alternatives. We also considered the common special case of having clinical covariates and measurements of a single omics data type available. Results We identify one variant termed “block forest” that outperformed all other approaches in the comparison study. In particular, it performed significantly better than standard random survival forest (adjusted p-value: 0.027). The two best performing variants have in common that the block choice is randomized in the split point selection procedure. In the case of having clinical covariates and a single omics data type available, the improvements of the variants over random survival forest were larger than in the case of the multi-omics data. The degrees of improvements over random survival forest varied strongly across data sets. Moreover, considering all clinical covariates mandatorily improved the performance. This result should however be interpreted with caution, because the level of predictive information contained in clinical covariates depends on the specific application. Conclusions The new prediction method block forest for multi-omics data can significantly improve the prediction performance of random forest and outperformed alternatives in the comparison. Block forest is particularly effective for the special case of using clinical covariates in combination with measurements of a single omics data type. Electronic supplementary material The online version of this article (10.1186/s12859-019-2942-y) contains supplementary material, which is available to authorized users.

C Multi-omics data: Analysis of the influence of data set characteristics on the performance of BlockForest relative to that of RSF Except for one data set the differences between the mean C index values obtained using RSF and that obtained using BlockForest were not larger than 0.02 for any of the data sets, where BlockForest outperformed RSF for the majority of data sets. The values of the differences between the mean C index values obtained for these methods differed quite strongly across data sets. This suggests that there are certain data set specific factors that determine whether or not we can expect to obtain a considerable improvement through using BlockForest as opposed to RSF. It would be valuable, if these factors and the forms of their influences on the performance differences between BlockForest and RSF would be known. In this way, it would be possible to discern situations in which we can expect BlockForest to perform considerably better than RSF from situations in which there is not much gain in prediction performance by using BlockForest or in which RSF might even be preferable.
Therefore, in this section we present an analysis in which we related the values of several data set specific factors to the values of the differences between the mean C index values obtained for BlockForest and that obtained for RSF. The latter differences are referred to as diffC in the following. The investigated data set specific factors are: • Sample size: n BlockForest involves M tuning parameters, which makes this algorithm more complex than standard RSF. We assumed that for larger data sets the optimization of these tuning parameters is more stable than for smaller data sets, which could have the effect that, compared to RSF, for BlockForest there might be a greater gain in prediction performance with increasing sample size.
• Degree of dominance of the most important block: oneblockimp We hypothesized that the more the predictions are dominated by one of the blocks, the less improvement there will be from using BlockForest (or the other variants) in place of RSF.
If almost all information relevant for prediction is contained in only one of the blocks, it is not necessary to exploit interactions between blocks or, more generally, let the other blocks participate in the prediction process. Instead, in this situation it is better to consider almost exclusively covariates from the relevant block. This is, however, already accomplished by the standard RSF algorithm, because the covariate with the best value of the split criterion among the mtry randomly sampled covariates will in general stem from the relevant block.
By contrast, any potential upweighting of the other blocks performed by the variants will in such situations not be beneficial and can even be harmful. For example, the randomization of the block choice as performed by BlockForest and RandomBlock is counter-productive if there is only one relevant block, because it would be best to simply always use the relevant block in such situations. A notable exception to the tendency of the standard RSF algorithm to select mostly covariates from the (single) relevant block is the case in which this block involves only a few covariates (which happens, e.g., in the case of the clinical block). In this situation, the covariates from the relevant block will be selected too infrequently, because of the fact that the great majority of the mtry sampled covariates will stem from the other blocks. For this reason, it will frequently be the case that a covariate from a non-informative block will divide the samples in the current node better than the best sampled covariate from the informative block simply by chance in this setting.
We measured the degree of dominance of the most important block through the maximum of the b m values, m = 1, . . . , M , associated with the RandomBlock variant. This metric is denoted as oneblockimp in the following.
• Strength of the biological signal: signal It would be interesting to know, whether the level of biological signal contained in the covariate data has a notable effect on the gain in prediction performance to expect by using BlockForest as opposed to RSF. For example, if it would be known that a particular strong gain in prediction performance can be expected in situations in which the biological signal is strong, it would be particularly recommended to use BlockForest instead of RSF in situations in which a decent level of prediction performance is already attainable using conventional prediction methods that do not take the block structure into account. If, by contrast, a considerable improvement through using BlockForest can be expected for weaker biological signals in particular, BlockForest can be employed effectively in situations in which conventional prediction methods do not deliver good results.
We measured the degree of biological signal present in a data set by the average of the mean C index value obtained using BlockForest and the mean C index value obtained using RSF.
This metric is referred to as signal in the following. Note that since the target metric in our analysis is the difference between the mean C index value obtained using BlockForest and the mean C index value obtained using RSF, the plot of the values of this difference diffC against the values of signal corresponds to a Bland-Altman plot.
In Figure S4, the values of diffC are plotted against the values of each of the quantities described above.
There seems to be a general trend that the improvements obtain by using BlockForest become larger for larger data sets. All data sets for which we observe a (small) deterioration by performing BlockForest in place of RSF are small to medium sized data sets. However, there are also small data sets for which BlockForest performed notably better than RSF. For example, in the case of the smallest data set the improvement of BlockForest over RSF was the strongest among all data sets.
The plot of the values of diffC against those of oneblockimp suggests the following: if no block is dominating the others, a comparably strong improvement might be obtained through the use of BlockForest, but the mere fact that no block is dominating is not a sufficient condition for a comparably strong improvement.
The Bland-Altman plot of the values of diffC against those of signal resembles a funnel: for smaller values of signal the values of diffC vary more strongly, that is, for weaker signals, there were more often stronger improvements by performing BlockForest instead of RSF, but also more often merely weak improvements and also (slight) impairments. The variable selection probabilities v m optimized using VarProb are for most data sets considerably larger for the clinical block than for most or all of the omics blocks ( Figures S5 and   S6). However, for many data sets there are also omics blocks with high optimized values of v m .
This demonstrates that, depending on the considered data set, it can also be effective to sample covariates from certain high-dimensional blocks with the same or even a higher probability as covariates from the low-dimensional clinical block.
The optimized weights w m associated with SplitWeights ( Figures S7 and S8 The values of the block selection probabilities b m optimized using RandomBlock (Figures S11 and S12) tend to be relatively stable across the cross-validation iterations. As written in Section 'RandomBlock: Random block selection' of the main paper the optimized block selection probabilities can give indications of the relative importances of the different blocks for prediction.
We obtained the following mean block selection probabilities across the data sets (sorted from highest to lowest): 0.43 (mutation), 0.29 (RNA), 0.12 (clinical), 0.11 (CNV), 0.07 (miRNA). Thus, the mutation block and the RNA block seem to be by far the most important blocks. Note however that the b m values depend strongly on the specific set of blocks available in the data sets.
As noted in Section 'RandomBlock: Random block selection' of the main paper, for individual data sets, small optimized b m values must be interpreted with great care, because important blocks can be attributed small optimized b m values. The latter can occur for blocks that share much predictive information with another block that contains (slightly) more predictive information.
In such cases, the latter block will be attributed a high b m value, whereas the former block that contains (slightly) less predictive information will be attributed a small b m value even though it contains much predictive information. This is more efficient than attributing high b m values to both blocks, because if two blocks with strongly overlapping predictive information had high selection probabilities, the information considered across different splits would be more similar.
By contrast, if the predictive information contained in two blocks is only mildly overlapping, the b m values attributed to the two blocks will not strongly correlate and will be similar if the levels of predictive information contained in the two blocks are similar.
We averaged the optimized b m values per data set for each block and investigated the correlations of these averaged block selection probabilities between the blocks. The strongest negative correlation (r ≈ −0.90) we observed was that between the mutation block and the RNA block. In Section 'RandomBlock: Random block selection' of the main paper we described the mechanism that if two informative blocks feature strongly overlapping predictive information, one of these blocks will be attributed a large b m value and the other one a small b m value. The fact that for most data sets either the b m value of the mutation block was very large and that of the RNA block very small or vice versa, suggests that the predictive information contained in these two blocks is both strong and strongly overlapping. The unrealistic alternative explanation for this fact would be that for each of these data sets either the mutation block or the RNA block is important and the respective other one is unimportant. This scenario is unrealistic, because each of these data sets features patients of a different cancer type and both mutation data and RNA data are known to be predictive of cancer [1,2]. The fact that the RNA block is an informative block throughout cancer types asdf will also become evident in the results obtained for the setting with only the clinical block and the RNA block, where the optimized b m values obtained for the RNA block were high for the vast majority of data sets (see Section G of Additional file 1). The correlation of the (averaged) b m values between the clinical block and the mutation block was -0.43, while it was 0.10 between the clinical block and the RNA block. The fact that the latter correlation is small and, more importantly, non-negative suggests that the information overlap between the clinical block and the RNA block is weak, which in turn suggests a high additional predictive value of the RNA block over the clinical block. By contrast, the fact that the correlation between the clinical block and the mutation block is negative suggests a stronger information overlap between these two blocks. Thus, the additional predictive value of the mutation block over the clinical block might in general be smaller than that of the RNA block over the clinical block.
The optimized weight values associated with BlockForest (Figures S13 and S14) are quite similar to those associated with BlockVarSel. However, with BlockForest there are more often data sets for which there are several blocks with very large optimized weights. This can be explained by the fact that when considering each block for each split, as performed with BlockVarSel, the weights of the more informative blocks (like those of the weak blocks) need to be sufficiently distinct. This inhibits non-optimal blocks from being used too often for splitting, but at the same time prevents large weights being attributed to several of the more informative blocks. The block randomization of BlockForest, by contrast, has the effect that the more informative blocks are frequently not considered at the same time. As a consequence, these blocks do not compete strongly in the split selection, which is why it is not that important that the weights attributed to these blocks differ strongly. This allows for greater differences between the set of weights of the more informative blocks on the one hand and that of the weights of the weak blocks on the other, ensuring that the weak blocks are not used too often for splitting.         of the clinical block. The latter will be denoted as b clin in the following. We expected that the larger the value of b clin becomes, the stronger will be the improvement of BlockForest over RSF.
We assumed the latter to be the case, because large values of the optimized selection probability b m of the clinical block indicate that there is much prediction information contained in the clinical covariates, making it particularly effective to exploit the predictive information contained in these covariates.
In Figure S17, the values of diffC are plotted against the values of each of the three quantities n, b clin , and signal.
In comparison to the multi-omics case, the relation between the sample size n and the values of diffC is not clearly positive. While the greatest diffC values were obtained for very small data sets, the three data sets for which RSF performed better than BlockForest were also small to medium sized. The fact that we do not observe a clearly positive association between n and diffC here might be explained by the fact that when including only the clinical block and the RNA block, only two tuning parameters have to be optimized instead of five (or sometimes four) in the multi-omics case. The optimized tuning parameter values can be expected to be more precise when there are only two blocks compared to when there are five (or four), in particular for small sample sizes. This is because, in general, the smaller the number of parameters in a model, the more precise will be the estimates of these parameters. For this reason, a common rule of thumb in linear regression is that at least 10 to 15 observations are required for each included covariate, because an estimate of the coefficient of each covariate in the model has to be obtained. and w 2 .
The relation between signal and diffC seems to be weakly negative overall, meaning that the improvement in prediction performance by performing BlockForest instead of RSF tends to be stronger for less strongly predictive covariates. Nevertheless, for the three data sets for which RSF performed worse than BlockForest, the signal is rather weak.         H Clinical covariates plus mutation measurements: Performances of the ten considered methods Figure S28 shows the results in the same form as in Figures 1 and 3 in the main paper. Note that this analysis was performed using 19 instead of 20 data sets, because the data set READ does not feature mutation data. For all ten prediction methods, the mean data set specific C index values are smaller than in the case of using the RNA block. We also performed a paired Student's t-test for each of the ten prediction methods to test for significant differences between the data set specific C index values obtained using the RNA block and those obtained using the mutation block. After adjustment for multiple testing with the Bonferroni-Holm method, for four of the ten methods the differences in performance were significant to the 5% significance level. Further notable differences to the case of using the RNA block are that the priority-Lasso and IPF-LASSO performed much better in relation to the other methods. BlockForest had the lowest mean rank 3.68 with respect to the data set specific C index values (where the latter did, however, not feature the lowest median rank as can be seen from Figure S28), followed by priority-Lasso with mean rank 3.74 and IPF-LASSO and RandomBlock with mean ranks 4.21 and 5.26, respectively. The ranks of BlockForest featured the lowest variance among these four methods. Again we used t-tests to test for superiority of the variants over RSF, adjusting for multiple testing using Bonferroni-Holm.
Here, only BlockForest was significantly superior to RSF (adjusted p-value: 0.003). While the mean ranks of BlockForest and priority-Lasso are very similar, Figure S29 reveals that these two methods performed quite differently for many of the data sets.    (SplitWeights), 11.00 (priority-Lasso). The mean rank of RandomBlock clin is substantially smaller than that of the other methods. Excluding the three data sets for which only four of the five blocks were available did not change these results notably (see Figure S31). For seven of the twenty datasets the improvement of the best performing method RandomBlock clin over RSF was greater than 0.05 in terms of absolute difference (see Figure S32), while for BlockForest this was the case for only four data sets (see Figure ?? in the main paper).
Also in the case of using only the clinical block and the RNA block, including the clinical covariates mandatorily improved the prediction results for each variant (see Figure S33). However, the improvement of RandomBlock clin over RandomBlock was less strong than in the multi-omics case. We obtained the following mean ranks (from best to worst) Forest clin thus had the lowest rank, it does not set itself apart as strongly from the second best method (BlockForest) as RandomBlock clin did in the multi-omics case. While both BlockForest clin and BlockForest outperformed RSF by more than 0.05 in terms of absolute difference for the same five data sets, the improvements obtained for these data sets are (slightly) stronger in the case of BlockForest clin (see Figure S34).
For the analysis in which we used the clinical block and the mutation block, the variations that include the clinical covariates mandatorily did not all perform better than the corresponding original variants (see Figure S35). However, we do observe considerable improvements here in the cases for the other blocks, that is, they contain less concise information, making the estimation more variable. Note that priority-Lasso is among the best methods here. IPF-LASSO also performs relatively well. Figure S36 reveals that the method that performed best according to the mean ranks, BlockForest clin was, for 10 of the 19 data sets (the mutation block was missing for one data set), better than RSF by more than 0.05 in terms of absolute difference.         BlockForest relative to that of RSF. Differences between the mean C index values obtained using BlockForest clin / BlockForest and that obtained using RSF, ordered by difference between the values obtained for BlockForest clin and RSF.