Due to their role as post-transcriptional regulators of around 30% of the human genome and their involvement in cancer development and progression [17, 18, 20, 50], miRNAs become more and more important for our understanding of the mechanisms leading to cancer. Since miRNAs are smaller than mRNAs they are more stable and in general more resistant against degradation processes than the longer mRNAs. Consequently, miRNA expression is measurable even in serum  and paraffin-embedded samples where mRNA expression is hardly detectable.
Several studies have combined gene and miRNA expression data [52, 53] or gene expression data with miRNA target predictions  to infer new miRNA regulation activities. In addition, several tools have been developed to integrate such data [55, 56]. In most cases, correlations between mRNA and miRNA expression profiles gained from matched samples and target prediction scores are most relevant for the analysis.
While there are several approaches to integrate mRNA and miRNA data to discover novel regulatory relation between miRNAs and mRNAs there is still a lack of prediction methods combining both kinds of data into one common prediction model. A central problem in these high-dimensional data is the tendency to overfit. When integrating several omics data sets the number of features increases, which makes the feature selection even more important.
In this article we introduce a method capable to fuse mRNA and miRNA expression data in a model to predict a clinical endpoint. Likelihood boosting was used as a method for fitting risk prediction models because of its performance and its ability to implicitly select features in the training process. The correlations between miRNAs and mRNAs and target prediction information were used to model the relations between miRNAs and mRNAs. The combination between these two sources of information was performed on a p-value level using the method from Stouffer . From the combined p-values a bipartite graph could be constructed covering the relations between the two types of features.
The integration of this graph into boosting improves the models in terms of prediction error. In this case the clinical endpoint was the biochemical relapse in prostate cancer using a combined miRNA/mRNA data set of 98 patients . The comparisons of the IPECs clearly showed a significant reduction of the prediction error in comparison with boosting on the single data sets or on the combined data set without the bipartite graph. Here we used the .632 bootstrap estimator of the prediction error because of its simplicity. Other estimators like the .632+ estimator  are often used for prediction error estimation for survival models [15, 41, 58]. It might be less biased but computationally more expensive. First tests with the .632+ estimator lead to comparable results.
Using the graph the feature selection became more stable regarding how often a specific feature was picked in the 500 bootstrap runs. By transferring the weights in the graph from mRNAs to miRNAs, these features were favored. However, it is important to note that miRNA expression data alone failed to predict the relapse as accurate as the combined data with the graph. This may be caused by the fact that one miRNA can have several targets and dysregulation of a miRNA can affect multiple molecular pathways with no direct connection to the outcome. Therefore, the genes as effectors seem to be a mandatory source of information. Among the top 10 features picked using the graph there are some miRNAs found to play a role in prostate cancer, e.g. hsa-miR-128 . However, most of the miRNAs have not been associated with prostate cancer before. It is therefore important to note that it is not straightforward to derive functional implications for single biomarkers from a panel found by a prediction model. The strength of our method is to find miRNA-gene combinations with high predictive power. To investigate whether the selected genes show differences in functional annotations, we also performed a GO enrichment test for the top 100 genes of CoxBoost with and without graph (data not shown). Both sets showed different enriched GO terms. However, no clear patterns concerning cancer related processes occurred.
To assess how our method performed in comparison with other methods suited for time-to-event data, Lasso and RSF were tested on the same data set using the same bootstrap samples. In both cases CoxBoost with the bipartite graph showed a significantly lower prediction error. RSF performed better than Lasso which was worse than CoxBoost without graph on this data set. The runtime of RSF and Lasso was considerably longer than the runtime of CoxBoost with graph on our test system. In this study we used the standard implementations of Lasso and RSF as a reference. As far as we know there are no established ways to combine Lasso or RSF with a graph to guide the feature selection. It might be interesting to see if such methods will improve the prediction error as well. Also other ways of fusing miRNA and mRNA expression data into one model e.g. bundling  or kernel based methods  have not been considered. Such methods offer a very flexible way of combining different prediction models and might also lead to improvements in terms of prediction error.
To minimize the possibility of overfitting, one CoxBoost model was trained with correlations calculated only on the training data of every bootstrap sample. The resulting prediction error is higher compared to the models with correlations calculated once on the whole data set but it is still significantly lower than CoxBoost with no graph. Further, we showed that the prediction could be improved using the target prediction information from MicroCosm. In order to test the influence of the target prediction database we also tried to incorporate the target predictions from TargetScan. This resulted in a higher prediction error, however. This result can possibly be explained by the lower coverage of TargetScan. From the 723 miRNAs in the data set only 170 could be found in TargetScan having a P
value. In comparison, the MicroCosm predictions contained 698 out of the 723 miRNAs with p-values.
While miRNA and mRNA expression data gained from microarray experiments were used in this study, the method is independent of the underlying experimental setup. Next generation sequencing data might be, after the necessary preprocessing steps, used in a similar manner. We presented the fusion of the both data sets with respect to a prognostic time-to-event endpoint. However, in a similar fashion binary endpoints like diagnostic questions or treatment response prediction can be tackled. This would lead to classification problems for which boosting was originally designed and powerful approaches have been formulated. On our setting we would substitute the CoxBoost algorithm by GAMBoost .