We have presented a novel workflow for generating interpretable multi-omic models. We illustrated the approach in an IBD case study. We analyzed the impact of configuration parameters, and compared CANTARE models to those generated by random forests and elastic net penalized regression. CANTARE models are competitive with random forests and elastic net, and more likely to include features from more omes. In this section, we discuss the biomedical implications of the best-performing model, provide additional context to the VOLARE-CANTARE workflow, and elaborate on the comparisons to random forests and elastic net. We do not claim that our methodology is inherently better than these alternatives. Rather, we claim that it offers a novel approach for generating interpretable multi-omic models. We discuss limitations and strengths of our approach and conclude with selected thoughts about future work.
Biomedical interpretation of Eubacterium hallii model (Model E)
To demonstrate that CANTARE models are parsimonious and interpretable, supporting further investigation, we offer the following discussion of Model E, the best performing in terms of AUC. The 13 predictors include 2 gut microbes, 5 microbial enzymes, 4 metabolites, age, and calprotectin. This model includes predictors that are protective of IBD (OR < 1) or increase the risk for IBD (OR > 1). In the following, all odds ratios are expressed in terms of changes from the 25th percentile to the 75th percentile (IQR). The 2 gut microbes, Eubacterium hallii (OR = 0.19) and Oscillibacter unclassified (OR = 0.64), are well described in the literature, as abundant members of a healthy gut microbiome that are often found in decreased abundance in IBD [38,39,40]. These identified microbes may be targets for modifying the IBD microbiome composition towards a healthier gut through diet . Additionally the small molecules ascorbate (Vitamin C) (OR = 0.14), pyridoxamine (Vitamin B6) (OR = 0.16), and choline (OR = 0.15), which are found in various fruits, vegetables, and grains have been implicated in decreasing inflammation in IBD, and therefore are potential dietary modifiers that can be used towards nutritional strategies to prevent and treat IBD [41,42,43].
Shikimate kinase (OR = 0.10) operates within the shikimate pathway found in plants and bacteria and contributes to assembly of the basic building blocks for the range of aromatic metabolites and aromatic amino acids . Metabolites that are derived from aromatic compounds provide ultraviolet protection, electron transport, and signaling molecules, and they serve as antibacterial agents which are beneficial to gut health . In the diet, the presence of glyphosate from genetically modified agricultural environments disables Shikimate kinase and may result in the imbalances of the gut bacteria in IBD .
In clinical practice, fecal calprotectin is a highly sensitive biomarker that is routinely used as a marker of endoscopically active IBD . In the model, higher levels of fecal calprotectin are expectedly associated with higher risk of IBD (OR = 19.83), but interestingly the enzyme methionine adenosyltransferase (MAT) has a higher predictive value (OR = 38.24). MAT genes encode enzymes responsible for the biosynthesis of S-adenosylmethionine, the principal biological methyl donor and precursor of polyamines and glutathione . There is increasing evidence suggesting that MATs play significant roles in the development of cancers . IBD patients have chronic inflammation which is an underlying risk factor for colon cancer, and mouse models demonstrate that tumor necrosis factor α (TNF-α), a target of IBD treatments, plays a critical role in development of inflammation-induced colon cancer [48,49,50]. MAT may be a novel biomarker with higher sensitivity than calprotectin that has not yet been studied in IBD patients. Two enzyme predictors of IBD, aminoacyl tRNA hydrolase (OR = 8.06) and phosphofructokinase (OR = 5.69), are observed in the literature related nonspecifically to pharmacotherapy and microbial metabolism, which may be related to IBD treatment and commonly associated microbes . Although the model indicates the enzyme H + transporting two sector ATPase (OR = 0.01) is a strong predictor for protection, there is no clear role for it in IBD.
IBD is a complex disease with heterogenous clinical outcomes. The CANTARE models include features associated with reduced and increased odds of IBD. As such, these models may provide valuable insights for identifying non-invasive clinical biomarkers for therapeutic intervention to IBD patients. These features may also aid in characterizing IBD pathogenesis, etiology, and diagnosis. Though our discussions of biological relevance are speculative, the identified predictors provide the foundation for testable hypotheses in future studies.
CANTARE workflow and parameter considerations
At a more general level, this IBD case study demonstrates that CANTARE supports multi-omic hypothesis generation; identifies quantitative, scorable relationships among a handful of analytes and an outcome of interest, with analytes drawn from multiple omes; delivers individual-level visualization of these relationships, for both the pairwise cross-omic relationships and multi-variable predictive models; and provides flexibility in both the building of the Vnet and the specification of the predictive models by leveraging existing regression machinery. For example, the pairwise cross-omic regressions of the Vnet used linear regression with an interaction term for IBD group, while the CANTARE model used logistic regression to predict IBD. Our primary goal in formulating the pairwise cross-omic regressions is to generate straightforward relationships between the analytes that can be easily visualized and vetted in units of measure that are familiar to biologists. In some cases, it might be appropriate to incorporate other covariates into these models, and present the pairwise regressions as partial correlation plots of residuals. Such residuals would represent a transformation of the underlying data, upstream of the VOLARE workflow.
The Vnet provides a multi-omic scaffold for the CANTARE models. In this case study, it includes the same number of edges for each of the cross-omic comparisons (enzymes to microbes, microbes to metabolites, and metabolites to enzymes). All but one of the 20 neighborhoods, from which CANTARE models were built, spanned all the three omes. While it is possible that any particular Vnet contains only two-node subnetworks with no additional connecting edges, we have not yet observed such a network. Biologically, highly connected networks, informally called “hairballs,” seem more common than sparse networks . In addition, non-neighborhood variables of interest can be included in the CANTARE models. In the IBD case study, we included age and calprotectin. Other predictors of known relevance, whether omic or not, could also be included. In a fine-tuning step, predictors from disjoint neighborhoods (e.g. S and E, Fig. 1d) could be combined.
Turning to algorithmic details, we examined the impact of CANTARE configuration parameters—both the configuration of the Vnet and the order of the network neighborhood for predictive models. The predictive models built from a Vnet that accounted for IBD with an interaction term performed better than did those built from a Vnet that did not account for IBD. We reason that the better performing models were seeded with underlying information about IBD group. That said, the predictive models built from the correlation Vnet might provide different insights, since the underlying relationships represent strong correlations between analytes, independent of IBD. As expected, increasing the order (and thus number of nodes) of the network neighborhood introduces a tradeoff between model performance and number of predictors. In general, the larger networks have better performance at the cost of a larger less-parsimonious model. Neighborhoods of order 3 yielded fewer unique models than neighborhoods of order 2 (11 versus 16, with AUC > 0.8), due to overlapping neighborhoods. If the edge cut for the Vnet were more permissive (e.g. 50 per assay pair instead of 35 per assay pair), it is likely that there would be more unique models with increasing neighborhood order. Also, in a few cases, the model estimation failed to converge. Our case study included 153 participants. As the number of samples increases, more predictors can be supported. For example, Harrell suggests 10 to 15 events per simple predictor in logistic regression models . Thus, study size may influence the appropriate neighborhood order.
Comparison to random forests and penalized regressions
We also compared CANTARE models to penalized regressions and random forests, considering overall performance (as measured by AUC), sample-level predictions, and the predictors themselves. The AUC values of CANTARE models were comparable to those of random forests and penalized regressions, whether the forests or regressions were generated with the universe of multi-omic data or the data underlying the Vnet. The dynamic range of the predicted probabilities of the CANTARE models was greater than that of random forests and penalized regressions. The CANTARE models and the V penalized regression models included predictors from at least two omes, with model E and two others (hubs of Prevotella_copri and Veillonella_parvula) having predictors from all three omes.
The most important variables in all six random forests included only metabolites. Metabolites account for 58% of the analytes in the U forests and 50% of the analytes in the V forests. These proportions are not so large as to make metabolite-only forests probable. Thus, there may be aspects of the distribution of metabolites that make them conducive to selection by random forests. Microbes are included in only the CANTARE models and V regressions. The zero-heavy distribution of many species may make them less likely to be important predictors for random forests and penalized regressions. In the case of penalized regression, the absolute value of each regression coefficient contributes to the overall constraint. Thus, a predictor that provides additional model accuracy for only a handful of samples may be suboptimal. For example, Odoribacter splanchnicus, which is a predictor in model A, has only 21 non-zero values (11 control, 10 IBD). Including this predictor in a penalized regression would impact the model for these 21 samples only.
Both random forests and penalized regressions have random components that result in slightly different important variables or predictors on repeated runs. The random components in forests include the selection of samples for building each tree, and the subset of variables that are considered at each split of the tree. The random component of penalized regression is in the cross-validated estimation of lambda, a constraint on the model coefficients. The glmnet package offers a function for automatically selecting lambda as the largest value that is within one standard error of the value which minimizes error. While these random components aim to reduce overfitting, analysis of any single model can only be considered representative. CANTARE, in contrast, is deterministic, although susceptible to overfitting.
There are limitations to this work. First, we provide only one case study. However, we examined the effect of configuration parameters and compared a variety of CANTARE models to random forests and penalized regression, considering overall performance, predicted probabilities, and included predictors. These analyses place CANTARE in context with other algorithms. In other work , we applied the method to continuous cardio-metabolic outcomes associated with a short-term weight loss intervention. Second, the current implementation is a collection of R scripts accessible to analysts who have intermediate R skills and are comfortable formulating regression models and dissecting the result objects. While this may be a barrier to potential users, it does allow flexibility in specification of the pairwise regressions and the predictive models. Third, we cannot claim that our models are optimal. There may well be other multi-omic parsimonious models with superior performance. Fourth, as regression models, the CANTARE models are subject to the general constraints of linear regressions, such as linearity with log odds or continuous outcomes, normal distribution of the errors, and little to no multicollinearity between predictors . Fifth, while the CANTARE models provide a consolidated quantitative framework for multiple omic predictors, they do not provide a mechanistic framework suggesting a sequence of linked events (e.g. Bacteroides fragilis interacts with regulatory T cells which in turn produce IL-10 ).
The CANTARE workflow offers a number of strengths. First, the CANTARE models are parsimonious and interpretable. They identify a handful of predictors that collectively support conversation, further investigation in the literature, and follow-up experiments. Second, this handful of predictors is multi-omic by design. The Vnet consists of edges that encapsulate relationships across omes, with a similar number of edges for each pair of omes. The CANTARE models are seeded with subsets of this Vnet. Third, the relationships between the predictors and the outcome are quantitative and directional. We can identify which predictors have the largest effect sizes, and whether they are advantageous or disadvantageous to the outcomes. Fourth, the workflow is customizable and allows the analyst to leverage a variety of regression models such as linear regression, logistic regression, and mixed effects models. Fifth, because they are regression models, the CANTARE models can be scored by well-established techniques such as AUC or mean-squared error. These scoring techniques support the evaluation of workflow configuration. Sixth, the workflow is supported by visual representations of key steps. Both the pairwise regression plots and the cumulative fit plots of the CANTARE models illustrate person-level patterns. Seventh, given a particular set of tuning decisions, the results are deterministic.
The analysis presented herein suggests several areas of future work. First, different types of regressions can be used in both the VOLARE and CANTARE components of the method. For example, quadratic or cubic terms could be added to the pairwise regressions to identify non-linear relationships. Significant terms could then be incorporated into predictive models. Second, an effort to better understand why certain methods select predictors from certain omes is warranted. This future work could consider the impact of group balance within data sets, and might include an analysis of the interactions and functional forms of relationships between predictors selected by random forests. Such work could influence both upstream data transformations specific to particular omes, and strategies for making algorithms more ome aware. Third, interactive visualization of multi-omic models could aid in their interpretation. For example, combining cumulative fit plots (Figs. 2, 3) with model-specific data distributions similar to Fig. 8 with synchronized brushing would allow an analyst to select outliers for one predictor and identify both the cumulative fit paths of the associated samples and the location of the selected samples within the distributions of other predictors. This would likely facilitate the identification of subgroups within the context of the predictive model.