A knowledge graph approach to predict and interpret disease-causing gene interactions

Background Understanding the impact of gene interactions on disease phenotypes is increasingly recognised as a crucial aspect of genetic disease research. This trend is reflected by the growing amount of clinical research on oligogenic diseases, where disease manifestations are influenced by combinations of variants on a few specific genes. Although statistical machine-learning methods have been developed to identify relevant genetic variant or gene combinations associated with oligogenic diseases, they rely on abstract features and black-box models, posing challenges to interpretability for medical experts and impeding their ability to comprehend and validate predictions. In this work, we present a novel, interpretable predictive approach based on a knowledge graph that not only provides accurate predictions of disease-causing gene interactions but also offers explanations for these results. Results We introduce BOCK, a knowledge graph constructed to explore disease-causing genetic interactions, integrating curated information on oligogenic diseases from clinical cases with relevant biomedical networks and ontologies. Using this graph, we developed a novel predictive framework based on heterogenous paths connecting gene pairs. This method trains an interpretable decision set model that not only accurately predicts pathogenic gene interactions, but also unveils the patterns associated with these diseases. A unique aspect of our approach is its ability to offer, along with each positive prediction, explanations in the form of subgraphs, revealing the specific entities and relationships that led to each pathogenic prediction. Conclusion Our method, built with interpretability in mind, leverages heterogenous path information in knowledge graphs to predict pathogenic gene interactions and generate meaningful explanations. This not only broadens our understanding of the molecular mechanisms underlying oligogenic diseases, but also presents a novel application of knowledge graphs in creating more transparent and insightful predictors for genetic research. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-023-05451-5.

Table A1.Source databases to construct BOCK.Bioinformatics databases used with their version used for this study.The specific extracted information from each database is indicated as well as the corresponding KG component it is used for (see Supplementary Table A2).Table A2.KG components integration into BOCK.Data integration of the BOCK knowledge graph by component.Each component describes a different biological level or view and has been integrated into the knowledge graph from a source network or ontology database (see Supplementary Table A1) into KG node types and edge types.For most components, an edge score is computed and used as an edge property.For some components, the integration process has required the application of a filtering stage to limit the noise and the size of the integrated network.More details about the computation of scores and filtering steps are provided in the Methods section BOCK data integration and resources.Table A3.Gene contribution from BOCK sources.BOCK source network gene contributions.We calculated #Genes as the total number of genes present in each source network.#Exclusive genes was obtained by considering genes that are exclusively present in each of the specific network resource.

Appendix B Disease-causing gene pair details
This appendix provides detailed information on the selection and analysis of diseasecausing gene pairs.It covers the assignment of oligogenic evidence levels to each variant combination (Supplementary Table B1), the selection process of pathogenic gene pairs based on these levels (Supplementary Table B2), and specifics of the independent test set of 15 selected pathogenic gene pairs (Supplementary Table B3).Table B3.Independent test set details. Details on the selected held-out test set made of 15 pathogenic gene pairs.This test set was selected based on an automatic procedure designed to favour diverse, confident and recently published cases: first, all disease-causing gene pairs were ranked by their first associated article publication date, then for each gene pair from the most recent to the oldest, gene pairs were chosen if their evidence level was at least Moderate (Supplementary Table B1) and if none of their genes overlapped with the previously selected ones.

Appendix C ARBOCK framework parameter analysis
We present here in more details the parameters used in the ARBOCK framework (Table C1) and how their choice was empirically determined.In particular, we analyse the effect of the alpha, minsup ratio and max rule length parameters on the model's performance, measured by the Area Under the Receiver Operating Characteristic (AUROC), and explanation complexity, measured mainly by the number of paths on average present in the explanation subgraph, for True Positive (TP) instances.Note that for this benchmark, the choice of AUROC as performance metric stems from its ability to meaningfully evaluate and compare models irrespective of class imbalance and threshold settings, without being artificially inflated by models that merely predict the majority class (e.g sets of parameters leading to a model with only the default rule), considering our high class imbalance setting.Figure C2 present a comprehensive analysis of the best performing models (considering all other parameters) according to the alpha parameter of the decision set model.At α=0.5, performance is maximised and higher values lead to more explanation complexity, measured as the number of matching rules and explanation path count.This model corresponds to path cutoff=3 ; minsup ratio=0.2 ; max rule length=3.
Figure C3 shows the combined effect of minsup ratio and max rule length, for fixed α=0.5 and path cutoff=3, on performance and explanation path count.Higher max rule length leads to a similar performance but at the cost of increasing explanation complexity.Lower minsup ratio leads to better performance and a smaller number of paths in explanations.
Fixing path cutoff=2, as presented in Figures C4 and C5 leads to a maximum performance of 0.67 for all evaluated sets of parameters but yields simpler explanations in terms of path count.Table C1.Framework parameters summary.This table presents the parameters defined in the ARBOCK approach.For each parameter, a description, its relevant step in the process, the algorithm employed in that step, the range of potential values, and the specific value chosen for this study are provided. A.

C.
Fig. C2.Influence of decision model's alpha parameter tuning Evaluating decision set models over a spectrum of the decision model's alpha (α) values (0.1-0.9) and variable parameters -Path Cutoff (path cutoff) {2,3}, Minimum Support (minsup ratio) {0.2, 0.4, 0.6, 0.8}, Max Rule Length (max rule length) {2,3,4}, we highlight the performance and characteristics of the optimal models for each alpha value.This evaluation is performed on the model excluding Phenotype, prone to study bias.Optimal models for each alpha are selected based on mean AUROC from a stratified cross-validation, keeping all within 0.01 of the highest performer to ensure comparable accuracy levels.(A) Mean AUROC of topperforming models evaluated via stratified cross-validation on test sets (B) Measures of explanation complexity for the associated models, calculated from predicted true positive (TP) gene pairs on the test sets: Rule Count measures the mean number of rules per match, and Path Count measures the mean number of paths across all rules, on average per match.
(C) Parameter distribution for the corresponding top-performing models.Evaluating decision set models with a fixed Path Cutoff = 2, over a spectrum of the decision model's alpha (α) values (0.1-0.9) and variable parameters -Minimum Support (minsup ratio) {0.2, 0.4, 0.6, 0.8}, Max Rule Length (max rule length) {2,3,4}, we highlight the performance and characteristics of the optimal models for each alpha value.This evaluation is performed on the model excluding Phenotype, prone to study bias.Optimal models for each alpha are selected based on mean AUROC from a stratified cross-validation, keeping all within 0.01 of the highest performer to ensure comparable accuracy levels.

Availability of evaluation
Yes: Github: github.com/oligogenic/bockrule mining Table E1.DOME recommendation table consisting of essential information to assess the machine learning approach [13].These criteria correspond to the decision set models presented in this study.G2.Comparison of predictions on held-out pathogenic gene pairs with different phenotype annotations.Comparison of three predictive models: "DiGePred" [15] (a gene pair pathogenicity predictor with features at the gene and relationship level, including strong phenotype-based features), "BOCK-DS" (the decision set trained without phenotypebased paths) and "BOCK-DS incl.Phenotype" (the decision set trained with inclusion of paths traversing nodes of type "Phenotype").The 15 held-out disease-causing gene combinations are evaluated on the three models and the number of gene pairs is shown for each predicted label (D=Disease-causing ; N=Neutral) ; The overlaid stripes indicates that both genes of a combination are annotated with at least one human phenotype ontology (HPO) term.For each gene pair, the number of phenotype terms (HPO) linked to to the first and second gene is shown, as well as the Jaccard similarity index between HPO terms of the two genes (Phenotype Similarity).The predicted probabilities are shown for the three benchmarked models: "DiGePred" [15] (a gene pair pathogenicity predictor with features at the gene and relationship level, including strong phenotype-based features), "DS incl.Pheno" (the decision set trained with inclusion of paths traversing nodes of type "Phenotype") and "DS excl.Pheno" (the decision set trained without phenotype-based paths).Green and red colors represent the predicted classification labels, respectively "disease-causing" and "neutral", based on the classification optimal thresholds of these models (DiGePred=0.496; DS incl.Pheno=0.929; DS excl.Pheno=0.788).Rows are ordered by the Phenotype similarity.The influence of phenotype annotations over gene pairs can be observed for the "DiGePred" and the "DS incl.Phenotype" models, both reliant on phenotype-related features.Each rule is written in their abbreviated form (see Table 1) with its conditions separated by &.Indices for node types (e.g.G 1 ) are used in unification conditions (e.g.G 1 =G 2 ) to constrain entities to be the same across different metapaths.The numerical value associated to each metapath (e.g.≥ 0.31) sets the path reliability threshold, which conditions the minimum path reliability score of all underlying paths.

Fig. C3 .Fig. C4 .
Fig. C3.Influence of minimum support and max rule length parameters tuning.Heatmaps showing the influence of parameters minsup ratio and max rule length on the performance and explanation complexity for decision models trained with paths excluding Phenotype and with fixed parameters: α = 0.5 and path cutoff = 3. (A) Effect on the model performance, evaluated with mean AUROC on a stratified crossvalidation.(B) Effect on the model explanation complexity, evaluated with the Path Count measure, measuring the mean number of paths across all rules, on average per True Positive matches.
Fig. C4.Influence of decision model's alpha parameter tuning with path cutoff=2Evaluating decision set models with a fixed Path Cutoff = 2, over a spectrum of the decision model's alpha (α) values (0.1-0.9) and variable parameters -Minimum Support (minsup ratio) {0.2, 0.4, 0.6, 0.8}, Max Rule Length (max rule length) {2,3,4}, we highlight the performance and characteristics of the optimal models for each alpha value.This evaluation is performed on the model excluding Phenotype, prone to study bias.Optimal models for each alpha are selected based on mean AUROC from a stratified cross-validation, keeping all within 0.01 of the highest performer to ensure comparable accuracy levels.(A) Mean AUROC of top-performing models evaluated via stratified cross-validation on test sets (B) Measures of explanation complexity for the associated models, calculated from predicted true positive (TP) gene pairs on the test sets: Rule Count measures the mean number of rules per match, and Path Count measures the mean number of paths across all rules, on average per match.(C) Parameter distribution for the corresponding top-performing models.

Fig.
Fig.G2.Comparison of predictions on held-out pathogenic gene pairs with different phenotype annotations.Comparison of three predictive models: "DiGePred"[15] (a gene pair pathogenicity predictor with features at the gene and relationship level, including strong phenotype-based features), "BOCK-DS" (the decision set trained without phenotypebased paths) and "BOCK-DS incl.Phenotype" (the decision set trained with inclusion of paths traversing nodes of type "Phenotype").The 15 held-out disease-causing gene combinations are evaluated on the three models and the number of gene pairs is shown for each predicted label (D=Disease-causing ; N=Neutral) ; The overlaid stripes indicates that both genes of a combination are annotated with at least one human phenotype ontology (HPO) term.

Fig. H1 .
Fig. H1.Influence of BOCK metaedge ablation on performance.Considering a decision set model, its predictive performance is estimated by the Mean Area Under the Receiver Operating Characteristic Curve (AUROC, red) and the Mean Area Under the Precision-Recall Curve (AUPRC, orange).These metrics are evaluated on a 10-fold stratified cross-validation.The 'Baseline' bar (bold) represents the distribution of this metric for a model excluding Phenotype relationship (GaP metaedge).Based on that baseline, all other bars represent models where an additional metaedge (type of edge, as summarised in 1) has been removed from BOCK.Error bars indicate the standard deviation of this metric for all evaluated test folds.

Table G3 .
Independent test set predicted probabilities in relation with phenotype.Table presenting the 15 pathogenic gene pairs held-out as independent test set.