Enabling personalised disease diagnosis by combining a patient’s time-specific gene expression profile with a biomedical knowledge base

Background Recent developments in the domain of biomedical knowledge bases (KBs) open up new ways to exploit biomedical knowledge that is available in the form of KBs. Significant work has been done in the direction of biomedical KB creation and KB completion, specifically, those having gene-disease associations and other related entities. However, the use of such biomedical KBs in combination with patients’ temporal clinical data still largely remains unexplored, but has the potential to immensely benefit medical diagnostic decision support systems. Results We propose two new algorithms, LOADDx and SCADDx, to combine a patient’s gene expression data with gene-disease association and other related information available in the form of a KB, to assist personalized disease diagnosis. We have tested both of the algorithms on two KBs and on four real-world gene expression datasets of respiratory viral infection caused by Influenza-like viruses of 19 subtypes. We also compare the performance of proposed algorithms with that of five existing state-of-the-art machine learning algorithms (k-NN, Random Forest, XGBoost, Linear SVM, and SVM with RBF Kernel) using two validation approaches: LOOCV and a single internal validation set. Both SCADDx and LOADDx outperform the existing algorithms when evaluated with both validation approaches. SCADDx is able to detect infections with up to 100% accuracy in the cases of Datasets 2 and 3. Overall, SCADDx and LOADDx are able to detect an infection within 72 h of infection with 91.38% and 92.66% average accuracy respectively considering all four datasets, whereas XGBoost, which performed best among the existing machine learning algorithms, can detect the infection with only 86.43% accuracy on an average. Conclusions We demonstrate how our novel idea of using the most and least differentially expressed genes in combination with a KB can enable identification of the diseases that a patient is most likely to have at a particular time, from a KB with thousands of diseases. Moreover, the proposed algorithms can provide a short ranked list of the most likely diseases for each patient along with their most affected genes, and other entities linked with them in the KB, which can support health care professionals in their decision-making. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-024-05674-0.


Background
Due to advances in the field of genomics in the past two decades, the focus of medical science has been shifting from disease-centric to person-centric diagnostic and therapeutic methods [1,2].The development of microarray techniques and new advances in RNA sequencing have improved our ability to explore the underlying molecular mechanisms associated with complex diseases [3].Gene expression profiles are being used to identify disease-specific genome-wide changes in genes, which can help in the identification of differentially expressed genes (DEGs): these are genes whose expression levels significantly differ between the healthy state and the diseased state [4,5].The motivation behind the identification of DEGs is to understand the molecular processes involved in the progression of a disease.These DEGs can be used as important biomarkers for patient classification [3,6], disease diagnosis [7], and drug target identification [8].
A knowledge base is an extensive collection of structured or unstructured data that represent facts about the world [9,10].It is a dataset with some formal semantics that may contain different kinds of knowledge, for example, facts, rules, axioms, statements, definitions, and primitives [11,12].Although some researchers have used the terms 'knowledge base' and 'knowledge graph' interchangeably, e.g.[13][14][15][16], the use of 'graph' generally implies that it has some specific features.The fundamental factor that sets knowledge graphs apart from knowledge bases lies in their emphasis on the interconnectedness of entities, reasoning capabilities, and graph structure [12,14,17].While all knowledge graphs can be considered knowledge bases, not all knowledge bases meet the criteria to be labeled as knowledge graphs.
To offer personalized diagnostic recommendations using gene expression profiles, it is important to obtain knowledge relevant to individual patient data.Huser et al. use the term 'knowledge bases' to describe resources that include information about the interpretation and implications of specific genomic findings [18].They further mention that knowledge bases typically contain aggregated knowledge and no patient-level data [18].
In our case, the patient's data that we analyse is their gene expression profile, while the knowledge base can encompass additional information related to genes, diseases, and associations between them.This knowledge can be accessed through knowledge bases such as CTD [19], GisGeNet [20], Gene Ontology [21] and Disease Ontology [22,23].
Researchers are exploring new ways to use the knowledge represented by biomedical KBs to solve complex problems in the biomedical domain [24][25][26][27].Bonner et al. [28] provide an overview of existing biomedical KBs.Biomedical KBs such as DisGeNet [20], Hetionet [29], BioKG [30], Bio2RDF [31,32] and UniProt [33] provide prior biomedical knowledge which can be combined with patients' clinical data for better model building in the health care domain.
Using DEGs, it is possible to measure changes in individual patients at the molecular level and identify the relevant biological processes triggered by those DEGs.Thus, DEGs can play an important role in disease diagnosis.However, a DEG can be involved in multiple biological processes [34,35], and so be related to multiple different diseases in a KB.This makes it more challenging to perform personalised disease diagnosis based on DEGs, when there are thousands of diseases in the KB and the objective is to identify the most probable disease for a patient.
The existing biomedical KBs that we explored for experiments in this paper are not available with any quantitative association strength information.This leads to an implicit assumption that all associations are of equal significance or strength.For example, in biomedical KBs such as DisGeNet, 1 if there is an association between a gene and a disease, the KB does not represent the quantitative strength of the association between the gene and the disease.As a result, all genes are simply represented as being linked to all associated diseases, but in reality, only a small group of genes are strongly associated to a particular disease, while many other genes are weakly associated to it.This can limit the usefulness of such KBs for identifying which diseases are most likely, given observed gene expression data.
Moreover, KBs are known to be generally quite incomplete.For example, more than 60% of the people in DBpedia and Freebase are missing their birthplaces [13,36,37].Similarly, biomedical KBs also suffer from the problem of missing links.No existing KB has information of all possible diseases and related entities.For example, there are more than 10,000 rare diseases [38] and most of the biomedical KBs have between 2000 and 9000 diseases [39].Missing links can be added based on the literature, as we will describe in "CTD knowledge base" section.Missing links can also be identified using KB embedding approaches [37,40,41], however, curated KB links are considered more reliable.KB embedding approaches such as TransE were found to perform poorly in biomedical link prediction (13.88% Hits@10) [42].Challahan et al. [43] also noted that KB embedding and NLP based biomedical KBs are generally very noisy and should be used with caution.Therefore, we have curated some missing links for the KB that we use in this work; see "CTD knowledge base" section.We have also performed experiments with the publicly available DisGeNet KB, without adding or changing any of its links.
The long-established Cyc KB [44] also illustrates the challenges of KB incompleteness.Cyc KB is still reported to have gaps, despite being estimated to have accumulated over 900 person-years of work in its development [16].It may require many more personyears to refine such existing KBs to incorporate quantitative link strength manually.Therefore, in this paper, we propose an alternative approach that can make use of existing KBs and assist with disease diagnosis tasks, when we do not know the quantitative link strength values between genes and diseases in the KB.
Our overall goal is to predict the disease that a patient is most likely to have at a particular time, by evaluating changes in their gene expression levels, with the help of a KB that represents thousands of diseases and their links to associated genes and other entities.Bharadhwaj et al. [45] worked on combining gene expression data with biomedical KBs, however, their approach is not suitable for longitudinal gene expression datasets, where subjects' samples collected at different time-points play an important role.The novelty and advantage of our proposed approach are that it is suitable for longitudinal gene expression datasets and that it considers the time aspect for personalised disease diagnosis.
Our specific contributions are as follows: (1) We demonstrate how a patient's Least Differentially Expressed Genes (LDEGs) along with Most Differentially Expressed Genes (MDEGs) can help in disease diagnosis in the presence of a KB.To the best of our knowledge, LDEGs have not previously been used for disease diagnosis in combination with KBs.(2) We show how KBs that do not include quantitative link strength information can be used to infer the strength of links in a patient-specific manner, using the patient's gene expression profile.(3) We propose two new algorithms to combine patients' time-series gene expression data with a KB.Both of the algorithms can assist with personalised disease diagnosis and can produce a short personalised ranked list of most likely diseases for each patient.
The rest of the paper is structured as follows.In "Description of existing ML algorithms" section, we briefly describe the existing machine learning (ML) algorithms."Gene expression datasets" section describes the real-world gene expression datasets that we will use in this work."Knowledge base" section describes the KBs used to perform experiments.In "Proposed algorithms" section, we explain our proposed algorithms."Experimental design" section describes the experimental design.In "Results" section, we discuss and compare results in detail.Finally, we conclude in "Conclusions and future work" section.

Description of existing ML algorithms
In "Results" section, we will compare the performance of the proposed algorithms with existing ML algorithms.These ML algorithms are described here.k-Nearest Neighbour (k-NN) is an instance-based learning algorithm [46].k-NN stores the training cases, and when presented with a new query case, it finds the set of k instances that have the lowest distance, according to some metric; these are termed the nearest neighbours.Then, the query case is assigned a class label based on the majority class of the k nearest neighbours [47].We have used the Euclidean distance metric for our experiments.The optimum value of k is searched over the range of k = 1 to n with a step size of 2 (odd values such as 1, 3, . . ., n ), where n represents the number of samples in the training set.We chose odd values for k to avoid ties.
Random Forests is an ensemble machine learning method.It is considered an efficient algorithm for the classification of gene expression data [48].The Random Forest algorithm constructs an ensemble of many classification trees [49,50].Each classification tree is created by selecting a bootstrap sample from the whole training dataset and a random subset of attributes with size denoted n a is selected at each split.The optimum value of n a is searched over the range of 10 to x with a step size of 10, where x represents the square root of the total number of attributes (in this case, the total number of genes).The number of trees in the ensemble is denoted as n t .We have used n t = 100.
Support Vector Machine (SVM) works on the principle of finding the maximum margin separating hpyerplane.Assume that we have a training set of instance-label pairs (x i , y i ); ∀i ∈ {1, 2, . . ., l} where x i ∈ R n and y ∈ {1, −1} l , then the SVM [51][52][53] can be formulated and solved by the following optimization problem: Here w is normal to the hyperplane, φ is a function that maps the data into a higher dimensional space, the parameter C > 0 is the penalty parameter of the error term [53] and ξ i ∀i ∈ {1, 2, . . ., l} are positive slack variables [51].Furthermore, K (x i , x j ) = φ(x i ) T φ(x j ) is called the kernel function [53].The technique known as the kernel trick [54] can be used to translate the linear SVM algorithm into a kernelized version.After projecting data into a higher dimensional space, the SVM finds a maximal margin linear classifier, f (x) = sign(w T φ(x)) which can be solved using Eq.(1).There are four basic kernels that are frequently used: linear, polynomial, sigmoid, and RBF.We produced results using both Linear SVM and using SVM with RBF kernel.
For Linear SVM (linear kernel: K (x i , x j ) = x i T x j ), we did a search for best value of parameter C for a range of values from 2 −5 to 2 15 in multiples of 4.
XGBoost (eXtreme Gradient Boosting) [55] is an ensemble learning algorithm that has been found to be an effective method for a wide range of machine learning tasks, including classification, regression, and ranking.XGBoost builds a set of decision trees iteratively, using a gradient boosting approach to minimize a user-specified loss function.
The key idea behind XGBoost is to iteratively add decision trees to the ensemble, with each new tree trained to correct the residual errors of the previous trees.In other words, XGBoost fits the model by adding new trees to the ensemble that improve the overall prediction accuracy, while penalizing trees that are too complex or overfit the data.We used the R implementation of the XGBoost library with its default gradient boosting tree model, called GBTree. 2 The optimal values for XGBoost parameters were determined across the following ranges: eta (learning rate) from 0.1 to 1 with a step size of 0.1, max_depth (maximum depth of a tree) from 2 to 6, and nround (number of rounds in the gradient boosting process) from 10 to 100 with a step size of 10. (1) ; 1 2σ 2 > 0.

Gene expression datasets
We have conducted experiments using four real-world gene expression datasets related to Respiratory Viral Infection (RVI).Dataset 1 is collected from 7 RVI Challenge studies, and is openly available on Gene Expression Omnibus (GEO). 3This dataset consists of 151 human volunteers who were healthy when they enrolled for the study.After enrolment, all subjects were inoculated with one of four viruses (H1N1, H3N2, HRV, RSV).Their blood samples were taken at pre-defined time-points, including before inoculation, thus delivering gene expression profiles from non-infected individuals as well as from infected ones [56].Out of 151 subjects, 47 subjects samples failed quality control checks, so we exclude them from the study.For more information, see [57,58].Dataset 2 contains gene expression profiles of 133 adults whose samples are taken in three different seasons: Autumn, Winter and Spring.Baseline samples are taken at the time of enrolment of volunteers [59].For each volunteer, samples are taken at up to seven time-points before, during, and after the occurrence of illness (influenza and other acute respiratory viral infections).This dataset is also accessible on GEO. 4  Dataset 3, also on GEO, 5 is collected from an influenza challenge trial in which 21 volunteers participated.Their samples are collected at baseline (healthy) and 4 different time-points after intranasal administration of wild-type A/California/2009 H1N1 virus [60].Out of 21 subjects, 15 got infected and reported symptoms of illness.Three more subjects had some detectable amount of live virus shedding [60], however, their mapping to subject IDs are not available, therefore, we performed experiments with the data of the 15 subjects for whom reliable information is available.
Dataset 4 is also collected from an influenza trial which contains the gene expression profile of 22 subjects.All subjects were healthy at the time of enrollment and were aged 18-45 years [61].All 22 subjects were inoculated with A/Wisconsin/67/2005 H3N2 influenza virus at a dose of 1 ml in a quarantine facility [61].Gene expression data from peripheral blood was taken immediately before the viral inoculation and at 12, 24, and 48 h post-inoculation [61].Dataset 4 is also accessible on GEO. 6

Knowledge base
We performed experiments using two KBs: DisGeNet KB [20] and CTD KB [19].The following subsections provide a description of these KBs.

DisGeNet knowledge base
The DisGeNet KB [20] is a publicly available collection of genes, diseases, and variants associated with human diseases.For sake of simplicity and for the requirement of the research work, we extracted a subset of the DisGeNet KB from the provided portal 7 using the R package mentioned on the portal website.The extracted DisGeNet KB has 7 types of entities and 6 types of relations.The full RDF schema of DisGeNet KB is 3 https:// www.ncbi.nlm.nih.gov/ geo/ query/ acc.cgi? acc= GSE73 072. 4 https:// www.ncbi.nlm.nih.gov/ geo/ query/ acc.cgi? acc= GSE68 310. 5 https:// www.ncbi.nlm.nih.gov/ geo/ query/ acc.cgi? acc= GSE90 732. 6https:// www.ncbi.nlm.nih.gov/ geo/ query/ acc.cgi? acc= GSE61 754.available on DisGeNet website. 8The 7 types of entities that our experimental DisGeNet KB includes are gene, disease, disease type, disease class, disease semantic type, protein class, and UniProt ID.A UniProt ID is linked with a gene, representing a specific protein encoded by that gene.UniProt IDs provide information about the gene that encodes a particular protein, including its gene symbol and chromosomal location, as well as the function and interactions of the protein through the UniProt KB [62].We included Uni-Prot IDs in our experimental DisGeNet KB so that it becomes easier for researchers to further investigate these relationships if they want to do so.

CTD knowledge base
CTD KB is described as a digital ecosystem that establishes connections between toxicological data pertaining to genes, diseases, chemicals, and phenotypes.[19,63].CTD KB 9has been extensively used in projects where the association between biomedical entities plays an important role [64,65].There are 11,622 genes which are common between the CTD KB and the experimental gene expression data set so we use only these.After preprocessing, the CTD KB has a total of 14,138,823 links between 11,622 genes and 6430 diseases.
We found that the CTD KB does not have RVI disease links so we added curated RVI links to it.To do this, we referred to five journal papers [56,59,[66][67][68] to find information about which genes are associated with RVI.All relevant genes as identified in the journal papers were already present in the CTD KB, so we added 220 links from them to the new RVI disease in the KB.See Fig. S1 in Additional file 1 that plots the disease in-degree of the KB, which we define as the number of genes linked to each disease.The CTD KB we are using for our experiments also has 7 types of entities and 6 types of relations, because for those genes and diseases that are common between CTD and Dis-GeNet, we have added other entity types and relations in the CTD KB from DisGeNet KB.

Proposed algorithms
Our approach to personalised disease diagnosis is inspired by the approach of recommender systems, where the goal is to provide a short ranked list of recommended items to a person, from a set of thousands of items, based on the person's past preferences or profile.Here, we aim to provide a short ranked list of most likely diseases from the thousands in the KB, based on the person's gene expression profile.
For that, we have developed two algorithms, LOADDx (Log-Odds based Assistant for Disease Diagnosis (Dx)) and SCADDx (SCore-based Assistant for Disease Diagnosis (Dx)).Both of the algorithms share the same basic novel idea of up-weighting disease scores based on P MDEGs, and down-weighting disease scores based on Q LDEGs, where the P MDEGs (Most Differentially Expressed Genes) are the top P ranked genes whose expression levels show a large difference between the healthy state (control) and the diseased state (target).Conversely, the Q LDEGs (Least Differentially Expressed Genes) are the bottom Q ranked genes whose expression levels show little or no difference between the two states.In effect, genes are sorted in descending order of their differential expression, and we select the top P and bottom Q.The idea is that, for a given person at a particular time t, if a significant number of MDEGs are associated with a particular disease in the KB, this provides evidence supporting that the person may have that disease, so the disease is up-weighted based on the identified MDEGs.Conversely, if a significant number of LDEGs are associated with a particular disease, this provides evidence against the person having that disease, so the disease is down-weighted.Then, the disease with the highest weight should be the most likely disease that the person may have at that time.Figure 1 illustrates the basic idea.
In order to test our hypothesis that the magnitude of most/least differentially expressed genes may be a useful signal in relating gene expression to disease diagnosis, we propose two algorithms: LOADDx does not use the magnitudes of MDEGs/LDEGs, whereas SCADDx does.Then, by testing whether SCADDx outperforms LOADDx, we will gain insight into whether this magnitude information is important.

LOADDx algorithm
The LOADDx algorithm finds the changes in all genes' expression levels ( G ) by sub- tracting a subject's gene expression data at time = t 1 (healthy state) from their gene expression data at time = t D , the time at which disease diagnosis has been requested (infected state or when infection is suspected).It selects the P MDEGs and Q LDEGs from the sorted list of all DEGs.Then, for each disease in the KB, it finds the number of common genes CP between those associated with disease D i and the P MDEGs from the gene expression data.It calculates the log-odds (LP) of disease D i from the P LDEGs and CP genes as follows: (2) LP = ln(CP + 1/(P + 1 − CP)) Similarly, it finds the number of common genes CQ between those associated with disease D i and the Q LDEGs in the gene expression data.It calculates the log-odds (LQ) of disease D i from the Q LDEGs and CQ genes as follows: Then, it calculates the weight W i for each disease D i using the following formula: Finally, it ranks all the diseases in the KB in descending order based on their calculated weights/scores and extracts the top m diseases with the set of other entities (E) linked to those diseases in the KB.

SCADDx algorithm
The SCADDx algorithm operates on the same basic idea as that of the LOADDx algorithm.The most fundamental difference between them is that, to calculate disease score, SCADDx makes use of the magnitudes of the P MDEGs and Q LDEGs, whereas LOADDx does not.
SCADDx selects P MDEGs and Q LDEGs from the list of all DEGs.Then, for each disease in the KB, it finds the common genes CP and CQ between those associated with and the P MDEGs and Q LDEGs respectively in the gene expression data.It calculates the weight W i for each disease D i using the following formula: where x represents the total number of genes in CP, y represents the total number of genes in CQ and G denotes the change in the magnitude of gene expression.Finally, it ranks all the diseases in the KB in descending order based on their calculated weights and extracts the top m diseases with the set of other entities (E) linked to those diseases in the KB.
To compute the probabilities of top-ranked m diseases from their weight scores, we use the softmax function, f (w i ) = e w i / m j=1 e w j , where, w i = weight of i th disease, m = number of diseases, i = 1, . . ., m , and f (w i ) represents the probability [69].

Experimental design
We design our experiments in a way such that it is possible to enable personalized disease diagnosis at an early stage of infection.We perform experiments by combining a KB with the patients' gene expression data collected at an early time-point.Here, an early time-point means within day 3 (72 h) of exposure to a virus.For each subject, we consider gene expression data collected at two time-points.The first time-point is called the reference sample and the second time point is called the target sample.The reference sample is collected at time t 1 = 0 h, before the patient has the disease.Target samples are collected at time t D = day 2 or day 3, based on the availability of data, at which time sub- jects might or might not be exhibiting signs of infection.For all subjects in all datasets, reference samples are available at time t 1 = 0 h, however, target samples are not available at the same time for all subjects.For Dataset 1, we have target samples available between time t D = 60 to 72 h.For Dataset 2, we have target samples available at time t D = day 2. For Dataset 3 and Dataset 4, we have target samples available at time t D = 72 h and t D = 48 h respectively.
We test our proposed algorithms on four real-world gene expression datasets of RVI disease that are described in "Gene expression datasets" section.The datasets have a true class label indicating whether a subject has a respiratory viral infection or not (see Table 1).We use the true class label (actual disease) and predicted class label (predicted disease) to compute the accuracy of disease predictions.The formula to compute accuracy is: Accuracy = (TP + TN )/(TP + TN + FP + FN ) , where TP = True Positive, TN = True Negative, FP = False Positive, and FN = False Negative [70].For each patient, the predicted class label is obtained by comparing the actual disease with the top n predicted diseases, for values of n = 1, 2, 3, 4, 5 or 10.If there is a match found between the predicted top n diseases and the actual disease, then this is assigned as the predicted class label.For example, if the actual disease is RVI, and the algorithm includes RVI in its top n predicted diseases, the predicted class label is set to RVI, otherwise it is set to Not RVI.Influenza is a respiratory viral infection that belongs to the class of respiratory tract infections or diseases.As a result, our proposed algorithms have the ability to identify (5) respiratory viral infections or diseases if the KB contains any of these terms: Influenza or Respiratory Viral Infection, or Respiratory Tract Disease.
Because of the small size of the datasets, it would not be practical to use k-fold CV, even for small values of k such as 5. Therefore, we employed two alternative validation approaches to compare the performance of SCADDx and LOADDx with existing ML algorithms.The first approach is single internal validation set approach (see Tables 5, 6) and the second is Leave-One-Out Cross-Validation (LOOCV) approach (see Table 7).For a fair comparison, the existing ML models are also trained on the same time points for which the proposed algorithms are trained.
For the single internal validation set approach, all the datasets are divided into training, validation, and test sets with a ratio 50:25:25.We used random stratified sampling while splitting the datasets.The model parameters are selected based on the performance of the validation set.The power of the t-test increases as we increase the number of test sets, therefore, we divided the test set of Dataset 1 and Dataset 2 further into two parts (Testset 1a, 1b, 2a, and 2b).Thus, we have 6 test sets in total as shown in the tables in the "Results" section.We cannot divide test sets of Dataset 3 and Dataset 4 further as they are small.LOOCV is considered an efficient way to evaluate performance when the number of samples is very small [56,71].Therefore, we also performed evaluations using the LOOCV approach.To conduct LOOCV, the data of each subject is held out one at a time as a test case, while the data of other subjects are used for training.LOOCV ensures there is no risk of a lucky split since each patient's data serves as the validation set in each iteration, with all other data points acting as the training set.This process is repeated for each data point, and the results are then averaged to evaluate the model's performance [71].In our study, LOOCV was chosen to address the challenges posed by a limited number of samples.For LOOCV, both Dataset 1 and Dataset 2 are split into two equal parts, creating two independent datasets that we refer to as Datasets 1a, 1b, 2a and 2b.Each split consists of 50% of the data from its respective dataset.Comparative analysis with detailed results is presented in the next section.

Results
In this section, we present the results of LOADDx and SCADDx using different parameter settings and comparative analysis using different values of n.We also analyse the performance of both algorithms on respiratory viral infections generally, as well their performance on specific virus types/subtypes in the datasets that can cause respiratory viral infections.Finally, we also compare the performance of the proposed algorithms with existing ML algorithms.

Comparison of LOADDx with SCADDx
For each subject, both SCADDx and LOADDx produce a ranked list of the top n predicted diseases with their weights and probabilities (see Table 1 and Tables S1-S12 in Additional file 1).These tables present the results obtained using the single internal validation set approach, as explained in "Experimental design" section.For hyperparameter optimization of SCADDx and LOADDx, we conducted a grid search over P and Q in the range of 25 to 300 with a step size of 25.Table 1 shows the top 5 diseases predicted by the SCADDx algorithm for first 5 subjects for Testset 1a.Please see Additional file 1 for full results on all the datasets using both algorithms.Table 1 also shows, for each subject, the top 5 most affected genes, and the changes in expression values of these genes.The subjects with the largest changes in gene expression have the highest disease scores in SCADDx (see Table 1, Subject 3).These can indicate severe cases of infection, so such subjects should be handled carefully.From Table 1, it can be seen that our SCADDx algorithm can produce a short personalised ranked list of most likely diseases for each patient, which can help health care professionals in their decision-making.
To compare the performance of proposed algorithms, we compute the accuracy of each algorithm based on whether the correct disease is in the top n predicted diseases.Table 2 presents a comparative analysis between LOADDx and SCADDx at different values of n, keeping the parameter values fixed ( P = 200, Q = 200 ).SCADDx achieves a median accuracy of 89.52% (average accuracy = 88.81%),whereas LOADDx achieves a median accuracy of 86.19% (average accuracy = 86.42%),when n = 10 in both cases considering all datasets (see Table 2).SCADDx performs as well as or better than LOADDx on all four datasets for all values of n.In almost cases, accuracy scores are higher for higher values of n.This is to be expected, since it is more likely that the correct answer would be among the top 3 ranked diseases than the top 2, for example.However, as we increase n, there is an increased risk of false positives.We can see this in the SCADDx result for Testset 1b, when n is increased from 5 to 10 (see Table 2).Therefore, we suggest that the number of predicted diseases (n) should be kept low (from 1 to 10).In any case, a short list of predicted diseases would be more useful for the user.
Table 3 presents a comparative analysis between LOADDx and SCADDx on different values of n, when performing a grid search over P and Q in the range of 25 to 300 with a step size of 25.The reported results were obtained by employing the single internal validation set approach, as explained in "Experimental design" section.The median accuracy of SCADDx is 92.86% (average accuracy = 91.21%),whereas for LOADDx the median accuracy is 86.19% (average accuracy = 87.70%)when n = 10 (see Table 3).Again, SCADDx performs better than LOADDx.The results of the grid search suggest that in most of the cases, both algorithms achieved their best accuracies when Q > P (see Table 3).When Q > P , the algorithms down-weight those diseases that are linked to LDEGs (Q) in the KB, since having a larger number of LDEGs associated with a disease provides stronger evidence against the person having that disease.By down-weighing the diseases that a person is less likely to have, better accuracy is achieved.
There are many virus types/subtypes that can cause RVI disease.Therefore, we have performed what we term a virus-wise performance analysis, to analyse how well the proposed algorithms work on different viruses.The datasets have an entry named virus that provides the information about the type/subtype of virus that caused each subject's infection (see Table 4).For example, Dataset 1 contains 4 specific viruses, as shown in Table 4. Please refer to Additional file 1 (see Table S13) for the information of all virus types/subtypes covered in all the datasets.We categorized all subjects into 7 general virus groups: H1N1, H3N2, HRV, RSV, Influenza A, Other viruses and Infected but no virus subtype detected (see Table 4).Table 4 represents virus-wise performance analysis on all the testsets.These results were obtained by using the single internal validation set approach, as explained in "Experimental design" section.Table 4 shows that SCADDx is able to achieve 100% accuracy in the case of Influenza A virus for all the testsets for all values of n.SCADDx is also able to achieve 100% accuracy in case of H3N2 and HRV virus on Testset 1a, H1N1 and HRV virus on Testset 1b, all virus types on Testset 2a, Influenza A and other viruses types on Testset 2b, and H1N1 virus on Dataset 3 for all values of n.This illustrates that the strengths of relationships between KB entities play an important role, and shows that the novel idea on which SCADDx is based has the potential to achieve up to 100% accuracy (see Tables 4, 5).Based on the virus-wise comparative analysis, it can be observed that SCADDx again performs better than LOADDx.The main reason why SCADDx tends to outperform LOADDx is that it makes use of the magnitudes of MDEGs/LDEGs, whereas LOADDx does not.We conclude that this magnitude information is important, and that SCADDx is successfully able to exploit this information to infer gene-disease link strengths in a patient-specific fashion, and incorporate this information in disease probability estimates.
Both of the algorithms are able to perform well in case of H1N1, H3N2 and HRV viruses.However, they do not perform so well in case of the RSV virus (see Table 4).We examined the datasets more deeply to understand why, and concluded that the KB does not have sufficient information about genes associated with RSV virus.It has information about only 30 genes that are related to the RSV virus -a case of KB incompleteness.This could be addressed in future work by conducting further KB curation work to add more links.
Figure 2 provides a visualisation of results for a single subject and a single dataset (first subject of Dataset 2).In Fig. 2A, we plot the most and least differentially expressed genes (MDEGs & LDEGs) of a subject.From Fig. 2B-D, we plot the subject's MDEGs and LDEGs associated with the disease in the KB, with the disease rank predicted by SCADDx.For example, as can be seen in Fig. 2B, there are many genes which are highly expressed and linked with RVI disease (the top-ranked disease), whereas the lowerranked diseases have relatively fewer highly-expressed genes linked with those diseases in the KB.As would be expected, the larger the number of MDEGs associated with a disease, the higher the chances are of having that disease.The trend from Fig. 2B-D shows that as we move from the Rank 1 disease to the Rank 100 disease, the number of associated MDEGs drops significantly.Also, the larger the number of LDEGs associated with a disease, the lower the chances are of having that disease.Figure S2 in Additional file 1 shows that as we move from the Rank 1 disease to the Rank 6000 disease, the number of associated LDEGs increases significantly.These trends provide evidence in support of the disease rank predicted by SCADDx.
Based on these observations, we conclude that there are three main contributing factors that influence which disease will get a high rank.Firstly, a significantly large number of MDEGs should be associated with the disease.Secondly, a low number of LDEGs should be associated with it.Thirdly, among the associated MDEGs, the change in gene expression should be larger in comparison to the MDEGs associated with other diseases.If a very large number of LDEGs of a patient is associated with a disease in KB, then that disease should never get a higher rank.

Comparison with existing ML algorithms
We also compared LOADDx and SCADDx with a number of existing machine learning algorithms (see Tables 5, 6, 7).The machine learning algorithms applied are k-NN, Random Forest, XGBoost, Linear SVM, and SVM with RBF kernel.
To compare the performance with existing machine learning algorithms, we applied two validation approaches.The first approach is the single internal validation set approach (see Tables 5, 6) and the second is the LOOCV approach (see Table 7) as explained in "Experimental design" section.The aim of applying existing machine learning algorithms is to determine a baseline performance that can be obtained on these datasets.The performance of LOADDx and SCADDx can then be assessed through comparison.
Table 5 shows results obtained using the single internal validation set approach.This table represents the results of SCADDx and LOADDx using both CTD and Fig. 2 A visualisation of results of SCADDx using CTD KB for a single subject and a single dataset (first subject of Dataset 2).A shows change in gene expression value of all the P MDEGs and Q LDEGs of subject 1. B shows only those genes (MDEGs and LDEGs) of subject 1 which are associated with the disease in KB which has been assigned rank 1 by SCADDx.C, and D show only those genes (MDEGs and LDEGs) of subject 1 which are associated with the diseases in KB which have been assigned ranks 50 and 100 respectively by SCADDx DisGeNet KBs, with the value of n set to 10.For the optimal parameter selection of both SCADDx and LOADDx, we conducted a grid search over P and Q in the range of 25 to 300 with a step size of 25.Refer to "Description of existing ML algorithms" section for the criteria used in selecting hyperparameters for the existing machine learning algorithms.Single internal validation set results show that SCADDx and LOADDx are able to detect the infection with up to 100% accuracy in the case of Dataset 2 and Dataset 3 (see Table 5).Overall, SCADDx and LOADDx are able to detect the infection within 72 h of infection with an average accuracy of 91.21% and 87.70% using the CTD KB, and 91.38% and 92.66% using the DisGeNet KB, respectively, considering all four datasets.In contrast, Random Forest and XGBoost, which performed best among the existing machine learning algorithms, can detect the infection with only an average accuracy of 86.43%.
We also performed a paired t-test using the test set accuracy of all the four datasets.The t-test results on Table 5 show that SCADDx performs significantly better than three of the existing ML algorithms: k-NN; Linear SVM; and SVM with RBF Kernel.LOADDx performs significantly better than two: k-NN; and SVM with RBF Kernel.On average, SCADDx and LOADDx match or outperform the existing algorithms.
Table 6 presents SCADDx and LOADDx results at n = 1 .It is of course more chal- lenging to correctly detect the single most likely disease ( n = 1 ) than for it to be in a list of 10 most likely candidates ( n = 10 ).However, SCADDx can still achieve an average accuracy of 88.99% using the CTD KB and 91.38% using the DisGeNet   7 presents the mean accuracy of various algorithms obtained through LOOCV.The results indicate that SCADDx and LOADDx with the DisGeNet KB consistently outperformed the existing machine learning algorithms on all the datasets.To determine the overall performance of the algorithms, we computed the average accuracy across all the datasets (see Table 7).According to the paired t-test results, SCADDx and LOADDx using both KBs achieved significantly higher accuracy than all the existing algorithms, with a p value < 0.01 (see Table 7).
Based on the results shown in Tables 5, 6 and 7, it can be concluded that overall SCADDx with both the KBs performed very well on all the datasets.This shows that the use of magnitudes of MDEGs/LDEGs in combination with KB can help in gaining better results.In case of Dataset 3 and Dataset 4, LOADDx also performed similar to SCADDx but it did not perform so well on other datasets.This is due to the fact that LOADDx doesn't utilize magnitudes of MDEGs/LDEGs.The reason why the machine learning models couldn't perform so well is that they do not exploit KB for prediction.They only use the gene expression data.This suggests that the use of KB can help in better disease prediction.
We conducted Gene Set Enrichment Analysis (GSEA) [72] by selecting the most important 16 genes (listed in Table 8) across all four gene expression datasets used in this study.We identified these genes by taking the intersection of the top MDEGs (P) across all four datasets used in disease prediction with SCADDx.To perform GSEA, we used the Multi-Ontology Enrichment Tool (MOET), 10 a web-based enrichment analysis tool that supports multiple ontologies for multiple species, including humans.The results of GSEA, presented in Table 8, show that the 16 genes are strongly associated with Human Influenza.Moreover, the top 10 terms produced by GSEA that are associated with these   Table 6 Comparing the performance of SCADDx and LOADDx with existing machine learning algorithms using the single internal validation set approach ( n = 1 for SCADDx and

LOADDx)
Results in bold denote that they are statistically significant based on the performed t-test A single asterisk denotes p value < 0.05 16 genes are closely related to RVI.These findings suggest that these 16 genes can serve as important biomarkers and play a crucial role in precision medicine.

Conclusions and future work
In this paper, we have proposed two new algorithms, LOADDx and SCADDx, to combine patients' gene expression data with a KB.LOADDx and SCADDx can produce a short personalised ranked list of the most likely diseases with other entities linked with them in the KB for each patient at a requested time-point.We have discovered how a patient's Least Differentially Expressed Genes (LDEGs) along with Most Differentially Expressed Genes (MDEGs) can help in disease diagnosis in the presence  of a KB.We identified the potential of LDEGs in such settings and used them for disease diagnosis in combination with KB.We showed how KBs that do not include link strength information can be used to infer the strength of links in a patient-specific manner, using the patient's gene expression profile.We evaluated both SCADDx and LOADDx using two KBs and four real-world gene expression datasets of respiratory viral infections caused by 19 subtypes of Influenza-like viruses.Additionally, we compared the performance of these algorithms with five existing machine learning algorithms.Our results showed that both SCADDx and LOADDx consistently outperformed the existing machine learning algorithms, as demonstrated by both validation approaches, namely LOOCV and single internal validation set approach.SCADDx and LOADDx can predict the diseases that a person is most likely to have, at an early stage, with high accuracy, by combining their gene expression data with a KB.We have also provided the visualisation of results that can show the MDEGs and LDEGs associated with the disease in KB for each subject.Moreover, for each patient, the proposed algorithms can show the changes in gene expression values of the most affected genes together with the computed disease scores and can produce a ranked personalized list of the most likely diseases along with other entities linked with them in the KB, which can support health care professionals in their decision-making.
In future, we intend to perform experiments on subjects who are suffering from multiple diseases.We will also explore how the incorporation of more contextual links in KB can improve the accuracy of disease diagnosis.

Fig. 1
Fig. 1 The novel idea based on which we designed both the proposed algorithms.MDEGs and LDEGs are the abbreviations for Most Differentially Expressed Genes and Least Differentially Expressed Genes respectively Results in bold denote that they are statistically significant based on the performed t-test A single asterisk denotes p value < 0

Table 1
Sample of results for the first 5 subjects of Testset 1a using SCADDx on CTD KB.Showing top 5 diseases for each subject with most affected 5 genes of the subject Parameter values: P = 100, Q = 175, m = 5, time t D ≃ 60 hours

Table 2
Comparison between SCADDx and LOADDx using CTD KBParameter values: P = Q = 200 genes for all the four datasets

Table 3
Comparison between SCADDx and LOADDx using CTD KB considering best parameter values (P & Q) for all the four datasets

Table 4
Virus-wise comparison between SCADDx and LOADDx using CTD KB KB within 72 h of infection.Table 6 presents the t-test results, which indicate that SCADDx with both KBs and LOADDx with the DisGeNet KB outperformed k-NN significantly.Overall, these findings suggest that SCADDx and LOADDx are reliable tools for detecting infection, even in challenging circumstances.Table

Table 4
(continued) + denotes the number of infected subjects and S denotes the number of total subjects in the testset in that virus category 10https:// rgd.mcw.edu/ rgdweb/ enric hment/ start.html.

Table 5
Comparing the performance of SCADDx and LOADDx with existing machine learning algorithms using the single internal validation set approach ( n = 10

Table 7
Comparing the performance of LOADDx and SCADDx with the performance of existing machine learning algorithms using the LOOCV approach ( n = 10 for SCADDx and LOADDx)Results in bold denote that they are statistically significant based on the performed t-test A single asterisk denotes p value < 0.05 and a double asterisk denotes p value < 0.01

Table 8
Results of Gene Set Enrichment Analysis performed over the most important 16 genes that are common across all four gene expression datasets used in this study