 Research
 Open access
 Published:
Predicting potential microbedisease associations based on autoencoder and graph convolution network
BMC Bioinformatics volume 24, Article number: 476 (2023)
Abstract
The increasing body of research has consistently demonstrated the intricate correlation between the human microbiome and human wellbeing. Microbes can impact the efficacy and toxicity of drugs through various pathways, as well as influence the occurrence and metastasis of tumors. In clinical practice, it is crucial to elucidate the association between microbes and diseases. Although traditional biological experiments accurately identify this association, they are timeconsuming, expensive, and susceptible to experimental conditions. Consequently, conducting extensive biological experiments to screen potential microbedisease associations becomes challenging. The computational methods can solve the above problems well, but the previous computational methods still have the problems of low utilization of node features and the prediction accuracy needs to be improved. To address this issue, we propose the DAEGCNDF model predicting potential associations between microbes and diseases. Our model calculates four similar features for each microbe and disease. These features are fused to obtain a comprehensive feature matrix representing microbes and diseases. Our model first uses the graph convolutional network module to extract lowrank features with graph information of microbes and diseases, and then uses a deep sparse AutoEncoder to extract highrank features of microbedisease pairs, after which the lowrank and highrank features are spliced to improve the utilization of node features. Finally, Deep Forest was used for microbedisease potential relationship prediction. The experimental results show that combining lowrank and highrank features helps to improve the model performance and Deep Forest has better classification performance than the baseline model.
Introduction
Microbial communities are collections of microorganisms that live together in the same environment and share a common living space. They are a structural and functional unit that is widely present in ecosystems and can be found in all large organisms and their bodies [1]. Research over the past few decades has shown that microbial communities play a crucial role in human health. During the long process of evolution, microbes form an interdependent and mutually restrictive relationship with the host through individual adaptation and natural selection, while their microenvironment and immune system are in a dynamic equilibrium state [2]. When this dynamic balance is disrupted, the host’s transcription, translation, and DNA repair mechanisms may be affected, which can in turn affect human health. In addition, microbial communities can also play a key role in regulating the efficacy and toxicity of anticancer drugs by regulating host immunity and microbial enzyme degradation mechanisms [3]. For example, changes in the structure of the oral microbiome in a healthy state, that is, changes in the taxonomic composition and relative abundance of the oral microbiome, can lead to the occurrence of dental caries and periodontal disease [4]. Lelouvier, Benjamin, et al. [5] revealed the relationship between changes in the blood microbiome of obese patients and liver fibrosis through qualitative and quantitative analysis of blood bacterial DNA. It has been proven that Helicobacter pylori is associated with a variety of gastrointestinal diseases and was classified as a Group 1 carcinogen by the World Health Organization in 2017 [6,7,8,9].
In addition, some microorganisms are considered to be beneficial to human health. Streptococcus thermophilus, which is widely used in the food industry, is considered to be beneficial to human health. The proportion of adults who consume yogurt containing Streptococcus thermophilus while undergoing antibiotic treatment and suffer from antibioticassociated diarrhea is lower than that of the control group [10]. Bifidobacterium is distributed in both the human oral cavity and vagina, and is abundant in the human digestive tract. Like Streptococcus thermophilus, it is considered beneficial to human health and is widely used in the food and pharmaceutical industries. It is commonly used in the routine treatment of ulcerative colitis and has been proven to have a role in alleviating the disease [11].
As the above research shows, microbial communities can have a crucial impact on human health through a variety of mechanisms. Therefore, identifying potential microbialdisease associations is of great significance for clinical treatment, human health care, drug development, and understanding the relationship between microbes and the human body. In other words, identifying potential microbialdisease associations has practical significance and realworld demand. Further discovery of potential microbialdisease associations not only helps us to better understand the conditions and mechanisms of interaction between microbes and the human body, but also helps to further understand the occurrence and progression mechanisms of microberelated diseases, and provides new medical solutions for precision treatment, new drug development, and postoperative intervention. However, the number of proven microbialdisease associations is still far from meeting the demand. Therefore, it is necessary and imperative to accelerate the identification of potential microbialdisease associations. Thanks to their efficiency, low cost, and ability to predict potential associations on a large scale of computational models, computational models capable of predicting potential microbialdisease associations have been developed and widely applied. These models can be categorized into four types based on different prediction strategies: matrix decompositionbased methods, label propagationbased methods, pathbased methods, and machine learningbased methods.
Although many models for predicting potential microbialdisease associations are based on random walk methods, Qiu et al. [12] have shown that many commonly used random walk methods essentially perform implicit matrix decomposition. Therefore, we combine random walkbased methods with matrix decompositionbased methods for discussion. Matrix decomposition methods refer to representing the target matrix as the result of matrix operations on two or more matrices. Shen et al. [13] proposed a model called CMFHMDA, which is the first microbedisease association prediction model based on matrix decomposition. CMFHMDA takes the microbedisease association matrix, microbe Gaussian similarity kernel, and disease Gaussian similarity kernel as inputs to the model and then predicts potential microbedisease associations. Later, Zou et al. [14] proposed the BiRWHMDA model based on birandom walk, which constructs a network of microbial similarity and a network of disease similarity through the microbialdisease association matrix, and then connects these two networks to establish a microbialdisease association heterogeneous network and performs birandom walk on this heterogeneous network to make predictions. Similar models include BiRWMP [15], NMFMDA [16], MSLINE [17], and MVFA [18], etc. The main disadvantage of the matrix decompositionbased methods is that the performance of the model suffers greatly when the matrix is sparse.
The Label Propagation Algorithm (LPA) is a graphbased semisupervised learning method. The basic idea of LPA is to propagate labels in the data according to pregiven rules. This algorithm was proposed by Zhu et al. [19] in 2002. Since its introduction, the algorithm has been widely used in relation prediction models. For example, Yin et al. [20] and Gao et al [21]. proposed the MDAMSFLP model and the MKLLP model, respectively, both of which use the label propagation algorithm to predict potential microbialdisease associations. Zhao et al. [22] proposed a model called PLPMDA, which is based on an improved label propagation algorithm called “Precompletionbased Label Propagation” to predict potential microbialdrug associations. Similar models include MDLPHMDA [23], NBLPIHMDA [24], etc. The LPA is characterized by its simplicity and efficiency, with the disadvantage of unstable results per iteration and low accuracy.
The basic idea of Pathbased methods is to predict the potential relationships by calculating the path score between microbial nodes and disease nodes in a heterogeneous network composed of microbes and diseases. Chen et al [25]. proposed the first model for predicting microbialdisease associations, KATZHMDA, based on the pathbased method. This model first calculates the Gaussian interaction profile kernel similarity for microbes and diseases separately, then calculates the KATZ [26] measure and makes predictions. The authors believe that the Gaussian interaction profile kernel similarity and KATZ measure play a crucial role in the performance of KATZHMDA. Inspired by KATZHMDA, Li et al. [27] proposed the BWNMHMDA model, which replaces the KATZ measure with a bidirectional recommendation measure and makes predictions on the resulting bidirectional weighted network. Later, considering the advantages of the KATZ measure and the sparsity of the microbialdisease association matrix, Li et al. [28] proposed the KATZBNRA model based on the Bipartite Network Recommendation Algorithm and KATZ measure to predict potential microbialdisease associations. In addition, there are other models based on the Pathbased method, such as PBHMDA [29], WMGHMDA [30], MDPH_HMDA [25], etc. These types of methods are insufficient in extracting highorder structural information from nodes and are also limited by the definition and selection of paths.
Machine learning methods (including deep learning methods) have been widely applied in association prediction in recent years, such as microbedisease association prediction, microbedrug association prediction, miRNAdisease association prediction, and recommendation systems. For example, in the prediction of microbedrug associations, Long et al. [31] utilized GCN (Graph Convolutional Networks) and Conditional Random Field (CRF) to establish a model named GCNMDA for predicting human microbedrug associations. Subsequently, they proposed the EGATMDA [32] model based on the hierarchical attention mechanism, which demonstrated superior performance in predicting human microbedrug associations compared to GCNMDA. Sample imbalance is a major issue faced by these types of methods.
In the field of microbedisease association prediction, Peng et al. [33] proposed ABHMDA, considering the low proportion of positive samples, they used the kmeans algorithm to cluster negative samples into 23 categories and randomly selected the same negative samples in each category, then composed these negative samples into negative samples for model training. The ABHMDA model also weights multiple weak classifiers and then forms a strong classifier to predict potential microbedisease associations. Wang et al. [34] proposed the DSAE_RF model based on the deep sparse autoencoder neural network and random forest. The DSAE_RF model uses a deep sparse autoencoder neural network to extract features of microbedisease pairs, and then uses the extracted features as inputs to the random forest model to predict potential microbedisease associations. Inspired by the ABHMDA model, Wang et al. compared the impact of two types of negative sample sampling on model performance, that is, comparing the impact of kmeans algorithm sampling and simple random sampling on model performance. The results show that negative sampling through the kmeans algorithm can effectively screen reliable negative samples and thereby improve model performance. In addition, graph neural networks have also been well applied in relation prediction. For example, Liu et al. [35] proposed a model based on a multicomponent Graph Attention Network (GAT [36]) for microbedisease association prediction. This model consists of three parts: a decomposer and combiner based on attention mechanism, and a predictor based on a fully connected network. Similarly, Li et al. [37] proposed a model named GATMDA based on GAT for predicting miRNAdisease associations. Wang et al. [38] used Principal Component Analysis (PCA) to extract node features, and then used these features as inputs to a twolayer Relation Graph Convolutional Network (RGCN [39]) to predict potential microbedisease associations. Jiang et al. [40] proposed a model named KGNMDA, which built a knowledge graph on microorganisms and diseases. KGNMDA used a graph neural network to learn their representations, and proposed a scoring function to predict microbedisease associations. Models such as MDAGCAN [41], GCNMA [42], MLAGCNMDA [43], etc. also use graph neural network methods.
Although the methods above have achieved certain success in inferring potential microbialdisease associations, these methods also have their own drawbacks. For example, models based on graph neural networks can extract node feature information and topological information well, but in order to prevent “over smoothing”, the number of layers in related models is usually only 2–3 layers, which means that the information obtained by the model is loworder features of the nodes. Although models based on other neural networks can increase the number of layers of the network to a large extent, they cannot handle graph structure data well. Based on this consideration, we propose the DAEGCNDF model. Our model uses a Deep Sparse AutoEncoder neural network(DAE) to extract deep features of microbialdisease pairs, and uses a GCN model to extract loworder features of microbialdisease pairs, then concatenates the deep features with the loworder features and uses Deep Forest for microbialdisease association prediction. The DAE, a model formulated by the combination of stacked and sparse autoencoders and proposed by Lee et al. [44] in 2020, has been widely applied in feature learning and dimension reduction. The Deep Forest(DF) model was proposed by Zhou et al. [45] in 2018. This deep model is an extension of the decision tree model, characterized by fewer hyperparameters, determining model complexity by a datadriven approach, and not relying on gradient backpropagation. Experiments show that this model has excellent robustness and performance.
The specific steps can be divided into five. First, we separately calculate the four similarities of microbes and diseases and fuse them. In the second step, the fused similarity matrix is used as the initial input of the GCN module of the model to extract the loworder feature matrix of microbes and diseases. In the third step, a loworder feature vector of microbedisease pairs is constructed from the extracted loworder feature matrix. In the fourth step, an initial feature vector of microbedisease pairs is constructed from the fused similarity matrix, and this initial feature vector is input into the DSA module of the model to extract a highorder feature vector of microbedisease pairs. In the fifth step, the loworder feature vector and highorder feature vector of microbedisease pairs are concatenated and used for latent microbedisease association prediction with Deep Forest. Our experimental results show that the model has an average AUC and AUPR of 0.9700 and 0.9690 in 10fold crossvalidation, which fully demonstrates the effectiveness of the model’s predictive performance. In addition, to further evaluate the performance of the model, we also conducted ablation experiments, comparisons of various negative sample selection methods, performance comparisons with other methods, comparisons of various classifiers, and studies on two cases. The experimental results further verify the performance of DAEGCNDF. In summary, our research results will help to further understand the relationship between microbes and diseases, assist in disease diagnosis, treatment and prognosis, and play a supporting role in traditional biological experiments and medical experiments.
Overall, our research has the following main contributions:

1.
We use a deep sparse AutoEncoder neural network to extract highorder feature vectors of microbedisease pairs.

2.
We use GCN to extract lowrank feature matrices of microbes and diseases, and construct lowrank feature vectors of microbedisease pairs.

3.
We concatenate the highrank feature vectors and lowrank vectors of microbedisease pairs and use Deep Forest for latent microbedisease association prediction. The experimental results demonstrate the effectiveness of our model.
Materials and methods
Human microbedisease associations database
Currently, there are three microbialdisease associations datasets, namely HMDAD [46], Disbiome [47], and Peryton [48]. Similar to the research conducted by Wang et al. [34], the data used in this paper is obtained by merging datasets of HMDAD, Disbiome, and Peryton. The basic information of the three datasets above and the integrated dataset used in this paper is shown in Tables 1 and 2, respectively. In this paper, the degree refers to the node degree of the microbedisease association matrix, that is, the number of edges associated with that node. It should be noted that after merging the three datasets above, we removed duplicate and irrelevant items. As a result, we obtained 1177 microbes, 134 diseases, and 4499 microbedisease associations, and the microbedisease associations network was represented by a bipartite graph. An adjacency matrix \(\textbf{Y} \in R^{N_m \times N_d}\) was used to represent the microbedisease associations. In the matrix \(\textbf{Y}\), the rows represent \(N_m\) microbes, and the columns represent 134 diseases. If a microbe \({m}_{i}\)(\(1 \le i \le N_m\)) is associated with a disease \({d}_{j}\)(\({1 \le j \le N_d}\)), then \(\textbf{Y}_{ij}=1\), otherwise \(\textbf{Y}_{ij}=0\). When \(\textbf{Y}_{ij}=1\), we consider it as a positive sample, otherwise, it is considered as a negative sample. In this way, we obtained 4499 positive samples from the integrated dataset(MDAID).
Diseases similarity
In this study, we employ four distinct methods to calculate disease similarity: semantic similarity, Gaussian Interaction Profile kernel similarity(GIP), cosine similarity, and sigmoid kernel function similarity.
Diseases semantic similarity
The calculation of disease similarity is very important for downstream tasks. Xuan [49] proposed a method for calculating similarity based on disease ontology information. The disease similarity calculated by this method is called disease semantic similarity. Since its proposal, disease semantic similarity has been widely used in various researches. Disease ontology information can be obtained from the Human Disease Ontology (DO) [50] ( http://www.diseaseontology.org) or the the Medical Subject Headings (MeSH) database ( https://www.ncbi.nlm.nih.gov/), and each disease in the two database above can be represented as a Directed Acyclic Graph (DAG). Our calculation of disease semantic similarity is based on DAG, and the specific steps are as follows: Firstly, let \(DAG({d}_{i}) = ({d}_{i},T({d}_{i}),E({d}_{i}))\) represent the directed acyclic graph of disease \({d}_{i}\), which encompasses disease \({d}_{i}\), its ancestor nodes \(T({d}_{i})\), and the set of edges \(E({d}_{i})\) that directly connect from the ancestral nodes to node \(T({d}_{i})\). The semantic contribution value of disease \({d}_{k}\) to \({d}_{i}\) can then be calculated by using the equation:
In this context, \(d_{k^{'}}\) denotes the children node of \(d_{k}\), and FC signifies the contributing factor of semantic decay. As per the study by Xuan et al. [49], we set \(FC=0.5\). We have determined the contributing factor of disease \(d_{i}\) to itself to be 1. Drawing from Eq (1), it can be deduced that an increase in the distance from disease \(d_{k}\) to disease \(d_{i}\) results in a decrease in the semantic contribution factor. Conversely, a decrease in this distance leads to an increase in the semantic contribution factor. The final semantic value of disease \(d_{i}\) can be calculated by using the formula:
The proposition is that diseases with a higher number of shared DAGs are deemed more similar. Based on this premise, the disease semantic similarity between disease \(d_{i}\) and \(d_{j}\) can be determined by employing the equation:
Gaussian interaction profile kernel similarity for diseases
Due to the excellent performance capabilities of GIP, it has been used in many studies to describe the similarity complement of microbes and diseases. Specifically, the Gaussian interaction profile kernel similarity for any two diseases, denoted as \(d_{i}\) and \(d_{j}\), can be determined by using the equation:
In this context, the binary vector \(\textbf{DB}(d_{i})\) is equivalent to the ith row of the matrix \(\textbf{Y}\), which signifies the relationships between disease \(d_{i}\) and all microbes. The term \(N_{d}=134\) indicates the number of diseases. The value of \(\alpha _{d}\) was set to 1, as suggested in the studies by Chen et al. [51].
Cosine similarity for diseases
Cosine similarity is used to evaluate the similarity between two vectors by calculating the cosine of the angle between them. It has been widely applied in various research fields and has demonstrated excellent performance [46, 52]. Therefore, this paper also uses cosine similarity to calculate the similarity between diseases. In particular, the cosine similarity between any two diseases, \(d_{i}\) and \(d_{j}\), can be determined by employing the subsequent equation:
Sigmoid kernel function similarity for diseases
Studies have demonstrated that the sigmoid kernel function falls under the category of global kernel functions, thereby enabling the effective extraction of global characteristics from samples. The similarity measure derived from the sigmoid kernel function has found application in the research conducted by Han et al. [53] and Wang et al. [34]. Inspired by their work, this paper also employs the sigmoid kernel function similarity measure to ascertain the similarity between diseases and microbes. For any given pair of diseases, \(d_{i}\) and \(d_{j}\), their similarity based on the sigmoid kernel function can be computed as follows:
Microbes similarity
This section presents four distinct computational techniques for determining microbe similarity, namely functional similarity, Gaussian interaction profile kernel similarity, cosine similarity, and sigmoid kernel function similarity.
Microbes functional similarity
The computation of microbial functional similarity hinges on the premise that microbes with similar functions have a higher likelihood of being linked to analogous diseases. Following the same method as Liu et al. [54], we assume that any two microbes \(m_{i}\) and \(m_{j}\) are associated with disease groups \(D_{i}=\{d_{ik}1 \le k \le p\}\) and \(D_{j}=\{d_{jl}1 \le l \le q\}\) respectively, and the similarity of \(d_{ik}\) with disease group \(D_{j}\) can be calculated by the following formula:
Where a is the semantic similarity between disease \(d_{ik}\) and \(d_{jl}\), which is the element of the disease semantic similarity matrix \(\textbf{DS}\) in the \(ikth\) row and \(jlth\) column. Subsequently, the functional similarity between microbes \(m_{i}\) and \(m_{j}\) can be determined as:
Gaussian interaction profile kernel similarity for microbes
In a manner akin to the previously described method for calculating microbe similarities, the GIP similarity between two microbes, denoted as \(d_{i}\) and \(d_{j}\), can be determined as follows:
Within this framework, the binary vector \(\textbf{MB}(m_{i})\) aligns with the ith column of matrix \(\textbf{Y}\), thereby delineating the associations between microbe \(m_{i}\) and all encompassing diseases. In a similar vein, the value of \(\alpha _{m}\) is designated as 1.
Cosine similarity for microbes
In a manner akin to the computation of cosine similarity between two diseases, the cosine similarity between two microbes can be ascertained utilizing the subsequent equation:
Sigmoid kernel function similarity for microbes
Similarly, the sigmoid kernel function similarity between microbes can be computed in the following equation:
Multisource features fusion for microbes and diseases
The fusion of multisource features has been proven by many studies to be beneficial in improving model performance. Therefore, we fuse the four disease features and four microbial features above. The fusion operations are performed using Eqs. (14) and (15) respectively to obtain the fused disease and microbial features.
Negative sample selection method
In this study, due to the fact that negative samples far outnumber positive samples, balancing positive and negative samples and selecting highquality negative samples for model training can improve model performance, thereby enhancing the efficiency and effectiveness of the model in predicting potential microbedisease associations. Peng et al. [33] and Wang et al. [34], in their research, used the KMeans algorithm to cluster negative samples into 23 classes. They then randomly selected an equal number of samples from each cluster as negative samples. Finally, they combined the selected negative samples with all positive samples to serve as training samples for the model. In their research, the parameter k of the KMeans algorithm was set to 23. Their experiments showed that selecting negative samples through the KMeans algorithm can improve the model’s AUC and AUPR by about 2\(\%\). Inspired by their work, we used four clustering algorithms for negative sample selection: KMeans, Gaussian mixture, Spectral coclustering, and Spectral biclustering. We also conducted an evaluation of these four negative sampling methods. Like the aforementioned research, we retained all positive samples. When conducting experiments on the MDAID dataset, we selected 4508 negative samples, while for the HMDAD dataset, we selected 450 negative samples.
Model framework
Deep AutoEncoder models have good representational efficiency and can extract rich data features. The work of Wang et al. [34] also shows that the classification effect extracted based on the deep AutoEncoder model is superior to the baseline model. However, the work of Wang et al. [34] did not fully utilize the information brought by the graph structure. We note that Peng et al. [55] proposed a GCN network based on bipartite graphs to predict potential carcinogenic genes, and their work shows that this network can extract loworder information brought by the graph structure well. In addition, the Deep Forest model proposed by Zhou et al. [45] outperforms traditional machine learning methods on multiple datasets. Inspired by these works, we designed a widely effective computational framework DAEGCNDF for predicting potential microbialdisease associations. The flowchart of the DAEGCNDF model is shown in Fig. 1, which can be divided into five parts: (1) Similarity calculation (Fig. 1A), (2) Similarity fusion (Fig. 1B), (3) Extraction of loworder features (Fig. 1C), (4) Extraction of highorder features (Fig. 1D), (5) Feature fusion and prediction using deep forest model (Fig. 1E).
The work of Wang et al. [34] suggests that utilizing the multiple similarities between microbes and diseases can enhance model performance. As shown in Fig. 1A, B, we calculated four types of similarities for both microbes and diseases, and integrated these similarities. To extract the information brought by the graph structure and avoid oversmoothing, as shown in Fig. 1C, we used a twolayer GCN module to extract the lowrank features of the nodes. To compensate for the inability of the GCN module to extract higherrank information, as shown in Fig. 1D, we introduced a fourlayer AutoEncoder model to extract the highrank features of the nodes. Finally, we concatenated the lowrank features and highrank features, and used the deep forest model for prediction.
GCN module
The Graph Convolutional Model can learn the hidden layer representation of nodes by the features of neighboring nodes and local graph structure. This model requires the adjacency matrix of the graph and the feature matrix of nodes as initial inputs. Inspired by Peng et al. [55], the specific process of the GCN module is as follows: First, matrices \(\textbf{FuM}\) and \(\textbf{FuD}\) are used as the initial features of microbes and diseases. To make the dimensions of these two initial features consistent, we use Eq. (16) for dimension reduction. Then, we use Eq. (17) to aggregate neighborhood features. Finally, we use Eq. (18) for local graph structure learning.
Where \(\textbf{W}^{(0)}_{M} \in R^{1177 \times h_{1}}, \textbf{W}^{(0)}_{D} \in R^{134 \times h_{1}}, \textbf{W}^{(1)}_{1} \in R^{h_{1} \times h_{2}}, \textbf{W}^{(1)}_{2} \in R^{h_{1} \times h_{2}}\) are learnable weight matrices, while \(b_{M}, b_{D}, b_{1}\) are learnable bias vectors with a dimension of \(h_{1}\). \(\textbf{D}_{1}=\sum _{j}\textbf{Y}_{ij}+1\) and \(\textbf{D}_{2}=\sum _{i}\textbf{Y}_{ij}+1\) are diagonal matrices, \(\tilde{\textbf{P}}=\textbf{D}^{\frac{1}{2}}_{1} \textbf{Y} \textbf{D}^{\frac{1}{2}}_{2}\). \(\odot\) represents the elementwise multiplication.
After calculating according to the formula above, as shown in Eq. (19), by adding the aggregated neighborhood features and the learned local graph structure information and activating them with an activation function, we can obtain the lowrank features of nodes with neighbor node features and local graph structure information. It should be noted that Eqs. (17) and (18) constitute the first layer of the GCN module. We can summarize the process above into the following formula:
Where N(M) and N(D) respectively represent the set of neighbors for microbes and diseases in the network. \(\sigma\) represents the ReLU activation function.
Like a general GCN, our GCN module can also stack multiple graph convolution layers. Let l represent the number of layers of the graph convolution layer, and \(\textbf{LM}^{(l)}\) and \(\textbf{LD}^{(l)}\) respectively represent the final microbial features and disease features learned by the GCN model from the microbedisease network, that is, the lowrank features of microbes and diseases. Formally, a \(l \ge 2\)layer GCN model can be represented by the following Eq. (20). In this paper, the number of layers in our GCN module is 2, that is, \(l=2,\textbf{LM}=\textbf{LM}^{(l)},\textbf{LD}=\textbf{LD}^{(l)}\).
As shown in Eq. (21), the association matrix \(\textbf{Y}\) of microbes and diseases is reconstructed by using the inner product of the lowrank features of microbes and diseases output by the GCN model. Here, \(\sigma\) represents the sigmoid activation function. In addition, we use Eq. (22) as the loss function for the reconstruction of the microbedisease association matrix.
Where E represents the edge set of the microbedisease network, while n is the number of edges. Neg refers to the set of negative samples, which is of size n and obtained by negative sampling, while \(\hat{y}_{ij}\) represents the value of the reconstructed adjacency matrix \(\hat{\textbf{Y}}\).
Deep autoencoder module
Deep AutoEncoder is an unsupervised learning model that can efficiently learn the latent information of sample data. This model typically consists of an encoder and a decoder. The aim of the deep AutoEncoder is to reconstruct the input, thereby enabling the neural network to learn the most informative latent features of the input data, making it widely used in feature extraction.
For any disease \(d_{i}\), we take the ith row \(\textbf{FuD}_{i}\) of matrix \(\textbf{FuD}\) as its initial feature vector; similarly, for any microbe \(m_{i}\), we take the jth row \(\textbf{FuM}_{j}\) of matrix \(\textbf{FuM}\) as its initial feature vector. We concatenate \(\textbf{FuD}_{i}\) and \(\textbf{FuM}_{i}\) to obtain the feature vector of diseasemicrobe pair \(d_im_j\), at which point the dimension of the feature vector of diseasemicrobe pair \(d_im_j\) is 1311. We use a deep AutoEncoder to extract the effective features of diseasemicrobe pairs. Specifically, the encoder and decoder of the model can be represented by Eqs. (23) and (24) respectively.
Where \(k \ge 1\) and \(t\ge 1\) represent the number of layers in the encoder and decoder, respectively. Following the study of Wang et al [34], we set them both to 4. \(\sigma ^{(k)}_{e}\) and \(\sigma ^{(t)}_{d}\) represent the activation functions of the encoder and decoder respectively, and in this paper, they are both set to sigmoid function. \(\textbf{W}^{(k)}_{e}\), \(b^{(k)}_{e}\) and \(\textbf{W}^{(t)}_{d}\), \(b^{(t)}_{d}\) are the learnable parameters of the encoder and decoder. In addition, \(z^{(0)}\) is the initial input data x, and \(x^{(0)}=z^{(4)}\).
As shown in Eq. (25), the model’s loss is composed of mean squared error and KL divergence, where \(\theta\) is the weight coefficient.
Ultimately, the \(z^{(4)}\) obtained by the model is treated as the highorder feature vector of the diseasemicrobe pair.
Prediction of microbedisease associations by deep forest model
Deep Forest is a decision tree ensemble method proposed by Zhou et al in 2018 [45]. This method first preprocesses the input features using multigranularity scanning, then inputs the obtained feature vectors into a cascading forest for training, and uses crossvalidation to generate each cascade, effectively avoiding overfitting. As shown in Fig. 1E, we take the ith row \(\textbf{LD}_{i}\) of the lowrank feature matrix \(\textbf{LD}\) of the disease extracted by the GCN module and the jth row \(\textbf{LM}_{j}\) of the lowrank feature matrix \(\textbf{LM}\) of the microorganism as the lowrank feature vectors of disease \(d_i\) and microorganism \(m_j\) respectively. By concatenating \(\textbf{LD}_{i}\) and \(\textbf{LM}_{j}\), we can obtain the lowrank feature vector of the diseasemicroorganism pair \(d_im_j\). Afterwards, we concatenate the highrank feature vector and the lowrank feature vector to obtain the final feature vector of the diseasemicrobe pair. Finally, we input the final feature vector of the diseasemicrobe pair into the Deep Forest model for latent microbedisease associations prediction.
Result
Parameter details and model evaluation
We implemented our model using PyTorch and PyG, with both the GCN module and the Deep AutoEncoder module utilizing Adam as the optimizer. For the GCN module, we set the number of network layers to 2, with the dimensions of the hidden layer and output layer set to 256 and 128 respectively. We used a default dropout rate of 0.5, and set the number of model training iterations and learning rate to 1000 and 0.001 respectively. For the Deep AutoEncoder module, as previously mentioned, we set the number of layers for both the encoder and decoder to 4, with the dimensions of each network layer being 1311, 1152, 576, 288, 144, 288, 576, 1152, and 1131 respectively (see Fig. 1E). The number of model training iterations and initial learning rate were set to 150 and 0.01 respectively, with ReduceLROnPlateau used for automatic optimization of the learning rate. For the Deep Forest model, we set ’n_estimators’ and ’criterion’ to 17 and ’entropy’, respectively.
In this study, we conducted experiments using 10fold crossvalidation and evaluated the model using a variety of metrics, namely AUC, AUPR, Recall, Precision (Pre), Accuracy (Acc), and F1score. Considering that MDAID is a large dataset, to further demonstrate the performance of our model, we also conducted experiments on the HMDAD dataset. As indicated in Table 3, our model achieved good performance on both datasets.
Comparison of methods for selecting negative samples
We noticed that in the microbedisease association matrix \(\textbf{Y}\), a value of “1” indicates the presence of a microbedisease association, indicating a positive sample. Conversely, a value of “0” represents an unknown or negative sample. This suggests that there is an issue with false negatives in these negative samples, highlighting the importance of selecting reliable negative samples during the model training phase. Wang et al. [34] and Peng et al. [33] employed KMeans clustering to group negative samples into 23 categories and subsequently randomly selected 196 negative samples from each category, resulting in a total of 4508 negative training samples. The advantage of this approach lies in ensuring that negative samples contribute to model training for each type of data feature, thereby avoiding biased learning during model training. In this study, we employ five methods for selecting negative samples: random sampling, KMeans clustering sampling, Gaussian mixture clustering sampling, spectral coclustering sampling, and spectral biclustering sampling.
As shown in Table 4, sampling negative samples by clustering methods can effectively improve model performance. Among them, KMeans clustering sampling has the best effect on improving model performance, improving model performance by about 4\(\%\) compared to random sampling. However, the effect of Gaussian mixture clustering sampling on improving model performance is almost the same as that of KMeans clustering sampling.
Ablation experiments
To evaluate the impact of lowrank and highrank features on the predictive performance of the model, we divided the features of the diseasemicrobe pairs into three groups: LRF, HRF, and LHRF. Group LRF represents predictions made using only lowrank features, Group HRF represents predictions made using only highrank features, and Group LHRF represents predictions made after concatenating lowrank and highrank features.
From Table 5, we can see that the lowrank features of diseasemicroorganism pairs contribute more to the model performance than the highrank features. This may be due to our GCN module’s ability to effectively aggregate the features of diseases and microorganisms through neighboring nodes. Furthermore, when lowrank and highrank features are combined, the model’s performance surpasses that of predictions made using only a single feature.
Comparison of different classifiers
To evaluate the contribution of Deep Forest (DF) to predictive performance, we selected nine benchmark models, including a threelayers MLP neural network commonly used as a benchmark model, and eight traditional machine learning models. These are Logistic Regression (LR), Support Vector Machine (SVM), Naive Bayes (NB), Decision Tree (DT), AdaBoost Classifier (ABC), Gradient Boosting Classifier (GBC), KNearest Neighbors (KNN), and Random Forest(RF). The prediction results are shown in Table 1.
As can be seen from the results in Table 6, the Deep Forest classifier outperforms the other nine benchmark classifiers across all evaluation metrics. Furthermore, these results indicate that while Random Forest outperforms other traditional machine learning models, Deep Forest, as an improved model of Random Forest, demonstrates superior performance. Therefore, our choice of Deep Forest as the final classifier is both reasonable and reliable.
Comparison of other methods
To further evaluate the performance of our model, we selected six of the latest microbedisease associations prediction methods for comparison with our model, based on the dataset in this paper and 10fold crossvalidation. The names of the models and the experimental results are shown in Table 7.
From the experimental results in Table 7, it is evident that our model, DAEGCNDF, outperforms the benchmark models in terms of AUC and AUPR values. Specifically, our model achieved an AUC value of \(97.00\%\) and an AUPR value of \(96.90\%\), which are approximately \(2.22\%\) and \(2.59\%\) higher than the secondplace model, respectively. We attribute the optimal performance of our DAEGCNDF model to four main reasons. Firstly, the GCN module employed in our model effectively captures loworder features from bipartite graphs representing microbes and diseases with a graph structure. Secondly, the DAE module successfully extracts complex highrank features from diseasemicrobe pairs, thereby eliminating noise present in these initial features after undergoing DAE processing. Furthermore, by combining both lowrank and highrank features, we are able to better represent information pertaining to diseasemicrobe pairs and consequently enhance classifier performance. Lastly, the deep forest cascade structure utilized by our model enables effective utilization of input features for prediction purposes.
Case studies
To evaluate the performance of DAEGCNDF further, we conducted two types of case studies on this model: predicting potential microbedisease associations based on known information and predicting new microbedisease associations based on unknown information. In the first type of case study, all known microbedisease association information was used for training purposes. Subsequently, predictions were made for all unknown associations corresponding to a given disease while ranking them according to their prediction scores. Finally,the top ten microbes with highest scores were validated using literature sources. In the second type of case study, the disease under study was treated as a completely new disease, and its association information with microbes would be removed before model training, which means that there is no information about this disease during model training. Similar to the first type of case study, we ranked the scores of all microbes corresponding to the same disease and took the top 10 microbes for validation by relevant literature. It is important to note that conducting the second type of case study allows us to assess our model’s ability to predict microbial associations with new diseases when no prior diseasemicrobe related information is available.This reflects how well our model can guide actual experiments.
Colorectal cancer is a common malignant tumor in the gastrointestinal tract, with early symptoms often not obvious [59]. Therefore, about 20\(\%\) of newly diagnosed colorectal cancer patients have already experienced cancer cell metastasis [60]. Early diagnosis of colorectal cancer is of great significance for the treatment of the disease and improving the survival time of patients [61]. Although the cause of its onset is not yet fully understood, more and more evidence suggests that gut microbes have an impact on the occurrence, progression, metastasis, treatment, and prognosis of colorectal cancer. For example, Gao et al. [62] found that Lactococcus and Fusobacterium are relatively enriched in colorectal cancer tissues. Wang et al. [63] found that Salmonella enterica is involved in the progression of colorectal cancer. Therefore, further study of the relationship between colorectal cancer and microbes will help us further understand its pathogenesis and is of great significance for its early screening, auxiliary diagnosis, and assistance. In view of this, we chose colorectal cancer for the two types of case studies above. As can be seen from Table 8, in the first type of case study, 8 out of the top 10 microbes predicted to be associated with colorectal cancer were confirmed by literature. In addition, in the second type of case study (see Table 9), all of the top 10 microbes predicted to be associated with colorectal cancer were confirmed by literature.
Autoimmune hepatitis is a chronic progressive inflammatory disease of the liver mediated by autoimmune reactions, which can manifest in acute or chronic forms [64, 65]. In severe cases, it can rapidly progress to cirrhosis and liver failure, threatening life [66]. The disease occurs worldwide, with an incidence rate exceeding fortytwo per hundred thousand in certain ethnic groups [67]. The disease requires timely and longterm treatment, and untimely or improper treatment can greatly affect the patient’s 10year survival rate [68]. Currently, a large amount of research has confirmed that autoimmune hepatitis is related to changes in the composition of the gut microbiota. For example, Liwinski et al. [69] found that Bifidobacterium affects the remission of autoimmune hepatitis. Wei et al. [70] found that Veillonella not only has a strong correlation with autoimmune hepatitis but also affects the progression of hepatitis. Lou et al. [71] found that a combination of Bacteroides, Ruminococcaceae, Lachnospiraceae, Veillonella, Roseburia, and Ruminococcaceae can distinguish autoimmune hepatitis patients from healthy controls, suggesting that certain microbes or their combinations can serve as markers for autoimmune hepatitis. Therefore, it is practically significant to choose autoimmune hepatitis as a case study. Tables 10 and 11 reveal that, of the top 10 microbes projected to potentially associate with autoimmune hepatitis, 8 have been validated by literature. Furthermore, among the top 10 microbes predicted to form new associations with autoimmune hepatitis, five have been substantiated by literature.
Examining the four experimental outcomes from the aforementioned pair of case studies, our model exhibits strong performance across both types of experiments. This demonstrates the model’s robust practical guidance capabilities. Consequently, our model’s predictive results can be leveraged to enhance the efficiency of traditional biomedical experiments and reduce their duration.
Discussion and conclusion
The human body is a vast ecosystem teeming with microbes, many of which play a pivotal role in our health and the onset, progression, and treatment of diseases. As such, understanding the intricate relationships between these microbes and diseases is crucial for disease prevention, clinical practice, and biomedical research. Traditional biomedical experiments in this field often face hurdles due to their lengthy duration, high costs, and strict requirements for experimental conditions. While computational methods offer a way to circumvent these challenges to some degree. They are not without their own limitations. These include the inadequate extraction and utilization of data features, lessthanoptimal methods for selecting reliable negative samples, and a lack of precision in model predictions.
In this study, we introduce DAEGCNDF, a novel computational model designed to predict associations between microbes and diseases. Our approach involves calculating four distinct types of similarity for both microbes and diseases, which are then fused to generate a comprehensive set of initial features. We employ GCN to extract highrank features of diseases and microbes, while the DAE module is used to distill lowrank features of diseasemicrobe pairs. In the process of selecting negative samples for training, we compared five different sampling methods to ensure the selection of reliable negative samples. Our findings indicate that KMeans clustering sampling and Gaussian mixture cluster clustering sampling enhance model performance by approximately 4\(\%\). In the final step, we concatenate the low and highrank features of diseasemicrobe pairs and utilize a deep forest for predicting potential microbedisease associations. Through ablation experiments, classifier selection experiments, and case studies, our computational framework demonstrates significant potential in identifying potential microbedisease associations.
From the experimental results, the performance of our model is superior to the baseline model, and we believe there are four main reasons. First, the GCN variant module suitable for bipartite graphs can effectively extract the loworder information of nodes. Second, the DAE module can effectively extract the highorder features of the microbedisease pair. Third, unlike the traditional random selection of negative samples, we used KMean for negative sample sampling. Fourth, the performance of the deep forest classification is superior to traditional machine learning methods.
Nonetheless, our model does have certain limitations that warrant further refinement in the future. This includes the need to devise superior methods for selecting reliable negative samples and to delve into the mathematical principles that underpin the differences in these methods. Moreover, the interplay between drugs, ncRNA, microbes, and diseases presents an opportunity for extracting novel features of microbes and diseases. This is an area that is yet to be fully explored. Our future work will concentrate on these two pivotal aspects.
Availability of data materials
The datasets and corresponding codes are available at https://github.com/cuntjx/microbe.
References
Finlay BJ, Clarke KJ. Ubiquitous dispersal of microbial species. Nature. 1999;400(6747):828–828.
Zhou YD, Liang FX, Tian HR, Luo D, Wang YY, Yang SR. Mechanisms of gut microbiotaimmunehost interaction on glucose regulation in type 2 diabetes. Front Microbiol. 2023;14:1121695.
Jiayuan H, Wenting L, Wanying K, Yulong H, Ruifu Y, Xiangyu M, Wenjing Z. Effects of microbiota on anticancer drugs: current knowledge and potential applications. EBioMedicine. 2022;83:19.
Tanner ACR, Kressirer CA, Rothmiller S, Johansson I, Chalmers NI. The caries microbiome: implications for reversing dysbiosis. Adv Dent Res. 2018;29(1):78–85.
Lelouvier B, Servant F, Païssé S, Brunet AC, Benyahya S, Serino M, Valle C, Ortiz MR, Puig J, Courtney M, et al. Changes in blood microbiota profiles associated with liver fibrosis in obese patients: a pilot analysis. Hepatology. 2016;64(6):2015–27.
Hatakeyama M, Higashi H. Helicobacter pylori caga: a new paradigm for bacterial carcinogenesis. Cancer Sci. 2005;96(12):835–43.
Dumrese C, Slomianka L, Ziegler U, Choi SS, Kalia A, Fulurija A, Wei L, Berg DE, Benghezal M, Marshall B, et al. The secreted helicobacter cysteinerich protein a causes adherence of human monocytes and differentiation into a macrophagelike phenotype. FEBS Lett. 2009;583(10):1637–43.
Sajib S, Zahra FT, Lionakis MS, German NA, Mikelis CM. Mechanisms of angiogenesis in microberegulated inflammatory and neoplastic conditions. Angiogenesis. 2018;21:1–14.
LairdFick HS, Saini S, Hillard JR. Gastric adenocarcinoma: the role of helicobacter pylori in pathogenesis and prevention efforts. Postgrad Med J. 2016;92(1090):471–7.
Beniwal RS, Arena VC, Thomas L, Narla S, Imperiale TF, Chaudhry RA, Ahmad UA. A randomized trial of yogurt for prevention of antibioticassociated diarrhea. Dig Dis Sci. 2003;48:2077–82.
Ghouri Yezaz A, Richards David M, Rahimi Erik F, Krill Joseph T, Jelinek Katherine A, DuPont AW. Systematic review of randomized controlled trials of probiotics, prebiotics, and synbiotics in inflammatory bowel disease. Clin Exp Gastroenterol. 2014;8:473–87.
Qiu J, Dong Y, Ma H, Li J, Wang K, Tang J. Network embedding as matrix factorization: unifying deepwalk, line, pte, and node2vec. In: Proceedings of the eleventh ACM international conference on web search and data mining. 2018;459–67.
Shen Z, Jiang Z, Bao W. Cmfhmda: collaborative matrix factorization for human microbedisease association prediction. In: Intelligent computing theories and application: 13th international conference, ICIC 2017, Liverpool, UK, August 7–10, 2017, Proceedings, Part II 13. Springer; 2017. pp. 261–269.
Zou S, Zhang J, Zhang Z. A novel approach for predicting microbedisease associations by birandom walk on the heterogeneous network. PLoS ONE. 2017;12(9): e0184394.
Shen X, Zhu H, Jiang X, Hu X, Yang J. A novel approach based on birandom walk to predict microbedisease associations. In: Intelligent computing methodologies: 14th international conference, ICIC 2018, Wuhan, China, August 15–18, 2018, proceedings, Part III 14. Springer; 2018. p. 746–752.
Liu Y, Wang SL, Zhang JF. Prediction of microbedisease associations by graph regularized nonnegative matrix factorization. J Comput Biol. 2018;25(12):1385–94.
Wang Y, Lei X, Cheng L, Pan Y. Predicting microbedisease association based on multiple similarities and line algorithm. IEEE/ACM Trans Comput Biol Bioinf. 2021;19(4):2399–408.
Peng W, Liu M, Dai W, Chen T, Fu Y, Pan Y. Multiview feature aggregation for predicting microbedisease association. IEEE/ACM Trans Comput Biol Bioinform. 2021.
Zhu X, Ghahramani Z. Learning from labeled and unlabeled data with label propagation. ProQuest number: information to all users; 2002.
Yin MM, Gao YL, Shang J, Zheng CH, Liu JX. Multisimilarity fusionbased label propagation for predicting microbes potentially associated with diseases. Futur Gener Comput Syst. 2022;134:247–55.
Gao YL, Yin MM, Liu JX, Shang J, Zheng CH. Mkllp: predicting diseaseassociated microbes with multiplesimilarity kernel learningbased label propagation. In: International symposium on bioinformatics research and applications. Springer; 2021. pp. 3–10.
Zhao H, Duan G, Yang B, Li S, Wang J. Predicting of microbedrug associations via a precompletionbased label propagation algorithm. In: 2022 IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE; 2022. p. 686–691.
Jia Q, Zhao Y, Yin J. Identification and analysis of human microbedisease associations by matrix decomposition and label propagation. Front Microbiol. 2019;10:291.
Wang L, Wang Y, Li H, Feng X, Yuan D, Yang J. A bidirectional label propagation based computational model for potential microbedisease association prediction. Front Microbiol. 2019;10:684.
Chen X, Huang YA, You ZH, Yan GY, Wang XS. A novel approach based on katz measure to predict associations of human microbiota with noninfectious diseases. Bioinformatics. 2017;33(5):733–9.
Katz L. A new status index derived from sociometric analysis. Psychometrika. 1953;18(1):39–43.
Li H, Wang Y, Jiang J, Zhao H, Feng X, Zhao B, Wang L. A novel human microbedisease association prediction method based on the bidirectional weighted network. Front Microbiol. 2019;10:676.
Li S, Xie M, Liu X. A novel approach based on bipartite network recommendation and katz model to predict potential microdisease associations. Front Genet. 2019;10:1147.
Huang ZA, Chen X, Zhu Z, Liu H, Yan GY, You ZH, Wen Z. Pbhmda: pathbased human microbedisease association prediction. Front Microbiol. 2017;8:233.
Long Y, Luo J. Wmghmda: a novel weighted metagraphbased model for predicting human microbedisease association on heterogeneous information network. BMC Bioinform. 2019;20:1–18.
Long Y, Min W, Kwoh CK, Luo J, Li X. Predicting human microbedrug associations via graph convolutional network with conditional random field. Bioinformatics. 2020;36(19):4918–27.
Long Y, Min W, Liu Y, Kwoh CK, Luo J, Li X. Ensembling graph attention networks for human microbedrug association prediction. Bioinformatics. 2020;36(Supplement2):i779–86.
Peng LH, Yin J, Zhou L, Liu MX, Zhao Y. Human microbedisease association prediction based on adaptive boosting. Front Microbiol. 2018;9:2440.
Wang L, Wang Y, Xuan C, Zhang B, Hanwen W, Gao J. Predicting potential microbedisease associations based on multisource features and deep learning. Brief Bioinform. 2023;24(4):bbad255.
Liu D, Liu J, Luo Y, He Q, Deng L. Mgatmda: predicting microbedisease associations via multicomponent graph attention network. IEEE/ACM Trans Comput Biol Bioinf. 2021;19(6):3578–85.
Veličković P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y. Graph attention networks 2017. arXiv:1710.10903.
Li G, Fang T, Zhang Y, Liang C, Xiao Q, Luo J. Predicting mirnadisease associations based on graph attention network with multisource information. BMC Bioinform. 2022;23(1):244.
Wang Y, Lei X, Pan Y. Microbedisease association prediction using RGCN through microbedrugdisease network. IEEE/ACM Trans Comput Biol Bioinform. 2023.
Schlichtkrull M, Kipf TN, Bloem P, Van Den Berg R, Titov I, Welling M. Modeling relational data with graph convolutional networks. In: The semantic web: 15th international conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, proceedings 15. Springer; 2018. p. 593–607.
Jiang C, Tang M, Jin S, Huang W, Liu X. Kgnmda: a knowledge graph neural network method for predicting microbedisease associations. IEEE/ACM Trans Comput Biol Bioinf. 2022;20(2):1147–55.
Shi K, Li L, Wang Z, Chen H, Chen Z, Fang S. Identifying microbedisease association based on graph convolutional attention network: case study of liver cirrhosis and epilepsy. Front Neurosci. 2023;16:1124315.
Wang L, Yang X, Kuang L, Zhang Z, Zeng B, Chen Z. Graph convolutional neural network with multilayer attention mechanism for predicting potential microbedisease associations. Curr Bioinform. 2023;18(6):497–508.
Shi K, Li L, Yu J, Zhang Y, Xie X. Predicting microbedisease associations via multiple layer graph convolutional network and attention mechanism. In: Proceedings of the 2022 11th international conference on bioinformatics and biomedical science, 2022. p. 59–65.
Lee J, Pak J, Lee M. Network intrusion detection system using feature extraction based on deep sparse autoencoder. In: 2020 International conference on information and communication technology convergence (ICTC). IEEE; 2020. p. 1282–1287.
Zhou ZH, Feng J. Deep forest. Natl Sci Rev. 2019;6(1):74–86.
Wei Ma L, Zhang PZ, Huang C, Li J, Geng B, Yang J, Kong W, Zhou X, Cui Q. An analysis of human microbedisease associations. Brief Bioinform. 2017;18(1):85–97.
Janssens Y, Nielandt J, Bronselaer A, Debunne N, Verbeke F, Wynendaele E, Van Immerseel F, Vandewynckel YP, De Tré G, De Spiegeleer B. Disbiome database: linking the microbiome to disease. BMC Microbiol. 2018;18(1):1–6.
Skoufos G, Kardaras FS, Alexiou A, Kavakiotis I, Lambropoulou A, Kotsira V, Tastsoglou S, Hatzigeorgiou AG. Peryton: a manual collection of experimentally supported microbedisease associations. Nucleic Acids Res. 2021;49(D1):D1328–33.
Xuan P, Han K, Guo M, Guo Y, Li J, Ding J, Liu Y, Dai Q, Li J, Teng Z, et al. Prediction of micrornas associated with human diseases based on weighted k most similar neighbors. PLoS ONE. 2013;8(8): e70204.
Schriml LM, Mitraka E, Munro J, Tauber B, Schor M, Nickle L, Felix V, Jeng L, Bearer C, Lichenstein R, et al. Human disease ontology 2018 update: classification, content and workflow expansion. Nucleic Acids Res. 2019;47(D1):D955–62.
Chen X, Yan CC, Zhang X, You ZH, Deng L, Liu Y, Zhang Y, Dai Q. Wbsmda: within and between score for mirnadisease association prediction. Sci Rep. 2016;6(1):21106.
Chuanyan W, Gao R, Zhang D, Han S, Zhang Y. Prwhmda: human microbedisease association prediction by random walk on the heterogeneous network with pso. Int J Biol Sci. 2018;14(8):849.
Jiang HJ, You ZH, Huang YA. Predicting drugdisease associations via sigmoid kernelbased convolutional neural networks. J Transl Med. 2019;17(1):1–11.
Liu JX, Yin MM, Gao YL, Shang J, Zheng CH. Msflrr: multisimilarity information fusion through lowrank representation to predict diseaseassociated microbes. IEEE/ACM Trans Comput Biol Bioinf. 2022;20(1):534–43.
Peng W, Wu R, Dai W, Ning Y, Fu X, Liu L, Liu L. Mirnagene network embedding for predicting cancer driver genes. Brief Funct Genom. 2023;23:elac059.
Luo J, Long Y. Ntshmda: prediction of human microbedisease association based on random walk by integrating network topological similarity. IEEE/ACM Trans Comput Biol Bioinf. 2018;17(4):1341–51.
Bao W, Jiang Z, Huang DS. Novel human microbedisease association prediction using network consistency projection. BMC Bioinform. 2017;18:173–81.
Wang F, Huang ZA, Chen X, Zhu Z, Wen Z, Zhao J, Yan GY. Lrlshmda: Laplacian regularized least squares for human microbedisease association prediction. Sci Rep. 2017;7(1):7601.
Jemal A, Bray F, Center MM, Ferlay J, Ward E, Forman D. Global cancer statistics. CA Cancer J Clin. 2011;61(2):69–90.
Biller LH, Schrag D. Diagnosis and treatment of metastatic colorectal cancer: a review. JAMA. 2021;325(7):669–85.
Torre Lindsey A, Bray Freddie SRL, Ferlay J, LortetTieulent J, Jemal A. Global cancer statistics, 2012. CA Cancer J Clin. 2015;65(2):87–108.
Gao Z, Guo B, Gao R, Zhu Q, Qin H. Microbiota disbiosis is associated with colorectal cancer. Front Microbiol. 2015;6:20.
Wang Z, Vogelstein B, Kinzler KW. Phosphorylation of \(\beta\)catenin at s33, s37, or t41 can occur in the absence of phosphorylation at t45 in colon cancer cells. Can Res. 2003;63(17):5234–5.
Krawitt EL. Autoimmune hepatitis. N Engl J Med. 2006;354(1):54–66.
MieliVergani G, Vergani D, Czaja AJ, Manns MP, Krawitt EL, Vierling JM, Lohse AW, MontanoLoza AJ. Autoimmune hepatitis. Nat Rev Dis Primers. 2018;4(1):1–21.
Heneghan MA, Yeoman AD, Verma S, Smith AD, Longhi MS. Autoimmune hepatitis. Lancet. 2013;382(9902):1433–44.
Hurlburt KJ, McMahon BJ, Deubner H, HsuTrawinski B, Williams JL, Kowdley KV. Prevalence of autoimmune liver disease in alaska natives. Am J Gastroenterol. 2002;97(9):2402–7.
Soloway RD, Summerskill WHJ, Baggenstoss AH, Geall MG, Gitnick GL, Elveback LR, Schoenfield LJ. Clinical, biochemical, and histological remission of severe chronic active liver disease: a controlled study of treatments and early prognosis. Gastroenterology. 1972;63(5):820–33.
Liwinski T, Casar C, Ruehlemann MC, Bang C, Sebode M, Hohenester S, Denk G, Lieb W, Lohse AW, Franke A, et al. A diseasespecific decline of the relative abundance of bifidobacterium in patients with autoimmune hepatitis. Aliment Pharmacol Therap. 2020;51(12):1417–28.
Wei Y, Yanmei Li LI, Yan CS, Miao Q, Wang Q, Xiao X, Lian M, Li B, Chen Y, et al. Alterations of gut microbiome in autoimmune hepatitis. Gut. 2020;69(3):569–77.
Lou J, Jiang Y, Rao B, Li A, Ding S, Yan H, Zhou H, Liu Z, Shi Q, Cui G, et al. Fecal microbiomes distinguish patients with autoimmune hepatitis from healthy individuals. Front Cell Infect Microbiol. 2020;10:342.
Funding
The authors wish to thank editors and reviewers.This research was supported in part by Young and Middle aged Teachers Research Basic Ability Improvement Project of Guangxi Universities (No.2022KY0608). Macau Science and Technology Development Funds Grant No.0056/2020/AFJ from the Macau Special Administrative Region of the People’s Republic of China.
Author information
Authors and Affiliations
Contributions
SHL conceived of the presented idea, carried out the experiments, analyzed the result, and wrote the manuscript. LL and SLL helped shape the research, analysis, and manuscript. RM, YFZ, CJY and DO analyzed the result and revised the manuscript. YL conceived the project and revised the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Lu, S., Liang, Y., Li, L. et al. Predicting potential microbedisease associations based on autoencoder and graph convolution network. BMC Bioinformatics 24, 476 (2023). https://doi.org/10.1186/s12859023056117
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859023056117