Skip to main content

GSAMDA: a computational model for predicting potential microbe–drug associations based on graph attention network and sparse autoencoder

Abstract

Background

Clinical studies show that microorganisms are closely related to human health, and the discovery of potential associations between microbes and drugs will facilitate drug research and development. However, at present, few computational methods for predicting microbe–drug associations have been proposed.

Results

In this work, we proposed a novel computational model named GSAMDA based on the graph attention network and sparse autoencoder to infer latent microbe–drug associations. In GSAMDA, we first built a heterogeneous network through integrating known microbe–drug associations, microbe similarities and drug similarities. And then, we adopted a GAT-based autoencoder and a sparse autoencoder module respectively to learn topological representations and attribute representations for nodes in the newly constructed heterogeneous network. Finally, based on these two kinds of node representations, we constructed two kinds of feature matrices for microbes and drugs separately, and then, utilized them to calculate possible association scores for microbe–drug pairs.

Conclusion

A novel computational model is proposed for predicting potential microbe–drug associations based on graph attention network and sparse autoencoder. Compared with other five state-of-the-art competitive methods, the experimental results illustrated that our model can achieve better performance. Moreover, case studies on two categories of representative drugs and microbes further demonstrated the effectiveness of our model as well.

Peer Review reports

Background

Microorganisms, including bacteria, viruses, archaea, fungi and protozoa, are dynamic, diverse and complex genetic reservoirs that exist in interactive flux, colonize human cells, and play significant roles in human beings [1]. The microbial function is to protect the pathogens, improve and enhance metabolism and immunity capability [2]. For example, microbes can resist the invasion of opportunistic pathogens [3], promote the synthesis of sugar metabolism and synthesis the necessary vitamins to boost T-cell responses [4]. Maintaining the homeostasis of internal environment of organisms is inseparable from the regulation of microorganisms [5]. Unusual growth or decline of microorganisms will influence human health and cause diseases, such as obesity [6], inflammatory bowel disease [7], and even cancer [8]. For instance, pathogens, including bacteria and viruses, may cause infectious diseases such as the COVID-19 [9]. Also, while using drugs to treat microbe-caused diseases, the microbiome may affect the physiological action of drugs in turn. Several studies have shown that not only microbial metabolism can significantly affect the clinical response to drugs, but also the administration of drugs can similarly affect the microbiome [1, 10, 11]. Hence, uncovering potential associations between microbes and drugs will be helpful for the development of drugs and the treatment of human diseases. Due to the high cost and time-consuming of clinical and biological experiments, it is obvious that effective computational approaches for predicting possible microbe–drug associations will be useful complements of traditional web-lab experiments.

Recently, researchers have published multiple databases such as MDAD [12] and aBiofilm [13], which include a large number of experimentally validated microbe–drug associations. And based on these databases, a series of calculation methods have been proposed to detect latent microbe–drug associations and achieved a certain degree of effects. For instance, Zhu et al. proposed a method called HMDAKATZ by adopting KATZ metric to infer potential microbe–drug associations [14]. Long et al. designed a calculation framework named HNERMDA for possible microbe–drug association prediction through combining metapath2vec with bipartite network recommendation [15]. Furthermore, a computational model called LRLSMDA was proposed in reference [16] for identifying microbe–drug associations based on the Laplacian Regularized Least Square algorithm. Literature [17] introduced a calculation scheme named GCNMDA based on the Graph Convolutional Network (GCN) and Conditional Random Field (CRF) to discover associations between microbes and drugs. In reference [18], a method called EGATMDA was designed based on the framework of graph attention networks to predict possible microbe–drug associations. Additionally, Deng et al. conceived a calculation model named Graph2MDA through applying a variational graph autoencoder to infer microbe–drug associations [19].

Most of the above methods took multiple node features into account and fed them into the same model for prediction. Hence, considering that different node features can be learned by different models may have better performance, we classified node features as topological features and attribute features and learn the representations of these two features through graph attention network(GAT) and sparse autoencoder(SAE) respectively. GAT can propagate the information from local neighbors to learn effective representations and has been widely and successfully used in the field of association prediction such as Long et al. [18], Liu et al. [20]. SAE can extract relatively sparse and useful features by introducing a sparse penalty term on autoencoder [21].

In this paper, we introduced a novel calculation method called GSAMDA based on the graph attention network (GAT) and the sparse autoencoder (SAE) to predict potential microbe–drug associations. In GSAMDA, a heterogeneous network would be constructed first based on the Gaussian interaction profile (GIP) kernel similarity and Hamming interaction profile (HIP) similarity for microbes and drugs. And then, for each node in the heterogeneous network, a unique topological representation would be learned by adopting a GAT-based autoencoder. Simultaneously, based on multiple features of microbes and drugs, we would further apply SAE to learn a unique attribute representation for each node in the heterogeneous network as well. Thereafter, through combining these two types of node representations with multiple features of microbes and drugs, such as drug structure similarity, microbe functional similarity, drug–disease associations and microbe–disease associations, a unique feature matrix would be built for each node in the heterogeneous network, which would be utilized to obtain predicted scores for possible microbe–drug associations. Finally, in order to verify the prediction performance of GSAMDA, we performed case studies and intensive comparison experiments based on two well-known public databases, and results demonstrated that GSAMDA outperformed five state-of-the-art competitive methods, which means that GSAMDA not only can achieve satisfactory predictive performance, but also may be a kind of useful tool for potential microbe–drug association prediction in the future.

Materials and methods

Data sources

In this manuscript, we first downloaded known microbe–drug associations from two public databases such as MDAD (http://www.chengroup.cumt.edu.cn/MDAD/) and aBiofilm (http://bioinfo.imtech.res.in/manojk/abiofilm/) separately. As a result, we obtained 2470 clinically or experimentally verified microbe–drug associations between 1373 drugs and 173 microbes from the MDAD, while 2884 known microbe–drug associations between 1720 drugs and 140 microbes from the aBiofilm. And then, we further collected known drug–disease associations and known microbe–disease associations from the dataset proposed by Wang et al. [22] as well. During the experiment, only diseases associating with at least one drug and one microbe in the MDAD or aBiofilm, and associations related with these diseases, would be kept. Hence, we finally obtained 109 different diseases, 232 different drugs, 1121 different drug–disease associations and 402 different microbe–disease associations from the MDAD, and 72 different diseases, 103 different drugs, 435 different drug–disease associations and 254 different microbe–disease associations from the aBiofilm. The detailed numbers of these aforementioned data were shown in the following Table 1.

Table 1 The detailed numbers of microbes, drugs, diseases and related associations in the MDAD and aBiofilm

Methods

As shown in Fig. 1, GSAMDA mainly consists of five parts:

Fig. 1
figure 1

The overall architecture of GSAMDA

Step 1. Constructing the heterogeneous network HN by adopting integrated microbe similarities and drug similarities;

Step 2. Learning topological representations for nodes in HN based on the GAT;

Step 3. Learning attribute representations for nodes in HN based on the SAE;

Step 4. Constructing feature matrices for nodes in HN through combining their topological representations and attribute representations with multiple original attributes of them;

Step 5. Computing possible association scores for microbe–drug pairs based on their feature matrices.

Constructing the heterogeneous network HN

In this section, based on newly downloaded drugs, microbes and known microbe–drug associations, we would build the heterogeneous network HN as follows.

Firstly, we defined A\({R}^{{n}_{r}\times {n}_{m}}\) as an adjacency matrix, where nr and nm denote the numbers of newly downloaded drugs and microbes separately. In A, for any given drug ri and microbe mj, if there is a known association between them, then there is Aij = 1, otherwise there is Aij = 0.

Secondly, let A(ri) and A(mj) denote the i-th row and the j-th column of A respectively, then for any two given drugs ri and rj, we would estimate the GIP kernel similarity \({S}_{r}^{GIP}({r}_{i},{r}_{j})\in {R}^{{n}_{r}\times {n}_{r}}\) between them as follows:

$${S}_{r}^{GIP}({r}_{i},{r}_{j})=exp(-{\gamma }_{r}{||A({r}_{i})-A({r}_{j})||}^{2})$$
(1)
$${\gamma }_{r}=1/\left(\frac{1}{{n}_{r}}\sum_{i=1}^{{n}_{r}}{||A({r}_{i})||}^{2}\right)$$
(2)

Similarly, for any two given microbes mi and mj, we would evaluate the GIP kernel similarity \({S}_{m}^{GIP}({m}_{i},{m}_{j})\in {R}^{{n}_{m}\times {n}_{m}}\) between them as follows:

$${S}_{m}^{GIP}({m}_{i},{m}_{j})=exp(-{\gamma }_{m}{||A({m}_{i})-A({m}_{j})||}^{2})$$
(3)
$${\gamma }_{m}=1/\left(\frac{1}{{n}_{m}}\sum_{i=1}^{{n}_{m}}{||A({m}_{i})||}^{2}\right)$$
(4)

Here, ||·|| is the Frobenius norm.

Thirdly, inspired by the work proposed by Xu et al. [23], we further adopted the HIP similarity to measure the similarities between drugs or microbes based on the assumption that two nodes will have lower similarity when their interaction profiles are more different. To be specific, for any two given drugs ri and rj, the HIP similarity \({S}_{r}^{HIP}({r}_{i},{r}_{j})\in {R}^{{n}_{r}\times {n}_{r}}\) between them would be computed as follows:

$${S}_{r}^{HIP}({r}_{i},{r}_{j})=1-\frac{|A({r}_{i})!=A({r}_{j})|}{|A({r}_{i})|}$$
(5)

where \(|A({r}_{i})!=A({r}_{j})|\) denotes the number of different elements between the profiles A \(({r}_{i})\) and \(A({r}_{j})\), and \(|A({r}_{i})|\) represents the number of elements in \(A({r}_{i})\). Similarly, for any two given microbes mi and mj, the HIP similarity \({S}_{m}^{HIP}({m}_{i},{m}_{j})\in {R}^{{n}_{m}\times {n}_{m}}\) between them could be estimated as follows:

$${S}_{m}^{HIP}({m}_{i},{m}_{j})=1-\frac{|A({m}_{i})!=A({m}_{j})|}{|A({m}_{i})|}$$
(6)

Finally, considering that the values in both the matrices \({S}_{r}^{GIP}\) and \({S}_{r}^{HIP}\) range from 0 to 1, we could combine these two matrices into a new matrix \({S}_{r}\in {R}^{{n}_{r}\times {n}_{r}}\) as follows:

$${S}_{r}=({S}_{r}^{GIP}+{S}_{r}^{HIP})/2$$
(7)

Similarly, a novel matrix \({S}_{m}\in {R}^{{n}_{m}\times {n}_{m}}\) could be obtained by integrating \({S}_{m}^{GIP}\) and \({S}_{m}^{HIP}\) as follows:

$${S}_{m}=({S}_{m}^{GIP}+{S}_{m}^{HIP})/2$$
(8)

Thereafter, a matrix \(N\in {R}^{{{(n}_{r}+n}_{m})\times {{(n}_{r}+n}_{m})}\) could be constructed through combining \({S}_{r}\) and \({S}_{m}\) with the adjacency matrix A as follows:

$$N=\left[\begin{array}{cc}{S}_{r}& A\\ {A}^{T}& {S}_{m}\end{array}\right]$$
(9)

Here, \({A}^{T}\) is the transposed matrix of A.

Obviously, based on above matrix N, we can easily design a heterogeneous network HN consisting of \({{n}_{r}+n}_{m}\) different nodes, in which, there is an edge between any two nodes i and j, if and only if there is \(N(i,j)\ne 0\).

Learning topological representations for nodes in HN

The graph attention network (GAT) is an extension of the graph convolution network, it can overcome some shortcomings of graph convolution by using the masked self-attentional layers, which allows implicitly different weights to be assigned to different nodes in an adjacent set of nodes [24]. In this section, we would construct a GAT and take the network HN as its input to learn topological representations for nodes in N according to the following steps:

Step1 (Encoder): For any given node i in HN, let \({N}_{i}\) denote the set of neighboring nodes of i in N, then, for any node \(j\in {N}_{i}\), the GAT would first calculate the attention score \({\alpha }_{ij}\) between i and j according to the following formulae:

$${\alpha }_{ij}=softmax\left({e}_{ij}\right)=\frac{\mathrm{exp}({e}_{ij})}{{\sum }_{k\in {N}_{i}}\mathrm{exp}({e}_{ik})}$$
(10)
$$softmax(x)=\frac{1}{1+{e}^{-x}}$$
(11)
$${e}_{ij}=LeakyRelu(\alpha [W{h}_{i}||W{h}_{j}])$$
(12)
$$LeakyRelu(x)=\left\{\begin{array}{c}x x>0\\ \mu x otherwise\end{array}\right.$$
(13)

Here, \(\alpha\) represents the computational operation of self-attention, W is the matrix of trainable weights, hi denotes the feature representation of the node i (i.e., the i-th row of N), \(\mu\) is the hypermeter and || denotes the concatenation operation.

Subsequently, the GAT would multiply the attention score \({\alpha }_{ij}\) with the feature representation \({h}_{j}\) of each node in \({N}_{i}\) and sum all these products up as follows:

$${h}_{i}=\sigma \left(\sum_{j\in {N}_{i}}{\alpha }_{ij}W{h}_{j}\right)$$
(14)

Here, \(\sigma\) denotes the activation function.

After above Encoder step, obviously, we could obtain a matrix \(Z=\left[\begin{array}{c}{Z}^{r}\\ {Z}^{m}\end{array}\right]\in {R}^{({n}_{r}+{n}_{m})*l}\), where \({Z}^{r}\) and \({Z}^{m}\) represent the low-dimensional topological representation of drug nodes and microbe nodes in HN respectively.

Step2(Decoder) Based on the matrix \(Z\), it was easy to see that we could take its inner product as a decoder:

$$ZZ=sigmoid(Z\bullet {Z}^{T})$$
(15)
$$sigmoid(x)=\frac{1}{1+{e}^{-x}}$$
(16)

Step3(Optimization) Considering that the decoded result ZZ should be close to the original inputted matrix N, we adopted the MSE loss function to compute the mean of the sum of squares of the differences between ZZ and N as follows:

$${L}_{MSE}=\frac{1}{{n}_{r}+{n}_{m}}\sum_{k=1}^{{n}_{r}+{n}_{m}}{||ZZ(k)-N(k)||}^{2}$$
(17)

where ZZ(k) and N(k) denote the k-th row of ZZ and N respectively.

Thereafter, based on the Eq. (16), we would adopt the Adam optimizer [25] to optimize the results of topological representations for nodes in HN.

Learning attribute representations for nodes in HN

In this section, in order to effectively capture local and global topological intrinsic characteristics of nodes in HN, we further implemented an improved random walk with restart (RWR) on \({S}_{r}\), where the RWR was defined as follows [26]:

$${q}_{i}^{l+1}=\varphi M{q}_{i}^{l}+(1-\varphi ){\varepsilon }_{i}$$
(18)

Here, \(\mathrm{\varphi }\) is the restart probability and set as 0.1, M denotes the transition probability matrix, and \({\varepsilon }_{i}\)\({R}^{1\times m}\) is the initial probability vector of the node i, which is defined as follows:

$${\varepsilon }_{ij}=\left\{\begin{array}{c}1 if i=j\\ 0 otherwise\end{array}\right.$$
(19)

Based on above RWR process, it was easy to know that we could obtain a novel matrix \({S}_{r}^{mm}\).

In addition, let \({n}_{d}\) denote the number of newly downloaded diseases, similar to construction of the adjacency matrix A, we could obtain an adjacency matrix D\({R}^{{n}_{r}\times {n}_{d}}\) based on these newly-downloaded known drug–disease associations as well. And then, for any two given drug nodes i and j in HN, we could calculate the cosine similarity \({S}_{r}^{dis}(i,j)\) between them as follows:

$${S}_{r}^{dis}(i,j)=cos(D(i),D(j))=\frac{D(i)\cdot D(j)}{||D(i)||\times ||D(j)||}$$
(20)

Here, D(i) denotes the i-th row of D.

Moreover, in a similar way, we could further calculate the drug structural similarity matrix \({S}_{r}^{che}\) based on the dataset downloaded from the SIMCOMP2 [27], which measured the drug similarity based on the drug chemical structure information.

Hence, through integrating all these matrices A, \({S}_{r}^{mm}\), \({S}_{r}^{dis}\) and \({S}_{r}^{che}\), it is easy to see that we could obtain a novel drug attribute matrix \({A}^{r}\) as follows:

$${A}^{r}=[A;{S}_{r}^{che};{S}_{r}^{mm};{S}_{r}^{dis}]$$
(21)

Similarly, after applying the improved RWR on \({S}_{m}\), we could obtain a new matrix \({S}_{m}^{rr}\).

And in addition, through adopting the method proposed by Kamneva [28], which calculated the functional similarity between microbes based on a microbial protein–protein functional association network, we could obtain a new matrix \({S}_{m}^{f}\) as well. Here, in the microbial protein–protein functional association network, the nodes represent any gene family encoded by the genome and the edges denote genetic neighbor scores based on STRING database. The functional similarity between microbes was calculated as the ratio of the link scores connecting the two microbes to the sum of all the link scores of the two microbial gene families.

Moreover, similar to the construction of \({S}_{r}^{dis}\), based on the dataset of newly downloaded known microbe–disease associations, for any two given microbe nodes i and j in HN, we could calculate the cosine similarity \({S}_{m}^{dis}(i,j)\) between them in a similar way as well.

Hence, through integrating all these matrices \({A}^{T}\), \({S}_{m}^{f}\), \({S}_{m}^{rr}\) and \({S}_{m}^{dis}\), it is obvious that we could obtain a novel microbe attribute matrix \({A}^{m}\) as follows:

$${A}^{m}=[{A}^{T};{S}_{m}^{f};{S}_{m}^{rr};{S}_{m}^{dis}]$$
(22)

Thereafter, after taking above two kinds of attribute matrices \({A}^{r}\) and \({A}^{m}\) as input of the SAE respectively, we could learn a unique attribute representation for each node in HN as well, where the structure of SAE was shown in the following Fig. 2.

Fig. 2
figure 2

The overall architecture of GSAMDA

From observing above Fig. 2, it is easy to see that the SAE consists of the following steps:

Step1(Encoder) Let h and x represent the hidden layer and the input layer of the SAE respectively, the encoder could be formulated as follows:

$${h}_{W,b}=\sigma ({W}_{encoder}x(i)+{b}_{encoder})$$
(23)

Step2(Decoder) The decoder adopted the same structure as the encoder, which was defined as follows:

$${y}_{W,b}=\sigma ({W}_{decoder}h+{b}_{decoder})$$
(24)

where W is the weight matrix between two layers and b is the bias term.

Moreover, in order to ensure the sparsity of the hidden layer, we would add a penalty term in the SAE as follows:

$${P}_{penalty}=\sum_{t=1}^{{S}_{2}}KL(\rho ||\widehat{{\rho }_{t}})$$
(25)

where \({S}_{2}\) is the number of neurons in the hidden layer, \(\widehat{{\rho }_{t}}\) represents the average activity of hidden neuron t, \(KL(\rho ||\widehat{{\rho }_{t}})\) is the relative entropy between two Bernoulli random variables with mean \(\rho\) and mean \(\widehat{{\rho }_{t}}\) and is defined as follows:

$$KL(\rho ||\widehat{{\rho }_{t}})=\rho log\frac{\rho }{\widehat{{\rho }_{t}}}+(1-\rho )log\frac{1-\rho }{1-\widehat{{\rho }_{t}}}$$
(26)

Hence, after inputting the drug attribute matrix \({A}^{r}\) and the microbe attribute matrix \({A}^{m}\) into the SAE, we could obtain the output matrices \({A}^{rr}\) and \({A}^{mm}\) respectively.

Step3(Optimization) In the SAE, we adopted the MSE loss function and the Adam optimizer for optimization as well. During optimization, the sparse penalty term would be added to the loss function as follows (Taking the drug attribute matrix as an example):

$${L}_{sparse}=\frac{1}{{n}_{r}}\sum_{k=1}^{{n}_{r}}{||{A}^{rr}(k)-{A}^{r}(k)||}^{2}+\beta {P}_{penalty}$$
(27)

Here, \(\beta\) is the weight of the sparse penalty and will be set to 0.1. \({A}^{rr}(k)\) and \({A}^{r}(k\)) represent the k-th row of \({A}^{rr}\) and \({A}^{r}\) respectively.

After training the SAE, we could adopt the trained SAE to learn and obtain the low dimensional drug attribute representation matrix \(\widetilde{{A}^{r}}\in {R}^{{n}_{r}*k}\) and microbe attribute representation matrix \(\widetilde{{A}^{m}}\in {R}^{{n}_{m}*k}\) simultaneously.

Constructing feature matrices for microbes and drugs

Based on above drug topological representation matrix \({Z}^{r}\), drug attribute representation matrix \(\widetilde{{A}^{r}}\), drug structural similarity matrix \({S}_{r}^{che}\), drug cosine similarity matrix \({S}_{r}^{dis}\), drug similarity matrix \({S}_{r}^{mm}\) and the original adjacency matrix A, inspired by Xuan et al. [29], we could construct a novel drug feature matrix \({F}_{r}\) as follows:

$${F}_{r}=[{Z}^{r};\widetilde{{A}^{r}};{S}_{r}^{che};A;{S}_{r}^{dis};A;{S}_{r}^{mm};A]$$
(28)

Similarly, based on above microbe topological representation matrix \({Z}^{m}\), microbe attribute representation matrix \(\widetilde{{A}^{m}}\), microbe functional similarity matrix \({S}_{m}^{f}\), microbe cosine similarity matrix \({S}_{m}^{dis}\), microbe similarity matrix \({S}_{m}^{rr}\) and the original adjacency matrix \({A}^{T}\), we can construct a novel microbe feature matrix \({F}_{m}\) as follows:

$${F}_{m}=[{Z}^{m};\widetilde{{A}^{m}};{A}^{T};{S}_{m}^{fun};{A}^{T}{;S}_{m}^{dis};{A}^{T};{S}_{m}^{rr}]$$
(29)

Computing predicted scores for microbe–drug pairs

The multiplication of two vectors is an effective means of simulating the interaction, which emphasizes the commonality of the interaction and dilutes the difference information of the interaction. Hence, for any given drug \({r}_{i}\) and microbe \({m}_{j}\), we could obtain the predicted score between them by calculating the inner product of their feature representations as follows:

$$S({r}_{i},{m}_{j})=Sigmoid({F}_{r}({r}_{i})\bullet {{F}_{m}({m}_{j})}^{T})$$
(30)

Here, \({F}_{r}({r}_{i})\) denotes the i-th row of \({F}_{r}\) and \({F}_{m}({m}_{j})\) denotes the j-th row of \({F}_{m}\).

Results

In this section, we first compared GSAMDA with five state-of-the-art competitive predictive methods based on databases MDAD and aBiofilm separately. And then, we conducted the hyperparameter sensitivity analysis to decide the best parameters. Finally, we implemented case studies on two selected drugs and two selected microbes to further demonstrate the performance of GSAMDA.

Comparison with state-of-the-art methods

As predicting microbe–drug associations is a new problem, there are few computational methods and codes available, therefore, we would compare our method GSAMDA with some representative methods for link prediction problems in this section. Among them, HMDAKATZ [14] is a KATZ-based method proposed for microbe–drug associations prediction. LAGCN [30] is a graph convolutional network with attention mechanism based method designed to infer potential drug–disease associations. NTSHMDA [31] is a model based on random walk with restart for microbe–disease associations prediction. HMDA-Pred [32] integrated multiple data types and adopted the Network Consistency Projection (NCP) technique to detect latent microbe–disease associations. BPNNHMDA [33] designed a novel neural network to infer microbe–disease associations.

During experiments, we settled with the original parameters for all these competitive methods and ran them on the MDAD and aBiofilm datasets respectively for a fair comparison. In addition, we adopted the framework of fivefold cross validation (CV) in Cai et al. [34] to evaluate these methods, in which, we randomly selected 20% of known associations and 20% of unknown associations as the testing set, and the remaining 80% of known associations and unknown associations as the training set. We run the fivefold CV for 10 times and the AUROCs, AUPR and the best Accuracy of all compared methods were shown in Table 2. The best ROC curves and PR curves of these six competitive methods based on datasets MDAD and aBiofilm were drawn in Figs. 3 and 4, respectively.

Table 2 The AUCs, AUPRs and Accuracy of compared methods based on datasets MDAD and aBiofilm under fivefold CV
Fig. 3
figure 3

a ROC curves of six competitive methods based on the MDAD dataset. b ROC curves of six competitive methods based on the aBiofilm dataset

Fig. 4
figure 4

a PR curves of six competitive methods based on the MDAD dataset. b PR curves of six competitive methods based on the aBiofilm dataset

The indicators including true positive rate(TPR), false positive rate(FPR), precision and recall related to ROC curve and PR curve were calculated as follows:

$$TPR=\frac{TP}{TP+FN}$$
(31)
$$FPR=\frac{FP}{TN+FP}$$
(32)
$$Precision=\frac{TP}{TN+FP}$$
(33)
$$Recall=\frac{TP}{TP+FN}$$
(34)

In addition, the accuracy is defined as below:

$$Accuracy=\frac{TP+TN}{TP+TN+FP+FN}$$
(35)

Here, TP and TN represent the number of positive and negative samples predicted correctly, respectively. FN and FP separately denote the number of positive and negative samples that are incorrectly identified.

As shown in Table 2, it is obvious that GSAMDA can achieve the highest AUC values of 0.9496 ± 0.0005 and 0.9308 ± 0.0120 respectively based on both MDAD and aBiofilm, while HMDAKATZ can achieve the second highest AUC values of 0.8712 ± 0.0010 and 0.8993 ± 0.0021separately based on both MDAD and aBiofilm, which are 8.19% and 4.39% lower than that of GSAMDA respectively. Meanwhile, GSAMDA also obtained the highest AUPR values of 0.4436 ± 0.0007 and 0.4510 ± 0.0051 respectively based on both MDAD and aBiofilm. Moreover, the best accuracy of GSAMDA performs better than most compared methods.

Hyperparameter sensitivity analysis

Considering that there are several hyperparameters in our model GSAMDA including the dimension of node topological representation l, the dimension of node attribute representation k, and the learning rate lr1 and lr2 in GAE and SAE separately, therefore, in this section, we would perform a fivefold CV on the MDAD dataset for 10 times and observe the average AUC value to tune these parameter values.

First, we tested the dimension l and k in the range of {32, 64, 128, 256}, and illustrated the experimental results in Fig. 5a and b, respectively, from which, we found that the dimension has a subtle impact on the performance of GSAMDA. As shown in Fig. 5a and b, when l was set to 128 and k was set to 32, the performance would be the best. Next, through experimental results, we found that these two parameters for learning rate were important for the performance of GSAMDA, too high or too low of their values would both cause performance degradation of GSAMDA. In experiments, we selected lr1 and lr2 in the range of {0.0001, 0.001, 0.005, 0.01, 0.1}, and showed the results in Fig. 5c and d separately, from which, it is easy to see that GSAMDA can obtain the highest AUC values while both lr1 and lr2 are set to 0.01.

Fig. 5
figure 5

Analysis of the impact of hyperparameters on performance of GSAMDA. The subfigures from (a) to (d) show the AUC values of the related values of the dimension of node topological representation and node attribute representation, the learning rate of GAE and SAE, respectively

Case study

To further validate the performance of GSAMDA, in this section, we would select two popular drugs, Ciprofloxacin and Moxifloxacin, and two microbes, Human immunodeficiency virus type 1 and Mycobacterium tuberculosis, for case studies. During experiments, we selected the top 20 microbes or drugs predicted by GSAMDA based on MDAD for investigation, and then verified that whether these top 20 predicted microbes or drugs have been reported by PubMed literatures.

Ciprofloxacin is a fluorinated quinolone antibiotic with high activity against a wide spectrum of gram-positive and gram-negative bacteria, including methicillin-resistant Staphylococcus aureus, Enterobacteriaceae, and Pseudomonas aeruginosa [35]. For example, Mycobacterium avium is highly susceptible to Ciprofloxacin [36]. And it is validated that Ciprofloxacin is an active agent against Candida albicans [37]. Besides, the Moxifloxacin [38] is a fluoroquinolone antibiotic, which can treat the social acquired pneumonia caused by Staphylococcus aureus, influenza bacillus, pneumococcus, acute attack of chronic bronchitis, acute sinusitis and so on. Gislason et al. revealed a two-component system that sensitized Burkholderia cenocepacia to moxifloxacin after depletion of the response regulator [39]. Tahoun et al. found that Listeria monocytogenes' antimicrobial susceptibility was most frequently observed for moxifloxacin [40]. Chon et al. demonstrated that most isolates of Clostridium perfringens were susceptible to moxifloxacin [41]. As shown in Tables 3 and 4, among these top 20 predicted ciprofloxacin-associated and moxifloxacin-associated microbes, we found 19 and 17 microbes having been reported by PubMed literatures, which means that the prediction performance of GSAMDA can reach up to 95% and 85%, and demonstrates as well that GSAMDA can achieve satisfactory performance.

Table 3 The top 20 predicted Ciprofloxacin-associated microbes
Table 4 The top 20 predicted Moxifloxacin-associated microbes

With regards to microbes, mycobacterium tuberculosis is a species of gram-positive, aerobic bacteria that is the etiological agent of tuberculosis which is the leading cause of death from bacterial infections [42]. The bacteria can infect various organs in the human body, causing pulmonary tuberculosis the most common. Moreover, human immunodeficiency virus type 1(HIV-1) is a member of the lentivirus (‘slow-acting’) genus of the family Retroviridae [43]. It is the cause of the Acquired Immunodeficiency Syndrome (AIDS) which is a deadly infectious disease. The top 20 predicted mycobacterium tuberculosis-associated and human immunodeficiency virus type 1-associated drugs are shown in Tables 5 and 6, respectively, from which, we can see that there are 16 and 17 out of top 20 predicted drugs having been validated by PubMed literatures, which further demonstrates the predictive power of GSAMDA as well.

Table 5 The top 20 predicted Mycobacterium tuberculosis-associated drugs
Table 6 The top 20 predicted Human immunodeficiency virus type 1-associated drugs

Discussion and conclusion

Increasing researches have shown that microbes are closely related to human health. Predicting microbe–drug associations can promote microbe-derived therapy and drug discovery. However, traditional wet experiments are time-consuming and expensive and few predictive computational models for microbe–drug associations have been studied. An effective predictive computational model will be a great help for microbe–drug associations discovery.

In this paper, we designed a novel calculation model called GSAMDA based on both GAT and SAE for possible microbe–drug association prediction. In GSAMDA, we first constructed a heterogeneous network based on known microbe–drug associations. And then, the GAT- and SAE-based modules were established to learn the topological representations and the attribute representations of microbe and drug nodes in the heterogeneous network respectively. Finally, through combining the node topological representations and attribute representations with multiple original node features of nodes in the heterogeneous network, the microbe feature matrix and drug feature matrix would be constructed to infer potential microbe–drug associations. Experimental results of both comparison with five state-of-the-art competitive prediction methods and case studies demonstrated the superior performance of GSAMDA and its great potential for drug discovery.

It should be noted that some limitations still exist in GSAMDA. Firstly, the microbe–drug association matrix is sparse and it will affect the performance of the model to some extent. Moreover, not all microbes(drugs) have diseases associated with them, and there are some defects in using microbe(drug)–disease association as attribute feature. Finally, we can incorporate more biological information, such as microbe-microbe interactions and drug–drug interactions, to improve the performance of the model.

Availability of data and materials

The data and code can be found online at: https://github.com/tyqGitHub/TYQ.

References

  1. Huttenhower C, Gevers D, Knight R, et al. Structure, function and diversity of the healthy human microbiome. Nature. 2012;486(7402):207–14.

    Article  CAS  Google Scholar 

  2. Ventura M, O’Flaherty S, Claesson MJ, et al. Genome-scale analyses of health-promoting bacteria: probiogenomics. Nat Rev Microbiol. 2009;7(1):61–71.

    Article  CAS  PubMed  Google Scholar 

  3. Sommer F, Bäckhed F. The gut microbiota-masters of host development and physiology. Nat Rev Microbiol. 2013;11(4):227–38.

    Article  CAS  PubMed  Google Scholar 

  4. Kau AL, Ahern PP, Griffin NW, et al. Human nutrition, the gut microbiome and the immune system. Nature. 2011;474(7351):327–36.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. ElRakaiby M, Dutilh BE, Rizkallah MR, et al. Pharmacomicrobiomics: the impact of human microbiome variations on systems pharmacology and personalized therapeutics. OMICS. 2014;18(7):402–14.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Ley RE, Turnbaugh PJ, Klein S, et al. Human gut microbes associated with obesity. Nature. 2006;444(7122):1022–3.

    Article  CAS  PubMed  Google Scholar 

  7. Durack J, Lynch SV. The gut microbiome: relationships with disease and opportunities for therapy. J Exp Med. 2018;216(1):20–40.

    Article  PubMed  Google Scholar 

  8. Schwabe RF, Jobin C. The microbiome and cancer. Nat Rev Cancer. 2013;13(11):800–12.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Xiang Y-T, Li W, Zhang Q, et al. Timely research papers about COVID-19 in China. Lancet. 2020;395(10225):684–5.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. McCoubrey LE, Gaisford S, Orlu M, et al. Predicting drug–microbiome interactions with machine learning. Biotechnol Adv. 2022;54: 107797.

    Article  CAS  PubMed  Google Scholar 

  11. Zimmermann M, Zimmermann-Kogadeeva M, Wegmann R, et al. Mapping human microbiome drug metabolism by gut bacteria and their genes. Nature. 2019;570(7762):462–7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Sun Y-Z, Zhang D-H, et al. MDAD: a special resource for microbe–drug associations. Front Cell Infect Microbiol. 2018;8:424.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Rajput A, Thakur A, Sharma S, et al. aBiofilm: a resource of anti-biofilm agents and their potential implications in targeting antibiotic drug resistance. Nucleic Acids Res. 2018;46(D1):D894–900.

    Article  CAS  PubMed  Google Scholar 

  14. Zhu L, Duan G, Yan C, et al. Prediction of microbe–drug associations based on KATZ measure. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2019

  15. Long Y, Luo J. Association mining to identify microbe drug interactions based on heterogeneous network embedding representation. IEEE J Biomed Health Inform. 2021;25(1):266–75.

    Article  PubMed  Google Scholar 

  16. Zhu L, Wang J, Li G, et al. Predicting microbe–drug association based on similarity and semi-supervised learning. Am J Biochem Biotechnol. 2021;17(1):50–8.

    Article  Google Scholar 

  17. Long Y, Wu M, Kwoh CK, et al. Predicting human microbe–drug associations via graph convolutional network with conditional random field. Bioinformatics. 2020;36(19):4918–27.

    Article  CAS  PubMed  Google Scholar 

  18. Long Y, Wu M, Liu Y, et al. Ensembling graph attention networks for human microbe–drug association prediction. Bioinformatics. 2020;36(Supplement 2):i779–86.

    Article  CAS  PubMed  Google Scholar 

  19. Deng L, Huang Y, Liu X, et al. Graph2MDA: a multi-modal variational graph embedding model for predicting microbe–drug associations. Bioinformatics. 2022;38(4):1118–25.

    Article  CAS  Google Scholar 

  20. Dayun L, Junyi L, Yi L, et al. MGATMDA: predicting microbe–disease associations via multi-component graph attention network. IEEE/ACM Trans Comput Biol Bioinform. 2021. https://doi.org/10.1109/TCBB.2021.3116318.

    Article  PubMed  Google Scholar 

  21. Jiang HJ, Huang YA, You ZH. SAEROF: an ensemble approach for large-scale drug–disease association prediction by incorporating rotation forest and sparse autoencoder deep neural network. Sci Rep. 2020;10(1):1–11.

    CAS  Google Scholar 

  22. Wang L, Tan Y, Yang X, et al. Review on predicting pairwise relationships between human microbes, drugs and diseases: from biological data to computational models. Briefings Bioinform. 2022;23(3):bbac080.

    Article  Google Scholar 

  23. Xu D, Xu H, Zhang Y, et al. MDAKRLS: predicting human microbe–disease association based on Kronecker regularized least squares and similarities. J Transl Med. 2021;19(1):1–12.

    Article  CAS  Google Scholar 

  24. Veličković P, Cucurull G, Casanova A, et al. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.

  25. Kingma D, Ba J. Adam: a method for stochastic optimization. Comput Sci. 2014;10(22):1–15.

    Google Scholar 

  26. Köhler S, Bauer S, Horn D, et al. Walking the interactome for prioritization of candidate disease genes. Am J Hum Genet. 2008;82(4):949–58.

    Article  PubMed  PubMed Central  Google Scholar 

  27. Hattori M, Tanaka N, Kanehisa M, et al. SIMCOMP/SUBCOMP: chemical structure search servers for network analyses. Nucleic Acids Res. 2010;38(Suppl2):W652–6.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Kamneva OK. Genome composition and phylogeny of microbes predict their co-occurrence in the environment. PLoS Comput Biol. 2017;13(2): e1005366.

    Article  PubMed  PubMed Central  Google Scholar 

  29. Xuan P, Gao L, Sheng N, et al. Graph convolutional autoencoder and fully-connected autoencoder with attention mechanism based method for predicting drug–disease associations. IEEE J Biomed Health Inform. 2020;25(5):1793–804.

    Article  Google Scholar 

  30. Yu Z, Huang F, Zhao X, et al. Predicting drug–disease associations through layer attention graph convolutional network. Briefings Bioinform. 2020;22(4):bbaa243.

    Article  Google Scholar 

  31. Luo J, Long Y. NTSHMDA: prediction of human microbe–disease association based on random walk by integrating network topological similarity. IEEE/ACM Trans Comput Biol Bioinf. 2020;17(4):1341–51.

    CAS  Google Scholar 

  32. Fan Y, Chen M, Zhu Q, et al. Inferring disease-associated microbes based on multi-data integration and network consistency projection. Front Bioeng Biotechnol. 2020;8:831.

    Article  PubMed  PubMed Central  Google Scholar 

  33. Li H, Wang Y, Zhang Z, Tan Y, Chen Z, Wang X, Pei T, Wang L. BPNNHMDA: identifying microbe-disease associations based on a novel back propagation neural network model. IEEE/ACM Trans Comput Biol Bioinform. 2021;18(6):2502–13.

    Article  PubMed  Google Scholar 

  34. Cai L, Lu C, Xu J, et al. Drug repositioning based on the heterogeneous information fusion graph convolutional network. Brief Bioinform. 2021;22(6):bbab319.

    Article  PubMed  Google Scholar 

  35. Terp DK, Rybak MJ. Ciprofloxacin. Drug Intell Clin Pharm. 1988;35(4):373–447.

    Google Scholar 

  36. Cho EH, Huh HJ, Song DJ, et al. Differences in drug susceptibility pattern between Mycobacterium avium and Mycobacterium intracellulare isolated in respiratory specimens. J Infect Chemother Off Jo Japan Soc Chemother. 2017;24(4):315–8.

    Article  Google Scholar 

  37. Hacioglu M, Haciosmanoglu E, Birteksoz-Tan AS, et al. Effects of ceragenins and conventional antimicrobials on Candida albicans and Staphylococcus aureus mono and multispecies biofilms. Diagn Microbiol Infect Dis. 2019;95(3): 114863.

    Article  CAS  PubMed  Google Scholar 

  38. Barman Balfour JA, et al. Moxifloxacin. Drugs. 1999;59(1):115–39.

    Article  Google Scholar 

  39. Gislason AS, Choy M, et al. Competitive growth enhances conditional growth mutant sensitivity to antibiotics and exposes a two-component system as an emerging antibacterial target in Burkholderia cenocepacia. Antimicrob Agents Chemother. 2017;61(1):00790.

    Article  Google Scholar 

  40. Tahoun A, Elez R, Abdelfatah EN, et al. Listeria monocytogenes in raw milk, milking equipment and dairy workers: molecular characterization and antimicrobial resistance patterns. J Glob Antimicrob Resist. 2017;10:264–70.

    Article  PubMed  Google Scholar 

  41. Chon J-W, Seo K-H, Bae D, et al. Prevalence, toxin gene profile, antibiotic resistance, and molecular characterization of Clostridium perfringens from diarrheic and non-diarrheic dogs in Korea. JVS. 2018;19(3):368–74.

    Google Scholar 

  42. Koch A, Mizrahi V. Mycobacterium tuberculosis. Trends Microbiol. 2018;26(6):555–6.

    Article  CAS  PubMed  Google Scholar 

  43. Spector SA. Human immunodeficiency virus type-1. Ref Module Biomed Sci. 2014;11(28):1–12.

    Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

The National Natural Science Foundation of China (No.62272064, No.61873221) and the Key project of 321 Changsha Science and technology Plan (No. KQ2203001).

Author information

Authors and Affiliations

Authors

Contributions

YQT and LW designed the model and conducted the experiments, YQT and XYW wrote this paper. LW, JZ, LAK, BZ and ZZ provide suggestions and revised the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Lei Wang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tan, Y., Zou, J., Kuang, L. et al. GSAMDA: a computational model for predicting potential microbe–drug associations based on graph attention network and sparse autoencoder. BMC Bioinformatics 23, 492 (2022). https://doi.org/10.1186/s12859-022-05053-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-022-05053-7

Keywords