 Research
 Open access
 Published:
VGAEDTI: drugtarget interaction prediction based on variational inference and graph autoencoder
BMC Bioinformatics volumeÂ 24, ArticleÂ number:Â 278 (2023)
Abstract
Motivation
Accurate identification of DrugTarget Interactions (DTIs) plays a crucial role in many stages of drug development and drug repurposing. (i) Traditional methods do not consider the use of multisource data and do not consider the complex relationship between data sources. (ii) How to better mine the hidden features of drug and target space from highdimensional data, and better solve the accuracy and robustness of the model.
Results
To solve the above problems, a novel prediction model named VGAEDTI is proposed in this paper. We constructed a heterogeneous network with multiple sources of information using multiple types of drug and target dataIn order to obtain deeper features of drugs and targets, we use two different autoencoders. One is variational graph autoencoder (VGAE) which is used to infer feature representations from drug and target spaces. The second is graph autoencoder (GAE) propagating labels between known DTIs. Experimental results on two public datasets show that the prediction accuracy of VGAEDTI is better than that of six DTIs prediction methods. These results indicate that model can predict new DTIs and provide an effective tool for accelerating drug development and repurposing.
Introduction
The therapeutic effect of a drug on a disease from its action on a target protein and its effect on its expression [1]. Therefore, the accurate identification of DTIs is of significance for understanding the treatment of disease by drugs. Recent studies have estimated the average cost of developing a new drug is around 40 million dollars, the cost of approving a drug for marketing is around 873 million dollars, and it usually takes more than a decade for a new drug to go from development to clinical use. Due to some side effects, less than 10% of new drugs have been approved for clinical medicine [2, 3]. In order to increase the number of drug approvals and reduce the cost of drug research and development, drug repurposing has attracted more and more attention from the pharmaceutical industry, namely, the use of currently approved drugs to treat new diseases [4]. For example, Gleevec, originally used to treat leukaemia, was redirected to treat gastrointestinal stromal tumours [5, 5], but the side effects of Gleevec in humans are substantial. Through making full use of drug, target and disease information, identifying DTIs play a crucial role in drug discovery, reducing the time and cost required for drug development and repurposing.
Traditional calculation methods [6] mainly include ligandbased methods [7] and molecular dockingbased methods [8, 9]. For ligandbased method, the prediction accuracy is often poor because few ligands are binding to known target proteins. For molecular dockingbased methods, if the 3D structure of target proteins cannot be obtained, these methods will be limited to some extent. To address the limitations of traditional methods, researchers have proposed methods to predict DTIs using machine learning which are mainly divided into two categories: (1) featurebased methods [10, 11] and (2) graphbased methods [13, 14]. Featurebased methods transform DTIs prediction into a binary classification problem and use machine learning methods such as Support Vector Machine (SVM) as classifiers [15]. For example, autoencoderbased approaches predict DTIs by maintaining consistency in pharma chemical properties and functions. Sun et al. using autoencoder to predict DTIs in the space of drug and target [16]. Zhao et al. [17] predicted drugdisease association using graph representation learning through constructing a heterogeneous network. Graphbased methods describe complex interactions between different entities, assuming that interconnected nodes tend to have more associations [18, 19]. In graphbased methods, the similarity between drugs and targets is calculated based on local or global topological information in heterogeneous graphs constructed by association information [20]. The multiview network embedding of DTIs prediction based on consistency and complementary information preservation was constructed by Shang et al. [21]. Most of the methods currently in use, such as residual neural networks and multiscale autoencoders, learn the features of drugs and targets [22, 23], but they are shallow learning methods, which cannot fully extract the deep and complex associations between drugs and targets.
In recent years, heterogeneous networks of some deep learning algorithms have integrated information related to multiple drugs, diseases and targets for DTIs prediction. Compared with homogeneous networks, heterogeneous networks cover multiple entities and complex interaction relationships between different types of entities [24]. For example, DTINet is a method that focuses on learning the lowdimensional vector representation of drugs and targets [18], which can accurately represent the topological information of every node in the heterogeneous network. However, networkbased methods focus on building various heterogeneous networks [25] but ignore the inherent feature between different types of entities. It is difficult to extract the critical feature information between nodes.
In this paper, we propose a new prediction model named VGAEDTI in Fig.Â 1, which combines multisource data in a collaborative training approach to extract features of drugs and targets. We use two algorithms for feature inference and label propagation. The label propagation process may not fully utilize the lowdimensional representation learned from highdimensional features, so under the variational inference algorithm of the Graph Markov Neural Network (GMNN) [26], the algorithm of feature inference and label propagation is integrated. Specifically, the feature inference network in VGAEDTI is designed as VGAE [27] which learns representations from the feature matrices of drugs and targets, respectively. In addition, the label propagation network in our model is GAE [28] that estimates the score of an unknown drugtarget pair from known drugtarget pair. These two autoencoders learn features and propagation labels alternately and are trained using a variational EM algorithm [29]. The framework minimizes the difference between the representations learned separately by the two autoencoders. In order to improve the performance of DTIs prediction, we use the Random Forest module as a classifier [30], which take the feature information of the drugs and targets obtained above as input to predict DTIs.
The major contributions of this research are as follows:

1.
The VGAEDTI model uses multisource drug information and target similarity to build a heterogeneous network, learning their embeddings through known association relationships and unknown associations.

2.
The indepth features of drugs and targets are learned through collaborative training with VGAE and GAE in VGAEDTI model.
Materials
The two datasets we use were downloaded from several public databases, DrugBank, UniProt and MalaCard. DrugBank contains information on the molecular structure of drugs, target proteins, etc. UniProt is a proteinrelated database with a large amount of protein information. MalaCard is a human disease database that collects information on symptoms and related drug data. We download the chemical structure information of drugs and the targets information of all chemical drugs from DrugBank. Protein sequence information was obtained from UniProt, and drug indications were obtained from the MalaCard database. These two datasets use involves 3508 targets, 2015 drugs, 9702 diseases, and contains 207,540 known drugdisease association information and and 8947 known DTIs and some other types of data, these two data sets were summarized into Table 1.
Methods
Drug and target similarity calculation
The n drugs in the dataset are denoted by \(R=\){\({r}_{1}\), \({r}_{2}\), \({r}_{3}\), â€¦â€¦, \({r}_{n}\)}, transforming SMILES structures of drug molecules into extended connectivity fingerprints (ECFPs) by using Rdkit tools, the vector of the specific structural representation of drug \({r}_{\mathrm{i}}\) is denoted by \({F}_{i}^{r}\) in Fig.Â 2. Cosine similarity was used to calculate the similarity between drugs and drugs as follow,
where \({F}_{i}^{r}\) and \({F}_{j}^{r}\) in formula (1) represent the ECFPs of drug \({r}_{i}\) and drug \({r}_{j}\), respectively. The more similar the drugs are to each other, the closer the value of \({S}_{r}\left(i,j\right)\) is to 1, and a drug similarity matrix \({S}_{r}\in {R}_{n\times n}\) is obtained. Similarly, drug side effects and protein domains were calculated and fused into the drug similarity matrix and protein similarity matrix, respectively.
The \(m\) targets in the dataset are denoted by \(p=\){\({p}_{1}\), \({p}_{2}\), \({p}_{3}\), â€¦â€¦, \({p}_{m}\)},the similarity between target protein sequence \({p}_{i}\) and target protein sequence \({p}_{j}\) can be calculated by SmithWaterman algorithm [31], and then normalized by the following,
where \(sw(i,j)\) in the formula (2) represents the protein similarity score calculated by Smith Waterman algorithm for two target protein sequences, \(\mathrm{max }\left({sw}_{i}\right)\) and \(\mathrm{min}({sw}_{i})\) represent the highest and lowest scores between protein sequence \(i\) and other protein sequences, respectively, Then the target similarity matrix \({S}_{p}\in {p}_{m\times m}\) is obtained by normalization of Eq.Â (2).
Construction of heterogeneous networks of drugs, targets and diseases
In order to better extract the internal connections between drug and target nodes, and perform deep learning on the common topological information representation of drug and target nodes, a heterogeneous network \({H}_{pr}\) containing drug, target and disease subnetworks is constructed, which integrates the internal connections and target similarity matrix \({S}_{p}\) and drug similarity matrix\({S}_{r}\). Heterogeneous networks contain three kinds of nodes \(N=\){\(N_{r} \; \cup \;N_{p} \; \cup \;N_{d}\)} and four kinds of edges \(E=\){\({E}_{dr}\cup {E}_{rr}\cup {E}_{pr}\cup {E}_{pp}\cup {E}_{dp}\)}, If there is a known association between the drugs and the targets, there is a solid edge between them; If not, it is a dashed edge.
The adjacency matrix of a heterogeneous network of drugs, targets, and diseases is represented as follows,
where \({S}_{r}\) belongs to drug similarity matrix, \({S}_{p}\) belongs to target similarity matrix, \({A}_{pr}\) belongs to drug target association matrix, \({A}_{dr}\) is the diseasedrug association matrix and \({A}_{dp}\) is the diseasetarget matrix.
Integrate drug and targets spatial information based on VGAE and GAE
VGAE and GAE serve as feature extractors for drug space and targets space. These two autoencoders extract the potential feature information from the two Spaces through feature inference and label propagation, respectively. For a drug or target node, the association and similarity with it can be regarded as the feature attribute of the node, So take H p r as a drug and the characteristics of the target node matrix \(X\). The input to the VGAE and GAE is X. Each layer of VGAE and GAE is a graph convolutional layer. The formula for the first graph convolutional layer is as follows,
For example, for the targets space, \(\widetilde{A}\) is an associational adjacency matrix with selfcycle, \(\widetilde{A }={A}_{pr}+{A}_{dp}+I\), \(\widetilde{D}\) is the diagonal matrix of the associative adjacency matrix \({A}_{pr}+{ A}_{dp}\), \(\sigma\) is the nonlinear activation function, \({X}_{p}\) is the feature matrix of target the initial input, \(l\) denotes the number of layers, and \({W}^{(l)}\) denotes the weight of the \(l\) layer in the network, the same is true for the drug space.
The decoding process of VAE is as follows,
We use VGAE to extract the spatial information of the input target feature matrix \({X}_{p}\), and we can obtain the representation \({\text{ Z}}_{p}\) by the reparameterization technique as follows,
where \(\mu { }\) represents the mean of the VGAE, \(\sigma { }\) represents the standard deviation, and the random variable \(\in \sim \left( {0,1} \right)\) conforms to Gaussian sampling
For the targets space, the loss function of VGAE is the sum of reconstruction error \({ }L_{VG}\) and KL divergence \({ }L_{KL}\) as follows,
If the feature follows Gaussian distribution, the reconstruction error is the mean square error, when the feature follows Bernoulli distribution, the reconstruction error is crossentropy loss as follows,
where \({X}_{p}\) is the feature matrix of the input target space, \({L}_{KL}\) divergence loss can be calculated by the following equation,
For the target space, the following equation is the reconstruction error \({L}_{pGAE}\) of the GAE as follow,
where \({A}_{pr}\) represents the input drugtarget association matrix, \({A}_{pr }^{\mathrm{^{\prime}}}\) is the reconstructed drugtarget matrix, and the same is true for the drug space.
We propose the VGAEDTI model, design a representation learning framework that integrates the feature inference network and labels propagation network and use the integrated variational inference to train the variational EM algorithm. VGAEDTI alternates the following two steps until convergence occurs.
Estep (Feature inference): The VGAE is used for feature inference.
Mstep (Label propagation): The GAE is used for label propagation.
Variational EM algorithm
Taking training spatial target information as an example, the variational EM algorithm is implemented by alternately minimizing the loss of the VGAE and GAE, after the variational EM algorithm alternately trains the two autoencoders until convergence as follows,
where \({Z}_{p}\) represents the output of VGAE, \({{Z}_{p}}^{\mathrm{^{\prime}}}\) represents the output of GAE, and the mean square error is used to achieve loss construction, the same is true for the drug space.
Collaborative training integrates information from drug space and target space
In this paper, the VGAE and GAE are cotrained, and the cotraining loss is represented by learning from drug and target space respectively as follows,
In the above equation, \({Y}_{p}\) and \({Y}_{r}\) represent the protein and drug feature matrices obtained through training, where \({X}_{p}\) and \({X}_{r}\) is the initial input feature matrix, the mean square error is used to achieve loss construction.
The total optimized loss \({L}_{TVGAE}\) of the VGAE trained in target and drug space is as follows,
It indicates that \(\alpha\) and \(\beta \in (\mathrm{0,1})\) are weight parameters to balance the information obtained from drug and target Spaces. \({L}_{pVGAE}\) belongs to the loss of target space under the VGAE and \({L}_{rVGAE}\) belongs to the loss of drug space under the VGAE.
The total optimized loss \({L}_{TGAE}\) of the GAE trained in target and drug space is as follows,
Prediction of DTI by random forest module
In this paper, in order to get better score prediction and avoid the negative impact of feature dimension and the importance of feature information on the prediction of drugs and targets, a Random Forest classifier [32] is used. Random Forests are a composed integrated decision tree algorithm, it belongs to integrated Bagging methods of learning [33]. By adding a random (sample randomness and properties of randomness), it can come out a high dimension data, and there is no dimension reduction, without having to make feature selection, it can judge the critical degree of the feature, and the interaction between different features. For unbalanced data sets, it can balance the error, if a large part of the features is lost, the accuracy can still be maintained. This model has strong robustness and generalization ability, so it has been widely used in the field of bioinformatics. In our learning, the learning steps of random forest are as follows,
where \({Y}_{r}\) represents the feature information in the drug space and \({Y}_{p}\) represents the feature information in the target space, these two features are input into the Random Forest.

1.
The first step is to sample the data. The samples in the training set are sampled in the form of put back, and the data set is sampled for \(N\) times to train \(N\) Classification and Regression Tree (CART) decision trees.

2.
Then, the Gini coefficient is used to calculate the optimal segmentation variable, and the decision tree is constructed by node attribute splitting.

3.
Obtain N decision trees by repeating the previous steps \(N\) times, and predict drug target association according to the decision tree results.
The Gini coefficient is as follows,
where Y is the sample set, \({U}_{i}\) is the proportion of the \(ith\) classification in Y, \({Y}^{V}\) is the sample set of Y with the \(V\) value of \(f\), and \(f\) is the feature attribute set. We take the lowdimensional feature representations \({Y}_{p}\) and \({Y}_{r}\) obtained through autoencoder training as input. In the training stage, pairs of drugs and targets form the training set. Then put it into the Random Forest as input, and finally get the DTIs score matrix.
Experiment and discussion
Comparison with other methods
In order to evaluate the performance of our proposed VGAEDTI model for predicting DTIs. We use fivefold crossvalidation. The dataset we use contained 1307 drugs, and the dataset was randomly divided into five groups of the same size, one of which was the test set in turn, and the remaining four groups were the training set. All the known drug target information were positive samples, and the remaining unknown drug target associations were negative samples, and the negative data contained all unknown or nonexistent DTI, it can be seen from Table 1 that imbalanced datasets were used. The VGAEDTI model was used for training. In order to better compare the superiority of our model, we also use Luo et al.â€™s dataset for testing and training, and our VGAEDTI model compares the following methods as follows,
GRMF: DTIs prediction using graph regularized matrix factorization [34].
DTINet: A network integration method for predicting DTIs and computing drug repurposing from heterogeneous information.
MolTrans: Transformer of molecular interactions for DTIs prediction [35].
NGDTP: Graph convolution autoencoder and Generative adversarial network approach for predicting DTIs [36].
DeepDTNet: Identify targets between known drugs by deep learning from heterogeneous networks [37].
AEFS: An autoencoderbased approach to predict DTIs by maintaining consistency in pharmacochemical properties and functions [16].
HNM: Drug repositioning by integrating target information through a heterogeneous network model [40]
The epochs of our VGAEDTI model are 500, the learning rate is 0.1, the weight decay rate is \({1e}^{8}\), the size of the hidden layer is 256, the initial weight of the drug and protein space is 0.5, and the Adam optimizer is used to optimize.
We adopted a fivefold crossvalidation method for training, and the following are some evaluation indicators:
In the above formula, TN is the true negative; FN is a false negative; FP is a false negative, TP is truly positive, FPR is the false positive rate, and TPR is the true positive rate. TPR and FPR can draw receiver operating characteristic (ROC) curves, and the area under the ROC curve (AUROC) and the area under the accuracyrecall curve (AUPR) are important indicators to measure the performance and stability of binary classification models.
Comparison of experimental results
In order to better demonstrate that our method can extract deep drugtarget information from highdimensional feature information, In order to maintain the fairness of the experiment, we used the same data processing methods, and the input data were the same. The scores of the other models were derived from AEFS [16], we compared other six methods as follows,
Table 2 shows the comparison of AUROC and AUPR score between our VGAEDTI model and the other six methods. It can be seen intuitively that the performance of our model is superior to that of the other methods. On the first dataset, the VGAEDTI model had the best performance (AUROCâ€‰=â€‰0.9847, AUPRâ€‰=â€‰0.8247). Compared with the GRMF method, the AUROC of our method was 0.13 higher, and the AUPR was 0.61 higher. The AUROC was 0.02 higher, and the AUPR was 0.5 higher than that of AEFS, Our method is 1% higher than the AUPR of HNM. In the second dataset, the performance of the VGAEDTI model was better (AUROCâ€‰=â€‰0.9484, AUPRâ€‰=â€‰0.7302). Compared with the MolTrans method, the AUROC of our method was 0.07 higher, and the AUPR was 0.42 higher. The AUROC of our method was 0.2 lower than that of NGDTP. The AUROC was slightly higher than that of AEFS, and the AUPR was 0.31 higher, The AUPR of our method is about 13% lower than that of HNM, which may be due to the integration of our method into the omics data, leading to the better AUPR effect than our method. Our model can perform so well in the above indicators; several methods are used in front of the shallow card model, which is not good for extracting the feature attributes in the network structure, and our model uses two since the encoder, interval training, better from drug and protein extraction to better comparison, the results of the six methods in this Table 2 are derived from Sun et al. [16].
In order to better evaluate the performance of the model, we decided to use the recall rate of the top k DTIs candidates (5%, 10%, 20%, 30%). The recall rate can reflect whether the model can reasonably predict the performance of DTIs. We still selected the average recall rate of these methods to compare the performance of these methods with our method, as shown in Fig.Â 3.
In the first data set, the average value of recall of our model before (5%, 10%, 20%) is better than that of the six methods, and in the first 30%, our method is slightly lower than AEFS. In the second dataset, our model outperformed all DTIs methods in the top (5%, 10%, 20%, 30%), reflecting our model's strong performance in identifying drugtarget associations.
Case study
Evaluating the performance of a model is mainly based on accuracy and practicality. We trained the VGAEDTI model using known DTIs datasets to predict the natural association of drugtargets. We will predict the interaction between drugtarget scores in the top 15 for recording. In order to verify the accuracy of the prediction score, we verified its authenticity by querying the source data set of Uniprot and DrugBank databases; the database contains a large number of drugs and targets of the associated information, so that supported by data authenticity.
In the Table 3, these target associations were confirmed in both Uniprot and DrugBank databases, at the same time, we found that drugs DB00007 (Leprolide) and DB00014 (Goserelin) in the Table 3 have effects on prostate disease [38], and drug DB00007 is associated with target protein P30968 Gonadotropinreleasing hormone receptor). Drug DB00007 and target protein P22888 (Lutropinchoriogonadotropic hormone receptor) ranked high in the scores of our model results, so they have a unknown association. If this association can be predicted, it could have important implications for the discovery of new treatments for diseases. In order to have a better visual understanding of the interaction between proteins and molecules, such as P30968 and DB00007, they are two interacting drugtarget pairs. Pharmaceutical chemists need to understand the role of targets in the human body or pathogens in the process of disease, so as to design drugs that can regulate the physiological functions of targets, so as to achieve the purpose of treating diseases. A drug may have multiple potential targets in the body at the same time. When a drug acts on its target, it is called ontarget, and it acts on other targets, it is called OffTarget. In general, a disease may be associated with multiple targets, and a target may be associated with multiple diseases. How to identify and select the key targets is very important for drug design. Our VGAEDTI model can screen a large number of unknown but related drug targets in advance, reduce the blind test of drug targets for researchers, save the cost of some unnecessary biological experiments, and shorten the time of drug development and promote the pace of drug research and development.
To further validate this novel interaction, we performed computational docking and utilized the docking program AutoDock to infer the possible binding modes of the new predicted DTI. Docking results showed that Gentamicin can dock the structure of 2M0P. More specifically, Ibrutinib binds to 2MOP by forming hydrogen bonds with residues LEU23, PBU22, and ASN305.We use pymol for molecular docking and hydrogen bond coloring, as shown in Fig.Â 4.
Ablation experiment
VGAEDTI model combines drugs space and target space information, so two spatial information are integrated to cotraining improve the ability of its important feature information extraction. The pattern of cotraining on performance evaluation of the VGAEDTI model has an important influence. Therefore, this paper set up a set of ablation experiments on its effectiveness.
The AUROC and AUPR of the VGAEDTI model with and without cotraining under two different datasets are shown in Fig.Â 5. Except for these two Settings, all other parameters are consistent to ensure the accuracy of the experiment. In dataset 1, the AUROC score was 0.98 with a cotraining and 0.90 without cotraining, while the AUPR was 0.82 and 0.63, respectively. In dataset 2, the AUROC score for using cotraining is 0.89, the AUPR score for not using cotraining is 0.72, and the AUPR score is 0.94 and 0.73, respectively. The above two datasets show that the prediction performance of the model using cotraining is higher than that of the model not using cotraining. Therefore, the experimental results show that the VGAEDTI model can extract the feature information of drug space and target space to predict DTIs accurately, and cotraining is essential.
In VGAEDTI, based on the embedding features of drug and target, we use random forest to calculate drugtarget association scores. In order to confirm that random forest can obtain better score prediction, we performed the following ablation experiments as Fig.Â 6. We used several different classifiers, fully connected layer, SVM, KNN as well as random forest to compare the performance of the two datasets. In the first dataset, random forest (AUPRâ€‰=â€‰0.98, AUROCâ€‰=â€‰0.93) and SVM (AUPRâ€‰=â€‰0.94,AUROCâ€‰=â€‰0.91) were used, and random forest performed better than other classifiers in this dataset. In the second dataset, The AUROC of random forest is 2% higher than that of SVM, but it is still superior to other classifiers. It can be seen that the importance of the Random Forest classification module for this VGAEDTI model enhances the accuracy of the scoring results.
Weight parameter selection of drug space and target space
By integrating the feature information of drug space and target space trained alternately by two autoencoders using a variational EM algorithm, the VGAEDTI model can get more accurate feature information so as to better predict its association. To select the suitable weight parameters of the two spaces to maintain the balance between them and ensure the contribution of different spatial feature information outputs to the prediction performance of the model, we use different datasets for testing.
It can be seen from the above Fig.Â 7 that when our VGAEDTI model integrates spatial feature information of drugs and proteins, it can be seen in dataset 1 that when the weight is 0.5, AUROC is 0.98, and AUPR is 0.82. The prediction performance of the model at this time is the best. In dataset 2, when the weight is 0.1, AUROC is 0.94, AUPR is 0.73, and the prediction performance for dataset 2 is the best. Experiments can show that different data sets contribute different weights to the feature information of integrated drug and target space, and some properties, such as the sparsity of data sets, affect the modelâ€™s training.
Conclusion
How to accurately identify DTIs is one of the most important steps in drug repurposing and new drug development. In this study, we propose a novel model VGAEDTI to predict DTIs. Firstly, the VGAEDTI model calculates the similarity of multisource drug information, target information and disease information, and then constructs a heterogeneous network through the known association information and feature information among them, so as to better extract more potentially complex relationships among drugs, targets and diseases. Then it is input to the VGAE and GAE for feature information extraction. The VGAE deduces the feature representation from the drug and target space respectively, while the GAE propagates the label between the known drug and target associations, and uses the variational EM algorithm for alternating training until convergence. Also, the cotraining starategy is used to capture the feature information of drug space and target space, which enhances the ability of VGAEDTI to capture efficient lowdimensional representations from highdimensional features, thereby improving the robustness and accuracy of predicting the unknown DTIs. In this way, the obtained drug and target feature information is more accurate and comprehensive. In order to obtain better score prediction and avoid the negative effects of feature dimension and importance of feature information on predicting drugs and targets, we use random forest classifier, which can judge the importance of features and the interaction between different features. For imbalanced data sets, it can balance the error. If a large part of the features are lost, the accuracy can still be maintained, and the model has strong robustness and generalization ability. In order to evaluate the performance of the proposed VGAEDTI model for predicting DTIs. We use fivefold cross validation to compare the performance of six methods on two different datasets, all of which achieved better results in some aspects, and also proved that our model has strong generalization ability. In general, our model VGAEDTI can be used as an effective and accurate tool for predicting DTIs.
Future and prospects
Although VGAEDTI model has good performance at present, there are also some potential drawbacks in extracting information from heterogeneous networks, recently inspired by Zhao et al. [39], existing computational models can only use lowlevel biological information at the level of individual drugs, diseases and targets and their associations. This also germinates new ideas for the next work, in the future work, not only multisource information but also highorder metapath information of heterogeneous networks should be integrated to improve the prediction performance and generalization performance of the model.
Availability of data and materials
All instructions and codes for our experiments are available at https://github.com/FengYinFei/VGAEDTI.
References
Chen X, Liu MX, Yan GY. Drugtarget interaction prediction by random walk on the heterogeneous network. Mol BioSyst. 2012;8(7):1970â€“8. https://doi.org/10.1039/c2mb00002d.
Whitebread S, Hamon J, Bojanic D, Urban L. Keynote review: in vitro safety pharmacology profiling: an essential tool for successful drug developmentâ€”sciencedirect. Drug Discov Today. 2005;10(21):1421â€“33. https://doi.org/10.1016/S13596446(05)036329.
Masataka T, Masaaki K, Yosuke N, Susumu G, Yoshihiro Y. Drug target prediction using adverse event report systems: a pharmacogenomic approach. Bioinformatics. 2012. https://doi.org/10.1093/bioinformatics/bts413.
Ashburn TT, Thor KB. Drug repositioning: identifying and developing new uses for existing drugs. Nat Rev Drug Discov. 2004;3(8):673â€“83. https://doi.org/10.1038/nrd1468.
Frantz S. Drug discovery: playing dirty. Nature. 2005;437(7061):942â€“3. https://doi.org/10.1038/437942a.
McLean SR, GanaWeisz M, Hartzoulakis B, Frow R, Whelan J, Selwood D, Boshoff C. Imatinib binding and cKIT inhibition is abrogated by the cKIT kinase domain I missense mutation val654ala. Mol Cancer Ther. 2005;4(12):2008â€“15. https://doi.org/10.1158/15357163.MCT050070.
Yamanishi Y, Kotera M, Kanehisa M, Goto S. Drugtarget interaction prediction from chemical, genomic and pharmacological data in an integrated framework. Bioinformatics. 2010;26(12):i246â€“54. https://doi.org/10.1093/bioinformatics/btq176.
Keiser MJ (2009) Relating protein pharmacology by ligand chemistry. (Doctoral dissertation, University of California, San Francisco). https://doi.org/10.1038/nbt1284.
Honglin L, Zhenting G, Ling K, Hailei Z, Kun Y, Kunqian Y, et al. Tarfisdock: a web server for identifying drug targets with docking approach. Nucleic Acids Res. 2006;34:219â€“24. https://doi.org/10.1093/nar/gkl114.
Fauman EB, Rai BK, Huang ES. Structurebased druggability assessmentâ€“identifying suitable targets for small molecule therapeutics. Curr Opin Chem Biol. 2011;15(4):463â€“8. https://doi.org/10.1016/j.cbpa.2011.05.020.
Mei JP, Kwoh CK, Yang P, Li XL, Zheng J. Drugâ€“target interaction prediction by learning from local information and neighbors. Bioinformatics. 2012. https://doi.org/10.1093/bioinformatics/bts670.
Shi H, Liu S, Chen J, Li X, Ma Q, Yu B. Predicting drugtarget interactions using lasso with random forest based on evolutionary information and chemical structure. Genomics. 2018. https://doi.org/10.1016/j.ygeno.2018.12.007.
Peng J, Wang Y, Guan J, Li J, Han R, Hao J, et al. An endtoend heterogeneous graph representation learningbased framework for drugâ€“target interaction prediction. Brief Bioinform. 2021. https://doi.org/10.1093/bib/bbaa430.
Ingoo L, Hojung N. Identification of drugtarget interaction by a random walk with restart method on an interactome network. BMC Bioinformatics. 2018;19(S8):208. https://doi.org/10.1186/s128590182199x.
Chang CC, Lin CJ. Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol. 2007. https://doi.org/10.1145/1961189.1961199.
Sun C, Cao Y, Wei JM, Liu J. Autoencoderbased drugtarget interaction prediction by preserving the consistency of chemical properties and functions of drugs. Bioinformatics. 2021. https://doi.org/10.1093/bioinformatics/btab384.
BoWei Z, Lun H, ZhuHong Y, Lei W, XiaoRui S. Hingrl: predicting drugâ€“disease associations with graph representation learning on heterogeneous information networks. Brief Bioinform. 2022. https://doi.org/10.1093/bib/bbab515.
Luo Y, Zhao X, Zhou J, Yang J, Zhang Y, Kuang W, et al. A network integration approach for drugtarget interaction prediction and computational drug repositioning from heterogeneous information. Res Comput Mol Biol. 2017. https://doi.org/10.1038/s41467017006808.
Yan XY, Zhang SW, He CR. Prediction of drugtarget interaction by integrating diverse heterogeneous information source with multiple kernel learning and clustering methods. Comput Biol Chem. 2019. https://doi.org/10.1016/j.compbiolchem.2018.11.028.
Chen X, Liu MX, Yan GY. Drugâ€“target interaction prediction by random walk on the heterogeneous network. Mol BioSyst. 2012;8(7):1970â€“8. https://doi.org/10.1039/c2mb00002d.
Shang Y, Ye X, Yasunori F, Yu L, Tetsuya S. Multiview network embedding for drugtarget interactions prediction by consistent and complementary information preserving. Brief Bioinform. 2022. https://doi.org/10.1093/bib/bbac059.
Yu S, Wang M, Pang S, Song L, Qiao S. Intelligent fault diagnosis and visual interpretability of rotating machinery based on residual neural network. Measurement. 2022. https://doi.org/10.1016/j.measurement.2022.111228.
Yu S, Wang M, Pang S, Song L, Zhai X, Zhao Y. TDMSAE: A transferable decoupling multiscale autoencoder for mechanical fault diagnosis. Mech Syst Signal Process. 2023. https://doi.org/10.1016/j.ymssp.2022.109789.
Liu Y, Wu M, Miao C, Zhao P, Li XL. Neighborhood regularized logistic matrix factorization for drugtarget interaction prediction. PLoS Comput Biol. 2016;12(2):e1004760. https://doi.org/10.1371/journal.pcbi.1004760.
Zhao X, Zhao X, Yin M. Heterogeneous graph attention network based on metapaths for lncrnaâ€“disease association prediction. Brief Bioinform. 2021. https://doi.org/10.1093/bib/bbab407.
Niu M, Zou Q, Wang C. Gmnn2cd: identification of circrnaâ€“disease associations based on variational inference and graph markov neural networks. Bioinformatics. 2022. https://doi.org/10.1093/bioinformatics/btac079.
Kipf TN, Welling M (2016) Variational graph autoencoders. https://doi.org/10.48550/arXiv.1611.07308.
Pan S, Hu R, Long G, Jing J, Zhang C (2018) Adversarially regularized graph autoencoder for graph embedding. https://doi.org/10.48550/arXiv.1802.04407.
Chang C, Oh J, Min E, Long Q (2019) KnowledgeGuided Biclustering via Sparse Variational EM Algorithm. 2019 IEEE International Conference on Big Knowledge (ICBK) (vol. 2019, pp.25â€“32). 10th IEEE Int Conf Big Knowl (2019). https://doi.org/10.1109/icbk.2019.00012.
Chu Y, Chandra KA, Wang X, Wang W, Zhang Y, Shan X, et al. Dticdf: a cascade deep forest model towards the prediction of drugtarget interactions based on hybrid features. Brief Bioinform. 2019. https://doi.org/10.1093/bib/bbz152.
Pearson WR. Searching protein sequence libraries: comparison of the sensitivity and selectivity of the smithwaterman and fasta algorithms. Genomics. 1991;11(3):635â€“50. https://doi.org/10.1016/08887543(91)90071L.
Scornet E, Biau G. A random forest guided tour. Test Off J Spanish Soc Stat Oper Res. 2016. https://doi.org/10.48550/arXiv.1511.05741.
Breiman L. Bagging predictors. Mach Learn. 1996. https://doi.org/10.1023/A%3A1018054314350.
Zhang J, Xie M. NNDSVDGRMF: a graph dual regularization matrix factorization method using nonnegative initialization for predicting drugtarget interactions. IEEE Access. 2022;10:91235â€“44. https://doi.org/10.1109/ACCESS.2022.3199667.
Huang K, Xiao C, Glass L, Sun J. Moltrans: molecular interaction transformer for drug target interaction prediction. Bioinformatics. 2020. https://doi.org/10.1093/bioinformatics/btaa880.
Sun C, Xuan P, Zhang T, Ye Y. Graph convolutional autoencoder and generative adversarial networkbased method for predicting drugtarget interactions. IEEE/ACM Trans Comput Biol Bioinform. 2020. https://doi.org/10.1109/tcbb.2020.2999084.
Zeng X, Zhu S, Lu W, Liu Z, Huang J, Zhou Y, et al. Target identification among known drugs by deep learning from heterogeneous networks. Chem Sci. 2020. https://doi.org/10.1039/c9sc04336e.
Rajput A, Thakur A, Mukhopadhyay A, Kamboj S, Kumar M. Prediction of repurposed drugs for coronaviruses using artificial intelligence and machine learning. Comput Struct Biotechnol J. 2021. https://doi.org/10.1016/j.csbj.2021.05.037.
Zhao BW, Wang L, Hu PW, et al. Fusing higher and lowerorder biological information for drug repositioning via graph representation learning. IEEE Trans Emerg Topics Comput. 2023. https://doi.org/10.1109/TETC.2023.3239949.
Wang W, Yang S, Zhang X, Li J. Drug repositioning by integrating target information through a heterogeneous network model. Bioinformatics. 2014;30(20):2923â€“30.
Acknowledgements
We thank Yuanyuan Zhang, Mengjie Wu, Zengqian Deng, Shudong Wang, and others for their efforts.
Funding
This work was partially supported by the National Natural Science Foundation of China [Nos.61902430, 61873281].
Author information
Authors and Affiliations
Contributions
YF: performed the experiments, analyzed the data, and wrote the paper. YZ: provided ideas for the article and reviewed the manuscript. MW and ZD provided the source of the data. SW: discusses the feasibility of the article. All authors have approved the final version of the article.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Zhang, Y., Feng, Y., Wu, M. et al. VGAEDTI: drugtarget interaction prediction based on variational inference and graph autoencoder. BMC Bioinformatics 24, 278 (2023). https://doi.org/10.1186/s1285902305387w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285902305387w