Problem formulation
In our study, the DTIs prediction can be formulated as a transductivelearning binary linkprediction task (i.e., discovering novel DTIs within the DTIs space consisted of fixed drugs and targets in the given dataset, that is, the involved entities do not need to be extended) based on a heterogeneous network, which is divided into a bipartite DTI network as well as drug and target homogeneous networks. More specifically, let \({G}_{b}=(\mathrm{D},\mathrm{ T},\mathrm{ E})\) be a bipartite DTI network, where \(\mathrm{D}=\{{d}^{1}, {d}^{2},\dots ,{d}^{m}\}\) (m refers to the number of drugs in the dataset) and \(\mathrm{T}=\{{t}^{1}, {t}^{2},\dots , {t}^{n}\}\) (n represents the amount of targets in the dataset) denote the set of drug and target protein nodes, respectively. \(\mathrm{E}\subset \mathrm{D}\times \mathrm{T}\) defines known edges (interactions) between drugs and targets, and all known edges correspond to the weight of 1. Meanwhile, the homogeneous drug and target networks are defined as the \(m\times m\) matrix (\({G}_{d}\)) and the \(n\times n\) matrix (\({G}_{t}\)), respectively, in which every element indicates the degree of similarity between two drugs or two targets. The higher the value of one element, the higher the similarity between two corresponding entities. In addition, there is a \(m\times n\) matrix (Y) storing binary DTI predictions, if \({y}^{ij}=1\), it indicates that the \({d}^{i}{t}^{j}\) pair is predicted to have a potential interaction, if not, then \({y}^{ij}=0\).
Furthermore, it is precisely because of the definition of our prediction task (i.e., the involved nodes are fixed) that the transductivelearninglike method can be utilized. Another contributing reason is that directly setting the weight of unknown interactions to 0 may not produce a satisfactory performance on datasets with a highly imbalanced ratio between known and unknown samples (e.g., DTI datasets) [21]. The transductive learning allows methods to have observed all the data beforehand, including training and test datasets, and potentially exploit structure information in their distribution [22] (so that it can better use additional information of unknown samples in the face of datasets with sparse known interactions). Compared with the inductive learning that learns a general inference to a task based on the information of a dataset, transductive learning is less ambitious and finds a specific solution that is optimal only for the current dataset (i.e., acquiring the best performance under the fixed drugs and targets in the dataset in our case study) [23, 24]; and the transductive setup has already been mentioned by some DTI prediction approaches [25].
Workflow
Figure 1 presents the four main steps of the proposed method in our study:

1.
Obtaining drug and target embeddings: a bipartite DTI network is established based on every known \({d}^{i}{t}^{j}\) pair, and then the BiNE algorithm is performed on the bipartite network to capture the prior highorder similarity information of explicit and implicit transition relationships of all entities in the dataset.

2.
Selection and fusion of homogeneous networks: a heuristic algorithm is applied to screen and integrate multiple drug and target homogeneous networks.

3.
Pathbased information integration: in this step, the pathbased heterogeneous information is added as the auxiliary information to generate the embedding of every \({d}^{i}{t}^{j}\) pair.

4.
Novel DTI predictions: a RF classifier is trained to learn the integrated embedding representations for predicting unknown DTIs.
Learning bipartite DTI embedding
The challenge of learning bipartite network embedding is how to learn the explicit bipartite relationships between different types of vertices (e.g., DTIs) and implicit transition relationships between the same types of vertices (e.g., drugs and targets) simultaneously. BiNE addresses this challenge by using a threepart joint optimization framework and assigns each type of relationships with a dedicated objection function and an adjustable weight, which produces better vertex embeddings. Specifically, the first part of the framework is modeling the explicit relationships. In order to preserve the information of observed edges between two different types of nodes (\({u}_{i}\) and \({v}_{j}\)), the KLdivergence is chosen to measure the difference between the joint probability \(P(i,j)\) between vertices \({u}_{i}\) and \({v}_{j}\) and the joint probability \(\widehat{P}(i,j)\) between the embedding vectors of vertices \({u}_{i}\) and \({v}_{j}\) (\(\overrightarrow{{u}_{i}}\) and \(\overrightarrow{{v}_{j}}\)). The objection function can be defined as follows, which aims at minimizing the difference between \(P(i,j)\) and \(\widehat{P}(i,j)\):
$$\mathrm{minimize }{O}^{1}=KL(P\widehat{P})=\sum_{{e}_{ij}\in E}P(i,j)\mathrm{log}(\frac{P(i,j)}{\widehat{P}(i,j)})$$
(1)
For the sake of explicitly modeling the unobserved but transitive links (implicit transition relationships) between the same type of nodes (i.e., directly modeling that similar drugs/targets could interact with similar targets/drugs in our case study), firstly, BiNE utilizes an idea named CoHITS [26] to generate two homogeneous networks (matrices) which contain the 2ndorder proximity between the same type of nodes, and then the nodes having at least one weight greater than 0 are selected in the generated matrices. Then the truncated random walks, which are designed to better capture the frequency distribution of nodes, are performed on these two homogeneous networks consisted of selected nodes respectively, to convert the networks into two corpora of vertex sequences. More specifically, during our DTIs prediction process, there are two different types of homogeneous networks being generated. The first type is obtained in the second step of the workflow shown in Fig. 1, which contains the chemical and physical similarity information of drugs and targets and is more widely used by other DTIs prediction methods [27]. For the second type, it is calculated by CoHITS mentioned above to model the implicit transition relationships, which has the same size as the first type (i.e., drug homogeneous networks: \(m\times m\) matrix, target homogeneous networks: \(n\times n\) matrix), and every element (weight) denotes the implicit transition probability between two drugs/targets. That is, given a \(m\times n\) bipartite DTI matrix \({G}_{b}\), the drug homogeneous network can be represented by a \(m\times m\) matrix \({G}_{b}{G}_{b}^{T}\), and the target homogeneous network is defined as \({G}_{b}^{T}{G}_{b}\), which is a \(n\times n\) matrix. In our task, taking the drug homogeneous matrix as an example (Fig. 2), the entry \({w}_{ij}^{d}\) with a higher value in this matrix can be interpreted as that the \({drug}_{i}\) and \({drug}_{j}\) would share more similar targets, and the similar principle can be applied to the target homogeneous matrix. Such a characteristic is in line with the known assumption – “guiltbyassociation” [2]. Thus, the second type of homogeneous networks can carry more interactive information between drugs and between targets, which is helpful to improve the accuracy of the DTIs prediction.
Next, based on the corpora created by the truncated random walks, the Skipgram model [28] is used to learn embeddings of the two types of vertices in the bipartite network (e.g., drug and target embeddings), which makes embeddings capture more highorder proximity information; Essentially, the purpose of the Skipgram model is assigning the similar embeddings to the vertices which are more frequently cooccurred in the same context of a sequence in the corpora. Intuitively, if the vertices in a corpus sequence are more similar to each other, these vertices are more likely to cooccur in the same context, so they could be allocated more similar embeddings. Thus, we further add a relatively high restart probability (e.g., 0.7) to every step of truncated random walks. Taking the embedding process of drug nodes as an example, for a truncated random walk starting at a certain drug node, when the next node is randomly selected from the set which has other drug nodes having a connection to the current drug (the connection is determined based on the value between these two nodes in the drug homogeneous network, if the value is nonzero, which indicates that there is a connection between them), a number from 0 to 1 is randomly chosen. If this number is less than the restart probability, the next node will become the starting node instead. In this way, the drug nodes selected in the current corpus sequence are closer to the starting node, which could bring higherquality embeddings for the DTIs prediction.
Therefore, in order to learn the implicit transition relationships, there needs two objection functions expressed in (2)—(3) to maximize the conditional probability for highorder proximities on the two corpora respectively, where \(S\) denotes a vertex sequence which contains only \({u}_{i}\) nodes or only \({v}_{j}\) nodes, \({D}^{U}\) and \({D}^{V}\) correspond to the two generated corpora, \({C}_{S}({u}_{i})\) and \({C}_{S}({v}_{j})\) represent the context vertices of \({u}_{i}\) and \({v}_{j}\) in the sequence \(S\), respectively, and context vertices are several vertices (the number is \(\mathrm{ws}\) in total) before and after \({u}_{i}\) or \({v}_{j}\) in a sequence \(S\). In addition, \(P({u}_{c}{u}_{i})\) refers to how likely \({u}_{c}\) is found in the contexts of \({u}_{i}\), and the similar meaning can be applied to \(P({v}_{c}{v}_{j})\).
$$\mathrm{maximize }{O}^{2}=\prod_{{u}_{i}\in S\wedge S\in {D}^{U}}\prod_{{u}_{c}\in {C}_{S}({u}_{i})}P({u}_{c}{u}_{i})$$
(2)
$$\mathrm{maximize }{O}^{3}=\prod_{{v}_{j}\in S\wedge S\in {D}^{V}}\prod_{{v}_{c}\in {C}_{S}({v}_{j})}P({v}_{c}{v}_{j})$$
(3)
Finally, the three parts of objection functions mentioned above can be integrated into a joint framework to capture explicit and implicit transition relationships simultaneously. The framework is optimized by the Stochastic Gradient Ascent (SGA) algorithm, which can be presented as the Eq. (4). \(\mathrm{\alpha }\), \(\beta\) and \(\gamma\) are adjustable weights which control the relations between the three components.
$$\mathrm{maximize L}=\mathrm{\alpha log}{O}^{2}+\beta \mathrm{log}{O}^{3}\gamma {O}^{1}$$
(4)
When optimizing the Eq. (4) using SGA, in order to save the calculation time, negative sampling [29], which approximates the costly denominator of the softmax function by sampling several negative instances, is adapted to learn the embedding vectors. As a result, the whole optimization process in one gradient step is as follows:
Firstly the \(\gamma {O}^{1}\) part is maximized to update embeddings \(\overrightarrow{{u}_{i}}\) and \(\overrightarrow{{v}_{j}}\) as the Eqs. (5)(6):
$$\overrightarrow{{u}_{i}}=\overrightarrow{{u}_{i}}+\lambda \{\gamma {w}_{ij}[1\sigma ({\overrightarrow{{u}_{i}}}^{T}\overrightarrow{{v}_{j}})]\bullet \overrightarrow{{v}_{j}}\}$$
(5)
$$\overrightarrow{{v}_{j}}=\overrightarrow{{v}_{j}}+\lambda \{\gamma {w}_{ij}[1\sigma ({\overrightarrow{{u}_{i}}}^{T}\overrightarrow{{v}_{j}})]\bullet \overrightarrow{{u}_{i}}\}$$
(6)
where \(\lambda\) is the learning rate and \({w}_{ij}\) is the weight of edge between \({u}_{i}\) and \({v}_{j}\) (in our study the weight is 1 if there is an edge between \({u}_{i}\) and \({v}_{j}\)). Then, the \(\mathrm{\alpha log}{O}^{2}\) and \(\beta \mathrm{log}{O}^{3}\) parts are maximized separately for further updating the embedding vectors as follows:
$$\overrightarrow{{u}_{i}}=\overrightarrow{{u}_{i}}+\lambda \{\sum_{z\in \{{u}_{c}\}\cup {N}_{S}^{ns}({u}_{i})}\alpha [I\left(z,{u}_{i}\right)\sigma ({\overrightarrow{{u}_{i}}}^{T}\overrightarrow{{\theta }_{z}})]\bullet \overrightarrow{{\theta }_{z}}\}$$
(7)
$$\overrightarrow{{v}_{j}}=\overrightarrow{{v}_{j}}+\lambda \{\sum_{z\in \{{v}_{c}\}\cup {N}_{S}^{ns}({v}_{j})}\beta [I\left(z,{v}_{j}\right)\sigma ({\overrightarrow{{v}_{j}}}^{T}\overrightarrow{{\vartheta }_{z}})]\bullet \overrightarrow{{\vartheta }_{z}}\}$$
(8)
where \({u}_{c}\) and \({v}_{c}\) are the context vertices of \({u}_{i}\) and \({v}_{j}\) separately, \({N}_{S}^{ns}({u}_{i})\) denotes the negative samples (the number is \(\mathrm{ns}\) in total) of \({u}_{i}\) in the sequence \(S\epsilon {D}^{U}\), and the similar meaning can be applied to \({N}_{S}^{ns}({v}_{j})\). \(I\left(z,{u}_{i}\right)\) and \(I\left(z,{v}_{j}\right)\) are indicator functions determining whether vertex \(z\) is the context vertex of \({u}_{i}\) and \({v}_{j}\) respectively (is: 1, not: 0). Besides, \(\sigma\) is the sigmoid function \(1/(1+{e}^{x})\), and \(\overrightarrow{{\theta }_{z}}\) and \(\overrightarrow{{\vartheta }_{z}}\) are the embeddings of the context vertex of \({u}_{i}\) and \({v}_{j}\) respectively.
Furthermore, BiNE is an embedding method which could not well learn total isolation nodes that the truncated random walk cannot reach. However, under our transductivelearning setup, we reckon that the use of BiNE can be understood from another perspective. More specifically, many methods adopt multiple drug and target similarities (as a part of the input feature to generate homogeneous networks), which are precalculated over all nodes in the dataset based on certain properties of drugs and targets. As an analogy, we can treat BiNE as a similarity generator which takes drug and target CoHITS matrices (that are calculated based on the whole bipartite DTI network) as the input to precalculate another type of similarity of drugs and targets. In this case, the form of this drug and target similarity is the embedding score, and the property on which it is based is the highorder proximity; and every node in the whole bipartite DTI network in used datasets has at least one edge such that the truncated random walk can produce every node’s (highorder proximity) similarity in advance (i.e., there are no isolation nodes actually in the process of highorder similarity production).
Composite homogeneous network generation
As for the second step of our workflow, we choose a heuristic method to screen and combine different homogeneous networks (in matrix form) which contain different drugdrug and targettarget similarity information [27]. This method can acquire an informative and robust composite homogeneous network by removing redundant information and integrating the retained features. Specifically, we first calculate the entropy of each homogeneous matrix for determining how much information these matrices contain. Secondly, delete the homogeneous matrices with the entropy value higher than \(\mathrm{c}1\mathrm{log}\left(\mathrm{k}\right)\) where \(\mathrm{c}1\) is a threshold to control the information each matrix contains (is subjectively set to 0.7) and \(\mathrm{log}(\mathrm{k})\) represents the highest entropy contained among all matrices.
Next, flatten each matrix and calculate the Euclidean distance (\(d\)) between homogeneous matrices, and then start from the matrix with the lowest entropy, based on the similarity index \({E}_{s}\) (shown in Eq. (9)), further remove other matrices having \({E}_{s}\) higher than \(\mathrm{c}2\) (is subjectively set to 0.6) with the current matrix, and the process will be repeated until all matrices are removed or retained. Finally, the similarity network fusion (SNF) [30] algorithm is adopted for nonlinearly fusing the remaining matrices into a composite matrix that carries the necessary information from different similarity measures.
$${E}_{s}=\frac{1}{1+d}$$
(9)
As a result, a drug and a target composite matrix are obtained from multiple drug and target homogeneous matrices respectively. These two matrices and other matrices mentioned in this section all belong to the first type of the homogeneous network mentioned in the “Learning bipartite DTI embedding” section, which sizes are \(m\times m\) (for drug) and \(n\times n\) (for target), respectively.
Generating new embedding vectors of drugtarget pairs
In order to tackle the problem that some recent embeddingbased methods cannot add the pathway information about drugtarget interactions into embeddings of drugtarget pairs (e.g., simply concatenating generated drug and target embeddings as the final embeddings of drugtarget pairs), we provide a method, which draws on the pathbased information (about similar drugs interacting with the same targets and about similar targets sharing the same drugs), to acquire new embeddings of every drugtarget pair (i.e., the reconstruction of DTI relations (network) included in the whole dataset). The intuition behind this idea is that, although separate drug and target embeddings produced by embedding algorithms could carry certain DTI (highorder proximity) information through learning process, the characterization of DTIs they contain for DTI predictions is still insufficient before the heterogeneous information (e.g., pathbased knowledge) is added. The explanation of main calculation steps is shown in Fig. 3.
Specifically, taking the embedding generation process of a \({d}^{i}{t}^{j}\) pair as an example, first, we obtain \({d}^{i}\) and \({t}^{j}\) embeddings (\(\overrightarrow{{d}^{i}}\) and \(\overrightarrow{{t}^{j}}\)) produced by BiNE, the bipartite DTI matrix \({G}_{b}\), and drug and target homogeneous fused matrices mentioned in the “Composite similarity matrix generation” section. Second, acquire the five nearest drugs of \({d}^{i}\) according to the weights in the drug homogeneous matrix. That is, find the row corresponding to \({d}^{i}\) in the drug homogeneous matrix, and the values in the row are sorted from large to small, then the drugs corresponding to the five largest values are selected. In the same way, five targets with the highest similarity to \({t}^{j}\) can be found.
Third, multiply the embedding vector of \({d}^{i}\) by corresponding weights (i.e., similarities) of selected five nearest drugs in the drug homogeneous matrix respectively, then sum the obtained five products up to acquire a new feature \({d}^{sim\_i}\); the same rule can be applied to the embedding vector of \({t}^{j}\) to acquire a new feature \({t}^{sim\_j}\) (Eqs. (10)(11)).
$${d}^{sim\_i}=\sum_{{d}^{z}\in {D}^{near}}{w}_{d}^{z}\overrightarrow{{d}^{i}}$$
(10)
$${t}^{sim\_j}=\sum_{{t}^{z}\in {T}^{near}}{w}_{t}^{z}\overrightarrow{{t}^{j}}$$
(11)
where \({D}^{near}\) and \({T}^{near}\) denote the set of the selected nearest drugs of \({d}^{i}\) and the nearest targets of \({t}^{j}\) separately, \({w}_{d}^{z}\) is the weight between \({d}^{z}\) and \({d}^{i}\) in the drug homogeneous matrix, and the similar meaning can be applied to \({w}_{t}^{z}\). The main purpose in this step is integrating drugdrug and targettarget homogeneous matrices (similarity information) into the embedding vectors \({d}^{i}\) and \({t}^{j}\), respectively. In the fourth step, multiply the embedding vector \({t}^{j}\) by weights in \({G}_{b}\) between selected five nearest drugs and \({t}^{j}\) respectively, and then sum the five generated products up for acquiring a new feature \({d}^{path\_i}\). At the same time, we multiply the embedding vector \({d}^{i}\) by weights in \({G}_{b}\) between five selected nearest targets and \({d}^{i}\) respectively, and then sum the obtained products up to create a new feature \({t}^{path\_j}\) (Eqs. (12)(13)).
$${d}^{path\_i}=\sum_{{d}^{z}\in {D}^{near}}{w}_{{t}^{j}}^{z}\overrightarrow{{t}^{j}}$$
(12)
$${t}^{path\_j}=\sum_{{t}^{z}\in {T}^{near}}{w}_{{d}^{i}}^{z}\overrightarrow{{d}^{i}}$$
(13)
where \({w}_{{t}^{j}}^{z}\) and \({w}_{{d}^{i}}^{z}\) represent the weight between \({d}^{z}\) and \({t}^{j}\) in \({G}_{b}\) and the weight between \({t}^{z}\) and \({d}^{i}\) in \({G}_{b}\), respectively. In this step, we can model the interactive pathway information about the known interactions between drugs (which are more similar to \({d}^{i}\)) and \({t}^{j}\) as well as the known interactions between \({d}^{i}\) and targets (which are more similar to \({t}^{j}\)). In the fifth step, a new embedding vector \({d}^{part\_i}\) is calculated by summing the vectors \({d}^{sim\_i}\) and \({d}^{path\_i}\) up, and the embedding vector \({t}^{part\_j}\) is formed in a similar way (Eqs. (14)(15)).
$${d}^{part\_i}={d}^{sim\_i}+{d}^{path\_i}$$
(14)
$${t}^{part\_j}={t}^{sim\_j}+{t}^{path\_j}$$
(15)
Finally, the \({d}^{part\_i}\) and \({t}^{part\_j}\) can be concatenated to obtain an embedding of the \({d}^{i}{t}^{j}\) pair, which effectively integrates characteristics from the bipartite DTI network as well as drug and target homogeneous networks. In addition, this calculation process is conducted after the crossvalidation (CV) setup.
RFbased drugtarget interaction predictor
After acquiring embeddings of all drugtarget pairs in the dataset, the RF classifier [31] can be used for predicting the DTIs. RF has been proved to perform well in the face of highdimensional features and be able to deal with overfitting in the case of insufficient training data. More importantly, it can handle the sampleclassimbalance problem efficiently. We implement the RF classifier by using the scikitlearn [32] tool, and the embeddings of drugtarget pairs are as the input. The probability of whether each drugtarget pair has a potential interaction is then predicted.
In addition, we tune the parameters of the RF classifier for better learning the complex integrated embeddings. The number of estimators is set to 100, the criterion for measuring the quality of a split is the Gini coefficient, and we make the weights of the model inversely proportional to the occurrence frequency of positive (known DTIs) and negative (unknown DTIs) classes based on input labels, to further overcome the challenge of considerable imbalance between the number of known and unknown DTIs.