After construction of the DTI matrix Z, the drug similarity matrix \(S^d\) and a target similarity matrix \(S^t\), DTI prediction is transformed into non-negative factorization of the DTI matrix with graph dual regularization terms and a prior knowledge consistency constraint. The graph dual regularization terms are used to integrate the information from the drug similarity matrix and the target similarity matrix, in order to take the intrinsic geometrical structures of the related manifolds into account. Finally, an alternating direction algorithm is used to solve the matrix factorization. Furthermore, we prove the convergence of the algorithm.
Non-negative matrix factorization
In order to obtain low-dimensional feature representations of drugs and targets in the drug–target interaction space, factorization of the DTI matrix is widely adopted. The general form of the matrix factorization is as follows:
$$\begin{aligned} Z\approx X Y^T, \end{aligned}$$
(1)
where X and Y are the latent feature matrices of drugs and targets respectively, \(X \in {R^{n \times k}}, Y \in {R^{m \times k}}\), and k is the rank of feature vectors of drugs and targets (\(k \ll \min (m,n)\)).
To improve interpretability, the non-negativity constraint of X and Y is usually added. The optimization model of non-negative matrix factorization (NMF) is as follows:
$$\begin{array}{*{20}l} {\min \left\| {Z - XY^{T} } \right\|_{F}^{2} } \\ {s.t.~X \ge 0,Y \ge 0.} \\ \end{array}$$
(2)
Graph regularized non-negative matrix factorization
NMF aims to well approximate the DTI matrix by finding two low rank matrices, but fail to consider the geometric information in the original data. To integrate the geometric information, Cai et al. [50] proposed graph regularized non-negative matrix factorization (GNMF) which introduces a graph regularization item. The cost function of GNMF is as follows:
$$\begin{aligned} \mathcal {O}_{gr}=\frac{1}{2}\left\| {Z - X{Y^T}} \right\| _F^2&+{\lambda }{\text {Tr}}({Y^T}(Y^T(D-W)) Y)\\&s.t. ~ X \ge 0, Y \ge 0, \end{aligned}$$
(3)
where \({\text {Tr}}\) is the trace of a matrix, \(\lambda\) is regularization parameter, W is the weight matrix representing a neighbor graph of the data points, and D is a diagonal matrix such that \({D_{ii}} = \sum \nolimits _l {{W_{il}}}\). The matrix \(D-W\) is called graph Laplacian and denoted by \(\mathcal {L}\) in the following. Furthermore, considering the geometric structure of data manifold and feature manifold, Shang et al. [43] proposed graph dual regularization non-negative matrix factorization (GDNMF), whose cost function is:
$$\begin{aligned} \begin{aligned} \mathcal {O}_{gd}&=\frac{1}{2}\left\| {Z - X{Y^T}} \right\| _F^2 +{\lambda _y}{\text {Tr}}({Y^T}(Y^T \mathcal {L}_y Y)\\&+{\lambda _x}{\text {Tr}}({X^T}(X^T \mathcal {L}_x X)\\&s.t. ~ X \ge 0, Y \ge 0. \end{aligned} \end{aligned}$$
(4)
From the similarity matrices \(S^d\) and \(S^t\), we could obtain the geometric information of drugs and targets. First we construct two p-nearest neighbor graphs \(N^d\) and \(N^t\) of drugs and targets, respectively.
For two drugs \(d_i\) and \(d_j\), the weight of the edge between vertices i and j in the p-nearest neighbor graph \(N^d\) is defined as follows.
$$\begin{aligned} {N_{ij}^d} = \left\{ \begin{array}{l} 1, \quad j \in {\mathcal {N}_p}(i) \text { and } i \in {\mathcal {N}_p}(j) \\ 0, \quad j \notin {\mathcal {N}_p}(i) \text { and } i \notin {\mathcal {N}_p}(j) \\ 0.5, \quad \text {otherwise,} \\ \end{array} \right. \end{aligned}$$
(5)
where \(\mathcal {N}_p(i)\) and \(\mathcal {N}_p(j)\) are the sets of p most similar drugs of drugs \(d_i\) and \(d_j\) according to \(S^d\), respectively. \(N^d\) is used to make the drug similarity matrix \(S^d\) sparse as follows.
$$\begin{aligned} {\hat{S}}_{ij}^d = {N_{ij}^d}S_{ij}^d, \forall i, j. \end{aligned}$$
(6)
\({\hat{S}}^d\) is used as the weight matrix representing the drug neighbor graph. The graph Laplacian of \({\hat{S}}^d\) is \({\mathcal {L}_d} = {D^d} - {{\hat{S}}^d}\), where \(D^d\) is a diagonal degree matrix with \(D_{ii}^d = \sum \limits _r {{\hat{S}}_{ir}^d}\).
The same processing is performed on the target similarity matrix \(S^t\) and we calculated out \({\hat{S}}^t\) the weight matrix representing the target neighbor graph as follows.
$$\begin{aligned} {\hat{S}}_{ij}^t = {N_{ij}^t}S_{ij}^t, \forall i, j. \end{aligned}$$
(7)
The graph Laplacian of \({\hat{S}}^t\) is \({\mathcal {L}_t} = {D^t} - {{\hat{S}}^t}\), where \(D^t\) is diagonal degree matrix, \(D_{jj}^t = \sum \limits _q {{\hat{S}}_{jq}^t}\).
Since the normalized graph Laplacian usually performs better in many actual applications, we adopted the following normalized graph Laplacian forms of \(\mathcal {L}_d\) and \(\mathcal {L}_t\).
$$\begin{aligned}{} & {} \begin{aligned} {\widetilde{{\mathcal {L}}}_d} = {\left( {{D^d}} \right) ^{ - 1/2}}{{{\mathcal {L}}}_d}{\left( {{D^d}} \right) ^{ - 1/2}}, \end{aligned} \end{aligned}$$
(8)
$$\begin{aligned}{} & {} \begin{aligned} {\widetilde{{\mathcal {L}}}_t} = {\left( {{D^t}} \right) ^{ - 1/2}}{{{\mathcal {L}}}_t}{\left( {{D^t}} \right) ^{ - 1/2}}. \end{aligned} \end{aligned}$$
(9)
The graph dual regularization non-negative matrix factorization (GDNMF) for DTI prediction problem is formulated as follows.
$$\begin{aligned} \begin{aligned} \mathop {\min }\limits _{(X,Y,Z)}&\frac{1}{2}\left\| {Z - X{Y^T}} \right\| _F^2 +{\lambda _d}{\text {Tr}}({X^T}\widetilde{\mathcal {L}}_d X)\\&+ {\lambda _t}{\text {Tr}}({Y^T}\widetilde{\mathcal {L}}_t Y).\\ s.t. ~&X \ge 0, Y \ge 0, \\ \end{aligned} \end{aligned}$$
(10)
where \(\lambda _d\) and \(\lambda _t\) are regularization parameters.
Graph dual regularized non-negative matrix factorization with prior knowledge constraint for DTI prediction
To ensure the matrix decomposition result is consistent with the prior knowledge of known DTIs, we introduce a Prior knowledge Constraint in GDNMF and formulate the DTI prediction problem as the following optimization problem (abbreviated as GRMFC).
$$\begin{aligned} \begin{aligned} \mathop {\min }\limits _{(X,Y,Z)}&\frac{1}{2}\left\| {Z - X{Y^T}} \right\| _F^2 +{\lambda _d}{\text {Tr}}({X^T}\widetilde{\mathcal {L}}_d X)\\&+ {\lambda _t}{\text {Tr}}({Y^T}\widetilde{\mathcal {L}}_t Y).\\ s.t. ~&X \ge 0, Y \ge 0, \\&{\mathrm{\mathscr {P}}_\Omega }(Z - XY^T) = 0,\\ \end{aligned} \end{aligned}$$
(11)
where \(\Omega\) indexes the known drug–target interactions, i.e. the elements whose values are 1 in Z. \({\mathrm{\mathscr {P}}_\Omega }(S)\) returns a copy of S that zeros out the elements whose indices are not in \(\Omega\), which is defined as follows.
$$\begin{aligned} {{\mathscr {P}_\Omega }(S)_{ij}} = \left\{ \begin{array}{l} {S_{ij}},(i,j) \in \Omega , \\ 0,(i,j) \notin \Omega . \\ \end{array} \right. \end{aligned}$$
Since the elements whose values are 1 in Z represent the known validated drug–target interactions, and we introduce the prior knowledge consistency constraint \({\mathrm{\mathscr {P}}_\Omega }(Z - XY^T) = 0\) to ensure that the matrix fraction \(XY^T\) remains the known DTIs and does not lose the prior knowledge.
GRMFC is a non-convex optimization problem, and it is difficult to obtain its accurate solution. Inspired by the iteration algorithm in [41], we used an adapted alternating direction algorithm to obtain a local optimal solution. The alternating direction algorithm (ADA) is an iteration algorithm that alternatively updates X and Y. In order to use ADA to solve GRMFC efficiently, we introduce auxiliary matrices M, U and V, and transform (11) into the following equivalent form.
$$\begin{aligned} \begin{aligned} \mathop {\min }\limits _{(U,V,X,Y,M)}&\frac{1}{2}\left\| {M - X{Y^T}} \right\| _F^2 +{\lambda _d}{\text {Tr}}({X^T}\widetilde{\mathcal {L}}_d X)\\&+ {\lambda _t}{\text {Tr}}({Y^T}\widetilde{\mathcal {L}}_t Y).\\ s.t. ~&X = U, {Y^T} = V, \\&U \ge 0, V \ge 0, \\&{\mathrm{\mathscr {P}}_\Omega }(Z - M) = 0,\\ \end{aligned} \end{aligned}$$
(12)
where \(U \in {R^{n \times k}}\), \(V \in {R^{k \times m}}\). The auxiliary matrix M is regarded as the predicted drug–target interaction matrix.
The augmented Lagrangian of (12) is:
$$\begin{aligned} &{\mathscr {L}}(X,Y,M,U,V,\Lambda ,\Pi )\\&= \frac{1}{2}\left\| {M - X{Y^T}} \right\| _F^2+ {\lambda _d}{\text {Tr}}({X^T}{\widetilde{\mathcal {L}}}_d X) \\&+{\lambda _t}{\text {Tr}}({Y^T}{\widetilde{\mathcal {L}}}_t Y)+\Lambda \bullet (X - U)\\&+ \Pi \bullet (Y^T - V)+ \frac{\alpha }{2}\left\| {X - U} \right\| _F^2\\&+\frac{\beta }{2}\left\| {Y^T - V} \right\| _F^2, \end{aligned}$$
(13)
where \(\Lambda\) and \(\Pi\) are Lagrangian multipliers, \(\Lambda \in {R^{n \times k}}\), \(\Pi \in {R^{k \times m}}\), and \(\alpha , \beta > 0\) are penalty parameters.
The alternating direction algorithm successively updates the values of matrices X, Y, M, U and V one at a time, such that \(\mathscr {L}\) reaches the minimum with respect to the updated matrix while the other matrices take their most recent values. The updating rules are as follows.
$$\begin{aligned} X_{{i + 1}} & = \mathop {\arg \min }\limits_{X} {\mathscr{L}}(X,Y_{i} ,M_{i} ,U_{i} ,V_{i} ,\Lambda _{i} ,\Pi _{i} ), \\ Y_{{i + 1}} & = \mathop {\arg \min }\limits_{Y} {\mathscr{L}}(X_{{i + 1}} ,Y,M_{i} ,U_{i} ,V_{i} ,\Lambda _{i} ,\Pi _{i} ), \\ M_{{i + 1}} & = \mathop {\arg \min }\limits_{{{\mathscr{P}}_{\Omega } (Z - M) = 0}} {\mathscr{L}}(X_{{i + 1}} ,Y_{{i + 1}} ,M,U_{i} ,V_{i} ,\Lambda _{i} ,\Pi _{i} ), \\ U_{{i + 1}} & = \mathop {\arg \min }\limits_{{U \ge 0}} {\mathscr{L}}(X_{{i + 1}} ,Y_{{i + 1}} ,M_{{i + 1}} ,U,V_{i} ,\Lambda _{i} ,\Pi _{i} ), \\ V_{{i + 1}} & = \mathop {\arg \min }\limits_{{V \ge 0}} {\mathscr{L}}(X_{{i + 1}} ,Y_{{i + 1}} ,M_{{i + 1}} ,U_{{i + 1}} ,V,\Lambda _{i} ,\Pi _{i} ), \\ \Lambda _{{i + 1}} & = \Lambda _{i} + \gamma \alpha (X_{{i + 1}} - U_{{i + 1}} ), \\ \Pi _{{i + 1}} & = \Pi _{i} + \gamma \beta (Y_{{i + 1}} - V_{{i + 1}} ). \\ \end{aligned}$$
In closed form, the updating rules are as follows.
$$\begin{aligned} &{X_{i + 1}} = ({M_i}{Y_i} - {\lambda _d}{\widetilde{{\mathcal {L}}}_d}X + \alpha {U_i}- {\Lambda _i}) \\&\quad \quad {(Y_i^T{Y_i} + \alpha I)^{ - 1}},\\&{Y_{i + 1}} = ({M_{i}^T}{X_{i+1}} - {\lambda _t}{\widetilde{{\mathcal {L}}}_t}Y + \beta {V_i} - {\Pi _i})\\&\quad \quad {(X_{i+1}^T{X_{i+1}} + \beta I)^{ - 1}},\\&{M_{i + 1}} = X_{i + 1}Y_{i + 1}^T+{\mathscr {P}}_{\Omega }(M-X_{i + 1}Y_{i + 1}^T),\\&{U_{i + 1}} ={\mathscr {P}}_{+}(X_{i + 1}+{\Lambda _i}/ \alpha ),\\&{V_{i + 1}} = {\mathscr {P}}_{+}(Y_{i + 1}+{\Pi _i}/ \beta ),\\&{\Lambda _{i + 1}} = {\Lambda _i} + \gamma \alpha ({X_{i + 1}} - {U_{i + 1}}), \\&{\Pi _{i + 1}} = {\Pi _i} + \gamma \beta ({Y_{i + 1}} - {V_{i + 1}}), \end{aligned}$$
(14)
where \((\mathscr {P}_{+}(S))_{ij}=max\{S_{ij}, 0\}\), and \(\gamma\) is a step length parameter which is set as 1.618 in the following experiments according to [41].
The iteration process will terminated when the changes of M smaller than a given smaller threshold. The pseudocode of the algorithm (ADA-GRMFC) is shown in Algorithm 1. The proof of convergence of ADA-GRMFC is shown in Additional file 1: Appendix.