MOKPE: drug–target interaction prediction via manifold optimization based kernel preserving embedding

Background In many applications of bioinformatics, data stem from distinct heterogeneous sources. One of the well-known examples is the identification of drug–target interactions (DTIs), which is of significant importance in drug discovery. In this paper, we propose a novel framework, manifold optimization based kernel preserving embedding (MOKPE), to efficiently solve the problem of modeling heterogeneous data. Our model projects heterogeneous drug and target data into a unified embedding space by preserving drug–target interactions and drug–drug, target–target similarities simultaneously. Results We performed ten replications of ten-fold cross validation on four different drug–target interaction network data sets for predicting DTIs for previously unseen drugs. The classification evaluation metrics showed better or comparable performance compared to previous similarity-based state-of-the-art methods. We also evaluated MOKPE on predicting unknown DTIs of a given network. Our implementation of the proposed algorithm in R together with the scripts that replicate the reported experiments is publicly available at https://github.com/ocbinatli/mokpe.


S1 Manifold Optimization based Kernel Preserving Embedding
Our data come from two different domains, namely, D and T , and we are given two sets of objects D = {d i ∈ D} Nd i=1 and T = {t i ∈ T } Nt i=1 , corresponding to drugs and targets, respectively. Usually, these objects have vectorial representations (i.e., D and T are Euclidean spaces) in practical applications, however, the domains in our work contain non-vectorial structures, such as protein sequence strings for targets. Aiming for having a general notation for both vectorial and non-vectorial data, we assume that the drug-target interactions and drug-drug, target-target similarities are provided with three different scoring functions: i. s i c,j : D × T → R gives the drug-target interaction score between d i and t j , ii. s i d,j : D × D → R gives the drug-drug similarity score between drugs d i and d j , iii. s i t,j : T × T → R gives the target-target similarity score between targets t i and t j . Hence, we can use the within-domain similarity functions to represent the objects in a vectorial form, which is known as empirical kernel map [1, Chapter 2]: This technique allows us to introduce non-linearity into the embedding part [2, Chapter 2] even for the models where the objects have corresponding vectorial representations. We also introduce three index sets, namely, j is known}, to represent known information coming from these scoring functions.
We map objects from two different domains into a unified embedding space. The objects in D and T are converted into R-dimensional vectors of an Euclidean space, namely, However, to be able to do out-of-sample embedding, we can assume linear projections from the input domains to the embedding domain and optimize the model with respect to the projection matrices. The embedding coordinates are formulated as where we assume that both of the objects have vectorial representations (i.e., D ∈ R Dd and T ∈ R Dt , where D d and D t are subsets of N d and N t , respectively). We want to approximate the scoring functions s i c,j , s i d,j , and s i t,j by three kernel functions, These three kernel functions in the embedding space have to be differentiable with respect to the projected coordinates for calculating the gradients required for the optimization step. We use the Gaussian kernel (also called the radial basis function kernel) in the linear projections from the input domains to the embedding space to capture the local neighborhood information coming from the cross-domain interactions and within-domain similarities. The kernel functions in the embedding space can be written as where σ e ∈ R ++ is the kernel width, and the auxiliary variables, namely, V i c,j , V i d,j , and V i t,j , are used for simplicity.
We propose to preserve the interaction and similarity scores together using a compound loss function: where | · | gives the cardinality of the input set. We have separate mean squared error terms, and separate regularization hyperparameters, namely, λ c ∈ R + , λ d ∈ R + , and λ t ∈ R + , to adjust the weights. The corresponding optimization problem is where we assume orthonormality of the embedding dimensions in each domain separately.
With this assumption, we prevent the scaling ambiguity caused by heterogeneous objects, and we are able to extract extrapolated information in each dimension of the embedding space. The objective function of the optimization problem is non-convex due to the non-linearity introduced by the Gaussian kernels and global optimization is not possible with known methods. Therefore, we employ a Quasi-Newton method based framework to find a locally optimal solution. In our method, we need to satisfy the orthonormality constraints on the embedding space and the non-negativity constraint on the kernel width. We can write the gradients of L with respect to the projection matrices as: where the gradients of the auxiliary variables are found as Because of the orthonormality constraints, the embedding coordinates of each domain are defined on a Stiefel manifold (i.e., S(R, N ) = {E ∈ R R×N : EE ⊤ = I R }), which is a Riemannian manifold [3]. To satisfy these orthonormality constraints, we need to use a method proved efficient for optimization on Stiefel manifolds to update the embedding coordinates [4].
When learning the kernel width, we apply the change of variable technique to work on the logarithmic scale to satisfy the non-negativity constraint. We define a new variable for the logarithm of the kernel width (η e = log σ e ) and use an optimization method on this variable. The gradient of L with respect to η e are where the gradients of the auxiliary variables are found as Our complete algorithm is an alternating optimization scheme consisting of three main steps: i. update V d given V t and σ e , ii. update V t given V d and σ e , and iii. update σ e given V d and V t .
The optimization procedure sequentially updates the decision variables until the stopping criteria are satisfied. Algorithm 1 summarizes the training procedure of our proposed framework.

Algorithm 1 Manifold Optimization based Kernel Preserving Embedding
Input: The scoring functions s i c,j , s i d,j , and s i t,j , the index sets, I c , I d , and I t , subspace dimensionality R, regularization parameters λ c , λ d , λ t , relative tolerance parameters ϵ d , ϵ t  until stopping criterion is met 16: until stopping criterion is met

S2 Comparison of MOKPE against the Steepest Descent Method
We compare our algorithm against MKPE [5], a steepest descent based method with Armijotype line search. We demonstrate the performance of out-of-sample embedding in predicting interactions for unseen drugs. For NR, GPCR, and IC data sets, we apply ten replications of ten-fold cross-validation over drugs to obtain robust results. Since Enzyme data set is much bigger compared to others, achieving convergence within a reasonable timeframe may not be possible for both methods, especially when the subspace dimensionality is larger (i.e., 20 and 25), therefore, it is not included in the comparison. We convert the Matlab implementation of MKPE provided by [5] to R implementation and use it with its default parameters. The code is publicly available at https://github.com/ocbinatli/mkpe. We obtain the results of both methods by training them with changing subspace dimensionality parameters taken from {5, 10, 15, 20, 25}. We use the manifold optimization library ManifoldOptim (version 1.0.1) [6] for employing the algorithm LRBFGS in R programming language (version 4.0.2) [7]. We set the regularization parameters (λ c , λ x , λ z ) to (1, 0.1, 0.1). For MOKPE, we set relative improvement tolerance parameters (ϵ x , ϵ z ) to (10 −3 , 10 −3 ) as stopping criteria. We use the default values for all other input parameters. We train both of the methods for equal time amount linearly increasing with respect to the subspace dimensionality. Table S1 shows the maximum allowed training time per fold in seconds. All the experiments are run in the same computer cluster family. Figure S1 gives the average AUROC (area under the receiver operating characteristic curve) values for MKPE and MOKPE. When the subspace dimensionality is smaller than twenty five, we see that MOKPE achieves significantly higher average AUROC values on all the data sets. Figure S2 shows the objective function value over time for both methods in training data sets. Each smooth curve is a LOESS (locally estimated scatterplot smoothing) fitted curve. MKPE starts better but more likely to "get stuck in a local optima" later on. Also, the iterations are less costly time-wise for MKPE, which provides an advantage over MOKPE at the starting phase. Figure S2: Objective function values vs. processing time (in seconds) for the training sets on MKPE and MOKPE with changing subspace dimensionality on the NR, GPCR and IC data sets.