L2,1-GRMF: an improved graph regularized matrix factorization method to predict drug-target interactions

Background Predicting drug-target interactions is time-consuming and expensive. It is important to present the accuracy of the calculation method. There are many algorithms to predict global interactions, some of which use drug-target networks for prediction (ie, a bipartite graph of bound drug pairs and targets known to interact). Although these algorithms can predict some drug-target interactions to some extent, there is little effect for some new drugs or targets that have no known interaction. Results Since the datasets are usually located at or near low-dimensional nonlinear manifolds, we propose an improved GRMF (graph regularized matrix factorization) method to learn these flow patterns in combination with the previous matrix-decomposition method. In addition, we use one of the pre-processing steps previously proposed to improve the accuracy of the prediction. Conclusions Cross-validation is used to evaluate our method, and simulation experiments are used to predict new interactions. In most cases, our method is superior to other methods. Finally, some examples of new drugs and new targets are predicted by performing simulation experiments. And the improved GRMF method can better predict the remaining drug-target interactions.


Background
With advances in drug discovery technologies, the existing methods can identify drug targets to some extent. But drug development is a high-cost, inefficient problem [1]. For drug developers, there has been a great deal of interest in the repositioning of drugs. This repositioning has some potential to reduce risk time and cost [2]. A crucial element for the repositioning of medicines is online biological databases such as KEGG [3], DrugBank [4], STITCH [5] and ChEMBL [6], which store a large number of current drug-target interactions. It is worth noting that there are still many interactions that have not been found [7]. Therefore, the advances of drug-target prediction technology is accelerated, and more and more prediction methods are proposed [8]. These computations, which reasonably predict new and unexplored interactions, have greatly facilitated the drug discovery process, making the process more credible.
problems. The main methods are Bayesian matrix factorization, KBMF2K [14] and collaborative matrix factorization method, CMF [15]. A high-dimensional drug-target interaction matrix is decomposed into a plurality of low-dimensional matrices, and these matrices have characteristics of the original matrices, which is the principle of these methods. However, in theories, the above methods of matrix factorization still have some room for improvement. [16].
Using chemogenomic approaches to predict drug-target interactions is an effective method. The reason is that the first two methods have their own drawbacks. If a docking simulation is used, the three-dimensional structure of a target protein must be available. Furthermore, for ligand-based methods, if there are few or no target proteins known, this would be a problem that cannot be ignored. [9]. The advantage of using chemical genomics is that the information from the drugs and targets is used simultaneously for prediction [17]. New interactions are inferred by calculating the similarity of the chemical structures between drugs and the similarity of the genomic sequences between the targets. In this paper, the drug similarity and the target similarity are based on the construction methods in previous studies, which are based on the characteristics of the drug and the characteristics of the target. Its advantage is that we are better able to compare it with other methods, which is universal. However, if the same construction method of the drug similarity and the target similarity is used, this may affect the final results.
Two separate models are used to train drug target pairs, one based on the drug side and the other based on the target side. Thus, the final results are solved by predicting these two aspects. In this paper, to avoid over-fitting and sparing the target, the L 2,1 -norm is added in our method, which can eliminate some unattached target pairs [18]. Ten-fold cross-validation is used to evaluate the performance of our method.
We present the experimental results in Results. In Datasets, we conducted a case study. And we summarize this paper in Cross-validation experiments. In Interaction prediction under CVd, we clearly introduced the methods, including specific iteration formulas and algorithms.

Datasets
Four datasets are used to experiment: the nuclear receptor (NR), the G protein-coupled receptor (GPCR), the ion channel (IC) and the enzyme (E). The size of these four datasets is different. Nuclear receptors are one of the most abundant transcriptional regulators in metazoans. NR includes some steroid hormones, vitamin D and quinone. In recent years, nuclear receptors have received widespread attention. For example, they are closely related to the development of diseases such as diabetes and fatty liver. Among them, PPAR-g agonist thiazolidinedione rosiglitazone can effectively improve insulin sensitivity in diabetic patients. GPCRs are one of the target enzymes that are important proteins in cell signaling and have so far been found as therapeutic drugs. The total number of targets is about 500, and GPCR targets account for the vast majority of receptors therein. In recent years, indications for targeting GPCR drugs are expanding from traditional areas such as allergies, hypertension, anesthesia and schizophrenia to new areas such as obesity. An ion channel is a pore-forming protein that traverses the channel by allowing an ion of a particular type to rely on an electrochemical gradient. ICs are small pores in the cell membrane that allow ions to enter and exit the cell. Therefore, most of them have become the targets of some mainstream drugs. Enzymes are macromolecular biocatalysts. Some common drugs use enzymes as targets, and some effects on enzymes such as inhibition, induction, activation or reactivation are exerted. In addition, drugs like this are mostly enzyme inhibitors. According to statistics, half of the top 20 drugs in the world are enzyme inhibitors. It is worth noting that some drugs are enzymes themselves, such as pepsin and trypsin.
Each dataset contains three matrices, Y, S d and S t . Matrix Y represents the drug-target interactions. It is worth noting that this matrix is an adjacency matrix. If it is known that the drug d i is related to the target t j , Y ij is 1, otherwise Y ij is 0. The matrix S d represents the chemical pairing structural similarity [19] and the matrix S t represents the genome sequence similarity of the target pair [20]. Table 1 lists the specific information for the four datasets. More information about the datasets are published in https://github.com/cuizhensdws/L21-GRMF.

Cross-validation experiments
We compare the existing matrix decomposition methods CMF (Collaborative matrix factorization), GRMF (Graph regularized matrix factorization), WGRMF (Weighted graph regularized matrix factorization) and our proposed method and compare WKNKN preprocessing on these methods. We use cross-validation experiments on these methods. In this paper, we use a ten-fold cross-validation (CV). The original dataset Y is divided into ten subsets, each of which is tested once and the rest as a training set. The cross-validation is repeated five times, one subset is selected each time as a test set, and the average cross-validation recognition accuracy rate of five times is taken as a result.
To verify the effect of the prediction, we use the evaluation index which has been widely used before, the AUPR (Area under the Precision-Recall curve) [21]. There is also an evaluation scale called AUC (Area under the receiver operating characteristic curve). We can use this method when forecasting. In our experiments, ten AUPR values are calculated for each ten-fold cross-validation, an average is obtained and we repeat five times, so we take the average of the five AUPRs as the final result [22]. In general, the AUPR value is less than the AUC value. The AUPR value is above 0.3, so the experimental results are reasonable.
We test two aspects [23], one is CVd which is based on the drug-interaction profiles and the other is CVt, which is based on the target-interaction profiles. CVd is used to test the ability to predict new drugs, CVt is used to test the ability to predict new targets. In addition, we perform a convergence analysis of each method using the NR and GPCR datasets as examples, and each method is subjected to 100 iterations. When the number of iterations is about 20, our method achieves convergence. It is worth noting that we have different tolerances for errors, considering the size and type of the datasets. Generally speaking, as long as the error is within a reasonable range, this is acceptable. Figures 1 and 2 show the convergence of different methods on the NR and GPCR datasets, respectively. Table 2 lists the experimental results at CVd. And Standard deviations are given in parentheses. Under the NR dataset, the L 2,1 -GRMF (L 2,1 -norm Graph regularized matrix factorization) method is superior to the GRMF method and is almost the same as the GRMF method after adding the WKNKN. Importantly, our improved method L 2,1 -GRMF, with the addition of WKNKN, has seen significant improvements. Moreover, after adding the weight matrix to L 2,1 -GRMF and using WKNKN, the accuracy of prediction is also improved. Figure 3 shows the PR curves on the CVd side of each method on the NR dataset.

Interaction prediction under CVd
However, on the GPCR dataset, we run our method and find that it is not outperform the previous method, and initially estimate that there is a problem with the dataset itself. Figure 4 shows the PR curves on the CVd side of each method on the GPCR dataset. We observe that using the weight matrix when performing CVd experiments is higher than the AUPR value obtained without using the weight matrix. In addition, the L 2,1 -WGRMF (Weighted L 2,1 -norm graph regularized matrix factorization) method using WKNKN is superior to any other method in the IC dataset, slightly better than the WGRMF method using WKNKN. Figure 5 shows the PR curves on the CVd side of each method on the IC dataset. In the E dataset, the best method is L 2,1 -WGRMF but the AUPR score drops instead after applying WKNKN. In other words, in the E dataset, the preprocessing step will actually have a negative effect on the forecast result. Figure 6 shows the PR curves on the CVd side of each method on the E dataset. In general, not all methods use WKNKN to improve AUPR scores, which have a positive effect on most datasets and negative effects on some datasets. In practice, the negative impact of the WKNKN method is unavoidable on some datasets. One important reason is that the WKNKN Fig. 1 Comparison of convergence about three methods on the NR dataset method assigns an inaccurate value to the 0 element of the matrix Y on the E dataset. When we add the L 2,1 -GRMF method to make more accurate predictions, these inaccurate values will reduce the prediction accuracy.

Interaction prediction under CVt
We can see in Table 3 that under most datasets, the AUPR value of CVt is generally higher than the AUPR value of CVd. This shows that hiding the interactions of the target can still get a better prediction result. But hiding the drug interactions and the prediction result will be greatly reduced. And standard deviations are given in parentheses. It is worth noting that in most datasets, the CMF method has lower AUPR values than any other method, and its AUPR value is far less than our method, especially in the NR dataset.

Discussion
Among the NR, GPCR and IC datasets, the superior methods are the L 2,1 -GRMF method using the preprocessing steps, and our improved method has some improvement on all three datasets. Figures 7,8,9 and 10 show the PR curves on the CVt side of each method on the NR, GPCR, IC and E datasets, respectively. On the E dataset, it is still the best GRMF method. We can also see that some instances are ignored after using the weight matrix, whereas the GRMF method does not use the weight matrix W. Therefore, based on the previous conclusions, the information of the target is more important than the information of the drug. Therefore, using the GRMF method, the AUPR value is higher than the AUPR value using WGRMF.
On most datasets, the L 2,1 -norm does play a key role in predicting the results. The L 2,1 -norm can provide a sparse solution for the final result. Compared with the CMF method, the L 2,1 -norm also promotes the final convergence. Therefore, the overall performance of the L 2,1 -GRMF method and L 2,1 -WGRMF is superior to other methods.

Case study
In this section, we conduct a simulation experiment. First, we erase some of the known drug targets in the original dataset. That is, those elements that are originally 1 in the original matrix become 0. This process is performed randomly by the computer. In the second step, we perform the experiment. We examine the results of the experiment and see if the erased condition is successfully predicted. The experimental procedure we implement is that in the NR dataset, ten drugs with the interaction of the target estrogen receptor alpha (KEGG ID: hsa2099) are removed. This target is the main cause of breast cancer. After the experiment is done, we count the experimental results. We predict five of the hidden interactions. At the same time, we also predict a portion of new drugs and take the most reliable top five new drugs stated in Table 4. Among them, the sixth drug Testosterone is the drug with the highest correlation with this target.
In IC dataset, for the drug Diazoxide (KEGG ID: D00294), a blood pressure lowering drug. We also use a similar approach. Before using the L 2,1 -GRMF method, we eliminate twenty of them in the matrix Y. Because the GPCR dataset is larger than the NR dataset and there are many targets associate with this drug, we have removed twenty interactions here. After conducting simulation experiments, we successfully predicted twelve known targets and eight new targets. We then list the top twenty targets in Table 5. The first 12 are known targets and the remaining part is our prediction of a new target.
For these two cases, the similarity of the estrogen receptor alpha to its nearest neighbor target is less than 0.02 in the matrix S t . In the matrix S d , the similarity of Diazoxide to its nearest neighbor is 0.3, which is also Fig. 3 PR curves for different methods are plotted together, providing a visual comparison between their prediction performances. The PR curves on the CVd side of each method on the NR dataset. a WKNKN is not used, the PR curves for each method. b WKNKN is used, the PR curves for each method Fig. 4 PR curves for different methods are plotted together, providing a visual comparison between their prediction performances. The PR curves on the CVd side of each method on the GPCR dataset. a WKNKN is not used, the PR curves for each method. b WKNKN is used, the PR curves for each method quite low. Therefore, we are more difficult to make predictions. Thus, this shows that our proposed L 2,1 -GRMF method is excellent and reliable results can be obtained when predicting some challenging drugs and targets. Of course, there are still some limitations to the two methods proposed. If we add a weight matrix, the time required for the experiment will multiply. Compared with other methods, our time complexity is relatively high. In addition, the method does not predict new drugs and new targets without any interaction.

Conclusions
In this paper, we propose two improved matrix decomposition methods, L 2,1 -GRMF and L 2,1 -WGRMF. Both methods are used to predict drug-target interactions. We use cross-validation to calculate AUPR values and predict on the drug side (CVd) and the target side (CVt), respectively. We compare them with the most advanced matrix factorization methods currently available. In most cases, our improved methods can provide the best results, which means that the predictive performance is improved with the use of the L 2,1 -norm.
WKNKN preprocessing steps are used to help the experimental results. In addition, it can also be used as an independent method to predict the interactions of drug-target. Considering that the dimensions of the data are relatively small, so the drug-target interactions contained in each dataset are also limited. And our approach applies to these datasets.
In the future, we expect more and more known interaction of drug targets will be found, providing more valuable datasets for our prediction. We will explore more effective prediction methods to solve drug-target interaction problems. For example, we can use matrix factorization of hyper-graph method to improve the reliability of predictive interactions.

CMF
Co-matrix factorization is an effective method to predict the interactions of drug-target [15]. The objective function of CMF method is where W represents a weight matrix, W ij = 1 when Y ij is known, W ij = 0 otherwise. Obviously, the last two items of the objective function are regularization terms. We use L to represent the objection function in Eq. (1), a i represents the i-th vector of A, and b j represents the j-th vector of B. Two update rules are used to solve ∂L/∂a = 0 and ∂L/∂b = 0. Finally, the two update rules are executed using least square until convergence: In summary, after the potential feature matrices A and B are updated, the predicted score matrix can be obtained by multiplying A and B. This predicted score matrix can be used to predict new drug-target interactions by comparing with the original drug-target interactions matrix Y.

GRMF
In the GRMF method, the benefits of regularization items is that it can avoid over-fitting [20]. The objective function of GRMF is as follows: Then, matrix A and B are initialized. The SVD (singular value decomposition) method is used to decompose matrix Y ∈ R n × m into U ∈ R n × k , S k ∈ R k × k , and V ∈ R n × k . In matrix Y, the largest possible k . Next, the least square method is used to update A and B. This objective function in Eq. (4) can be replaced by L. These two update rules are used to solve ∂L/∂a = 0 and ∂L/∂b = 0. Finally, the two update rules are executed by using least square until convergence.

WGRMF
Like CMF, the weight matrix W in WGRMF is the same as W in CMF. Behind the weight matrix, either to prevent unknown interactions, the purpose is to help find the latent feature matrix A and B. The objective function of WGRMF method is as follows This objective function in Eq. (5) can be replaced by L, where a i represents the i-th vector of A, and b j represents the j-th vector of B. These two update rules are used to solve ∂L/∂a = 0 and ∂L/∂b = 0. Finally, the two update rules are executed by using least square until convergence. However, it is worth noting that the update rules here are not the same as the update rules in GRMF. In GRMF, the rules are matrix updates, but in WGRMF the rules are row updates. Fig. 8 PR curves for different methods are plotted together, providing a visual comparison between their prediction performances. The PR curves on the CVt side of each method on the GPCR dataset. a WKNKN is not used, the PR curves for each method. b WKNKN is used, the PR curves for each method Fig. 9 PR curves for different methods are plotted together, providing a visual comparison between their prediction performances. The PR curves on the CVt side of each method on the IC dataset. a WKNKN is not used, the PR curves for each method. b WKNKN is used, the PR curves for each method

Our proposed methods
Here, our improved approach is used to solve the prediction of drug-target interactions problem. WKNKN (weighted K nearest known neighbors) [20] as a preprocessing step is used to solve unknown missing value problems. Two methods are proposed, Graph Regularization Matrix factorization based on L 2,1 -norm, and a variant called L 2,1 -WGRMF, both of which are used to predict drug-target interactions. Figure 11 shows a flow chart of the proposed method. L 2,1 -GRMF Sparsification of the drug similarity matrix and target similarity matrix Graph regularization terms are used to fully consider the internal structure of the similarity matrix S d and S t . In addition, the graph regularization terms can keep the internal structure of the matrices unchanged. We derive a p-nearest neighbor graph from each drug and target similarity matrix [24] S d and S t in this work. Therefore, given a drug similarity matrix S d , a p-nearest neighbor graph [25] N can be generated as where N is used to sparsify the matrix S d , which can be written as This result is for a sparse drug similarity matrix. Similarly, the target similarity matrix S t can be obtained in the same way. We use the Euclidean distance to calculate the nearest neighbor. In general, Euclidean distance will give better results because it represents the true distance.
Graph regularization helps to facilitate the study the manifold from learning drugs and target spaces. In the original space, there are points that are close to each other, and when the manifold learning is performed, the points are also close to each other in learning.

Low-rank approximation
The idea of low rank approximation (LRA) is applied to GRMF [26]. It decomposes the target matrix Y into two low-rank latent feature matrices A and B, i.e., Y ≈ AB T [27]. And the objective function of GRMF can be written as the following optimization problem: where ‖⋅‖ F is Frobenius norm. In addition, the number of potential features of A and B is represented by k.

Regularization
In general, the Tikhonov and graph regularization terms can be used to avoid over-fitting and enhance generalization capability. Here is the objective function of L 2,1 -GRMF: where λ l , λ d and λ t are positive parameters, a i is the i-th rows of A, and b j is the j-th rows of B, n is the number of drugs, and m is the number of targets. The first term is an approximate model of the matrix Y. The second term is the Tikhonov regularization. Its main purpose is to minimize the norms of A, B. The third term is the L 2,1 -norm applied on B to increase the target matrix sparsity and discard unwanted target pairs. Considering that we are more concerned with certain drugs, we use the L 2,1 -norm to sparse the potential feature matrix of the target, so that we can better predict new drugs. However, while the L 2,1 -norm is added to A, some of the more important drugs may be lost. The last two terms are graph regularization of drugs and targets, respectively. Moreover, the drug-target model can be rewritten as: where Tr(⋅) is the trace of the matrix, Please refer to [28] for more details on rewriting graph regularization. We know that the known normalized Laplacian is better than unknown, so we replace L d The function can be written as: We use the minimization of the objective function to predict the outcome of the interactions, but this could lead to unsatisfactory results. Because there are many zeros that have not been found. Therefore, we use WKNKN pre-processing method to solve this problem.

Initialization of A and B
For the input matrix Y, SVD (Singular Value Decomposition) method is used to obtain the initial value of matrix A and matrix B: Among them, S k is a diagonal matrix and contains the k largest singular values. In matrix Y, the number of singular values is k max = min(n, m). According to the SVD method, k max is the maximum possible number.

Optimization algorithm
In this paper, we can update A and B by using the least square method. Let the partial derivative of A be equal to 0, the partial derivative of B be equal to 0, the objective function in Eq. (11) can be replaced by L, that is, ∂L/∂A = 0 and ∂L/∂B = 0. The two update rules are executed by using least square until convergence. When we perform the L 2,1 -GRMF method, λ l , λ d and λ t are determined by the cross-validation on the training set to the optimal parameter values. We use grid search, λ l ∈ {2 −2 , 2 −1 , 2 0 , 2 1 }. Then we choose the optimal parameters from this set. Derivation process is as follows: where D is a diagonal matrix with the i-th diagonal element as d ii = 1/2‖(B) i ‖ 2 . The specific algorithm of L 2,1 -GRMF is as follows: L 2,1 -WGRMF A variant of L 2,1 -GRMF, called L 2,1 -WGRMF, is obtained here by adding a weight matrix W to the L 2,1 -GRMF. The advantage is that it helps to determine the latent feature matrices A and B of the drug-target matrix Y. So, we write the objective function that contains W as follows: Let objective function be set to F such that ∂F/∂a i = 0 and ∂F/∂b j = 0. The update rules are used to obtain A and B until convergence