MCCMF: collaborative matrix factorization based on matrix completion for predicting miRNA-disease associations

Background MicroRNAs (miRNAs) are non-coding RNAs with regulatory functions. Many studies have shown that miRNAs are closely associated with human diseases. Among the methods to explore the relationship between the miRNA and the disease, traditional methods are time-consuming and the accuracy needs to be improved. In view of the shortcoming of previous models, a method, collaborative matrix factorization based on matrix completion (MCCMF) is proposed to predict the unknown miRNA-disease associations. Results The complete matrix of the miRNA and the disease is obtained by matrix completion. Moreover, Gaussian Interaction Profile kernel is added to the miRNA functional similarity matrix and the disease semantic similarity matrix. Then the Weight K Nearest Known Neighbors method is used to pretreat the association matrix, so the model is close to the reality. Finally, collaborative matrix factorization method is applied to obtain the prediction results. Therefore, the MCCMF obtains a satisfactory result in the fivefold cross-validation, with an AUC of 0.9569 (0.0005). Conclusions The AUC value of MCCMF is higher than other advanced methods in the fivefold cross validation experiment. In order to comprehensively evaluate the performance of MCCMF, accuracy, precision, recall and f-measure are also added. The final experimental results demonstrate that MCCMF outperforms other methods in predicting miRNA-disease associations. In the end, the effectiveness and practicability of MCCMF are further verified by researching three specific diseases.

prevalence of these genes were revealed in recent years. To date, 38,589 miRNA have been found in animals, plants and viruses [3]. At the same time, miRNAs were discovered to play an important role in cell proliferation [4], differentiation [5], senescence [6], apoptosis [7], and so on. A study indicated that more than one third of human genes are regulated by miRNA [8]. Obviously, miRNA disorder could have severe impacts on humans.
Evidence shows that an increasing number of miRNAs are closely associated with diseases [9]. Since the first discovery of miR15 and miR16 deficiency in B cell chronic lymphocytic leukemia (B-CLL) [10], the research results of miRNA-disease associations are often reported. For example, the expression of miR-25 and miR-223 is significantly higher in patients with esophageal squamous cell carcinoma than the normal people, while the expression of miR-375 is significantly lower [11]. Studies show that miR-26a may be a regulatory factor that inhibits the progression and metastasis of c-Myc/EZH2 double height advanced HCC [12]. In addition, miR-340 has been suggested as a biomarker for cancer metastasis and prognosis [13]. At present, the research on miRNAs and diseases is becoming more extensive. Researchers have also developed a number of databases to store miRNA and disease data, such as dbDEMC [14], HMDD v3.0 [15] and miR2Disease [16]. Unfortunately, the known correlation data is not complete. Moreover, traditional methods to identify new miRNA-disease associations are time-consuming and laborious.
With the improvement of information technology and the development of a large number of miRNA data sets, many effective methods for predicting miRNA-disease associations have been proposed [17]. According to the hypothesis that functionally similar miRNAs may be associated with diseases with similar phenotypes [18], Jiang et al. [19] first constructed a genetic data network, and then prioritized disease-related miRNAs to predict miRNA-disease associations. However, due to the limited association information, this method is not quite effective. A computational framework was developed by Li et al., which can be used to measure the association between the cancer and miRNA based on the functional consistency score (FCS) of miRNA target genes and cancer-related genes. This method has a significant advantage in the identification of cancer-related miRNA [20]. Based on heterogeneous omics data, the potential miRNA-disease associations were identified via using a Graph Regularized Non-negative Matrix Factorization (GRNMF) by Xiao et al. [21]. However, the prediction results of GRNMF method may not be optimal in some cases. Chen et al. [22] proposed a new a computational model of Matrix Decomposition and Heterogeneous Graph Inference for miRNA-disease association prediction (MDHGI) to discover new miRNA-disease associations. The model made full use of matrix decomposition before the construction of heterogeneous networks, thus improving the prediction accuracy. The proteindriven inference of miRNA-disease associations (miRPD) was proposed by Mørk et al. [23], which can infer the correlation between miRNA-protein-disease associations. At the same time, they provide scoring schemes that can create correlation sets of high and medium credibility. Three new miRNA-disease association prediction methods based on global network similarity measure were developed by Chen et al. [24], namely MBSI (microRNA-based similarity inference), PBSI (phenotype-based similarity inference) and NetCBI (network-consistency-based inference). NetCBI is especially suitable for predicting target diseases, but it relies on network similarity measurement to a great extent. Similarly, Gao et al. [25] put forward a method based on Double Network Sparse Graph Regularized Matrix Factorization (DNSGRMF), and added the L 2,1 -norm and Gaussian interaction profile (GIP) kernel to improve the prediction ability. In addition, considering the nearest neighbor information of the miRNA and the disease, Gao et al. [26] introduced a method of Nearest Profile-based Collaborative Matrix Factorization (NPCMF) to predict miRNA-disease associations. One of the most obvious disadvantages of NPCMF is that it introduces too much NP information, which may reduce the prediction accuracy while adding extra noise. In order to protect the known correlation, Logistic Weighted Profile-based Collaborative Matrix Factorization (LWPCMF) method was proposed by Yin et al. [27], which effectively predicts miRNA-disease associations. The prediction effect of this method is promising. Chen et al. [28] constructed a model based on Canonical Correlation analysis (CCA), which can fully reveal the possible molecular causes of miRNA-disease association. However, direct performance comparison is difficult to be achieved by this method.
In recent years, machine learning-based miRNA-disease association prediction methods are also popular. A support vector machine (SVM) classifier was developed by Xu et al. [29] to extract features from the miRNA-disease network and miRNA expression levels. Yet, the construction of miRNA target-dysregulated network (MTDN) is complex, so only direct miRNA target regulation can be predicted. Chen et al. [30] used random walk to prioritize disease-related miRNAs to predict potential human miRNA-disease associations. Like the problem of Jiang et al., their approach is also affected by limited disease-miRNA associations. A model of Restricted Boltzmann machine for multiple types of miRNA-disease association prediction (RBMMMDA) was established by Chen et al. [31]. Chen et al. [32] constructed a computational model called Laplacian Regularized Sparse Subspace Learning for MiRNA-Disease Association prediction (LRSS-LMDA). The model has stronger dimensionality reduction capability and can be easily extended to higher dimensional data sets. A new Induction Matrix Completion model for MiRNA-Disease Association prediction (IMCMDA) was constructed by Chen et al. [33]. Because it is a semi-supervised model, only positive samples and unmarked samples are needed, which greatly reduces the difficulty of modeling. Soon after, Chen et al. [34] proposed a new MiRNA-Disease Association Prediction Bipartite graph Network Projection computing model (BNPMDA). Compared with previous models, the prediction accuracy of BNPMDA is improved. A new miRNA-disease association prediction algorithm based on the decision tree was proposed by Chen et al. [35]. This method constructs a computing framework for integrated learning and dimension reduction. By training and integrating multiple base classifiers, they reduce prediction bias and improve prediction performance. Ding et al. [36] used an improved calculation method based on inductive matrix completion to predict miRNA-disease associations. (IIM-CMP). Experiments show that IIMCMP can achieve powerful and reliable performance evaluation. Li et al. [37] developed a method of neural inductive matrix completion with graph convolutional network (NIMCGCN) for the prediction of miRNA-disease association. To test the predictive power of NIMCGCN in the absence of any known miRNAs, they studied breast cancer with 100% accuracy. The above methods have made great contributions to predicting associations of miRNA-disease.
Since the shortcomings of the above methods, a novel method for predicting miRNAdisease associations with Collaborative Matrix Factorization based on Matrix Completion (MCCMF) is proposed in this paper. Firstly, human miRNA-disease association matrix, miRNA function similarity matrix and disease semantic similarity matrix are obtained from HMDD v2.0, but the obtained matrix is sparse. Therefore, the matrix completion method is used to complete the matrix. The matrix completion algorithm is mainly developed on the basis of Augmented Lagrange multiplier method (ALM) [38], Alternating Direction Method (ADM) [39] and Singular Value Threshold (SVT) operation [40]. Secondly, we integrate the completed matrix and the GIP kernel similarity matrix of the disease and the miRNA. At the same time, the miRNA-disease association matrix is preprocessed by Weight K Nearest Known Neighbors (WKNKN) method to solve the problem of unknown missing values [41]. Finally, collaborative matrix factorization is used to predict associations between miRNAs and diseases. In the experiment, a fivefold cross validation on MCCMF is performed, and results show that our method is superior to the other four methods. In addition, we focus on the cases of Gastrointestinal Neoplasms, Retinoblastoma and Hepatoblastoma. Our method not only successfully verifies the known associations of miRNA-disease, but also finds many unknown associations. To sum up, MCCMF can avoid the inherent noise of the data set, with highspeed and high prediction accuracy.

Performance evaluation
In this section, AUC value, accuracy, precision, recall and f-measure are used to evaluate the performance of MCCMF method. Initially, we implement fivefold cross validation to objectively evaluate the predictive power of our method. The existing miRNA-disease associations are randomly divided into five groups, among which four groups are used as the training set and the remaining one as the test set. In addition, in order to demonstrate the high predictive capability of our method, the random deletion of the miRNA-disease association (i.e. Cross Validation pairs' mode) increases the difficulty of prediction before performing the cross validation [42]. Fivefold cross validation is repeated 10 times to prevent grouping from causing bias, and the average result of 10 times is used as the final evaluation result.
The ROC curve is drawn to represent the predicted performance intuitively, and the AUC value is calculated to evaluate MCCMF quantitatively. TPR and FPR can be expressed as: where TP is the number of samples that are actually positive and are also predicted to be positive. FN represents the number of samples that are actually negative and also predicted to be negative. However, TN and FP represent the number of samples for which the predicted results are inconsistent.
In order to make the performance evaluation more comprehensive, we also use other evaluation indicators, including the accuracy, precision, recall and f-measure. Their calculation formulas are defined as follows:

Comparison with other methods
The AUC value is generally between 0 and 1. The higher the AUC value is, the better the prediction result will be. MCCMF finally obtains an AUC value of 0.9563 in the fivefold cross validation. MCCMF is compared with four advanced methods such as WBNPMD [43], RLSMDA [44], GRNMF [21] and CMF [45], which proves the superior performance of our method. The ROC curves are drawn in Fig. 1, and the comparison results are listed in Table 1. The results of other methods in Table 1 are obtained directly from the literature.
In the Table 1, the highest value is highlighted in italic, with the standard deviation in parentheses. In the fivefold cross validation experiment, WBNPMD, RLSMDA, GRNMF, CMF and MCCMF obtain AUCs of 0.9173, 0.8389, 0.869, 0.8697 and 0.9569, respectively. Therefore, our method is superior to the other four methods.
WBNPMD with higher AUC value is selected for comparison with MCCMF, and accuracy, precision, recall and f-measure are presented as a bar graph in Fig. 2. Also, MCCMF is better than WBNPMD.

Case studies
In the end, we carry out a simulation experiment to analyze the specific disease. First of all, the disease we want to explore is selected and the predicted score is ranked. Then, based on the predicted score after ranking, some miRNAs of high associations degree with the disease are found. Moreover, by comparing with the original miRNA-disease association matrix, they are determined whether the associations of high prediction score is known. Finally, the unknown associations are verified by searching existing data sets. Here, we choose three diseases of Gastrointestinal Neoplasms, Retinoblastoma and Hepatoblastoma for analysis. In addition, three popular data sets, dbDEMC [14], HMDD v3.0 [15] and miRCancer [46] are used for validation. These data sets store miRNA-disease associations that have been experimentally confirmed by some researchers over the years.
Gastrointestinal Neoplasms is a very common gastrointestinal disease with a high incidence. However, there are no obvious symptoms in the early growth stage of the neoplasms, which is very dangerous to human beings. We successfully predict 31 known associations and 9 new associations, 7 of which are confirmed by HMDD v3.0 and miRCancer. For example, Tazawa et al. [47] discovered the potential role of oncogenic miR-21 in Gastrointestinal Neoplasms. Other confirmed miRNAs have been reported in relevant data sets, and they are not listed here. There are still two unconfirmed ones that need further research. Table 2 describes the simulation results, where known associations are shown in italic, confirmed new predictions are written to the corresponding database, and unconfirmed ones are shown as "unconfirmed". The predicted scores in the Table 2 are ranked according to the strength of the association between the miRNA and disease. There is a threshold to determine whether the prediction is accurate. Compared with known information and other databases, the prediction results of our method are generally accurate. Although two remain unconfirmed, these two could provide some insights for researchers. Retinoblastoma is a malignant tumor that occurs in children under 3 years old, and has a familial predisposition. There are 38 known associations between the disease and miRNA in the known association data set, and 37 known associations are successfully predicted by us. At the same time, 23 new associations are predicted, seven of which are confirmed and the others are unconfirmed. Montoya et al. [48] found that the expression of miR-31 in Retinoblastoma is significantly reduced, which promotes the development of targeted therapy for Retinoblastoma. Table 3 shows the specific situation. The predictive sorting method in Table 3 is the same as that in Table 2.
Hepatoblastoma is the most common intraabdominal malignant tumor after neuroblastoma and nephroblastoma in childhood. In the existing miRNA-disease association data set, there are 8 known miRNA-disease associations, and all of them have been predicted. Besides, we predicted 12 new associations, seven of which are confirmed and 5 are not. We also find literatures confirming that miR-143 is a factor affecting Hepatoblastoma. The study of Zhang et al. [49] showed that blocking miR-143 could significantly inhibit local liver metastasis. Hepatoblastoma prediction   results are shown in Table 4. The predictive sorting method in Table 4 is also the same as that in Tables 2 and 3. As can be seen from the simulation results above, most known miRNAs are successfully predicted, while a small number of unknown associations are in HMDDv3.0, miR-Cancer and dbDEMC data sets. Although a few have not been confirmed, they can be used as a reference for researchers. In addition, we used Cytoscape software to map the prediction network of these three diseases (Fig. 3). In the network, the ellipse represents miRNAs, and the remaining shapes represent diseases. The correlations are connected by line segments with arrows, and there are common miRNAs between diseases. According to the size of the predicted score, the color degree of the ellipse is set differently. The darker the color of the ellipse is set to, the stronger the correlation between miRNA and disease is.

Discussion
The above experimental results are enough to prove that our method is superior to the most advanced method. The excellent prediction performance of MCCMF can be attributed to several significant factors. Firstly, data is preprocessed by Weight K Nearest Known Neighbors method and matrix completion method to improve the prediction accuracy. Secondly, a collaborative matrix factorization model is applied to predicting miRNA-disease associations, which is a promising one among many collaborative filtering technologies. In bioinformatics, matrix factorization contributes Fig. 3 The association network between disease and miRNA to identifying hidden links among genes. However, the performance of our method needs to be further improved. For instance, there exists a better way to integrate data, rather than simply adding them together. In the future, we will improve the technology to use the latest version of the data set, such as HMDD v3.0.

Conclusions
In this paper, a collaborative matrix factorization method based on matrix completion (MCCMF) is developed for predicting miRNA-disease associations. Considering the sparse and incomplete similarity matrix of miRNA-disease, we use the matrix completion method to complete the matrix. Then the completed matrix is integrated with GIP kernel similarity to improve the data information and reduce the influence of noises. In addition, WKNKN is also introduced to pretreat the existing association matrix of miRNAs and diseases, so our method is suitable to practical problems. Finally, the idea of CMF is adopted to construct the objective function and obtain the predicted results. The AUC value (0.9569) of MCCMF is higher than other advanced methods in the fivefold cross validation experiment. In order to comprehensively evaluate the performance of MCCMF, accuracy, precision, recall and f-measure are applied to measure the performance, and results are 0.992, 0.779, 0.918 and 0.830, respectively. Compared with the other four methods, our method has the best performance. The analysis of Gastrointestinal Neoplasms, Retinoblastoma and Hepatoblastoma further verified the effectiveness of MCCMF. Since most of associations are unknown in reality, MCCMF can also be used to predict in this situation.

Methods
We develop a novel method for predicting miRNA-disease associations with MCCMF. MCCMF is divided into four main steps: Firstly, we use the matrix completion algorithm to complete the miRNA similarity matrix and the disease similarity matrix to generate a new completion similarity matrix. Secondly, the new completion similarity matrix is integrated with existing miRNA and disease similarity information. Thirdly, the WKNKN is used to convert the binary values of the miRNA-disease association matrix into the interaction likelihood values [41]. Finally, the Collaborative Matrix Factorization is used to predict the association of miRNA-disease. Figure 4 shows the complete process for MCCMF.

Human miRNA-disease associations
The initial miRNA-disease association data is downloaded from HMDD v2.0 [50]. HMDD v2.0 is an experimental data set supporting human miRNA-disease associations, and storing 5430 experimentally verified miRNA-disease associations between 495 miRNAs and 383 diseases. In this paper, the adjacency matrix MD is used to represent the miRNA-disease association network. The adjacency matrix MD is a sparse matrix composed of 0 and 1. If MD m i , d j is 1, disease d j is correlated with miRNA m i ; otherwise irrelevant.

MiRNA function similarity
According to the hypothesis that functionally similar miRNAs are more likely to be associated with phenotypic diseases, a method for calculating the functional similarity of miRNAs (MISIM) is proposed by Wang et al. [51]. Firstly, we need to define semantic similarity between one disease and one group of disease. The calculation formula is as follows: Here d represents one disease and D represents one disease group. Then, we define the similarity of d and D , S(d, D) , as the maximum similarity.
Functional similarity of the two miRNAs is defined as where M 1 and M 2 represent the related miRNAs of D 1 and D 2 , respectively. D 1 contains m diseases, and D 2 contains n diseases.
In this paper, we download the miRNA function similarity from https ://www.cuila b.cn/ files /image s/cuila b/misim .zip. And the matrix MF is used to represent the functional similarity network of the miRNA, in which the element MF(i, j) represents the similarity between miRNA m i and miRNA m j . The self-similarity of each miRNA is 1, so the diagonal elements of the matrix MF are 1.
Due to incomplete miRNA data supported by the experiment, the similarity values calculated by MISIM may be biased. Some subsequent treatment of the matrix may be improved [52].

Disease semantic similarity
The relationship between different diseases is obtained from the MeSH database (https :// www.ncbi.nlm.nih.gov/). Based on the previous literature [51], we represent the disease D as a Directed Acyclic Graph, DAG(D) = (D, T (D), E(D)) , where T (D) is the set of both a node D and its ancestor nodes, and E(D) is the set of edges that ancestor nodes pointing to node D . For ancestor node t in DAG(A) , its contribution to the semantic value of disease A is computed as follows: In the above formula, is a semantic contribution factor. Based on the method of Wang et al., the value of is set to 0.5. For the disease A , the contribution of itself to the disease A is 1, while the contribution of ancestor node t is decreasing with the increase of its layers.
Based on the contribution of ancestor diseases and disease A itself, the semantic value of disease A can be expressed as follows: According to the hypothesis that the more shared part of the disease pairs in DAGs is, the higher similarity is. The semantic similarity between disease A and disease B is calculated as:  However, the above model is a little inadequacy, which is the setting of that causes the same layer of diseases with the same semantic contribution. Obviously, the incidence of various diseases is different, and the contribution of diseases with high incidence should be less than those with low incidence. To improve the above model, we combine the method of Xuan et al. [53] to define the semantic similarity calculation method. In this method, the contribution of ancestor node t in DAG(A) to the semantic value of disease A is as follows: The semantic value of disease A , and the semantic similarity between the disease A and the disease B are calculated as: Finally, in order to calculate the semantic similarity more comprehensive and rational, we combine the two models to get Eq. (15).

Gaussian interaction profile kernel similarity for diseases and miRNAs
On the basis of the hypothesis that functionally similar miRNAs may be associated with similar diseases, and vice versa, the known miRNA-disease association network is used to construct the GIP kernel similarity for diseases and miRNAs [54]. GIP kernel similarity can increase the multiple and topological information of known correlations. The interaction profile of miRNA m(i) is represented by the binary vector M(i) of the i-th column of the adjacency matrix MD . Similarly, the binary vector D(i) of the i-th row of the adjacency matrix MD denotes the interaction profile of disease d(i) . Hence, we can define the GIP kernel similarity for miRNAs and diseases as follows: Here, γ m and γ d are parameters to control the kernel bandwidth and obtained by the following formulas: where δ m and δ d are also bandwidth parameters and they are set to 1 according to the previous study [55]. The nm and nd mean the number of all the miRNAs and diseases.

Matrix completion
The miRNA functional similarity matrix and disease semantic similarity matrix calculated by the above operations are still sparse and incomplete, and there are some redundant associations (i.e. inherent noise). So we use the matrix completion method to solve the problem [56]. Suppose the incomplete matrix is D , which can be represented as a linear combination of D and the noise matrix N . The formula is as follows: where DR is a low-rank matrix, and specifically, it is a more refined or informative similarity matrix after removing noise from the existing similarity matrix.
In order to make R be low-rank, a nuclear norm on D is added. At the same time, the L 2,1 -norm of the error term N is used to make noise matrix N more sparse. When the final low-rank matrix DR * and sparse matrix N * are calculated, DR * or D − N * are used to describe a completed matrix. Therefore, a formula for solving convex optimization problem can be defined as follows: Here, || · || * represents the nuclear norm, ω ∈ (0, 1) is the positive weighting parameter and || · || 2,1 is the noise regularization term.
When solving optimization problems under equality constraints, the ALM method is more effective [38]. Therefore, according to ALM, the Eq. (21) can be rewritten as: Then switch the Eq. (22) to an unconstraint problem, which is the Lagrange function. The formula is as follows: where β > 0 is the penalty parameter, and β is updated by β = min(ρβ, max β ) . Y 1 and Y 2 are the Lagrange multipliers.
The ADM method is used to solve the Eq. (23) [39]. The ADM is a simple method to solve the decomposable convex optimization problem, especially in solving largescale problems. The update iterations for ADM are as follows: (19) Based on the singular value shrinkage operator [40], X k+1 and N k+1 are represented as follows: yet the minimization of R is a least squares problem, and its normal equation is as follows: where I = DD T is widely used in matrix completion.
Then X , R and N are updated by changing the Lagrange multipliers Y 1 and Y 2 . Moreover, Y 1 and Y 2 can be obtained by the following formulas: Finally, we can get the final low-rank matrix R * and sparse matrix N * until the convergence conditions ||D − DR − N|| ∞ < ε and ||R − X|| ∞ < ε are satisfied. Here, ε is an extremely low number (set as 1 × 10 −8 in this paper). As mentioned above, the refined matrix R * and noise matrix N * can be used to describe a completed matrix in the form of D × R * or D − N * . The specific process of matrix completion is shown in Fig. 5.
Based on the above matrix completion method, the disease semantic similarity matrix DS and miRNA functional similarity matrix MF are used as input matrices to replace matrix D , so that we can obtain two refined similarity matrices CD and CM , respectively.

(25)
Y 2 = Y 2 + β(R − X). Wu et al. BMC Bioinformatics (2020) 21:454 MCCMF for MiRNA-disease association prediction The CMF method proposed by Shen et al. [45] that can effectively predict the potential interactions between miRNAs and diseases. In this study, the idea of the CMF method is used to predict the miRNA-disease association. The specific steps of CMF are as follows: firstly, the input miRNA-disease association matrix PMD is decomposed into two low-rank matrices A and B by using the singular value decomposition.
where U and V is the unitary matrix. S is a negative real diagonal matrix, and there are k singular values on the diagonal.
where I k is the k × k identity matrix. Finally, we update A and B iteratively until they converge to get the final A and B . By A * B T , the prediction matrix for miRNA-disease associations is obtained. The detail process of MCCMF can be seen in Fig. 8.  Fig. 8 The flowchart of CMF method. PMD is the pre-processed matrix of miRNA-disease association matrix