- Methodology article
- Open Access
- Published:

# MCCMF: collaborative matrix factorization based on matrix completion for predicting miRNA-disease associations

*BMC Bioinformatics*
**volume 21**, Article number: 454 (2020)

## Abstract

### Background

MicroRNAs (miRNAs) are non-coding RNAs with regulatory functions. Many studies have shown that miRNAs are closely associated with human diseases. Among the methods to explore the relationship between the miRNA and the disease, traditional methods are time-consuming and the accuracy needs to be improved. In view of the shortcoming of previous models, a method, collaborative matrix factorization based on matrix completion (MCCMF) is proposed to predict the unknown miRNA-disease associations.

### Results

The complete matrix of the miRNA and the disease is obtained by matrix completion. Moreover, Gaussian Interaction Profile kernel is added to the miRNA functional similarity matrix and the disease semantic similarity matrix. Then the Weight K Nearest Known Neighbors method is used to pretreat the association matrix, so the model is close to the reality. Finally, collaborative matrix factorization method is applied to obtain the prediction results. Therefore, the MCCMF obtains a satisfactory result in the fivefold cross-validation, with an AUC of 0.9569 (0.0005).

### Conclusions

The AUC value of MCCMF is higher than other advanced methods in the fivefold cross validation experiment. In order to comprehensively evaluate the performance of MCCMF, accuracy, precision, recall and f-measure are also added. The final experimental results demonstrate that MCCMF outperforms other methods in predicting miRNA-disease associations. In the end, the effectiveness and practicability of MCCMF are further verified by researching three specific diseases.

## Background

MicroRNAs (MiRNAs) are a class of non-coding single-stranded RNA molecules. Their lengths are usually 18–24 nucleotides. Instead of synthesizing proteins, miRNAs participate in post-transcriptional regulation of gene expression in eukaryotes and viruses [1]. In spite of the first miRNA Line-4 was discovered in 1993 [2], the diversity and prevalence of these genes were revealed in recent years. To date, 38,589 miRNA have been found in animals, plants and viruses [3]. At the same time, miRNAs were discovered to play an important role in cell proliferation [4], differentiation [5], senescence [6], apoptosis [7], and so on. A study indicated that more than one third of human genes are regulated by miRNA [8]. Obviously, miRNA disorder could have severe impacts on humans.

Evidence shows that an increasing number of miRNAs are closely associated with diseases [9]. Since the first discovery of miR15 and miR16 deficiency in B cell chronic lymphocytic leukemia (B-CLL) [10], the research results of miRNA-disease associations are often reported. For example, the expression of miR-25 and miR-223 is significantly higher in patients with esophageal squamous cell carcinoma than the normal people, while the expression of miR-375 is significantly lower [11]. Studies show that miR-26a may be a regulatory factor that inhibits the progression and metastasis of c-Myc/EZH2 double height advanced HCC [12]. In addition, miR-340 has been suggested as a biomarker for cancer metastasis and prognosis [13]. At present, the research on miRNAs and diseases is becoming more extensive. Researchers have also developed a number of databases to store miRNA and disease data, such as dbDEMC [14], HMDD v3.0 [15] and miR2Disease [16]. Unfortunately, the known correlation data is not complete. Moreover, traditional methods to identify new miRNA-disease associations are time-consuming and laborious.

With the improvement of information technology and the development of a large number of miRNA data sets, many effective methods for predicting miRNA-disease associations have been proposed [17]. According to the hypothesis that functionally similar miRNAs may be associated with diseases with similar phenotypes [18], Jiang et al. [19] first constructed a genetic data network, and then prioritized disease-related miRNAs to predict miRNA-disease associations. However, due to the limited association information, this method is not quite effective. A computational framework was developed by Li et al., which can be used to measure the association between the cancer and miRNA based on the functional consistency score (FCS) of miRNA target genes and cancer-related genes. This method has a significant advantage in the identification of cancer-related miRNA [20]. Based on heterogeneous omics data, the potential miRNA-disease associations were identified via using a Graph Regularized Non-negative Matrix Factorization (GRNMF) by Xiao et al. [21]. However, the prediction results of GRNMF method may not be optimal in some cases. Chen et al. [22] proposed a new a computational model of Matrix Decomposition and Heterogeneous Graph Inference for miRNA-disease association prediction (MDHGI) to discover new miRNA-disease associations. The model made full use of matrix decomposition before the construction of heterogeneous networks, thus improving the prediction accuracy. The protein-driven inference of miRNA-disease associations (miRPD) was proposed by Mørk et al. [23], which can infer the correlation between miRNA-protein-disease associations. At the same time, they provide scoring schemes that can create correlation sets of high and medium credibility. Three new miRNA-disease association prediction methods based on global network similarity measure were developed by Chen et al. [24], namely MBSI (microRNA-based similarity inference), PBSI (phenotype-based similarity inference) and NetCBI (network-consistency-based inference). NetCBI is especially suitable for predicting target diseases, but it relies on network similarity measurement to a great extent. Similarly, Gao et al*.* [25] put forward a method based on Double Network Sparse Graph Regularized Matrix Factorization (DNSGRMF), and added the \(L_{2,1}\)-norm and Gaussian interaction profile (GIP) kernel to improve the prediction ability. In addition, considering the nearest neighbor information of the miRNA and the disease, Gao et al*.* [26] introduced a method of Nearest Profile-based Collaborative Matrix Factorization (NPCMF) to predict miRNA-disease associations. One of the most obvious disadvantages of NPCMF is that it introduces too much NP information, which may reduce the prediction accuracy while adding extra noise. In order to protect the known correlation, Logistic Weighted Profile-based Collaborative Matrix Factorization (LWPCMF) method was proposed by Yin et al*.* [27], which effectively predicts miRNA-disease associations. The prediction effect of this method is promising. Chen et al. [28] constructed a model based on Canonical Correlation analysis (CCA), which can fully reveal the possible molecular causes of miRNA-disease association. However, direct performance comparison is difficult to be achieved by this method.

In recent years, machine learning-based miRNA-disease association prediction methods are also popular. A support vector machine (SVM) classifier was developed by Xu et al. [29] to extract features from the miRNA-disease network and miRNA expression levels. Yet, the construction of miRNA target-dysregulated network (MTDN) is complex, so only direct miRNA target regulation can be predicted. Chen et al. [30] used random walk to prioritize disease-related miRNAs to predict potential human miRNA-disease associations. Like the problem of Jiang et al., their approach is also affected by limited disease-miRNA associations. A model of Restricted Boltzmann machine for multiple types of miRNA-disease association prediction (RBMMMDA) was established by Chen et al*.* [31]. Chen et al. [32] constructed a computational model called Laplacian Regularized Sparse Subspace Learning for MiRNA-Disease Association prediction (LRSSLMDA). The model has stronger dimensionality reduction capability and can be easily extended to higher dimensional data sets. A new Induction Matrix Completion model for MiRNA-Disease Association prediction (IMCMDA) was constructed by Chen et al. [33]. Because it is a semi-supervised model, only positive samples and unmarked samples are needed, which greatly reduces the difficulty of modeling. Soon after, Chen et al. [34] proposed a new MiRNA-Disease Association Prediction Bipartite graph Network Projection computing model (BNPMDA). Compared with previous models, the prediction accuracy of BNPMDA is improved. A new miRNA-disease association prediction algorithm based on the decision tree was proposed by Chen et al*.* [35]. This method constructs a computing framework for integrated learning and dimension reduction. By training and integrating multiple base classifiers, they reduce prediction bias and improve prediction performance. Ding et al. [36] used an improved calculation method based on inductive matrix completion to predict miRNA-disease associations. (IIMCMP). Experiments show that IIMCMP can achieve powerful and reliable performance evaluation. Li et al. [37] developed a method of neural inductive matrix completion with graph convolutional network (NIMCGCN) for the prediction of miRNA-disease association. To test the predictive power of NIMCGCN in the absence of any known miRNAs, they studied breast cancer with 100% accuracy. The above methods have made great contributions to predicting associations of miRNA-disease.

Since the shortcomings of the above methods, a novel method for predicting miRNA-disease associations with Collaborative Matrix Factorization based on Matrix Completion (MCCMF) is proposed in this paper. Firstly, human miRNA-disease association matrix, miRNA function similarity matrix and disease semantic similarity matrix are obtained from HMDD v2.0, but the obtained matrix is sparse. Therefore, the matrix completion method is used to complete the matrix. The matrix completion algorithm is mainly developed on the basis of Augmented Lagrange multiplier method (ALM) [38], Alternating Direction Method (ADM) [39] and Singular Value Threshold (SVT) operation [40]. Secondly, we integrate the completed matrix and the GIP kernel similarity matrix of the disease and the miRNA. At the same time, the miRNA-disease association matrix is preprocessed by Weight K Nearest Known Neighbors (WKNKN) method to solve the problem of unknown missing values [41]. Finally, collaborative matrix factorization is used to predict associations between miRNAs and diseases. In the experiment, a fivefold cross validation on MCCMF is performed, and results show that our method is superior to the other four methods. In addition, we focus on the cases of Gastrointestinal Neoplasms, Retinoblastoma and Hepatoblastoma. Our method not only successfully verifies the known associations of miRNA-disease, but also finds many unknown associations. To sum up, MCCMF can avoid the inherent noise of the data set, with high-speed and high prediction accuracy.

## Results

### Performance evaluation

In this section, AUC value, accuracy, precision, recall and f-measure are used to evaluate the performance of MCCMF method. Initially, we implement fivefold cross validation to objectively evaluate the predictive power of our method. The existing miRNA-disease associations are randomly divided into five groups, among which four groups are used as the training set and the remaining one as the test set. In addition, in order to demonstrate the high predictive capability of our method, the random deletion of the miRNA-disease association (i.e. Cross Validation pairs’ mode) increases the difficulty of prediction before performing the cross validation [42]. Fivefold cross validation is repeated 10 times to prevent grouping from causing bias, and the average result of 10 times is used as the final evaluation result.

The ROC curve is drawn to represent the predicted performance intuitively, and the AUC value is calculated to evaluate MCCMF quantitatively. TPR and FPR can be expressed as:

where \(TP\) is the number of samples that are actually positive and are also predicted to be positive. \(FN\) represents the number of samples that are actually negative and also predicted to be negative. However, \(TN\) and \(FP\) represent the number of samples for which the predicted results are inconsistent.

In order to make the performance evaluation more comprehensive, we also use other evaluation indicators, including the accuracy, precision, recall and f-measure. Their calculation formulas are defined as follows:

### Comparison with other methods

The AUC value is generally between 0 and 1. The higher the AUC value is, the better the prediction result will be. MCCMF finally obtains an AUC value of 0.9563 in the fivefold cross validation. MCCMF is compared with four advanced methods such as WBNPMD [43], RLSMDA [44], GRNMF [21] and CMF [45], which proves the superior performance of our method. The ROC curves are drawn in Fig. 1, and the comparison results are listed in Table 1. The results of other methods in Table 1 are obtained directly from the literature.

In the Table 1, the highest value is highlighted in italic, with the standard deviation in parentheses. In the fivefold cross validation experiment, WBNPMD, RLSMDA, GRNMF, CMF and MCCMF obtain AUCs of 0.9173, 0.8389, 0.869, 0.8697 and 0.9569, respectively. Therefore, our method is superior to the other four methods.

WBNPMD with higher AUC value is selected for comparison with MCCMF, and accuracy, precision, recall and f-measure are presented as a bar graph in Fig. 2. Also, MCCMF is better than WBNPMD.

### Case studies

In the end, we carry out a simulation experiment to analyze the specific disease. First of all, the disease we want to explore is selected and the predicted score is ranked. Then, based on the predicted score after ranking, some miRNAs of high associations degree with the disease are found. Moreover, by comparing with the original miRNA-disease association matrix, they are determined whether the associations of high prediction score is known. Finally, the unknown associations are verified by searching existing data sets. Here, we choose three diseases of Gastrointestinal Neoplasms, Retinoblastoma and Hepatoblastoma for analysis. In addition, three popular data sets, dbDEMC [14], HMDD v3.0 [15] and miRCancer [46] are used for validation. These data sets store miRNA-disease associations that have been experimentally confirmed by some researchers over the years.

Gastrointestinal Neoplasms is a very common gastrointestinal disease with a high incidence. However, there are no obvious symptoms in the early growth stage of the neoplasms, which is very dangerous to human beings. We successfully predict 31 known associations and 9 new associations, 7 of which are confirmed by HMDD v3.0 and miRCancer. For example, Tazawa et al. [47] discovered the potential role of oncogenic miR-21 in Gastrointestinal Neoplasms. Other confirmed miRNAs have been reported in relevant data sets, and they are not listed here. There are still two unconfirmed ones that need further research. Table 2 describes the simulation results, where known associations are shown in italic, confirmed new predictions are written to the corresponding database, and unconfirmed ones are shown as “unconfirmed”. The predicted scores in the Table 2 are ranked according to the strength of the association between the miRNA and disease. There is a threshold to determine whether the prediction is accurate. Compared with known information and other databases, the prediction results of our method are generally accurate. Although two remain unconfirmed, these two could provide some insights for researchers.

Retinoblastoma is a malignant tumor that occurs in children under 3 years old, and has a familial predisposition. There are 38 known associations between the disease and miRNA in the known association data set, and 37 known associations are successfully predicted by us. At the same time, 23 new associations are predicted, seven of which are confirmed and the others are unconfirmed. Montoya et al. [48] found that the expression of miR-31 in Retinoblastoma is significantly reduced, which promotes the development of targeted therapy for Retinoblastoma. Table 3 shows the specific situation. The predictive sorting method in Table 3 is the same as that in Table 2.

Hepatoblastoma is the most common intraabdominal malignant tumor after neuroblastoma and nephroblastoma in childhood. In the existing miRNA-disease association data set, there are 8 known miRNA-disease associations, and all of them have been predicted. Besides, we predicted 12 new associations, seven of which are confirmed and 5 are not. We also find literatures confirming that miR-143 is a factor affecting Hepatoblastoma. The study of Zhang et al. [49] showed that blocking miR-143 could significantly inhibit local liver metastasis. Hepatoblastoma prediction results are shown in Table 4. The predictive sorting method in Table 4 is also the same as that in Tables 2 and 3.

As can be seen from the simulation results above, most known miRNAs are successfully predicted, while a small number of unknown associations are in HMDDv3.0, miRCancer and dbDEMC data sets. Although a few have not been confirmed, they can be used as a reference for researchers. In addition, we used Cytoscape software to map the prediction network of these three diseases (Fig. 3). In the network, the ellipse represents miRNAs, and the remaining shapes represent diseases. The correlations are connected by line segments with arrows, and there are common miRNAs between diseases. According to the size of the predicted score, the color degree of the ellipse is set differently. The darker the color of the ellipse is set to, the stronger the correlation between miRNA and disease is.

## Discussion

The above experimental results are enough to prove that our method is superior to the most advanced method. The excellent prediction performance of MCCMF can be attributed to several significant factors. Firstly, data is preprocessed by Weight K Nearest Known Neighbors method and matrix completion method to improve the prediction accuracy. Secondly, a collaborative matrix factorization model is applied to predicting miRNA-disease associations, which is a promising one among many collaborative filtering technologies. In bioinformatics, matrix factorization contributes to identifying hidden links among genes. However, the performance of our method needs to be further improved. For instance, there exists a better way to integrate data, rather than simply adding them together. In the future, we will improve the technology to use the latest version of the data set, such as HMDD v3.0.

## Conclusions

In this paper, a collaborative matrix factorization method based on matrix completion (MCCMF) is developed for predicting miRNA-disease associations. Considering the sparse and incomplete similarity matrix of miRNA-disease, we use the matrix completion method to complete the matrix. Then the completed matrix is integrated with GIP kernel similarity to improve the data information and reduce the influence of noises. In addition, WKNKN is also introduced to pretreat the existing association matrix of miRNAs and diseases, so our method is suitable to practical problems. Finally, the idea of CMF is adopted to construct the objective function and obtain the predicted results. The AUC value (0.9569) of MCCMF is higher than other advanced methods in the fivefold cross validation experiment. In order to comprehensively evaluate the performance of MCCMF, accuracy, precision, recall and f-measure are applied to measure the performance, and results are 0.992, 0.779, 0.918 and 0.830, respectively. Compared with the other four methods, our method has the best performance. The analysis of Gastrointestinal Neoplasms, Retinoblastoma and Hepatoblastoma further verified the effectiveness of MCCMF. Since most of associations are unknown in reality, MCCMF can also be used to predict in this situation.

## Methods

We develop a novel method for predicting miRNA-disease associations with MCCMF. MCCMF is divided into four main steps: Firstly, we use the matrix completion algorithm to complete the miRNA similarity matrix and the disease similarity matrix to generate a new completion similarity matrix. Secondly, the new completion similarity matrix is integrated with existing miRNA and disease similarity information. Thirdly, the WKNKN is used to convert the binary values of the miRNA-disease association matrix into the interaction likelihood values [41]. Finally, the Collaborative Matrix Factorization is used to predict the association of miRNA-disease. Figure 4 shows the complete process for MCCMF.

### Human miRNA-disease associations

The initial miRNA-disease association data is downloaded from HMDD v2.0 [50]. HMDD v2.0 is an experimental data set supporting human miRNA-disease associations, and storing 5430 experimentally verified miRNA-disease associations between 495 miRNAs and 383 diseases. In this paper, the adjacency matrix \({\mathbf{MD}}\) is used to represent the miRNA-disease association network. The adjacency matrix \({\mathbf{MD}}\) is a sparse matrix composed of 0 and 1. If \({\mathbf{MD}}\left( {m_{i} ,d_{j} } \right)\) is 1, disease \(d_{j}\) is correlated with miRNA \(m_{i}\); otherwise irrelevant.

### MiRNA function similarity

According to the hypothesis that functionally similar miRNAs are more likely to be associated with phenotypic diseases, a method for calculating the functional similarity of miRNAs (MISIM) is proposed by Wang et al. [51]. Firstly, we need to define semantic similarity between one disease and one group of disease. The calculation formula is as follows:

Here \(d\) represents one disease and \({\mathbf{D}}\) represents one disease group. Then, we define the similarity of \(d\) and \({\mathbf{D}}\), \(S(d,{\mathbf{D}})\), as the maximum similarity.

Functional similarity of the two miRNAs is defined as

where \(M_{1}\) and \(M_{2}\) represent the related miRNAs of \({\mathbf{D}}_{1}\) and \({\mathbf{D}}_{2}\), respectively. \({\mathbf{D}}_{1}\) contains \(m\) diseases, and \({\mathbf{D}}_{2}\) contains \(n\) diseases.

In this paper, we download the miRNA function similarity from https://www.cuilab.cn/files/images/cuilab/misim.zip. And the matrix \({\mathbf{MF}}\) is used to represent the functional similarity network of the miRNA, in which the element \({\mathbf{MF}}(i,j)\) represents the similarity between miRNA \(m_{i}\) and miRNA \(m_{j}\). The self-similarity of each miRNA is 1, so the diagonal elements of the matrix \({\mathbf{MF}}\) are 1.

Due to incomplete miRNA data supported by the experiment, the similarity values calculated by MISIM may be biased. Some subsequent treatment of the matrix may be improved [52].

### Disease semantic similarity

The relationship between different diseases is obtained from the MeSH database (https://www.ncbi.nlm.nih.gov/). Based on the previous literature [51], we represent the disease \(D\) as a Directed Acyclic Graph, \(DAG(D) = (D,T(D),E(D))\), where \(T(D)\) is the set of both a node \(D\) and its ancestor nodes, and \(E(D)\) is the set of edges that ancestor nodes pointing to node \(D\). For ancestor node \(t\) in \(DAG(A)\), its contribution to the semantic value of disease \(A\) is computed as follows:

In the above formula, \(\Delta\) is a semantic contribution factor. Based on the method of Wang et al., the value of \(\Delta\) is set to 0.5. For the disease \(A\), the contribution of itself to the disease \(A\) is 1, while the contribution of ancestor node \(t\) is decreasing with the increase of its layers.

Based on the contribution of ancestor diseases and disease \(A\) itself, the semantic value of disease \(A\) can be expressed as follows:

According to the hypothesis that the more shared part of the disease pairs in \(DAGs\) is, the higher similarity is. The semantic similarity between disease \(A\) and disease \(B\) is calculated as:

However, the above model is a little inadequacy, which is the setting of \(\Delta\) that causes the same layer of diseases with the same semantic contribution. Obviously, the incidence of various diseases is different, and the contribution of diseases with high incidence should be less than those with low incidence. To improve the above model, we combine the method of Xuan et al. [53] to define the semantic similarity calculation method. In this method, the contribution of ancestor node \(t\) in \(DAG(A)\) to the semantic value of disease \(A\) is as follows:

The semantic value of disease \(A\), and the semantic similarity between the disease \(A\) and the disease \(B\) are calculated as:

Finally, in order to calculate the semantic similarity more comprehensive and rational, we combine the two models to get Eq. (15).

### Gaussian interaction profile kernel similarity for diseases and miRNAs

On the basis of the hypothesis that functionally similar miRNAs may be associated with similar diseases, and vice versa, the known miRNA-disease association network is used to construct the GIP kernel similarity for diseases and miRNAs [54]. GIP kernel similarity can increase the multiple and topological information of known correlations. The interaction profile of miRNA \(m(i)\) is represented by the binary vector \(M(i)\) of the *i*-th column of the adjacency matrix \({\mathbf{MD}}\). Similarly, the binary vector \(D(i)\) of the *i*-th row of the adjacency matrix \({\mathbf{MD}}\) denotes the interaction profile of disease \(d(i)\). Hence, we can define the GIP kernel similarity for miRNAs and diseases as follows:

Here, \(\gamma_{m}\) and \(\gamma_{d}\) are parameters to control the kernel bandwidth and obtained by the following formulas:

where \(\delta_{m}\) and \(\delta_{d}\) are also bandwidth parameters and they are set to 1 according to the previous study [55]. The \(nm\) and \(nd\) mean the number of all the miRNAs and diseases.

### Matrix completion

The miRNA functional similarity matrix and disease semantic similarity matrix calculated by the above operations are still sparse and incomplete, and there are some redundant associations (i.e. inherent noise). So we use the matrix completion method to solve the problem [56]. Suppose the incomplete matrix is \({\mathbf{D}}\), which can be represented as a linear combination of \({\mathbf{D}}\) and the noise matrix \({\mathbf{N}}\). The formula is as follows:

where \({\mathbf{DR}}\) is a low-rank matrix, and specifically, it is a more refined or informative similarity matrix after removing noise from the existing similarity matrix.

In order to make \({\mathbf{R}}\) be low-rank, a nuclear norm on \({\mathbf{D}}\) is added. At the same time, the \(L_{2,1}\)-norm of the error term \({\mathbf{N}}\) is used to make noise matrix \({\mathbf{N}}\) more sparse. When the final low-rank matrix \({\mathbf{DR}}^{*}\) and sparse matrix \({\mathbf{N}}^{*}\) are calculated, \({\mathbf{DR}}^{*}\) or \({\mathbf{D}} - {\mathbf{N}}^{*}\) are used to describe a completed matrix. Therefore, a formula for solving convex optimization problem can be defined as follows:

Here, \(|| \cdot ||_{*}\) represents the nuclear norm, \(\omega \in (0,1)\) is the positive weighting parameter and \(|| \cdot ||_{2,1}\) is the noise regularization term.

When solving optimization problems under equality constraints, the ALM method is more effective [38]. Therefore, according to ALM, the Eq. (21) can be rewritten as:

Then switch the Eq. (22) to an unconstraint problem, which is the Lagrange function. The formula is as follows:

where \(\beta > 0\) is the penalty parameter, and \(\beta\) is updated by \(\beta = \min (\rho \beta ,\max_{\beta } )\). \(Y_{1}\) and \(Y_{2}\) are the Lagrange multipliers.

The ADM method is used to solve the Eq. (23) [39]. The ADM is a simple method to solve the decomposable convex optimization problem, especially in solving large-scale problems. The update iterations for ADM are as follows:

Based on the singular value shrinkage operator [40], \({\mathbf{X}}^{k + 1}\) and \({\mathbf{N}}^{k + 1}\) are represented as follows:

yet the minimization of \({\mathbf{R}}\) is a least squares problem, and its normal equation is as follows:

where \({\mathbf{I}} = {\mathbf{DD}}^{T}\) is widely used in matrix completion.

Then \({\mathbf{X}}\), \({\mathbf{R}}\) and \({\mathbf{N}}\) are updated by changing the Lagrange multipliers \(Y_{1}\) and \(Y_{2}\). Moreover, \(Y_{1}\) and \(Y_{2}\) can be obtained by the following formulas:

Finally, we can get the final low-rank matrix \({\mathbf{R}}^{*}\) and sparse matrix \({\mathbf{N}}^{*}\) until the convergence conditions \(||{\mathbf{D}} - {\mathbf{DR}} - {\mathbf{N}}||_{\infty } < \varepsilon\) and \(||{\mathbf{R}} - {\mathbf{X}}||_{\infty } < \varepsilon\) are satisfied. Here, \(\varepsilon\) is an extremely low number (set as \(1 \times 10^{ - 8}\) in this paper). As mentioned above, the refined matrix \({\mathbf{R}}^{*}\) and noise matrix \({\mathbf{N}}^{*}\) can be used to describe a completed matrix in the form of \({\mathbf{D}} \times {\mathbf{R}}^{*}\) or \({\mathbf{D}} - {\mathbf{N}}^{*}\). The specific process of matrix completion is shown in Fig. 5.

Based on the above matrix completion method, the disease semantic similarity matrix \({\mathbf{DS}}\) and miRNA functional similarity matrix \({\mathbf{MF}}\) are used as input matrices to replace matrix \({\mathbf{D}}\), so that we can obtain two refined similarity matrices \({\mathbf{CD}}\) and \({\mathbf{CM}}\), respectively.

The algorithm of Matrix completion is summarized in Algorithm 1.

### Similarity information integrations

Subsequent work is to integrate the completed matrix with existing similarity matrices. Since similarity information integrations of diseases and miRNAs are similar, Fig. 6 only shows the process for integration of miRNA similarity.

The specific integration formulas are as follows:

### WKNKN

WKNKN can be thought of as a voting or integration method: some potential classifiers (nearest neighbors) are aggregated by a (weight) majority vote, the results of which are used for prediction [41].

In this paper, \({\mathbf{MD}}\) expresses the miRNA-disease association matrix, which only represents the association between the miRNA and the disease verified by human experiment at the current stage. And we simply stipulate that if the miRNA is associated with the disease, \({\mathbf{MD}}\left( {m_{i} ,d_{j} } \right)\) will be set to 1. However, there are still many unknown miRNAs and diseases in the world, and whether they can be used as a bridge between existing miRNAs and diseases or not are still unknown. Maybe existing miRNAs are correlated with existing diseases through these unknown miRNAs, so the \({\mathbf{MD}}\) regulation is obviously inappropriate.

Therefore, by estimating these unknown conditions through the correlation of its known neighbors, the WKNKN method preprocesses the matrix \({\mathbf{MD}}\) to get the pre-processed matrix of \({\mathbf{MD}}\) (\({\mathbf{PMD}}\)). If \({\mathbf{MD}}\left( {m_{i} ,d_{j} } \right) = 0\), WKNKN will give \({\mathbf{MD}}\left( {m_{i} ,d_{j} } \right)\) a value from 0 to 1 according to the corresponding similar information of miRNAs and diseases. The specific process of WKNKN is shown in Fig. 7.

### MCCMF for MiRNA-disease association prediction

The CMF method proposed by Shen et al. [45] that can effectively predict the potential interactions between miRNAs and diseases. In this study, the idea of the CMF method is used to predict the miRNA-disease association. The specific steps of CMF are as follows: firstly, the input miRNA-disease association matrix \({\mathbf{PMD}}\) is decomposed into two low-rank matrices \({\mathbf{A}}\) and \({\mathbf{B}}\) by using the singular value decomposition.

where \({\mathbf{U}}\) and \({\mathbf{V}}\) is the unitary matrix. \({\mathbf{S}}\) is a negative real diagonal matrix, and there are k singular values on the diagonal.

Secondly, we write the objection function of MCCMF according to the idea of CMF, as follows:

Here, \(|| \cdot ||_{F}\) is the Frobenius norm to ensure that the feature vectors of similar miRNAs and similar diseases are similar. \(\lambda_{l}\), \(\lambda_{m}\) and \(\lambda_{d}\) are positive parameters, which are determined by the fivefold cross validation, and \(\lambda_{l} \in \left\{ {2^{ - 2} ,2^{ - 1} ,2^{0} ,2^{1} } \right\}\), \(\lambda_{m} /\lambda_{d} \in \left\{ {2^{ - 3} ,2^{ - 2} ,2^{ - 1} ,2^{0} ,2^{1} ,2^{2} ,2^{3} ,2^{4} ,2^{5} } \right\}\).

Thirdly, we use \(L\) to represent the Eq. (33), and derive two alternative update rules by setting \({{\partial L} \mathord{\left/ {\vphantom {{\partial L} {\partial {\mathbf{A}}}}} \right. \kern-\nulldelimiterspace} {\partial {\mathbf{A}}}} = 0\) and \({{\partial L} \mathord{\left/ {\vphantom {{\partial L} {\partial {\mathbf{B}}}}} \right. \kern-\nulldelimiterspace} {\partial {\mathbf{B}}}} = 0\).

where \({\mathbf{I}}_{k}\) is the \(k \times k\) identity matrix.

Finally, we update \({\mathbf{A}}\) and \({\mathbf{B}}\) iteratively until they converge to get the final \({\mathbf{A}}\) and \({\mathbf{B}}\). By \({\mathbf{A}}*{\mathbf{B}}^{T}\), the prediction matrix for miRNA-disease associations is obtained. The detail process of MCCMF can be seen in Fig. 8.

The algorithm of CMF is summarized in Algorithm 2.

## Availability of data and materials

The datasets that support the findings of this study are available in https://github.com/cuizhensdws.

## References

- 1.
Alshalalfa M, Alhajj R. Using context-specific effect of miRNAs to identify functional associations between miRNAs and gene signatures. BMC Bioinform. 2013;14(12):S1.

- 2.
Lee RC, Feinbaum RL, Ambros V. The

*C. elegans*heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell. 1993;75(5):843–54. - 3.
Kozomara A, Griffiths-Jones S. miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Res. 2013;42(D1):D68–73.

- 4.
Cheng AM, Byrom MW, Shelton J, Ford LP. Antisense inhibition of human miRNAs and indications for an involvement of miRNA in cell growth and apoptosis. Nucleic Acids Res. 2005;33(4):1290–7.

- 5.
Miska EA. How microRNAs control cell division, differentiation and death. Curr Opin Genet Dev. 2005;15(5):563–8.

- 6.
Bartel DP. MicroRNAs: target recognition and regulatory functions. Cell. 2009;136(2):215–33.

- 7.
Xu P, Guo M, Hay BA. MicroRNAs and the regulation of cell death. Trends Genet. 2004;20(12):617–24.

- 8.
Lewis BP, Burge CB, Bartel DP. conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are MicroRNA targets. Cell. 2005;120(1):15–20.

- 9.
Lu M, Zhang Q, Deng M, Miao J, Guo Y, Gao W, Cui Q. An analysis of human microRNA and disease associations. PLoS ONE. 2008;3(10):e3420.

- 10.
Calin GA, Dumitru CD, Shimizu M, Bichi R, Zupo S, Noch E, Aldler H, Rattan S, Keating M, Rai K, et al. Frequent deletions and down-regulation of micro-RNA genes miR15 and miR16 at 13q14 in chronic lymphocytic leukemia. Proc Natl Acad Sci USA. 2002;99(24):15524–9.

- 11.
Wu C, Li M, Hu C, Duan H. Clinical significance of serum miR-223, miR-25 and miR-375 in patients with esophageal squamous cell carcinoma. Mol Biol Rep. 2014;41(3):1257–66.

- 12.
Zhang X, Zhang X, Wang T, Wang L, Zhijun T, Wei W, Yan B, Zhao J, Wu K, Yang A-G, et al. MicroRNA-26a is a key regulon that inhibits progression and metastasis of c-Myc/EZH2 double high advanced hepatocellular carcinoma. Cancer Lett. 2018;426:98–108.

- 13.
Wu Z, Wu Q, Wang C, Wang X, Huang J, Zhao J, Mao S, Zhang G, Xu X, Zhang N. miR-340 inhibition of breast cancer cell migration and invasion through targeting of oncoprotein c-Met. Cancer. 2011;117(13):2842–52.

- 14.
Yang Z, Ren F, Liu C, He S, Sun G, Gao Q, Yao L, Zhang Y, Miao R, Cao Y, et al. DbDEMC: a database of differentially expressed miRNAs in human cancers. BMC Genomics. 2010;11(Suppl 4):S5.

- 15.
Huang Z, Shi J, Gao Y, Cui C, Zhang S, Li J, Zhou Y, Cui Q. HMDD v3.0: a database for experimentally supported human microRNA-disease associations. Nucleic Acids Res. 2019;47(D1):D1013–7.

- 16.
Jiang Q, Wang Y, Hao Y, Juan L, Teng M, Zhang X, Li M, Wang G, Liu Y. miR2Disease: a manually curated database for microRNA deregulation in human disease. Nucleic Acids Res. 2008;37:D98-104.

- 17.
Chen X, Xie D, Zhao Q, You Z-H. MicroRNAs and complex diseases: from experimental results to computational models. Brief Bioinform. 2019;20(2):515–39.

- 18.
Zou Q, Li J, Song L, Zeng X, Wang G. Similarity computation strategies in the microRNA-disease network: a survey. Brief Funct Genomics. 2015;15(18):55–64.

- 19.
Jiang Q, Hao Y, Wang G, Juan L, Zhang T, Teng M, Liu Y, Wang Y. Prioritization of disease microRNAs through a human phenome-microRNAome network. BMC Syst Biol. 2010;4(1):S2.

- 20.
Li X, Wang Q, Zheng Y, Lv S, Ning S, Sun J, Huang T, Zheng Q, Ren H, Xu J, et al. Prioritizing human cancer microRNAs based on genes’ functional consistency between microRNA and cancer. Nucleic Acids Res. 2011;39(22):e153–e153.

- 21.
Xiao Q, Luo J, Liang C, Cai J, Ding P. A graph regularized non-negative matrix factorization method for identifying microRNA-disease associations. Bioinformatics. 2017;34(2):239–48.

- 22.
Chen X, Yin J, Qu J, Huang L. MDHGI: matrix decomposition and heterogeneous graph inference for miRNA-disease association prediction. PLoS Comput Biol. 2018;14(8):e1006418.

- 23.
Mørk S, Pletscher-Frankild S, Palleja A, Gorodkin J, Jensen L. Protein-driven inference of miRNA-disease associations. Bioinformatics (Oxford, England). 2013;30(3):392–7.

- 24.
Chen H, Zhang Z. Similarity-based methods for potential human microRNA-disease association prediction. BMC Med Genomics. 2013a;6:12.

- 25.
Gao M-M, Cui Z, Gao Y-L, Liu J-X, Zheng C-H. Dual-network sparse graph regularized matrix factorization for predicting miRNA-disease associations. Mol Omics. 2019;15(2):130–7.

- 26.
Gao Y-L, Cui Z, Liu J-X, Wang J, Zheng C-H. NPCMF: nearest profile-based collaborative matrix factorization method for predicting miRNA-disease associations. BMC Bioinform. 2019;20(1):353.

- 27.
Yin M-M, Cui Z, Gao M-M, Liu J-X, Gao Y-L. LWPCMF: logistic weighted profile-based collaborative matrix factorization for predicting MiRNA-disease associations. IEEE/ACM Trans Comput Biol Bioinform. 2019. https://doi.org/10.1109/TCBB.2019.2937774.

- 28.
Chen H, Zhang Z, Feng D. Prediction and interpretation of miRNA-disease associations based on miRNA target genes using canonical correlation analysis. BMC Bioinform. 2019;20(1):404.

- 29.
Xu J, Li C-X, Lv J-Y, Li Y-S, Xiao Y, Shao T-T, Huo X, Li X, Zou Y, Han Q-L, et al. Prioritizing candidate disease miRNAs by topological features in the miRNA target-dysregulated network: case study of prostate cancer. Mol Cancer Ther. 2011;10(10):1857.

- 30.
Chen H, Zhang Z. Prediction of associations between OMIM diseases and microRNAs by random walk on OMIM disease similarity network. Sci World J. 2013b;2013:204658.

- 31.
Chen X, Yan CC, Zhang X, Li Z, Deng L, Zhang Y, Dai Q. RBMMMDA: predicting multiple types of disease-microRNA associations. Sci Rep. 2015;5:13877.

- 32.
Chen X, Huang L. LRSSLMDA: Laplacian regularized sparse subspace learning for MiRNA-disease association prediction. PLoS Comput Biol. 2017;13(12):e1005912.

- 33.
Chen X, Wang L, Qu J, Guan N-N, Li J-Q. Predicting miRNA-disease association based on inductive matrix completion. Bioinformatics. 2018;34(24):4256–65.

- 34.
Chen X, Xie D, Wang L, Zhao Q, You Z-H, Liu H. BNPMDA: bipartite network projection for MiRNA-disease association prediction. Bioinformatics. 2018;34(18):3178–86.

- 35.
Chen X, Zhu C-C, Yin J. Ensemble of decision tree reveals potential miRNA-disease associations. PLoS Comput Biol. 2019;15(7):e1007209.

- 36.
Ding X, Xia JF, Wang YT, Wang J, Zheng CH. Improved inductive matrix completion method for predicting microRNA-disease associations. In: Huang DS, Jo KH, Huang ZK, editors. Intelligent computing theories and application. ICIC 2019. Lecture notes in computer science. Cham: Springer; 2019. vol. 11644, p. 247–255. https://doi.org/10.1007/978-3-030-26969-2_23.

- 37.
Li J, Zhang S, Liu T, Ning C, Zhang Z, Zhou W. Neural inductive matrix completion with graph convolutional networks for miRNA-disease association prediction. Bioinformatics. 2020;36(8):2538–46.

- 38.
Lin Z, Chen M, Ma Y. The augmented Lagrange multiplier method for exact recovery of corrupted low-rank matrices. 2010. arXiv preprint arXiv:1009.5055.

- 39.
Yang J-F, Yuan X-M. Linearized augmented Lagrangian and alternating direction methods for nuclear norm minimization. Math Comput. 2013;82(281):301–29.

- 40.
Cai J-F, Candès EJ, Shen Z. A Singular value thresholding algorithm for matrix completion. SIAM J Optim. 2010;20:1956–82.

- 41.
Ezzat A, Zhao P, Wu M, Li X, Kwoh C. Drug-target interaction prediction with graph regularized matrix factorization. IEEE/ACM Trans Comput Biol Bioinform. 2017;14(3):646–56.

- 42.
Ezzat A, Wu M, Li X-L, Kwoh C-K. Computational prediction of drug–target interactions using chemogenomic approaches: an empirical survey. Brief Bioinform. 2018;20(4):1337–57.

- 43.
Xie G, Fan Z, Sun Y, Wu C, Ma L. WBNPMD: weighted bipartite network projection for microRNA-disease association prediction. J Transl Med. 2019;17:322.

- 44.
Chen X, Yan G-Y. Semi-supervised learning for potential human microRNA-disease associations inference. Sci Rep. 2014;4:5501.

- 45.
Shen Z, Zhang Y-H, Han K, Nandi A, Honig B, Huang D-S. miRNA-disease association prediction with collaborative matrix factorization. Complexity. 2017;2017:1–9.

- 46.
Xie B, Ding Q, Han H, Wu D. MiRCancer: A microRNA-cancer association database constructed by text mining on literature. Bioinformatics (Oxford, England). 2013;29:638–44.

- 47.
Tazawa H, Kagawa S, Fujiwara T. MicroRNAs as potential target gene in cancer gene therapy of gastrointestinal tumors. Expert Opin Biol Ther. 2011;11:145–55.

- 48.
Montoya V, Fan H, Bryar P, Weinstein J, Mets M, Feng G, Martin J, Martin A, Jiang H, Laurie N. Novel miRNA-31 and miRNA-200a-mediated regulation of retinoblastoma proliferation. PLoS ONE. 2015;10:e0138366.

- 49.
Zhang X, Liu S, Hu T, Liu S, He Y, Sun S. Up-regulated microRNA-143 transcribed by nuclear factor kappa B enhances hepatocarcinoma metastasis by repressing fibronectin expression. Hepatology (Baltimore, MD). 2009;50:490–9.

- 50.
Li Y, Qiu C, Tu J, Geng B, Yang J, Jiang T, Cui Q. HMDD v20: a database for experimentally supported human microRNA and disease associations. Nucleic Acids Res. 2013;42:D1070–4.

- 51.
Wang D, Wang J, Lu M, Song F, Cui Q. Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases. Bioinformatics (Oxford, England). 2010;26:1644–50.

- 52.
Chen H, Guo R, Li G, Zhang W, Zhang Z. Comparative analysis of similarity measurements in miRNAs with applications to miRNA-disease association predictions. BMC Bioinform. 2020;21(1):176.

- 53.
Xuan P, Han K, Guo M, Guo Y, Li J, Ding J, Liu Y, Dai Q, Li J, Teng Z, et al. Prediction of microRNAs associated with human diseases based on weighted k most similar neighbors. PLoS ONE. 2013;8:e70204.

- 54.
van Laarhoven T, Nabuurs SB, Marchiori E. Gaussian interaction profile kernels for predicting drug–target interaction. Bioinformatics. 2011;27(21):3036–43.

- 55.
Chen X, Yan G-Y. Novel human lncRNA-disease association inference based on lncRNA expression profiles. Bioinformatics. 2013;29(20):2617–24.

- 56.
Sheng-Peng Y, Liang C, Xiao Q, Li GH, Ding P, Luo JW. MCLPMDA: a novel method for miRNA-disease association prediction based on matrix completion and label propagation. J Cell Mol Med. 2018;23:1215–27.

## Acknowledgements

Not applicable.

## Funding

Publication costs are funded by the National Science Foundation of China under Grant Nos. 61872220, and 61702299.

## Author information

### Affiliations

### Contributions

TRW and MMY jointly contributed to the design of the study. TRW designed and implemented the MCCMF method, performed the experiments, and drafted the manuscript. XZK participated in the design of the study and performed the statistical analysis. YLG contributed to the data analysis. CNJ and JXL gave computational advice for the project and participated in designing evaluation criteria. All authors read and approved the final manuscript.

### Corresponding author

## Ethics declarations

### Ethics approval and consent to participate

Not applicable.

### Consent for publication

Not applicable.

### Competing interests

The authors declare that they have no competing interests.

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

**Open Access** This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

## About this article

### Cite this article

Wu, T., Yin, M., Jiao, C. *et al.* MCCMF: collaborative matrix factorization based on matrix completion for predicting miRNA-disease associations.
*BMC Bioinformatics* **21, **454 (2020). https://doi.org/10.1186/s12859-020-03799-6

Received:

Accepted:

Published:

### Keywords

- MiRNA-disease association prediction
- Matrix completion
- Weight K Nearest Known Neighbors
- Matrix factorization