Bipartite graph-based collaborative matrix factorization method for predicting miRNA-disease associations

Background With the rapid development of various advanced biotechnologies, researchers in related fields have realized that microRNAs (miRNAs) play critical roles in many serious human diseases. However, experimental identification of new miRNA–disease associations (MDAs) is expensive and time-consuming. Practitioners have shown growing interest in methods for predicting potential MDAs. In recent years, an increasing number of computational methods for predicting novel MDAs have been developed, making a huge contribution to the research of human diseases and saving considerable time. In this paper, we proposed an efficient computational method, named bipartite graph-based collaborative matrix factorization (BGCMF), which is highly advantageous for predicting novel MDAs. Results By combining two improved recommendation methods, a new model for predicting MDAs is generated. Based on the idea that some new miRNAs and diseases do not have any associations, we adopt the bipartite graph based on the collaborative matrix factorization method to complete the prediction. The BGCMF achieves a desirable result, with AUC of up to 0.9514 ± (0.0007) in the five-fold cross-validation experiments. Conclusions Five-fold cross-validation is used to evaluate the capabilities of our method. Simulation experiments are implemented to predict new MDAs. More importantly, the AUC value of our method is higher than those of some state-of-the-art methods. Finally, many associations between new miRNAs and new diseases are successfully predicted by performing simulation experiments, indicating that BGCMF is a useful method to predict more potential miRNAs with roles in various diseases.

lin-4, was discovered by Victor Ambros et al. [2]. After seven years, biological researchers discovered the second miRNA, let-7 [3]. As miRNAs are increasingly identified as playing crucial roles, researchers have begun to focus more attention on identifying miRNAs.
Studies have found that miRNAs are crucial components in cells and can play roles in many important biological processes, including haematopoiesis, cell proliferation, development, differentiation, apoptosis, cell ageing, viral infection, embryonic development and organ formation [4][5][6][7]. Mutated and disordered miRNAs will lose the ability to control their target genes, leading to the development of various complex human diseases, such as cardiovascular diseases, nervous system diseases, tumours, metabolic diseases, and autoimmune diseases [8,9]. As an example, a miR-133b defect is easily observed in the midbrain of patients with Parkinson's disease; miR-133b is thought to have a regulatory effect on the maturation and function of midbrain dopamine neurons [10]. In addition, Gao et al. found that the expression of miR-155 in the serum of lung cancer patients was much higher than that in normal samples by experimental PCR [11]. Furthermore, Takamizawa et al. have proved that the homology of let-7 is significantly reduced in the process of lung cancer [12]. However, discovering meaningful associations between miRNAs and diseases is a time-consuming process. Therefore, it is urgent to develop fast and efficient computational methods for predicting miRNA-disease associations.
In the last decade, a large number of methods and models have been proposed to identify potential relationships between miRNAs and diseases [13,14]. These methods and models have mainly focused on solving the above problem by machine learning, network mining, combinatorial optimization, and related approaches [15][16][17]. For example, Jiang et al. used a support vector machine to extract data on positive samples from negative samples. This method extracted features from miRNA target data and phenotypic similarity data and achieved favourable results [18]. Chen et al. applied the random walk algorithm with a restart, which is also a classic network-based prediction model, to miRNA-disease association (MDA) prediction [19]. In 2013, Qabaja et al. proposed a protein-protein interaction network based on the lasso regression model. They first used the lasso regression model to identify miRNAs associated with markers of diseases and then integrated biological networks and multisource data to define the gene signatures of miRNAs and diseases. Finally, this method achieved good predictive performance [20]. Xuan et al. proposed a prediction method based on the K-nearest neighbour algorithm. This method constructs a similarity network by integrating the miRNA-disease phenotype similarity network, the family information of miRNAs, and the relationships between diseases and miRNAs identified by biological experiments. The disadvantage of this method is that it cannot be applied to the association prediction of diseases without any known related miRNAs [21].
In 2014, Chen et al. proposed a semi-supervised algorithm named regularized least squares (RLSMDA) to predict potential disease-miRNA associations. The advantage of this method is that it does not require negative MDAs information and can be applied to the prediction of isolated diseases. In 2017, Chen et al. proposed a predictive model for the associations between miRNAs and diseases based on Laplacian regularized sparse subspace learning (LRSSLMDA) [22]. They used Laplacian regularization to keep local information and then used the L 1 norm to select important miRNA/disease features, further improving the precision of the algorithm. Chen et al. proposed a computational method, ensemble learning and link prediction for miRNA-disease association (ELLPMDA), which combines both machine learning methods and similarity-based algorithms. This method is based on a globally similar measurement method for diseases without any associated miRNAs [23]. Algorithms such as neural networks are also used to predict miRNA-disease associations. In 2017, Fu et al. proposed a deep integration model (DeepMDA), which uses a stacktype autoencoder to extract high-level features from similar information and then predicts disease-miRNA associations through a three-layer neural network [24]. In addition, matrix factorization is also used to predict the association between miR-NAs and diseases. In 2019, Gao et al. proposed a computational model, dual-network sparse graph regularized matrix factorization (DNSGRMF), for predicting miRNAdisease associations by integrating the miRNA functional similarity matrix, the disease semantic similarity matrix and Gaussian kernel similarities with the addition of the L 2,1 norm. They used collaborative matrix factorization to predict miRNA-disease associations [25]. Later, a more efficient miRNA-disease associations prediction model, nearest profile-based collaborative matrix factorization (NPCMF), was proposed by Gao et al., which integrates Gaussian kernel similarity and the nearest profile, taking the nearest neighbour information into account. Finally, DNSGRMF and NPCMF achieved excellent predictive accuracy based on fivefold cross-validation [26].
Although there are many advanced methods to predict MDAs, they still have some shortcomings. For instance, several methods trigger bias to miRNAs (diseases). Moreover, the small number of known relationships cannot be utilized to predict new miRNAs and diseases. More importantly, in some methods, the nearest neighbour information of the miRNA and the disease is not considered. To address the limitations of previous methods, a computational method of bipartite graphs based on collaborative matrix factorization (BGCMF) is proposed. The specific contributions of our method include the following two aspects: In our method, the miRNA similarity matrix and disease similarity matrix are constructed by combining Gaussian interaction kernel similarity, miRNA functional similarity, and disease semantic similarity. This could help to infer potential miRNA-disease associations. It is worth noting that the bipartite graph algorithm (BG) is introduced to our method for maximum consideration of neighbouring information for miRNAs and diseases. Then, it calculates a weighted average of similarities between miR-NAs and diseases to eliminate the bias on prediction.
In addition, there are quite a few missing associations in the original matrix Y , and Weight K Nearest Known Neighbours (WKNKN) [27] is implemented as a pre-treatment step to minimize the error. Moreover, five-fold cross-validation is exploited in our method to evaluate our experimental results. We also introduce simulation experiments to further evaluate the performance of our method. Overall, the results demonstrate that our BGCMF method is superior to other existing advanced methods.

Human miRNA-disease associations association dataset
In this study, the gold standard human miRNA-disease association dataset was downloaded from the Human microRNA Disease Database (HMDD) v2.0 [28]. HMDD v2.0 includes 495 miRNAs, 383 diseases and 5430 experimentally verified miRNA-disease associations. In this paper, the dataset includes three matrices: the adjacency matrix Y , the miRNA functional similarity matrix S m , and the disease semantic similarity matrix S d . The information for the dataset is listed in Table 1.
We use an adjacency matrix Y ∈ R n×m to describe the associations between miRNAs and diseases that have been validated, where n represents the number of miRNAs and m represents the number of diseases. When M(i) and D j are associated, Y M(i), D j is set to 1; otherwise, Y M(i), D j is set to 0. The following is the expression of the matrix Y:

Performance evaluation
In this study, we implement fivefold cross-validation to evaluate the prediction performance of each method. The principle of fivefold cross-validation is to randomly divide the known miRNA-disease associations into five subsets, one of which is used as a test set, and the rest are used as a training set. Then, five models are trained by cycling five times, and the average of the five evaluation results is calculated as the final score of the model. Finally, fivefold cross-validation was performed 100 times, and the final score was taken as the average. It is worth noting that in our BGCMF method, WKNKN is used as a preprocessing procedure to evaluate unknown MDAs. At the same time, the nearest neighbour information is applied to our method, and it has the advantage of taking into account the nearest neighbour information and improving the accuracy of the prediction.
To verify the effect of the prediction, the area under the curve (AUC) value was applied in this study, which is widely used in previous studies. Therefore, a receiver operating characteristic (ROC) curve was obtained. In this curve, the x-axis is the false-positive rate (FPR, specificity), and the y-axis is the true positive rate (TPR, sensitivity). The definitions for calculating specificity and sensitivity are as follows: (2) where TP represents the number of positive samples, FP represents the number of falsepositive samples, TN represents the number of negative samples and FN represents the number of false-negative samples. An AUC value between 0.5 and 1 is considered feasible, with AUC = 1 representing the best predictive performance and AUC = 0.5 representing stochastic prediction.

Comparison with other methods
Based on experimentally confirmed associations between diseases and miRNAs, fivefold cross-validation is implemented in this paper to evaluate the predictive accuracy of BGCMF. We compared our method with other advanced methods, such as HDMP [21], CMF [29], ELLPMDA [23], DNSGRMF [25] and NPCMF [26]. The experimental results are listed in Table 2. More intuitively, the ROC curves are shown in Fig. 1  From the above statistical results, our method is 11.72% higher than the lowest value of HDMP. The BGCMF is 2.1% and 0.85% higher than the values of DNSGRMF and NPCMF, respectively. Therefore, we can conclude from the experimental results that the BGCMF has excellent predictive performance.

Sensitivity analysis from WKNKN
If a miRNA or a disease is known, it must have one or more associations. However, there are many missing unknown associations in the interaction matrix Y . WKNKN pre-processing is used to estimate the interaction possibilities to minimize the error. There are two parameters K and p in WKNKN, where K represents the number of known nearest neighbours and p represents the decay term for the neighbour. The value of p is between 0 and 1. As shown in Fig. 2, when K = 7 and p = 0.6 , the AUC value tends to be stable.

Case study
The previous sections verify that our proposed method has outstanding predictive performance. Colon, prostate, and kidney are selected in the case study to further illustrate the superior performance of our BGCMF. The known miRNA-disease associations in the standard dataset are used as a training set to predict potential disease-associated miRNAs. Our specific process first uses the BGCMF method to predict these three diseases, and the choice of parameters is as described above. Then, the predicted score matrix is compared to the original miRNA-disease association matrix. The associations of predicted scores with changes are filtered and compared. Finally, we validated whether the predicted new miRNA-disease associations exist in the updated dbDEMC [30], miR2Disease [31] and the HMDD v3.2 [32].

Fig. 2 Sensitivity analysis from WKNKN
Colon neoplasms, also known as bowel cancer, are one of the three most common cancers, accounting for 10% of all cancer cases. Due to the low detection rate of colon tumours in the early stages, it creates a huge threat to people's lives. New biomarkers may help to improve the early detection of colon tumours. Recent studies have found that miRNA dysregulation can be used as a marker for colon tumour diagnosis in colon neoplasm cells. For example, miR-145 and miR-126 can inhibit the growth of colon neoplasm cells, and an increasing number of miRNAs associated with colon neoplasms have been found to be of great significance for improving the early detection of colon neoplasms. Here, the first type of case is colon neoplasms. In the dataset used in our experiments, there are 5 existing associations between miRNAs and colon neoplasms. After the simulation experiment is performed, the top 30 miRNAs of colon neoplasms are extracted, and all existing associations are successfully predicted. At the same time, 25 novel MDAs are predicted. Among these 25 new miRNAs, all of the miRNAs are validated by dbDEMC, miR2Disease and HMDD v3.2. More importantly, fourteen of them are confirmed by the above three databases. For example, in 2007, Shi et al. found that the target gene of miR-145 is insulin receptor substrate-1 and can inhibit the growth of colon cancer cells [33]. In 2013, Wan et al. identified that patients with colon cancer with high expression of miR-199a-3p had a lower survival rate [34]. Table 3 lists the simulation results of colon tumours, and the known associations are shown in bold. I, II, III represent dbDEMC, miR2Disease and the HMDD v3.2.
The next case is prostate neoplasms, which are the third most common cause of male cancer-related death. In our simulation experiments, we also select the top 30 miRNAs with the highest correlation scores, and seven known miRNAs associated with prostate neoplasms are successfully predicted. Among the 23 newly predicted miRNAs associated with prostate neoplasms, miR143, miR21, and miR126 are the highest ranked miRNAs, as confirmed by three databases at the same time. Only miR-200b is not confirmed in either dbDEMCs or miR2Disease associated with prostate neoplasms, but it is confirmed by HMDD v3.2. Although a large number of miRNAs have been discovered,  Table 4 lists the details of the experiment and the existing associations. The last case is kidney neoplasms. Kidney neoplasms, also known as kidney cancer, are cancers that originate in kidney cells and include several different types of tumours. Kidney neoplasms account for 3% of adult malignancies [35]. According to previous studies, a large number of miRNAs have been examined for kidney tumours. For example, circulating levels of miR-15b in patients with advanced kidney cancer are significantly reduced [36]. In addition, overexpression of miR-210 leads to the amplification of renal cancer cell centrosomes [37]. In this case, there are 9 miRNAs that have associations with kidney neoplasms. Nine known miRNAs are successfully predicted in our results. Simultaneously, 35 new miRNAs are predicted. Among the 35 new miRNAs, 29 miRNAs connected with kidney neoplasms have discovered experimental proof from three databases. For example, studies have found that miR-17 is carcinogenic and overexpressed in renal cell carcinoma. Although the predicted novel miRNAs, including miR205, miR125b, miR-7, miR-221, miR-31, and miR-92a, are unconfirmed by miR2Disease or dbDEMC, these miRNAs are closely associated with kidney neoplasms. Table 5 lists the simulation results of kidney neoplasms, and the known associations are shown in bold. To show the simulation experiment of BGCMF more intuitively, Cytoscape software was used to map the three predicted disease-miRNA association networks. As shown in Fig. 3, large ellipses indicate the three diseases, and small ellipses indicate the predicted miRNAs.

Discussion
MiRNAs are involved in many physiological processes, such as organismal development, cell differentiation and proliferation, apoptosis, hormone secretion, and lipid metabolism. miRNAs are closely related to the occurrence and development of tumours, metabolic diseases, stress diseases, and cardiovascular diseases. With the development of miRNA bioinformatics, direction prediction and other advances in biological science   Fig. 3 Visualized miRNA-disease associations association network of case study and technology, a large number of miRNAs have been discovered and verified. However, validating the associations between miRNAs and diseases through biological experiments is time-consuming and expensive. Therefore, it is absolutely necessary to develop new and effective computational models to predict potential associations between miR-NAs and diseases. In this paper, an efficient and useful method to predict potential MDAs is developed. Our method can be divided into three parts. The entire calculation process is described in detail in Fig. 4. The first step in this method is to process the data for subsequent prediction. Then, we use the CMF algorithm and BG algorithm to make predictions based on the processed data separately. Finally, we combine the prediction results of the two algorithms to obtain the final prediction matrix. The BGCMF achieves an overall result that is better than the two results given by single models.

Conclusions
The success of our method can be mainly attributed to several factors. First, we combined Gaussian interaction profile similarity with miRNA functional similarity and disease semantic similarity to obtain accurate information about miRNA pairs and disease pairs. In addition, WKNKN is an essential pretreatment process. It is worth noting that our largest contribution is combining the bipartite graph algorithm with the collaborative matrix Fig. 4 Flow chart of BGCMF in novel MDAs prediction by integrating miRNA function similarity, disease semantic similarity, and known miRNA-disease associations. Then combine the prediction results of the BG algorithm and CMF algorithm to obtain the final prediction matrix factorization model. This allows for maximum consideration of neighbouring information for miRNAs and diseases, preventing the network similarity of miRNAs and diseases from being affected. Finally, both fivefold cross-validation results and three kinds of case studies on colon neoplasms, prostate neoplasms, and kidney neoplasms demonstrated the reliable prediction performance of BGCMF.
In the future, an increasing number of useful methods will be applied to predict potential MDAs. We will continue to study this aspect of research. At the same time, more meaningful datasets are being published in online bio-databases. Therefore, our next work will focus on developing effective methods to predict novel miRNA-disease associations and to evaluate the effectiveness of the method on diverse datasets.

Method
This novel method is named bipartite graph-based collaborative matrix factorization (BGCMF). The method is divided into two major steps. First, the Gaussian interaction profile kernel (GIP) and nearest neighbour profile (NP) are introduced in our method to process the original miRNA matrix and the disease matrix to obtain their network information. At the same time, WKNKN is used to handle the original interaction matrix Y to minimize the error. Second, the BG algorithm is implemented to obtain prediction matrix Y 1 and collaborative matrix factorization (CMF) to obtain the prediction matrix Y 2 , respectively. Finally, the prediction matrix Y predict is obtained by combining our two improved models. The flowchart of BGBMF is shown in Fig. 4.

MiRNA functional similarity
With the hypothesis that functionally similar miRNAs tend to be associated with phenotypically similar diseases, a computing method of miRNA functional similarity was presented by Wang et al. [10]. The functional similarity score matrix can be downloaded from http:// www. cuilab. cn/ files/ images/ cuilab/ misim. zip. Here, the obtained functional similarity for miRNA is denoted by S m ∈ R n×n , and the value of entity S M(i), M j measures the closeness between miRNA M(i) and M j .

Disease semantic similarity
A directed acyclic graph (DAG) is proposed to describe the relationships among various diseases. In addition, the disease D can be described by DAG(D) = (D, T (D), E(D)) . T (D) is the node set and represents both its ancestor nodes and D itself. E(D) is used to represent all direct edges between child nodes and parent nodes. The semantic similarity value of disease D is as follows: where represents the semantic contribution factor and D1 D (d) is the contribution of disease d . For each disease d , its contribution to itself is 1, and the contribution of its child node decreases with increasing distance. Obviously, when the two diseases have a larger shared part in their DAGs , they will obtain a greater similarity score. SV (d i ) and SV d j represent the semantic similarity values of d i and d j , respectively. Thus, the semantic similarity score of the two diseases d i and d j can be calculated as follows:

Gaussian Interaction Profile Kernel for miRNAs and diseases
According to the previous work [38], the method is based on the idea that it relies on the topological structure of known miRNA-disease associations in a network to compute the similarity of diseases and miRNAs [26]. Here are two miRNAs m i and m j and two diseases d i and d j . The network similarity between them can be calculated with the following formulas: where γ is an adjustable parameter that can control the bandwidth of the kernel. where α is an adjustable parameter range in [0, 1], and K m represents the miRNA integrated similarity matrix, which is a linear combination of the Gaussian interaction profile kernel similarity for miRNA GIP miRNA and the miRNA functional matrix S m . Similar to K m , K d represents the disease integrated similarity matrix, which is a linear combination of the Gaussian interaction profile kernel similarity for disease GIP disease and the disease semantic matrix S d . When α is equal to 0.5, BGCMF achieves the highest AUC value. The sensitivity analysis of α is shown in Fig. 5.

Bipartite graph method
Based on the assumption that miRNAs that are similar will interact with similar diseases, the interaction profile for a new miRNA candidate could be inferred from the known interactions of their neighbours. MiRNAs with large similarities to new potential miRNAs are said to be their neighbours. Therefore, we introduce the nearest profile (NP) to our method [39]. Below are the formulas for calculating a new miRNA m i and a new disease d i .
where m nearest and d nearest are the miRNAs most similar to m i and the diseases most similar to d i , respectively. N m (m i ) and N d (d i ) are the association profiles of the miRNAs and diseases, respectively. The NP process in this method can be divided into four steps. First, remove the self-similarity of miRNA matrices K m and K d . Next, obtain the nearest neighbour for each miRNA and disease. Then, ignore all miRNA similarities and disease similarities. Finally, the miRNA nearest neighbour matrix N m and disease nearest neighbour matrix N d can be obtained.

Weighted profile
The weighted profile (WP) is proposed as a simple predictive model in [39]. The idea of the weighted profile is to perform a similarity-weighted average of all other miRNAs or diseases to obtain the prediction matrix. For instance, the WP for a new miRNA m i and a new disease are computed as: where N m and N d are the nearest neighbour matrices we construct for miRNA and disease. Y m j and Y d j are association matrices of miRNA m j and disease d j , respectively. First, the BG algorithm is used to obtain the neighbour information about miRNAs and diseases, and then predictions from both miRNA and disease sides are averaged to obtain the final prediction matrix:

BGCMF for MiRNA-disease associations association prediction
The traditional collaborative matrix factorization (CMF) method is effective in predicting the underlying interactions between miRNAs and diseases [29]. The objective function of CMF method is defined as: where l , d , and t are non-parameters and �·� 2 F represents the Frobenius norm. In this formula, the first item is used to find the low-rank matrices A and B of the reconstructed Y . The second item is the Tikhonov regularization term. The last two items are regularization terms that demand potential feature vectors of similar miRNAs/diseases to be similar and potential feature vectors of dissimilar miRNAs/diseases to be dissimilar. However, traditional CMF does not take into account the network relationship between the miRNA and the disease, which will reduce the accuracy of predicting MDAs. Therefore, we introduce the Gaussian kernel similarity K m of miRNA and the K d of disease into CMF [40]. The objective function can be rewritten as: where �·� 2 F is the Frobenius norm. l , d and t represent the positive parameters. In this study, the setting of the three parameters is done by cross-validation. The grid search is adopted to select the optimal parameters among these values: l ∈ 2 −2 , 2 −1 , 2 0 , 2 1 , 2 2 , d / t ∈ 2 −6 , 2 −5 , 2 −4 , 2 −3 , 2 −2 , 2 −1 , 2 0 , 2 1 , 2 2 . The association matrix Y is decomposed into two low-rank matrices A and B , where Y ≈ AB T . Tikhonov regularization is adopted to minimize the norms of both A and B . The roles of the third and fourth terms are to minimize the squared error S m ≈ AA T and S d ≈ BB T , respectively.

Initialization of A and B
In the CMF method, the first step is to initialize the adjacency matrix Y . We use singular value decomposition (SVD) to decompose the input matrix Y ∈ R n×m into U n×k , S k×k and V k×m . Then, matrix A and matrix B are obtained by the following formula: where S is a diagonal matrix and k represents the maximum number of singular values.