NEMPD: A Network Embedding-Based miRNA-Protein-Disease Network Method for the miRNA-Disease Association Prediction

Background: As an important non-coding RNA newly discovered in recent years, MicroRNA (miRNA) plays an important role in a series of life processes and is closely associated with a variety of human diseases. Hence, the identification of potential miRNA-disease associations can make great contributions to the research and treatment of human diseases. However, to our knowledge, many of the existing state-of-the-art computational methods only utilize the single type of known association information between miRNAs and diseases to predict their potential associations, without focusing on their interactions or associations with other types of molecules. Results: In this paper, a network embedding-based the tripartite miRNA-protein-disease network (NEMPD) method was proposed for the prediction of miRNA-disease associations. Firstly, a tripartite miRNA-protein-disease network is created by integrating known miRNA-protein and protein-disease associations. Then, we utilize the network representation method-Learning Graph Representations with Global Structural Information (GraRep) to obtain the behavior information (associations with proteins in the network) of miRNAs and diseases. Secondly, the behavior information of miRNAs and diseases is combined with the attribute information of them (disease semantic similarity and miRNA sequence information) to represent miRNA-disease pairs. Thirdly, the prediction model was established based on these known miRNA-disease pairs and the Random Forest algorithm. In the results, under five-fold cross validation, the prediction accuracy, sensitivity, and AUC of NEMPD is 85.41%, 80.96%, and 91.58%. Furthermore, the performance of NEMPD was also validated by the case studies. Among the top 50 predicted disease-related miRNAs, 48 (breast neoplasms), 47 (colon neoplasms), 47 (lung neoplasms) were confirmed by two other databases. Conclusions: NEMPD has a good performance in predicting the potential associations between miRNAs and diseases and has great potency in the field of miRNA-disease association prediction in the future.


Background
MicroRNAs (miRNAs) are a kind of endogenous non-coding RNA with a length of ~ 22nt, which regulates the expression of target mRNAs by controlling the expression of target genes through sequence complementary pairing [1]. The sequence of miRNA is very short, and it is only expressed in specific tissues or cells at specific stages, so miRNAs are not well known to people before and usually called dark matter in life [2]. In 1993, Lee et al. [3] identified the first miRNA gene, lin-4, in Caenorhabditis elegans. Since then, numerous studies have shown that miRNAs play an important role in life processes, including cell metabolism, proliferation, apoptosis, and development [4][5][6][7][8]. Besides, miRNAs are also involved in the occurrence and development of many human diseases, such as prostatic neoplasms, breast neoplasms, and so on [9][10][11]. Therefore, identifying the potential miRNA-disease associations is crucial in the research and treatment of human diseases. Traditional experimental methods have high accuracy in predicting the miRNA-disease associations, but such methods are often limited to the disadvantages of small scale, high time-consuming and cost. Hence, using computational methods to predict the potential associations has gradually attracted more and more researchers.
In the past few years, there are many computational methods have been developed to predict the miRNAdisease associations. For example, Chen et al. [12] developed a model named RBMMMDA, which utilizing the restricted Boltzmann machine to predict multi-type associations between miRNAs and diseases. This method can not only discover new potential associations between miRNAs and diseases but also indicate the corresponding association types. Chen et al. [13] proposed a novel method based on heterogeneous graph inference (HGIMDA). This approach takes advantage of the miRNA functional similarity, disease semantic similarity, Gaussian interaction profile kernel similarity, and known miRNA-disease associations. It breaks through the limitations of traditional methods and can be used for new miRNAs and diseases without any known associations. You et al. [14] constructed a heterogeneous graph and utilized the depth-first search algorithm (PBMDA). Compared with other previous models, this method has better reliability and accuracy. Chen et al. [15] proposed a new method of within and between score, named WBSMDA. This method can be used for diseases without any known related miRNAs. Wang et al. [16] proposed a method of the logistic model tree (LMTRDA) by combining miRNA sequence information, miRNA functional similarity, and disease semantic similarity. Li et al. [17] designed a novel method (MCMDA) for the prediction of potential miRNA-disease associations by updating the known association adjacency matrix. Zheng et al. [18] developed a prediction model based on the machine learning method. This model combines Gaussian interaction spectrum kernel similarity information, disease semantic similarity, and miRNA functional similarity and sequence information. Furthermore, it respectively utilizes the auto-encoder neural network (AE) and random forest for feature extraction and training. Zheng et al. [19] developed a novel model based on the distance sequence similarity method (DBMDA). This method utilizes the regional distance to calculate the global similarity and is implemented through a chaotic game representation algorithm based on miRNA sequences, which provides a new idea for the field of miRNA-disease prediction.
At present, most existing state-of-the-art algorithms only make use of the single known miRNA-disease associations for potential miRNA-disease association prediction. However, diseases are mainly caused by the disturbance of a complex of interacting multiple biomolecules, rather than the abnormity of a single biomolecule. In addition, the functionally dependent molecular components in human cells form a complex biological network, in which proteins are an important part of human tissues and cells. The protein-miRNA associations and protein-disease associations have been confirmed by many previous experiments [20][21][22]. Therefore, we proposed a novel method to predict the miRNA-disease associations based on the miRNA-proteindisease network and the GraRep network embedding method (NEMPD). More specifically, we firstly constructed and comprehensively analyzed a tripartite miRNA-protein-disease network by integrating the miRNA-protein and protein-disease associations (see Fig. 1). Secondly, the network representation method can be used to get the embedding representation of nodes from the network while maintaining the network property. In recent years, network embedding methods such as LINE [23], DeepWalk [24] and so on, have been applied to several bioinformatics problems and have good performance. In this article, we choose the GraRep [25] method to learn the associations with proteins (behavior information) of miRNAs and diseases. Thirdly, the behavior information of miRNAs and diseases is combined with their own attribute information (disease semantic similarity and miRNA sequence information) to represent the 16427 known miRNA-disease pairs downloaded from HMDD [26] database. Finally, the Random Forest classifier was utilized to train the converted miRNA-disease feature pairs. The pipeline of NEMPD is shown in Fig. 2. In the experimental results, under five-fold cross-validation, the average AUC and AUPR of NEMPD is respectively 0.9233 and 0.9301. Furthermore, we measured the performance of NEMPD with different feature combinations and classifiers. Besides, in order to further test the performance of NEMPD, we conducted case studies of three major human diseases. All the results demonstrate that NEMPD has a good performance and can be used as a reliable model in the field of miRNA-disease association prediction.

Results And Discussion the five-fold cross-validation performance of NEMPD
To evaluate the prediction performance of NEMPD, we adopted the 5-fold cross-validation method in our experiment. Specifically, we firstly divide the training set into five parts, where the ratio of positive and negative samples is the same ratio in each part. Each time we select 4 parts as the training sample and the remaining 1 part as the test sample, and then repeat the experiment 5 times. In the result, we selected six parameters as evaluation indicators: accuracy (Acc.), precision (Prec.), matthews correlation coefficient (MCC), specificity (Spec.), sensitivity (Sen.) and areas under the ROC curve (AUC). Table 1 shows the training results of each fold in detail. The final results well prove the good performance of NEMPD in the prediction of potential miRNA-disease associations. The ROC (Receiver Operating Characteristic) curve is often used to evaluate the advantages and disadvantages of a binary classifier and to measure the non-equilibrium in classification. The abscissa of the ROC curve is FPR (false positive rate), which means the number of cases predicted to be positive among all negative cases. The ordinate of the ROC curve is TPR (true positive rate), which means the total predicted true positive samples. The AUC is defined as the areas under the ROC curve, with values generally ranging from 0.5 to 1. In general, the reason why AUC is usually used as an evaluation indicator in most cases is that the ROC curve cannot clearly indicate which classifier has a better effect. In addition, as a value, the larger the AUC value, the better the performance of the classifier. The PR (Precision-Recall) curve is another tool for evaluating the classification ability of machine learning algorithms for a given data set. Moreover, when dealing with some highly imbalanced data sets, the PR curve can display more information and find more problems. The AUPR is defined as the areas under the PR curve. Same as AUC, the larger the AUPR value, the better the performance of the classifier. The ROC and PR curves of NEMPD under 5-fold cross-validation are respectively shown in Fig. 3 and Fig. 4. As we can be seen from the figure, the mean AUC and AUPR of NEMPD is 0.9158 and 0.9233, respectively. Generally, the results fully demonstrate that NEMPD has a good performance in the field of potential miRNA-disease association prediction.

Comparison with Different Feature Combinations
In order to verify the validity of the proposed feature representation information, we discussed the influence of different feature combinations on the results of NEMPD. In detail, the combination 1 is only composed of the attribute information of miRNAs and diseases, the combination 2 is only composed of behavior information of miRNAs and diseases, the combination 3 is composed of attribute and behavior information. These three different feature combinations were respectively used as training features of the random forest classifier and verified under 5-fold cross-validation. The detailed results and ROC and PR curves are respectively shown in Table 2 and Fig. 5. In the end, the experimental results show that the NEMPD method using the combination 3 as the final training feature vector can get better performance in the prediction.

Comparison with Different Classifier models
To verify the performance of the random forest classifier in NEMPD, we further compared it with three other different classifier models (KNN, Naive Bayes and Decision Tree). It is worth noting that all these four classifiers use the same data set, and all use the default parameters for training and prediction to ensure the effectiveness of the comparison. We also utilize these six parameters (accuracy (Acc.), precision (Prec.), matthews correlation coefficient (MCC), specificity (Spec.), sensitivity (Sen.) and areas under the ROC curve (AUC)) as evaluation indicators of different classifiers.  Table 3, and Fig. 6 shows the ROC and PR curves of different classifiers. The results of the comparison experiment fully prove that the random forest classifier is more suitable for NEMPD. Although it is not as good as KNN and Naive Bayes in sensitivity, random forest performs better in accuracy and AUC, which can better reflect the classification ability of a model.

Case studies
To further verify NEMPD's ability to discover potential miRNA-disease associations, we selected three common and complex human cancers (colon neoplasms, breast neoplasms, and lung neoplasms) to conduct the case studies, which is the most common experiment in miRNA-disease association prediction methods. After the experiment was completed, we selected the top 50 predicted associations between miRNAs and corresponding cancers and confirmed them with two other databases, dbDEMC [27] and miR2Disease [28].
Colon neoplasms are currently the third common gastrointestinal disease in the world [29,30]. Furthermore, some of the potential miRNA-colon neoplasms associations have been verified by previous experiments, such as miR-17, miR-92a, miR-31, miR-155, and miR-21 [31]. These researches have demonstrated that miRNA is crucial for the prediction of colon neoplasms and can be used as an important biomarker for colon neoplasms. Therefore, the prediction of miRNA-colon neoplasms associations is very important for the treatment and diagnosis of colon neoplasms. In this work, we sorted the final prediction results of NEMPD according to the prediction score. Finally, 48 of the top 50 miRNAs are verified to be associated with colon neoplasms through the miR2Disease and dbDEMC databases (see Table 4). For example, hsa-miR-20a-5p has been experimentally confirmed to be associated with colon neoplasms [32]. This method draws conclusions through statistical analysis of population-based colorectal cancer studies conducted in Utah and the Kaiser Permanente Medical Care Project (PMID: 26963002).
Breast neoplasms are another common malignant tumor that mainly occurs in women. In the United States, there are about 180,000 new breast patients each year, and about 40,000 die from breast neoplasms. In recent years, the incidence of breast neoplasms in China is also rising and has become the second leading cause of cancer death after lung neoplasms. As a small molecule RNA, miRNA can inhibit breast neoplasms by inhibiting its target mRNA. Besides, the miRNA-breast neoplasms associations have been verified by many previous literatures. For example, miR-21 has been found to be excessive in breast neoplasms [33], while miR-429 and miR-200c are down-regulated [34]. Similarly, we sorted the final prediction results according to the prediction score. Finally, 47 of the top 50 miRNAs are verified to be associated with breast neoplasms through the miR2Disease and dbDEMC databases (see Table 5). For example, hsa-miR-93-5p has been experimentally proved to be related with breast neoplasms [35] (PMID: 24865188).
Lung neoplasms are a common tumor disease worldwide and one of the leading causes of cancer death. It is also one of the fastest-growing morbidity and mortality rates and the most threatening to the health and life of the population. In recent years, the incidence and mortality of lung cancer in many countries have increased significantly. In addition, miRNAs have been confirmed by many previous researches that are crucial in the early treatment and diagnosis of lung neoplasms. For example, Yanaihara et al. [36] found that the expression of 17 miRNAs in lung cancer cells has changed compared to normal cells through microarray analysis. Mascaux et al. [37] also found that the expression profile of miRNAs also changed during the entire process of lung cancer. Similarly, we sorted the final prediction results of NEMPD according to the prediction score. Finally, 47 of the top 50 miRNAs were verified to be related to lung neoplasms by the dbDEMC and miR2Disease databases (see Table 6). Table 4 The top 50 miRNAs associated with colon neoplasms were predicted by NEMPD. The top 1-25 associated miRNAs were shown in the first column. The top 26-50 associated miRNAs were shown in the third column.  Table 5 The top 50 miRNAs associated with breast neoplasms were predicted by NEMPD. The top 1-25 associated miRNAs were shown in the first column. The top 26-50 associated miRNAs were shown in the third column.  Table 6 The top 50 miRNAs associated with lung neoplasms were predicted by NEMPD. The top 1-25 associated miRNAs were shown in the first column. The top 26-50 associated miRNAs were shown in the third column.

Conclusion
The prediction of potential miRNA-disease associations by using computational models addresses the disadvantages of high time-consuming and cost of traditional methods and provides data support for traditional experimental researches. In this article, we proposed a novel computational model (NEMPD) by constructing a tripartite miRNA-protein-disease network based on known miRNA-protein and protein-disease associations and utilizing the GraRep network embedding method to obtain network behavior information (associations with proteins) of miRNA and disease. After that, their attribute and behavior information are combined into the final node feature vectors. Finally, the converted known miRNA-disease pairs are used for training and prediction by the random forest classifier. In the results, NEMPD obtained the average AUC and AUPR values of 0.9158 and 0.9233 under 5-fold cross-validation. Moreover, we also verified colon neoplasms, breast neoplasms, and lung neoplasms for case studies, and respectively confirmed 48, 47, and 47 miRNAs in the top 50 prediction results. All the experimental results proved that NEMPD can effectively predict potential miRNA-disease associations and can also be extended to other biological small molecule association prediction researches.

Construct the miRNA-protein-disease association network
The miRNA-protein-disease association network is constructed by combining the known miRNA-protein and protein-disease associations. More specifically, the miRNA-protein and protein-disease associations are respectively obtained from the miRTarBase [38] and DisGeNET database [39]. After that, we unified identifiers and simplified unrelated items. Finally, a total of 4944 groups of miRNA-protein associations and 25087 groups of protein-disease associations were acquired (see Table 7). In addition, we further classified the three types of nodes in the network and separately calculate the number of them. Finally, 271 miRNA nodes, 1147 protein nodes and 693 disease nodes were respectively got (see Table 8).  Table 7 The associations in the miRNA-protein-disease network.
Node Amount

MiRNA 271
Disease 693 Protein 1147 Total 2111 Table 8 The nodes in the miRNA-protein-disease network

Numerical miRNA sequence information
In this work, the numerical miRNA sequence information derived from the miRbase [40] database was used as its own attribute information. At the same time, considering the simplicity of the experiment, we choose the 3mer method to encode the miRNA sequences into 64(4 × 4 × 4) dimension vectors, where each dimension means the occurrence rate of the corresponding 3-mer of miRNA sequences (e.g. UGA, AGC, CUA).

Disease semantic similarity
Disease semantic similarity has been widely used in the identification of disease-related miRNAs, and its effectiveness has been fully proved in a large number of previous studies [41][42][43][44][45]. Therefore, we choose to use disease semantic similarity to represent the attribute information of disease and calculate it based on its direct acyclic graphs (DAGs) and MeSH descriptors. For example, disease C can be described as DAG(C)= (D(C), E(C)), where D(C) is composed of the disease itself and its ancestor, and E(C) is composed of all edges from the parent node to the child node. Figure 7 below shows the DAG of lung neoplasms: In traditional calculation models [41], disease terms at the same layer contribute the same semantic value to diseases. In fact, it is inaccurate to assign the same contribution value to two disease items on the same layer because they appear differently in the DAGs. In this article, we calculate the contribution of disease to the semantic value of disease C based on the assumption that the more specific diseases should contribute more to the semantic value of disease C. In this way, the contribution of a disease d to DAG(C) can be defined as follows: 1 Therefore, the semantic value of disease C can be obtained by adding the contributions of all ancestor diseases and disease d itself: 2 Besides, the semantic similarity between disease A and disease B can be obtained by adding together the contributions of disease terms shared by the two disease DAGs: 3

GraRep network embedding model
In many practical problems, information is usually organized using graphs, so it is important to learn useful information from graphs. One strategy for learning graph representations is that each node of the graph is represented by a low-dimensional vector, which contains meaningful semantic, relational, and structural information. GraRep [25] is one of these network embedding models for learning vector representations of weighted graph nodes. It utilizes low-dimensional vectors to represent the node vectors which appear in the graph, and integrate the global structure information of the graph into the learning process. By operating different global transformation matrices defined in the graph, GraRep can directly obtain the k-order relation information between nodes without involving a slow and complicated sampling process. Besides, different loss functions are used to capture different k-order local relation information, and matrix decomposition technology is used to optimize each model. In this way, the global representation of each vertex is constructed by combining different representations obtained from different models. This learned global representation can be used as a feature for further processing. More specifically, the basic steps of the whole algorithm are as follows: Step 1. Get k-step transition probability matrix , where k = 1,2...K.
Given the graph G, we can calculate the k-step transition probability matrix by the product of the inverse of the degree matrix D and the adjacent matrix S (for weighted graphs, S is a real matrix; for unweighted graphs, S is a binary matrix).
Step 2. Get each k-step representation.
Get the k-step log probability matrix , and minus the log() of each term, and replace the negative terms with 0. Then, construct the row representation vector of . Finally, the k-step representation of each node in the graph was obtained.
Step 3. Connect all k-step representations.
All the k-step representations are linked together to form a global representation, which can be used as features in other tasks. Table 9 describes the whole algorithm in detail.

Input
Adjacency matrix S on graph Maximum transition step K Log shifted factor β Dimension of representation vector d Table 9 The GraRep Overall Algorithm

Get k-step transition probability matrix
Compute A = Calculate ,, … ,, respectively Output Matrix of the graph representation W

Node Representation
In order to improve the accuracy of the training results, we added the attribute information on the basis of the network behavior information of miRNAs and diseases to represent the final feature information of known miRNA-disease training pairs. Among them, the network behavior information of miRNA and disease nodes is extracted based on the miRNA-protein-disease network and the GraRep network embedding method. After that, we respectively select the sequence feature and semantic similarity information as the attribute feature of miRNA and disease. Finally, the known miRNA-disease training pairs are transformed into a 128-dimensional feature vector for training and prediction by using a random forest classifier.

Abbreviations
GraRep: Learning Graph Representations with Global Structural Information; AUC: the areas under the Receiver Operating Characteristic curve; AUPR: the areas under the Precision-Recall curve; DAGs: direct acyclic graphs; DSS: disease semantic similarity; HMDD: human microRNA disease database;

Ethics approval and consent to participate
Not applicable

Consent for publication
Not applicable

Availability of data and materials
The datasets during this article are available at https://github.com/jiboya123/NEMPD Figure 1 The miRNA-protein-disease network The 5-fold cross validation ROC curves and AUC of NEMPD The 5-fold cross validation PR curves and AUPR of NEMPD The ROC and PR curves of NEMPD with different combination Figure 6 The ROC and PR curves of NEMPD with different classifiers Figure 7 The DAGs of lung neoplasms.