 Research article
 Open Access
 Published:
Combined embedding model for MiRNAdisease association prediction
BMC Bioinformatics volume 22, Article number: 161 (2021)
Abstract
Background
Cumulative evidence from biological experiments has confirmed that miRNAs have significant roles to diagnose and treat complex diseases. However, traditional medical experiments have limitations in timeconsuming and high cost so that they fail to find the unconfirmed miRNA and disease interactions. Thus, discovering potential miRNAdisease associations will make a contribution to the decrease of the pathogenesis of diseases and benefit disease therapy. Although, existing methods using different computational algorithms have favorable performances to search for the potential miRNAdisease interactions. We still need to do some work to improve experimental results.
Results
We present a novel combined embedding model to predict MiRNAdisease associations (CEMDA) in this article. The combined embedding information of miRNA and disease is composed of pair embedding and node embedding. Compared with the previous heterogeneous network methods that are merely nodecentric to simply compute the similarity of miRNA and disease, our method fuses pair embedding to pay more attention to capturing the features behind the relative information, which models the finegrained pairwise relationship better than the previous case when each node only has a single embedding. First, we construct the heterogeneous network from supported miRNAdisease pairs, disease semantic similarity and miRNA functional similarity. Given by the above heterogeneous network, we find all the associated context paths of each confirmed miRNA and disease. Metapaths are linked by nodes and then input to the gate recurrent unit (GRU) to directly learn more accurate similarity measures between miRNA and disease. Here, the multihead attention mechanism is used to weight the hidden state of each metapath, and the similarity information transmission mechanism in a metapath of miRNA and disease is obtained through multiple network layers. Second, pair embedding of miRNA and disease is fed to the multilayer perceptron (MLP), which focuses on more important segments in pairwise relationship. Finally, we combine metapath based node embedding and pair embedding with the cost function to learn and predict miRNAdisease association. The source code and data sets that verify the results of our research are shown at https://github.com/liubailong/CEMDA.
Conclusions
The performance of CEMDA in the leaveoneout cross validation and fivefold cross validation are 93.16% and 92.03%, respectively. It denotes that compared with other methods, CEMDA accomplishes superior performance. Three cases with lung cancers, breast cancers, prostate cancers and pancreatic cancers show that 48,50,50 and 50 out of the top 50 miRNAs, which are confirmed in HDMM V2.0. Thus, this further identifies the feasibility and effectiveness of our method.
Background
Microribonucleic acids (miRNAs), a small noncoding RNA molecule which contains about 21–22 nucleotides, have an important effect on the posttranscriptional level and cell processes [1]. Experiments have confirmed that miRNAs participate in the diagnosis and medical treatment of heart conditions [2], cardiovascular diseases, malignancies, mental disorders and diabetes. For instance, medical experiments exhibit that mir33 controls cholesterol homeostasis [3]. Hence, it is essential for medical scholars to find out miRNAs which are related to diseases. Many medical technologies, e.g., microarrays and PCR, have been utilized to explore miRNA and disease associations [4]. Though, traditional medical experiments have their limitations in high cost and timeconsuming. Therefore, many researchers are devoted to devising computational methods to find unidentified miRNA and disease interactions, so that they can recompense the drawbacks [5, 6] of traditional experimental methods.
Many innovational computational approaches have been developed to discovery miRNA and disease interactions recently. Among them, those methods can be approximately classified into two categories: similaritybased methods and machine learningbased methods. With the presumption that miRNAs with similar functions are closely associated with similar diseases, many kinds of measurements apply similaritybased methods. For instance, Jiang et al. proposed the first method which combines disease phenotype information with miRNA information to predict miRNA and disease interactions [7]. Nevertheless, this approach also had some shortcomings. It was unreasonable to regard the number of overlapping target genes of two miRNAs as the criterion for calculating the miRNA functional similarity score, which proved that it was inadequate because it ignored the indirect neighbors. According to functional similarity, miRNA clusters, and miRNA families, Xuan et al. scored unlabeled miRNAs. However, the miRNA similarity network they utilized restrained their experimental performance [8]. Chen et al. applied the random walk algorithm to the prediction of miRNA and disease associations [9]. However, this method had some limitations in constructing miRNA functionally similar networks, which made it unable to predict new diseases without the confirmed related miRNAs. Then, Chen et al. integrated withinscores and betweenscores to rank the unverified miRNA and disease associations [10]. Besides, without using any known miRNAdisease associations, Zhao et al. innovatively constructed a miRNAlncRNAdisease network(DCSMDA), which integrated the miRNAlncRNA associations and lncRNAdisease associations to indirectly predict miRNAdisease intearctions [11]. In summary, the subject of the similarity calculation method is to construct a network model, and different methods are used to measure the similarity between nodes in the network to predict miRNA and disease interactions, most of which are limited by the quality of the constructed network model and the incomplete relationship between nodes.
Except methods based on similarity measures, exploring potential miRNAdisease interactions with machine learning algorithms is also a significant academic method in this field. Different from the methods based on similarity to directly calculate the similarity between nodes in the network, researches based on machine learning are committed to extracting inherent features and devising effective classification algorithms to find miRNA and disease associations. For example, Jiang et al. offered negative samples randomly from the unverified miRNAdisease pairs and applied SVM as prediction classifier [12]. Different from above approach, Chen et al. designed a semisupervised classification, which demanded no negative samples [13]. In order to solve data insufficiency and data noise, Liang et al. devised an objective function based on L1norm [14]. Chen et al. chose the discriminative features in view of occurrence frequency [15]. Further, Zhao et al. combined multiple weak classifiers with boosting to strengthen classification [16]. In addition, matrix decomposition [17, 18] and collaborative filtering [19] are both useful in revealing miRNAdisease relations. For instance, Mao et al. devised the method based on genomic data fusion, which employed the Bayesian Probabilistic Matrix Factorization model to fuse data from multiple sources(MDBPMF). They innovatively offered a great approximation to the matrix and were able to generalize it by assessing its performance on invisible data [20]. Also, there are enormous efforts on predicting miRNA and disease association motivated by promising development of autoencoder [21], node embedding [22], deep learning and structural deep network embedding (SDNE) [23].
Though, current approaches have favorable performances to predict the unconfirmed miRNA and disease interactions. We still have to do some work to improve experimental performance. On the one hand, many papers have shown that previous nodecentric methods simply compute the similarity by applying a similarity metric, such as inner product or Euclidean distance [24], ignoring hidden relative information between two nodes. On the other hand, some methods limit in obtaining intrinsic information and discriminative features from miRNAdisease associations, to a large extent. Moreover, some methods are not suitable for new diseases without the confirmed miRNAs.
Nodecentric methods fall short of considering the hidden relative information between two nodes. Thence, we introduce the concept of “pair”. We deem that “pair” can better capture the hidden relative features between two nodes. In order to obtain effictient relative features between two nodes, it is necessary to transform the feature them simultaneously which we call “pair embedding”. For instance, Fig. 1 demonstrates a visualization of embeddings of miRNA and disease, where each miRNA is assigned a single embedding. Names of most diseases contain keywords related to body organs, which can be their feature representing their disease type. We assume that miR21 cluster has related to multiple disease types, such as Pancreatic cancers [25], Breast cancers. Whereas miR17 cluster [26], regarded as oncogene, is solely overexpressed in lung cancers. Since every miRNA has a single embedding, it has to be embedded to a best single point among all the various disease types. Thus, lung cancers are regarded to be associated with miR17, rather than miR21 when predicting. However, in fact, miR21 has confirmed to be related to lung cancers in clinical trials [27]. On the other hand, as shown in Fig. 2, if we can embed each miRNAdisease pair such that each pair independently captures its associated features. (“Target disease”, miR21) pair may be associated more closely with the valid pairs related to “lung cancers” than (“Target disease”, miR17) pair is. To sum up, the pair embedding could capture the hidden features behind the pairwise relationship more precisely than the node embedding.
Metapaths are some links formed by a series of nodes, which can be employed to preserve associations between nodes and explore the structure information in heterogeneous networks. Shi et al. offered an algorithm to reveal relationships by performing random walk [28]. They used the miRNAtarget associations and diseasegene interactions to identify potential miRNAdisease. However, the model strongly depended on the previous nodes to predict the next node in the network [29], ignoring that each node had a different contribution to the metapath and could not optimize it step by step. Different from Shi’s work, we develop a novel Combined Embedding model for MiRNA and Disease Associations prediction to learn the similarity feature of miRNAs and diseases. We deem that the pair embedding can better capture the features between two nodes. Then, the MLP enables us to construct the finegrained pairwise relationship in confirmed miRNA and disease pair. We construct heterogeneous network from the identified miRNAdisease pairs, disease semantic similarity and miRNA functional similarity. According to the above heterogeneous network, we find all the associated context paths of each confirmed miRNA and disease in the miRNAdisease heterogeneous network. Then, the associated context paths are linked by nodes, and we propose to employ metapath based nodding embedding to obtain features which are high contributions to metapaths during model training. The parameters are optimized to get better prediction through iterative training. To incentivize associated metapaths, the multihead attention mechanism is applied to weight the hidden state of each sequence and compensate for the dependency loss of the methpaths in model training. In this way, the similarity information transmission mechanism in a metapath of miRNA and disease is obtained through multiple network layers. Finally, we combined the pair embedding and node embedding, which predicts the finegrained relationship in heterogeneous network better than single embedding. At the same time, CEMDA is suitable for new diseases with unknown miRNA information. Our method outperforms other stateoftheart methods, with the power of the combination of pair embedding of miRNAdisease and metapath based node embedding. The results of global LOOCV and 5folds cross validation illustrate that CEMDA achieves the AUCs of 93.16% and 92.03%, respectively. Furthermore, three kinds of case researches with breast cancers, lung cancers, pancreatic cancers, prostate cancers and colorectal cancers illustrate our approach obtains a remarkable performance.
Results
Firstly, we present the experimental methods and evaluation criteria. Secondly, compared with five classical methods, the results of CEMDA are analyzed. Finally, we implement three kinds of case researches to verify the experimental performance of our approach.
Experimental approaches and evaluation criteria
5430 experimental identified miRNAsdiseases interactions are collected from HMDD V2.0 [30] to regard as the dataset in the predicting work. We apply global LOOCV and fivefold cross validation strategies in experiments. Then, every one verified miRNA and disease pair is acted as the testing samples, and the other pairs are view as the training samples in global LOOCV. At the same time, the miRNA and disease associations are divided into five equalsize groups randomly in fivefold cross validation. Then, four groups are regarded as the training set and the other one left acts as the testing set. We repeat fivefold cross validation 50 times to reduce randomness, and then calculate the averaged results. All the metapaths, the length of which is less than 4, are extracted, because we find that too long metapaths contribute little to improve the performance and increase too much in computing resources.
We consider area under the curve as AUC, which is regarded as the standard to evaluate the following compared approaches’ performance.
Comparisons with stateoftheart methods
In order to verify our experimental results, we compared CEMDA with ICFMDA [19], IMCMDA [18], WBSMDA [10], RLSMDA [13] and KATZBNRA [31]. The five compared stateoftheart approaches in global LOOCV and fivefold cross validation are displayed in Figs. 3 and 4, respectively. Besides, we compare with DCSMDA [11] in global LOOCV and MDBPMF [20] in fivefold cross validation. Since these two methods have only one result, it is not shown in the following figures. As depicted in Fig. 3, CEMDA has the highest AUC of 93.16% in global LOOCV, revealing that it has remarkable performance compared with the other five approaches. Moreover, the AUCs of ICFMDA, IMCMDA, WBSMDA, RLSMDA, KATZBNRA and DCSMDA are 90.67%, 83.87%, 88.95%, 87.47%, 90.98% and 81.55%, respectively. In addition, Fig. 4 shows that CEMDA also achieves the best prediction performance for fivefold cross validation experiments. The AUCs of CEMDA, ICFMDA, IMCMDA, WBSMDA, RLSMDA, KATZBNRA and MDBPMF are, 92.03%, %90.45%, 81.09%, 80.05%, 83.39%, 89.72% and 87.55%, respectively. Therefore, the performance demonstrates that CEMDA is reliable in discovering the unverified miRNA and disease associations.
Comparisons of CEMDA with pair embedding and without pair embedding
We compared CEMDA with pair embedding and without pair embedding upon Global LOOCV and fivefold cross validation. The results depicted in Figs. 5 and 6, demonstrate that the pair embedding enhances the effect in global LOOCV and fivefold cross validation strategies, which means that the pair embedding takes an important role in CEMDA. First, the pair embedding helps model the finegrained pairwise relationship better than the previous when each node only has a single embedding. Second, pair embedding generates incentives to the associated nodes in the metapath. The feature information of miRNAdisease pair is obtained by multilayer perceptron to enhance the similarity information transmission.
Comparisons with different metapath length of CEMDA
Parameter metapath length is a critical element for information extraction in CEMDA. Different parameter values result in different information scales. The experimental performance is compared with the different metapath length upon global LOOCV and fivefold cross validation. Figures 7 and 8 illustrate Experimental results. We find that it’s the better performance when metapath length increases. More relative nodes are contained when the length of metapath increases, which brings rich information and abundant features in metapaths to model training. In other word, the method can integrat more longterm dependency between nodes. Figures 7 and 8 show that the metapath length increases, but the performance of CEMDA falls distinctly. Because the length of metapath is longer, the information repeats more in segments that it contains, which contributes less to the performance. After many trials, we decided 3L as the max length of metapath in our method below.
Influence of projection dimensions
We respectively compared the influence of several projection dimensions \(Z\) in Formula (11) on the result of CEMDA under global LOOCV and fivefold crossvalidation. Figure 9 shows the AUC values of CEMDA under different projection dimensions \(Z\) upon global LOOCV and fivefold crossvalidation. In the Formula (11), we used five different projection dimensions, 32, 64, 128, 256 and 512, respectively. It illustrates that the AUC with the increase of projection dimensions values display an upward trend slightly. Besides, we also tested experiment on the projection dimensions of 512, the effect was diminished slightly in training process because of huge amount of calculation and data noise. Thence, we finally selected the projection dimensions of 256.
Cases studies
Three kinds of case researches are carried out to further validate miRNA and disease interactions. In the first case research, we utilized lung cancers and breast cancers with HDMM V2.0 as data set to discovery the associated unverified miRNAs for. Finally, we compare the found candidate miRNAs with two public databases, dbDEMC [32] and PhenomiR [33] to validate its accuracy.
It has been reported that lung cancers are overwhelming deadly diseases that led to a wide range of deaths worldwide [34]. Biomedical finds that a person discovers lung cancers as soon as possible, he may have a high survival rate. Medical experiments have proven that miRNAs have a huge effect on the diagnosis and cure of lung cancers [35]. Depicted in Table 1, the first column contains the top 50 and the second column lists the top 26–50. Among them, 48 of the top 50 candidates are proved to be related to lung cancers by biological experimental results that are supported from the two public databases. There are only 2 unconfirmed miRNAs. For instance, hsamir421 ranking 1st in the Table 1, has been illustrated to promote proliferation in nonsmall cell cancers [36]. Thence, the performance of our prediction model offers a novel view for researches.
Breast cancers are widespread neoplasms with high mortality in women around the world. The deaths of breast neoplasm will up to three million in the future [37]. Evidence that miR1423p is related to breast cancers, has been validated in biological experiments. We adopt CEMDA to verify the related miRNAs for breast cancers and chose the top 50 related miRNAs contained in Table 2. It has been shown that all the top 50 miRNAs were supported by the abovementioned databases. Hsamir140, which ranks 1st, has been validated to promote the spread of breast neoplasm cell [38]. Thence, the novel findings illustrate that CEMDA offers strong evidence for breast neoplasm predictions.
Then, in the second case research, we want to verify whether this approach is suitable for new diseases without the confirmed related miRNA in biological experiments. We first selected prostate cancers because it is the most universal cancers in men in the world. It is said that over one hundred thousand men die from prostate diseases in a foreign country in 2018 [39]. Firstly, we set all miRNAdisease associations that are associated with prostate cancers from HMDD 2.0 to zero and then perform CEMDA to verify the related miRNAs for prostate cancers. The results shown in Additional file 1: Table S1 indicates that all the top 50 miRNAs were verified by dbDEMC and PhenomiR. Second, to access more new diseases further, we carried out the research on pancreatic cancers. The results of the case of pancreatic cancers are contained in Additional file 1: Table S2. All of the top 50 predicted miRNAs were also included in HMDD, dbDEMC and PhenomiR. Therefore, the case indicates that CEMDA is suitable for new diseases without the confirmed related miRNAs.
Finally, we implemented the third case research to identify whether CEMDA trained with data from an older version of HMMD could verify new imported miRNA and disease pairs in a new version of HMDD. We use HMDD 3.0 [40], dbDEMC and PhenomiR to identify the outcomes. The findings of the case research in colorectal cancers are contained in Additional file 1: Table S3. All of the top 50 miRNAs are supported by HMDD 3.0, dbDEMC and PhenomiR.
In view of the outcomes of three case researches, we summarize that, our approach is effective when predicting unverified miRNA and disease interactions.
Discussion
Compared with five classical approaches upon global LOOCV and fivefold cross validation, experimental results indicate that CEMDA has better prediction performance. Moreover, three kinds of case researches with five diseases also support our approach’ s result. Firstly, we take out all metapath instances of the confirmed miRNA and disease pair in miRNA and disease heterogeneous network to obtain complicated associations from miRNA and disease interactions. Metapaths are linked by noeds and then input to GRU to learn more accurate similarity measures between miRNA and disease. Considering that there are different nodes with different contribution values in the meta path, the multihead attention mechanism is used to weight the hidden state of each matepath, and the similarity information transmission mechanism in a metapath of miRNA and disease is obtained through multiple network layers. Second, the MLP is utilized to obtain the relative information in confirmed miRNA and disease pair. By applying pair embedding that captures the features behind the pairwise relationships, we can obtain the finegrained associations. Finally, metapath based node embedding and pair embedding are devised to integrate node and edge information from metapath instances. In conclusion, CEMDA achieves an excellent prediction in modeling the finegrained pairwise relationship and considering contributions of different nodes in the miRNA and disease heterogeneous network.
Methods
The framework of predicting miRNA and disease associations by CEMDA is presented in Fig. 9. Firstly, many similarity methods are utilized to compute miRNA integrated similarity and disease integrated similarity. Secondly, we build the heterogeneous network from experimentally certified miRNA and disease associations, miRNA integrated similarity and disease integrated similarity. Thirdly, we develop a novel Combined Embedding model to extract associated information to predict the unidentified miRNA and disease associations. The model is composed of pair embedding of miRNAdisease, metapath based node embedding and predicting miRNAdisease associations with combined embedding. Pair embedding employs the MLP to pay more attention to important segments in pairwise relationship. Then, the initial representations of miRNAs and diseases with different dimensions are projected into the same vector space. The associated context paths are serialized based on nodes, and then GRU is used to learn node features which are high contributions to metapaths. The multihead attention mechanism is used to weight the hidden state of each sequence, and the entire metapath information is obtained through multiple network layers. We define the loss function to obtain the ultimate representations of miRNAs and diseases by combining pair embedding and metapath based node embedding.
Structure of MiRNA and disease heterogeneous network
MiRNA and disease association network structure
HMDD V2.0 is composed of supported experimentally miRNAdisease interactions, which is a universal database. In this article, we employ the adjacency matrix \(A\in {R}^{m\times n}\) to express the supported miRNA and disease associations. Where, \(m\) and \(n\) stand for the number of miRNAs and diseases, respectively. The element \({A}_{ij}\) is equal to 1, which means miRNA \({r}_{i}\) is associated with disease \({d}_{j}\). Otherwise, \({A}_{ij}\) equals to 0 in the matrix. We utilize the datasets with HMDD v2.0 to construct the matrix. As illustrated in the datasets, there are 5430 associations between 495 miRNAs and 383 diseases. We define that \(m=495\) and \(n=383\). Overall, the adjacency matrix \(A\) is adopted to construct miRNA and disease association network.
Disease integrated similarity network construction
In order to make the experimental model more accurate and reliable, we investigated Wang et al.’s work [41] and then utilized Medical Subject Headings (MeSH) [42] to calculate the semantic similarity of diseases. We calculate disease integrated similarity network \(SD\) by aggregated disease semantic similarity \(SS\) and disease Gaussian interaction profile kernel similarity \(GD\) as follows:
where \(GD\left({d}_{i},{d}_{j}\right)\) represents disease Gaussian interaction profile kernel similarity.
Assuming that if two diseases have more the same ancestor subject headings, they will be more similar in semantics. In the above Formula (1), \(SS\left({d}_{i},{d}_{j}\right)\) represents the combined semantic similarity of diseases \({d}_{i}\) and \({d}_{j}\). For the first disease semantic similarity method, we take disease semantic similarity based on MeSH which defined by Wang et al. For any kind of disease \(D\), it can be represented by a Directed Acyclic Graph \((DAG\left(D\right))\), which contains the set of ancestor disease nodes and the edges of each parent node pointing to the child node. They define the contribution of disease \(d\) in \(DAG(D)\) as follows:
where \(\Delta\) is the semantic attenuation contribution factor (0 < ∆ < 1). This article refers to Xuan et al.’s study [8] and set factor \(\Delta\) to 0.5. Then, the semantic value of disease \(D\) is the sum of the semantic contribution values of \(D\) and its all ancestor nodes as follows:
where \(T(D)\) means all ancestor nodes of disease \(D\) including itself in the \(DAG\) graph.
Eventually, they calculate the first disease semantic similarity between disease \({d}_{i}\) and disease \({d}_{j}\) as follows:
Xuan et al. [8] defined the second method to provide the semantic value of disease \(D\). Supposing that some special diseases may have higher contributions to disease \(D\), they have another definition of the semantic contribution of disease \(d\) as follows:
When, the semantic similarity \(SS2\left({d}_{i},{d}_{j}\right)\) between \({d}_{i}\) and \({d}_{j}\) is calculated as the percentage of the contribution of themselves and their common ancestor nodes as follows:
Eventually, the first disease semantic similarity calculation method and the second disease semantic similarity calculation method are arithmetically averaged as the disease semantic similarity \(SS\left({d}_{i},{d}_{j}\right)\) as follows:
Finally, according to the Formula(1), we calculated disease integrated similarity network \(SD\left({d}_{i},{d}_{j}\right)\).
MiRNA integrated similarity network structure
According to Wang et al.’ study, miRNAs with similar functions are often associated with diseases with similar semantics [42]. We calculated miRNA similarity by merging miRNA functional similarity \(FS\)and Gaussian interaction profile kernel similarity \(GM\) as follows:
where \(FS\left({r}_{i},{r}_{j}\right)\) (\(i\in \left[{1,495}\right], j\in [{1,383}])\) represents miRNA functional similarity between \({r}_{i}\) and r_{j}. GM(r_{i}, r_{j}) represents Gaussian interaction profile kernel similarity of miRNAs \({r}_{i}\) and \({r}_{j}\). Benefit from Wang’s task, the miRNA functional similarity \(FS\left({r}_{i},{r}_{j}\right)\) is downloaded from their study.
Besides, Zhao et al. calculated the Gaussian similarity calculation between miRNA \({r}_{i}\) and miRNA \({r}_{j}\) as follows [16]:
where \(IV({r}_{i})\), \(IV\left({r}_{i}\right)\) is the ith and jth row of matrix \(A\), respectively. Parameter \({\alpha }_{r}\) controls the kernel bandwidth as follows:
where initial kernel bandwidth parameter \({\alpha }_{r0}\) is set to 1.
Finally, we can provide miRNA integrated similarity network \(SM\) as Formula (8).
To sum up, we combine miRNA and disease association network, miRNA integrated similarity network, disease integrated similarity network to construct miRNA and disease heterogeneous network. We define MiRNA and Disease heterogeneous network as an undirected graph G = ( V, E), including miRNAs (\(M\)) and diseases (\(D\)). V is composed of miRNA and disease nodes. E represents an edge set containing three edge types, for example, \(M\to D\) or \(D\to M\) indicates a miRNA is correlated with a disease, \(M\to M\) suggests two miRNA nodes are similar and \(D\to D\) reveals us there is an edge between two disease nodes.
Metapath instances extraction from MiRNA and disease heterogeneous network
There are one or multiple paths between a miRNA and a related disease in miRNA and disease heterogeneous network. Metapaths mean that the indirect and composite connections between miRNA and disease, which help to understand information and complicated structure in miRNA and disease associations. There are different metapath instances between the confirmed miRNA and disease association in its sequence. For convenience, we explain metapath instance below.
Firstly, we define that metapath \(P\) with LLength as a sequence is in form of \(m\to {N}_{1}\to \cdots {N}_{i}\to \cdots d\). Where, \(m\) and \(d\) is from the verified miRNA and disease pair with HMDD2.0, \({N}_{i}\in \left\{M,D\right\}\). Different types of metapath can help understand the season why two nodes are closely related to each other. Because the paths from one node to another can also be associated with multiple types, which construct the different semantics of the paths. For example, a metapath type of \(D\to D\to {\rm{M}}\) shows that if a disease is associated with a miRNA, then other disease who is similar to the disease will be potential associated with the miRNA. A metapath type of \(D\to M\to {\rm{M}}\) shows that if a miRNA is associated with a disease, then other miRNA who is similar to the miRNA will be potential associated with the disease. There are different metepath instances with LLength between the identified m and d as shown in Fig. 10. For example, the confirmed \({m}_{2}\)and \({d}_{2}\) pair have different instances with different length, one metapath instance \({P}_{7}={m}_{2}\to {m}_{2}\to {d}_{3}\to {d}_{2}\) is a 3Length and \({P}_{2}={m}_{4}\to {d}_{1}\to {d}_{4}\) is a 2Length.
Finally, all metapath instances of the confirmed miRNA and disease in network are extracted.
Pair embedding of MiRNAdisease
Linear transformations of MiRNAs and diseases
We take the \(i\)th row in the miRNA similarity matrix \(SM\) as the initial features of the \(i\)th miRNA. In the same way, we regard the jth row in the disease similarity matrix \(SD\) as the feature of the \(j\)th disease. Then, the initial features of miRNAs and diseases projected into the same vector with linear transformations because of the difference of dimensions.
We project the feature of a miRNA \(r\) into the \(Z\)dimensional space as follows:
Similarly, the initial feature of disease d is projected into the \(Z\)dimensional space as follows:
where \({{\varvec{h}}}_{r}, {{\varvec{h}}}_{d}\) is the projected feature of miRNA r and disease d, respectively. \({{\varvec{x}}}_{r}\) and \({{\varvec{x}}}_{d}\) are the initial feature of miRNA r and disease d. \({{\varvec{W}}}^{R}\in {\mathbb{R}}^{Z*m}\) is a linear transformation matrix to project the 495dimensional matix into \(Z\)dimensional space and \({{\varvec{W}}}^{D}\in {\mathbb{R}}^{Z*n}\) is a linear transformation matrix to project the 383dimensional matix into \(Z\)dimensional space.
In Fig. 9, the nodes with shadow are the transformed representation of the initial miRNA and disease.
MLP encoder of miRNAdisease interactions
Given a miRNA embedding \({{\varvec{h}}}_{r} \in {\mathbb{R}}^{Z}\) and a disease embedding \({{\varvec{h}}}_{d}\in {\mathbb{R}}^{Z}\) as \(Com\left({{\varvec{h}}}_{{\varvec{r}}},{{\varvec{h}}}_{d}\right)\in {\mathbb{R}}^{4Z}\), we use a \(m\)layer multilayer perceptron (MLP) to embed miRNAdisease interaction (\({{\varvec{h}}}_{r},{{\varvec{h}}}_{d}\)) into Zdimensional vector. The pair embedder is \({\varvec{g}}\left(r,d\right)\). Firstly, miRNA embedding and disease embedding is combined to form the initial input of MLP.
where ° denotes elementwise vector multiplication, \(ReLU\)(\(x\)) denotes \(max\left(0,x\right)\) and \({\varvec{g}}\left({\varvec{r}},{\varvec{d}}\right)\in {\mathbb{R}}^{X}\). We employ dropout on the hidden layers and regarded the last layer output of MLP as the pair embedding. We take g(·) as a 2layered MLP, which each layer has 100 hidden units.
Validity of pair embedding
Recall that one of the limitations of node embedding is that it inadvertently makes a miRNA and a disease similar to each other if they frequently appear together within the metapath, whether or not the miRNA is associated with disease. Then, we present a pair validity classifier \(\pi\): \({\mathbb{R}}^{X}\to {\mathbb{R}}\) to discriminate whether the miRNAdisease pair is a valid pair or not, which is formulated by binary crossentropy loss as follows:
\(\pi (\cdot )\) is a 2layered MLP with ReLU activation.
Metapath based node embedding
Multihead attention embedding of metapath
Metapaths are linked by a series of nodes, which can be employed to preserve the important structure information in heterogeneous networks. According to a metapath instance \(p\) connecting the confirmed miRNA \(r\) with disease \(d\), the measurable features of the connection are implied in the sequences of \(p\). The sequence of \(p\) is represented as \({\{ {\varvec{X}}}_{1},{{\varvec{X}}}_{2},\cdots {{\varvec{X}}}_{n1},{{\varvec{X}}}_{n}\}\), where, \({{\varvec{X}}}_{1}={{\varvec{h}}}_{r}\), \({{\varvec{X}}}_{n}={{\varvec{h}}}_{d}\). Considering that different nodes in the meta path have different importance to the meta path, GRU can learn important nodes with the contributions to the sequence, which is suitable for sequential data learning. We use a GRU to generate a \(Z\)dimensional vector for \(p\). GRU calculates the hidden state \({{\varvec{h}}}_{t}\) with \({{\varvec{h}}}_{t1}\) and \({{\varvec{X}}}_{t}\) as input, \(t\in [1,n]\), which is shown as follows.
where \({\varvec{\sigma}}\) is a sigmoid function, and \({{\varvec{W}}}_{{\varvec{z}}{\varvec{x}}}{\in {\mathbb{R}}}^{X\times Z}\), \({{\varvec{W}}}_{{\varvec{r}}{\varvec{x}}}{\in {\mathbb{R}}}^{X\times Z}\), \({{\varvec{W}}}_{{\varvec{h}}{\varvec{x}}}{\in {\mathbb{R}}}^{X\times Z}\), \({{\varvec{W}}}_{{\varvec{z}}{\varvec{h}}}\in {\mathbb{R}}^{X\times X}\), \({{\varvec{W}}}_{{\varvec{r}}{\varvec{h}}}\in {\mathbb{R}}^{X\times X}\), \({{\varvec{W}}}_{{\varvec{h}}{\varvec{h}}}\in {\mathbb{R}}^{X\times X}\), \({{\varvec{b}}}_{{\varvec{z}}}\in {\mathbb{R}}^{X},{{\varvec{b}}}_{{\varvec{r}}}\in {\mathbb{R}}^{X},{{\varvec{b}}}_{{\varvec{h}}}\in {\mathbb{R}}^{X}\).
We apply dropout to the hidden state update vector as g_{t} follows:
where \({\varvec{d}}(\cdot )\) is the dropout function defined as follows:
where \(q\) is the dropout rate and \({\varvec{m}}{\varvec{a}}{\varvec{s}}{\varvec{k}}\) is a vector, which is got from sampling from the Bernoulli distribution with success probability \(1q\).
We obtain an embedding matrix h\(\in {\mathbb{R}}^{n\times Z}\) after GRU training of metapath instance \(p\). \(Z\)dimesnional vector is extracted by aggregating \({\varvec{h}}\) with attentive pooling. The contribution of each node in the metapath instances is measured as follows:
where \({\varvec{M}}\in {\mathbb{R}}^{Z}\)is a trained attention parameter vector, \(i\in \left[1,n\right],j\in [1,n]\).
The extracted vector is formed by a weighted sum of the vectors from the matrix \({\varvec{h}}\) as follows:
To make the learning of attention parameter stable, we extend attention mechanism to multihead attention, conduct attention K times independently and average their outputs as follows:
where ΣΣ indicates concatenation, \({\varvec{\alpha }}_{{\varvec{i}}}^{{\varvec{k}}}\)are normalized attention coefficients in the \(K\)th attention.
Attentionaware fusion of multiple metapath instances to represent miRNAdisease associations
For metapath instances connecting the confirmed miRNA \(r\) and disease \(d\), the metapath instances may have different length. The metapath instances with the same metapath length exhibit diverse contributions to the connection between r_{i} and d_{j} as the difference of nodes in the sequences, which we call metapath type. For example, \({m}_{2}\to {m}_{4}\to {d}_{3}\to {d}_{4}\) and \(m\to {m}_{4}\to {d}_{1}\to {d}_{4}\) are listed in Fig. 10. Since the related information involved in two metapath instances are not the same. To merge the global information of different metapath instances with the same length to indicate the connection between \(r\) and d, we joint into an attention.
where \({{{\varvec{a}}{\varvec{t}}{\varvec{t}}}_{{\varvec{p}}}\in {\mathbb{R}}}^{Z}\) is the parameter in metapath instance \(p\). \({{\varvec{e}}}^{p}\) indicates the contribution of metapath instance \(p\) of \({r}_{i}\) and \({d}_{j}\). \({({{\varvec{e}}}^{{^{\prime}}})}^{{\varvec{p}}}\) is normalized with the softmax function among all metapath instances with metapath type \(P\). For all \(p\in P\), the comprehensive representation the connection between \({r}_{i}\) and \({d}_{j}\) can be obtained by the weighted sum of all metapath instances as shown in Formula (29).
Attentionaware fusion of multiple metapaths to represent miRNAdisease associations
We define metapath type as \({P}_{i}\), \(i\in \left[1,N\right]\) and the features of the confirmed miRNA \({r}_{i}\) and disease \({d}_{j}\) association by different metapath type as \({{\varvec{h}}}^{{P}_{i}}{\in {\mathbb{R}}}^{Z}\). Supposing the different contributions of different types and length, attention mechanisms are employed to obtain the ultimate representation.
where \({{\varvec{a}}{\varvec{t}}{\varvec{t}}}_{{P}_{i}}{\in {\mathbb{R}}}^{Z}\) is the parameter with different path length \({P}_{i}\). \({{\varvec{w}}}^{{P}_{i}}\) indicates that the contribution of metapath type \({P}_{i}\) to the connection. \({{\varvec{w}}{^{\prime}}}^{{P}_{i}}\)is normalized with the softmax function of all the metapaths. So, \({{\varvec{h}}}_{r,d}^{p}{\in {\mathbb{R}}}^{Z}\) represents all mathpath with path length attention.
Finally, the representations of miRNA \(r\) and disease \(d\) interactions with significant information of metapaths are modeled by the abovementioned mechanisms.
Predicting MiRNAdisease associations with combined embedding
Finally, we get the ultimate representation of miRNA and disease \({{\varvec{h}}}_{u}^{P}\), including the total information of miRNA and disease associations. The parameters of \({{\varvec{W}}}^{R}{,{\varvec{W}}}^{D}{,{\varvec{a}}{\varvec{t}}{\varvec{t}}}_{p}\) and \({{\varvec{a}}{\varvec{t}}{\varvec{t}}}_{pi}\) are trained in order to gain features as correct as possible. The primary purpose for training our model is to make distance between two nodes who are related in miRNA and disease heterogeneous network as small as possible. Meanwhile, we want to make pair embedding and metapath based node embedding similar. Thence, we predicting miRNAdisease associations with combined embedding.
We obtain the cross entropy for metapath based node embedding as follows:
where \(\mathcal{P}\) is the set of positive pairs with the supported relationships. The parameters can be learned by minimizing the following loss function. We combine the above two loss functions to gain the ultimate loss function as follows:
\({Loss}_{Reg}\) is the regularization to prevent overfitting. We analyzed the AUC with the value of \(\lambda\) from 0 to 1 with the interval of 0.1. It denotes that When \(\lambda\) is set to 0.5, CEMDA achieved the better result. Thus, we set \(\lambda\) to 0.5.
Availability of data and materials
The datasets that support the findings of this study are available in https://github.com/liubailong/CEMDA. A web service for CEMDA is available at http://132.232.17.50:8080/CEMDA.jsp
Abbreviations
 CEMDA:

Combined embedding model to predict MiRNAdisease associations
 GRU:

Gate recurrent unit
 MLP:

Multilayer perceptron
 LOOCV:

Global leaveoneout cross validation
 miRNAs:

Microribonucleic acids
 MeSH:

Medical subject headings
References
Huang HY, Lin YCD, Li J, Huang KY, Shrestha S, Hong HC, et al. miRTarBase 2020 updates to the experimentally validated microRNAtarget interaction database. Nucleic Acids Res. 2020;2020:145–8.
Chen PP, Wang DD, Chen H, Zhou ZZ, He XL. The nonessentiality of essential genes in yeast provides therapeutic insights into a human disease. Genome Res. 2016;26(10):1355–62.
Zheng Y, Jiang SB, Zhang HY, Zhang R, Gong DQ. Detection of miR33 expression and the verification of its target genes in the fatty liver of geese. Int J Mol Sci. 2015;16(6):12737–52.
Shefa U, Jung JY. Comparative study of microarray and experimental data on Schwann cells in peripheral nerve degeneration and regeneration: big data analysis. Neural Regen Res. 2019;14(6):1099.
Chen X, Xie D, Zhao Q, You ZH. MicroRNAs and complex diseases: from experimental results to computational models. Brief Bioinform. 2019;20(2):515–39.
Zhang H, Liang Y, Han SY, Peng C, Li Y. Long noncoding RNA and protein interactions: from experimental results to computational models based on network methods. Int J Mol Sci. 2019;20(6):1284.
Jiang Q, Wang G, Wang Y. An approach for prioritizing diseaserelated microRNAs based on genomic data integration. In: Proceedings of the international conference on biomedical engineering and informatics. 2010; 2270–4.
Xuan P, Han K, Guo M, Gao YH, Li JB, Ding J, et al. Prediction of microRNAs associated with human diseases based on weighted k most similar neighbors. PLoS ONE. 2013;8:8.
Chen M, Liao B, Li ZJ. Global similarity method based on a twotier random walk for the prediction of microRNAdisease association. Sci Rep. 2018;8(1):1–16.
Chen X, Yan CC, Zhang X, You ZH, Deng LX, Liu Y, et al. WBSMDA: within and between score for MiRNAdisease association prediction. Sci Rep. 2016;6:21106.
Zhao HC, Kuang LN, Wang L, et al. Prediction of MicroRNAdisease associations based on distance correlation set. BMC Bioinform. 2018;19:141. https://doi.org/10.1186/s128590182146x.
Jiang Q, Wang G, Zhang T, et al. Predicting human microRNAdisease associations based on support vector machine. IEEE international conference on bioinformatics and biomedicine. 2010, pp. 467–472.
Chen X, Yan GY. Semisupervised learning for potential human microRNAdisease associations inference. Sci Rep. 2014;4:5501.
Liang C, Yu SP, Luo JW. Adaptive multiview multilabel learning for identifying diseaseassociated candidate miRNAs. PLoS Comput Biol. 2019;15(4):e1006931.
Chen X, Sun LG, Zhao Y. NCMCMDA: miRNAdisease association prediction through neighborhood constraint matrix completion. Briefings in Bioinformatics. 2020.
Zhao Y, Chen X, Yin J. Adaptive boostingbased computational model for predicting potential miRNAdisease associations. Bioinformatics. 2019;35(22):4730–8.
Chen X, Wang CC, Yin J, You ZH. Novel human miRNAdisease association inference based on random forest. Mol Ther Nucleic Acids. 2018;13:568–79.
Chen X, Wang L, Qu J, Guan NN, Li JQ. Predicting miRNAdisease association based on inductive matrix completion. Bioinformatics. 2018;34(24):4256–65.
Jiang YT, Liu BT, Yu LH, Yan CG, Bian HJ. Predict MiRNAdisease association with collaborative filtering. Neuroinformatics. 2018;16(3–4):363–72.
Mao G, Wang SL, Zhang W. Prediction of potential associations between MicroRNA and disease based on bayesian probabilistic matrix factorization model. J Comput Biol. 2019;26(9):1030–9.
Chen ZH, Wang XK, Gao P, Liu HJ, Song BS. Predicting disease related microRNA based on similarity and topology. Cells. 2019;8(11):1405.
Zeng XX, Wang W, Deng GS, Bing JX, Zou Q. Prediction of potential diseaseassociated MicroRNAs by using neural networks. Mol Ther Nucleic Acids. 2019;16:566–75.
Gong YC, Niu YQ, Zhang W, Li XH. A network embeddingbased multiple information integration method for the MiRNAdisease association prediction. BMC Bioinform. 2019;20(1):468.
Zhang C, Chao Huang, Lu Yu, et al. Camel: contentaware and metapath augmented metric learning for author identification. WWW. 2018
Wang Y, Zheng FS, Wang ZB, Lu JB, Zhang HY. Circular RNA circSLC7A6 acts as a tumor suppressor in nonsmall cell lung cancer through abundantly sponging miR21. Cell Cycle. 2020;19(17):2235–46.
Zhang XJ, Li YL, Qi PF, Ma ZL. Biology of MiR1792 cluster and its progress in lung cancer. Int J Med Sci. 2018;15(13):1443–8.
Sun Q, Hang M, Guo XD, Shao WL, Zeng GQ. Expression and significance of miRNA21 and BTG2 in lung cancer. Tumor Biol. 2013;34(6):4017–26.
Shi HB, Xu J, Zhang GD, Xu LD, Li CQ, Wang L, et al. Walking the interactome to identify human miRNAdisease associations through the functional link between miRNA targets and disease genes. BMC Syst Biol. 2013;7(1):101.
Minh NT, Wu YH. Integrating metapath similarity with user preference for topN recommendation. In: International conference on technologies and applications of artificial intelligence (TAAI). 2019, pp. 1–6.
Li Y, Qiu CX, Tu J, Geng B, Yang JC, Jiang TZ, et al. HMDD v2.0: a database for experimentally supported human microRNA and disease associations. Nucleic Acids Res. 2013;42(D1):1070–4.
Li SR, Xie MZ, Liu XQ. A novel approach based on bipartite network recommendation and KATZ model to predict potential microdisease associations. Front Genet. 2019;10:1147.
Yang Z, Ren F, Liu CN, He SM, Sun G, Gao Q, et al. dbDEMC: a database of differentially expressed miRNAs in human cancers. BioMed Central. 2010;11:S5.
Ruepp A, Kowarsch A, Schmidl D, Buggenthin F, Brauner B, Dunger I, et al. PhenomiR: a knowledgebase for microRNA expression in diseases and biological processes. Genome Biol. 2010;11:R6.
Siegel RL, Miller KD, Jemal A. CA: a cancer journal for clinicians. Cancer Stat. 2017;67(1):7–30.
Xiao WD, Zhong YC, Wu LL, Yang DX, Ye SQ, Zhang M. Prognostic value of microRNAs in lung cancer: a systematic review and metaanalysis. Mol Clin Oncol. 2019;10(1):67–77.
Li YX, Cui XM, Li YD, Zhang TT, Li SY. Upregulated expression of miR421 is associated with poor prognosis in nonsmallcell lung cancer. Cancer Manag Res. 2018;10:2627–33.
Mansoori B, Mohammadi A, Ghasabi M, Shirjang S, Dehghan R, Montazeri V, et al. MiR1423p as tumor suppressor miRNA in the regulation of tumorigenicity, invasion and migration of human breast cancer by targeting Bach1 expression. J Cell Physiol. 2019;234(6):9816–25.
He YJ, Deng F, Zhao SJ, Zhong SL, Zhao JH, Wang DD, et al. Analysis of miRNA–mRNA network reveals miR1405p as a suppressor of breast cancer glycolysis via targeting GLUT1. Epigenomics. 2019;11(9):1021–36.
Voss G, Haflidadóttir BS, Järemo H, Persson M, Ivkovic CT, Wikström P, Ceder Y. Regulation of cell–cell adhesion in prostate cancer cells by microRNA96 through upregulation of ECadherin and EpCAM. Carcinogenesis. 2019;41(7):865–74.
Huang Z, Shi JC, Gao YX, Cui CM, Zhang S, Li JW, et al. HMDD v3.0: a database for experimentally supported human microRNAdisease associations. Nucleic Acids Res. 2018;47(D1):D1013–7.
Wang D, Wang J, Lu M, Song F, Cui QH. Inferring the human microRNA functional similarity and functional network based on microRNA associated diseases. Bioinformatics. 2010;26(13):1644–50.
Lipscomb CE. Medical subject headings (MeSH). Bull Med Lib Assoc. 2000;88(3):265–6.
Acknowledgements
We thank the editor and the anonymous reviewers for their comments and suggestions.
Funding
This work was supported in part by “The DoubleFirstRate Special Fund for Construction of China University of Mining and Technology, No. 2018ZZCX14.” The funder had no role in study design, data collection and preparation of the manuscript.
Author information
Authors and Affiliations
Contributions
BL and LZ conceived the prediction method, implemented the experiments, conducted the experimental result analysis, and wrote the paper. XZ and ZL1gathered data and performed experiments. XZ and ZL2 revised the paper. All authors have read and approved the final paper.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no potential conflicts of interest with respect to the research, authorship, and publication of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
. Supplementary tables for case studies.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Liu, B., Zhu, X., Zhang, L. et al. Combined embedding model for MiRNAdisease association prediction. BMC Bioinformatics 22, 161 (2021). https://doi.org/10.1186/s1285902104092w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285902104092w
Keywords
 MiRNA and disease interactions
 Metapath
 Pair embedding
 Node embedding
 Combined embedding