Skip to main content

Computational method using heterogeneous graph convolutional network model combined with reinforcement layer for MiRNA–disease association prediction

Abstract

Background

A large number of evidences from biological experiments have confirmed that miRNAs play an important role in the progression and development of various human complex diseases. However, the traditional experiment methods are expensive and time-consuming. Therefore, it is a challenging task that how to develop more accurate and efficient methods for predicting potential associations between miRNA and disease.

Results

In the study, we developed a computational model that combined heterogeneous graph convolutional network with enhanced layer for miRNA–disease association prediction (HGCNELMDA). The major improvement of our method lies in through restarting the random walk optimized the original features of nodes and adding a reinforcement layer to the hidden layer of graph convolutional network retained similar information between nodes in the feature space. In addition, the proposed approach recalculated the influence of neighborhood nodes on target nodes by introducing the attention mechanism. The reliable performance of the HGCNELMDA was certified by the AUC of 93.47% in global leave-one-out cross-validation (LOOCV), and the average AUCs of 93.01% in fivefold cross-validation. Meanwhile, we compared the HGCNELMDA with the state‑of‑the‑art methods. Comparative results indicated that o the HGCNELMDA is very promising and may provide a cost‑effective alternative for miRNA–disease association prediction. Moreover, we applied HGCNELMDA to 3 different case studies to predict potential miRNAs related to lung cancer, prostate cancer, and pancreatic cancer. Results showed that 48, 50, and 50 of the top 50 predicted miRNAs were supported by experimental association evidence. Therefore, the HGCNELMDA is a reliable method for predicting disease-related miRNAs.

Conclusions

The results of the HGCNELMDA method in the LOOCV (leave-one-out cross validation, LOOCV) and 5-cross validations were 93.47% and 93.01%, respectively. Compared with other typical methods, the performance of HGCNELMDA is higher. Three cases of lung cancer, prostate cancer, and pancreatic cancer were studied. Among the predicted top 50 candidate miRNAs, 48, 50, and 50 were verified in the biological database HDMMV2.0. Therefore; this further confirms the feasibility and effectiveness of our method. Therefore, this further confirms the feasibility and effectiveness of our method. To facilitate extensive studies for future disease-related miRNAs research, we developed a freely available web server called HGCNELMDA is available at http://124.221.62.44:8080/HGCNELMDA.jsp.

Peer Review reports

Background

As a kind of non-coding RNA with regulatory properties and highly conserved in the evolutionary process, miRNA is approximately 20–24 nucleotides in length. Researchers that have been studying miRNA [1] have found that it plays a vital role in biological processes such as cell growth, proliferation, metabolism, differentiation and apoptosis. Moreover, the abnormal expression of miRNA has also been proved to be closely related to some diseases, such as chronic lymphocytic leukemia, tumor, gastric cancer, cardiomyopathy, etc. Therefore, identifying the correlation between miRNA and diseases has become a critical step in biological research recently [2]. However, the traditional biological experiments take up a long time, cost much, and have some blindness, all of which would stall the research process. Therefore, many researchers are devoted to designing computational methods to discover the interaction between unidentified miRNAs and diseases to make up for the shortcomings of traditional experimental approaches [3].

Currently, researchers have established a series of effective calculation models for miRNA–disease association prediction, which can be roughly divided into two categories according to the methods used: similarity measurement-based and machine learning-based. For similarity measurement [4], the miRNA–disease association is predicted by measuring the degree of similarity between nodes using different statistical methods. The machine-learning approach trains other models by learning features and then predicting miRNA–disease associations based on the trained models. The above two methods have different theoretical bases and innovations, and thus making outstanding contributions to future research. For example, Jiang et al. [] determined the functional correlation of two miRNAs by calculating the number of familiar neighbors and the shortest path length of two miRNAs and constructing two miRNAs' functional correlation information. For the first time, Jiang et al. combined disease phenotype information with miRNA function information to predict miRNA–disease association [5], contributing significantly to the future research. Subsequently, for each predicted disease, they designed a hypergeometric distribution-based scoring system [6] to score the diseases and all of the miRNAs associated with them. However, this method comes with some limitations too. Because only the direct neighbors of the miRNA were considered as the criterion for miRNA functional similarity score, the prediction effect was limited. To increase the accuracy of miRNA–disease association prediction, Xuan et al. [7] proposed the weighted k-nearest neighbor method (HDMP). Chen et al. developed the computational framework of RWRMDA that performs random walk on the miRNA network to predict novel disease-related miRNAs. They first put the initial probability values on the pre-constructed miRNA functional similarity network (MFSN) to conduct random walk algorithm. In summary, this model integrated miRNA functional similarity and known miRNA–disease associations to infer novel disease-related miRNAs. They suggested that members of the same miRNA family may be involved in diseases with related phenotypes. According to the association state of the nearest neighbor [8], members of the miRNA family and miRNA cluster can obtain more weight, which improves the prediction performance of the model to some extent. However, it is difficult to manually select the optimal parameter K that classifies the number of members in each miRNA family and miRNA cluster [9], and this method cannot predict new diseases that do not have known miRNA associations. Pasquier et al. [10] formed a matrix with higher dimensions based on miRNA–disease association, miRNA target association, miRNA word association, miRNA family association and miRNA neighbor association state data. Using the singular value matrix decomposition method to decompose the matrix, Pasquier et al. successfully obtained miRNA vectors and disease vectors [11]. They took the cosine distance between the miRNA node vector and the disease node vector as the degree of association between the nodes. However, due to the false-positive rate and false-negative rate between miRNA and target, the model's prediction performance is affected to a certain extent. In WBSMDA [12], authors integrated comprehensive similarity score between the miRNA and disease based on Gaussian interaction profile kernel. WBSMDA could be applicable to the new miRNAs without disease association and to diseases without miRNA association, thereby overcoming the previous limitation of the prediction model.

In addition to similarity-based approaches, machine learning algorithms aiming at exploring potential miRNA disease interactions are also an essential academic approach in this field. Unlike the method of directly calculating the similarity between nodes in the network based on similarity itself, the machine learning approach [13] is devoted to extracting inherent features and designing practical classification algorithms to find miRNA and disease associations. As an early method based on machine learning, Jiang et al. [14] first extracted feature vectors from disease similarity and miRNA function similarity. Then, they randomly selected 270 samples from unknown miRNA disease pairs as negative data, as missing negative instances in the actual data set [15]. Finally, they chose the SVM (support vector machine) as the classifier [16]. However, this artificial method randomly selected negative samples, impacting on the model's accuracy. A different approach conducted by Chen et al. [17] constructed a semi-supervised classifier with regularized least squares. Although the model does not require negative samples, and the possibility of unknown associations is confirmed, this method also has some limitations: the predicted results of fusion miRNA and disease are strongly dependent on parameters [18], and thus it is difficult to choose the optimal parameters. Chen et al. [19] proposed the DRMDA method to use stacked autoencoders for feature extraction to obtain low-dimensional and high-resolution feature vectors and then used SVM to score candidate miRNAs. This method eliminated a lot of noise in similar unprocessed data and achieved good performance results. Graph neural network has attracted extensive attention from researchers due to its high precision. Li et al. Presented a model of MCMDA that exploited known miRNA–disease associations to build binary adjacency matrix, and imple mented a singular value thresholding (SVT) algorithm to extract miRNA–disease associations [20]. Pasquier et al. [21] made the assumption that information attached to miRNAs and diseases can be revealed by distributional semantics. The approach represented distributional information on miRNAs and diseases in a high-dimensional vector space and defined associations between miRNAs and diseases in terms of vector similarity. Chen et al. proposed a prediction model of ensemble of decision tree-based miRNA–disease association (EDTMDA). This model adopted dimensionality reduction algorithm for principal component analysis (PCA) to apply ensemble learning to predict disease-related miRNAs [22]. Ha et al. [23] proposed it focuses on the problem of inferring miRNA and disease associations by exploiting distance metric learning on miRNA–disease bipartite graph, which is constructed based on the known miRNA–disease associations.

Also, biological information networks such as disease and miRNA have complex topological structures, so it is suitable for graphical modelling [24]. For graph data, graph convolutional networks (GCN) have better performance than inhomogeneous networks (such as classification). Therefore, researchers have been trying to apply GCN in heterogeneous networks to predict the association between miRNA and disease [25]. For example, Li et al. [26] extracted node features from the protein–protein interaction network and put them into the graph convolutional network following the Node2VEC algorithm. Finally, each node was embedded in the graph convolutional layer, and the miRNA–disease association was obtained by multiplying the miRNA–gene adjacency matrix by the disease-gene adjacency matrix [27]. This method provides a new perspective for the field of miRNA–disease association prediction. Then, Li et al. [28] proposed the FCGCNMDA method based on a fully connected graph. They extracted the aggregation of node features by using a two-layer graph convolution layer in miRNA functional similarity network and disease semantic similarity network to make end-to-end prediction [29]. However, the GCN model considers all neighbors equally, and the similarity information of nodes cannot be retained when learning node embedding. Li et al. presented [30] a model of MCMDA that exploited known miRNA–disease associations to build binary adjacency matrix, and implemented a singular value thresholding (SVT) algorithm to extract miRNA–disease associations. However, choosing the best parameters of the algorithm restricted to any further improvement in prediction accuracy. Ha et al. [31] proposed focuses on the problem of inferring miRNA and disease associations by exploiting distance metric learning on miRNA–disease bipartite graph, which is constructed based on the known miRNA–disease associations.

Although the existing methods have good performances in predicting miRNA–disease associations, we can still improve some aspects of them. On the one hand, some methods [32] produce inevitable data noise during feature extraction, affecting the prediction effect. On the other hand, some graph [33] convolution methods fail to retain the similarity information of nodes so that similar nodes have similar feature representations in the feature space to enhance the spatial node features of the topology graph [34]. This paper is based on strengthening layer figure convolution heterogeneous network model HGCNELMDA (heterogeneous graph convolutional network model with enhanced layer to predict miRNA–disease associations) to extract node features from the level of the graph. To reduce the data noise of the similarity matrix calculation, the random reboot walk is used to get the original features of nodes from the similarity matrix. Graph convolution aggregates node information according to edge information and represents new node features. Before the figure of convolution model, GCN (graph convolutional network) will consider all equal neighbors, and thus being unable to retain when learning node embedded nodes similarity information. The enhancement layer added in the GCN hidden layer is used to strengthen the similar representation of similar nodes (miRNAs or diseases) in the feature space and enhance the eigenvector aggregation of similar nodes to retain similar information between nodes. First, we constructed an miRNA–disease heterogeneous network based on the proven miRNA–disease association, disease semantic similarity and miRNA functional similarity. Second, to reduce the data noise of extracting the original feature vectors of miRNA and disease nodes and better capture the structural relationship between different types of nodes in heterogeneous graphs, the method based on restart random walk is used for extracting node features from similarities. Third, the miRNA–disease heterogeneous graph and the miRNA–disease feature matrix are gathered through graph convolution to gather the information of neighbor nodes on the layer, and an attention-based reinforcement layer is added to the hidden layer. In the miRNA–disease heterogeneous graph, to strengthen similar nodes (miRNA or disease) for similar representations in the feature space, a reinforcement layer is added to the GCN hidden layer, enhancing the feature vectors of similar aggregate retain similar information between nodes. The attention mechanism is introduced in the reinforcement layer, and more critical topological neighborhood nodes are merged, and miRNA and disease node features are extracted from the spatial topological structure of heterogeneous graphs to predict associations. The results of the HGCNELMDA method in LOOCV (leave-one-out cross-validation) and fivefold cross-validations were 93.47% and 93.01%, respectively. Compared with other typical methods, the HGGCNMA has a better performance. Four cases of lung cancer, prostate cancer and pancreatic cancer were used for research. Among the predicted top 50 candidate miRNAs, 48, 50, and 50 were verified in the biological database HDMI V2.0. Therefore, the result further confirms the feasibility and effectiveness of our method.

Results

First, we present the experimental methods and evaluation indexes. The performance of the HGCNELMDA approach is then compared with the following four existing approaches. Finally, we used the HGCNELMDA method to determine the accuracy of the predictive association based on three cases of prostate tumor, lung tumor and pancreatic tumor.

Experimental approaches and evaluation criteria

We collected 5430 known miRNA–disease associations from HMDD V2.0 as the experimental data set. Based on experimentally verified associations between miRNAs and diseases, we implemented global LOOCV and fivefold CV to evaluate the predictive accuracy of HGCNELMDA. In LOOCV evaluation, every confirmed association was regarded as a test sample in turn, while the rest associations were treated as training samples. In general, two types of LOOCV exists (global LOOCV, local LOOCV). Global LOOCV considers all the diseases at the same time while local LOOCV only take account of the miRNAs for a given disease of interest. Candidate samples included all of the miRNA–disease pairs that experimental studies had not verified. After executing HGCNELMDA, every miRNA–disease pair will obtain an association score. A higher score means a higher likelihood for a link to exist between a pair. In global LOOCV, we compared the score of the test sample with the scores of all the candidate samples. Furthermore, we drew receiver operating characteristics (ROC) curve by plotting the actual positive rate (TPR, sensitivity) against the false positive rate (FPR, 1-specificity) at different thresholds. Sensitivity denotes the percentage of miRNA–disease test samples with ranks exceeded the given point, while specificity represents the percentage of negative miRNA–disease associations with ranks lower than the threshold. AUC was further calculated to demonstrate the prediction ability of HGCNELMDA. The model has perfect prediction performance when AUC reaches exactly 1. If AUC equals 0.5, it suggests that the model only has random prediction performance.

Moreover, we exploited fivefold CV to examine the predictive accuracy further. Fivefold cross-validation was also implemented to further estimate the prediction accuracy of the HGCNELMDA model by randomly dividing the known associations equally into five groups and treating each one of them as test samples in turn by removing the associations of the current test samples simultaneously. Afterwards, every test sample would be scored and compared with the candidate miRNA–disease pairs to obtaining the rankings. We repeated this procedure 50 times to get a more accurate average AUC value.

Compare with other methods

In order to verify the accuracy of our method, the HGCNELMDA method was compared with the following four existing methods, namely FCGCNMDA [35], CNMDA [36], EDTMDA [37], MCMDA [20], IMIPMF [38] and RKNNMDA [39], for fivefold cross-validation. As shown in Table 1, the AUC of FCGCNMDA, CNMDA, EDTMDA, MCMDA, IMIPMF and RKNNMDA were 92.85%, 85.33%, 91.92%, 86.47%, 89.32% and 82.21%, respectively. Among them, the AUC of HGCNELMDA was the highest under fivefold cross-validation, with a value of 93.01% , the AUPR value of HGCNELMDA was 85.37% and the ACC value of HGCNELMDA was 84.36%. Therefore, HGCNELMDA was proved to be reliable in miRNA–disease association. As for global LOOCV, MLMD achieved a reliable AUC value of 0.8786, which was also superior to that in FCGCNMDA (0.8964), MCMDA (0.8629), IMIPMF (0.8857), and EDTMDA (0.8878), as shown in Fig. 1. As shown in Fig. 2, our model showed superior performance (AUC value 0.8634) compared to FCGCNMDA (0.8596), MCMDA (0.8561), IMIPMF (0.8547), and EDTMDA (0.8512) in the framework of local LOOCV.

Table 1 Comparison of HGCNELMDA and other models for fivefold cross-validation
Fig. 1
figure 1

Comparison of HGCNELMDA and other models for Global LOOCV

Fig. 2
figure 2

Comparison of HGCNELMDA and other models for Local LOOCV

Comparison of results with or without reinforcement layer

Figures 3 and 4 respectively show the influence of HGCNELMDA on the model performance with or without reinforcement layer under onefold cross-validation and fivefold cross-validation. In the experiment, the reinforcing layer is removed and replaced by the common hidden layer of GCN. The results showed that the AUC value with the reinforcement layer was higher than that without the hidden layer, because the similar miRNA (or disease) nodes in the reinforcement layer were similar in the feature space, and the attention mechanism was used to focus on the aggregation of similar important neighbor nodes in the reinforcement layer, and the similar information of nodes was retained.

Fig. 3
figure 3

Comparison of left cross-validation with or without reinforcement layer

Fig. 4
figure 4

Comparison of fivefold cross validation with or without reinforcement layer

Comparison of results with or without random walk with restart

Figures 5 and 6 respectively show the influence of HGCNELMDA on the results by using RWR to extract node features under onefold and fivefold cross validation. No experiments using RWR were used directly \(SM\) and \(SD\) a row or a column of is used as the eigenmatrix of nodes. As shown in the figure, it is better to use RWR as the initial feature of the node, because RWR can select adjacent nodes to travel or return to the initial node, thus reducing the influence of data noise in node feature extraction.

Fig. 5
figure 5

Comparison of residual cross-validation with or without RWR

Fig. 6
figure 6

Comparison of fivefold cross validation with or without RWR

Comparison of parameter sensitivities

Layer node embedding dimension is the node embedding parameter in GCN hidden layer \(h\), Different parameter values will affect the experimental results. As shown in Fig. 7, define \(h\) as [32, 64, 128, 256, 512], Compared with the AUC results, The validation methods of one-left cross-validation and fivefold cross-validation show that the AUC value presents an upward trend with the increase of node embedding dimension \(h\). The performance of the HGCNELMDA approach is highest when the embedding dimension \(h\) is defined as 256.

Fig. 7
figure 7

Comparison of different embedded dimensions

Cases studies

The HGCNELMDA method was used to determine the accuracy of the predictive association based on three cases of prostate cancer, lung cancer and pancreatic cancer. We compared the predicted candidate miRNAs with DBDEMC and Phenomir, two public databases, to verify their accuracy.

In the first case study, the selected prostate tumors are used to test whether our approach is suitable for novel diseases with unsupported miRNAs or not. This case selected prostate tumors because this is the most common cancer happening on males worldwide. In 2018, more than 100,000 males died of prostate cancer in Europe alone [40]. This case study first set all miRNA–disease associations related to prostate neoplasms from HMDD 2.0 to zero. Then, M2GMDA was performed to identify the associated miRNAs for prostate neoplasms. Table 2 lists the top 50 candidate miRNAs for HGCNELMDA prediction associated with prostate tumors. The first 50 miRNAs were verified by DBDEMC and Phenomir databases. The results show that the above two databases could verify the first 50 miRNAs.

Table 2 Top 50 miRNAs associated with prostate tumors

Researchers found that the second-ranked HAS-miR-96b was found to regulate apoptosis of prostate cancer cells by inhibiting the FoxO1 transcription factor, indicating that the HGCNELM subsequently validates the predictive ability of HGCNELMDA in new diseases without any known linked miRNAs. To further investigate, we set up a special case study. In this case, we examined HGCNELMDA on Lung Neoplasms, a common human cancer with many experimentally verified related miRNAs. We utilized the experimentally verified miRNA–disease associations from the HMDD v2.0 database as the initial training set. However, we removed all the associations, including lung neoplasms, from the training set this time. Hence, lung neoplasms could be regarded as a disease without any known related miRNAs. Lung tumors are devastating and fatal, causing many deaths in both males and females worldwide [41]. The survival rate of lung tumors is as low as five years, so early diagnosis is critical to save patients’ lives [43]. Therefore, lung tumors, in which miRNAs have become a promising tool in diagnosing and treating process, were selected in this case. HGCNELMDA is used to predict candidate miRNAs associated with lung tumors. The validations of the first 50 related miRNAs are listed in Table 3. Two databases confirmed 49 miRNAs, and only one miRNA was not verified. In addition, the ectopic expression of miR-494-3p in A549 lung cancer cells promoted the tumor-initiating population and enhanced the motor ability of cancer cells and the expression of stem cell-related genes, suggesting that HGCNELMDA can help the diagnosis and treatment of lung tumors. HGCNELMDA method has good accuracy in predicting prostate tumor-associated miRNA.

Table 3 Top 50 miRNAs associated with lung tumors

For the third disease case we chose pancreatic tumor as the new disease case. When the known miRNA and disease association matrix is set to zero, the column of pancreatic tumor indicates that no related miRNA is associated with it, as a new disease [43]. HGCNELMDA is used to predict candidate miRNAs associated with pancreatic tumors, and the top 50 related miRNAs are listed in Table 4. The DBDEMC and Phenomir databases validated the first 50 miRNAs. Studies have shown that increased serum miR-193b is a potential new biomarker for pancreatic neuroendocrine tumors (PNEN). The results indicate that HGCNELMDA plays an important role in predicting new diseases.

Table 4 Top 50 miRNAs associated with pancreatic tumors

For the results of the four case studies, our method was effective when predicting unvalidated miRNA and disease interactions.

Discussion

Compared with five classic methods based on Global LOOCV and fivefold cross-validation, the experimental results show that HGCNELMDA has better predictive performance. In addition, three case studies also support the results of our method. First, we constructed a heterogeneous network of miRNA–disease based on the proven miRNA–disease association, disease semantic similarity and miRNA functional similarity. Second, we used the restart random walk method to extract node features from similarity, aiming at reducing the data noise of extracting the original feature vectors of miRNA and disease nodes and better capturing the structural relationship between different types of nodes in the heterogeneous graph. In the miRNA–disease heterogeneous graph, to reinforce that similar nodes (miRNAs or diseases) have identical representations in the feature space, a reinforcement layer was added to the GCN hidden layer, enhancing the eigenvector aggregation of similar nodes, to preserve similar information between nodes. The attention mechanism was introduced in the reinforcement layer, more important topological neighborhood nodes were integrated, and miRNA and disease node features were extracted from the spatial topology of heterogeneous graphs to predict associations. In summary, the HGCNELMDA method makes full use of the complex structure and semantic information of the miRNA–disease heterogeneous network to achieve good predictions.

Conclusion

This paper mainly describes the enhancement layer based heterogeneous graph convolutional network model (HGCNELMDA) to predict miRNA–disease association method. First, by restarting the random walk between the miRNA and the disease phase.

The eigenvectors of miRNA and disease nodes were obtained from the similarity network. Secondly, the heterogeneous graph of miRNA–disease was input into GCN, and a reinforcement layer was added into the hidden layer of GCN to make similar nodes have similar feature representations in the feature space. The attention mechanism was used to update the influence of important adjacent nodes in the reinforcement layer on the target node. Thirdly, the association matrix between miRNA and disease was reconstructed by bilinear encoder, and the cross-entropy loss function was used to train the model. Finally, HGCNELMDA performance was evaluated by four sets of experiments, which were left onefold cross-validation and fivefold cross-validation, compared with other methods, ablation test, parameter sensitivity test and three disease case studies. The results indicated that HGCNELMDA method had a good predictive effect in the prediction of miRNA–disease association.

Methods

In order to reduce the data noise of extracting original features, make similar nodes have similar feature representation in feature space, and enhance the spatial node feature aggregation of topology map, this paper constructs a heterogeneous graph convolutional network model based on reinforcement layer to predict miRNA–disease association. The model framework is shown in Fig. 8.

Fig. 8
figure 8

HGCNELMDA model

(1) Step 1: Build miRNA–disease isomerization map according to literature [44]. Through integrated disease semantic similarity network \(SD\), The known miRNA–disease association matrix is the same \(A\) and an integrated miRNA functional similarity network \(SM\) constructed into a miRNA–disease heterogeneous map \(A_{H}\), as shown in Formula (1):

$${A}_{H}=\left[\begin{array}{cc}SM& A\\ {A}^{T}& SD\end{array}\right]$$
(1)

Among them, \({A}_{H}\in {\mathbb{R}}^{(m+n)\times (m+n)}\), \(m\) and \(n\) are respectively the number of miRNA and disease.

(2) Step 2: Node feature extraction based on restart random walk. In order to reduce the influence of data noise on the original features of nodes, restart the random walk is used to represent the original features of nodes.

(3) Step 3: Node embedding based on GCN. The information of neighbor nodes of each layer is aggregated through GCN to form an embedding of miRNA or disease node features.

(4) Step 4: Reinforcement layer based on attention mechanism. Since the previous GCN considered neighbor nodes equally, the text adds an attention-based reinforcement layer to the GCN hidden layer \(H\).

(5) Step 5: Use the decoder to reconstruct the association matrix between miRNA and disease. The node feature embedding matrix is obtained by the reinforcement layer \(H\), The Eigen matrix of miRNA is \({H}_{R}\in {\mathbb{R}}^{m\times h}\), The characteristic matrix of disease is \({H}_{d}\in {\mathbb{R}}^{n\times h},h\) is layer embedding dimension, Since \(sigmod\) is often used as the activation function of dichotomy, It can be used to reconstruct miRNA–disease association matrix \(Y\), as shown in Eq. (2):

$$Y=sigmod({H}_{r}{W}^{{\prime}}{H}_{d}^{T})$$
(2)

where the element in the row of the matrix \(Y\) represents the correlation prediction score \({{y}_{ij}}^{{\prime}}\) between miRNA \({r}_{i}\) and disease \({d}_{j}\), \({W}^{{\prime}}\in {\mathbb{R}}^{X\times X}\) is a trainable matrix.

(6) Step 6: In order to make the predicted results as close as possible to the actual results, cross entropy is used as the loss function to carry out end-to-end back propagation, as shown in Formula (3):

$${\mathcal{L}}_{cross}=-\sum_{i,j\in y\cup {y}^{-}}({y}_{\mathit{ij}}\mathit{log}{{y}_{ij}}^{{\prime}}+\left(1-{y}_{ij}\right)\mathrm{log}\left(1-{{y}_{ij}}^{{\prime}}\right))$$
(3)

Among them, \(y\) represents an associated miRNA–disease positive sample, \({y}^{-}\) represents a negative sample with an unknown relationship.

Feature extraction based on random walk with restart

The M2GMDA and CEMDA methods assign each row or column in the \(SM\) (or \(SD\)) similarity matrix to represent an eigenvector of amiRNA (or disease). Literature [45] believes that the limitation of similarity calculation method may lead to some data noise in the direct extraction of original node features. Therefore, in order to optimize the original feature vectors of miRNA and disease nodes and better capture the structural relationship between different types of nodes in heterogeneous graphs, the text reference uses a method based on R (random walk with restart, RWR) to extract node features from similarity. Restarting the random walk starts from a node, and each step can randomly select adjacent nodes or return to the starting node. Assume that there are \(n\) nodes, Right at the start node \(e\), then the probability of appearing at any node \(i\) in the next move is \(r\), as shown in Formula (4):

$${r}^{0}\left[i\right]=W\left[i\right]\cdot e$$
(4)

Here, \(W\left[i\right]\) represents the i row of the transition probability matrix \(W\), that is, the transition probability from all nodes to node \(i\), in the next move, the probability of the node \(i\) is shown in formula (5):

$${r}^{1}\left[i\right]=W\left[i\right]\cdot {r}^{0}$$
(5)

After considering restarting, after \(k\) iterations, it still returns to node \(i\) with probability \(c\). After the \(k+1\) iteration is stable, \({r}_{i}\) is the probability score of reaching node \(i\), which is the similarity feature vector of node \(i\), as shown in formula (6):

$${r}_{i}^{k+1}=cW{r}_{i}^{k}+\left(1-c\right){e}_{i}$$
(6)

Here, \(c\in \left(0, 1\right)\) represents the restart probability, \(W\left[i,j\right]\in {\mathbb{R}}^{n\times n}\) represents the probability from \(i\) to \(j\), and \({e}_{i}\in {\mathbb{R}}^{n\times 1}\) is the \(i\)-th node Initial probability vector. If \(i\) is equal to \(j\), then \({e}_{ij}\) is 1, otherwise it is 0. This paper replaces \(W\) with \(SM\) or \(SD\) respectively, and obtains the probability distribution matrix of the node (miRNA or disease) based on the restart random walk, and normalizes the feature matrix as the miRNA feature matrix \(RM\in {\mathbb{R}}^{m\times m}\) and the characteristic matrix of the disease \(RD\in {\mathbb{R}}^{n\times n}\). By restarting the random walk, the similarity between two points can be obtained, and the global structure of the graph can be better captured. According to \(RM\) and \(RD\), the characteristic matrix of miRNA–disease \({A}_{F}{\in {\mathbb{R}}}^{(m+n)\times (m+n)}\) is obtained, as shown in formula (7):

$${A}_{F}=\left[\begin{array}{cc}0& RM\\ RD& 0\end{array}\right]$$
(7)

GCN-based node cutting

Graph convolution aggregates node information according to edge information and represents new node features. The two feature extraction methods of graphs are spatial domain and Spectral domain. According to the explanation in Literature [44], the spatial method means that the neighbor nodes connected with the vertex are directly used to extract features. But the spectral method hopes to realize the convolution operation on the graph with the help of the graph theory, and studies the properties of the graph with the eigenvalues and eigenvectors of the Laplace matrix of the graph. Laplacian matrices are symmetric matrices, and GCN can perform feature decomposition. Common Laplacian matrix is symmetric normalized Laplacian, each node is the purpose of the normalized Laplacian matrix by foreign transfer the same amount of information, the more edge nodes exist, the less the amount of information transmitted each edge. The definition of the symmetric normalization Laplace matrix is shown in Eq. (8):

$$\widehat{L}={D}^{-\frac{1}{2}}\cdot L\cdot {D}^{-\frac{1}{2}}$$
(8)

Here, \(D\) represents the degree matrix of the vertex, also called the diagonal matrix, and the definition of the elements of the \(L\) matrix is shown in formula (9):

$${L}_{ij}=\left\{\begin{array}{ll}1&\quad i=j\, and\, diag({v}_{i})\ne 0\\ -\frac{1}{\sqrt{diag\left({v}_{i}\right)diag\left({v}_{j}\right)}} &\quad i\ne j \,and\, {v}_{i}\, is\, adjacent \,to\, {v}_{j} \\ 0& \quad otherwise\end{array}\right.$$
(9)

According to the heterogeneous map \({A}_{H}\) of miRNA–disease, the normalized Laplacian matrix is constructed as shown in formula (10):

$${\widehat{A}}_{H}{=D}^{-\frac{1}{2}}{ A}_{H}{ D}^{-\frac{1}{2}}$$
(10)

Literature [45] indicates that Laplace matrix and Fourier transform are the two theoretical foundations of GCN. The Fourier transform of the graph expresses the arbitrary vector \(f\) defined on the graph as a linear combination of the eigenvectors of the Laplacian matrix, as shown in formula (11):

$$f=\widehat{f}\left(1\right){u}_{1}+ \widehat{f}\left(2\right){u}_{2}+\dots \widehat{f}\left(n\right){u}_{n}$$
(11)

\(({u}_{1},{u}_{2},\cdot \cdot \cdot {u}_{n})\) is a set of orthogonal bases formed by \(n\) linearly independent vectors. The relationship between Fourier transform and Laplace matrix: The eigenvector of Laplace matrix is the base of Fourier transform, Get the graph convolution network, as shown in formula (12):

$$f(X,A)=ReLU(\widehat{A}XW)$$
(12)

Here, \(X\) represents the feature matrix of the node, \(\widehat{A}\) represents the normalized adjacency matrix, and \(W\) is the weight matrix from the input layer to the hidden layer, which is equivalent to using a fully connected network to combine the feature connections.

According to the miRNA–disease heterogeneous map \({A}_{H}\) and the miRNA–disease feature matrix \({A}_{F}\), the initial embedding of miRNA and disease nodes is formed through GCN. Make GCN directly connect and gather the information of neighbor nodes on each layer through the graph, as the input of the next layer, as shown in formula (13):

$$F=f({A}_{F},{A}_{H})=ReLU({ \widehat{A}}_{H}{A}_{F}{W}^{(0)})$$
(13)

Here, \({W}^{(0)}{\in {\mathbb{R}}}^{(m+n)\times h}\), \(h\) embeds dimensions for layers.

Reinforcement layer based on attention mechanism

In order to make similar miRNA (or disease) nodes similar in the feature space, this paper added an attentional strengthening layer \(H\) into the GCN hiding layer. The initial reinforcement layer \(H\) was defined as \(F\), and an attention mechanism was introduced to consider all neighbor nodes. The attention mechanism is used to measure the influence of the feature vector \(H\) of adjacent nodes in the reinforcement layer on the feature vector \(H\) of nodes. \({a}_{ij}\) represents the attention coefficient between nodes, as shown in Formula (14), (15) and (16):

$${e}_{ij}=ReLU\left(W{h}_{i},W{h}_{j}\right)$$
(14)
$${a}_{ij}=\frac{\mathrm{exp}({e}_{ij})}{{\sum }_{j\in {\mathcal{N}}_{i}}\mathrm{exp}({e}_{ix})}$$
(15)
$${H}_{i}={\sum }_{j\in {\mathcal{N}}_{i}}{a}_{ij}{h}_{i}$$
(16)

where \({\mathcal{N}}_{i}\) is the set of neighborhood nodes of node \(i\). \(ReLU\) is the activation function and \(W{\in {\mathbb{R}}}^{(m+n)\times X}\) is a trainable matrix.

Next, define the \(Loss\) function \({\mathcal{L}}_{H}\) of the reinforcement layer. In order to make the feature vector of node \({H}_{i}\) on the reinforcement layer \(H\) focus on the feature vector \({H}_{j}\) of important similar neighbor nodes, so that the feature vector of node \(i\) can be better iterated and updated, \(Loss\) function is defined as follows, as shown in Eq. (17)

$$Loss\left({H}_{i}\right)=\sum_{i=1}^{m+n}{\sum }_{j\in {N}_{i}}{a}_{ij}{|{H}_{i}-{H}_{j}|}^{2}$$
(17)

Among them, m and n represent the number of miRNAs and diseases.

Availability of data and materials

The datasets that support the findings of this study are available in https://github.com/liubailong/HGCNELMDA.

Abbreviations

HGCNELMDA:

Heterogeneous graph convolutional network model with enhanced layer to predict miRNA–disease associations

GCN:

Graph convolutional network

RWR:

Random walk with restart

LOOCV:

Global leave-one-out cross validation

miRNAs:

Micro ribonucleic acids

AUC:

Area under the curve

References

  1. Huang HY, Lin YCD, Li J, Huang KY, Shrestha S, Hong HC, et al. miRTarBase 2020: updates to the experimentally valid-ted microRNA–target interaction database. Nucleic Acids Res. 2020;2020:145–8.

    Google Scholar 

  2. Leland H. Genetics: from gene to genomes. New York: McGraw-Hill Higher Education; 2021.

    Google Scholar 

  3. Cantile M, Di BM, Tracey DBM, et al. Functional interaction among lncRNA HOTAIR and microRNAs in cancer and other human diseases. Cancers. 2021;13(3):570.

    Article  CAS  Google Scholar 

  4. Shefa U, Jung JY. Comparative study of microarray and experimental data on Schwann cells in peripheral nerve degeneration and regeneration: big data analysis. Neural Regen Res. 2019;14(6):1099.

    Article  Google Scholar 

  5. Zhang H, Liang Y, Han SY, Peng C, Li Y. Long noncoding RNA and protein interactions: from experimental results to computational models based on network methods. Int J Mol Sci. 2019;20(6):1284.

    Article  CAS  Google Scholar 

  6. Blanca OQ. Extracellular microRNAs as intercellular mediators and noninvasive biomarkers of cancer. Cancers. 2020;12(11):3455.

    Article  Google Scholar 

  7. Wang X, Chai YB, Li H, et al. Link prediction in heterogeneous information networks: an improved deep graph convolution approach. Decis Support Syst. 2021;141:113448.

    Article  Google Scholar 

  8. Chen M, Liao B, Li ZJ. Global similarity method based on a two-tier random walk for the prediction of microRNA–disease association. Sci Rep. 2018;8(1):1–16.

    Google Scholar 

  9. Zhang W, Li ZS, Guo WZ, et al. A fast linear neighborhood similarity-based network link inference method to predict microRNA–disease associations. IEEE/ACM Trans Comput Biol Bioinform. 2021;18:1.

    CAS  Google Scholar 

  10. Zhao HC, Kuang LN, Wang L, et al. Prediction of microRNA–disease associations based on distance correlation set. BMC Bioinform. 2018;19:141. https://doi.org/10.1186/s12859-018-2146-x.

    Article  CAS  Google Scholar 

  11. Chen X, Wang L, Qu J, Guan NN, Li JQ. Predicting miRNA–disease association based on inductive matrix completion. Bioinformatics. 2018;34(24):4256–65.

    CAS  PubMed  Google Scholar 

  12. Chen X, Wang CC, Yin J, You ZH. Novel human miRNA–disease association inference based on random forest. Mol Ther Nucleic Acids. 2018;13:568–79.

    Article  CAS  Google Scholar 

  13. Jiang YT, Liu BT, Yu LH, Yan CG, Bian HJ. Predict miRNA–disease association with collaborative filtering. Neuroinformatics. 2018;16(3–4):363–72.

    Article  Google Scholar 

  14. Jiang Q, Wang Y, Hao Y, Juan L, Teng M, Zhang X, Li M, Wang G, Liu Y. MiR2Disease: A manually curated database for microRNA deregulation in human disease. Nucl Acids Res. 2009;37:D98–104.

    Article  CAS  Google Scholar 

  15. Chen ZH, Wang XK, Gao P, Liu HJ, Song BS. Predicting disease related microRNA based on similarity and topology. Cells. 2019;8(11):1405.

    Article  CAS  Google Scholar 

  16. Zeng XX, Wang W, Deng GS, Bing JX, Zou Q. Prediction of potential disease-associated microRNAs by using neural networks. Mol Ther Nucleic Acids. 2019;16:566–75.

    Article  CAS  Google Scholar 

  17. Gong YC, Niu YQ, Zhang W, Li XH. A network embedding-based multiple information integration method for the miRNA–disease association prediction. BMC Bioinform. 2019;20(1):468.

    Article  Google Scholar 

  18. Zhang C, Chao H, Lu Y, et al. Camel: content-aware and meta-path augmented metric learning for author identification. WWW. 2018.

  19. Wang Y, Zheng FS, Wang ZB, Lu JB, Zhang HY. Circular RNA circ-SLC7A6 acts as a tumor suppressor in non-small cell lung cancer through abundantly sponging miR-21. Cell Cycle. 2020;19(17):2235–46.

    Article  CAS  Google Scholar 

  20. Li Y, Qiu C, Tu J, Geng B, Yang J, Jiang T, Cui Q. HMDD V2.0: A database for experimentally supported human microRNA and disease associations. Nucl Acids Res. 2014;42(D1):D1070–4.

    Article  CAS  Google Scholar 

  21. Zhao Y, Chen X, Yin J. Adaptive boosting-based computational model for predicting potential miRNA–disease associations. Bioinformatics. 2019;35(22):4730–8.

    Article  CAS  Google Scholar 

  22. Chen X, Zhu CC, Yin J. Ensemble of decision tree reveals potential miRNA–disease associations. PLoS Comput Biol. 2019;15(7):e1007209. https://doi.org/10.1371/journal.pcbi.1007209.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Ha J, Park C, Park C, Park S. Improved prediction of miRNA-disease associations based on matrix completion with network regularization. Cells. 2020;9(4):881.

    Article  CAS  Google Scholar 

  24. Zhang XJ, Li YL, Qi PF, Ma ZL. Biology of MiR-17-92 cluster and its progress in lung cancer. Int J Med Sci. 2018;15(13):1443–8.

    Article  CAS  Google Scholar 

  25. Fu X, Zhang J, Meng Z, et al. MAGNN: meta path aggregated graph neural network for heterogeneous graph embedding. In: The web conference 2020; p. 2331–2341.

  26. Song XY, Liu T, Qiu ZY, et al. Prediction of lncRNA–disease associations from heterogeneous information network based on deepwalk embedding model. In: Intelligent computing methodologies 2020; p. 291–300.

  27. Minh NT, Wu Y-H. Integrating meta-path similarity with user preference for top-N recommendation. In: International conference on technologies and applications of artificial intelligence (TAAI). 2019; p. 1–6.

  28. Li J-Q, Rong Z-H, Chen X, Yan G-Y, You Z-H. MCMDA: Matrix completion for MiRNA-disease association prediction. Oncotarget. 2017;8(13):21187–99.

    Article  Google Scholar 

  29. Li SR, Xie MZ, Liu XQ. A novel approach based on bipartite network recommendation and KATZ model to predict potential micro-disease associations. Front Genet. 2019;10:1147.

    Article  Google Scholar 

  30. Yue X, Wang Z, Huang J, et al. Graph embedding on biomedical networks: methods, applications and evaluations. Bioinformatics. 2020;36(4):1241–51.

    CAS  PubMed  Google Scholar 

  31. Chen X, Sun LG, Zhao Y. NCMCMDA: miRNA–disease association prediction through neighborhood constraint matrix completion. Brief Bioinform. 2020;22(1):485–96.

    Article  Google Scholar 

  32. Liang C, Yu SP, Luo JW. Adaptive multi-view multi-label learning for identifying disease-associated candidate miRNAs. PLoS Comput Biol. 2019;15(4):e1006931.

    Article  Google Scholar 

  33. Chen X, Sun LG, Zhao Y. NCMCMDA: miRNA–disease association prediction through neighborhood constraint matrix completion. Brief Bioinform. 2020;22:485–96.

    Article  Google Scholar 

  34. Li YX, Cui XM, Li YD, Zhang TT, Li SY. Upregulated expression of miR-421 is associated with poor prognosis in non-small-cell lung cancer. Cancer Manag Res. 2018;10:2627–33.

    Article  CAS  Google Scholar 

  35. Li ZW, Li JS, Nie R, et al. A graph auto-encoder model for miRNA–disease associations prediction. Brief Bioinform. 2020;1:1–13.

    Google Scholar 

  36. Ding Y, Tian LP, Lei X, et al. Variational graph auto-encoders for miRNA–disease association prediction [J/OL]. Methods. 2020. https://doi.org/10.1016/j.ymeth.2020.08.004.

    Article  PubMed  Google Scholar 

  37. Li JS, Li ZW, Nie R, et al. FCGCNMDA: predicting miRNA–disease associations by applying fully connected graph convolutional networks. Mol Genet Genomics. 2020;295(5):1197–209.

    Article  CAS  Google Scholar 

  38. Ha J, Park C, Park C, Park S. IMIPMF: inferring miRNA–disease interactions using probabilistic matrix factorization. J Biomed Inform. 2020;102:103358. https://doi.org/10.1016/j.jbi.2019.103358.

    Article  PubMed  Google Scholar 

  39. Xiao WD, Zhong YC, Wu LL, Yang DX, Ye SQ, Zhang M. Prognostic value of microRNAs in lung cancer: a systematic review and meta-analysis. Mol Clin Oncol. 2019;10(1):67–77.

    CAS  PubMed  Google Scholar 

  40. Mansoori B, Mohammadi A, Ghasabi M, Shirjang S, Dehghan R, Montazeri V, et al. MiR-142-3p as tumor suppressormiRNA in the regulation of tumorigenicity, invasion and migration of human breast cancer by targeting Bach-1 expression. J Cell Physiol. 2019;234(6):9816–25.

    Article  CAS  Google Scholar 

  41. He YJ, Deng F, Zhao SJ, Zhong SL, Zhao JH, Wang DD, et al. Analysis of miRNA–mRNA network reveals miR-140-5p as a suppressor of breast cancer glycolysis via targeting GLUT1. Epigenomics. 2019;11(9):1021–36.

    Article  CAS  Google Scholar 

  42. Voss G, Haflidadóttir BS, Järemo H, Persson M, Ivkovic CT, Wikström P, Ceder Y. Regulation of cell–cell adhesion in prostate cancer cells by microRNA-96 through upregulation of E-Cadherin and EpCAM. Carcinogenesis. 2019;41(7):865–74.

    Article  Google Scholar 

  43. Huang Z, Shi JC, Gao YX, Cui CM, Zhang S, Li JW, et al. HMDD v3.0: a database for experimentally supported human microRNA–disease associations. Nucleic Acids Res. 2018;47(D1):D1013-D1017:47(D1).

    Article  Google Scholar 

  44. Zhang L, Liu BL, Li ZW, Zhu XY, Liang ZZ, An JY. Predicting miRNA–disease associations by multiple meta-paths fusion graph embedding model. BMC Bioinform. 2020;21:470.

    Article  CAS  Google Scholar 

  45. Pasquier C, Gardès J. Prediction of miRNA–disease associations with a vector space model. Sci Rep. 2016;6:27036.

    Article  CAS  Google Scholar 

Download references

Acknowledgements

We thank the editor and the anonymous reviewers for their comments and suggestions.

Funding

This work was supported by ‘the Fundamental Research Funds for the Central Universities, No. 2019XKQYMS88.” The funder had no role in study design, data collection and preparation of the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

HD and AJY conceived the algorithm, carried out analyses, prepared the data sets, carried out experiments, and wrote the manuscript; ZL and LBL designed, performed and analyzed experiments and wrote the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to JiYong An or Lei Zhang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no potential conflicts of interest with respect to the research, authorship, and publication of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, D., An, J., Zhang, L. et al. Computational method using heterogeneous graph convolutional network model combined with reinforcement layer for MiRNA–disease association prediction. BMC Bioinformatics 23, 299 (2022). https://doi.org/10.1186/s12859-022-04843-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-022-04843-3

Keywords