Skip to main content

ANMDA: anti-noise based computational model for predicting potential miRNA-disease associations

Abstract

Background

A growing proportion of research has proved that microRNAs (miRNAs) can regulate the function of target genes and have close relations with various diseases. Developing computational methods to exploit more potential miRNA-disease associations can provide clues for further functional research.

Results

Inspired by the work of predecessors, we discover that the noise hiding in the data can affect the prediction performance and then propose an anti-noise algorithm (ANMDA) to predict potential miRNA-disease associations. Firstly, we calculate the similarity in miRNAs and diseases to construct features and obtain positive samples according to the Human MicroRNA Disease Database version 2.0 (HMDD v2.0). Then, we apply k-means on the undetected miRNA-disease associations and sample the negative examples equally from the k-cluster. Further, we construct several data subsets through sampling with replacement to feed on the light gradient boosting machine (LightGBM) method. Finally, the voting method is applied to predict potential miRNA-disease relationships. As a result, ANMDA can achieve an area under the receiver operating characteristic curve (AUROC) of 0.9373 ± 0.0005 in five-fold cross-validation, which is superior to several published methods. In addition, we analyze the predicted miRNA-disease associations with high probability and compare them with the data in HMDD v3.0 in the case study. The results show ANMDA is a novel and practical algorithm that can be used to infer potential miRNA-disease associations.

Conclusion

The results indicate the noise hiding in the data has an obvious impact on predicting potential miRNA-disease associations. We believe ANMDA can achieve better results from this task with more methods used in dealing with the data noise.

Background

MicroRNA (miRNA) is a class of endogenous small molecule single-stranded non-coding RNA (ncRNA), which can specifically bind to 3'-UTR (3'-untranslated region) of the target mRNA [1]. Research shows that miRNA is involved in many cell activities including cell proliferation, apoptosis, and stem cell differentiation [2, 3]. It's reported that 48,860 different mature miRNAs sequences have been found from 271 organic organisms, of which 2654 mature miRNAs sequences come from humans [4].

MiRNA-related malfunctions are related to various types of human diseases including tumor, neurodegeneration, and diabetic cardiomyopathy, etc. [5,6,7]. Therefore, uncovering the miRNA-disease associations can provide valuable clues for disease diagnosis at an early stage [8]. Based on the hypothesis that miRNAs with similar functions tend to be related to similar diseases [9], much effort has been devoted to developing various computational methods for miRNA-disease associations prediction during the past years [10].

In general, there are four main types of methods proposed to predict potential miRNA-disease associations.

One type of method is the score function-based algorithms. Jiang et al. [11] integrated miRNAs functional interactions network and disease similarity network and then implemented a scoring method to predict the associations. Chen et al. [12] used a model of calculating within-scores and between scores for miRNA-disease association probabilities (WBSMDA) by integrating miRNA functional similarity, disease semantic similarity, and using Gaussian kernel functions. One challenge of these methods is to utilize more effective features and to design a reasonable score function.

Another type of method is network-based algorithms. Shi et al. [13] tried to connect miRNA and disease through the gene function network and applied the random walk algorithm for final prediction. You et al. [14] constructed a heterogeneous graph with many paths by using weighted matrices to design a path-based algorithm for prediction (PBMDA). Qu et al. [15] built a reliable heterogeneous network and used KATZ to predict miRNA-disease associations (KATZMDA). One challenge of the methods is to integrate different data to build reliable networks and analyze the network function.

The third type of method is mainly based on machine learning algorithms. Chen et al. [16] proposed a ranking-based k-nearest neighbor method for miRNA-disease associations prediction (RKNNMDA). RKNNMDA searched miRNA and disease by k-nearest neighbors and re-ranked them by support vector machine (SVM). Ha et al. [17] utilized a matrix factorization method to predict miRNA-disease associations (PMAMCA). Zhu et al. [18] used the biased heat conduction (BHCMDA) to pay more attention to unpopular nodes and improve the final results. Recently, ensemble learning methods have been designed to solve this problem and achieve great success. For instance, Zhao et al. [19] adopted the adaptive boosting algorithm for prediction (ABMDA). By adapting the weighing coefficient of residual samples, the algorithm re-learned the residual samples and obtain better results. Zhou et al. [20] combined gradient boosting decision trees with logistic regression (GBDT-LR) to predict potential pairs. Yao et al. [21] used the random forest to select 100 important features and predict miRNA-disease associations based on the selected features (IRFMDA-100). Peng et al. [22] attempted to solve this association inference based on ensemble learning and kernel ridge regression (EKRRMDA). However, the training cost of the ensemble learning methods is often high.

The last type of method belongs to deep learning-based methods. As convolution neural networks (CNN) can obtain potential information between features effectively, Peng et al. [23] used auto-encoders for dimensionality reduction and then applied CNN to predict miRNA-disease associations (MDACNN). To extract dense and high-dimensional representations of diseases and miRNAs, Ji et al. [24] used a deep autoencoder framework (AEMDA). Further, to utilize the information of all miRNA-disease pairs during the pre-training process, Chen et al. [25] adopted a deep-belief network (DBNMDA) to predict the associations. Li et al. [26] applied fully connected graph convolutional networks to rank the potential pairs, which combined the graph-related techniques and CNN (FCGCNMDA). However, deep learning may be more suitable for bigger data.

Although much progress has been made in this field, the noise hiding in the data is an unprecedented problem to be tackled. As some researchers [19,20,21, 23, 25, 26] regard undetected miRNA-disease pairs as negative samples and randomly choose several samples to feed into algorithms, the algorithms may be influenced by some unreliable negative samples.

This paper proposes a novel anti-noise algorithm predict potential miRNA-disease associations (ANMDA). According to the method, we first analyze the interference of the noise and then use a k-means algorithm to pick negative samples, subsample to noise smoothing, and finally apply Light Gradient Boosting Machine (LightGBM) to tackle this problem.

The main contributions are listed as follows: (1) We focus on the noise hiding in the data from a new perspective. (2) We subsample the data to smooth the noise to eliminate the influence of the noise. (3) We apply an effective algorithm (LightGBM) to further deal with the noise. The results demonstrate that ANMDA can outperform some published methods.

Result

Experiment design

To validate the performance of ANMDA, we design different experiments to demonstrate the effect of subsampling for noise smoothing and the superiority of LightGBM. In our study, all of the experiments are implemented by using five-fold cross-validation 100 times, and the evaluation metrics are the same as other works including the area under the receiver operating characteristic curve (AUROC), area under the precise-recall curve (AUPR), precision, recall, and F1-score.

Performance evaluation on ANMDA

We evaluate the performance of ANMDA and compare the results of ANMDA with 6 other published methods: WBSMDA, BHCMDA, EKRRMDA, MDACNN, FCGCNMDA, and DBNMDA. The main character for each method is shown in Table 1. WBSMDA is a classic method, BHCMDA and EKRRMDA are recently published machine learning methods, EKRRMDA is an ensemble learning method and more comparable to ANMDA. Furthermore, the deep learning-based models: MDACNN, FCGCNMDA, and DBNMDA are also picked.

Table 1 The main ideas of ANMDA and 6 published methods

The AUROCs of ANMDA and other 6 published methods are shown in Fig. 1, as we can see, ABMDA achieves the best performance in these 6 methods. What’s more, the standard deviation of ANMDA is 0.0005, which means that ANMDA is more stable than other methods such as WBSMDA (0.0009) and DBNMDA (0.0026).

Fig. 1
figure1

The AUROCs of ANMDA and other 6 published methods

To further show the performance of ANMDA, we repeat ABMDA, GBDT-LR, and IRFMDA-100 to compare with ANMDA because they have similar feature construction and data construction. In addition, all of them belong to ensemble learning algorithms. To design a fair and convincing experiment, we test these methods on the same data. The results are shown in Fig. 2. It is shown from the ROC curve and the precise-recall curve that ANMDA can outperform ABMDA, GBDT-LR, and IRFMDA-100. In addition, ANMDA can achieve higher AUROC and AUPR and lower standard deviation than ABMDA, GBDT-LR, and IRFMDA-100. Table 2 shows the performance of different methods in 100 times five-fold cross-validation test.

Fig. 2
figure2

The performance of ANMDA, ABMDA, GBDT-LR and IRFMDA-100 tested on the same data

Table 2 The performance of ANMDA, ABMDA, GBDT-LR and IRFMDA-100 in 100 times five-fold cross validation

Effect of subsampling for noise smoothing

To evaluate the influence of subsampling for noise smoothing, we compare the results of using subsampling for noise smoothing or not. The results are shown in Fig. 3.

Fig. 3
figure3

The ROC and PR curves of different algorithms with and without subsampling for noise smoothing

Noisy_KNN and Noisy_MLP represent applying k-Nearest Neighbor (kNN) and Multilayer Perceptron (MLP) directly for the data, respectively. Smooth_Noisy_KNN and Smooth_Noisy_MLP represent applying kNN and MLP in subsampling for noise smoothing on the data, respectively.

The results demonstrate that the performance of both algorithms is improved after using subsampling for noise smoothing. Specifically, the average AUROC of kNN and MLP increases by 2.35%, and the average AUPR increases by 3.75%, respectively.

The superiority of LightGBM in noise resistance

To reveal the noise resistance ability of each algorithm, we compare the performance of the methods (LightGBM, kNN, and MLP) on the dataset. The results are shown in Fig. 4.

Fig. 4
figure4

The ROC and PR curves of kNN MLP and LightGBM without subsampling for noise smoothing

Noisy_KNN, Noisy_MLP, Noisy_LGB represent applying kNN, MLP LightGBM method, respectively. It can be seen that the performance of LightGBM is better than the other two algorithms, reflecting that LightGBM is expert in dealing with the noise in the data.

Case study

Further, we use ANMDA to predict undetected miRNA-disease pairs that are not recorded in the Human MicroRNA Disease Database version 2.0 (HMDD v2.0). Then, we verify the results in HMDD v3.0 which records more newly-discovered miRNA-disease associations. The results of the top 200 miRNA-disease associations predicted by ANMDA are shown in the Additional file 1.

Two kinds of case studies are carried out to prove the prediction ability of ANMDA. In the first part, we sort all of the undetected pairs and then verify the top 50 associations predicted by ANMDA with HMDD v3.0. The results are shown in the Additional file 2: Table 1. In the second part, we apply ANMDA to predict prostate neoplasm, gastric neoplasm, colorectal carcinoma, melanoma, and hepatocellular carcinoma. For each disease, the top 10 predicted miRNA-disease associations are selected based on the probabilities. The results are shown in the Additional File 2: Table 2.

In conclusion, the case studies indicate that ANMDA can predict potential miRNA-disease associations with high accuracy.

Discussion

In this work, we analyze the noise hiding in the data systematically and propose a novel and practical algorithm ANMDA to tackle the noise properly. The main reasons can be listed as follows: (1) By subsampling for noise smoothing, we extract several subsets from the data. In this way, the noise can be separated into each subset, thereby it reduces the interference to the algorithm on judging positive samples because of the noise aggregation. Further, subsampling for noise smoothing can further decrease the influence of the noise by averaging the prediction results of each subset. (2) The residual is mainly caused by the noise hiding in the data. Further, LightGBM based on GBDT can fit residual in each iteration and improve the final prediction.

However, there are also some limitations in ANMDA. First, the high computational cost in the training process of ANMDA is an important problem. For instance, it takes about 300 min to finish five-fold cross-validation 100 times with CPU of Intel Xeon E3-1231 and 1.5 GB of memory usage. In addition, using the current sampling method to discover reliable negative samples is common, therefore, there is still room for improvement.

Conclusion

This paper proposes a novel method (ANMDA) to predict potential miRNA-disease associations. The experiment results confirm that ANMDA can achieve better results than other published methods. In the case study, several miRNA-disease associations predicted by ANMDA are supported by HMDD v3.0. Therefore, ANMDA is effective and can provide a reference for researchers. In the follow-up work, we plan to use feature selection to accelerate the training process and try to find reliable negative samples. Further, some biological experiments can also be conducted to verify the prediction results of ANMDA.

Methods

The framework of ANMDA is shown in Fig. 5.

Fig. 5
figure5

The framework of ANMDA contains three steps: construct features; construct data (construct positive samples and using k-means on undetected pairs to select negative samples based on HMDD v2.0); apply the algorithm to predict the associations

First, the features are constructed based on the miRNA functional similarity, disease semantic similarity, and Gaussian kernel functions. Second, we try to visualize the noise to reveal the effect of noise on data. Based on HMDD v2.0, we construct positive samples and use k-means on undetected pairs to select negative samples as data. Then, we subsample the data to smooth the noise. Finally, each subset is fed to LightGBM, and a voting rule is used to decide the final prediction.

MiRNA-disease associations

HMDD records experimentally supported human miRNA and disease associations. The current version of HMDD is 3.0. As most of the researchers [12,13,14,15,16,17,18,19,20,21,22, 25, 26] choose HMDD v2.0 to test their methods, so we also take it to validate ANMDA. Finally, we obtained 5430 experimentally verified associations, including 495 miRNAs and 383 diseases [27].

Feature construction

We construct the features by integrating miRNA functional similarity, disease semantic similarity, and using Gaussian kernel functions, which is similar to several other methods [14, 16, 18,19,20,21,22, 24,25,26].

Disease semantic similarity

Based on the idea that "functionally similar miRNAs may be associated with similar diseases, vice versa" [28], we calculate the semantic similarity of two diseases according to the extent that they share in common [29].

First, according to MeSH (Medical Subject Headings) tree structure, the relationship between diseases can be displayed as a layered directed acyclic graph (DAG). Each vertex is composed of tree numbers and the heading of one disease. The directed edge in DAG represents the coordination of different diseases. The diseases with a more general heading (like neoplasm) are at an upper layer in the DAG called ancestor nodes. The vertex at a lower layer in the DAG called the children node is composed of diseases having a more specific definition. Given a disease di and its DAG Equation is as follows:

$$DAG\left( {d_{i} } \right) = \left( {d_{i} ,P(d_{i} ),\;S(d_{i} )} \right)$$
(1)

where P(di) represents the set of vertexes in the DAG and S(di) represents the set of edges in the DAG.

Therefore, the similarity based on the semantic value between two diseases can be measured according to their positions in the DAG. The more information two diseases share in common, the more similar they are. To be specific, the semantic similarity between disease di and disease dj can be calculated as follows:

$$SS\left( {d_{i} ,d_{j} } \right) = \frac{{\sum\nolimits_{{d \in P\left( {d_{i} } \right) \cap P(d_{j} )}} {\left( {D_{{d_{i} }} (d) + D_{{d_{j} }} (d)} \right)} }}{{V(d_{i} ) + V(d_{j} )}}$$
(2)

Respectively, Ddi(d) is defined as the semantic value of the disease d contributes to the disease di. Disease d is a set of the vertex shared by the disease di and the disease dj in common in the DAG. V(di) represents the semantic value of the disease di.

To calculate Ddi(d), we assume that diseases at different layers in the DAG contribute differently to the semantic value of disease di [38]. Therefore, we define it as a semantic contribution factor and the contribution of disease to di itself is defined as 1, and the disease located at the upper node of the DAG denotes less to the semantic value of the disease di. Therefore, the contribution of disease d to the semantic value of disease di can be calculated by the formula:

$$D_{{d_{i} }} (d) = \left\{ {\begin{array}{*{20}c} {1,d = d_{i} } \\ {\max \left( {\Delta \times D_{{d_{i} }} (d^{\prime } )/d^{\prime } \in \;children\;space\;of\;space\;d/} \right),\;d \ne d_{i} } \\ \end{array} } \right.$$
(3)

In addition, to avoid the problem that two kinds of diseases having different occurrences in the DAG are calculated as the same semantic value for being at the same layer, a new way is used to define the contribution of disease d to the semantic value of disease di:

$$D_{{d_{i} }} (d) = - \log \frac{{N_{d} }}{N}$$
(4)

In the formula, Nd is the number of DAGs that contain diseases d. N represents the number of all of the diseases. Based on the contribution of each disease d in the DAG to the disease di, disease di’s V(di) can be calculated by the formula:

$$V(d_{i} ) = \sum\limits_{{d \in P(d_{i} )}} {D_{{d_{i} }} (d)}$$
(5)

As shown in Eqs. (3) and (4), there are two ways to calculate Ddi(d). Thus, two semantic similarities (SS1 and SS2) are calculated according to Eq. (2). Here, the final semantic similarity is calculated as follows:

$$SS\left( {d_{i} ,d_{j} } \right) = \frac{{SS_{1} \left( {d_{i} ,d_{j} } \right) + SS_{2} \left( {d_{i} ,d_{j} } \right)}}{2}$$
(6)

miRNA functional similarity

Research combine disease phenotype similarity, semantic similarity, and miRNA-disease network to calculate miRNAs functional similarity [30, 31].

For the two miRNAs mi and mj, (1) According to the miRNA-disease network, we set MDi = {md1, md2, …, mdni} for all the diseases associated with mi, and MDj = {md1, md2, …, mdnj} for all the diseases associated with mj. (2) We calculate the semantic value of each disease in MDi and MDj. (3) Finally, the functional similarity of mi and mj is calculated as follows:

$$FSM\left( {m_{i} ,m_{j} } \right) = \frac{{\sum\nolimits_{{1 \le p \le n_{j} }} {S\left( {md_{p} ,MD_{i} } \right) + } \sum\nolimits_{{1 \le q \le n_{i} }} {S\left( {md_{q} ,MD_{j} } \right)} }}{{n_{i} + n_{j} }}$$
(7)

Respectively, ni is the number of diseases associated with mi. nj is the number of diseases associated with mj. S(md, MD) is the max semantic similarity between the disease md and any diseases in another set MD.

Disease and miRNA similarity

As mentioned above, the Gaussian interaction kernel function is used for computing the disease and miRNA similarity [32].

In the miRNA-disease association network, the binary interaction profile vector IP(xi) represents the interaction information of disease or miRNA. Therefore, the Gaussian interaction profile kernel similarity for diseases or miRNAs is defined as follows:

$$GS_{x} \left( {x_{i} ,x_{j} } \right) = \exp \left( { - \gamma _{x} \left\| {IP(x_{i} ) - IP(x_{j} )} \right\|^{2} } \right)$$
(8)

In the formula, x can represent disease d or miRNA m, IP(xi) is the interaction information of disease di or miRNA mi. IP(xj) is the interaction information of disease dj or miRNA mj.

γx is a parameter controlling the kernel bandwidth and can be calculated by normalizing γx by the average number of related miRNAs(diseases) per disease(miRNA). The specific formula is as follows:

$$\gamma _{x} = \gamma _{x}^{\prime } /\left( {\frac{1}{{n_{x} }}\sum\limits_{{i = 1}}^{{n_{x} }} {\left\| {IP(x_{i} )} \right\|} ^{2} } \right)$$
(9)

Here, we set γx to a value of 1 based on the previous study [33], so that we can have a better comparison.

Integrated similarity for diseases and miRNAs

To deal with the problem that some diseases have no semantic similarity or miRNAs have no functional similarity, here we propose a reasonable method: if SS(di, dj) (the semantic similarity of disease di and dj) exists, the similarity of these two diseases will finally be

$$\frac{{GS_{d} \left( {d_{i} ,d_{j} } \right) + SS\left( {d_{i} ,d_{j} } \right)}}{2}$$
(10)

the average of Gaussian interaction profile kernel similarity and semantic similarity; otherwise, it will be only GSd(di, dj) (Gaussian interaction profile kernel similarity). In the same way, if FSM(mi, mj) (the functional similarity of miRNA mi and mj) exists, the similarity of these two miRNAs will finally be

$$\frac{{GS_{m} \left( {m_{i} ,m_{j} } \right) + FSM\left( {m_{i} ,m_{j} } \right)}}{2}$$
(11)

the average of Gaussian interaction profile kernel similarity and functional similarity; otherwise, it will be only GSm(mi, mj) (Gaussian interaction profile kernel similarity).

Noise visualization

From HMDD v2.0, we download 5430 miRNA-disease associations as a positive sample. According to the research in AEMDA [24], there are 12,034 known pairs in HMDD v3.0. Therefore, if we choose negative samples randomly, we estimate that it will obtain the data containing about 3.59% of the noise.

To illustrate the impact of the noise, we design the experiment as follows:

  1. 1.

    First, we extract 200 positive samples and 200 negative samples as noise-free data from the UCI ML Breast Cancer Wisconsin (Diagnostic) dataset [34].

  2. 2.

    Then, we deliberately change 7 positive samples’ labels in the noise-free data into negative labels to simulate the noise hiding in data and form the noise data. The situation process is shown in Fig. 6. The red dots represent the noise hiding in the data. The blue dots and the black ones represent positive samples and negative samples, respectively. It is shown that the decision boundaries are different because of the noise in the two situations.

  3. 3.

    Further, we maintain positive samples and negative samples 200 each in the noisy data to make sure the experiment is rigorous.

  4. 4.

    Finally, we use the logistic regression algorithm on both noise-free and noise data to demonstrate the interference caused by the noise. The results are listed in Table 3.

Fig. 6
figure6

The interference of the noise hiding in the dataset

Table 3 The performance of logistic regression algorithm on noise and noise-free data

Further, the experiments can prove that the noise hiding in the data affects the final results of miRNA-disease associations prediction to a certain extent. To be specific, the noise hiding in the data is close to positive samples, which can cause interference to algorithms on judging positive samples.

Method for negative samples selection

Inspired by ABMDA [19], here we use the k-means algorithm [35] to select negative samples. The specific process is as follows: we cluster all undetected miRNA-disease pairs into 23 clusters by k-means. The similar pairs will be in the same cluster after clustering, which makes the noise in the same cluster and distinguished easily. Then, we extract equal amounts of samples from each cluster as negative samples in a way that the noise can be reduced to some extent.

Anti-noise computational model for miRNA-disease associations prediction

To further resist the noise, we propose a subsampling method for noise smoothing motivated by Ho [36]. In detail, we construct several subsets by sampling with replacement from the original data.

Then, we feed each subset to LightGBM [37], which is an ensemble algorithm based on GBDT [38]. In each learning iteration, the basic model of LightGBM learns the residual result from the previous iteration so that it can improve the performance. What’s more, LightGBM utilizes two significant techniques: Gradient-based One-Side Sampling (GOSS) for data samples and Exclusive Feature Bundling (EFB) for features. To be specific, GOSS can maintain the examples with large gradients and randomly picks examples with small gradients, which reduces the training cost. EFB can bundle many exclusive features to fewer dense features, which further reduces the cost of calculating for zero feature values.

The eventual result is an average of each subset’s prediction result. The detailed steps of the ANMDA are shown in Fig. 7.

Fig. 7
figure7

The pseudocode of ANMDA

Availability of data and materials

The data and materials are available from https://github.com/BioInfoLeo/ANMDA

Abbreviations

miRNA:

MicroRNA

ANMDA:

Anti-noise algorithm for predicting miRNA-disease associations

LightGBM:

Light gradient boosting machine

HMDD:

Human microRNA disease database

ncRNA:

Non-coding RNA

ROC:

Receiver operating characteristic

PR:

Precise-recall

AUROC:

Area under the receiver operating characteristic curve

AUPR:

Area under the precise-recall curve

kNN:

k-Nearest neighbor

MLP:

Multilayer perceptron

DAG:

Directed acyclic graph

GOSS:

Gradient-based one-side sampling

EFB:

Exclusive feature bundling

References

  1. 1.

    Stark A, Brennecke J, Bushati N. Animal microRNAs confer robustness to gene expression and have a significant impact on 3’UTR evolution. Cell. 2005;123(6):1133–46.

    CAS  Article  Google Scholar 

  2. 2.

    Hayashita Y, Osada H, Tatematsu Y. A polycistronic microRNA cluster, miR-17-92, is overexpressed in human lung cancers and enhances cell proliferation. Cancer Res. 2005;65(21):9628–32.

    CAS  Article  Google Scholar 

  3. 3.

    Hatfield SD, Shcherbata HR, Fischer KA. Stem cell division is regulated by the microRNA pathway. Nature. 2005;435(7044):974–8.

    CAS  Article  Google Scholar 

  4. 4.

    Kozomara A, Birgaoanu M, Griffiths-Jones S. miRBase: from microRNA sequences to function. Nucleic Acids Res. 2019;47:D155–62.

    CAS  Article  Google Scholar 

  5. 5.

    Toxopeus E, Lynam-Lennon N, Biermann K. Tumor microRNA-126 controls cell viability and associates with poor survival in patients with esophageal adenocarcinoma. Exp Biol Med. 2019;244(14):1210–9.

    CAS  Article  Google Scholar 

  6. 6.

    Sharma S, Lu HC. microRNAs in neurodegeneration: current findings and potential impacts. J Alzheimers Dis Parkinsonism. 2018;8(1):420.

    Article  Google Scholar 

  7. 7.

    Pofi R, Giannetta E, Galea N, Francone M, Campolo F, Barbagallo F, et al. Diabetic cardiomiopathy progression is triggered by miR122–5p and involves extracellular matrix: a 5-year prospective study. JACC. Cardiovascular Imaging. 2020.

  8. 8.

    Li L, Masica D, Ishida M. Human bile contains microRNA-laden extracellular vesicles that can be used for cholangiocarcinoma diagnosis. Hepatology. 2014;60(3):896–907.

    CAS  Article  Google Scholar 

  9. 9.

    Perez-Iratxeta C, Wjst M, Bork P. G2D: a tool for mining genes associated with disease. BMC Genet. 2005;6:45.

    Article  Google Scholar 

  10. 10.

    Chen X, Xie D, Zhao Q. MicroRNAs and complex diseases: from experimental results to computational models. Brief Bioinform. 2019;20:515–39.

    CAS  Article  Google Scholar 

  11. 11.

    Jiang Q, Hao Y, Wang G. Prioritization of disease microRNAs through a human phenome-microRNAome network. BMC Syst Biol. 2010;4:S2.

    Article  Google Scholar 

  12. 12.

    Chen X, Yan CC, Zhang X. WBSMDA: within and between score for MiRNA-disease association prediction. Sci Rep. 2016;6:21106.

    CAS  Article  Google Scholar 

  13. 13.

    Shi H, Xu J, Zhang G. Walking the interactome to identify human miRNA-disease associations through the functional link between miRNA targets and disease genes. BMC Syst Biol. 2013;7:101.

    Article  Google Scholar 

  14. 14.

    You Z, Huang ZA, Zhu ZX. PBMDA: A novel and effective path-based computational model for miRNA-disease association prediction. PLoS Comput. Biol. 2017;13(3):e1005455.

  15. 15.

    Qu Y, Zhang HX, Liang C. KATZMDA: prediction of miRNA-disease associations based on KATZ model. IEEE Access. 2018;6:3943–50.

    Article  Google Scholar 

  16. 16.

    Chen X, Wu QF, Yan GY. RKNNMDA: ranking-based KNN for miRNA-disease association prediction. RNA Biol. 2017;14(7):952–62.

    Article  Google Scholar 

  17. 17.

    Ha J, Park C, Park S. PMAMCA: prediction of microRNA-disease association utilizing a matrix completion approach. BMC Syst Biol. 2019;13:33.

    Article  Google Scholar 

  18. 18.

    Zhu X, Wang X, Zhao H. BHCMDA: A new biased heat conduction based method for potential MiRNA-Disease association prediction. Front Genet. 2020;11:384.

    CAS  Article  Google Scholar 

  19. 19.

    Zhao Y, Chen X, Yin J. Adaptive boosting-based computational model for predicting potential miRNA-disease associations. Bioinformatics. 2019;35(22):4730–8.

    CAS  Article  Google Scholar 

  20. 20.

    Zhou S, Wang SL, Wu Q. Predicting potential miRNA-disease associations by combining gradient boosting decision tree with logistic regression. Comput Biol Chem. 2020;85:107200.

  21. 21.

    Yao DJ, Zhan XJ, Kwoh CK. An improved random forest-based computational model for predicting novel miRNA-disease associations. BMC Bioinform. 2019;20:624.

    CAS  Article  Google Scholar 

  22. 22.

    Peng LH, Zhou LQ, Chen X. A computational study of potential miRNA-disease association inference based on ensemble learning and kernel ridge regression. Front Bioeng Biotechnol. 2020;8:40.

    Article  Google Scholar 

  23. 23.

    Peng JJ, Hui WW, Li QQ. A learning-based framework for miRNA-disease association identification using neural networks. Bioinformatics. 2019;35(21):4364–71.

    Article  Google Scholar 

  24. 24.

    Ji C, Gao Z, Ma X, Wu Q, Ni J, Zheng C. AEMDA: Inferring miRNA-disease associations based on deep autoencoder. Bioinformatics. 2020; 29:btaa670.

  25. 25.

    Chen X, Li TH, Zhao Y. Deep-belief network for predicting potential miRNA-disease associations. Brief Bioinform. 2020:bbaa186.

  26. 26.

    Li J, Li Z, Nie R. FCGCNMDA: predicting miRNA-disease associations by applying fully connected graph convolutional networks. Mol Genet Genomics. 2020;295(5):1197–209.

    CAS  Article  Google Scholar 

  27. 27.

    Li Y, Qiu C, Tu J. HMDD v2.0: a database for experimentally supported human microRNA and disease associations. Nucleic Acids Res. 2013;42(D1): D1070–4.

  28. 28.

    Hsu JB, Chiu CM, Hsu SD. miRTar: an integrated system for identifying miRNA-target interactions in human. BMC Bioinformatics. 2011;12:300.

    CAS  Article  Google Scholar 

  29. 29.

    Resnik P. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 1, IJCAI’95. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1995, pp. 448–453.

  30. 30.

    Wang D, Wang J, Lu M. Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases. Bioinformatics. 2010;26(13):1644–50.

    CAS  Article  Google Scholar 

  31. 31.

    Xuan P, Han K, Guo M. Correction: Prediction of microRNAs Associated with Human Diseases Based on Weighted k Most Similar Neighbors. PLoS One. 2013;8(9):10.1371.

  32. 32.

    Van Laarhoven T, Nabuuxs SB, Marchiori E. Gaussian interaction profile kernels for predicting drug-target interaction. Bioinformatics. 2011;27(21):3036–43.

    Article  Google Scholar 

  33. 33.

    Chen X, Yan GY. Novel human lncRNA-disease association inference based on lncRNA expression profiles. Bioinformatics. 2013;29(20):2617–24.

    CAS  Article  Google Scholar 

  34. 34.

    The UCI ML Breast Cancer Wisconsin (Diagnostic) dataset. https://goo.gl/U2Uwz2

  35. 35.

    Hartigan JA, Wong MA. A K-means clustering algorithm. J Roy Stat Soc: Ser C (Appl Stat). 1979;28(1):100–8.

    Google Scholar 

  36. 36.

    Ho TK. The random subspace method for constructing decision forests. Pattern Anal Mach Intell. 1998;20(8):832–44.

    Article  Google Scholar 

  37. 37.

    Ke G, Meng Q, Finely T. LightGBM: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst. 2017;30:3146–54.

    Google Scholar 

  38. 38.

    Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;29(5):1189–232.

    Article  Google Scholar 

Download references

Acknowledgements

We would like to thank anonymous reviewers for their comments and suggestions.

Funding

This work was partially supported by grants from the National Key R&D Program of China (2019YFA0110802 and 2019YFA0802800), the Fundamental Research Funds for the Central Universities. The funding bodies did not play any roles in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Affiliations

Authors

Contributions

XJC, XYH and ZRJ designed the experiments and analyzed the data. XJC, XYH performed the experiments. XJC, XYH and ZRJ wrote the paper. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Zhen-Ran Jiang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1

. The top 200 miRNA-disease associations predicted by ANMDA

Additional file 2

. The case studies of ANMDA

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chen, XJ., Hua, XY. & Jiang, ZR. ANMDA: anti-noise based computational model for predicting potential miRNA-disease associations. BMC Bioinformatics 22, 358 (2021). https://doi.org/10.1186/s12859-021-04266-6

Download citation

Keywords

  • miRNA-disease association
  • k-means
  • Noise smoothing
  • Light gradient boosting machine