Skip to main content

Predicting potential miRNA-disease associations based on more reliable negative sample selection

Abstract

Background

Increasing biomedical studies have shown that the dysfunction of miRNAs is closely related with many human diseases. Identifying disease-associated miRNAs would contribute to the understanding of pathological mechanisms of diseases. Supervised learning-based computational methods have continuously been developed for miRNA-disease association predictions. Negative samples of experimentally-validated uncorrelated miRNA-disease pairs are required for these approaches, while they are not available due to lack of biomedical research interest. Existing methods mainly choose negative samples from the unlabelled ones randomly. Therefore, the selection of more reliable negative samples is of great importance for these methods to achieve satisfactory prediction results.

Results

In this study, we propose a computational method termed as KR-NSSM which integrates two semi-supervised algorithms to select more reliable negative samples for miRNA-disease association predictions. Our method uses a refined K-means algorithm for preliminary screening of likely negative and positive miRNA-disease samples. A Rocchio classification-based method is applied for further screening to receive more reliable negative and positive samples. We implement ablation tests in KR-NSSM and find that the combination of the two selection procedures would obtain more reliable negative samples for miRNA-disease association predictions. Comprehensive experiments based on fivefold cross-validations demonstrate improvements in prediction accuracy on six classic classifiers and five known miRNA-disease association prediction models when using negative samples chose by our method than by previous negative sample selection strategies. Moreover, 469 out of 1123 selected positive miRNA-disease associations by our method are confirmed by existing databases.

Conclusions

Our experiments show that KR-NSSM can screen out more reliable negative samples from the unlabelled ones, which greatly improves the performance of supervised machine learning methods in miRNA-disease association predictions. We expect that KR-NSSM would be a useful tool in negative sample selection in biomedical research.

Peer Review reports

Background

As one category of endogenous non-coding RNAs with about 20–24 nucleotides in length, miRNAs have been widely discovered in plants, viruses and human beings [1]. miRNAs function as regulators of gene expression by binding to the 3ʹ-untranslated region (UTR) of their target mRNAs, which would cause translational repression or transcript degradation [2]. Existing studies have revealed that miRNAs are implicated in many crucial processes [3, 4], such as cell proliferation, apoptosis, development, differentiation and metabolism. Therefore, the dysregulation of miRNAs would result in a large number of diseases [5]. Currently, miRNAs have been recognized as important biomarkers for disease diagnosis, and detection of disease-related miRNAs can contribute to the pathological studies of diseases.

As traditional biological experiments are time consuming and costly, computational methods to determine potential associations between miRNAs and diseases are emerging as efficient complementary tools. These methods are mainly based on the assumption that miRNAs with similar functions tend to be associated with similar diseases [6, 7]. For example, Chen et al. [8] analysed the effects of similarity measurements on miRNA-disease association prediction and presented a semi-supervised inference method NetCBI to prioritize associations between miRNAs and human diseases by combining OMIM phenotype similarity information and miRNA functional similarity information. Han et al. [9] proposed a novel method DismiPred to predict disease-related miRNA candidates by incorporating functional similarity and association information. Xuan et al. [10] developed a computational model MIDP by random walk on miRNA-disease bilayer network established based on similarity between nodes to predict disease-related miRNAs. Chen et al. [11] proposed a prediction model WBSMDA to combine within- and between-scores for potential miRNA-disease association inference. Chen et al. [12] developed a computational method HAMDA to uncover novel miRNA-disease associations by integrating network structure, node attribution and information propagation on bipartite miRNA-disease network. You et al. [13] proposed a prediction model PBMDA to infer potential miRNA-disease associations by adopting a depth-first search algorithm on miRNA-disease heterogeneous graph. Chen et al. [14] presented an inductive matrix completion model IMCMDA to complete missing miRNA-disease associations based on known miRNA-disease associations, integrated miRNA similarities and integrated disease similarities. Chen et al. [15] proposed a novel computational model BNPMDA for miRNA-disease association predictions based on bipartite network projection [16]. Xuan et al. [17] developed a method DMAPred which applied non-negative matrix factorization for potential miRNA-disease association inference. DMAPred projected miRNAs and diseases into low-dimensional spaces to yield feature representations. The likelihood that a miRNA was associated with a disease was calculated according to these projections. Chen et al. [18] proposed a recommendation-based computational framework MDVSI to predict miRNA-disease associations by incorporating miRNA topological similarity and functional similarity. Zhang et al. [19] developed a computational model MSFSP to predict disease-related miRNAs by similarity fusion and space projection. Wang et al. [20] developed an unbalanced random walk algorithm MGDF on genome-wide similarity networks to predict miRNA–disease associations. These similarity-based approaches have achieved encouraging miRNA-disease association prediction performance, and there still exists room for improvement.

Meanwhile, inspired by the successful application of machine learning methods in the fields of web searches, content filtering and e-commerce, many researchers have applied machine learning techniques to infer miRNA-disease associations. For example, Chen et al. [21] formulated the miRNA-disease association prediction as a classification problem and developed a decision tree-based method for association predictions. Feature vectors from existing associations including similarity measurement were used to train a regression tree under a gradient boosting framework for determining whether a miRNA-disease association existed or not. Chen et al. [22] proposed a random forest-based model to infer miRNA-disease associations, in which feature vectors to represent miRNA-disease samples were defined by integrated similarities, and their dimensions were further reduced for building an effective classifier. Zhao et al. [23] developed an adaptive boosting approach ABMDA for predicting potential associations between diseases and miRNAs. ABMDA improved learning accuracy by integrating weak classifiers constructed on decision trees. Peng et al. [24] proposed a learning framework MDA-CNN for miRNA-disease association identification. An auto-encoder was applied in their model to extract essential features and a convolutional neural network was used for prediction. Ji et al. [25] presented a network embedding-based method to predict miRNA-disease associations, in which the embedding representations of miRNA and disease were learned from a heterogeneous information network and the Random Forest (RF) classifier was used for predicting potential miRNA-disease associations. Liu et al. [26] developed a computational framework SMALF to infer possible miRNA-disease associations. SMALF utilized a stacked autoencoder to learn latent features. XGBoost was used to make predictions from the unlabelled miRNA-disease associations. Tang et al. [27] presented a graph convolutional network-based method MMGCN with multi-view multichannel attention to predict potential miRNA–disease associations. Liu et al. [28] proposed a computational method DFELMDA to predict miRNA-disease associations, in which two deep autoencoders were applied for low-dimensional feature representations and prediction scores of unlabelled miRNA-disease associations were received by deep random forest. Wang et al. [29] proposed a graph attention networks-based framework MKGAT and used dual Laplacian regularized least squares to predict potential miRNA-disease associations. With the recent advances in machine learning especially in deep learning, these methods have received more and more accurate results in miRNA-disease association predictions.

It is known that both positive and negative samples are needed for supervised machine learning methods to predict reliable miRNA-disease associations. However, the required negative samples are not available due to lack of research interest in life sciences. Previous studies used two strategies to address this problem. The first one is randomly selecting negative samples from the unlabelled associations [22, 26, 30]. The other one is dividing the unlabelled miRNA-disease samples into K parts using K-means algorithm, and randomly selecting negative samples from the K clusters [23, 31]. As positive samples exist in the whole unlabelled ones, the two selection strategies would bring noise and result in less reliable prediction performance.

In this study, we propose a novel mothed named KR-NSSM to select more reliable negative samples for miRNA-disease association inference. Specifically, KR-NSSM first combines similarity measurements from miRNAs and diseases to generate feature vectors for miRNA-disease pairs. It then applies SS-Kmeans [32] to obtain likely negative and positive samples from the unlabelled ones. Rocchio classification [33] is finally used to receive more reliable negative and positive samples for inference. Comprehensive experiments based on fivefold cross-validations show using negative samples received by our method KR-NSSM could significantly improve prediction accuracy compared with using these by existing negative sample selection strategies. Moreover, we obtain 1123 reliable positive samples by using KR-NSSM, among which 469 have been confirmed by existing databases.

Results

Evaluation metric

The benchmark datasets (see “Methods”) contain 5430 experimentally confirmed miRNA-disease associations, which are considered as positive samples in this study. We select negative samples from the unlabelled ones using not only our method KR-NSSM, but also existing methods, such as random selection or K-means. We test the effects of negative samples selected by different strategies on final predictions. We apply fivefold cross-validations to systematically analyse prediction performance, in which the samples are randomly divided into five equal parts. In each validation, one part is used as the test set and the other four parts as the training set. We prioritize the inferred miRNA-disease associations according to the final prediction results. True positive rate (TPR) and false positive rate (FPR) are calculated by varying the thresholds. We further calculate AUC, AUPR, Precision, Recall, F1-score and Accuracy as evaluation metric for performance assessment and comparison.

Ablation test in KR-NSSM

In our method KR-NSSM, we combine SS-Kmeans and Rocchio classification for negative sample selection. To test whether this combination strategy helps infer miRNA-disease associations, we design three categories of ablation experiments. The first one is only using SS-Kmeans for screening. The second one is only using Rocchio classification for screening. The third one is integrating the two strategies for screening. We use logistic regression (LR) as a benchmark classifier and conduct fivefold cross-validations to test their prediction performance. The experiments are based on a balanced data set of positive and negative samples. The results are shown in Table 1. We can discover from Table 1 that using negative samples from KR-NSSM gets the best prediction performance, which indicates the negative samples received by KR-NSSM are the most reliable.

Table 1 The ablation experimental results based on fivefold cross-validations

Performance evaluation on classic classifiers

In order to further evaluate the performance of our method KR-NSSM, we use six different classification algorithms for miRNA-disease association predictions. The six classifiers are: lightGBM, support vector machine (SVM), Random Forest (RF), logistic regression (LR), XGBoost and Multilayer perceptron (MLP). LightGBM is a computational framework implemented with gradient lifting decision trees (GBDT). We set the number of decision trees in lightGBM as 1000, the maximum number of leaf nodes as 100, the learning rate as 0.05, and the rest parameters as default values. SVM is a classical binary classification model, which has achieved good results in many classification problems. We use RBF kernel, and the remaining parameter values are set to be default in SVM. In random forest, we set the number of decision trees as 50, and the rest parameters as default values. In XGBoost, we set the number of trees to be 1000, the learning rate to be 0.1, and the remaining parameters as default values. For MLP, we set two hidden layers, each layer is 30 and 20 neurons respectively, and update the weights by using quasi-Newton method.

Since 5430 experimentally verified miRNA-disease associations are taken as positive samples in our study (see “Methods”), we use KR-NSSM to select 5430 negative samples to generate a balance data set. In the control group, we randomly choose 5430 negative samples from the unknown associations. We conduct fivefold cross-validations for association predictions and plot ROC and PR curves in Figs. 1 and 2, respectively. Table 2 lists the prediction performance. We can find from Table 2 that better performance is received when using negative samples by our method KR-NSSM, which indicates that the negative samples selected by KR-NSSM are more reliable.

Fig. 1
figure 1

ROC curves of different classifiers based on fivefold cross-validations and different strategies of negative sample selection

Fig. 2
figure 2

PR curves of different classifiers based on fivefold cross-validations and different strategies of negative sample selection

Table 2 Performance comparison based on six classical classifiers and fivefold cross-validations

Performance evaluation on existing miRNA-disease association prediction models

We choose five existing supervised methods (RFMDA [22], IRFMDA [30], ABMDA [23], GBDT-LR [31] and SMALF [26]), which were developed for miRNA-disease association predictions, for performance evaluation. Note RFMDA, IRFMDA and SMALF randomly select negative samples from the unlabelled associations, while ABMDA and GBDT-LR select negative samples by performing random sampling based on K-means clustering on the unlabelled associations. We replace these negative sample selection strategies with KR-NSSM, and evaluate the prediction performance based on fivefold cross-validation experiments. Performance evaluation results are summarised in Table 3, which suggests using negative samples obtained by KR-NSSM can significantly improve prediction performance. It further demonstrates the reliability of negative sample selection by KR-NSSM.

Table 3 Performance comparison of existing prediction methods based on fivefold cross-validations

Identification of positive miRNA-disease associations

Besides negative sample selection, KR-NSSM can produce positive miRNA-disease associations (see “Methods”). We eventually obtain a reliable positive set which contains 1123 potential miRNA-disease associations after implementing KR-NSSM on the benchmark datasets. We choose two established databases HMDD V3.2 [34] and dbDEMC [35], which store miRNA-disease association entries received by text-mining from literature, for validation. We discover that 469 out of the 1123 associations are supported by the databases. Note these unconfirmed associations may exist in reality as our investigation of miRNAs’ roles in diseases is not complete. We provide the 1123 positive associations as an additional file (see Additional file 1) for further studies.

Conclusions

For supervised machine learning methods to miRNA-disease association predictions, a core challenge is that experimentally-supported uncorrelated miRNA-disease pairs used as negative samples are not available. In this study, we propose a negative sample screening model KR-NSSM to solve the problem. Our method consists of two steps: a refined K-means for preliminary screening and a Rocchio classification-based procedure for further screening. Compared with the original K-means and Rocchio algorithms, we take the experimentally-confirmed miRNA-disease association pairs in HMDD V2.0 as positive samples for more accurate classification. The ablation test in KR-NSSM shows that integrating the two procedures would increase prediction accuracy.

Experimental results from six classic classifiers and five well-known prediction models based on fivefold cross validations prove that using the negative samples obtained by KR-NSSM can significantly improve the accuracy of miRNA-disease association predictions. It is because we integrate two semi-supervised algorithms in KR-NSSM, so that more reliable negative samples can be selected. Meanwhile, KR-NSSM can also screen a certain number of reliable positive samples based on the same principle. Some of the selected positive samples are verified by existing databases. The experiments show the effectiveness of our method. Since more association predictions, such as drug-target [36], drug-disease [37], and lncRNA-disease [38], exist in bioinformatics fields, and negative samples are not available in these situations. Reliable negative samples are also needed to be selected in supervised methods for the association predictions. We believe that KR-NSSM can be widely applied in these fields for negative sample selection.

Methods

Benchmark dataset

The benchmark dataset used in our study is downloaded from reference [26], in which known miRNA-disease associations are obtained from HMDD V2.0 [39]. These miRNA-disease associations are considered as positive samples. miRNA functional similarity scores computed in reference [40] are taken as miRNA-miRNA similarities. Disease-disease similarities are calculated according to their semantic values based on the MeSH database (http://www.ncbi.nlm.nih.gov/). We finally receive 5430 miRNA-disease associations including 495 miRNAs and 383 diseases.

Method overview

Construction of feature vectors

We construct the feature vectors to represent miRNA-disease associations as follows: first, we obtain a 383-dimensional vector consisting of 383 disease similarity scores to represent each disease, and a 495-dimensional vector consisting of 495 miRNA similarity scores to represent each miRNA. Then, we represent each sample by an 878-dimensional feature vector consisting of the 383 disease similarity scores and 495 miRNA similarity scores as Eq. (1):

$$F_{miRNA - disease} = \left( {f_{1} ,f_{2} ,...,f_{495} ,f_{496} ,...,f_{878} } \right)$$
(1)

where (f1, f2, , f495) represents the 495 miRNA similarity scores, and (f496, , f878) denotes the 383 disease similarity scores. In this study, we regard the experimentally validated miRNA-disease associations as positive samples, the unknown miRNA-disease associations as unlabelled samples. Correspondingly, P and U are used to represent the positive sample set and unlabelled sample set.

KR-NSSM

Inspired by previous research [32, 33, 41], we propose a negative sample screening model KR-NSSM. The workflow of KR-NSSM is briefly shown in Fig. 3. We integrate two algorithms, i.e., SS-Kmeans and Rocchio classification, to construct the core framework of KR-NSSM. SS-Kmeans are applied to conduct preliminary screening on unlabelled samples, and then Rocchio classification are used to conduct further screening on the results of SS-Kmeans.

Fig. 3
figure 3

The workflow of our method KR-NSSM

SS-Kmeans

In the first part of KR-NSSM, we use an improved K-means algorithm, SS-Kmeans [32], for screening. Different from the traditional unsupervised K-means algorithm, SS-Kmeans uses the information of both labelled and unlabelled samples. We first generate the centroid of positive sample set P and unlabelled sample set U, respectively. The centroid of positive sample \(c_{1}\) is generated by all the feature vectors of P, and \(c_{1}\) is calculated by Eq. (2)

$$c_{1} = \frac{{\sum\limits_{i = 0}^{m} {p_{i} } }}{m}$$
(2)

where \(m\) is the number of positive samples, and \(p_{i}\) represents the ith positive sample. Similarly, the sample set U are used to generate \(c_{2}\), which is the centroid of reliable negative samples and is calculated as follows:

$$c_{2} = \frac{{\sum\limits_{j = 0}^{n} {u_{i} } }}{n}$$
(3)

where \(u_{i}\) represents the unlabelled samples and \(n\) is the number of unlabelled samples. We then compare the cosine similarity between each unlabelled sample \(u_{i}\) and ck as follows:

$$x_{i} = \arg \max_{k} \cos \mathrm{in}\, e(u_{i} ,c_{k} )$$
(4)

where k (= 1 or 2) represents \(c_{1}\), or \(c_{2}\), respectively. According to the value of cosine similarity, the unlabelled samples can be classified into likely positive sample set1(LP1) and likely negative sample set1(LN1).

In the third step, LP1 and LN1 are used to obtain new centroids where we denote them as \(l_{1}\) and \(l_{2}\), respectively. The new centroids are calculated according to Eq. (1) and Eq. (2). We use \(l_{1}\) and \(l_{2}\) for further classification. We apply the Euclidean distance to measure the similarity as follows:

$$x_{i} = \arg \min_{k} ||u_{i} - l_{k} ||^{2}$$
(5)

We repeat the steps until the latest centroids are stable. Eventually, we receive the likely positive sample set (LP1) and likely negative sample set (LN1) in SS-Kmeans.

Rocchio classification

In the second part of KR-NSSM, we use Rocchio classification [33] to further screen the preliminary results of SS-Kmeans. The core purpose of Rocchio classification is to generate two prototype vectors that represent positive sample set and negative sample set. More specifically, Rocchio classification can be subdivided into rocchio1 and rocchio2.

In the first step of Rocchio classification, P are regarded as positive sample set and we choose to use the experimentally confirmed miRNA-disease associations as P. U are regarded as negative sample set and we choose to use the LN1 (the likely negative sample obtained from SS-Kmeans) as U. The prototype vectors \(\vec{c}^{ + }\) and \(\vec{c}^{ - }\) are calculated by Eq. (6) and (7), respectively.

$$\vec{c}^{ + } = \alpha \frac{1}{|P|}\sum\limits_{{\mathop{d}\limits^{\rightharpoonup} \in P}} {\frac{{\vec{d}}}{{||\vec{d}||}}} - \beta \frac{1}{|U|}\sum\limits_{{\vec{d} \in U}} {\frac{{\vec{d}}}{{||\vec{d}||}}}$$
(6)
$$\vec{c}^{ - } = \alpha \frac{1}{|U|}\sum\limits_{{\vec{d} \in U}} {\frac{{\vec{d}}}{{||\vec{d}||}}} - \beta \frac{1}{{{|}P{|}}}\sum\limits_{{\vec{d} \in P}} {\frac{{\vec{d}}}{{||\vec{d}||}}}$$
(7)

where |P| and |U| is the number of samples in their correspond set. \(||\vec{d}||\) is the binary norm of \(\vec{d}\). \(\alpha\) and \(\beta\) adjust the relative influence of positive samples and negative samples and we set them to be 16 and 4, respectively.

Then, the samples in LN1 are classified according to their cosine similarity to prototype vectors. If the similarity between positive prototype vector and an unlabelled sample is less than that between negative prototype vector, the unlabelled sample will be classified as a reliable negative sample. Otherwise, a reliable positive sample. Eventually, we can form the reliable negative sample set2 LN2.

However, rocchio1 may still occur classification errors [33]. In order to solve the problem, we propose to use rocchio2. In rocchio2, the K-means algorithm are used to divide LN2 into multiple subsets, \(i.e. \, N_{1} ,N_{2} ,N_{3} ,...,N_{k}\). For each subset, P will combine with them to form a pair of data set. The prototype vector is calculated by Eq. (8) and Eq. (9).

$$\vec{n}_{j} = \alpha \frac{1}{{|N_{j} |}}\sum\limits_{{\vec{d} \in N_{j} }} {\frac{{\vec{d}}}{{||\vec{d}||}}} - \beta \frac{1}{|P|}\sum\limits_{{\vec{d} \in P}} {\frac{{\vec{d}}}{{||\vec{d}||}}}$$
(8)
$$\vec{p}_{j} = \alpha \frac{1}{|P|}\sum\limits_{{\vec{d} \in P}} {\frac{{\vec{d}}}{{||\vec{d}||}}} - \beta \frac{1}{{|N_{j} |}}\sum\limits_{{\vec{d} \in N_{j} }} {\frac{{\vec{d}}}{{||\vec{d}||}}}$$
(9)

where \(\vec{n}_{j}\) and \(\vec{p}_{j}\) represent the jth pair of prototype vector. In this study, we use K-means to divide LN2 into 3 subsets. For each sample in LN2, we calculate the cosine similarity between it and each pair of prototype vectors. If the similarity between the sample and the negative prototype vector \(\vec{n}_{j}\) is greater than that with the positive prototype vector \(\vec{p}_{j}\), we consider it as a reliable negative sample.

Availability of data and materials

The datasets and source codes used in this study are available from the corresponding author on reasonable request.

Abbreviations

UTR:

Untranslated region

TPR:

True positive rate

FPR:

False positive rate

LR:

Logistic regression

SVM:

Support vector machine

RF:

Random forest

MLP:

Multilayer perceptron

GBDT:

Gradient lifting decision trees

References

  1. Kozomara A, Birgaoanu M, Griffiths-Jones S. miRBase: from microRNA sequences to function. Nucleic Acids Res. 2019;47(D1):D155–62.

    Article  CAS  Google Scholar 

  2. Ambros V. The functions of animal microRNAs. Nature. 2004;431(7006):350–5.

    Article  CAS  Google Scholar 

  3. Garzon R, Fabbri M, Cimmino A, Calin GA, Croce CM. MicroRNA expression and function in cancer. Trends Mol Med. 2006;12(12):580–7.

    Article  CAS  Google Scholar 

  4. Kloosterman WP, Plasterk RH. The diverse functions of microRNAs in animal development and disease. Dev Cell. 2006;11(4):441–50.

    Article  CAS  Google Scholar 

  5. Esteller M. Non-coding RNAs in human disease. Nat Rev Genet. 2011;12(12):861–74.

    Article  CAS  Google Scholar 

  6. Lu M, Zhang Q, Deng M, Miao J, Guo Y, Gao W, Cui Q. An analysis of human microRNA and disease associations. PLoS ONE. 2008;3(10): e3420.

    Article  Google Scholar 

  7. Chen H, Guo R, Li G, Zhang W, Zhang Z. Comparative analysis of similarity measurements in miRNAs with applications to miRNA-disease association predictions. BMC Bioinform. 2020;21(1):176.

    Article  CAS  Google Scholar 

  8. Chen H, Zhang Z. Similarity-based methods for potential human microRNA-disease association prediction. BMC Med Genomics. 2013;6(1):1–9.

    Article  CAS  Google Scholar 

  9. Han K, Xuan P, Ding J, Zhao Z, Hui L, Zhong Y. Prediction of disease-related microRNAs by incorporating functional similarity and common association information. Genet Mol Res. 2014;13(1):2009–19.

    Article  CAS  Google Scholar 

  10. Xuan P, Han K, Guo Y, Li J, Li X, Zhong Y, Zhang Z, Ding J. Prediction of potential disease-associated microRNAs based on random walk. Bioinformatics. 2015;31(11):1805–15.

    Article  CAS  Google Scholar 

  11. Chen X, Yan CC, Zhang X, You ZH, Deng L, Liu Y, Zhang Y, Dai Q. WBSMDA: within and between score for MiRNA-disease association prediction. Sci Rep. 2016;6:21106.

    Article  CAS  Google Scholar 

  12. Chen X, Niu YW, Wang GH, Yan GY. HAMDA: hybrid approach for MiRNA-disease association prediction. J Biomed Inform. 2017;76:50–8.

    Article  CAS  Google Scholar 

  13. You ZH, Huang ZA, Zhu Z, Yan GY, Li ZW, Wen Z, Chen X. PBMDA: A novel and effective path-based computational model for miRNA-disease association prediction. PLoS Comput Biol. 2017;13(3): e1005455.

    Article  Google Scholar 

  14. Chen X, Wang L, Qu J, Guan NN, Li JQ. Predicting miRNA-disease association based on inductive matrix completion. Bioinformatics. 2018;34(24):4256–65.

    CAS  PubMed  Google Scholar 

  15. Chen X, Xie D, Wang L, Zhao Q, You ZH, Liu H. BNPMDA: Bipartite network projection for MiRNA-disease association prediction. Bioinformatics. 2018;34(18):3178–86.

    Article  CAS  Google Scholar 

  16. Zhou T, Ren J, Medo M, Zhang Y-C. Bipartite network projection and personal recommendation. Phys Rev E. 2007;76(4): 046115.

    Article  Google Scholar 

  17. Xuan P, Zhang Y, Zhang T, Li L, Zhao L. Predicting miRNA-disease associations by incorporating projections in low-dimensional space and local topological information. Genes (Basel). 2019;10(9):685.

    Article  CAS  Google Scholar 

  18. Chen Q, Zhe Z, Lan W, Zhang R, Wang Z, Luo C. Chen Y-PP: Identifying miRNA-disease association based on integrating miRNA topological similarity and functional similarity. Quant Biol. 2019;7(3):202–9.

    Article  CAS  Google Scholar 

  19. Zhang Y, Chen M, Cheng X, Wei H. MSFSP: a novel miRNA-disease association prediction model by federating multiple-similarities fusion and space projection. Front Genet. 2020;11:389.

    Article  CAS  Google Scholar 

  20. Wang C, Sun K, Wang J, Guo M. Data fusion-based algorithm for predicting miRNA–disease associations. Comput Biol Chem. 2020;88: 107357.

    Article  CAS  Google Scholar 

  21. Chen X, Huang L, Xie D, Zhao Q. EGBMMDA: extreme gradient boosting machine for MiRNA-disease association prediction. Cell Death Dis. 2018;9(1):3.

    Article  Google Scholar 

  22. Chen X, Wang CC, Yin J, You ZH. Novel human miRNA-disease association inference based on random forest. Mol Ther Nucleic Acids. 2018;13:568–79.

    Article  CAS  Google Scholar 

  23. Zhao Y, Chen X, Yin J. Adaptive boosting-based computational model for predicting potential miRNA-disease associations. Bioinformatics. 2019;35(22):4730–8.

    Article  CAS  Google Scholar 

  24. Peng J, Hui W, Li Q, Chen B, Hao J, Jiang Q, Shang X, Wei Z. A learning-based framework for miRNA-disease association identification using neural networks. Bioinformatics. 2019;35(21):4364–71.

    Article  Google Scholar 

  25. Ji BY, You ZH, Cheng L, Zhou JR, Alghazzawi D, Li LP. Predicting miRNA-disease association from heterogeneous information network with GraRep embedding model. Sci Rep. 2020;10(1):6658.

    Article  CAS  Google Scholar 

  26. Liu D, Huang Y, Nie W, Zhang J, Deng L. SMALF: miRNA-disease associations prediction based on stacked autoencoder and XGBoost. BMC Bioinform. 2021;22(1):219.

    Article  CAS  Google Scholar 

  27. Tang X, Luo J, Shen C, Lai Z. Multi-view multichannel attention graph convolutional network for miRNA-disease association prediction. Brief Bioinform. 2021;22(6):bbab174.

    Article  Google Scholar 

  28. Liu W, Lin H, Huang L, Peng L, Tang T, Zhao Q, Yang L. Identification of miRNA-disease associations via deep forest ensemble learning based on autoencoder. Brief Bioinform. 2022;23(3):bbac104.

    Article  Google Scholar 

  29. Wang W, Chen H. Predicting miRNA-disease associations based on graph attention networks and dual Laplacian regularized least squares. Br Brief Bioinform. 2022;23(5):bbac292.

    Article  Google Scholar 

  30. Yao D, Zhan X, Kwoh CK. An improved random forest-based computational model for predicting novel miRNA-disease associations. BMC Bioinform. 2019;20(1):624.

    Article  CAS  Google Scholar 

  31. Zhou S, Wang S, Wu Q, Azim R, Li W. Predicting potential miRNA-disease associations by combining gradient boosting decision tree with logistic regression. Comput Biol Chem. 2020;85: 107200.

    Article  CAS  Google Scholar 

  32. Yoder J, Priebe CE. Semi-supervised k-means++. J Stat Comput Simul. 2017;87(13):2597–608.

    Article  Google Scholar 

  33. Li X, Liu B. Learning to classify texts using positive and unlabeled data. In: IJCAI: 2003. Citeseer: 587–592.

  34. Huang Z, Shi J, Gao Y, Cui C, Zhang S, Li J, Zhou Y, Cui Q. HMDD v3.0: a database for experimentally supported human microRNA-disease associations. Nucleic Acids Res. 2019;47(D1):D1013–7.

    Article  CAS  Google Scholar 

  35. Yang Z, Wu L, Wang A, Tang W, Zhao Y, Zhao H, Teschendorff AE. dbDEMC 20: updated database of differentially expressed miRNAs in human cancers. Nucleic Acids Res. 2017;45(D1):D812–8.

    Article  CAS  Google Scholar 

  36. Yamanishi Y, Araki M, Gutteridge A, Honda W, Kanehisa M. Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics. 2008;24(13):i232-240.

    Article  CAS  Google Scholar 

  37. Gottlieb A, Stein GY, Ruppin E, Sharan R. PREDICT: a method for inferring novel drug indications with application to personalized medicine. Mol Syst Biol. 2011;7:496.

    Article  Google Scholar 

  38. Zhu R, Wang Y, Liu JX, Dai LY. IPCARF: improving lncRNA-disease association prediction using incremental principal component analysis feature selection and a random forest classifier. BMC Bioinform. 2021;22(1):175.

    Article  CAS  Google Scholar 

  39. Li Y, Qiu C, Tu J, Geng B, Yang J, Jiang T, Cui Q. HMDD v2.0: a database for experimentally supported human microRNA and disease associations. Nucleic Acids Res. 2014;42(Database issue):D1070-1074.

    Article  CAS  Google Scholar 

  40. Wang D, Wang J, Lu M, Song F, Cui Q. Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases. Bioinformatics. 2010;26(13):1644–50.

    Article  CAS  Google Scholar 

  41. Wu Y, Zhu D, Wang X, Zhang S. An ensemble learning framework for potential miRNA-disease association prediction with positive-unlabeled data. Comput Biol Chem. 2021;95: 107566.

    Article  CAS  Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

This work was supported by the National Natural Science Foundation of China under grant numbers 61862026 and 62062063.

Author information

Authors and Affiliations

Authors

Contributions

H.C. conceived and designed this study. R.G. and W.W. implemented the experiments. R.G., H.C. and W.W. analysed the results. R.G., H.C., G.W. and F.L. wrote the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Hailin Chen.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

The 1123 more reliable miRNA-disease associations selected by KR-NSSM.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, R., Chen, H., Wang, W. et al. Predicting potential miRNA-disease associations based on more reliable negative sample selection. BMC Bioinformatics 23, 432 (2022). https://doi.org/10.1186/s12859-022-04978-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-022-04978-3

Keywords