 Research
 Open Access
 Published:
SMALF: miRNAdisease associations prediction based on stacked autoencoder and XGBoost
BMC Bioinformatics volume 22, Article number: 219 (2021)
Abstract
Background
Identifying miRNA and disease associations helps us understand disease mechanisms of action from the molecular level. However, it is usually blind, timeconsuming, and smallscale based on biological experiments. Hence, developing computational methods to predict unknown miRNA and disease associations is becoming increasingly important.
Results
In this work, we develop a computational framework called SMALF to predict unknown miRNAdisease associations. SMALF first utilizes a stacked autoencoder to learn miRNA latent feature and disease latent feature from the original miRNAdisease association matrix. Then, SMALF obtains the feature vector of representing miRNAdisease by integrating miRNA functional similarity, miRNA latent feature, disease semantic similarity, and disease latent feature. Finally, XGBoost is utilized to predict unknown miRNAdisease associations. We implement crossvalidation experiments. Compared with other stateoftheart methods, SAMLF achieved the best AUC value. We also construct three case studies, including hepatocellular carcinoma, colon cancer, and breast cancer. The results show that 10, 10, and 9 out of the top ten predicted miRNAs are verified in MNDR v3.0 or miRCancer, respectively.
Conclusion
The comprehensive experimental results demonstrate that SMALF is effective in identifying unknown miRNAdisease associations.
Background
Human cells contain a variety of noncoding RNAs. MicroRNAs(miRNAs) are a set of short noncoding RNA, with about 20–25 nucleotides in length, which play an essential role in various biological processes of living organisms [1]. In 1993, the first miRNA lin4 was discovered in elegans [2]. However, this discovery didn’t catch researchers’ attention at that time, and people used to see miRNAs as the “Dark Matter”. Now, a substantial number of miRNAs have been found in animals, plants, viruses, and humans. Mounting evidences have shown that miRNAs participate in cell proliferation, cell division, cell death, cell differentiation, hematopoiesis, and neural development [3].
Besides, miRNAs have been identified to regulate gene expression posttranscriptionally by affecting the translation of mRNA[4], which means the dysregulation of miRNAs may be associated with kinds of diseases by affecting gene expression. Studies have validated that miRNAs are closely related to diseases [5, 6]. For example, chronic lymphocytic leukemia(CLL) results from miR15 and miR16 by controlling the antiapoptotic Bcell lymphoma protein BCL2 in B cells [7]. Iorio proposed the abnormal expression of miR21, miR125b, miR145, and miR155 are involved in human breast cancer [8]. Kozaki observed oral squamous cell carcinomas(OSCC) are associated with the following miRNAs. miR34b, miR137,miR193a, and miR203, which were silenced by aberrant DNA methylation [9]. Glioblastoma multiform(GBM) pathogenesis are shown to be associated with the deregulation of miR21 [10]. Also, the decreased expression of APP and BACE1 regulated by miR9, miR29a, and miR29b1 may increase the occurrence of Alzheimer’s ailment [11]. Based on the research above, predicting miRNAdisease association is apparently a valuable field to research. It provides a better understanding of the pathogenesis of diseases, and contributes a lot to prevent and diagnose illnesses.
In earlier studies, researchers devoted to identifying miRNAdisease association using conventional biological experiments, which are pricey, timeconsuming, laborious, and easy to fail. In those studies, a mass of biological datasets still has been collected. Therefore, establishing an effective computational model with high accuracy to predict the connection with miRNAs and diseases is essential. Nowadays, machine learning, deep learning, and methods that combine the above algorithms are widely applied in proposed computational models, mainly relying on the assumption that miRNAs with similar functions are nearly related to similar diseases [12]. For example, Chen et al.[13] built a random walkbased computational model named RWRMDA to reveal miRNAdisease association. Xuan et al. [14] presented a networkbased model named MIDP, which considered the prior information and the structure of different categories of network nodes, diminished the negative effect of noisy data effectively and performed better than Chen’s RWRMDA [13]. Chen et al. improved their original work to create a new model, GRMDA [15], using graph regression synchronously on miRNA, disease, and association graph, while combining with Partial LeastSquares to reduce the noise. Jiang et al. [16] proposed ICFMDA to uncover the unknown relationship between miRNA and diseases through using the similarity matrices to adjust the weight of the bipartite network of miRNA and diseases, implementing a collaborative filtering algorithm to suggest miRNA or diseases to each other. You et al. [17] put forward PBMDA using the similarity of miRNA and diseases as subgraphs to construct a heterogeneous graph, applying a depthfirst search algorithm to traverse the graph’s paths to find the possible connection between miRNA and disease.
The above approaches are generally based on graphs to predict the relationship between miRNA and diseases. This way can effectively dig out the potential, deepseated, unknown relationship between miRNA and disease from the existing relationship between miRNA and disease, and the use of graphs can more clearly understand the connection between miRNA and disease. However, methods based on graphs are easily biased towards miRNAs or diseases which have many known associations. For diseases with few known associations, it is difficult for them to fully obtain accurate miRNAs candidates because sparse links limit information propagation. Meanwhile, with the spring up of machine learning and deep learning, more and more machine learning and deep learning algorithms are utilized for miRNAdisease prediction. Yao et al. [18] used random forest for feature selection and selected the top 100 features to use random forest regression to score the connection between miRNA and disease. Zheng et al. [19] raised a machine learningbased model named MLMDA, which adopted a deep autoencoder neural network to extract features and the random forest classifier to deduce miRNAdisease interaction. Zhao et al. [20] utilized kmeans clustering in dataprocessing to balance the positive and negative sample and presented ABMDA implemented by boosting algorithm that iterates the weak classifier, decision tree, to improve the accuracy of classification to know the potential miRNAdisease interaction. Wang et al. [21] first integrated the miRNA sequence information with miRNA and disease similarity to extract features, and they applied the logistic tree model to classify the relationship between miRNA and disease, with 90.54% AUC value. Zhou et al. [22] constructed a novel model GBDTLR using GDBT to extract latent features efficiently and logistic regression to score the diseasemiRNA interaction. Zhang et al.[23] obtained two splicing matrices from the similarity matrix and association matrix of disease and miRNA, and then adopted two variational autoencoders to predict the unknown miRNAdisease interaction. Xuan et al. [24] proposed CNNMDA constructed by CNN to train the local and global features acquired from the two embedding layers learn from the association between miRNA and disease respectively to expose the relationship between miRNA and disease. Chen et al. [25] presented a model that can easily extend to higher dimension datasets called LRSSLMDA implemented by Laplacian regulation and L1norm to optimize the function to get the possible connection between disease and miRNA. Fu et al. [26] implemented DeepMDA which uses stacked autoencoder to extract features and applies a 3layer neural network to identify the connection between miRNA and disease. Li et al.[27] presented MCMDA using the SVT algorithm to complete the matrix to obtain an updated miRNAdisease association matrix to predict miRNA and disease connection. Zhao et al. [28] put forward the Spy and Super Cluster strategy to uncover the interaction between disease and miRNA based on the established miRNAdisease association. Furthermore, Luo et al. [29] put forward KPLMS to reveal the potential connection of miRNA and disease by combining miRNA and disease through Kronecker product into the whole space and using regularized least squares to predict miRNAdisease interaction. Also, a novel model presented by Gong et al. [30] utilizing random forest to train the features obtained from miRNAdisease association matrix and disease description graph is designed for miRNAdisease association prediction.
We can regard miRNAdisease association prediction as a miRNAdisease recommendation system. There are complex potential factors hidden under the miRNAdisease association matrix. Unearthing these potential factors can help accurately predict miRNAdisease associations. Hence, we present a novel approach to extract latent features from the original miRNAdisease association matrix. In this work, we develop a calculation framework called SMALF that utilizes stacked autoencoder and XGBoost to infer unknown miRNAdisease associations by integrating latent features and similarities. Stacked autoencoder is an unsupervised learning model that can extract latent features from the input information [31]. XGBoost is a representative of the boosting algorithm, which can effectively enhance the classification effect by integrating many weak classifiers to generate a robust classifier[32]. In SMALF, firstly, we use stacked autoencoders to extract miRNA latent feature and disease latent feature from the original miRNAdisease association matrix. Next, cascade latent features and similarities to obtain feature vectors. Finally, adopt the XGBoost model to complete the classification prediction. To evaluate the performance of SMALF, we perform crossvalidation experiments. The AUC of SMALF reached 0.9503, which is much higher than other models. Simultaneously, the top 10 miRNAs predicted for hepatocellular carcinoma, colon cancer, and breast cancer were 10, 10, and 9 verified in other databases, respectively. All in all, SMALF can effectively predict miRNAdisease associations.
Results and discussion
The performance of SMALF based on fivefold crossvalidation
In this section, to validate the ability of SMALF to infer unknown miRNAdisease associations, we adopt the fivefold crossvalidation in our experiment. The dataset is randomly divided into five subsets, then four subsets are selected for training and one subset for testing. This process is repeated until all subsets have been used for the test set. In classification problems, the ROC curve is an important method to evaluate model performance. The horizontal coordinate of the ROC curve is the false positives rate (FPR), and the vertical coordinate being the true positives rate (TPR).FPR and TPR is given by the following formulas:
where TP and TN are the numbers of miRNAdisease association pairs and nonassociation pairs which are correctly identified, respectively; FP and FN are the numbers of miRNAdisease association pairs and nonassociation pairs which are incorrectly identified, respectively. This paper selects the AUC value as the main evaluation index. The AUC value is the area under the ROC curve, and its value is between 0 and 1. We can regard AUC as the probability that a positive sample is ranked higher than a negative sample in a test. Generally, if a model has good performance, its AUC is usually high as well.
Figure 1 shows the performance of SMALF based on fivefold crossvalidation. As we can see from Fig. 1, AUCs of SMALF are 0.9534,0.9529,0,9496,0.9437,0.9521, respectively. The average AUC value is 0.9503. The results indicate that SMALF has good performance in inferring unknown miRNAdisease associations.
Analysis the dimensionality of latent feature
In SMALF, we use stacked autoencoders to obtain latent feature from the original miRNAdisease association matrix. If the dimensionality of the latent feature is too short, the model cannot fully learn the association between miRNA and disease. If the dimensionality of the latent feature is too long, the risk of overfitting will increase. In this section, in order to study the impact of the dimensionality of the latent feature on the model, we set the dimensionality of latent feature to 8, 16, 32, 64, 128 for experimental comparison.
The experimental results are shown in Table 1. From Table 1, we can see that the model achieves the optimal AUC value when the dimensionality of latent feature is 64. Therefore, in this study, we set the dimensionality of latent feature to 64.
Analysis effects of feature vectors
How to construct feature vectors to represent per miRNAdisease has an essential role in inferring unknown miRNAdisease associations. In SMALF, we combine similarity data and latent features to represent per miRNAdisease. To verify whether our combined strategy helps infer unknown miRNAdisease associations, we designed three sets of experiments. The first set of experiments only used similarity data, directly integrating miRNA functional similarity and disease semantic similarity. We only used latent features in the second set of experiments, directly integrating the latent feature of miRNA and disease. The third set of experiments used similarity data and latent features, which was the same as SMALF.
The results are shown in Table 2 and Fig. 2, AUCs of models using similarity data, only using latent feature, and combining similarity data and latent feature are 0.9161, 0.9467, and 0.9503. In summary, combining similarity data and latent feature gets better performance than only using similarity data or latent feature in inferring potential miRNAdisease associations.
Comparison with different classifiers
SMALF performs well on HMDD2.0 by using the XGBoost classifier. This section selected several typical classifiers (Adaboost, Random Forest, SVM) for experimental comparison. Adaboost obtains a robust classifier by integrating multiple weak classifiers, achieving good performance in many fields. Random forest integrates various decision trees, and its final output value is determined by voting on these decision trees. SVM is a classic twoclass classification model, which realizes classification by maximizing the interval between two heterogeneous classes. SVM has taken excellent results on many classification problems. In the Adaboost algorithm, we choose the decision classification tree as the weak classifier, where the maximum depth of the tree is 10 and minimize samples split is 5. The remaining parameter values are the default. In the RF algorithm, we set the maximum depth of the tree to 10 and max features is 100. The remaining parameter values are default. In the SVM algorithm, we utilize RBF kernel and set C to 50. In the XGBoost algorithm, we set the number of trees to 1000, and the learning rate is 0.1. The remaining parameter values are default.
Table 3 and Fig. 3 show the performance of these classifiers. From Fig. 3, we can see that AUCs of Adaboost, Random Forest, SVM, XGBoost classifiers are 0.9334,0.9191,0.9357 and 0.9503, respectively. The experimental results show that XGBoost achieves much higher AUC values than the other three classifiers. Calculating miRNA functional similarity and disease semantic similarity, there are missing values in the similarity data due to the lack of biological data. Compared with other classifiers, the XGBoost algorithm handles missing values more simply and effectively.In general, the XGBoost classifier is more suitable than other classifiers for SMALF.
Comparisons with the stateoftheart methods
To further assess the predictive ability of SMALF, we compare the SMALF with seven other computational methods (GBDTLR [22], LMTRDA [21], ABMDA [20], RFMDA [33], ICFMDA [16], GRMDA [15], MCMDA [27]). GDBTLR first integrates disease similarity and miRNA similarity to represent miRNAdisease. Then, it applies GDBT to extract new features. Finally, the LR model is employed to predict miRNAdisease association. LMTRDA integrates miRNA sequence similarity, miRNA functional similarity, and disease semantic similarity. The authors creatively engage skipgram algorithms in calculating miRNA sequence similarity. Finally, LMTRDA utilizes logistic model trees to achieve the prediction of miRNAdiseases association. ABMDA utilizes boosting algorithm which integrates many decision trees to mine miRNAdisease associations. To calculate the similarity about miRNA and disease accurately, RFMDA fuses various information and uses the random forest to realize the prediction of miRNAdisease associations.ICFMDA implements a collaborative filtering algorithm to suggest miRNA or diseases to each other.GRMDA uses graph regression synchronously on miRNA, disease, and association graph to infer miRNAdisease association. MCMDA predicts miRNA and disease association by using the SVT algorithm to obtain an updated miRNAdisease association matrix.
Table 4 and Fig. 4 show experimental results for SMALF and the other seven computational methods. SMALF achieves the highest AUC value, which is 2.29% higher than the secondbest model (GBDTLR). The reason why SMALF can achieve such good results is due to using not only similarity data but also latent feature.
Discussion
To investigate the performance of SMALF to infer unknown miRNAdisease interactions in practical application, we selected three common diseases (hepatocellular carcinoma, colon cancer, and breast cancer for case studies. In a specific disease study, we eliminated all miRNAs associated with this disease. Then we utilized SMALF to predict the remaining miRNAs’ score, getting the top 10 candidate miRNAs of this disease. Finally, we verify them by searching them in MNDR v3.0 [34] and miRCancer [35].
The first disease we studied is hepatocellular carcinoma. Hepatocellular carcinoma is a type of primary liver cancer that has a high mortality rate. [36] Hepatocellular carcinoma remains one of the most common and aggressive human malignancies worldwide [37, 38]. For hepatocellular carcinoma, we remove 214 miRNAs (hsalet7a, hsamir101, hsamir103a, et al.) associated with it. The remaining 281 candidate miRNAs are sent to SMALF for prediction.The results are shown in Table 5. From our study results, all the top ten miRNA candidates about hepatocellular carcinoma are confirmed in MNDR v3.0 or miRCancer.
The second disease we studied was colon cancer. Colon cancer has a high incidence in people aged 40 to 50 [39]. Colon cancer has no symptoms in its early stages, so it is straightforward to miss the diagnosis. For colon cancer, we remove 4 miRNAs (hsamir106a, hsamir145, hsamir126, hsamir17) associated with it. The remaining 491 candidate miRNAs are sent to SMALF for prediction. The results are shown in Table 6. Our study results show that all the top ten miRNA candidates about colon cancer are verified in MNDR v3.0 or miRCancer.
The third disease we studied was breast cancer. The number of people who have breast cancer is increasing since the 1970s, and now it has become common cancer affecting women’s physical and mental health [40]. We remove 202 miRNAs (hasmir1245a, hasmir1245b, hasmir1258, et al.) associated with breast cancer. There are 293 candidate miRNAs for breast cancer. The results are shown in Table 7. Our study results show nine of the top ten miRNA candidates about breast cancer are confirmed in MNDR v3.0 or miRCancer. It’s worth noting that biological experiments haven’t validated hsamir487b. It is likely associated with breast cancer.
Conclusion
Discovering unknown miRNAdisease associations is vital for us to understand the pathogenesis of diseases at the molecular level. However, the biological experimentbased approach to uncovering unknown miRNAdisease associations is still very limited. Thus, it is increasingly important to use computational methods to predict unknown miRNAdisease associations. We developed SMALF, which is a computational method by combining similarity data and latent features. SMALF first extracted miRNAs and diseases latent features from the original miRNAdisease association matrix by utilizing a stacked autoencoder, respectively. Then, integrating miRNA functional similarities, disease semantic similarities, miRNA latent features, and disease latent features generated the feature vector representing miRNAdisease. Finally, SMALF obtains the prediction result by employing the XGBoost algorithm. We performed fivefold crossvalidation experiments. SMALF achieved an AUC value of 0.9503, which is much higher than many other computational methods. Besides, the case studies also indicated that SMALF could infer unknown miRNAdisease interactions effectively. However, our work still has some room for improvement. Due to the lack of negative samples, we select unknown miRNAdisease associations as negative samples. There may be false negatives in these negative samples, which may also impact the experimental results. Therefore, finding reliable negative samples will help further improve the performance of the model.
Methods
Problem description
Researchers use lots of biological experiments to confirm miRNAsdisease associations, and by tapping the potential connections between human diseases and biomolecules, which could effectively boost the prevention, diagnosis, and treatment of human diseases. How to efficiently and accurately dig out the potential relationship between miRNA and disease is what we want to breakthrough. Most of the existing studies are based on the miRNAdisease databases provided by HMDD V2.0 [41]. To extract latent features of existing miRNAdisease associations, the known associations are identified by constructing an adjacency matrix Y. The research task of this paper is to discover the unobserved potential connections in known miRNAdisease association matrix(0 in matrix Y).
Human miRNAdisease association
To express the relationship between miRNA and disease, the adjacency matrix Y of the interaction between miRNA and disease is constructed. If miRNA m(i) and disease d(j) have a known association in this matrix, the value of Y(i,j) at the corresponding position of the matrix is set to 1, otherwise to 0. Note that, in this association, the 0 matrix does not indicate that there is no relation between miRNA and diseases. It only indicates that potential links are not yet discovered. For the ideal experimental result, it is necessary to select the positive and negative samples of miRNAdisease association. During the experiment, we used the miRNAdisease associations that are the same as Zhou et al [22]. and its 5430 positive samples and 5418 negative samples.The statistical information of the dataset is shown in Table 8.
MiRNA functional similarity
According to previous research results, it is not difficult to find that miRNA functional similarity is often more likely to be associated with phenotypically similar diseases. The miRNA functional similarity score can be computed [42]. We can construct an adjacency matrix FS(m(i),m (j)) to point out the useful similarity between miRNAs with records.
Disease semantic similarity
Inspired by previous studies, the MeSH database (http://www.ncbi.nlm.nih.gov/), which is widely used to obtain diseaserelated data, is extracted to constructa directed acyclic graphs(DAG). For the given D, DAG(D) = (D, T, E),where T(D) represents the node set composed of D and all of its ancestor nodes, and the parent node. The edge directly connected by the child nodes is defined as E(D). Finally, as Xuan et al [43], the value of d(a disease) to D (semantic value) can be defined as:
where \(\triangle\) is the semantic contribution attenuation factor. Xuan et al. denoted the value of \(\triangle\) to 0.5, the contribution value of disease D to itself is 1, and the value of other diseases to D decreases as the distance. From the above formula of the semantic value:
if two diseases can share more DAGs, they will be able to obtain a higher semantic similarity value. Therefore, the semantic similarity score SS between two diseases is:
Stacked autoencoders for latent features of miRNAs and diseases
In the adjacency matrix Y constructed by human miRNAdisease associations, the known 5430 miRNAdisease associations account for only 2.8% of all diseasemiRNAs. In order to better represent these sparse primitive simple data, The stacked autoencoder extracts the potential relationships contained in the highdimensional and sparse original feature vectors of miRNA and disease.
Autoencoder(AE) is an unsupervised learning method. Its purpose is based on the input unlabeled data, through training to obtain a dimensionality reduction feature expression of the data after compression. The autoencoder is an artificial neural network composed of two subnetworks: encoder and decoder [44]. In this article, a stack encoder is used to extract potential associations of miRNAdisease. The stacked autoencoder is a cascade of multiple autoencoders, that is, contains multiple hidden layers to complete the task of extracting information layer by layer for the original features. The stacked autoencoder trains multiple layers of AE sequentially. After the first AE training is completed, the output of its encoder is used as the input of the second AE, and so on, and finally, a more representative and lowdimensional latent feature is obtained.
SMALF model
In this section, we will detail the SMALF model construction process, and show the overall process in Fig. 5.
Step 1: Matrix decomposition
Regarding the original matrix Y as the input, each row of Y is the original feature of the miRNA, and each column is the original feature of the disease. In the original feature vectors, m(i) and d(j) that decompose miRNA and disease, the one marked with 1 indicates that there is a correlation, and the one marked with 0 indicates that there is an unobserved correlation. Decompose miRNA disease association matrix Y into M and \(D^T\).
there \(M,D^T \in Y^{m*n}\) is a real matrix. In our research, \(M_i\) and \(D_j^T\) are respectively regarded as the original feature vectors of m(i) and d(j).
Step 2: Extracts latent features by stacked autoencoders
In our autoencoder, the encoder H1 accepts the original feature m from miRNA in M and the encoder H2 accepts the original feature d from the disease in \(D^T\) as input, define the ith training sample \(x_i=m\) in M in H1; define H2 The jth training sample \(x_j=m\) and encoder H extracts features from the lowdimensional code Z. The formula is as follows:
where \(l={1,....,L}\), we set L to 2, which means that use two hidden layers, \(h_i^{(l)}\) is the lth hidden layer, \(h_i^{(0)}\) represents the input \(x_i\) , \(W^l\) is the weight matrix and \(b^l\) is the bias of the lth layer, The activation function \(f_e (.)\) can effectively adjust the input through training.
The purpose of the decoder is to reconstruct the input \(x_i\) as much as possible from the latent features \(z_i\) output by the encoder. Its definition formula is as follows:
Where \(f_d (.)\) and \(g_d (.)\) represent activation function and hyperbolic tangent function, respectively. where \(f_d (.)\) and \(g_d (.)\) represent activation function and hyperbolic tangent function, respectively.
Finally, the loss function is the sum of the reconstruction errors of all samples, and its expression is as follows:
among them, the first term loss is the square of the loss, the second term is the normalization of the Jacobian \(J_h (x_i)\) and \(\lambda\) is a hyperparameter. The stacked autoencoder will update the parameters of each node of the network iteratively to minimize the loss. it is trained through the iterative method of backpropagation, This step is also called finetuning. After continuous finetuning, the minimal loss is achieved, and the optimal solution of the autoencoder is reached. At this time, the latent feature z is the lowdimensional and highdensity feature vectors \(M_i\) and \(D_j^T\) compressed by the miRNA and disease sparse features we need.
Step 3: Combining latent features and similarity features
So far, we have obtained the 64dimensional miRNA and disease latent feature vectors \(M_i\) and \(D_j^T\) extracted by stacking autoencoder, which respectively concatenate with 495dimensional miRNA functional similarity feature \(FS_i\) and 383dimensional disease semantic similarity feature \(SS_j\) to new vectors that is 559dimensional miRNA new feature and 447dimensional disease new feature.
then concatenate the two vectors to get a new vector for model prediction.
Step 4: Predict new feature vectors by XGBoost
XGBoost accurately classifies the weak classifiers it contains through gradient iteration [45]. In this paper, we predict the new features of the miRNAdisease cascade in the new data set by the XGBoost model, which uses the cascaded \(Vec_{new}\) as input and obtains its best gradient regression tree through training. XGBoost model contain K decision trees, \(f_k\) represents the kth decision tree, and the feature vector \(Vec_{new\_i}\) is regarded as input \(x_i\), and finally get the prediction result as the following formula:
where \(\hat{y}_i^{(t)}\) means the classification result of the first jth classifier, to minimize the loss of the objective function, the XGBoost algorithm adds a new function to the original model in each iteration. And use the function \(\Omega (f_t )\) to control the complexity of the tth subtree.
where T is the number of leaf nodes, \(w_j\) is the score of each leaf node, \(\gamma\) and \(\lambda\) are the hyperparameters that control the proportion of complexity, and overfitting phenomenon can be prevented by adjusting these two hyperparameters. Furthermore, XGBoost also uses secondorder Taylor expansion to optimize the objective function. The objective function of the tth iteration is as follows:
where l(.) is the mean square error function of the iteration t1, because \(f_i(x_i)\) will finally be assigned to the leaf in the subtree, and its value can also be represented by the weight of the leaf \(w_j\).
where \(I_j\) represents the sample set contained in leaf j. The iterative training of the above formula can effectively fit the new miRNAdisease features and obtain the optimal prediction model. Traverse all the data in the new test set, input the fused feature vector into the optimal SMALF model, and get the score prediction value for each potential miRNAdisease.
Availability of data materials
The data and code used in the current study is available at:https://github.com/dayunliu/SMALF.
Abbreviations
 XGBoost:

eXtreme Gradient Boosting
 ROC:

Rceiver operating characteristic
 TPR:

True positive rate
 FPR:

False positive rate
 AUC:

Area under ROC curve
 Adaboost:

Adaptive boosting
 SVM:

Support vector machine
 RF:

Random Forest
 GBDT:

Gradient Boosting Decison Tree
 DAG:

Directed acyclic graph
References
Ambros V. micrornas: tiny regulators with great potential. Cell. 2001;107(7):823–6.
Lee RC, Feinbaum RL, Ambros V. The c. elegans heterochronic gene lin4 encodes small rnas with antisense complementarity to lin14. Cell. 1993;75(5):843–54.
Ambros V. The functions of animal micrornas. Nature. 2004;431(7006):350–5.
Bartel DP. Micrornas: genomics, biogenesis, mechanism, and function. Cell. 2004;116(2):281–97.
Erson A, Petty E. Micrornas in development and disease. Clin Genet. 2008;74(4):296–306.
LynamLennon N, Maher SG, Reynolds JV. The roles of microrna in cancer and apoptosis. Biol Rev. 2009;84(1):55–71.
Calin GA, Dumitru CD, Shimizu M, Bichi R, Zupo S, Noch E, Aldler H, Rattan S, Keating M, Rai K, et al. Frequent deletions and downregulation of microrna genes mir15 and mir16 at 13q14 in chronic lymphocytic leukemia. Proc Natl Acad Sci. 2002;99(24):15524–9.
Iorio MV, Ferracin M, Liu CG, Veronese A, Spizzo R, Sabbioni S, Magri E, Pedriali M, Fabbri M, Campiglio M, et al. Microrna gene expression deregulation in human breast cancer. Can Res. 2005;65(16):7065–70.
Kozaki KI, Imoto I, Mogi S, Omura K, Inazawa J. Exploration of tumorsuppressive micrornas silenced by dna hypermethylation in oral cancer. Can Res. 2008;68(7):2094–105.
Masoudi MS, Mehrabian E, Mirzaei H. Mir21: a key player in glioblastoma pathogenesis. J Cell Biochem. 2018;119(2):1285–90.
Hébert SS, Horré K, Nicolaï L, Papadopoulou AS, Mandemakers W, Silahtaroglu AN, Kauppinen S, Delacourte A, De Strooper B. Loss of microrna cluster mir29a/b1 in sporadic alzheimer’s disease correlates with increased bace1/βsecretase expression. Proc Natl Acad Sci. 2008;105(17):6415–20.
Chen X, Xie D, Zhao Q, You ZH. Micrornas and complex diseases: from experimental results to computational models. Brief Bioinform. 2019;20(2):515–39.
Chen X, Liu MX, Yan GY. Rwrmda: predicting novel human micrornadisease associations. Mol BioSyst. 2012;8(10):2792–8.
Xuan P, Han K, Guo Y, Li J, Li X, Zhong Y, Zhang Z, Ding J. Prediction of potential diseaseassociated micrornas based on random walk. Bioinformatics. 2015;31(11):1805–15.
Chen X, Yang JR, Guan NN, Li JQ. Grmda: graph regression for mirnadisease association prediction. Front Physiol. 2018;9:92.
Jiang Y, Liu B, Yu L, Yan C, Bian H. Predict mirnadisease association with collaborative filtering. Neuroinformatics. 2018;16(3–4):363–72.
You ZH, Huang ZA, Zhu Z, Yan GY, Li ZW, Wen Z, Chen X. Pbmda: a novel and effective pathbased computational model for mirnadisease association prediction. PLoS Comput Biol. 2017;13(3):1005455.
Yao D, Zhan X, Kwoh CK. An improved random forestbased computational model for predicting novel mirnadisease associations. BMC Bioinform. 2019;20(1):624.
Zheng K, You ZH, Wang L, Zhou Y, Li LP, Li ZW. Mlmda: a machine learning approach to predict and validate micrornadisease associations by integrating of heterogenous information sources. J Transl Med. 2019;17(1):260.
Zhao Y, Chen X, Yin J. Adaptive boostingbased computational model for predicting potential mirnadisease associations. Bioinformatics. 2019;35(22):4730–8.
Wang L, You ZH, Chen X, Li YM, Dong YN, Li LP, Zheng K. Lmtrda: using logistic model tree to predict mirnadisease associations by fusing multisource information of sequences and similarities. PLoS Comput Biol. 2019;15(3):1006865.
Zhou S, Wang S, Wu Q, Azim R, Li W. Predicting potential mirnadisease associations by combining gradient boosting decision tree with logistic regression. Comput Biol Chem. 2020;85:107200.
Zhang L, Chen X, Yin J. Prediction of potential mirnadisease associations through a novel unsupervised deep learning framework with variational autoencoder. Cells. 2019;8(9):1040.
Xuan P, Sun H, Wang X, Zhang T, Pan S. Inferring the diseaseassociated mirnas based on network representation learning and convolutional neural networks. Int J Mol Sci. 2019;20(15):3648.
Chen X, Huang L. Lrsslmda: Laplacian regularized sparse subspace learning for mirnadisease association prediction. PLoS Comput Biol. 2017;13(12):1005912.
Fu L, Peng Q. A deep ensemble model to predict mirnadisease association. Sci Rep. 2017;7(1):1–13.
Li JQ, Rong ZH, Chen X, Yan GY, You ZH. Mcmda: matrix completion for mirnadisease association prediction. Oncotarget. 2017;8(13):21187.
Zhao Q, Xie D, Liu H, Wang F, Yan GY, Chen X. Sscmda: spy and super cluster strategy for mirnadisease association prediction. Oncotarget. 2018;9(2):1826.
Luo J, Xiao Q, Liang C, Ding P. Predicting micrornadisease associations using kronecker regularized least squares based on heterogeneous omics data. Ieee Access. 2017;5:2503–13.
Gong Y, Niu Y, Zhang W, Li X. A network embeddingbased multiple information integration method for the mirnadisease association prediction. BMC Bioinform. 2019;20(1):468.
Shin HC, Orton MR, Collins DJ, Doran SJ, Leach MO. Stacked autoencoders for unsupervised feature learning and multiple organ detection in a pilot study using 4d patient data. IEEE Trans Pattern Anal Mach Intell. 2012;35(8):1930–43.
Chen T, Guestrin C. Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016; 785–94.
Xing, C., ChunChun, W., Jun, Y., ZhuHong, Y.: Novel human mirnadisease association inference based on random forest. Molecular Therapy Nucleic Acids 2018.
Ning L, Cui T, Zheng B, Wang N, Luo J, Yang B, Du M, Cheng J, Dou Y, Wang D. Mndr v3.0: mammal ncrna–disease repository with increased coverage and annotation. Nucleic Acids Research 2020.
Xie B, Ding Q, Han H, Wu D. Mircancer: a micrornacancer association database constructed by text mining on literature. Bioinformatics. 2013.
Ikura Y. Transitions of histopathologic criteria for diagnosis of nonalcoholic fatty liver disease during the last three decades. World J Hepatol. 2014.
Xin WW, Hussain SP, Huo TI, Wu CG, Harris CC. Molecular pathogenesis of human hepatocellular carcinoma. Toxicology. 2002;181(1–3):43–7.
Parkin DM, Bray MF, Ferlay MJ, Pisani P. Global cancer statistics, 2002. CA Cancer J Clin. 2005;55(2):74.
Favoriti P, Carbone G, Greco M, Pirozzi F, Pirozzi REM, Corcione F. Worldwide burden of colorectal cancer: a review. Updat Surg. 2016;68(1):7–11.
Jemal A, Bray F, Center MM, Ferlay J, Forman D. Global cancerstatistics. Ca Cancer J Clin. 2011;6(2):169–90.
Yang L, Qiu C, Jian T, Geng B, Yang J, Jiang T, Cui Q. Hmdd v2.0: a database for experimentally supported human microrna and disease associations. Nucleic Acids Res. (D1), 1070, 2014.
Cui Q. Inferring the human microrna functional similarity and functional network based on micrornaassociated diseases. Bioinformatics. 2010;26(13):1644–50.
Xuan P, Han K, Guo M, Guo Y, Huang Y. Prediction of micrornas associated with human diseases based on weighted k most similar neighbors. PLoS ONE. 2013;8(8):70204.
Ji C, Gao Z, Ma X, Wu Q, Zheng C. Aemda: inferring mirnadisease associations based on deep autoencoder. Bioinformatics. 2020.
Zhang Y, Chen J, Wang Y, Wang D, Cong W, Lai BS, Zhao Y, SendiñaNadal I. Multilayer network analysis of MIRNA and protein expression profiles in breast cancer patients. Plos One. 2019;14(4).
Acknowledgements
We would like to thank the Experimental Center of School of Computer Science and Engineering of Central South University, for providing computing resources.
Funding
This work was supported by National Natural Science Foundation of China under grant No. 61972422. Publication costs are funded by National Natural Science Foundation of China under grant No. 61972422.
Author information
Authors and Affiliations
Contributions
LD and DYL conceived the prediction method.DYL,YBH and WJN wrote the paper.LD and DYL developed the computer programs. YBH,WJN and JXZ analyzed the results and revised the paper. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Liu, D., Huang, Y., Nie, W. et al. SMALF: miRNAdisease associations prediction based on stacked autoencoder and XGBoost. BMC Bioinformatics 22, 219 (2021). https://doi.org/10.1186/s12859021041352
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859021041352
Keywords
 miRNAdisease associations
 Stacked autoencoder
 Latent feature
 XGBoost