SMALF: miRNA-disease associations prediction based on stacked autoencoder and XGBoost

Background Identifying miRNA and disease associations helps us understand disease mechanisms of action from the molecular level. However, it is usually blind, time-consuming, and small-scale based on biological experiments. Hence, developing computational methods to predict unknown miRNA and disease associations is becoming increasingly important. Results In this work, we develop a computational framework called SMALF to predict unknown miRNA-disease associations. SMALF first utilizes a stacked autoencoder to learn miRNA latent feature and disease latent feature from the original miRNA-disease association matrix. Then, SMALF obtains the feature vector of representing miRNA-disease by integrating miRNA functional similarity, miRNA latent feature, disease semantic similarity, and disease latent feature. Finally, XGBoost is utilized to predict unknown miRNA-disease associations. We implement cross-validation experiments. Compared with other state-of-the-art methods, SAMLF achieved the best AUC value. We also construct three case studies, including hepatocellular carcinoma, colon cancer, and breast cancer. The results show that 10, 10, and 9 out of the top ten predicted miRNAs are verified in MNDR v3.0 or miRCancer, respectively. Conclusion The comprehensive experimental results demonstrate that SMALF is effective in identifying unknown miRNA-disease associations.

spring up of machine learning and deep learning, more and more machine learning and deep learning algorithms are utilized for miRNA-disease prediction. Yao et al. [18] used random forest for feature selection and selected the top 100 features to use random forest regression to score the connection between miRNA and disease. Zheng et al. [19] raised a machine learning-based model named MLMDA, which adopted a deep autoencoder neural network to extract features and the random forest classifier to deduce miRNA-disease interaction. Zhao et al. [20] utilized k-means clustering in data-processing to balance the positive and negative sample and presented ABMDA implemented by boosting algorithm that iterates the weak classifier, decision tree, to improve the accuracy of classification to know the potential miRNA-disease interaction. Wang et al. [21] first integrated the miRNA sequence information with miRNA and disease similarity to extract features, and they applied the logistic tree model to classify the relationship between miRNA and disease, with 90.54% AUC value. Zhou et al. [22] constructed a novel model GBDT-LR using GDBT to extract latent features efficiently and logistic regression to score the disease-miRNA interaction. Zhang et al. [23] obtained two splicing matrices from the similarity matrix and association matrix of disease and miRNA, and then adopted two variational autoencoders to predict the unknown miRNA-disease interaction. Xuan et al. [24] proposed CNNMDA constructed by CNN to train the local and global features acquired from the two embedding layers learn from the association between miRNA and disease respectively to expose the relationship between miRNA and disease. Chen et al. [25] presented a model that can easily extend to higher dimension datasets called LRSSLMDA implemented by Laplacian regulation and L1-norm to optimize the function to get the possible connection between disease and miRNA. Fu et al. [26] implemented DeepMDA which uses stacked autoencoder to extract features and applies a 3-layer neural network to identify the connection between miRNA and disease. Li et al. [27] presented MCMDA using the SVT algorithm to complete the matrix to obtain an updated miRNA-disease association matrix to predict miRNA and disease connection. Zhao et al. [28] put forward the Spy and Super Cluster strategy to uncover the interaction between disease and miRNA based on the established miRNA-disease association. Furthermore, Luo et al. [29] put forward KPLMS to reveal the potential connection of miRNA and disease by combining miRNA and disease through Kronecker product into the whole space and using regularized least squares to predict miRNAdisease interaction. Also, a novel model presented by Gong et al. [30] utilizing random forest to train the features obtained from miRNA-disease association matrix and disease description graph is designed for miRNA-disease association prediction.
We can regard miRNA-disease association prediction as a miRNA-disease recommendation system. There are complex potential factors hidden under the miRNA-disease association matrix. Unearthing these potential factors can help accurately predict miRNA-disease associations. Hence, we present a novel approach to extract latent features from the original miRNA-disease association matrix. In this work, we develop a calculation framework called SMALF that utilizes stacked autoencoder and XGBoost to infer unknown miRNA-disease associations by integrating latent features and similarities. Stacked autoencoder is an unsupervised learning model that can extract latent features from the input information [31]. XGBoost is a representative of the boosting algorithm, which can effectively enhance the classification effect by integrating many weak classifiers to generate a robust classifier [32]. In SMALF, firstly, we use stacked autoencoders to extract miRNA latent feature and disease latent feature from the original miRNA-disease association matrix. Next, cascade latent features and similarities to obtain feature vectors. Finally, adopt the XGBoost model to complete the classification prediction. To evaluate the performance of SMALF, we perform cross-validation experiments. The AUC of SMALF reached 0.9503, which is much higher than other models. Simultaneously, the top 10 miRNAs predicted for hepatocellular carcinoma, colon cancer, and breast cancer were 10, 10, and 9 verified in other databases, respectively. All in all, SMALF can effectively predict miRNA-disease associations.

Results and discussion
The performance of SMALF based on five-fold cross-validation In this section, to validate the ability of SMALF to infer unknown miRNA-disease associations, we adopt the five-fold cross-validation in our experiment. The dataset is randomly divided into five subsets, then four subsets are selected for training and one subset for testing. This process is repeated until all subsets have been used for the test set. In classification problems, the ROC curve is an important method to evaluate model performance. The horizontal coordinate of the ROC curve is the false positives rate (FPR), and the vertical coordinate being the true positives rate (TPR).FPR and TPR is given by the following formulas: where TP and TN are the numbers of miRNA-disease association pairs and non-association pairs which are correctly identified, respectively; FP and FN are the numbers of miRNA-disease association pairs and non-association pairs which are incorrectly identified, respectively. This paper selects the AUC value as the main evaluation index. The AUC value is the area under the ROC curve, and its value is between 0 and 1. We can regard AUC as the probability that a positive sample is ranked higher than a negative sample in a test. Generally, if a model has good performance, its AUC is usually high as well. Figure 1 shows the performance of SMALF based on five-fold cross-validation. As we can see from Fig. 1, AUCs of SMALF are 0.9534,0.9529,0,9496,0.9437,0.9521, respectively. The average AUC value is 0.9503. The results indicate that SMALF has good performance in inferring unknown miRNA-disease associations.

Analysis the dimensionality of latent feature
In SMALF, we use stacked autoencoders to obtain latent feature from the original miRNA-disease association matrix. If the dimensionality of the latent feature is too short, the model cannot fully learn the association between miRNA and disease. If the dimensionality of the latent feature is too long, the risk of overfitting will increase. In this section, in order to study the impact of the dimensionality of the latent feature on the model, we set the dimensionality of latent feature to 8, 16, 32, 64, 128 for experimental comparison.
The experimental results are shown in Table 1. From Table 1, we can see that the model achieves the optimal AUC value when the dimensionality of latent feature is 64. Therefore, in this study, we set the dimensionality of latent feature to 64.

Analysis effects of feature vectors
How to construct feature vectors to represent per miRNA-disease has an essential role in inferring unknown miRNA-disease associations. In SMALF, we combine similarity data and latent features to represent per miRNA-disease. To verify whether our combined strategy helps infer unknown miRNA-disease associations, we designed three sets of experiments. The first set of experiments only used similarity data, directly integrating miRNA functional similarity and disease semantic similarity. We only used latent features in the second set of experiments, directly integrating the latent feature of miRNA and disease. The third set of experiments used similarity data and latent features, which was the same as SMALF.  The results are shown in Table 2 and Fig. 2, AUCs of models using similarity data, only using latent feature, and combining similarity data and latent feature are 0.9161, 0.9467, and 0.9503. In summary, combining similarity data and latent feature gets better performance than only using similarity data or latent feature in inferring potential miRNAdisease associations.

Comparison with different classifiers
SMALF performs well on HMDD2.0 by using the XGBoost classifier. This section selected several typical classifiers (Adaboost, Random Forest, SVM) for experimental comparison. Adaboost obtains a robust classifier by integrating multiple weak classifiers, achieving good performance in many fields. Random forest integrates various decision trees, and its final output value is determined by voting on these decision trees. SVM is a classic two-class classification model, which realizes classification by maximizing the interval between two heterogeneous classes. SVM has taken excellent results on many classification problems. In the Adaboost algorithm, we choose the decision classification tree as the weak classifier, where the maximum depth of the tree is 10 and minimize samples split is 5. The remaining parameter values are the default. In the RF algorithm, we set the maximum depth of the tree to 10 and max features is 100. The remaining parameter values are default. In the SVM algorithm, we utilize RBF kernel and set C to  50. In the XGBoost algorithm, we set the number of trees to 1000, and the learning rate is 0.1. The remaining parameter values are default. Table 3 and Fig. 3 show the performance of these classifiers. From Fig. 3, we can see that AUCs of Adaboost, Random Forest, SVM, XGBoost classifiers are 0.9334,0.9191,0.9357 and 0.9503, respectively. The experimental results show that XGBoost achieves much higher AUC values than the other three classifiers. Calculating miRNA functional similarity and disease semantic similarity, there are missing values in the similarity data due to the lack of biological data. Compared with other classifiers, the XGBoost algorithm handles missing values more simply and effectively.In general, the XGBoost classifier is more suitable than other classifiers for SMALF.

Comparisons with the state-of-the-art methods
To further assess the predictive ability of SMALF, we compare the SMALF with seven other computational methods (GBDT-LR [22], LMTRDA [21], ABMDA [20], RFMDA [33], ICFMDA [16], GRMDA [15], MCMDA [27]). GDBT-LR first integrates disease similarity and miRNA similarity to represent miRNA-disease. Then, it applies GDBT to extract new features. Finally, the LR model is employed to predict miRNA-disease association. LMTRDA integrates miRNA sequence similarity, miRNA functional similarity, and disease semantic similarity. The authors creatively engage skip-gram algorithms in calculating miRNA sequence similarity. Finally, LMTRDA utilizes logistic model trees  to achieve the prediction of miRNA-diseases association. ABMDA utilizes boosting algorithm which integrates many decision trees to mine miRNA-disease associations.
To calculate the similarity about miRNA and disease accurately, RFMDA fuses various information and uses the random forest to realize the prediction of miRNA-disease associations.ICFMDA implements a collaborative filtering algorithm to suggest miRNA or diseases to each other.GRMDA uses graph regression synchronously on miRNA, disease, and association graph to infer miRNA-disease association. MCMDA predicts miRNA and disease association by using the SVT algorithm to obtain an updated miRNA-disease association matrix. Table 4 and Fig. 4 show experimental results for SMALF and the other seven computational methods. SMALF achieves the highest AUC value, which is 2.29% higher than the second-best model (GBDT-LR). The reason why SMALF can achieve such good results is due to using not only similarity data but also latent feature.

Discussion
To investigate the performance of SMALF to infer unknown miRNA-disease interactions in practical application, we selected three common diseases (hepatocellular carcinoma, colon cancer, and breast cancer for case studies. In a specific disease study, we eliminated all miRNAs associated with this disease. Then we utilized SMALF to predict the remaining miRNAs' score, getting the top 10 candidate miRNAs of this disease. Finally, we verify them by searching them in MNDR v3.0 [34] and miRCancer [35]. The first disease we studied is hepatocellular carcinoma. Hepatocellular carcinoma is a type of primary liver cancer that has a high mortality rate. [36] Hepatocellular carcinoma remains one of the most common and aggressive human malignancies worldwide [37,38]. For hepatocellular carcinoma, we remove 214 miRNAs (hsa-let-7a, hsa-mir-101, hsa-mir-103a, et al.) associated with it. The remaining 281 candidate miRNAs are sent to SMALF for prediction.The results are shown in Table 5. From our study results, all the top ten miRNA candidates about hepatocellular carcinoma are confirmed in MNDR v3.0 or miRCancer.
The second disease we studied was colon cancer. Colon cancer has a high incidence in people aged 40 to 50 [39]. Colon cancer has no symptoms in its early stages, so it is straightforward to miss the diagnosis. For colon cancer, we remove 4 miRNAs   Table 6. Our study results show that all the top ten miRNA candidates about colon cancer are verified in MNDR v3.0 or miRCancer. The third disease we studied was breast cancer. The number of people who have breast cancer is increasing since the 1970s, and now it has become common cancer affecting women's physical and mental health [40]. We remove 202 miRNAs (has-mir-1245a, has-mir-1245b, has-mir-1258, et al.) associated with breast cancer. There are 293 candidate miRNAs for breast cancer. The results are shown in Table 7. Our study results show nine of the top ten miRNA candidates about breast cancer are confirmed in MNDR v3.0 or miRCancer. It's worth noting that biological experiments haven't validated hsa-mir-487b. It is likely associated with breast cancer.

Conclusion
Discovering unknown miRNA-disease associations is vital for us to understand the pathogenesis of diseases at the molecular level. However, the biological experimentbased approach to uncovering unknown miRNA-disease associations is still very limited. Thus, it is increasingly important to use computational methods to predict unknown miRNA-disease associations. We developed SMALF, which is a computational method by combining similarity data and latent features. SMALF first extracted miR-NAs and diseases latent features from the original miRNA-disease association matrix by utilizing a stacked autoencoder, respectively. Then, integrating miRNA functional similarities, disease semantic similarities, miRNA latent features, and disease latent features generated the feature vector representing miRNA-disease. Finally, SMALF obtains the prediction result by employing the XGBoost algorithm. We performed five-fold crossvalidation experiments. SMALF achieved an AUC value of 0.9503, which is much higher than many other computational methods. Besides, the case studies also indicated that SMALF could infer unknown miRNA-disease interactions effectively. However, our work still has some room for improvement. Due to the lack of negative samples, we select unknown miRNA-disease associations as negative samples. There may be false negatives in these negative samples, which may also impact the experimental results. Therefore, finding reliable negative samples will help further improve the performance of the model.

Problem description
Researchers use lots of biological experiments to confirm miRNAs-disease associations, and by tapping the potential connections between human diseases and biomolecules, which could effectively boost the prevention, diagnosis, and treatment of human diseases. How to efficiently and accurately dig out the potential relationship between miRNA and disease is what we want to breakthrough. Most of the existing studies are based on the miRNA-disease databases provided by HMDD V2.0 [41]. To extract latent features of existing miRNA-disease associations, the known associations are identified by constructing an adjacency matrix Y. The research task of this paper is to discover the unobserved potential connections in known miRNA-disease association matrix(0 in matrix Y).

Human miRNA-disease association
To express the relationship between miRNA and disease, the adjacency matrix Y of the interaction between miRNA and disease is constructed. If miRNA m(i) and disease d(j) have a known association in this matrix, the value of Y(i,j) at the corresponding position of the matrix is set to 1, otherwise to 0. Note that, in this association, the 0 matrix does not indicate that there is no relation between miRNA and diseases. It only indicates that potential links are not yet discovered. For the ideal experimental result, it is necessary to select the positive and negative samples of miRNA-disease association. During the experiment, we used the miRNA-disease associations that are the same as Zhou et al [22]. and its 5430 positive samples and 5418 negative samples.The statistical information of the dataset is shown in Table 8.

MiRNA functional similarity
According to previous research results, it is not difficult to find that miRNA functional similarity is often more likely to be associated with phenotypically similar diseases. The miRNA functional similarity score can be computed [42]. We can construct an adjacency matrix FS(m(i),m (j)) to point out the useful similarity between miRNAs with records.

Disease semantic similarity
Inspired by previous studies, the MeSH database (http:// www. ncbi. nlm. nih. gov/), which is widely used to obtain disease-related data, is extracted to constructa directed acyclic graphs(DAG). For the given D, DAG(D) = (D, T, E),where T(D) represents the node set composed of D and all of its ancestor nodes, and the parent node. The edge directly connected by the child nodes is defined as E(D). Finally, as Xuan et al [43], the value of d(a disease) to D (semantic value) can be defined as: where △ is the semantic contribution attenuation factor. Xuan et al. denoted the value of △ to 0.5, the contribution value of disease D to itself is 1, and the value of other diseases to D decreases as the distance. From the above formula of the semantic value: if two diseases can share more DAGs, they will be able to obtain a higher semantic similarity value. Therefore, the semantic similarity score SS between two diseases is:

Stacked autoencoders for latent features of miRNAs and diseases
In the adjacency matrix Y constructed by human miRNA-disease associations, the known 5430 miRNA-disease associations account for only 2.8% of all disease-miR-NAs. In order to better represent these sparse primitive simple data, The stacked autoencoder extracts the potential relationships contained in the high-dimensional and sparse original feature vectors of miRNA and disease. Autoencoder(AE) is an unsupervised learning method. Its purpose is based on the input unlabeled data, through training to obtain a dimensionality reduction feature expression of the data after compression. The autoencoder is an artificial neural network composed of two sub-networks: encoder and decoder [44]. In this article, a stack encoder is used to extract potential associations of miRNA-disease. The stacked autoencoder is a cascade of multiple autoencoders, that is, contains multiple hidden layers to complete the task of extracting information layer by layer for the original features. The stacked autoencoder trains multiple layers of AE sequentially. After the first AE training is completed, the output of its encoder is used as the input of the second AE, and so on, and finally, a more representative and low-dimensional latent feature is obtained.

SMALF model
In this section, we will detail the SMALF model construction process, and show the overall process in Fig. 5.

Step 1: Matrix decomposition
Regarding the original matrix Y as the input, each row of Y is the original feature of the miRNA, and each column is the original feature of the disease. In the original feature vectors, m(i) and d(j) that decompose miRNA and disease, the one marked with 1 Step3,Integrating miRNA functional similarity, miRNA latent feature, disease semantic similarity, and disease latent feature generates the feature vector representing miRNA-disease. Step4, the XGBoost algorithm is employed to predict the miRNA-disease associations indicates that there is a correlation, and the one marked with 0 indicates that there is an unobserved correlation. Decompose miRNA disease association matrix Y into M and D T .
there M, D T ∈ Y m * n is a real matrix. In our research, M i and D T j are respectively regarded as the original feature vectors of m(i) and d(j).

Step 2: Extracts latent features by stacked autoencoders
In our autoencoder, the encoder H1 accepts the original feature m from miRNA in M and the encoder H2 accepts the original feature d from the disease in D T as input, define the i-th training sample x i = m in M in H1; define H2 The j-th training sample x j = m and encoder H extracts features from the low-dimensional code Z. The formula is as follows: where l = 1, ...., L , we set L to 2, which means that use two hidden layers, h (l) i is the l-th hidden layer, h (0) i represents the input x i , W l is the weight matrix and b l is the bias of the l-th layer, The activation function f e (.) can effectively adjust the input through training.
The purpose of the decoder is to reconstruct the input x i as much as possible from the latent features z i output by the encoder. Its definition formula is as follows: Where f d (.) and g d (.) represent activation function and hyperbolic tangent function, respectively. where f d (.) and g d (.) represent activation function and hyperbolic tangent function, respectively.
Finally, the loss function is the sum of the reconstruction errors of all samples, and its expression is as follows: among them, the first term loss is the square of the loss, the second term is the normalization of the Jacobian J h (x i ) and is a hyperparameter. The stacked autoencoder will update the parameters of each node of the network iteratively to minimize the loss. it is trained through the iterative method of backpropagation, This step is also called fine-tuning. After continuous fine-tuning, the minimal loss is achieved, and the optimal solution of the autoencoder is reached. At this time, the latent feature z is the low-dimensional and high-density feature vectors M i and D T j compressed by the miRNA and disease sparse features we need.

Step 3: Combining latent features and similarity features
So far, we have obtained the 64-dimensional miRNA and disease latent feature vectors M i and D T j extracted by stacking autoencoder, which respectively concatenate with 495-dimensional miRNA functional similarity feature FS i and 383-dimensional disease semantic similarity feature SS j to new vectors that is 559-dimensional miRNA new feature and 447-dimensional disease new feature.
then concatenate the two vectors to get a new vector for model prediction.

Step 4: Predict new feature vectors by XGBoost
XGBoost accurately classifies the weak classifiers it contains through gradient iteration [45]. In this paper, we predict the new features of the miRNA-disease cascade in the new data set by the XGBoost model, which uses the cascaded Vec new as input and obtains its best gradient regression tree through training. XGBoost model contain K decision trees, f k represents the k-th decision tree, and the feature vector Vec new_i is regarded as input x i , and finally get the prediction result as the following formula: where ŷ (t) i means the classification result of the first j-th classifier, to minimize the loss of the objective function, the XGBoost algorithm adds a new function to the original model in each iteration. And use the function �(f t ) to control the complexity of the t-th subtree.
where T is the number of leaf nodes, w j is the score of each leaf node, γ and are the hyperparameters that control the proportion of complexity, and overfitting phenomenon can be prevented by adjusting these two hyperparameters. Furthermore, XGBoost also uses second-order Taylor expansion to optimize the objective function. The objective function of the t-th iteration is as follows: (12)