ReCGBM: a gradient boosting-based method for predicting human dicer cleavage sites

Background Human dicer is an enzyme that cleaves pre-miRNAs into miRNAs. Several models have been developed to predict human dicer cleavage sites, including PHDCleav and LBSizeCleav. Given an input sequence, these models can predict whether the sequence contains a cleavage site. However, these models only consider each sequence independently and lack interpretability. Therefore, it is necessary to develop an accurate and explainable predictor, which employs relations between different sequences, to enhance the understanding of the mechanism by which human dicer cleaves pre-miRNA. Results In this study, we develop an accurate and explainable predictor for human dicer cleavage site – ReCGBM. We design relational features and class features as inputs to a lightGBM model. Computational experiments show that ReCGBM achieves the best performance compared to the existing methods. Further, we find that features in close proximity to the center of pre-miRNA are more important and make a significant contribution to the performance improvement of the developed method. Conclusions The results of this study show that ReCGBM is an interpretable and accurate predictor. Besides, the analyses of feature importance show that it might be of particular interest to consider more informative features close to the center of the pre-miRNA in future predictors.

Several studies [2][3][4] show that miRNAs are related to different types of cancers such as breast, lung, and thyroid cancers. Understanding how Dicer specifically selects cleavage sites may help us interpret the effects of mutations in miRNA coding genes [5]. Therefore, it is of great interest to investigate how Dicer selects cleavage sites from the 3p-arm and the 5p-arm of a pre-miRNA.
Recently, machine learning-based approaches such as support vector machine [6][7][8][9], support vector regression [10][11][12], deep neural networks [13,14] and conditional random fields [15] have been widely used for cleavage site predictions. However, these methods mainly aim at predicting protein cleavage sites. To predict human dicer cleavage sites, different feature encoding schemes and feature extraction methods are needed.
There are some existing studies on human dicer cleavage sites. Ahmed et al. [16] developed an SVM-based model (PHDCleav) of Dicer cleavage site prediction. The inputs to this model are extracted from pre-miRNA nucleotide sequences with loop/bulge structures, and the output is whether an input pattern is a correct dicer cleavage site. They demonstrated this method outperformed other approaches such as Random Forest, CART, and Naïve Bayes by computational experiments. Bao et al. [17] combined the loop/bulge size with pre-miRNA nucleotide sequences as inputs and proposed another SVM-based prediction model (LBSizeCleav). However, there are some shortcomings with these methods. First, they only considered each sequence with its loop/bulge independently. Second, their models are not explainable.
To address the above issues, we propose an explainable predictor based on the gradient boosting machine [18]-ReCGBM. Our main contributions include: (i) extract relational features to combine each sequence and its complementary strand; (ii) design class features through affinity propagation, and (iii) identify some rules from the feature importance of ReCGBM. We summarize the design and evaluation process of ReCGBM in Fig. 1a.

Data preparation
In this study, we extracted the cleavage pattern and non-cleavage pattern from each pre-miRNA sequence. First, we collected 956 validated pre-miRNA sequences from miRBase (Version 22.1) [19]. To obtain the structural information for each pre-miRNA sequence, we employed quickfold [20] and RNAFold from ViennaRNA [21] to generate RNA secondary structure. We chose these two RNA secondary structure prediction methods because both are powerful tools for RNA secondary structure prediction. Besides, previous methods like PHDCleav and LBSizeCleav all employed these tools to predict RNA secondary structures. Then, we generated a cleavage pattern for each pre-miRNA sequence. Each cleavage pattern consisted of a 14 nt long sequence with the cleavage site located at the center. Finally, we extracted each non-cleavage pattern that was a 14 nt long sequence with the center 6 nt away from the corresponding cleavage site. Figure 2 illustrates how to obtain cleavage pattern and non-cleavage pattern. In this figure, the cleavage pattern of 5p-arm is the sequence between the two red bars in the structure of the pre-miRNA sequence, which is 'UAU AGU UUU AGG GU' . The non-cleavage pattern of 5p-arm is ' AGG UUG UAU AGU UU' according to our selection rules.
To incorporate structural information of each pattern, we first obtained the structure of each pre-miRNA through quickfold or RNAFold. Then, we extracted the complementary strands for each cleavage pattern and non-cleavage pattern from the structural information of each pre-miRNA. Notice that if a nucleotide in a pattern did not have a complementary nucleotide according to the structural information, we would define it as a 'loop/bulge' . To combine 'loop/bulge' in our data, we represented a 'loop/bulge' as 'O' and considered it as a complementary nucleotide of the corresponding nucleotide in a pattern. In Fig. 2, the complementary strand of the cleavage pattern of 5p-arm is given by the complementary nucleotides in the green boxes and loops/bulges ('O'). The complementary strand of the non-cleavage pattern of 5p-arm is given in the same way.
An example of data preprocessing is shown in Fig. 2. Since structural information was generated through quickfold or RNAFold, four datasets were obtained after a b Fig. 1 Flowchart of the data preprocessing, feature extraction, model training and evaluation of the developed ReCGBM approach. a Shows the design and evaluation process of ReCGBM. b Describes how to generate relational features and class features the preprocessing: 3p-arm with quickfold structure, 5p-arm with quickfold structure, 3p-arm with RNAFold structure and 5p-arm with RNAFold structure.

Relational features
In previous studies [16,17], each pattern and its complementary strand were encoded by one-hot encoding separately. However, nucleotides in a pattern may also form a base pair with nucleotides in the corresponding complementary strand.
To better encode the relation between each pattern and its complementary strand, we considered relational features. For example, given a pattern 'UAU AGU UUU AGG GU' and its complementary strand ' AUA UCA AOOOCCCO' , the relational features can be obtained according to (1) shown in Fig. 1b. An advantage of relational features is that it offers the important base-pairing information between each pattern and its complementary strand.

Class features
Previous methods [16,17] considered each input independently to make predictions. However, similar inputs may lead to the same prediction outputs. In this study, we made an assumption that similar inputs will lead to the same prediction results and accordingly designed the class feature, which assigned similar inputs to the same class.
To obtain class features, we first defined the pairwise similarities between different inputs based on the edit distance (Fig. 3). Then, an unsupervised learning methodaffinity propagation [22] was used to cluster the inputs to obtain the class features.

Edit distance
In order to measure the pairwise similarities between different sequences, we used the edit distance [23]. Given two strings A and B, suppose that the lengths of A and B are |A| and |B|, respectively. The edit distance between A and B is given by D edit (|A|, |B|) as follows where i represents the first i characters of A and j represents the first j characters of B, respectively, where i, j ≥ 1.
1 a i =b j is an indicator function where Since our inputs are 14 nt RNA sequences, the similarity between two sequences can be calculated by the edit distance. For example, given two sequences 'UAU AGU UUU AGG GU' and 'UAU GGU UUA GAG UU' , the edit distance between them can be calculated through a distance matrix (Fig. 3). As Fig. 3 shows, the edit distance between these two sequences is 4. In this study, each input consisted of a pattern and its complementary strand. Therefore, the similarity between the two inputs E and F can be defined as follows: where E 2 and F 2 are the complementary strands of E 1 and F 1 respectively.

Fig. 3 An example of calculating the edit distance by matrix
Given a dataset that includes n samples, we define a n × n similarity matrix S where the entry in the ith row and jth column s(i, j) = −d i,j . Notice that d i,j denotes the similarity D similar (i, j) between the ith training sample and the jth training sample.

Affinity propagation
Affinity Propagation [22] is a clustering algorithm based on message passing. It identifies classes of similar inputs. Given n data points x 1 , . . . , x n , the algorithm works as follows: • Iteratively execute the follow steps: 1 Responsibility updates: 2 Availability updates: until A + R remain unchanged over a number of steps, or after some predefined numbers of steps. For each point x i , the data point x k that maximizes a(i, r) + r(i, k) gives us the class information of x i .
We chose affinity propagation to generate class features as it only requires a few hyperparameters. More importantly, affinity propagation does not need to choose the number of classes. To employ affinity propagation, we used the edit distance to generate similarity matrices. Given a training set and a test set, we first calculated the similarity matrix of the training set. Then we applied affinity propagation to the similarity matrix. The affinity propagation will assign each sample in the training set a cluster label, which is our class features. Besides, the number of clusters and the center of each cluster were also obtained from the results of affinity propagation (Fig. 4). We then measured the edit distance between each sample in the test set and these cluster centers. Finally, we assigned each test sample the same cluster label as the cluster center with the minimum edit distance. The whole procedure is given by (2) in Fig. 1b.

LightGBM
Gradient boosting machine [18] is a machine learning algorithm that uses a group of weak prediction models (often decision trees) to make predictions. In this study, we utilized a gradient boosting machine-based framework -lightGBM.
LightGBM [24] is an efficient implementation of gradient boosting machine. It has been widely used in the field of bioinformatics and compuational biology since it has the following advantages: • High speed and low memory cost: LightGBM uses a histogram-based algorithm [25][26][27]. Such algorithm can assign continuous feature values into discrete bins, thereby leading to high training speed and low memory cost. • High accuracy: Traditional decision tree-based learning algorithms generate trees level-wise. However, lightGBM generates trees leaf-wise. This strategy usually causes lower loss than level-wise algorithms. • Support categorical features: One-hot encoding is an efficient encoding scheme for categorical features. However, for tree-based learning algorithms, one-hot features tend to generate very unbalanced trees, which may prevent the prediction model from achieving good accuracy. Instead of one-hot encoding, lightGBM allows users to input categorical features directly to train the model, which may lead to more balanced trees and more accurate results.
We built a lightGBM-based model -termed ReCGBM with relational features and class features as inputs. The outputs were cleavage sites or non-cleavage sites.

Evaluation metrics
To assess the performance of our prediction model, we used several different metrics including sensitivity (Sn), specificity (Sp), accuracy (Acc) and Matthews correlation coefficient (MCC):

Predictive performance
The main goal of this paper is to develop an accurate prediction model for the Dicer cleavage sites. In this section, we show the predictive performance of ReCGBM and compare our results with other existing models. We built prediction models for the 5p-arm dataset and 3p-arm dataset with secondary structures predicted by quickfold and RNAFold respectively.
To ensure the effectiveness of our model, we trained 10 models for each dataset, where we only considered the cases in which the affinity propagation converged. For each model, we randomly divided our preprocessed dataset into two subsets. The first subset that included 800 cleavage patterns and 800 non-cleavage patterns was used as the training set. The other subset was used as the independent test set, which included 156 cleavage patterns and 156 non-cleavage patterns. We computed the average sensitivity (Sn), specificity (Sp), accuracy (Acc), and MCC of the 10 models for each dataset.
We compared the predictive performance of ReCGBM with the existing methods, PHDCleav and LBSizeCleav. Since the performance of LBSizeCleav highly depends on the variable k, which represents the effect of length of loops/bulges on the kernel computation, we trained LBSizeCleav with k = 1, 2, 3, 4, 5 as previously described [17]. All models were trained on the same 10 training sets and evaluated on the same 10 test sets for each dataset.
The predictive performance of different models is illustrated in Fig. 5 and Additional file 1: Table S1 for both the 5p-arm dataset and the 3p-arm dataset with secondary structures predicted by quickfold. For the 5p-arm dataset with secondary structures predicted by quickfold, ReCGBM achieved a sensitivity of 0.863, a specificity of 0.846, an accuracy of 0.854 and an MCC of 0.709, respectively, which outperformed all other predictors in terms of three out of the four evaluation metrics. For the 3p-arm dataset with secondary structures predicted by quickfold, ReCGBM achieved the best sensitivity, specificity, accuracy and MCC of 0.883, 0.899, 0.891 and 0.783, respectively. Figure 6 and Additional file 1: Table S2 show the average specificity, sensitivity, accuracy, and MCC of models on both the 5p-arm dataset and the 3p-arm dataset with secondary structures predicted by RNAFold. For the 5p-arm dataset with secondary structures predicted by RNAFold, ReCGBM achieved the best specificity (0.862), accuracy (0.873) and MCC (0.747). In contrast, LBSizeCleav (k=3) achieved the best sensitivity (0.888). For the 3p-arm dataset with secondary structures predicted by RNAFold, ReCGBM achieved the best specificity (0.892), while PHDCleav achieved the best sensitivity (0.904), accuracy (0.894) and MCC (0.789).
Overall, ReCGBM outperformed the other models on three of the four datasets, highlighting the effectiveness of this model for predicting human dicer cleavage sites.
To further investigate whether different RNA secondary structures affect our prediction accuracy, we also generated relational features and class features using the RNAstructure package [29,30]. For ∼ 70 nt RNA structure, RNAstructure predicts secondary structures of the pre-miRNA-size RNAs with the accuracy near 100% .We trained 5 models for each dataset, where we only considered the cases in which the affinity propagation converged. For each model, we randomly divided our preprocessed dataset into two subsets. The first subset that included 800 cleavage patterns and 800 non-cleavage patterns was used as the training set. The other subset was used as the independent test set, which included 156 cleavage patterns and 156 non-cleavage patterns. The results are shown in Additional file 1: Table S3. The performance of ReCGBM with RNAstructure is close to the performance of ReCGBM with RNAFold.

Affinity propagation
The goal of this experiment is to explore the relationship between class features and cleavage/non-cleavage sites. To address this, we applied affinity propagation to each training set of 3p-arm and 5p-arm with secondary structures predicted by quickfold and RNAFold. Figure 4 shows the average cluster results of 10 training sets of 3p-arm and 5p-arm with secondary structures predicted by quickfold and RNAFold.
To further investigate the relationship between the class features and the cleavage/ non-cleavage sites, we defined ratio i (cleavage) for each class i in Fig. 4 as follows: where N i (cleavage) and N i (non-cleavage) represent the number of cleavage patterns and the number of non-cleavage patterns in class i, respectively. If ratio i (cleavage) is close to 0, the samples in class i are almost non-cleavage patterns. On the other hand, if ratio i (cleavage) is close to 1, the samples in class i are almost cleavage patterns. Classes with very high ratio(cleavage) and classes with very low ratio(cleavage) are desirable as these classes reflect that the class features have the potential to distinguish cleavage sites from non-cleavage sites.
We  Figure 7a-d show the relationships between the average number of classes and each label for the 5p-arm dataset(qf ), 3p-arm dataset(qf ), 5p-arm dataset(rf ) and 3p-arm dataset(rf ) in Fig. 4, respectively. The commonality in these four figures is that the numbers of classes in label 1 and label 5 were much higher than others. Figure 7e-h describe the relationships between the average number of samples and each label for four datasets. It is obvious that the numbers of samples in label 1 and label 5 were much higher than others.
Thus, the results of affinity propagation show that the majority of classes are overwhelmed by either only cleavage sites or non-cleavage sites, which indicates that the clusters based on edit distance may improve the prediction of the cleavage site.

Sequence logo representations
To explore the difference between cleavage sites and non-cleavage sites at the RNA sequence level, we draw the sequence logo representations (Fig. 8) of the 5p-arm cleavage sites, 5p-arm non-cleavage sites, 3p-arm cleavage sites and 3p-arm non-cleavage sites by WebLogo 3 [31].
As can be seen from Fig. 8, the 5p-arm cleavage sites and 5p-arm non-cleavage sites show different preferences for the neighboring nucleotides. For cleavage sites, sequence Results of affinity propagation. a-d plot the relationships between the average number of classes and different labels for the 5p-arm dataset (qf ), 3p-arm dataset (qf ), 5p-arm dataset (rf ) and 3p-arm dataset (rf ), respectively where qf represents secondary structures predicted by quickfold server and rf denotes secondary structures predicted by RNAFold. e-h show the relationships between the average number of samples and different labels for the 5p-arm dataset (qf ), 3p-arm dataset (qf ), 5p-arm dataset (rf ) and 3p-arm dataset (rf ), respectively motifs associated with over-represented nucleotides at the positions 3-11 (Fig. 8a) can be easily observed, while for non-cleavage sites, no specific nucleotides were found to be over-represented at the positions 7 and 8 (Fig. 8b). For the 3p-arm cleavage sites (Fig. 8c), nucleotides at two specific positions 7 and 8 showed most distinctive preferences, compared with the 3p-arm non-cleavage sites (Fig. 8d). There also exist subtle differences in other positions such as the positions 5, 9, 10, and 11 between the 3p-arm cleavage sites and non-cleavage sites. Altogether, the nucleotide preferences shown in Fig. 8 represent patterns important for distinguishing the dicer cleavage sites from non-cleavage sites.

Feature importance
Gradient boosting machine typically uses decision trees as the base learners. An advantage of the decision tree is that it is an explainable model. Therefore, it is also possible to interpret a gradient boosting machine as an ensemble of decision trees. Fortunately, lightGBM has a built-in module that provides a score to describe the usefulness of each feature for a trained model. Here we will discuss the feature importance of ReCGBM. Since we trained 10 models for each dataset, we list the average feature importance of the 10 trained models for each dataset.
Let p 1 , . . . , p 14 denote the 14 pairs of the relational features (Fig. 1b1). Figures 9 and 10 show the results of feature importance for different datasets. Figure 9 gives the feature importance of the 5p-arm datasets with the secondary structures predicted by quickfold and RNAFold, respectively. It can be seen that p 8 , . . . , p 14 were more important than p 1 , . . . , p 7 . Additional file 1: Figure S1a and S1b shows the relationships between p 8 , . . . , p 14 and p 1 , . . . , p 7 on 5p-arm dataset directly. a b Fig. 9 Results of feature importance on the 5p-arm datasets. The secondary structures of a and b are predicted by quickfold and RNAFold, respectively Figure 10 shows the feature importance of the 3p-arm datasets with the secondary structures predicted by quickfold and RNAFold, respectively. For the 3p-arm datasets, p 1 , . . . , p 7 were more important than p 8 , . . . , p 14 . Additional file 1: Figure S1c and S1d shows the relationships between p 8 , . . . , p 14 and p 1 , . . . , p 7 on the 3p-arm dataset directly.

Discussion
Although ReCGBM showed a better performance on three of the four datasets, PHDCleav achieved the best performance on the 3p-arm dataset with the secondary structures predicted by RNAFold. There are several potential reasons. First, the performance of ReCGBM highly depends on the predicted secondary structures. To obtain relational features and class features, the secondary structure information is necessary. However, the secondary structure generated by quickfold or RNAFold may include several structures. In ReCGBM, only one structure can be included in our input features, which might affect the prediction performance. a b Fig. 10 Results of feature importance on the 3p-arm datasets. The secondary structures of a and b are predicted by quickfold and RNAFold, respectively