LBSizeCleav: improved support vector machine (SVM)-based prediction of Dicer cleavage sites using loop/bulge length

Background Dicer is necessary for the process of mature microRNA (miRNA) formation because the Dicer enzyme cleaves pre-miRNA correctly to generate miRNA with correct seed regions. Nonetheless, the mechanism underlying the selection of a Dicer cleavage site is still not fully understood. To date, several studies have been conducted to solve this problem, for example, a recent discovery indicates that the loop/bulge structure plays a central role in the selection of Dicer cleavage sites. In accordance with this breakthrough, a support vector machine (SVM)-based method called PHDCleav was developed to predict Dicer cleavage sites which outperforms other methods based on random forest and naive Bayes. PHDCleav, however, tests only whether a position in the shift window belongs to a loop/bulge structure. Result In this paper, we used the length of loop/bulge structures (in addition to their presence or absence) to develop an improved method, LBSizeCleav, for predicting Dicer cleavage sites. To evaluate our method, we used 810 empirically validated sequences of human pre-miRNAs and performed fivefold cross-validation. In both 5p and 3p arms of pre-miRNAs, LBSizeCleav showed greater prediction accuracy than PHDCleav did. This result suggests that the length of loop/bulge structures is useful for prediction of Dicer cleavage sites. Conclusion We developed a novel algorithm for feature space mapping based on the length of a loop/bulge for predicting Dicer cleavage sites. The better performance of our method indicates the usefulness of the length of loop/bulge structures for such predictions.


Background
MicroRNAs (miRNAs) are a type of small RNAs with the length ∼22 nt, which perform the function of suppressing gene expression at the post-transcriptional level [1,2]. Usually in vivo, a gene of a miRNA is transcribed to produce a long, primary miRNA (pri-miRNA) transcript, which is then processed into a ∼65-nt-long hairpin structure via cleavage by the Drosha (DGCR8) enzyme. Then, the resulting pre-miRNA is cleaved by another enzyme (termed Dicer) to generate a mature miRNA, which is ∼22 nt long [3]. Finally, the generated miRNA can be combined with an Argonaute protein to form the protein-miRNA site [9]. There are also studies showing that the loop/bulge structure also determines the accuracy of cleavage activity [10,11]. MacRae et al. reported that the 3p-terminal nucleotide of single-stranded RNA can affect Dicer binding [12]. In addition, Jin and Lee found that a single nucleotide polymorphism may be associated with miRNA regulation [13]. All these studies revealed that secondary structures of both the Dicer enzyme and cleavage substrates are essential for cleavage site determination.
With a better understanding of the features of selection of a Dicer cleavage site, researchers may be able to elucidate the mechanism of action of enzymes in the RNA III family as well as the processes of RNA inference. Thus, it is imperative to explore the factors affecting the accuracy of Dicer cleavage to gain better insights into the mechanism of Dicer cleavage. Recently, a support vector machine (SVM)-based method (PHDCleav) was developed to predict selection of Dicer cleavage sites [14]. They proposed feature space mappings from pre-miRNA nucleotide sequences on the basis of existence of a predicted loop/bulge structure. SVM is a state-of-the-art machine learning technology [15] that has been applied to various areas of pattern recognition in many biological fields such as protein classification [16][17][18], prediction of RNA secondary structure [19,20], and drug-nondrug classification [21,22].
In this paper, we made use of the length of loop/bulge structures and proposed a novel algorithm of feature space mapping called LBSizeCleav. To evaluate our method, we used 810 empirically valid sequences of pre-miRNAs from miRBase and performed fivefold cross-validation. In the 5p arm of pre-miRNAs, the proposed method attained higher accuracy (87.4%), whereas the best prediction result of PHDCleav corresponded to the accuracy of 84.0% (an extended binary pattern, a window of 14-nt size). In addition, in the 3p arm, the average prediction accuracy of our method reached 83.0%, whereas PHD-Cleav achieved up to 79.1% prediction accuracy. These results suggest that our method LBSizeCleav outperforms binary patterns of PHDCleav in predicting the position of Dicer cleavage sites. The better performance may in turn serve as the evidence that the features utilized by these two methods are necessary for Dicer cleavage selection.

Methods
In this section, we provide a brief description of feature space mapping algorithms of PHDCleav using sequences and secondary structures and propose a novel algorithm for feature space mapping, LBSizeCleav, based on the length of a loop/bulge structure.

Feature space mapping procedures of PHDCleav
Given a pre-miRNA sequence, a site between two successive nucleotides is mapped to a binary vector. In PHDCleav, a window is generated for each input sequence where for the positive pattern the center of the window is exactly located at the cleavage site of 5p (3p) arm and for the negative pattern the center of the window is located 6 nt away from the cleavage site of 5p(3p) arm. Since this is based on the assumption that a cleavage site can shift slightly (1-2 nt in biological experiments) but the chance is rare that Dicer cuts in the middle of mature miRNA, 6 nt could be changed under the principle that the center of the negative pattern is far enough from the real cleavage site. PHDCleav has shown that there is little affect to the accuracy of prediction even with the shifting of negative windows among the whole sequence of pre-miRNA.
A nucleotide in a window having the site at the center is converted to a four-dimensional vector as [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], and [0, 0, 0, 1], for A, U, C, and G, respectively (see Table 1). Let w denote the size of the window, where w is a positive even number. Then, a 4wdimensional vector is generated for the site.
There are many loops/bulges in the secondary structure of pre-miRNA where one arm contains extra nucleotides without counterparts in the other arm [23]. A recent study indicated that these loops/bulges play an important role in the selection of a Dicer cleavage site [10]. This observation Table 1 Binary patterns for nucleotides, A, U, C, G, and a loop/bulge structure, denoted by L, in PHDCleav [14] and LBSizeCleav with k ones based on sequences and predicted secondary structures suggests that the loop/bulge structure may be a feature that is useful for prediction of a Dicer cleavage site. The extended binary pattern of PHDCleav was developed on the basis of this assumption. After obtaining the secondary structure from a given sequence by some prediction methods, in the extended binary pattern of PHDCleav, a nucleotide is converted to a five-dimensional vector as [1, 0, 0, 0, 0], [0, 1, 0, 0, 0], [0, 0, 1, 0, 0], [0, 0, 0, 1, 0], and [0, 0, 0, 0, 1], for A, U, C, G, and L, respectively, where L indicates that the corresponding nucleotide is predicted to be in a loop/bulge structure. Just as the nucleotides in the window, its complementary nucleotides are also converted to a feature vector. After that, the dimensionality of the vector is 10w.

Feature space mapping of LBSizeCleav
It is reasonable to consider not only the position but also the length of loop/bulge structures. Therefore, we propose novel feature space mapping (LBSizeCleav) by introducing the length of a loop/bulge structure into the algorithm.
The binary pattern of LBSizeCleav is an extension of that of PHDCleav. Let M be the maximal length of loops and bulges of all the pre-miRNAs in a dataset, and suppose L l indicates that the corresponding nucleotide is in a loop/bulge structure of length l. Here we introduce a new parameter named k into LBSizeCleav, which is a positive integer representing the effect of length of loops and bulges to the kernel computation. Then, we designate a nucleotide without any loop/bulge structure for k as a  Table 1). A nucleotide in a loop/bulge structure of length l is represented as [ 0, . . . , 0, 1, . . . , 1, 0, . . . ], where k ones appear from the (4 + l)-th element to the (k + 3 + l)-th element. Thus, for window size w, a 2w(M + k + 3)-dimensional vector is generated.
Let x 1 and x 2 be binary patterns of L l 1 and L l 2 , respectively. If we use the inner product for kernel computation, then the inner product between the binary patterns is x 1 These values assume the maximum when l 1 = l 2 and decrease according to the difference |l 1 − l 2 | and k, while k gets larger, the value changes of kernel function is more sensitive to the size of |l 1 − l 2 |, in this way by controlling the value k we could control the sensitivity of our method to length of loops and bulges. Since PHDCleav used radial basis function (RBF), we also selected RBF as our kernel function. Figure 1 illustrates the feature space mapping of LBSize-Cleav for the pre-miRNA of the miRBase ID hsa-miR-200c with a predicted secondary structure, where nucleotides in the region removed by Dicer are shown as lowercase letters. CD-5p and CD-3p denote cleavage sites in 5p and 3p arms, respectively. Sequences in the red rectangles denote sequences used to generate feature vectors representing 5p and 3p arms, which are selected by the principle that the cleavage site is located at the center of the sequence. Here, we generate the feature vector of LBSizeCleav at k = 3 and w = 6 for the site CD-5p and for the site 6 nt away from CD-5p. The nucleotides in the window in the 5p arm are UGGgug, and loop/bulge structures are detected at two positions. As a result, L 1 GGgL 1 g is converted to  Table 1.

Results
We retrieved 810 empirically validated sequences of pre-miRNAs from miRBase (version 21) [24], where cleavage sites CD-5p and CD-3p are both defined for each pre-miRNA. The pre-miRNAs were selected under the principle that both the cleavage sites of CD-5p and CD-3p are experimentally validated. (i.e. only precursors with cleavage sites at both CD-5p and CD-3p are selected, we made this choice to let our dataset be generated the in same way as dataset of PHDCleav) Table 2 Results on average specificity, sensitivity, accuracy, and MCC for both 5p and 3p arms by five-fold cross-validation using PHDCleav and LBSizeCleav (k = 1, · · · , 5) with window sizes 8, 10, 12, 14 based on sequences and secondary structures predicted by quikfold server All the pre-miRNAs are selected from human pre-miRNAs.
We used these cleavage sites as positive examples using windows of size 8,10,12,14 nt, where each window was selected so that a cleavage site is located at the center of the window, and we generated negative examples on the same sequence so that both centers of the positive and negative examples were 6 nt away from each other, as in the previous study [14]. This approach is based on the assumption that for most pre-miRNAs, the Dicer cleavage site is seldom selected at the center of the hairpin structure. In PHDCleav, two secondary structure predictors, quikfold [25] and RNAFold from ViennaRNA [26] were used, hence, we used both the RNAFold from ViennaRNA. and the quikfold server (version 3.0, http://mfold.rna. albany.edu/?q=DINAMelt/Quickfold) for prediction of RNA secondary structures. The results were given in Tables 2 and 3. Because in PHDCleav, the accuracy of prediction by nucleotide composition was worse than that by binary patterns, we compared our method with the binary patterns of PHDCleav. We used the libSVM 3.18 package [27] with the RBF kernel to utilize SVM because the RBF kernel was used in PHDCleav.
The performance of prediction methods was assessed by means of sensitivity, specificity, accuracy, and the Matthews correlation coefficient (MCC), defined as follows: where TP, TN, FP, FN denote the number of true positive, true negative, false positive, and false negative results, respectively. We performed fivefold cross-validation, and used the average sensitivity, specificity, accuracy, and MCC. We examined size w of a window from 8 to 14 and the number k of ones in LBSizeCleav from 1 to 5 for 5p and 3p arms of pre-miRNAs. Table 2 shows the results of PHD-Cleav and LBSizeCleav (k = 1, · · · , 5) based on sequences and secondary structures predicted by the quikfold server. In terms of prediction performance in the 5p arm of pre-miRNA, the best result of PHDCleav corresponded to the accuracy of 84.0%, whereas LBSizeCleav at k = 5 achieved the accuracy of 87.4%. In addition, the values Table 4 Variances of specificity, sensitivity, accuracy, and MCC for both 5p and 3p arms by five-fold cross-validation using PHDCleav and LBSizeCleav (k = 1, · · · , 5) with window sizes 8, 10, 12, 14 based on sequences and secondary structures predicted by quikfold server  of prediction accuracy of LBSizeCleav at w = 12, 14 were higher than those of PHDCleav. As for prediction performance in the 3p arm of pre-miRNA, the best result of PHDCleav corresponded to the accuracy of 79.1%, whereas LBSizeCleav achieved the accuracy of 83.0%. Table 3 shows the results of PHDCleav and LBSizeCleav (k = 1, · · · , 5) based on sequences and secondary structures predicted by the RNAFold. In terms of prediction performance in the 5p arm of pre-miRNA, the best result of PHDCleav corresponded to the accuracy of 81.3%,  whereas LBSizeCleav at k = 1 achieved the accuracy of 82.2%. As for prediction performance in the 3p arm of pre-miRNA, the best result of PHDCleav corresponded to the accuracy of 82.9%, whereas LBSizeCleav achieved the accuracy of 85.1%.
To better evaluate the performance we also calculated the variance of each prediction result in Table 4. Figures 2  and 3 show the results of LBSizeCleav and PHDCleav on receiver-operator characteristic (ROC) curves at window size w = 14 in 5p and 3p arms. Judging by the performance evaluation, our newly developed method outperformed the binary patterns of PHDCleav; this finding was suggestive of efficiency of the feature representing the length of loop/bulge structures.
Since in our results, LBSizeCleav with parameters of w = 14, k = 5 outperforms the others, we selected these parameters as our parameters for prediction model. For an input sequence, we created a shift window of size 14 nt shifting from the 5p arm to the 3p arm. For each shift window we performed an SVM regression analysis using our model. Here we randomly selected 2 precursors from the dataset and showed the score of the extended binary pattern of PHDCleav and LBSizeCleav with k = 5. From the result we could see that although both tools have predicted the cleavage site correctly, LBSizeCleav predicted more true negatives than extended binary pattern of PHD-Cleav, which indicates a better performance in identifying negative patterns of LBSizeCleav (see in Fig. 4).
We also compared the performance of our tools with another state-of-art method, a recent published paper introduced an easy way named SGL (Simple Geometric Locater) to calculate the cleavage site of miRNA which  outperforms other methods. We generated a benchmark to compare our method as well as PHDCleav with SGL of prediction in CD-5p, which result is shown in Fig. 5. In this benchmark we selected the threshold (0.0) of LBSize-Cleav as well as PHDCleav and calculated the EAEs(End Absolute Error, the absolute error of the predicted minus the true position for a specific duplex end) from the true cleavage site and compared it with the SGL method. From the result we could see that although at high EAEs PHDCleav outperforms LBSizeCleav, LBSizeCleav outperforms both PHDCleav and SGL at EAE 1, which indicates that LBSizeCleav predicted less false positives than PHDCleav.

Discussion
There were several pre-miRNAs, such as pre-mir221, pre-mir138-1, and pre-mir-15a, that were identified by LBSizeCleav but were not identified by PHDCleav in the prediction results from the 5p arm of a pre-miRNA with a shift window of 14 nt (see Fig. 6). By comparing these three pre-miRNAs, we found that all of them contain a part of loop/bulge structures that is more than 1 nt long in their mature parts. This result indicates that the length of a loop/bulge structure is an important determinant of a cleavage site. Careful analysis revealed that pre-mir221 and pre-mir138-1 contain their loop/bulge structures in their bulge parts, whereas pre-mir-15a has its loop/bulge structure in its loop part, proving that both loop and bulge structures can affect the cleavage site selection. To further evaluate the effect of the length of a loop/bulge structure in affecting the cleavage site selection we calculated the number of pre-miRNAs which LBSizeCleav identified successfully while PHDCleav failed to identify and vice versa (see Table 5). From the result we could see that for the positive patterns our method performs almost the same as PHDCleav, but for negative patterns our method shows a significant improvement in accuracy. This result indicates that our method has a better resolution in identifying negative patterns.

Conclusions
In this study, we proposed a novel method-LBSizeCleav-for prediction of Dicer cleavage sites. By integrating information on the length of a loop/bulge structure of a pre-miRNA (as predicted by the quikfold server), we developed novel feature space mapping. We performed fivefold cross-validation for validated pre-miRNA sequences from miRBase. In both 5p and 3p arms, the proposed method showed better performance than did the binary patterns of PHDCleav. This study shows a new way of feature evaluation; moreover, the better performance of our method points to the effectiveness of analysis of loop/bulge length at detecting Dicer cleavage sites.