Dataset
Constructing positive and negative samples sets is essential to training a machine learning classifier. It is naturally to take the already known miRNAs as the positive samples. The difficulty is to decide the best negative samples for training the classifiers. Since the number of miRNAs in a given genome is unknown [19], it is not suitable to randomly extract stem-loops sequences from the genomes. To produce high specificity in the prediction of new candidate miRNAs, the negative examples should be highly similar to the miRNA themselves. We collected negative samples in two ways: (1) Using samples from the mRNA 3'-untranslated region (3'-UTR). It has been proved that there is none predicted hsa and aga miRNA sequence in the UTRdb [23, 34] . (2) Using ncRNA recognized so far including these miRNA from Rfam9.1 [35] and other datasets. The resulting dataset contains two kinds of representative species, Homo sapiens (hsa) and Anopheles gambiae (aga), both have been well studied in previous researches [23, 24, 36]. Construction of the dataset including both training and testing samples involves several steps. Figure 2 illustrates the process where each step is described as follows.
-
(1)
692 hsa and 52 aga pre-miRNA sequences in miRBase12.0 were chosen to serve as the positive set.
-
(2)
9225 hsa and 92 aga 3’UTR sequences in 3’-UTRdb (release 22.0) whose length ranges from 70nt and 150nt were chosen to form one part of the negative set.
-
(3)
For hsa, an ncRNA dataset was already collected by Griffiths-Jones [37] that was used in [32] lately, but none sncRNA dataset of aga is available now. We selected all 256 aga ncRNA sequences in Rfam9.1, in which 68 sequences that were redundant or longer than 150nt were removed. These sequences form another part of the negative set, which are listed in the additional file 2.
-
(4)
14 hsa and 14 aga new hairpin sequences in miRBase13.0 were used to evaluate our miRenSVM system.
-
(5)
In this step, sequences with the similarity score higher than 0.90 were removed by CD-HIT program [38] from the training set and testing set respectively. The 27 selected testing sequences were summarized in Table S3 of the supplementary file.
-
(6)
21 sequences from 3’-UTRdb, whose second structure could not be predicted by RNAfold[39] or UNAfold[40] were removed. Finally, we constructed a training set with 697 true pre-miRNA sequences, 5428 pseudo pre-miRNA sequences, and a testing set with 27 bran-new real pre-miRNA sequences. After predicting the secondary structure, nearly 84% of the 5428 pseudo miRNA precursors have the secondary structure with multi-loop.
-
(7)
27 animal genomes (not including has and aga) in miRBase13.0 contain 5238 pre-miRNA sequences. We collected these sequences and used them to further evaluate the proposed approach miRenSVM.
Feature selection
The extraction of an appropriate set of features with which a classifier is trained is one of the most challenging issues in machine learning-based classifier development. In our study, both hairpin secondary structure and multi-loop structure features were considered. Concretely, we characterized a miRNA precursor by 65 local and global features that capture its sequence, secondary structure and thermodynamic properties. These features were subsumed into three groups as follows.
32 triplet elements
Sequence and structure properties are characterized by triplet structure-sequence elements proposed in [25]. In the predicted secondary structure, there are only two states for each nucleotide, paired or unpaired, indicated by brackets (‘(’ or‘)’) and dots (‘.’), respectively. We do not distinguish these two situations in this work and use ‘(’ for both situations, and GU wobble pair is allowed here. For any 3 adjacent nucleotides, there are 8 possible structure compositions: ‘(((’, ‘((.’, ‘(..’, ‘(.(’, ‘.((’, ‘.(.’, ‘..(’ and ‘…’. Considering the middle nucleotide among the 3, there are 32 (8*4) possible structure-sequence combinations, which are denoted as “U(((”, “A((.”, etc.
15 base pair features
Some secondary structure relevant features are already introduced by existing pre-miRNA classification methods [27, 32]. In this paper, we included 11 secondary structure features (G/C ratio, %C+G, dP, Avg_BP_Stem, Diversity, |A-U|/L, |G-C|/L, |G-U|/L, (A-U)/n_stems, (G-C)/n_stems, (G-U)/n_stems) in our miRenSVM. Furthermore, for identifying real miRNA precursors with multi- loop, we used four new features related to the loop number in the predicted secondary structure. They are:
♦ dP/n_loops, where n_loops is the number of loops in secondary structure.
♦ %(A-U)/n_loops, %(G-C)/n_loops, %(G-U)/n_loops, where %(X-Y) is the ratio of X-Y base pairs in the secondary structure.
These features were extracted using the RNAfold program contained in Vienna RNA package (1.8.3) [39] with default parameters.
18 thermodynamic features
It has been proved that using only secondary structure is not enough to effectively predict miRNA [30]. Since miRNA precursors usually have lower MFE than other small ncRNAs and random short sequences, thus MFE related features were introduced, such as (dG, MFEI
1
, MFEI
2
, MFEI
3
, MFEI
4
, Freq). Other 8 global thermodynamics features (NEFE, Diff, dH, dS, Tm, dH/L, dS/L, Tm/L), and 4 statistically significant features (p-value_MFE, p-value_EFE, z-score_MFE, z-score_EFE) were chosen from previous research [23, 24, 36]. When evaluating those statistically significant features related with MFE and ensemble free energy (EFE), for each original sequence, 300 random sequences were generated by Sean Eddy's squid program [30]. dH, dS, Tm, dH/L, dS/L, Tm/L were calculated by UNAfold 3.7. More detail of all the 65 features are provided in additional file 1.
We used F-score to measure the discriminatory power of each feature above. F-score is a simple technique that measures the discrimination of two sets of real numbers. Given a set of training vectors x
k
, k = 1,…, m, if the number of positive and negative instances are n+ and n
-,
respectively, then the F-score of the i th feature is defined as:
where
are the average values of the i th features of the whole, positive, and negative data sets, respectively;
is the i th feature of the k th positive instance, and
is the i th feature of the k th negative instance. Larger F-scores indicate better discrimination [41]. All the 65 local and global candidate features were ranked by F-score in order to determine which features will be used in the final model.
The miRenSVM approach
Support vector machine
The internal of miRenSVM is Support Vector Machine, a supervised classification technique derived from the statistical learning theory of structural risk minimization principle [42]. A support vector machine constructs a hyperplane or set of hyperplanes in a high-dimensional space, which can be used for classification, regression or other tasks. SVM has been adopted extensively as an effective discriminative machine learning tool to address the miRNA prediction problem [25, 27, 43]. The model selection for SVMs involves the selection of a kernel function and its parameters that yield the optimal classification performance for a given dataset [44]. In our study, we used radial basic function (RBF) due to its higher reliability in finding optimal classification solutions in most situations. The SVM algorithm was implemented by C++ interface libsvm (version 2.89) package [41], and the training process of miRenSVM follows the guidelines described in [45].
SVM classifiers ensemble
One major factor that will influence the performance of a machine learning system is class imbalance, that is, the examples of some classes heavily outnumber the examples of the other classes [46]. Training a classifier system with an imbalance dataset will result in poor classification performance, especially for the rare classes [47]. And a classifier should have good and balanced performance over all classes for it to be useful in real-world applications.
For miRNA gene detection, the imbalance issue was widely recognized [32]. Existing machine learning based methods either employ random under-sampling to choose a portion of representative examples or just ignore it. It has already shown that both random over-sampling and random under-sampling have some drawbacks. The former does not add any information in addition to incurring large amount of compitation cost, and the later actually misses information and thus leads to poor performance. There remains a challenge: for a given dataset, how to select an appropriate sampling proportion?
In this work, the training dataset contains 697 positive (real pre-miRNA) samples and 5428 negative (pseudo pre-miRNA) samples, the ratio of negative to positive is 7.79:1. To address the drawbacks of over-sampling and under-sampling, we employed a SVM ensemble scheme. We tried to generate training sets with a desired distribution such that neither removing any training sample nor increasing the training time. An ensemble SVM classifier has several advantages over the ordinary classifiers. First, an ensemble SVM classifier exploits the information of the entire dataset, while random under-sampling uses only part of the dataset; On the other hand, it consumes less computation compared to random over-sampling. Second, an ensemble SVM classifier is able to overcome some drawbacks of a single classifier. With multi sub SVM classifiers, miRenSVM is more robust and expected to learn the exact parameters for a global optimum [42].
Figure 3 shows the ensemble scheme of miRenSVM. Here, we used the same strategies as in [48] to sample the training dataset. First, splitting the negative examples into k partitions where k is chosen from 1 to the ratio of the majority class' size to the minority class' size. Second, generating individual subsets by combining the positive set with each partition of the negative samples. Third, training SVMs independently over every subset of the training set, and finally combining all constituent SVMs by certain of strategy to get the ensemble classifier. Majority vote is a widely used method to combine the results of several SVM sub-classifiers. In this paper, in addition to majority vote, we also used another technique called mean distance. Unlike majority vote, in the mean distance scheme, each sample is tested only one time by using one SVM sub-classifier. While training classifiers, we evaluated the center vector of each training set. To classify an unlabeled sample, the distance between the sample and each center vector will be calculated, and the sample will be labelled by the SVM sub-classifier whose center vector is the nearest one to the sample under testing.
Performance evaluation method and metrics
Outer 3-fold cross validation
We used the libsvm 2.89 package to establish the miRenSVM classification system. Here, the complete training dataset is randomly divided into three equally sized partitions, while each partition has the same ratio of positive samples to negative samples. Then, any two partitions are merged together as the training dataset to train an SVM classifier. Following that, the resulting model is tested over the third data partition. This procedure is repeated three times with different combinations of training (two partitions) and testing (the remaining partition) datasets in an outer 3-fold cross validation style, and the classification result is gotten by averaging the results of the three tests above.
A straightforward way to evaluating the performance of a classifier is based on the confusion matrix. With this matrix, it is possible to evaluate a number of widely used metrics to measure the performance of a learning system, such as accuracy (Acc). However, Acc cannot be used to measure the performance of a classifier precisely when the class imbalance problem is present, as it does not reveal the true classification performance of the rare classes [47, 49]. Therefore, in addition to Acc, we also used sensitivity (SE), specificity (SP) to evaluate the performance of a classifier. In order to exploit both positive and negative samples as much as possible, we also used their geometric mean (G m). Actually we pay more attention to G m than to other three metrics, as G m is an aggregated performance measure. These performance metrics are defined as follows:
where TP, FP, TN and FN are the numbers of true positive predictions, false positive predictions, true negative predictions and false negative predictions, respectively.