HuntMi: an efficient and taxon-specific approach in pre-miRNA identification
© Gudyś et al.; licensee BioMed Central Ltd. 2013
Received: 2 July 2012
Accepted: 21 February 2013
Published: 5 March 2013
Machine learning techniques are known to be a powerful way of distinguishing microRNA hairpins from pseudo hairpins and have been applied in a number of recognised miRNA search tools. However, many current methods based on machine learning suffer from some drawbacks, including not addressing the class imbalance problem properly. It may lead to overlearning the majority class and/or incorrect assessment of classification performance. Moreover, those tools are effective for a narrow range of species, usually the model ones. This study aims at improving performance of miRNA classification procedure, extending its usability and reducing computational time.
We present HuntMi, a stand-alone machine learning miRNA classification tool. We developed a novel method of dealing with the class imbalance problem called ROC-select, which is based on thresholding score function produced by traditional classifiers. We also introduced new features to the data representation. Several classification algorithms in combination with ROC-select were tested and random forest was selected for the best balance between sensitivity and specificity. Reliable assessment of classification performance is guaranteed by using large, strongly imbalanced, and taxon-specific datasets in 10-fold cross-validation procedure. As a result, HuntMi achieves a considerably better performance than any other miRNA classification tool and can be applied in miRNA search experiments in a wide range of species.
Our results indicate that HuntMi represents an effective and flexible tool for identification of new microRNAs in animals, plants and viruses. ROC-select strategy proves to be superior to other methods of dealing with class imbalance problem and can possibly be used in other machine learning classification tasks. The HuntMi software as well as datasets used in the research are freely available at http://lemur.amu.edu.pl/share/HuntMi/.
KeywordsMicroRNA Random forest Imbalanced learning Genome analysis
MicroRNAs (miRNAs) are ∼21 bases long RNAs that post-transcriptionally control multiple biological processes, such as development, hematopoiesis, apoptosis and cell proliferation . Mature miRNAs are derived from longer precursors called pre-miRNAs that fold into hairpin structures containing one or more mature miRNAs in one or both arms . Their biogenesis is highly regulated at both transcriptional and post-transcriptional levels , and disregulation of miRNAs is linked to various human diseases, including cancer .
Identification of miRNA is a challenging task that allows us to better understand post-transcriptional regulation of gene expression. In last ten years a number of experimental and computational approaches were proposed to deal with the problem. However, experimental approaches, including direct cloning and Northern blot, are usually able to detect only abundant miRNAs. MicroRNAs that are expressed at very low levels or in a tissue- or stage-specific manner, often remain undetected. These problems are partially addressed by applying the deep-sequencing techniques that nevertheless require extensive computational analyses to distinguish miRNAs from other non-coding RNAs or products of RNA degradation .
Computational approaches in miRNA search can be homology-based, take advantage of machine learning methods, or use both of these. Homology-based approaches rely on conservation of sequences, secondary structures or miRNA target sites (e.g. RNAmicro , MIRcheck ). As a result, these methods are not suitable for detection of lineage- or species-specific miRNAs and miRNAs that evolve rapidly. Moreover, they are strongly limited by the current data and performance of available computational methods, including alignment algorithms . Another problem is that there are as many as ∼11 million sequences that can fold into miRNA-like hairpins in the human genome , some of which originate from functional, non-miRNA loci. It is therefore no surprise that a large number of hairpins that are conserved between species could be mistakenly classified as miRNAs. Nevertheless, homology search has been successfully applied in many miRNA gene predictions, in both animals and plants [10, 11].
In some approaches, e.g. PalGrade  or miRDeep , experimental and computational procedures are combined. However, as mentioned above, experimental methods can not easily detect low-expression or tissue-specific miRNAs and/or they have to meet computational challenges, as in the case of deep sequencing technology. miRDeep, for instance, aligns deep sequencing reads to the genome and selects the regions that can form a hairpin structure. Then, using a probabilistic model, the hairpins are scored based on the compatibility of the position and frequency of sequenced reads with the secondary structure of the pre-miRNA. This method achieves high specificity at the cost of relatively low sensitivity.
Machine learning methods are amongst the most popular ways of miRNA identification nowadays. They share the same overall strategy. First, the features of primary sequence and secondary structure are extracted from known miRNAs (positive set) and non-miRNA sequences (negative set). Then, the features are used to construct a model which serves to classify candidate sequences as real pre-miRNAs or pseudo pre-miRNAs. There are several machine learning methods that have been applied in the field of miRNA identification. These include hidden Markov models (HMM) , random forest  and naïve Bayes classifier . Support vector machine, however, seems to be the most popular framework nowadays and has been used in a number of well recognised tools. For instance, Triplet-SVM  classifies real human pre-miRNAs and pseudo pre-miRNAs using 32 structure- and sequence-derived features that refer to the dot-bracket representation of the secondary structure i.e. it considers the frequencies of triplets, such as "A(((" and "U.(.", consisting of the secondary structure of three adjacent nucleotides and the nucleotide in the middle. miPred  classifies human pre-miRNAs from pseudo hairpins represented by twenty nine folding features, using SVM-based approach. The features were evaluated with the F scores F1 and F2 on the class-conditional distributions to assess their discriminative power. Strongly correlated attributes were rejected. microPred  presents nineteen new features along with twenty nine taken from miPred. After feature selection, twenty one attributes were used to train the classifier. The improved feature selection approach and addressing the class imbalance problem resulted in high sensitivity and specificity of the method.
However, the existing machine learning approaches suffer from some drawbacks. First of all, they often make structural assumptions concerning stem length, loop size and numbers as well as a minimum free energy (MFE). Secondly, most of existing miRNA classifiers work well on data from model species and closely related ones; the classifiers trained on human data best fit the miRNA identification problem in human and other primates but perform unsatisfactorily when applied to, for example, invertebrates. Finally, the imbalance problem between the positive and negative classes is usually not addressed properly, while this is a crucial issue, as the number of microRNAs throughout a genome is much lower than the number of non-microRNAs (e.g. ∼1 400 miRNAs vs. ∼11 million pseudo hairpins in H. sapiens). The resulting difference in misclassification costs of positive and negative classes requires special techniques of learning from imbalanced data as well as a proper assessment metrics. Moreover, in order to accurately judge classifier performance in real-life applications, the problem of imbalance should be reflected in the testing datasets.
In this study we addressed all these issues. We made no preliminary assumptions about miRNA structure and carefully took into account class imbalance problem. We implemented a procedure of thresholding score function produced by traditional classifiers and called it ROC-select. This strategy turned out to be superior to other imbalance-suited techniques in miRNA classification. From all classifiers for which ROC-select procedure was applied we chose random forest as it yields the best balance between sensitivity and specificity. Regarding the data representation, we introduced seven new features and show that they further improve the classification performance. In the experiments we considered large and strongly imbalanced up-to-date sets of positive and negative examples, paying much attention to the data quality. The tests were performed using stratified 10-fold cross-validation (CV) giving reliable estimates of classification performance. Finally, we show that the method outperforms the existing miRNA classification tools, including microPred, without compromising the computational time.
Our miRNA classification method is freely available as a framework called HuntMi. HuntMi comes with trained models for animals, plants, viruses and separately for H. sapiens and A. thaliana. As a result, the tool can be used in miRNA classification experiments in a wide range of species. The user can use built-in models in the experiments or train new models using custom datasets prior to classification.
In order to create positive sets, we retrieved all pre-miRNAs from miRBase release 17  and filtered out the sequences lacking experimental confirmation. By using evidence-supported miRNAs only, we minimize the chance of introducing false positives into the set. The sequences were divided into five groups: H. sapiens, A. thaliana, animals, plants, and viruses.
Negative sets were extracted from genomes and mRNAs of ten animal and seven plant species as well as twenty nine viruses (Additional file 1: Table S1). Additional sets were prepared for H. sapiens and A. thaliana. Start positions were randomly selected, whereas end positions were calculated so that the sequence length distribution in the resulting negative dataset is the same as in the corresponding positive one. With this approach, the classifier achieves better performance when applied in real-life experiments, where miRNA candidates tend to have lengths similar to those of known miRNAs. Finally, in order to remove known miRNAs together with similar sequences that possibly represent unknown homologs of miRNAs, we ran BLASTN search against miRBase hairpins and filtered out sequences that produced E-value of 10−2 or lower. 96.17% of negative sequences prepared in this way possess structural features of real pre-microRNAs, including the minimum free energy below -0.05 (normalised to the sequence length) and number of pairings in the stem above 0.15 (also normalised to the length). At the same time these criteria are met by 97.61% of hairpins stored in miRBase.
The twenty one features selected by  were used as a base representation in the experiments. Thus, we employed microPred scripts for extracting necessary attributes. In the case of microPred dataset we took precalculated features from webpage to make our results comparable with the existing research (some of the features are calculated using randomly generated sequences).
Beside twenty one microPred features, we calculated seven additional sequence- and structure-related attributes. First, we considered the frequencies of secondary structure triplets composed of three adjacent nucleotides and the middle nucleotide. We chose four of them that were shown to have the highest information gain : "A(((", "U(((", "G(((", and "C(((", referred to as tri_A, tri_U, tri_G, and tri_C, respectively. The remaining features are: the maximal length of the amino acid string without stop codons found in three reading frames: orf; the cumulative size of internal loops found in the secondary structure: loops; a percentage of low complexity regions detected in the sequence using Dustmasker: dm (all Dustmasker settings were set to default except for score threshold for subwindows set from 20 to 15).
Extensive research on imbalanced data classification has proven that standard machine learning techniques often overlearn a majority class sacrificing minority examples . Therefore, special approaches for imbalanced problems have been developed. They can be divided into sampling methods, cost-sensitive learning, kernel methods, active learning and others . microPred authors carried out exhaustive study of how several classification strategies from above perform in a microRNA prediction task . They used standard support vector machine as a base classifier and combined it with random over/under-sampling, SMOTE (which is also a representative of sampling methods) and multi-classifier system. They additionally tested cost-sensitive SVM modifications like zSVM and DEC (different error costs), finding SMOTE to be the best strategy. In the research, geometric mean (G m ) of classification sensitivity (SE) and specificity (SP) was used as an assessment metric. G m is common in imbalanced learning problems, including miRNA identification, as it takes into account unequal misclassification costs. Therefore, we also decided to use G m in HuntMi study.
Our approach to microRNA prediction relies on the fact that classification with unequal costs is equivalent to thresholding conditional class probabilities at arbitrary quantiles . Many classifiers provide continuous score function s(x) describing degree of a membership of instance x to particular class. Ideally, such a function estimates perfectly a class conditional probability P(c|x) and is denoted as well-calibrated score function . In reality, classifiers produce scores which are often not calibrated  thus a lot of algorithms for calibrating them have been developed . In addition, many meta learning techniques like bagging or classifier ensembles can be employed to produce score function on the basis of class labels alone . As long as scoring function ranks instances properly, that is s(x)<s(y)⇔P(c|x)<P(c|y), one can successfully use s(x) directly to classify instances with unequal costs.
Our method combines the idea of thresholding classifier score function with receiver operating characteristics (ROC) . For each threshold value T established at s(x) function, a point in a ROC space can be generated. Varying T from −∞ to +∞ produces entire ROC curve. One can select a point on it with highest evaluation metric (G m in the case) and read corresponding T value. In real applications ROC curves are generated by simply sorting elements of dataset by s(x) values and updating true positive (TP) and false positive (FP) statistics for consecutive points. In order to prevent threshold selection procedure from overfitting towards training data, a separate set should be used for constructing ROC curve. Hence, an internal cross-validation with k1 folds is employed for this purpose. As we are not interested in variance, ROC curves are averaged in a straightforward way - instances from all tuning folds together with assigned s(x) values are gathered in a single set on which ROC generation procedure is applied . Threshold leading to the highest value of evaluation metric is stored and used for classification of unknown instances. The threshold selection procedure described above will be referred to as ROC-select.
In the research we apply ROC-select only on classifiers directly providing scoring function, no meta learning techniques were examined. These classifiers are naïve Bayes , multilayer perceptron , support vector machine  and random forest . We used radial basis function as an SVM kernel as it is known to produce best classification results in wide range of applications . In order to compare proposed strategy with other methods, we additionally tested SMOTE filter  combined with SVM as it gave best results in microPred experiments and a novel method of asymmetric partial least squares classification (APLSC), which came out to be superior to other strategies on several strongly imbalanced datasets .
Parameter selection and complexity analysis
In many studies including microRNA prediction, classifier parameters are selected in order to obtain best possible results for a particular domain. Hence, we decided to place parameter tuning phase in our pipeline as a preceding step for threshold selection. Parameter selection is also done with an internal cross-validation with a number of folds equal to k2 and is straightforward. At first, a search space is defined by specifying a number of discrete values for each parameter to be tuned. Then, full cross-validation procedure is performed for each point in that space. Combination of parameter values leading to the highest average evaluation metric (G m ) is stored and used in threshold selection and, finally, for classification of unknown instances.
Let us denote number of points in the parameter space to be examined as λ. In addition, let L(n) and T(n) indicate time complexities of training and testing procedures for given classifier with respect to the dataset size n. ROC-select and parameter tuning are performed in and O(λ k2(L(n(k2−1)/k2)+T(n/k2))) time, respectively. As (k−1)/k<1 entire procedure is bounded by expression .
All classification experiments were carried out using stratified 10-fold CV, hence distributions of testing samples are exactly the same as for the entire datasets. Taking into account strong imbalance of examined sets, obtained results approximate well the expected performance of a classifier in practical applications. Additionally, 10-fold CV was proven to be the best method of model evaluation in terms of bias and variance .
The detailed configuration of examined classifiers together with parameter values tested in a tuning phase are listed below (number of points in a parameter space for tuning phase given in parentheses). Parameters not mentioned here remained default.
naïve Bayes: kernel estimation turned on,
multilayer perceptron: validation set size V=20%, validation threshold E=50, learning rate η=0.1,0.2,…,0.5, momentum μ=0.1,0.2,…,0.5 (λ=25),
SVM: feature normalization turned on, cost C=10−2,10−1,…,102, exponent in radial basis kernel γ=2−2,2−1,…,22 (λ=25),
random forest: number of trees i=10,21,…,219 (λ=20),
APLSC: number of dimensions d=5,10,15,20 (λ=4).
Preliminary experiments on naïve Bayes classifier confirmed that kernel estimation improves classification results, so this feature was turned on. Validation threshold parameter in a multilayer perceptron indicates how many times in a row the validation set error can increase before training is terminated. Early tests showed that introducing validation with this stop condition does not influence classification results but significantly reduces training time, therefore we decided to use it in our research. SMOTE filter was configured to balance positive and negative sets perfectly. SVM parameters in SMOTE + SVM combination were tuned with a wider range of values, that is C=10−2,10−1,...,103, and γ=2−2,2−1,…,24 (λ=42). Authors of microPred used a more exhaustive scanning strategy, however it is inapplicable for larger problems because of computational overhead. Hence, we limited search space to cover parameter values selected most commonly in preliminary experiments. Geometric mean (G m ) was chosen as an evaluation metric to be maximised. Numbers of folds, k1 and k2, were set to 10 and 5, respectively. We decided to use 5-fold CV in the parameter tuning because it allowed us to reduce times of analyses with respect to 10-fold CV almost by half (parameter tuning dominates over other stages in terms of computation time), rendering slightly inferior results . This approach follows microPred, which also used 5-fold CV for parameter tuning.
ROC-Select strategy described in the paper was prepared as a plug-in to Weka  package which had been chosen as the basic environment for all classification experiments. It provided us with implementations of naïve Bayes, multilayer perceptron, random forest and SMOTE filter. Weka interface for LibSVM was used for support vector machine experiments. The original APLSC code written in MATLAB was wrapped in Java class and also attached to Weka as a plug-in.
Results and discussion
Relative gains in classification results
Parameter + threshold
Detailed classification results
Parameter + threshold
SMOTE + SVM
SMOTE + SVM
SMOTE + SVM
SMOTE + SVM
SMOTE + SVM
SMOTE + SVM
Applying ROC-select procedure to traditional classifiers (variant IV) balances their sensitivity and specificity significantly improving G m values (except for naïve Bayes in which gains are moderate). The best results were on average obtained for random forest which beats SMOTE + SVM and APLSC in all datasets. However, multilayer perceptron and SVM also overperformed imbalance-suited methods in the majority of cases. The conclusion is twofold: (1) score function returned by examined classifiers properly ranks instances with respect to the conditional class probability, (2) ROC-select procedure successfully applies this knowledge to solve imbalanced classification problem.
Another interesting observation comes from comparison of imbalance-suited strategies, that is SMOTE + SVM and APLSC. Our experiments confirm previous findings that APLSC is superior to SMOTE . It is especially visible in large and highly imbalanced sets like human or plant. We explain this by the fact that SMOTE is able to produce only a limited number of informative examples. Above some threshold value, synthetically generated instances introduce only noise. An important observation is that APLSC seems to be the only classifier which is biased towards minority class (sensitivity is always higher than specificity) which may be a useful feature in some applications.
If one analyses absolute results for particular datasets, it becomes clear that animal sets (human and animal) are more resilient to classification than plant sets (arabidopsis and plant), even though they are more balanced. This is probably caused by the fact that plant miRNAs are better separated from non-miRNAs in the attribute space, hence they are easier to distinguish. The worst absolute results in terms of G m were observed for microPred dataset. We explain this by the low quality of this set (miRBase 12 was known to contain some false positives removed in later releases ) and lack of experimental evidence-based filtering.
SMOTE + SVM
One should remember that training times are influenced not only by the classification method itself, but also by the number of points in the parameter space to be analysed in a tuning stage. In the case of naïve Bayes classifier no parameters were tuned, thus it was the fastest classifier in the comparison (training times from seconds to minutes). For other classifiers undergoing ROC-select procedure, 20-25 points were evaluated. For smaller sets, training times obtained by multilayer perceptron, random forest and SVM were similar (tens of minutes). For larger sets support vector machines scaled worse than competitors (a few dozen of hours vs. hours). In the case of SMOTE + SVM strategy, 42 points were checked (except animal set in which only 25 points were examined). It is important to keep in mind that original microPred included more exhaustive, thus more time-consuming parameter tuning strategy. Limitation of search space did not prevent SMOTE + SVM from being the slowest strategy in our experiments though. In the case of plant and animal datasets single training took more than ten days which makes microPred strategy inapplicable for larger problems. In contrast, APLSC classifier (4 points in the parameter space) was very fast.
Eventually, we decided to use random forest combined with ROC-select as a basic strategy in HuntMi package due to its superior classification results and reasonable computation time.
Feature selection results
Comparison with other tools
Comparison with other tools: animal species
Several studies on improving microPred have been carried out. They exploited techniques like sample selection  or genetic algorithm-based feature selection [37, 38] resulting in very high values of G m (up to 99). All these methods were, however, evaluated on balanced subsets of microPred dataset and some of them suffered from important methodological incoherences like lack of random split of data into training and testing set and, more importantly, inclusion of training sequences in a testing set. Therefore, reported results do not accurately estimate the performance of presented strategies in real miRNA identification problems. In addition, these methods are not available as a ready to use packages.
Another strategy, MiRenSVM , employed SVM ensembles for miRNA classification. It was tested on moderately imbalanced dataset (697 human miRNAs, 5 428 pseudo harpins) with 3-fold CV resulting in G m =94.76. This value is very similar to the one obtained by HuntMi on microPred dataset which consisted of same positive examples and 50% more negatives. MiRenSVM was also tested on a set of 5 238 animal miRNAs successfully identifying 92.84% of them. As no negative sequences were included, specificity of the method is unknown. In our experiments, HuntMi was examined on a set consisting of 7 053 animal miRNAs and 218 154 pseudo hairpins. It outperformed MiRenSVM giving sensitivity of 94.92% and specificity of 96.60%. As MiRenSVM is not available as a tool, we were not able to compare its performance with HuntMi on miRNAs introduced in latest builds of miRBase.
Comparison with other tools: plant species
Based on obtained results, all the plant species examined by HuntMi can be divided into two groups. In the first group (A. thaliana, C. melo, G. max, M. domestica, N. tabacum, P. trichocarpa, S. bicolor) the classification sensitivity varied from 88.41% to 99.51% and is clearly superior to the performance of PlantMiRNAPred. The second group (H. vulgare, M. truncatula and O. sativa) was characterised by much lower sensitivity (35.56% to 72.67%). Two of the latter species belong to monocotyledons, which could suggest that our tool is inefficient when analysing sequences from this plant group. However, we obtained satisfactory sensitivity for S. bicolor (94.64%). This encouraged us to look closer at microRNAs from low-sensitivity group and we discovered that a large fraction of miRNAs in these species do not meet commonly recognised criteria for annotation of plant miRNAs e.g. in the case of osa-MIR5489, osa-MIR5484, hvu-MIR6177, hvu-MIR6182, mtr-MIR5741d and some other miRNAs the mature microRNA lies outside the stem part of the hairpin. Additionally, most of new miRNAs were discovered using deep sequencing approach only, where it is sometimes only one or several reads that support the miRNA (e.g. osa-MIR5527). This data is insufficient to confirm that the miRNA is precisely excised from the stem. Similarly to HuntMi, PlantMiRNAPred produces unsatisfactory results when applied to H. vulgare or O. sativa miRNAs (sensitivities of 56% and 61%).
To sum up, in majority of cases HuntMi was able to obtain better results than competitors even though it was evaluated on larger and more imbalanced datasets. Experiments on animal and plant miRNAs introduced in releases 18-19 of miRBase confirmed that HuntMi outperforms other tools like microPred and PlantMiRNAPred. There are methods reporting higher G m values than HuntMi. However, they were all tested on balanced datasets, often with important methodological flaws, which obstructs proper judgement of their performance in real-life tasks. Moreover, none of these methods is available as a ready to use package.
In this study we present a new machine learning-based miRNA identification package called HuntMi. It exploits ROC-select, a special strategy of thresholding score function output by classifiers, combined with random forest, which we find to produce best classification results. Twenty one features employed by microPred software together with seven new attributes are used as a data representation. The method was tested on large and strongly imbalanced datasets using stratified 10-fold cross-validation procedure. Classifiction performance was further verified on miRNAs newly introduced in latest builds of miRBase. As a result, HuntMi clearly outperforms state-of-the-art miRNA hairpin classification tools like microPred and PlantMiRNAPred without compromising the training time.
HuntMi comes with G m -optimised models for H. sapiens, A. thaliana, animals, plants and viruses. There is a possibility to train a model on any dataset and subsequently use it in classification analysis. This feature may be useful if one is interested in predicting miRNAs in particular species or in applying different optimization criterion than G m in ROC-select procedure. Therefore, HuntMi offers the highest flexibility of all existing microRNA classification packages.
Asymmetric partial least squares classification
Hidden Markov model
Minimum free energy
Receiver operating characteristic
Synthetic minority over-sampling technique
Support vector machine.
This work was supported by the European Social Fund grant UDA-POKL.04.01.01-00-106/09 to AG; National Science Centre grant 2011/01/N/NZ2/01653 to MWS; National Science Centre grant 2011/01/B/ST6/06868 to AG, MWS, IM; National Science Centre grant DEC-2011/01/D/ST6/07007 to MS; Faculty of Biology at AMU grant PBWB-08/2011 to MWS. We wish to thank Adam Adamarek for proofreading the manuscript.
- Laganá A, Forte S, Giudice A, Arena MR, Puglisi PL, Giugno R, Pulvirenti A, Shasha D, Ferro A: MiRó: a MiRNA knowledge base. Database (Oxford) 2009. 10.1093/database/bap008Google Scholar
- Cai X, Hagedorn CH, Cullen BR: Human MicroRNAs are processed from capped, polyadenylated transcripts that can also function as MRNAs. RNA 2004, 10: 1957-1966. 10.1261/rna.7135204PubMed CentralView ArticlePubMedGoogle Scholar
- Davis-Dusenbery BN, Hata A: Mechanisms of control of MicroRNA Biogenesis. J Biochem 2010, 148: 381-392.PubMed CentralPubMedGoogle Scholar
- Brabletz S, Bajdak K, Meidhof S, Burk U, Niedermann G, Firat E, Wellner U, Dimmler A, Faller G, Schubert J, Brabletz T: The ZEB1/miR-200 feedback loop controls notch signalling in cancer Cells. EMBO J 2011, 30: 770-782. 10.1038/emboj.2010.349PubMed CentralView ArticlePubMedGoogle Scholar
- Friedländer MR, Chen W, Adamidi C, Maaskola J, Einspanier R, Knespel S, Rajewsky N: Discovering MicroRNAs from deep sequencing data using MiRDeep. Nat Biotechnol 2008, 26: 407-415. 10.1038/nbt1394View ArticlePubMedGoogle Scholar
- Hertel J, Stadler PF: Hairpins in a haystack: recognizing MicroRNA precursors in comparative genomics data. Bioinformatics 2006, 22: 197-202. 10.1093/bioinformatics/btl257View ArticleGoogle Scholar
- Jones-Rhoades MW, Bartel DP: Computational identification of plant MicroRNAs and their targets, including a stress-induced MiRNA. Mol Cell 2004, 14: 787-799. 10.1016/j.molcel.2004.05.027View ArticlePubMedGoogle Scholar
- Ng KL, Mishra SK: De Novo SVM Classification of precursor MicroRNAs from genomic pseudo hairpins using global and intrinsic folding measures. Bioinformatics 2007, 23: 1321-1330. 10.1093/bioinformatics/btm026View ArticlePubMedGoogle Scholar
- Bentwich I: Prediction and validation of MicroRNAs and their targets. FEBS Lett 2005, 579: 5904-5910. 10.1016/j.febslet.2005.09.040View ArticlePubMedGoogle Scholar
- Mhuantong W, Wichadakul D: MicroPC (microPC): a comprehensive resource for predicting and comparing plant MicroRNAs. BMC Genomics 2009, 10: 366. 10.1186/1471-2164-10-366PubMed CentralView ArticlePubMedGoogle Scholar
- Szczesniak M, Deorowicz S, Gapski J, Kaczynski L, Makalowska I: MiRNEST database: an integrative approach in MicroRNA search and annotation. Nucleic Acids Res Database Issue 2012,40(Database issue):D198-D204.View ArticleGoogle Scholar
- Doran J, Strauss WM: Bio-informatic trends for the determination of MiRNA-target interactions in mammals. DNA Cell Biol 2007, 26: 353-360. 10.1089/dna.2006.0546View ArticlePubMedGoogle Scholar
- Kadri S, Hinman V, Benos PV: HHMMiR: Efficient De Novo prediction of MicroRNAs using hierarchical hidden Markov models. BMC Bioinformatics 2009,10(Suppl 1):S35. 10.1186/1471-2105-10-S1-S35PubMed CentralView ArticlePubMedGoogle Scholar
- Jiang P, Wu H, Wang W, Ma W, Sun X, Lu Z: MiPred: classification of real and pseudo MicroRNA precursors using random forest prediction model with combined features. Nucleic Acids Res 2007, 35: W339-W344. 10.1093/nar/gkm368PubMed CentralView ArticlePubMedGoogle Scholar
- Yousef M, Nebozhyn M, Shatkay H, Kanterakis S, Showe LC, Showe MK: Combining multi-species genomic data for MicroRNA identification using a Naïve Bayes classifier. Bioinformatics 2006, 22: 1325-1334. 10.1093/bioinformatics/btl094View ArticlePubMedGoogle Scholar
- Xue C, Li F, He T, Liu GP, Li Y, Zhang X: Classification of real and pseudo MicroRNA precursors using local structure-sequence features and support vector machine. BMC Bioinformatics 2005, 6: 310. 10.1186/1471-2105-6-310PubMed CentralView ArticlePubMedGoogle Scholar
- Batuwita R, Palade V: MicroPred: effective classification of pre-miRNAs for human MiRNA gene prediction. Bioinformatics 2009, 25: 989-995. 10.1093/bioinformatics/btp107View ArticlePubMedGoogle Scholar
- Kozomara A, Griffiths-Jones S: miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic Acids Res 2011,39(Database Issue):D152-D157.PubMed CentralView ArticlePubMedGoogle Scholar
- Xuan P, Guo M, Liu X, Huang Y, Li W, Huang Y: PlantMiRNAPred: efficient classification of real and pseudo plant pre-miRNAs. Bioinformatics 2011, 27: 1368-1376. 10.1093/bioinformatics/btr153View ArticlePubMedGoogle Scholar
- Chawla NV, Japkowicz N, Kotcz A: Editorial: special issue on learning from imbalanced data sets. SIGKDD Expl 2004, 6: 1-6.View ArticleGoogle Scholar
- He H, Garcia EA: Learning from imabalanced data. IEEE Trans Know and Data Eng 2009, 21: 1263-1284.View ArticleGoogle Scholar
- Mease D, Wyner AJ, Buja A: Boosted classification trees and class probability/quantile estimation. J Mach Learn Res 2007, 8: 409-439.Google Scholar
- Zadrozny B, Elkan C: Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of KDD 2002. New York: ACM; 2002:694-699.Google Scholar
- Domingos P: MetaCost: A general method for making classifiers cost-sensitive. In Proceedings of KDD 1999. New York: ACM; 1999:155-164.Google Scholar
- Fawcett T: An introduction to ROC analysis. Pattern Recogn Lett 2006, 27: 861-874. 10.1016/j.patrec.2005.10.010View ArticleGoogle Scholar
- Duda RO, Hart PE: Pattern Classification and Scene Analysis. New York: Wiley; 1973.Google Scholar
- Rosenblatt F: Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Washington: Spartan Books; 1962.Google Scholar
- Boser BE, Guyon IM, Vapnik VN: A training algorithm for optimal margin classifiers. In Proceedings of COLT 1996. ACM Press; 1992:144-152.Google Scholar
- Brieman L: Random forests. Mach Learn 2001, 45: 5-32. 10.1023/A:1010933404324View ArticleGoogle Scholar
- Keerthi S, Lin CJ: Asymptotic behaviours of support vector machines with gaussian kernel. Neural Comput 2003, 15: 1667-1689. 10.1162/089976603321891855View ArticlePubMedGoogle Scholar
- Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP: SMOTE: Synthetic Minority Over-sampling Technique. J Artif Intell Res 2002, 16: 321-357.Google Scholar
- Qu HN, Li GZ, Xu WS: An asymmetric classifier based on partial least squares. Pattern Recogn 2010, 43: 3448-3457. 10.1016/j.patcog.2010.05.002View ArticleGoogle Scholar
- Kohavi R: A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of IJCAI 1995, Vol. 2. San Mateo: Morgan Kaufmann; 1995:1137-1143.Google Scholar
- Hall M, Eibe F, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA data mining software: an upyear. SIGKDD Expl 2009, 11: 10-18. 10.1145/1656274.1656278View ArticleGoogle Scholar
- Demsar J: Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 2006, 7: 1-30.Google Scholar
- Han K: Effective sample selection for classification of Pre-miRNAs. Genet Mol Res 2011, 10: 506-518. 10.4238/vol10-1gmr1054View ArticlePubMedGoogle Scholar
- Wang Y, Chen X, Jiang W, Li L, Li W, Yang L, Liao M, Lian B, Lv Y, Wang S, Wang S, Li X: Predicting human MicroRNA precursors based on an optimized feature subset generated by GA-SVM. Genomics 2011, 98: 73-78. 10.1016/j.ygeno.2011.04.011View ArticlePubMedGoogle Scholar
- Xuan P, Guo M, Wang J, Wang CY, Liu XY, Liu Y: Genetic algorithm-based efficient feature selection for classification of Pre-miRNAs. Genet Mol Res 2011, 10: 588-603. 10.4238/vol10-2gmr969View ArticlePubMedGoogle Scholar
- Ding J, Zhou S, Guan J: MiRenSVM: towards better prediction of MicroRNA precursors using an ensemble SVM classifier with multi-loop features. BMC Bioinformatics 2010,11(Suppl 11):S35.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.