Volume 17 Supplement 18
Mirnacle: machine learning with SMOTE and random forest for improving selectivity in pre-miRNA ab initio prediction
© The Author(s) 2016
Published: 15 December 2016
MicroRNAs (miRNAs) are key gene expression regulators in plants and animals. Therefore, miRNAs are involved in several biological processes, making the study of these molecules one of the most relevant topics of molecular biology nowadays. However, characterizing miRNAs in vivo is still a complex task. As a consequence, in silico methods have been developed to predict miRNA loci. A common ab initio strategy to find miRNAs in genomic data is to search for sequences that can fold into the typical hairpin structure of miRNA precursors (pre-miRNAs). The current ab initio approaches, however, have selectivity issues, i.e., a high number of false positives is reported, which can lead to laborious and costly attempts to provide biological validation. This study presents an extension of the ab initio method miRNAFold, with the aim of improving selectivity through machine learning techniques, namely, random forest combined with the SMOTE procedure that copes with imbalance datasets.
By comparing our method, termed Mirnacle, with other important approaches in the literature, we demonstrate that Mirnacle substantially improves selectivity without compromising sensitivity. For the three datasets used in our experiments, our method achieved at least 97% of sensitivity and could deliver a two-fold, 20-fold, and 6-fold increase in selectivity, respectively, compared with the best results of current computational tools.
The extension of miRNAFold by the introduction of machine learning techniques, significantly increases selectivity in pre-miRNA ab initio prediction, which optimally contributes to advanced studies on miRNAs, as the need of biological validations is diminished. Hopefully, new research, such as studies of severe diseases caused by miRNA malfunction, will benefit from the proposed computational tool.
KeywordsPre-miRNA ab initio prediction Random forest Smote microRNA Machine learning Data mining
MicroRNAs (miRNAs) constitute currently a major research topic in molecular biology . These molecules are responsible for post-transcriptional regulation of gene expression, and are involved in several biological processes such as embryonic differentiation , skeletal muscle development , several cardiovascular disorders , expansion of skin stem cells , hematopoiesis , control of proliferation and death of cells , insulin secretion , adipogenesis and obesity , diseases such as cancer , in addition to play relevant physiological roles in animals and plants .
In miRNA biogenesis, the corresponding gene is transcribed into a primary miRNA (pri-miRNA) that undergoes cleavage by the Drosha complex, resulting in a hairpin-shaped miRNA precursor (pre-miRNA) of approximately 70 nt. Next, the precursor is exported to the cytoplasm by the protein Exportin5 and cleaved into the mature miRNA by the enzyme Dicer .
Identifying miRNA loci in vivo is an expensive and complex task. As a result, computational tools have been developed to perform in silico predictions. Comparative approaches, i.e., methods that use conservation of functional genomic elements of phylogenetically-close species, are a natural means for carrying out predictions, because many miRNAs are well conserved among eukaryotes . Additionally, methods based on next-generation sequencing (NGS) data are of increasing interest, as they deliver good sensitivity and provide the possibility of analyzing expression profiles . Another important computational solution that complements the approaches mentioned above, and which is a good alternative for identifying species-specific miRNAs, is the so-called ab initio approach.
According to Tempel and Tahi , ab initio methods can be classified into three categories: 1) Methods whose input is a putative pre-miRNA sequence and that classify the given candidate as true or false; 2) methods whose input is comprised of a genomic sequence as well as some other information that helps to predict pre-miRNAs in the given sequence; 3) methods whose input is a genomic sequence and that use no external information to predict pre-miRNAs in the given sequence. Tempel and Tahi call the third category completely ab inito methods.
Ab initio methods are very useful when no closely-related species with identified miRNAs are available. Even when such an information is accessible, ab initio approaches are important for predicting pre-miRNAs that are specific to the studied genome. Furthermore, NGS still involves a considerable cost and laboratory infrastructure that are not the reality of a significant part of research groups worldwide. Our work matches the third category of ab initio methods. Some important approaches in this category have been previously proposed.
VMir was conceived to predict pre-miRNAs in viruses . Initially, this method processes the input sequence through a sliding window to produce subsequences with length close to the expected length of a pre-miRNA. Then, in order to build a putative secondary structure for each subsequence, VMir uses the software RNAfold . Finally, VMir calculates a score for each pre-miRNA candidate based on selected secondary structure attributes. Only those pre-miRNA candidates with scores above an informed threshold value are given as output.
CID-miRNA applies a stochastic context-free grammar built from secondary structures of known human pre-miRNAs, along with a J48 decision tree to predict pre-miRNAs in a given DNA sequence . To learn the characteristics of negative cases in its model, CID-miRNA uses human ribosomal RNA.
Virgo predicts pre-miRNAs using a variant of the learning algorithm support vector machines (SVM) called SVM light [18, 19]. For training the classifier, validated human miRNA precursors are used as positive instances, while negative instances are artificially produced using coding regions of genes. Virgo extracts several subsequences from the input sequence, and computationally folds them with the software RNAfold. A filter is then applied to select the secondary structures that match known characteristics of pre-miRNAs. Finally, several attributes are extracted from the selected candidates so that the SVM model can assign them a final classification.
The method miRPara also applies SVM for performing classification and provides the prediction of mature miRNAs in addition to their precursors . Initially, miRPara breaks the input sequence into 500-nt subsequences with a 200-nt overlap. Next, the subsequences are given as input to the software UNAFold for extracting hairpins . Many of the obtained hairpins are then discarded by the application of a filter, and the remaining ones are further analyzed with UNAFold. At last, several attributes related to physical characteristics of miRNAs/pre-miRNAs are extracted from the final set of hairpins for predicting the most likely candidates. To construct the training sets, positive examples were extracted from validated miRNAs/pre-miRNAs, and negative examples were artificially produced from validated pri-miRNAs with the start position of their miRNAs randomly shifted within the sequences.
The method miRNAFold is based on a statistical approach defined from a previous analysis the authors made of pre-miRNAs contained in the miRBase database [14, 22]. Similarly to other methods, miRNAFold analyzes fixed-length sequence windows, searching for putative pre-miRNAs. For each window, miRNAFold constructs the most likely secondary structure without the use of third-party computational tools, in order to speed up the prediction procedure. This is accomplished by the construction of a base-pairing matrix for each window, which is analyzed in three stages. For each stage, several characteristics of parts of putative hairpins are observed according to the statistical study made previously. The final hairpins that match a certain percentage of these characteristics are given as likely pre-miRNA candidates.
The method of Titov and Vorozheykin (TVM) proposes a context-structural Markov model and a hidden Markov model for ab initio prediction of pre-miRNAs and miRNAs, respectively . The authors used 1,872 human pre-miRNAs from the miRBase database as positive instances, and 1,872 random sequences from the pseudo-precursors constructed by Xue and colleagues as negative instances . TVM uses the software GArna to produce secondary structures from the subsequences extracted from the input sequence . Then, several patterns of the primary and secondary structures are evaluated with the proposed models.
Even though the methods described above have provided important contributions, they deliver very low selectivity (also known as precision), i.e., a high number of false positives (FPs) are produced, which means more work at the laboratory bench. In particular, the statistical study made for the method miRNAFold identified secondary structure characteristics of positive instances only, i.e., the method was based on pre-miRNAs alone. The authors did not analyze negative instances in their work. As a consequence, even reaching excellent sensitivity, large amounts of FPs are produced.
In this work, we propose an extension of miRNAFold, which includes negative examples and applies machine learning (ML) techniques to improve selectivity. We compared our approach with miRNAFold and all other methods mentioned above. Experiments with three datasets demonstrate that we preserved very high sensitivity, while substantially increasing selectivity. Our method, termed Mirnacle, provided an improvement of two-fold, 20-fold, and 6-fold in selectivity, for the respective datasets, compared with the best results of the other approaches.
Our aim is to predict new pre-miRNAs from a DNA sequence without using any additional information (completely ab initio approach). For each subsequence extracted from the input sequence, Mirnacle executes three stages to try to build the most likely hairpin structure. Each stage classifies parts of the hairpin that is incrementally built. The steps are similar to the miRNAFold approach. One major difference is that we consider negative examples in addition to positive instances (previously identified pre-miRNAs). Another important distinction is the use of ML techniques to perform a more sophisticated classification, where each stage has its own model, with the aim of minimizing the number of false positives.
Ab initio prediction
To build the matrix for a given sequence s[0..n−1], a column i represents the character s[i], a row i represents the character s[n−1−i], and an entry i,j is a positive integer number if the corresponding bases are complementary, or zero, otherwise. The positive numbers indicate the extension of the paired region. Algorithm 1 clearly describes how the base pairing matrix is constructed. Lines 4-15 initialize the first column and the first row, while lines 16-24 fill the other entries.
Our main contribution here is the application of ML techniques with the objective of minimizing false positives. It contrasts with the verification of a list of criteria performed in miRNAFold. It is important to notice that each stage has its own ML model. For example, there is a specific model to apply on all exact stems of a minimum predefined length found in the first stage (Fig. 1 c). Only those instances considered as positives, i.e., for which the model assigns a probability value greater or equal to a predefined threshold, are given as input to the second stage. The non-exact stems produced from the positively classified exact stems, in turn, are classified with another ML model (Fig. 1 e), and only the instances regarded as positives pass to the next phase. In the last stage, a third ML model is used to classify (Fig. 1 g) the resultant hairpins to report the final predictions.
The three-stages procedure described above is repeated for each sliding window subsequence. At the end of each analysis, the sliding window is moved 10 nt downstream, as proposed by Tempel and Tahi . This process continues until the end of the given DNA sequence.
To validate the proposed method, three datasets used in the experiments of miRNAFold were also used in our work to facilitate the comparisons. One of the datasets is an artificially constructed sequence obtained from 100 human pre-miRNAs interposed between human mRNAs, leading to a sequence of 30,500 nt. The other two datasets are real genomic data containing clusters of pre-miRNAs. The first one is a sequence extracted from the positive strand of the human chromosome 19, starting at position 54,169,933 and ending at position 54,485,651, containing 50 pre-miRNAs. The second sequence was obtained from the positive strand of the mouse chromosome 2, starting at position 10,388,290 and ending at position 10,439,906, comprehending 71 pre-miRNAs. These three datasets are referred throughout the text as HAD (human artificial data), HSD (Homo sapiens data), and MMD (Mus musculus data), respectively.
The sequences from zebrafish and sea squirt chromosomes used in the miRNAFold work were not included in our experiments because we could not match the stretches cited by the authors to the data available in GenBank. Furthermore, to our knowledge, there were no validated pre-miRNAs in the miRBase database for these two species at the moment we performed our experiments. Details about the datasets can be found in the work of Tempel and Tahi .
To build the learning models, three independent datasets to produce training sets (TSs) were constructed: TSHA, TSHS, and TSMM, one for each experiment with the three test sets: HAD, HSD, and MMD, respectively. Positive examples were obtained from experimentally validated pre-miRNA sequences present in miRBase release 21 . For a fair model evaluation, all pre-miRNAs contained in the test sets were eliminated from TSHA, TSHS, and TSMM. Therefore, the positive instances in TSHA were a result of all validated human pre-miRNAs in miRBase subtracted by the pre-miRNAs present in HAD. Similarly, TSHS’s positive instances were obtained after subtracting the pre-miRNAs in HSD from the set of validated human pre-miRNAs. Finally, the positive instances of TSMM were the result of the validated mouse pre-miRNAs in miRBase minus the pre-miRNAs in MMD. As a consequence, TSHA, TSHS, and TSMM contain 235, 295, and 364 positive examples, respectively.
Negative examples, in turn, were comprised of other types of non-coding RNAs along with pseudo hairpins. Gene sequences of snRNA (small nuclear RNA), snoRNA (small nucleolar RNA), tRNA, and miscRNA (miscellaneous RNA) were extracted from GenBank and the NRDR repository , totaling 2,480 sequences from the H. sapiens genome and 3,298 sequences from the M. musculus genome. Moreover, 1,872 pseudo pre-miRNAs from the work of Xue et al. were considered to compose the H. sapiens TSs . Therefore, TSHA, TSHS, and TSMM contain 4,352, 4,352, and 3,298 negative examples, respectively.
Features for exact stem classification in the first stage: Size, deltaG, percentage of each type of nucleotide, maximum number of consecutive nucleotides of each type, percentage of GU pairing, percentage of each possible dinucleotide, percentage of the difference between G+A and C+U, and percentage of the difference between G and C;
Features for non-exact stem classification in the second stage: The same features of stage 1 plus percentage of base pairing, number of exact stems, average size of palindromes, number of internal symmetric loops, average size of internal symmetric loops, and maximum size of internal symmetric loops;
Features for hairpin classification in the third stage: The same features of stage 2 plus average size of exact stems, size of the terminal loop, percentage of GC pairing, adjusted MFE, percentage of non-exact stems covering the hairpin, size of the biggest difference between the two sides of bulge, size of the biggest bulge to the left of the terminal loop, size of the biggest bulge to the right of the terminal loop, difference between the number of left and right bulges, number of consecutive bulges, number of consecutive bulges on the same side, MFE1, MFE 2, and triplets.
The values of MFE and deltaG are obtained from the computational tools RNAfold and RNAcofold, respectively, which are part of the Vienna RNA package .
Learning with imbalanced datasets
As could be seen in the description of our data, the TSs are highly imbalanced. Namely, the number of negative instances is much greater than the number of positive instances. It may be a problem because the resulting model could be biased to the dominant class, presenting a poor accuracy to classify positive examples. In order to avoid this issue, three strategies to cope with imbalanced data were considered: Cost matrix, sampling, and the SMOTE filter [29, 30].
In cost matrices, it is possible to set the cost of misclassifying positive and negative instances. In our case, this cost is inversely proportional to the number of instances in the TS, i.e., the cost of misclassifying a positive instance, which belongs to the minority class, is much higher than the cost of misclassifying a negative instance. As a result, the weights of both classes in the training process are equalized.
Sampling and SMOTE were applied to alter the number of instances in the TS, such that the amount of instances in each class became even. To this end, sampling with replacement was used to undersampling the majority class instances, while keeping the minority class instances unchanged. The SMOTE technique, in turn, was used to oversampling the minority class elements, which eliminates the possibility of information loss. To achieve this, SMOTE combines the features of existing instances with the features of their nearest neighbors to create additional synthetic instances.
We selected a set of well known learning algorithms to be used in conjunction with the above methods for imbalanced data. Inspecting the literature about computational methods regarding learning tasks to miRNA discovery [19, 20, 31, 32], we have chosen the following learning algorithms to evaluate: SVM, multilayer perceptron (MLP), and random forest (RF). Our intention was to perform some experiments varying the combination of methods for imbalanced data with these learning algorithms to select the most appropriate approach. The results shown in the next section demonstrate that the best strategy among the possible combinations is SMOTE + RF.
To deploy all these ML methods in our computational tool, we used the API of the Weka machine learning toolkit , version 3.7.11, that implements all algorithms and techniques mentioned above. In particular for the SVM approach, we tested two implementations provided in the Weka API: Sequential minimal optimization (SMO) and LibSVM.
The analyses of time and space presented here do not consider the phase of ML model construction, because it is performed once and the resulting model is applied as much as needed.
Concerning space, given an input sequence of size n and a sliding window of size m, the space required by the algorithm as a function of these variables has complexity θ(n+m 2). Notice that it is necessary to store the whole sequence and the triangular base paring matrix. For large input sequences, such as an entire chromosome or even a whole genome, it is clear that n≫m. In this case, we can consider the space complexity as θ(n).
Regarding time, for the same variables above, we can assume n window subsequences of size m to analyze, i.e., n executions of the three-stages pipeline, and an m×m matrix to explore in each case. Considering an extreme scenario in every stage where each of the m 2 entries of the matrix has to be processed, the matrix scanning has complexity O(m 3), because for each position in the matrix, it is needed m additional entry accesses to process the respective diagonals. Therefore, the whole algorithm time complexity is O(nm 3).
Results and discussion
Experiments were performed on a Linux machine, Intel Core Duo 2 T6600, 2.2 GHz, and 3 GB of RAM. Considering we used the same platform and processor reported in the miRNAFold work, we could directly compare our running times with the times the miRNAFold authors reported for their work and for other methods.
where TP is the number of true positives, FN is the number of false negatives, and FP is the number of false positives. Furthermore, the geometric mean (GM) of SN and SL is provided in each experiment, as suggested by Gudyś et al. for the case of imbalanced data .
Deciding the ML methods to be part of Mirnacle
In order to decide the best method among the ones selected for treating the imbalance problem, all combinations of these methods with the already mentioned ML approaches were tested using TSHA1, TSHA2, and TSHA3. We kept the algorithms’ parameters with their default values [33–36].
Comparing different combinations of methods for imbalanced data and learning algorithms. A 10-fold cross validation was performed in each case using TSHA1, TSHA2, and TSHA3 for the respective stages
Comparison of the selected classifiers combined to the SMOTE filter using TSHS1, TSHS2, and TSHS3 (A) as well as TSMM1, TSMM2, and TSMM3 (B) for the respective stages. Each ML algorithm was tested for each TS through a 10-fold cross validation
(A) Results for TSHS1, TSHS2, and TSHS3:
(B) Results for TSMM1, TSMM2, and TSMM3:
Comparing Mirnacle with other methods for pre-miRNA ab initio prediction
After defining the best ML approach to use, we could conclude our computational tool and compare it with other previously proposed methods for pre-miRNA ab initio prediction. In our experiments, only ab initio methods of the third category, mentioned earlier, were considered.
The Mirnacle parameters are: Minimum exact-stem size, sliding window size, the probability threshold of each of the three stages, minimum pre-miRNA size, and maximum pre-miRNA size. In all experiments, the minimum exact-stem size, the sliding window size, the minimum pre-miRNA size, and the maximum pre-miRNA size were set to 4, 150, 50, and 150, respectively.
Comparison of the methods for pre-miRNA ab initio prediction using sequence HAD. The results of five distinct combinations of discriminant probabilities for the three respective stages are shown in parenthesis for Mirnacle
Comparison of the methods for pre-miRNA ab initio prediction using sequence HSD (A) and MMD (B). The results of Mirnacle are with thresholds (0.3, 0.3, 0.7) for the three stages, respectively
(A) Results for the Homo sapiens sequence (HSD):
(B) Results for the Mus musculus sequence (MMD):
The results shown in Tables 3 and 4 for TVM were taken from its publication , while the results for miRNAFold, miRPara, CID-miRNA, and VMir were obtained from the work of Tempel and Tahi . Using the same criterion as the authors of miRNAFold, a predicted pre-miRNA is considered true if the distance from its center to the center of the known hairpin is less or equal to 10% of the size of the latter.
It can be seen in Table 3 that the best GM value of Mirnacle was 88.91, representing an increase of approximately 44% regarding the best result among the other methods, namely, GM = 61.77 produced by TVM. Concerning selectivity, which is the main bottleneck of the previously proposed approaches, Mirnacle could deliver a selectivity of 81.51 that means more than a two-fold increase compared with TVM. Notice that a high sensitivity value was kept by Mirnacle. In Table 3, the five results shown for different thresholds for the three stages can serve as a reference to threshold values to be used depending on a particular objective. If one is interested in high selectivity to reduce biological validations, the combination (0.8,0.8,0.7) is a good option, for instance.
Concerning the running time, Mirnacle’s performance was highly variable. This is because low thresholds in the first and the second stages mean a less strict filter of exact and non-exact stems, i.e., the third stage will contain more sequences to extend to a complete hairpin, which leads to a longer running time. High thresholds in the first and second stages, on the other hand, mean fewer non-exact stems to expand at the end, speeding up the process. Notice that the matrix exploration in the third phase for inspecting different possibilities of a complete hairpin is the most computationally-expensive part. Comparing the Mirnacle execution that took 14 minutes and 58 seconds with the running time of other methods, Mirnacle could only overcome CID-miRNA. However, considering that we substantially improved selectivity, the time saved in laboratory experiments is likely more significant. It can be seen that the time taken by TVM was not reported. The reason for this is that the authors do not mention any experiment to measure time. Furthermore, TVM is available only as a webserver, which makes infeasible to measure its running time in a fair way.
Table 4 shows that Mirnacle provided even more significant improvements in selectivity for the other datasets. It can be seen an increase of 20-fold compared with TVM for sequence HSD (Table 4A), and an increase of 6-fold compared with miRNAFold for sequence MMD (Table 4B), while maintaining high sensitivity. The results for sequence HSD are particularly remarkable. For sequence MMD, Mirnacle missed one pre-miRNA, resulting in slightly lower sensitivity compared with TVM. However, the improvement in selectivity produced by Mirnacle led to a much higher GM value of this method.
In this work, we propose an extension of the miRNAFold method for pre-miRNA ab initio prediction to address the low selectivity issue of miRNAFold and other ab initio approaches. Our experiments have shown that our method, termed Mirnacle, substantially increased the selectivity compared with previously proposed procedures, whereas keeping high sensitivity.
To achieve these results, the main improvements implemented in Mirnacle were: The analysis of negative training examples, in addition to positive examples; the use of machine learning techniques, namely, SMOTE combined with random forest; and the inclusion of other important hairpin features. Furthermore, Mirnacle allows the user to provide positive and negative examples to generate new models, which results in great flexibility in the use of our computational tool, i.e., as long as appropriate training examples are available, Mirnacle can be, in principle, used for other organisms of interest.
As a future work, we intend to improve Mirnacle’s running time. First, the calculation of MFE and deltaG will be part of Mirnacle’s code, in order to eliminate calls to the Vienna RNA package. Second, and most importantly, Mirnacle will implement parallel approaches, such as GPU (graphics processor unit), as the analyses of the subsequences in a genome are completely independent. Therefore, the parallelization of such analyses can certainly bring a huge gain in performance.
Application programming interface
Graphics processor unit
Human artificial data
Homo sapiens data
Mus musculus data
Minimum free energy
Support vector machines
Small nuclear RNA
Small nucleolar RNA
Sequential minimal optimization
This research is supported by CAPES, CNPq, FAPEMIG, FAPERJ, and IFNMG.
This article has been published as part of BMC Bioinformatics Volume 17 Supplement 18, 2016. Proceedings of X-meeting 2015: 11th International Conference of the AB3C + Brazilian Symposium on Bioinformatics: bioinformatics. The full contents of the supplement are available online https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-17-supplement-18.
Publication charges for this article have been funded by CAPES, CNPq, and FAPEMIG.
Availability of data and materials
Mirnacle is available from: http://www.labinfo.lncc.br/mirnacle. The user can provide positive and negative examples to construct training sets, generate models, and search for new pre-miRNAs in a DNA or RNA sequence.
YBM, ATRV, and FRC designed all analyses; YBM and APO were responsible for carrying out the analyses; YBM and FRC wrote the initial draft of the manuscript; APO and ATRV contributed to posterior revisions to the final draft. All authors read and approved the final paper.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Ha M, Kim VN. Regulation of microRNA biogenesis. Nat Rev Mol Cell Biol. 2014; 15(8):509–24.View ArticlePubMedGoogle Scholar
- Suh MR, Lee Y, Kim JY, Kim SK, Moon SH, Lee JY, Cha KY, Chung HM, Yoon HS, Moon SY, et al. Human embryonic stem cells express a unique set of microRNAs. Dev Biol. 2004; 270(2):488–98.View ArticlePubMedGoogle Scholar
- Williams AH, Liu N, van Rooij E, Olson EN. MicroRNA control of muscle development and disease. Curr Opin Cell Biol. 2009; 21(3):461–9.View ArticlePubMedPubMed CentralGoogle Scholar
- van Rooij E, Olson EN. MicroRNA therapeutics for cardiovascular disease: Opportunities and obstacles. Nat Rev Drug Discov. 2012; 11(11):860–72.View ArticlePubMedGoogle Scholar
- Wang D, Zhang Z, O’Loughlin E, Wang L, Fan X, Lai EC, Yi R. MicroRNA-205 controls neonatal expansion of skin stem cells by modulating the PI(3)K pathway. Nat Cell biol. 2013; 15(10):1153–63.View ArticlePubMedPubMed CentralGoogle Scholar
- Shivdasani RA. MicroRNAs: Regulators of gene expression and cell differentiation. Blood. 2006; 108(12):3646–53.View ArticlePubMedPubMed CentralGoogle Scholar
- Ambros V. The functions of animal microRNAs. Nature. 2004; 431(7006):350–5.View ArticlePubMedGoogle Scholar
- Poy MN, Eliasson L, Krutzfeldt J, Kuwajima S, Ma X, MacDonald PE, Pfeffer S, Tuschl T, Rajewsky N, Rorsman P, et al. A pancreatic islet-specific microRNA regulates insulin secretion. Nature. 2004; 432(7014):226–30.View ArticlePubMedGoogle Scholar
- Hilton C, Neville M, Karpe F. MicroRNAs in adipose tissue: Their role in adipogenesis and obesity. Int J Obes. 2013; 37(3):325–32.View ArticleGoogle Scholar
- Pereira DM, Rodrigues PM, Borralho PM, Rodrigues CM. Delivering the promise of miRNA cancer therapeutics. Drug Discov Today. 2013; 18(5):282–9.View ArticlePubMedGoogle Scholar
- Carrington JC, Ambros V. Role of microRNAs in plant and animal development. Science. 2003; 301(5631):336–8.View ArticlePubMedGoogle Scholar
- Terai G, Okida H, Asai K, Mituyama T. Prediction of conserved precursors of miRNAs and their mature forms by integrating position-specific structural features. PLoS ONE. 2012; 7(9):11. doi:10.1371/journal.pone.0044314.View ArticleGoogle Scholar
- Hansen TB, Venø M. T, Kjems J, Damgaard CK. miRdentify: High stringency miRNA predictor identifies several novel animal miRNAs. Nucleic Acids Res. 2014; 42(16):124.View ArticleGoogle Scholar
- Tempel S, Tahi F. A fast ab-initio method for predicting miRNA precursors in genomes. Nucleic Acids Res. 2012; 40(11):80.View ArticleGoogle Scholar
- Grundhoff A, Sullivan CS, Ganem D. A combined computational and microarray-based approach identifies novel microRNAs encoded by human gamma-herpesviruses. Rna. 2006; 12(5):733–50.View ArticlePubMedPubMed CentralGoogle Scholar
- Lorenz R, Bernhart SH, zu Siederdissen CH, Tafer H, Flamm C, Stadler PF, Hofacker IL. ViennaRNA package 2.0. Algorithms Mol Biol. 2011; 6:26.View ArticlePubMedPubMed CentralGoogle Scholar
- Tyagi S, Vaz C, Gupta V, Bhatia R, Maheshwari S, Srinivasan A, Bhattacharya A. CID-miRNA: A web server for prediction of novel miRNA precursors in human genome. Biochem Biophys Res Commun. 2008; 372(4):831–4.View ArticlePubMedGoogle Scholar
- Joachims T. Advances in Kernel Methods, Chap. Making large-scale support vector machine learning practical. Cambridge: MIT Press; 1999, pp. 169–84.Google Scholar
- Kumar S, Ansari FA, Scaria V. Prediction of viral microRNA precursors based on human microRNA precursor sequence and structural features. Virol J. 2009; 6(1):129.View ArticlePubMedPubMed CentralGoogle Scholar
- Wu Y, Wei B, Liu H, Li T, Rayner S. MiRPara: A SVM-based software tool for prediction of most probable microRNA coding regions in genome scale sequences. BMC Bioinformatics. 2011; 12(1):107.View ArticlePubMedPubMed CentralGoogle Scholar
- Markham N, Zuker M. UNAFold: Software for nucleic acid folding and hybridization In: Keith J, editor. Bioinformatics. Methods in Molecular Biology. Totowa: Humana Press: 2008. p. 3–31.Google Scholar
- Kozomara A, Griffiths-Jones S. miRBase: Annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Res. 2014; 42(D1):D68–D73.View ArticlePubMedGoogle Scholar
- Titov II, Vorozheykin PS. Ab initio human miRNA and pre-miRNA prediction. J Bioinforma Comput Biol. 2013; 11(6):1343009.View ArticleGoogle Scholar
- Xue C, Li F, He T, Liu GP, Li Y, Zhang X. Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine. BMC Bioinformatics. 2005; 6(1):310.View ArticlePubMedPubMed CentralGoogle Scholar
- Titov II, Vorobiev DG, Ivanisenko VA, Kolchanov NA. A fast genetic algorithm for rna secondary structure analysis. Russ Chem Bull. 2002; 51(7):1135–44.View ArticleGoogle Scholar
- Paschoal AR, Maracaja-Coutinho V, Setubal JC, Simões ZLP, Verjovski-Almeida S, Durham AM. Non-coding transcription characterization and annotation. RNA Biol. 2012; 9(3):274–282.View ArticlePubMedGoogle Scholar
- Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: An update. ACM SIGKDD Explor Newsl. 2009; 11(1):10–8.View ArticleGoogle Scholar
- Batuwita R, Palade V. microPred: Effective classification of pre-miRNAs for human miRNA gene prediction. Bioinformatics. 2009; 25(8):989–95.View ArticlePubMedGoogle Scholar
- Chawla NV. Data mining for imbalanced datasets: An overview In: Maimon O, Rokach L, editors. Data Mining and Knowledge Discovery Handbook. Boston: Springer US: 2010. p. 875–86.Google Scholar
- Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic minority over-sampling technique. J Artif Int Res. 2002; 16(1):321–357.Google Scholar
- Ahmadi H, Ahmadi A, Azimzadeh-Jamalkandi S, Shoorehdeli MA, Salehzadeh-Yazdi A, Bidkhori G, Masoudi-Nejad A. HomoTarget: a new algorithm for prediction of microRNA targets in Homo sapiens. Genomics. 2013; 101(2):94–100.View ArticlePubMedGoogle Scholar
- Gudyś A, Szcześniak MW, Sikora M, Makałowska I. HuntMi: An efficient and taxon-specific approach in pre-miRNA identification. BMC Bioinformatics. 2013; 14(1):83.View ArticlePubMedPubMed CentralGoogle Scholar
- Weka SMO classifier. http://weka.sourceforge.net/doc.stable/weka/classifiers/functions/SMO.html. Accessed 10 May 2016.
- Weka LibSVM classifier. http://weka.sourceforge.net/doc.stable/weka/classifiers/functions/LibSVM.html. Accessed 10 May 2016.
- Weka Multilayer Perceptron classifier. http://weka.sourceforge.net/doc.stable/weka/classifiers/functions/MultilayerPerceptron.html. Accessed 10 May 2016.
- Weka Random Forest classifier. http://weka.sourceforge.net/doc.stable/weka/classifiers/trees/RandomForest.html. Accessed 10 May 2016.