Volume 10 Supplement 1
HHMMiR: efficient de novo prediction of microRNAs using hierarchical hidden Markov models
© Kadri et al; licensee BioMed Central Ltd. 2009
Published: 30 January 2009
MicroRNA s (miRNAs) are small non-coding single-stranded RNAs (20–23 nts) that are known to act as post-transcriptional and translational regulators of gene expression. Although, they were initially overlooked, their role in many important biological processes, such as development, cell differentiation, and cancer has been established in recent times. In spite of their biological significance, the identification of miRNA genes in newly sequenced organisms is still based, to a large degree, on extensive use of evolutionary conservation, which is not always available.
We have developed HHMMiR, a novel approach for de novo miRNA hairpin prediction in the absence of evolutionary conservation. Our method implements a Hierarchical Hidden Markov Model (HHMM) that utilizes region-based structural as well as sequence information of miRNA precursors. We first established a template for the structure of a typical miRNA hairpin by summarizing data from publicly available databases. We then used this template to develop the HHMM topology.
Our algorithm achieved average sensitivity of 84% and specificity of 88%, on 10-fold cross-validation of human miRNA precursor data. We also show that this model, trained on human sequences, works well on hairpins from other vertebrate as well as invertebrate species. Furthermore, the human trained model was able to correctly classify ~97% of plant miRNA precursors. The success of this approach in such a diverse set of species indicates that sequence conservation is not necessary for miRNA prediction. This may lead to efficient prediction of miRNA genes in virtually any organism.
Plant and animal miRNAs differ not only in their biogenesis, but also in target-miRNA interactions. Plant miRNAs base pair with their targets with perfect or near-perfect complementarity and they regulate their targets mostly through mRNA cleavage at single sites in coding regions. Animal miRNAs usually base pair with 3' UTRs of the mRNAs at multiple target sites through imperfect complementarity. Due to these and other differences, it has been suggested that this regulation mechanism may have evolved independently in plants and animals . Some viruses have also been shown to encode miRNAs that play a role in expression regulation of host genes .
The first animal miRNA genes, let-7 and lin-4, were discovered in Caenorhabditis elegans by forward genetics [11–13]. Currently, miRNA genes are biochemically identified by cloning and sequencing size-fractionated cDNA libraries. The main limitation of this method is that lowly expressed miRNAs may be missed . Although deep sequencing can help overcome this problem, this is currently a costly solution. Still, some miRNAs may be difficult to clone due to their sequence composition and possible post-transcriptional modifications [14–16]. Deep sequencing is being used on a large scale to identify small non-coding RNAs, but this is an expensive method and can only identify miRNAs expressed in a single cell type or in a given condition.
Computational predictive methods are fast and inexpensive and a number of approaches have been developed to predict miRNA genes, genome-wide. However, most of these approaches depend heavily on conservation of hairpins in closely related species [17–20]. Some methods have used clustering or profiling to identify miRNAs, [17, 21, 22]. The approach of Bentwich et al.  is interesting in that the whole genome is folded and scores are assigned to hairpins based on various features, including hairpin structural features and folding stability.
Machine learning approaches in the past have used support vector machines with high dimensional basis functions for classification of genomics hairpins [22, 24, 25]. Some of these methods depend on cross-species conservation for classification, while others do motif finding using multiple alignments. More recently, HMMs have been used in modelling miRNAs using both, evolutionary information and features related to the secondary structure .
Hierarchical Hidden Markov Models
Hierarchical Hidden Markov Models (HHMMs) constitute a generalization of Hidden Markov Models (HMMs). They have been successfully used for modelling stochastic levels and length scales . In biology, HHMMs have been used in the past to model vertebrate splice sites  and more recently in modelling cis-regulatory modules . An HHMM has two types of states: internal states and production states. Each internal state has its own HHMM but cannot emit symbols by itself. It can activate a sub-state by a vertical transition. Sub-states can also make vertical transitions, until the lowest level in the hierarchy (production state) is reached. Production states are the only states that can emit symbols from the alphabet via their own probability distributions. Sub-states at the same level of hierarchy will be activated through horizontal transitions till an "end state" is reached. Every level has only one "end state" for each parent state that shifts control back to the parent. Thus, each internal state can emit sequences instead of single symbols. The node at the highest level of the hierarchy is called the "root" node while the leaf nodes are the productions states. Please refer to Methods for information about HHMM parameters and their estimation.
In this article, we report the results on the performance of an HHMM we developed for modelling miRNA hairpins. Although the model was trained on human sequences only, it was able to classify accurately hairpins from species as distant as worm, flies and plants, indicating that the degree of sequence and structural conservation for these genes may be high.
Characteristics of miRNA hairpins in various taxonomic groups.
HHMMiR is built around the miRNA precursor template illustrated in Figure 2a. The figure presents the four characteristic regions of stem-loop of a typical miRNA gene as described above. The length distributions of each of these regions are derived from Table 1. Each region, except the loop itself has three states: match, mismatch, and insertion/deletion (indel). Match means a base pairing at that position in the stem-loop, while mismatch means bulges on both arms at that position in the folded hairpin.Indel means that a base in one strand has no counterpart in the opposite strand. The loop will only have the indel state. Examples of these states are presented in Figure 2a.
Datasets and alphabet selection
The training dataset contained a total 527 human miRNA precursors (positive dataset) and ~500 random hairpins (negative dataset), based on criteria derived from summarization (see Methods). The RNAfold program from Vienna Package  was used to obtain the secondary structure of these hairpins with the minimum fold energy (mfe). The parameters of the model were estimated using a modified Baum-Welch algorithm (see Methods for details on data sets and algorithms). All tests were conducted with 10-fold cross validation with random sampling.
Results for different alphabet sizes: Σ1 (larger alphabet) shows better accuracy than Σ2 (smaller alphabet)
It is surprising that Σ1 performs better than Σ2, because one would expect that mismatches in the stem-loop would not be characteristic of the miRNA sequence, since they do not contribute to the base pairing of the stem and thus the overall folding energy, on which other algorithms are based . Furthermore, Σ1 alphabet has more parameters. In order to rule out that the better performance is due to parameter overfitting, we repeated training with multiple datasets of different sizes and the results remained the same (data not shown). In the remaining of this paper we use the Σ1 alphabet.
Training algorithms: performance evaluation
Results for cross-validation using different algorithms.
Testing prediction efficiency in other organisms
Results of tests on other species.
% correctly predicted
Comparison with other approaches
Results for comparison between two precursor prediction methods.
Triplet SVM (%)
New human hairpins in registry at the time.
Epstein Barr virus
Folded genome hairpins from Chromosome 19
Negative hairpin Set
MiRNA genes constitute one of the most conserved mechanisms for gene regulation across all animal and plant species. The characteristics of the precursor miRNA stem-loops are well conserved in both vertebrate and invertebrate animals and fairly conserved between animals and plants. As seen in Table 1, plant hairpins tend to be generally longer than those in animals, while vertebrates have shorter precursors than invertebrates. Although, the "extension" and "pri-extension" regions may vary in length between animals and plants (much longer in plants), the lengths of the "miRNA" and "loop" regions are more similar. Thus, even across evolutionary time, the basic characteristics of miRNAs have not changed dramatically.
We designed a template for a typical precursor miRNA stem-loop and we built an HHMM based on it. HHMMiR was able to attain an average sensitivity of 84% and specificity of 88% on 10-fold cross validation of human data. We trained HHMMiR on human sequences and the resulting model was able to successfully identify a large percentage of not only mouse, but also invertebrate, plant and virus miRNAs (Table 4). This is an encouraging result showing that HHMMiR may be very useful in predicting miRNA genes across long evolutionary distances without the requirement for evolutionary conservation of sequences. This would be very beneficial for identification of miRNA hairpins in organisms that do not have closely related species sequenced, such as Strongylocentrotus purpuratus (sea urchin) and Ornithorhynchus anatinus (platypus) .
This is the first time a hierarchical probabilistic model has been used for classification and identification of miRNA hairpins. Probabilistic learning was previously applied by Nam et al.  for identifying the miRNA pattern/motif in hairpins. The advantage of the hierarchy used by our HHMMiR is that it parses each hairpin into four distinct regions and processes each of them separately. This represents better the biological role of each region, which is reflected in the distinct length distributions and neighbourhood base-pairing characteristic of that region. Furthermore, the underlying HHMM provides an intuitive modelling of these regions. We compared two modifications of the MLE and Baum-Welch algorithms for modelling the negative datasets, and we found them to perform similarly. Baum-Welch was selected for this study, since it does not require (random) labelling of the negative set.
The drawback of HHMMiR is that it depends on the mfe structure the RNAfold program returns . In the future, we will test more folding algorithms or use the probability distribution of a number of top scoring folding energy structures returned by this package.
The success of our approach shows that the conservation of the miRNA mechanism may be at a much deeper level than expected. Further developments of the HHMMiR algorithm include the extension of the precursor template model (Figure 3) to be able to predict pri-miRNA genes with multiple stem-loops. Another extension would be to train a model to decode all HHMMiR predicted hairpins to identify the miRNA genes in them. Finally, it will be interesting to extend our method to include evolutionary information, which will allow us to assess the significance of conservation in predicting miRNA genes.
Data collection and processing
MiRNA genes were obtained from the microRNA registry, version 10.1 (December 2007) , which contains 3265 miRNAs from animals and 870 from plants. For training HHMMiR, we used the residual 525 human hairpins, after filtering out precursor genes with multiple loops. Each gene was folded with the RNAfold program, which is part of the Vienna package , using the default parameters to obtain the secondary structure with minimum fold energy. The negative set consists of coding regions and random genomic segments from the human genome that were obtained using the UCSC genome browser . These regions were folded and processed as described below.
After the hairpins are extracted, we process them to an input format representing the hairpin's secondary structure (Figure 5 and Figure 2) to be compatible with the HHMM shown in Figure 3. The labelling is done only for training data. For the purpose of labelling, the miRNA is first mapped to the folded hairpin (on either or both arms), and then the region representing the miRNA is labelled as the duplex miRNA (R) region. Our method does not consider the 3' overhangs generated during Dicer processing. The main bulge is labelled as the loop (L), whereas the remaining region between loop and miRNA is represented as the extension (E). The rest of the hairpin beyond the miRNA is labelled as pri-extension (P). A detailed description of these regions is given in the Results section.
Parameter estimation and testing
Two separate HHMM models are trained, one on positive data set (miRNAs and their corresponding hairpins) and the other on negative data set (hairpins, randomly chosen from the coding parts of the genome). The hairpins are pre-processed and labelled (if needed) before parameter estimation. Baum-Welch requires no labelling, but for MLE, we applied random labelling, as described above (Figure 2a).
Now we will define the various probabilities that are required to be calculated for parameter estimation.
(i) finished at started at o t ) is the forward probability of emitting the substring o t ... ot+kof the observation sequence by the parent state q d such that it was entered at o t and the subsequence ended at substate and thus, it was the last state activated.
(ii) is the probability of making a vertical transition from parent q d to just before the emission of o t .
(iii) is the probability of making a horizontal transition from to where both are substates of q d after the emission of o t and before the emission of ot+1.
Measures of accuracy
Measures for accuracy calculation.
False Discovery Rate (FDR)
Matthew's Correlation Coefficient (MCC)
List of abbreviations used
hierarchical hidden Markov model
minimum fold energy
maximum likelihood estimate.
The authors would like to thank Paul Samollow, Chakra Chennubhotla, Eleanor Feingold and an anonymous reviewer for helpful suggestions. PVB was supported by NIH grant 1R01LM009657-01. This research was supported in part by the National Science Foundation through TeraGrid resources provided by Pittsburgh Supercomputing Center. Supplementary material can be found at the journal's web site and at our web site .
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S1
- Lee Y, Kim M, Han J, Yeom KH, Lee S, Baek SH, Kim VN: MicroRNA genes are transcribed by RNA polymerase II. Embo J 2004, 23(20):4051–4060. 10.1038/sj.emboj.7600385PubMed CentralView ArticlePubMedGoogle Scholar
- Cai X, Hagedorn CH, Cullen BR: Human microRNAs are processed from capped, polyadenylated transcripts that can also function as mRNAs. Rna 2004, 10(12):1957–1966. 10.1261/rna.7135204PubMed CentralView ArticlePubMedGoogle Scholar
- Yi R, Qin Y, Macara IG, Cullen BR: Exportin-5 mediates the nuclear export of pre-microRNAs and short hairpin RNAs. Genes Dev 2003, 17(24):3011–3016. 10.1101/gad.1158803PubMed CentralView ArticlePubMedGoogle Scholar
- Chendrimada TP, Gregory RI, Kumaraswamy E, Norman J, Cooch N, Nishikura K, Shiekhattar R: TRBP recruits the Dicer complex to Ago2 for microRNA processing and gene silencing. Nature 2005, 436(7051):740–744. 10.1038/nature03868PubMed CentralView ArticlePubMedGoogle Scholar
- Lee Y, Ahn C, Han J, Choi H, Kim J, Yim J, Lee J, Provost P, Radmark O, Kim S, et al.: The nuclear RNase III Drosha initiates microRNA processing. Nature 2003, 425(6956):415–419. 10.1038/nature01957View ArticlePubMedGoogle Scholar
- Schwarz DS, Hutvagner G, Du T, Xu Z, Aronin N, Zamore PD: Asymmetry in the assembly of the RNAi enzyme complex. Cell 2003, 115(2):199–208. 10.1016/S0092-8674(03)00759-1View ArticlePubMedGoogle Scholar
- Lagos-Quintana M, Rauhut R, Yalcin A, Meyer J, Lendeckel W, Tuschl T: Identification of tissue-specific microRNAs from mouse. Curr Biol 2002, 12(9):735–739. 10.1016/S0960-9822(02)00809-6View ArticlePubMedGoogle Scholar
- Lewis BP, Burge CB, Bartel DP: Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell 2005, 120(1):15–20. 10.1016/j.cell.2004.12.035View ArticlePubMedGoogle Scholar
- Millar AA, Waterhouse PM: Plant and animal microRNAs: similarities and differences. Functional & integrative genomics 2005, 5(3):129–135. 10.1007/s10142-005-0145-2View ArticleGoogle Scholar
- Sarnow P, Jopling CL, Norman KL, Schutz S, Wehner KA: MicroRNAs: expression, avoidance and subversion by vertebrate viruses. Nature reviews 2006, 4(9):651–659. 10.1038/nrmicro1473PubMedGoogle Scholar
- Lee RC, Feinbaum RL, Ambros V: The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell 1993, 75(5):843–854. 10.1016/0092-8674(93)90529-YView ArticlePubMedGoogle Scholar
- Wightman B, Ha I, Ruvkun G: Posttranscriptional regulation of the heterochronic gene lin-14 by lin-4 mediates temporal pattern formation in C. elegans. Cell 1993, 75(5):855–862. 10.1016/0092-8674(93)90530-4View ArticlePubMedGoogle Scholar
- Reinhart BJ, Slack FJ, Basson M, Pasquinelli AE, Bettinger JC, Rougvie AE, Horvitz HR, Ruvkun G: The 21-nucleotide let-7 RNA regulates developmental timing in Caenorhabditis elegans. Nature 2000, 403(6772):901–906. 10.1038/35002607View ArticlePubMedGoogle Scholar
- Berezikov E, Cuppen E, Plasterk RH: Approaches to microRNA discovery. Nat Genet 2006, 38(Suppl):S2–7. 10.1038/ng1794View ArticlePubMedGoogle Scholar
- Yang W, Chendrimada TP, Wang Q, Higuchi M, Seeburg PH, Shiekhattar R, Nishikura K: Modulation of microRNA processing and expression through RNA editing by ADAR deaminases. Nature structural & molecular biology 2006, 13(1):13–21. 10.1038/nsmb1041View ArticleGoogle Scholar
- Yang Z, Ebright YW, Yu B, Chen X: HEN1 recognizes 21–24 nt small RNA duplexes and deposits a methyl group onto the 2' OH of the 3' terminal nucleotide. Nucleic Acids Res 2006, 34(2):667–675. 10.1093/nar/gkj474PubMed CentralView ArticlePubMedGoogle Scholar
- Ohler U, Yekta S, Lim LP, Bartel DP, Burge CB: Patterns of flanking sequence conservation and a characteristic upstream motif for microRNA gene identification. Rna 2004, 10(9):1309–1322. 10.1261/rna.5206304PubMed CentralView ArticlePubMedGoogle Scholar
- Grad Y, Aach J, Hayes GD, Reinhart BJ, Church GM, Ruvkun G, Kim J: Computational and experimental identification of C. elegans microRNAs. Mol Cell 2003, 11(5):1253–1263. 10.1016/S1097-2765(03)00153-9View ArticlePubMedGoogle Scholar
- Lai EC, Tomancak P, Williams RW, Rubin GM: Computational identification of Drosophila microRNA genes. Genome Biol 2003, 4(7):R42. 10.1186/gb-2003-4-7-r42PubMed CentralView ArticlePubMedGoogle Scholar
- Lim LP, Lau NC, Weinstein EG, Abdelhakim A, Yekta S, Rhoades MW, Burge CB, Bartel DP: The microRNAs of Caenorhabditis elegans. Genes Dev 2003, 17(8):991–1008. 10.1101/gad.1074403PubMed CentralView ArticlePubMedGoogle Scholar
- Legendre M, Lambert A, Gautheret D: Profile-based detection of microRNA precursors in animal genomes. Bioinformatics 2005, 21(7):841–845. 10.1093/bioinformatics/bti073View ArticlePubMedGoogle Scholar
- Sewer A, Paul N, Landgraf P, Aravin A, Pfeffer S, Brownstein MJ, Tuschl T, van Nimwegen E, Zavolan M: Identification of clustered microRNAs using an ab initio prediction method. BMC Bioinformatics 2005, 6: 267. 10.1186/1471-2105-6-267PubMed CentralView ArticlePubMedGoogle Scholar
- Bentwich I, Avniel A, Karov Y, Aharonov R, Gilad S, Barad O, Barzilai A, Einat P, Einav U, Meiri E, et al.: Identification of hundreds of conserved and nonconserved human microRNAs. Nat Genet 2005, 37(7):766–770. 10.1038/ng1590View ArticlePubMedGoogle Scholar
- Pfeffer S, Sewer A, Lagos-Quintana M, Sheridan R, Sander C, Grasser FA, van Dyk LF, Ho CK, Shuman S, Chien M, et al.: Identification of microRNAs of the herpesvirus family. Nat Methods 2005, 2(4):269–276. 10.1038/nmeth746View ArticlePubMedGoogle Scholar
- Xue C, Li F, He T, Liu GP, Li Y, Zhang X: Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine. BMC Bioinformatics 2005, 6: 310. 10.1186/1471-2105-6-310PubMed CentralView ArticlePubMedGoogle Scholar
- Terai G, Komori T, Asai K, Kin T: miRRim: a novel system to find conserved miRNAs with high sensitivity and specificity. Rna 2007, 13(12):2081–2090. 10.1261/rna.655107PubMed CentralView ArticlePubMedGoogle Scholar
- Fine S, Singer Y, Tishby N: The Hierarchical Hidden Markov Model: Analysis and Applications. Machine Learning 1998, 32: 41–62. 10.1023/A:1007469218079View ArticleGoogle Scholar
- Hu M, Ingram C, Sirski M, Pal C, Swamy S, Patten C: A Hierarchical HMM Implementation for Vertebrate Gene Splice Site Prediction. In Technical Report. Department of Computer Science, University of Waterloo; 2000.Google Scholar
- Lin T, Ray P, Sandve GK, Uguroglu S, Xing EP: BayCis: A Bayesian Hierarchical HMM for Cis-Regulatory Module Decoding in Metazoan Genomes. In Research in Computational Molecular Biology (RECOMB), 12th Annual International Conference: 2008. Singapore: Springer; 2008:66–81.View ArticleGoogle Scholar
- Saetrom P, Snove O, Nedland M, Grunfeld TB, Lin Y, Bass MB, Canon JR: Conserved microRNA characteristics in mammals. Oligonucleotides 2006, 16(2):115–144. 10.1089/oli.2006.16.115View ArticlePubMedGoogle Scholar
- Hofacker IL: Vienna RNA secondary structure server. Nucleic Acids Res 2003, 31(13):3429–3431. 10.1093/nar/gkg599PubMed CentralView ArticlePubMedGoogle Scholar
- Nam JW, Shin KR, Han J, Lee Y, Kim VN, Zhang BT: Human microRNA prediction through a probabilistic co-learning model of sequence and structure. Nucleic Acids Res 2005, 33(11):3570–3581. 10.1093/nar/gki668PubMed CentralView ArticlePubMedGoogle Scholar
- Samollow PB: The opossum genome: insights and opportunities from an alternative mammal. Genome Res 2008, 18(8):1199–1215. 10.1101/gr.065326.107PubMed CentralView ArticlePubMedGoogle Scholar
- Griffiths-Jones S: miRBase: the microRNA sequence database. Methods Mol Biol 2006, 342: 129–138.PubMedGoogle Scholar
- Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC. Genome Res 2002, 12(6):996–1006.PubMed CentralView ArticlePubMedGoogle Scholar
- Catlett C, Allcock WE, Andrews P, Aydt R, Bair R, Balac N, Banister B, Barker T, Bartelt M, Beckman P, et al.: TeraGrid: Analysis of Organization, System Architecture, and Middleware Enabling New Types of Applications. In High Performance Computing and Grids in Action. Volume 16. Edited by: Grandinetti L. Amsterdam: IOS Press; 2008.Google Scholar
- Benos laboratory web server[http://www.benoslab.pitt.edu/]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.