MapMi: automated mapping of microRNA loci
© Guerra-Assunção and Enright. 2010
Received: 5 October 2009
Accepted: 16 March 2010
Published: 16 March 2010
Skip to main content
© Guerra-Assunção and Enright. 2010
Received: 5 October 2009
Accepted: 16 March 2010
Published: 16 March 2010
A large effort to discover microRNAs (miRNAs) has been under way. Currently miRBase is their primary repository, providing annotations of primary sequences, precursors and probable genomic loci. In many cases miRNAs are identical or very similar between related (or in some cases more distant) species. However, miRBase focuses on those species for which miRNAs have been directly confirmed. Secondly, specific miRNAs or their loci are sometimes not annotated even in well-covered species. We sought to address this problem by developing a computational system for automated mapping of miRNAs within and across species. Given the sequence of a known miRNA in one species it is relatively straightforward to determine likely loci of that miRNA in other species. Our primary goal is not the discovery of novel miRNAs but the mapping of validated miRNAs in one species to their most likely orthologues in other species.
We present MapMi, a computational system for automated miRNA mapping across and within species. This method has a sensitivity of 92.20% and a specificity of 97.73%. Using the latest release (v14) of miRBase, we obtained 10,944 unannotated potential miRNAs when MapMi was applied to all 21 species in Ensembl Metazoa release 2 and 46 species from Ensembl release 55.
The pipeline and an associated web-server for mapping miRNAs are freely available on http://www.ebi.ac.uk/enright-srv/MapMi/. In addition precomputed miRNA mappings of miRBase miRNAs across a large number of species are provided.
Recently, miRNAs have been shown to be a large and diverse class of regulators [1, 2]. A large effort has been under way to clone and sequence miRNAs using a variety of technologies in multiple species and tissues [3, 4]. These molecules are 18-22 nt and function via binding to the 3'UTRs of their target transcripts . This binding event is targeted via complementarity between miRNA and target sequence. The binding of a miRNA to its target transcript causes repression of translation and also transcriptional destabilisation . miRNAs have been implicated in a large and growing number of diseases and processes across both animal and plant kingdoms [7, 8]. The miRBase database is the primary repository for these data . It focuses on both nomenclature and recording of precursor and mature sequences and their probable genomic loci. Currently many deposited miRNAs are derived from model organisms (e.g. Human, C. elegans). Given that many miRNAs are highly conserved between species  it is likely, for example, that a miRNA discovered in C. elegans will also be present in C. briggsae. In other cases even in one species there may be multiple genomic loci which could encode a detected mature miRNA sequence and not all of these may be annotated in miRBase. This implicit bias towards model organisms hampers miRNA research in other organisms and makes evolutionary and phylogenetic analysis of miRNA families across species extremely difficult. Given a mature miRNA sequence in one species it is possible to detect the likely location of its orthologue in another species using both sequence analysis and RNA secondary structure prediction. Our assumption here is that an orthologous miRNA will possess both a high degree of similarity to the miRNA mature sequence and that identified orthologous loci should have the capability to form the stem-loop structure typical of miRNA precursors. Some groups use ad hoc methods for miRNA mapping analysis, however such approaches are generally either not available to the community, have not been validated or are too specific for general use. For example, miROrtho  provides web-access but not methods or raw data, while CoGemiR  provides raw data but does not allow sequence searches. Another tool, miRNAminer  requires the user to provide both the mature sequence and the precursor sequence and runs on a limited set of species. For these reasons, it is very difficult to directly compare the existing methods to MapMi in terms of performance. However, where possible we have compared predictions from MapMi against CoGemiR, miRNAminer and miROrtho (see Additional file 1). The most complete comparison is with miROrtho where there is a high degree of overlap between the methods, for the species where their data is freely available. When human miRBase miRNAs are used as a reference set, both methods predict a shared set of 478 loci, while miROrtho predicts 49 loci that MapMi does not and MapMi detects 139 loci not detected by miROrtho (see Additional file 1, Figure S4). Many methods have focused on prediction of novel miRNAs from genomic hairpins  which is a non-trivial problem. In our case we focus on the simpler task of mapping an identified miRNA in one species to others using both sequence similarity and RNA secondary structure. While our system has not been designed for predicting the loci of novel miRNAs, it is useful to leverage on the data produced by other methods, expanding it to other species. We describe our approach to miRNA mapping and demonstrate that it performs well, discriminating between true miRNAs and false-positives. The approach is freely available as both software and a web interface. Furthermore, we provide precomputed mappings of all miRBase miRNA sequences across 46 Ensembl genomes and 21 Ensembl Metazoa genomes . We will maintain this resource through subsequent updates of miRBase for all species available in Ensembl.
Finally, the web version of MapMi provides detailed further analysis capabilities for precomputed results. This includes the generation and display of Maximum Likelihood phylogenetic trees (PhyML  & PhyloWidget ), Multiple sequence alignments (MUSCLE  & Jalview ) and RNA Structural logos (RNALogo ).
The negative dataset was generated by using ushuffle  to generate 10 and 100 shuffles per initial nucleotide sequence. Due to their nucleotide composition, some of the 4,237 initial sequences could not be shuffled the required number of times. The resulting in datasets contained 42,366 and 423,343 random shuffled sequences respectively. These datasets were mapped against all 67 genomes under analysis.
The repeat masking procedure applied to the genomes prior to the analysis is useful to avoid the detection of repeat elements that contain sequences similar to known miRNAs. Nevertheless, as a consequence of this procedure some miRBase annotated miRNAs  are masked and therefore reduce the sensitivity of our method (see also Additional file 1, Tables S1 and S2).
MapMi mapping results.
Loci in miRBase
Overlapping Loci (1)
New Loci (1)
Overlapping Loci (2)
New Loci (2)
Total Loci in miRBase: 5974
Found: 5093 overlapping loci and 5232 new loci
Found: 5081 overlapping loci and 1365 new loci
Correct Name Ratio:
We evaluate the performance of the scoring function (see Equation 1) by comparing score distributions from a positive dataset containing 4,237 miRBase (Release 14) deposited unique sequences from Metazoan species, to a negative dataset composed of dinucleotide shuffled versions of the sequences in the positive control (see Implementation). The score distributions from both positive and negative control sequences are shown (Additional file 1, Figure S1). This illustrates that real miRNAs perform significantly better than shuffled miRNAs according to the scoring function described above. This also allows us to derive reasonable thresholds for large-scale mapping projects that balances sensitivity and specificity (Additional file 1, Table S4).
To assess the performance of our pipeline when predicting miRNA orthologues in a more general context, we analysed MapMi predictions of horse miRNAs. Horse was chosen because it was recently introduced in the latest release of miRBase. We used miRBase v13 deposited Metazoan miRNAs, that do not include horse sequences, to predict horse miRNAs. The overlap of MapMi predictions and miRBase v14 deposited horse miRNAs was 82.99% (Additional file 1, Table S5). The ability of our classifier function to distinguish miRNA hairpins from other genomic hairpins was verified by classifying the 8,494 non-miRNA hairpins as reported in . We obtained a correctly classified ratio of 93.14%.
Further verification was done for the genomes for which miRBase genomic coordinates are available, to assess how many MapMi predictions overlap with miRBase annotated miRNA loci and how many of those are correctly named. We found that 85.05% of our predictions overlap with miRBase with 99.09% of those being assigned the same name as miRBase (Table 1).
Nine miRNAs appear to be highly conserved across the majority of species (Additional file 1, Table S6). These miRNAs include the well-known let-7 family, previously known to be highly conserved . Conversely, a total of 636 miRNAs were shown to be species-specific mostly in Chicken, C. elegans, Cow, Platypus, Human and Mouse. This result may arise due to some organisms being more heavily profiled (e.g. Human and Mouse). Additionally, some species have few related species available for comparison (e.g. X. tropicalis) and would appear to have an excess of species-specific miRNAs. Saccharomyces cerevisiae is not believed to possess machinery for miRNA processing, however it is present in Ensembl and was retained as a negative control. As expected, no miRNAs were found in S. cerevisiae. These results indicate that while miRBase has excellent coverage of many species, it appears to be only capturing a fraction of the total number of miRNAs in some species. Hence, we believe that these results can complement miRBase.
We present a new system for miRNA mapping through sequence similarity and secondary structure, which is available both as a stand-alone tool and an online web resource. We demonstrate the selectivity and sensitivity of the approach on a variety of datasets and have applied it to a large number of genomes. This is particularly useful for recently sequenced genomes where miRNA information may be absent or sparse. Using this approach we have mapped miRNA loci in 67 genomes, many of which are not present in miRBase. We provide a web-interface and a database of pre-computed miRNA loci, multiple sequence alignments and phylogenetic trees for many genomes and we hope this system will prove useful to the community.
Project name: MapMi
Project home page: http://www.ebi.ac.uk/enright-srv/MapMi/
Operating system(s): Platform independent (Web-service), Linux and MacOS X (Standalone version)
Programming language: Perl
Other requirements: Bowtie, Dust, RNAfold (For standalone version only)
License: GNU GPL
The authors would like to acknowledge the other members of the Enright lab at EMBL-EBI.
We also thank Tzu-Hao Chang of National Central University, Taiwan for kindly making a standalone version of the RNALogo software available.
JAG-A would like to thank Mali Salmon-Divon (EMBL-EBI) for useful comments and discussion and Gregory Jordan (EMBL-EBI) for assisting with the integration of PhyloWidget.
JAG-A is a member of Clare Hall College of the University of Cambridge.
JAG-A was supported by fellowships SFRH/BI/33193/2007 and SFRH/BD/33527/2008 from the Fundação para a Ciência e Tecnologia as part of the Ph.D. Program in Computational Biology of the Instituto Gulbenkian de Ciência, Oeiras, Portugal. The Ph.D. program is also sponsored by Fundação Calouste Gulbenkian and Siemens SA.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.