Human microRNA target analysis and gene ontology clustering by GOmir, a novel stand-alone application
© Roubelakis et al; licensee BioMed Central Ltd. 2009
Published: 16 June 2009
microRNAs (miRNAs) are single-stranded RNA molecules of about 20–23 nucleotides length found in a wide variety of organisms. miRNAs regulate gene expression, by interacting with target mRNAs at specific sites in order to induce cleavage of the message or inhibit translation. Predicting or verifying mRNA targets of specific miRNAs is a difficult process of great importance.
GOmir is a novel stand-alone application consisting of two separate tools: JTarget and TAGGO. JTarget integrates miRNA target prediction and functional analysis by combining the predicted target genes from TargetScan, miRanda, RNAhybrid and PicTar computational tools as well as the experimentally supported targets from TarBase and also providing a full gene description and functional analysis for each target gene. On the other hand, TAGGO application is designed to automatically group gene ontology annotations, taking advantage of the Gene Ontology (GO), in order to extract the main attributes of sets of proteins. GOmir represents a new tool incorporating two separate Java applications integrated into one stand-alone Java application.
GOmir (by using up to five different databases) introduces miRNA predicted targets accompanied by (a) full gene description, (b) functional analysis and (c) detailed gene ontology clustering. Additionally, a reverse search initiated by a potential target can also be conducted. GOmir can freely be downloaded BRFAA.
microRNAS (miRNAs) are 20- to 23- nucleotide long single stranded RNAs that post-transcriptionally regulate gene expression [1, 2]. miRNAs act as translation inhibitors of mRNA into protein and promote mRNA degradation. In this way, miRNAs play important role in various cell processes such as proliferation, differentiation, apoptosis, development, cancer and various other diseases [1, 2] and thus represent potential targets for therapeutic applications. The biogenesis of miRNAs is a complicated process involving two different cellular compartments . First, in the nucleus, a primary miRNA (pri-miRNA) is transcribed from the genomic DNA by RNA polymerase II. The size of this primary product varies from 100- to 1000- nucleotides in length. Then, the pri-miRNA is truncated by Drosha and DGCR8 to form a hairpin loop precursor called pre-miRNA . The 60–70 nucleotide long pre-miRNA is loaded to Exportin 8 and Ran-GTP in order to be exported into the cytoplasm. A mature miRNA (20–23 nucleotides) is then released by the RNAse III endonuclease complex including Dicer and trans-activator RNA (tar)-binding protein TRBP . The mature miRNA then inhibits translation of a miRNA into a protein by imperfect base pairing to one or more mRNA sequences [1, 4, 5]. The identification of human miRNAs and their respective targets is of great importance and involves both computational and experimental approaches . Prediction servers such as TargetScan , miRanda , RNAhybrid , PicTar  and the recent one DIANA-MicroT 3.0  give information for the miRNA-target interactions. Recent reports have described correlated computational expression of miRNA and their target mRNAs and proteins giving a detailed functional description of the latest [4, 11]. Herein, we describe GOmir , a new stand-alone application for human miRNAs target prediction and ontology clustering, consisting of two different components, JTarget and TAGGO. JTarget combines the data from four different prediction databases (TargetScan, miRanda, RNAhybrid and PicTar) and also from the experimental database TarBase , whereas TAGGO gives detailed assignments from Gene Ontology (GO) resources to gene products. TAGGO uses one of the most reliable biological ontologies, the Gene Ontology, the main goal of which is to provide a well structured, precisely defined and controlled vocabulary for describing the roles of genes and gene products in any organism. GO was initiated back in 1998, as a collaborative effort to build consistency of gene product descriptions among different databases, initially including three model organisms. Since then, many plant, animal and microbial genomes have been assimilated . GO was developed into three structured controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner . Thus, GOmir serves as a reliable tool for miRNA target prediction and more interestingly provides assignments from GO resources for these gene products, exploring in this way the functional aspects of miRNAs in more detail.
Number of common targets found for several miRNAs from 1, 2, 4 OR 5 databases
TargetScan and miRanda
Compare All (4)
Description of the common target genes of miR-21 from 4 databases
BOL, BOULE-LIKE (DROSOPHILA)
BROMODOMAIN CONTAINING 1
CHROMOSOME 4 OPEN READING FRAME 16
CHROMODOMAIN HELICASE DNA BINDING PROTEIN 7
CILIARY NEUROTROPHIC FACTOR RECEPTOR
NUCLEAR FACTOR I/B
POLY(RC) BINDING PROTEIN 1
RAS P21 PROTEIN ACTIVATOR (GTPASE ACTIVATING PROTEIN RAS P21)
STROMAL ANTIGEN 2
TRANSFORMING GROWTH FACTOR, BETA-INDUCED, 68KDA
TRANSIENT RECEPTOR POTENTIAL CATION CHANNEL, SUBFAMILY M, MEMBER 7
Functions of the common target genes of mir_21 from 5 databases
NFIB is capable of activating transcription and replication.
RASA1 is an inhibitory regulator of the Ras-cyclic AMP pathway.
STAG2 is a component of cohesin complex, required for the cohesion of sister chromatids after DNA replication.
NTF3 promotes the survival of visceral and proprioceptive sensory neurons.
TarBase comparative analyses
Using as an example the miR-21, TargetScan predicts 186 targets, whereas miRanda, RNAhybrid, PicTar-4 way and PicTar-5 way predict 832, 274, 151 and 41 targets, respectively for the same microRNA (Figure 2B). GOmir conducting analysis by comparing 4 databases (TargetScan, miRanda, RNAhybrid and PicTar-4 way) revealed 12 common predicted targets. However, TarBase has described up to date 4 experimentally supported targets (PDCD4, TPM1, SERPINB5 and PTEN) for miR-21, whereas PDCD4 was also predicted by GOmir comparative analysis of 3 out of 4 databases (TargetScan, miRanda and PicTar-4 way) and this may in turn explain the importance of selecting common targets between different predicting miRNA databases.
Gene ontology analysis by TAGGO
miR-21 regulates TGF-β target in Hela cells
GOmir, is a novel stand-alone application designed to elucidate the human miRNA interactions with the respective targets by using the data sets retrieved by four different computationally miRNA prediction databases, increasing in this way the validity of the results. In this study, RNAhybrid database was inserted for the first time in a computational tool that combines the results from different miRNA databases. The validity of the computational predicted targets is confirmed by recent experimental studies for certain miRNAs. For example, GOmir indicated NFI-A as possible target for miR-223 comparing 4 out of 4 databases (miRBase, TargetScan, RNAhybrid and PicTar-4 way). Further experimental studies by Fazi et al. confirmed this prediction and showed that miR-223 plays a crucial role during granulopoiesis by downregulating NFI-A . GOmir provides a detailed gene description of the predicted targets accompanied by a function analysis. A reverse search initiated by a potential target can also be performed to find the predicted interacting miRNAs. Comparison with the experimentally supported target database, TarBase, is also provided. In a second next step, a detailed gene ontology clustering, including all the respective graphic charts and diagrams for the predicted targets are provided by the TAGGO module of GOmir. In this way, any group of human miRNAs and respective targets can be analysed with functional and ontology information provided, easily, in a short period of time and without using a web-based interface. The resulted common targets among different databases for a given miRNA may facilitate in selecting individuals for further experimental analyses. In this respect, our preliminary results on validating experimentally GOmir comparative predictions showed that miR-21 regulate TGF-β expression at mRNA level.
GOmir, is a stand-alone application for studying miRNA interactions with the respective targets by using the data sets retrieved by miRBase, TargetScan, RNAhybrid, PicTar-4 way and PicTar-5 way and also the experimental one TarBase. GOmir provides a detailed gene description of the predicted targets accompanied by a function and gene ontology analyses.
1) Data acquisition
Data derived from human miRNA target predicting tools, such as TargetScan, miRanda, RNAhybrid, PicTar and TarBase were used. The TargetScan database was obtained from the TargetScan website . Concerning the miRanda tool, the latest up-to-date data were downloaded from miRBase (Sanger Centre) web site . Similarly, the data from RNAhybrid database were retrieved from the mirnamap website . The PicTar data were obtained from the UCSC genome browser database . Finally, the TarBase data were retrieved from DIANA lab website . The database files were treated, in order to obtain only the human target genes. For gene description and functional analyses, three database files were downloaded from the DAVID Bioinformatics database , in order to implement the "Find gene description" and "Find gene function" applications and correlate in this way each gene product with a description and a function analysis, respectively.
2) Data integration
The database files from the four miRNA target prediction tools were truncated to the human related information, in order to have the minimum size and all the human miRNAs were paired with the respective targets. The TargetScan database file contains miRNA families, instead of individual miRNAs. Therefore, the miRNA families file corresponding to the respective miRNAs was downloaded as well. Different gene ID systems (Refseq ID, Gene symbol, Ensembl ID) are used among different databases. In order to correlate the data among different data sets the NCBI website  and the DAVID Bioinformatics Database  were used. The downloaded files, from the DAVID database for the "Find genes description" and "Find genes functions" procedures, contained pairs of DAVID ID number/genes symbol, DAVID ID number/genes description and DAVID ID number/genes functions and were minimized to the human related information. For the functionality and performance of the application we decided to create a database with all the necessary files, which were described above. We used SQLite, a software library that implemented a self-contained, serverless, zero-configuration, transactional SQL database engine which is ideal for internal databases used for distributable, stand-alone application . We imported the information from our data files in a SQLite database file, necessary for JTarget functionality which is downloaded along with the entire application installation package.
For JTarget, the miRNA target genes search within a single database is implemented by executing a "SELECT microRNA, target FROM database_name WHERE microRNA=microRNA_name" sql query into the database. The common targets from several database tools are found by performing inner joins among the results from the respective "SELECT" statements. JTarget comprises some more functionalities besides the miRNA common target genes prediction. After target gene selection for a given miRNA, the user can search for a description of these target genes or for their functions. These two options are implemented by executing "SELECT" queries into the entire database in order to correlate each gene with a description and/or functions, respectively. Finally, the JTarget tool is connected to the TAGGO through a button named "TAGGO", which enables the clustering of the genes. A temporary file is constructed from the output file from a miRNA target search, and then used to the TAGGO tool.
Gene Ontology (GO) is divided into three ontology aspects which yield information common to all living organisms. Molecular Function (MF) and Cellular Component (CC) aspects answer the questions of what a gene product does and where its active form can be found, whereas the Biological Process (BP) aspect clarifies the biological objective of a gene product. Each ontology aspect is structured as a Directed Acyclic Graph (DAG), a graph with no cyclic paths (no loops) with its nodes representing the ontology terms (and their intrinsic properties) and its edges the relations between the nodes. Apparently, since the ontology is in a DAG format, each term can have more than one parents and thus have multiple paths connecting it to the root. Each GO term has a unique identifier which is used as a database cross-reference in the collaborating databases . Each gene product-GO term pair is followed by an Evidence Code (EC) which indicates how an annotation to a particular term is supported. There are fourteen different ECs. In general, the higher the specificity of a term, the lower its level inside the ontology hierarchy is, and vice versa. Proteins are often annotated with terms of medium or low level in the ontology. This provides a huge amount of information that is misleading when the aim is to pin-point the main characteristics and functions of a protein or a protein set. To obtain a more global view of the attributes of a protein or a protein set, a way to assign more general terms to proteins is needed. Finding more general categories for the function and localization of a protein is equivalent of tracking the most general terms of GO which are relevant to its annotated GO terms, as more generic terms (those high in the ontology) mainly serve as abstractions which demonstrate the broader role of their children. GO is continuously expanding and improving its structure, thus serving as a dynamic ontology.
TAGGO implements general terms in the GO DAG structure and automatically produces biologically meaningful results. A method to estimate the specificity of a term is the evaluation of its Information Content (IC). In Algorithmic Information Theory, the information content of an individual object is a measure of the degree of difficulty to define or describe that object . In other words, high information content implies more intense effort to process an object. In biological terms, the higher the information content of a GO term, the more specific this term is and vice versa. To confine this theoretical definition into a mathematical formula, it is necessary to consider that the times a term occurs denote how general this term is. It is not even necessary to encounter the term itself but any of its children: According to the True Path Rule, a rule imposed in order to ensure the validity of GO entries, the pathway from a child term to its top-level parent(s) must always be true . In other words, a term holds all the attributes of its ancestors and can be considered one of them. Measurement of the degree of specificity of a term is complicated by the fact that the local density of GO terms and the length of branches vary. Furthermore, "leaves" should contain the same IC, as they provide the most detailed descriptions at a given time .
IC normal (c), the Normalised Information Content of a term, ranges from 0 (root) to 1 (leaf).
The main input file of TAGGO is a list of proteins that is experimentally produced by e.g. a large scale analysis. SwissProt accession number, gene symbol or International Protein Index (IPI) can be used to identify each protein. To load the GO structure, a GO file is used as input. OBO v.1.0, OBO v.1.2, GO (which is deprecated), OBO-XML or RDF formats are supported. To map gene products to GO terms, the organism of origin must be selected. That triggers the program to load the corresponding to the selected species GO annotation (GOA) file. Each GOA entry provides information about the database which contributes to this annotation, the date of the annotation, the object which is annotated, its synonym, its type (e.g. gene, transcript, protein), its assigned GO term, the ontology, where this term belongs to and evidence about the credibility of this annotation. Users are strongly advised to use the latest GO and GOA files which can be downloaded from the GO FTP site . To increase versatility and robustness, the user has the opportunity to exclude GOA entries supported by less reliable Evidence Codes (ECs). Thus, the output file may only hold the GO annotations of the input proteins supported by well established methods. To exclude very generic terms from the classification, non-desired terms can be specified and normalised information content threshold for the three aspects can be set (default values are 4% or 0.04). Finally, the directory, where the results will be stored is chosen and all data are submitted. When the program starts running, a file which contains the GO terms of each protein for all three aspects is created. Then, the protein dataset is categorised into general GO terms, as follows: all parents of each term are found (considering all possible pathways to the root) and sorted according to ascending information content. The most general term which does not belong to the non-desired terms of the corresponding aspect is considered a category, unless all of the ten most general parents belong to the user-specified non-desired terms; in that case, the term is classified as "NO ENTRY" category. The proteins with their assigned GO categories are gathered and duplicated categories for a given protein are removed. The output is visualised in pie and bar charts which show the percentage of each GO category on the given protein dataset, in all GO aspects. Moreover, Venn lists for all aspects are created to indicate the overlaps of GO categories for the given protein dataset. These lists can be imported in VennMaster  to create Venn diagrams. These three types of output indicate how many proteins share a common GO category. The analysis performed describes general aspects and functions of the proteins.
GOmir GUI implementation and prerequisites
Both tools were developed in Java Programming Language. We used the widget toolkit for Java, Swing, in order to develop the graphical user interface. As far as the JTarget database implementation is concerned, we selected the SQLite SQL database engine, which does not need any server to be installed and is very compact. In addition, Spring Framework and JFreeChart libraries were used for the implementation of TAGGO chart functionality. GOmir can be installed in any Microsoft Windows or Linux operation systems with Java Runtime Engine 1.5.0 (JRE 5.0)  pre installed.
The human Hela cell line was obtained from American Type Cell Collection (ATCC, Manassas, VA) cells were maintained in Dulbecco's modified Eagle's medium (DMEM; Sigma-Aldrich Ltd, Gillingham, Dorset, UK) supplemented with 10% (v/v) fetal bovine serum (FBS) (Gibco-BRL, Paisley, Scotland, UK).
miR-21 mimic (Applied Biosystems, Foster City, CA) at a concentration of 0.4 μM, miR-21 antagonist at a concentration of 0.3 μM (Exiqon, Vedbaek, Denmark), or miR-21 scrambled antagonist at a concentration of 0.3 μM (Exiqon, Vedbaek, Denmark) were transiently transfected independently into Hela cells using the Lipofectamine 2000 reagent (Gibco-BRL) in a 1:2.5 ratio, according to the manufacturer's protocol.
RNA exctraction and semi-quantitative RT-PCR analysis of cells
RNAs from transfected or non-transfected Hela cells were extracted with Trizol (Gibco-BRL). cDNAs were reverse transcribed from 1 mg of RNA using the MMLV reverse transcriptase enzyme and kit (Promega Ltd, Madison, WI) according to the manufacturer's instructions. PCR analysis was carried out using the following primer pairs: hTGF-β1 F: 5'-GCAACAATTCCTGGCGATACC-3' and hTGF-β1 R: 5'-GCCCTCAATTTCCCCTCCAC-3'. Semi-quantitative PCR analysis for the hTGF-β1 transcript was determined by using the Dolphin ID imaging software (Dolphin Imaging, Chatsworth, CA, USA) after normalizing to the β-actin endogenous control (primers for β-actin F: 5' TCTACAATGAGCTGCGTGTG 3' and β-actin R: 5' CAACTAAGTCATAGTCCGCC 3', respectively).
We would like to thank Karin Söderman and Fotis Psomopoulos for offering generous help and constructive comments throughout the work.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 6, 2009: European Molecular Biology Network (EMBnet) Conference 2008: 20th Anniversary Celebration. Leading applications and technologies in bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S6.
- Buchan JR, Parker R: Molecular biology. The two faces of miRNA. Science 2007, 318(5858):1877–1878. 10.1126/science.1152623View ArticlePubMedGoogle Scholar
- Chen JF, Mandel EM, Thomson JM, Wu Q, Callis TE, Hammond SM, Conlon FL, Wang DZ: The role of microRNA-1 and microRNA-133 in skeletal muscle proliferation and differentiation. Nat Genet 2006, 38(2):228–233. 10.1038/ng1725PubMed CentralView ArticlePubMedGoogle Scholar
- Song L, Tuan RS: MicroRNAs and cell differentiation in mammalian development. Birth Defects Res C Embryo Today 2006, 78(2):140–149. 10.1002/bdrc.20070View ArticlePubMedGoogle Scholar
- Megraw M, Sethupathy P, Corda B, Hatzigeorgiou AG: miRGen: a database for the study of animal microRNA genomic organization and function. Nucleic Acids Res 2007, (35 Database):D149–155. 10.1093/nar/gkl904Google Scholar
- Selbach M, Schwanhausser B, Thierfelder N, Fang Z, Khanin R, Rajewsky N: Widespread changes in protein synthesis induced by microRNAs. Nature 2008, 455(7209):58–63. 10.1038/nature07228View ArticlePubMedGoogle Scholar
- Lewis BP, Burge CB, Bartel DP: Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell 2005, 120(1):15–20. 10.1016/j.cell.2004.12.035View ArticlePubMedGoogle Scholar
- John B, Enright AJ, Aravin A, Tuschl T, Sander C, Marks DS: Human MicroRNA targets. PLoS Biol 2004, 2(11):e363. 10.1371/journal.pbio.0020363PubMed CentralView ArticlePubMedGoogle Scholar
- Rehmsmeier M, Steffen P, Hochsmann M, Giegerich R: Fast and effective prediction of microRNA/target duplexes. RNA 2004, 10(10):1507–1517. 10.1261/rna.5248604PubMed CentralView ArticlePubMedGoogle Scholar
- Krek A, Grun D, Poy MN, Wolf R, Rosenberg L, Epstein EJ, MacMenamin P, da Piedade I, Gunsalus KC, Stoffel M, et al.: Combinatorial microRNA target predictions. Nat Genet 2005, 37(5):495–500. 10.1038/ng1536View ArticlePubMedGoogle Scholar
- Maragkakis M, Alexiou P, Papadopoulos LG, Reczko M, Simossis AV, Riback M, Kourtis K, Goumas G, Koukis K, Dalamagas T, et al.: DIANA-MicroT 3.0: An integrative function analysis tool for microRNAs. 2008, in press.Google Scholar
- Nam S, Kim B, Shin S, Lee S: miRGator: an integrated system for functional annotation of microRNAs. Nucleic Acids Res 2008, (36 Database):D159–164.Google Scholar
- Sethupathy P, Corda B, Hatzigeorgiou AG: TarBase: A comprehensive database of experimentally supported animal microRNA targets. RNA 2006, 12(2):192–197. 10.1261/rna.2239606PubMed CentralView ArticlePubMedGoogle Scholar
- Gene Ontology (GO)[http://geneontology.org/]
- Dennis G Jr, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 2003, 4(5):P3. 10.1186/gb-2003-4-5-p3View ArticlePubMedGoogle Scholar
- Fazi F, Rosa A, Fatica A, Gelmetti V, De Marchis ML, Nervi C, Bozzoni I: A minicircuitry comprised of microRNA-223 and transcription factors NFI-A and C/EBPalpha regulates human granulopoiesis. Cell 2005, 123(5):819–831. 10.1016/j.cell.2005.09.023View ArticlePubMedGoogle Scholar
- RNAhybrid, mirnamap[http://mirnamap.mbc.nctu.edu.tw/]
- UCSC genome browser database[http://genome.ucsc.edu/]
- DIANA lab website[http://diana.cslab.ece.ntua.gr/tarbase/]
- DAVID Bioinformatics database[http://david.abcc.ncifcrf.gov/home.jsp]
- NCBI website[http://www.ncbi.nlm.nih.gov/]
- SQL database engine[http://www.sqlite.org/]
- The Gene Ontology Consortium: Creating the gene ontology resource: design and implementation. Genome Res 2001, 11(8):1425–1433. 10.1101/gr.180801View ArticleGoogle Scholar
- Chaitin G: Algorithmic Information Theory. In Encyclopedia of Statistical Science. Volume 1. New York: Wiley; 1982:38–41.Google Scholar
- Bérard S, Tichit L, Herrmann C: ClusterInspector: a tool to visualize ontology-based relationships between biological entities. Actes des Journées Ouvertes Biologie Informatique Mathématiques. Lyon 2005, 447–457.Google Scholar
- Lord PW, Stevens RD, Brass A, Goble CA: Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 2003, 19(10):1275–1283. 10.1093/bioinformatics/btg153View ArticlePubMedGoogle Scholar
- GO FTP site[ftp://ftp.geneontology.org/pub/go/]
- Kestler HA, Muller A, Gress TM, Buchholz M: Generalized Venn diagrams: a new method of visualizing complex genetic set relations. Bioinformatics 2005, 21(8):1592–1595. 10.1093/bioinformatics/bti169View ArticlePubMedGoogle Scholar
- Java Runtime Engine 1.5.0 (JRE 5.0)[http://www.java.com/]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.