MitoRes: a resource of nuclear-encoded mitochondrial genes and their products in Metazoa
© Catalano et al; licensee BioMed Central Ltd. 2006
Received: 03 May 2005
Accepted: 24 January 2006
Published: 24 January 2006
Mitochondria are sub-cellular organelles that have a central role in energy production and in other metabolic pathways of all eukaryotic respiring cells. In the last few years, with more and more genomes being sequenced, a huge amount of data has been generated providing an unprecedented opportunity to use the comparative analysis approach in studies of evolution and functional genomics with the aim of shedding light on molecular mechanisms regulating mitochondrial biogenesis and metabolism.
In this context, the problem of the optimal extraction of representative datasets of genomic and proteomic data assumes a crucial importance. Specialised resources for nuclear-encoded mitochondria-related proteins already exist; however, no mitochondrial database is currently available with the same features of MitoRes, which is an update of the MitoNuc database extensively modified in its structure, data sources and graphical interface. It contains data on nuclear-encoded mitochondria-related products for any metazoan species for which this type of data is available and also provides comprehensive sequence datasets (gene, transcript and protein) as well as useful tools for their extraction and export.
MitoRes http://www2.ba.itb.cnr.it/MitoRes/ consolidates information from publicly external sources and automatically annotates them into a relational database. Additionally, it also clusters proteins on the basis of their sequence similarity and interconnects them with genomic data. The search engine and sequence management tools allow the query/retrieval of the database content and the extraction and export of sequences (gene, transcript, protein) and related sub-sequences (intron, exon, UTR, CDS, signal peptide and gene flanking regions) ready to be used for in silico analysis.
The tool we describe here has been developed to support lab scientists and bioinformaticians alike in the characterization of molecular features and evolution of mitochondrial targeting sequences. The way it provides for the retrieval and extraction of sequences allows the user to overcome the obstacles encountered in the integrative use of different bioinformatic resources and the completeness of the sequence collection allows intra- and interspecies comparison at different biological levels (gene, transcript and protein).
Mitochondria are sub-cellular organelles which play a central role in many important metabolic pathways and are essential for energy production . The control of mitochondrial biogenesis and function strongly depends on the coordinated activity of both the nuclear and mitochondrial genomes  and as a consequence, the molecular mechanisms regulating the mitochondrial transcription, translation, post-transcriptional modification, signalling, import, folding and assembly of the mitochondrial components are processes extremely complicated and still not entirely defined . Part of this effort is the elucidation of transcriptional co-regulation networks, which can be seen as one of the most important levels at which nucleus-mitochondrion network connections emerge . The analysis and comparison of nuclear-encoded mitochondria-related sequences within and between organisms could be of great help for their functional characterization, but they require a sufficient quantity of data and appropriate tools for its selection, extraction and analysis. The mitochondrial databases collecting data on nuclear-encoded mitochondrial components currently available are essentially protein sequence databases such as MitoProteome , MitoP2  and the Human Mitochondrial Protein Database (HMPDb) . MitoProteome and HMPDb are databases including only human data while MitoP2 collects mitochondrial protein data on four different organisms: man, mouse, yeasts and neurospora. These databases, even if they may be searched, do not provide transcript and gene sequences and tool for the massive extraction of sequence data. MitoDrome , a mitochondrial database developed by our group, differs from the above mentioned databases because it adds the gene and the deduced transcript sequences to protein annotation and provides efficient tools for the retrieval and extraction of sequences but it is restricted to only three dipteran species, D. melanogaster, D. pseudoobscura and A. gambiae.
MitoRes, the database we present here, is a specialized mitochondrial resource which has been developed to complement the other available mitochondrial databases in their biological utility and application. In particular, it tries to fill the void of a comprehensive resource of mitochondria-related sequences and, to this end, it collects and integrates data on gene, transcript and protein sequences of any metazoan species from the most accredited worldwide sources. MitoRes provides efficient tools for the retrieval and massive extraction of any type of nucleus-encoded mitochondria-related sequence and sub-sequence (i.e., gene, gene flanking regions, intron, exon, transcript, UTR, CDS, protein and signal peptide) ready to be used for in silico analysis.
It also makes the intra- and inter- species comparison of the protein sequences collected, and integrates information on protein similarity with genomic data. The integration of the protein and gene data helps users to easily assess the conservation of both the protein sequence and the gene structure when compared with their counterparts in other organisms, thus enabling potential correlations not possible on the basis of the protein similarity alone and facilitating the selection/extraction of the best candidates for further and deeper investigation.
In the following sections we describe the construction and content of MitoRes, its graphical interface and usefulness.
Construction and content
MitoRes is an interconnected knowledge management system consisting of a relational database, a Web Graphical User Interface (GUI) and a sequence export manager tool. It is derived from the MitoNuc database , which has been completely rebuilt and extensively modified to better answer to users' needs in terms of the availability of a comprehensive and non redundant collection of data, as well as flexible access to information and rapid extraction of sequences.
Data sources and implementation
The MitoRes database has been implemented as a relational database using the freely available MySQL Database Management System (DBMS). Three main interconnected building blocks constitute its information content: protein, transcript and gene modules.
The information content of the database is generated using an automated procedure that is composed of a suite of BioPerl and C programs which retrieve and integrate data from external sources, compare and cluster proteins on the basis of their sequence similarity and populate the database records automatically.
In particular, gene chromosome location, sequence and structural information (exon/intron organization) are extracted from the ENSEMBL genome database . Transcript(s) sequence, polyadenylation recognition signal and location along with transcript function are extracted from the RefSeq database  while UTR regions are generated using the UTR database  as a reference. Protein sequences, along with information on sub-cellular location, tissue specificity, Enzyme Classification code, signal peptide, gene name and its synonyms (if any) are extracted from the UniProt database . The NCBI Taxonomy database  is used as a reference for information on the biological source of the protein and the GO database  for protein classification.
The list of UniProt ACs obtained through this procedure is then used to upload the MitoRes database using two different programs that act in consecutive steps. The first one, a BioPerl script using the Perl APIs of ENSEMBL, retrieves related genomic information from the ENSEMBL database. The ENSEMBL transcripts extracted, which are only those fully supported by a UniProt link, are checked for their consistency using the blast2seq program , to compare their translated sequence against the UniProt sequence, and then used to populate the gene module tables. The second program, a C program using the EMBOSS  and UTRdb C libraries, queries the UniProt, UTR and RefSeq databases and retrieves transcript and protein sequences and associated data to populate the other two MitoRes modules, transcript and protein.
The last step of the MitoRes database annotation is represented by the analysis and clustering of the protein sequences. The protein sequences stored in the MitoRes database are extracted and examined for their sequence similarity using an "all-versus-all" pair-wise global alignment and hence clustered in sub-groups (Clusters) on the basis of a threshold sequence similarity value of at least 60%. The entire procedure has been automated by using a Perl script which includes the Strecher EMBOSS programme . The procedure iteratively runs the pair-wise alignment for each protein against the entire collection of sequences and generates a cluster for each iterative step. Data deriving from the protein clustering procedure are automatically uploaded, by means of Perl scripts, into a database table, named Cluster, belonging to the protein module. Clusters take their name from the protein, indicated as "Leader protein" corresponding to the first sequence used by the alignment procedure for comparison with all the others present in the MitoRes collection. From this analysis, currently only about 10% of proteins present in MitoRes do not have counterparts with any other proteins in the database.
The core entity of the MitoRes database is represented by the protein so that each MitoRes entry is generated for each nuclear-encoded mitochondria-related protein reported in the UniProt database. The association of the related nuclear transcript and gene data depend on the availability of relevant annotation across the RefSeq and ENSEMBL databases respectively, thus some entries may be complete whereas others may contain only information on a protein and transcript or only on a protein. Apart from the completeness of the information reported, the general structure of a MitoRes entry is comprehensive of information on gene, transcript and protein.
Each database entry is identified by an Identification (ID) code and by an Accession number (AC) which serves for unambiguous retrieval and citation, from release to release. Gene and transcript sequences are graphically represented through the dynamic construction of their physical maps which are descriptive of their structure, sequence orientation and genomic localization. A tool tip window displays exon/intron features of the gene (number, length, start and end positions on the genomic sequence). Two clickable buttons at the top of the entry, namely "Associated Cluster" and "Export sequence", provide the user with a direct access respectively to the associated protein Cluster entry and to the export management tool for sequence extraction.
In some cases, the same entry code is assigned to more than one MitoRes entry that can be distinguished by the presence of a '_' sign followed by a progressive number. This happens when more than one copy of a gene, differing in their number of exons or in their genomic locations, has been annotated for the same protein in the ENSEMBL database. The convention adopted by MitoRes of generating different entries for the same protein for each available gene annotation allows the user to be aware of their existence and to extract only the sequences that are more suitable for their own analysis.
Mitochondria-related sequences, despite being available for a large number of metazoan organisms, are dispersed among many heterogeneous resources and this poses a major problem of optimal information extraction. MitoRes tries to fill this gap by collecting and integrating the information on all mitochondrial related sequences in Metazoa from the most accredited worldwide resources and providing a user-friendly web interface through which the user can browse and query the database, extract sequences, compare protein sequence and gene structure among different organisms and perform the alignment of any sequence against the MitoRes collections.
The Web GUI of MitoRes has been built up using a PHP Seagull Framework . BioPerl modules  are used to build up the gene and transcript maps and for sequence management in the export tool. A horizontal menu bar, accessible from any MitoRes GUI application, provides links to the MitoRes home page, to the search form, to the Cluster section, to the BLAST search tool and to an extensive on-line users' Manual.
The MitoRes search tool provides several query options and as a result builds up a sortable table of the retrieved entries and of key related available information. Users can query the MitoRes database through the "Quick search" option, the more elaborated search form and/or through the Cluster section.
The "quick search" option in the MitoRes home page takes as input MitoRes ID, UniProt AC and/or gene name and also accepts lists of these search terms
The query result page presents a sortable list of MitoRes entries matching the query criteria, along with appropriate summary information, set as defaults by the system (organism, gene name, chromosome location and protein description) or chosen by the user during the customization of the query using the "Show" check boxes (see Figure 2A). From the query result page, users can explore the complete contents of each entry and perform the export of associated sequences using the Sequence export form (Figure 2C).
Users can also browse the database using the protein Cluster GUI, where database entries of proteins sharing a certain degree of sequence similarity are grouped in Clusters (see Data sources and implementation paragraph). On this page Clusters are listed in a table that reports the name of the Cluster, the function of the Leader Protein, the list of MitoRes entries belonging to the same Cluster, and provides the link to the Cluster entry.
A good example of the flexibility and utility of MitoRes is the COX5B gene showed in Figure 2. The query is carried out combining the two search criteria, species and gene name, and using the option for the display of the gene map (Figure 2A) on the query result page (Figure 2B). The utility provided by the system to use several search terms for the same search criterion, allows it to perform the search for the same gene in two different species, namely H. sapiens and M. musculus in the example described. The query returns 3 matching records. It is immediately evident that two different copies of this gene exist in mouse, one located on chromosome 13 and the other on chromosome 1. The comparison of the gene maps allows the user to quickly asses that the gene structure is extremely conserved in man and mouse and that only one of the two mouse gene copies (MitoRes entry: MMUSCOX5B_2) could be the putative orthologue, the second copy (MitoRes entry: MMUSCOX5B_1), in chromosome 13, probably being derived by a retrotransposition event that should have happened after the divergence of the two organisms or lost in man.
The degree of similarity between the protein sequences can be quickly assessed browsing the COX5B Cluster entry (Figure 3). The information reported in the Cluster entry allows the user to assess that 1) protein annotations are also available for R. norvegicus, S. scrofa and B. taurus, 2) the protein is well conserved in all the organisms for which annotation is available, 3) the degree of similarity with the human protein is, as expected, higher in S. scrofa and R. norvegicus than in M. musculus, 4) genomic annotation is also available for R. norvegicus but not for S. scrofa and B. taurus, 5) the gene structure in R. norvegicus also seems identical to that of the other organisms and finally, 6) only M. musculus has a COX5B gene copy. At this point users have all the information on COX5B available in MitoRes, and thus can decide which sequences best fit their requirements so bypassing the tedious record-by-record query steps or BLAST database searches, which would otherwise be necessary.
Sequence export tool
The sequence-export manager tool can run sequence extraction from the entry view page, query results table and Cluster view page. The web view of the sequence export form for performing the extraction of sequence data from searched matching entries is shown in Figure 2C. Depending on the specific user's requirements the system is able to extract: 1) the unprocessed protein sequence and the signal peptide; 2) the complete mRNA sequence, the CDS and the untranslated non coding regions; 3) the complete gene sequence or part of it and the flanking gene regions up to 5000 bp; 4) all the intron and exon sequences or only those specifically chosen by users.
One particularly noteworthy feature of the system is to perform the extraction of gene intron sequences, which is not possible from any other public resource. Furthermore, no limits are imposed regarding the number of sequences which can be extracted and exported. File formats for saving sequence(s) include: FASTA, EMBL, GenBank and SwissProt.
The BLAST  search tool facilitates the database searching for the functional characterization of unknown sequences. Through this application the user has the option to choose whether to perform the BLAST search against protein, gene or RNA collections of the MitoRes database.
Availability and requirements
MitoRes is freely available for academic and non-academic users at http://www2.ba.itb.cnr.it/MitoRes/. User's registration, free of charge, is required only for the use of the BLAST tool. The e_mail email@example.com may be used for comments, suggestions and corrections.
This work has been supported by the Ministero dell'Istruzione, dell'Università e della Ricerca, Italy (projects: FIRB "Bioinformatica per la Genomica e la Proteomica", PON 2000–2006 "Project BIG"; Cluster C03 – Progetto 2 L. 488/92; Contributi Straordinari del D.D. n.1105).
- Saraste M: Oxidative phosphorylation at the fin de siecle. Science 1999, 283(5407):1488–1493. 10.1126/science.283.5407.1488View ArticlePubMedGoogle Scholar
- Garesse R, Vallejo CG: Animal mitochondrial biogenesis and function: a regulatory cross-talk between two genomes. Gene 2001, 263(1–2):1–16. 10.1016/S0378-1119(00)00582-5View ArticlePubMedGoogle Scholar
- Nisoli E, Clementi E, Moncada S, Carruba MO: Mitochondrial biogenesis as a cellular signaling framework. Biochem Pharmacol 2004, 67(1):1–15. 10.1016/j.bcp.2003.10.015View ArticlePubMedGoogle Scholar
- Kelly DP, Scarpulla RC: Transcriptional regulatory circuits controlling mitochondrial biogenesis and function. Genes Dev 2004, 18(4):357–368. 10.1101/gad.1177604View ArticlePubMedGoogle Scholar
- Cotter D, Guda P, Fahy E, Subramaniam S: MitoProteome: mitochondrial protein sequence database and annotation system. Nucleic Acids Res 2004, (32 Database):D463–467. 10.1093/nar/gkh048Google Scholar
- Andreoli C, Prokisch H, Hortnagel K, Mueller JC, Munsterkotter M, Scharfe C, Meitinger T: MitoP2, an integrated database on mitochondrial proteins in yeast and man. Nucleic Acids Res 2004, (32 Database):D459–462. 10.1093/nar/gkh137Google Scholar
- The Human Mitochondrial Protein Database[http://bioinfo.nist.gov:8080/examples/servlets/index.html]
- Sardiello M, Licciulli F, Catalano D, Attimonelli M, Caggese C: MitoDrome: a database of Drosophila melanogaster nuclear genes encoding proteins targeted to the mitochondrion. Nucleic Acids Res 2003, 31(1):322–4. 10.1093/nar/gkg123PubMed CentralView ArticlePubMedGoogle Scholar
- Attimonelli M, Catalano D, Gissi C, Grillo G, Licciulli F, Liuni S, Santamaria M, Pesole G, Saccone C: MitoNuc: a database of nuclear genes coding for mitochondrial proteins. Update 2002. Nucleic Acids Res 2002, 30(1):172–173. 10.1093/nar/30.1.172PubMed CentralView ArticlePubMedGoogle Scholar
- Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T, et al.: The Ensembl genome database project. Nucleic Acids Res 2002, 30(1):38–41. 10.1093/nar/30.1.38PubMed CentralView ArticlePubMedGoogle Scholar
- Pruitt KD, Maglott DR: RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res 2001, 29(1):137–140. 10.1093/nar/29.1.137PubMed CentralView ArticlePubMedGoogle Scholar
- Mignone F, Grillo G, Licciulli F, Iacono M, Liuni S, Kersey PJ, Duarte J, Saccone C, Pesole G: UTRdb and UTRsite: a collection of sequences and regulatory motifs of the untranslated regions of eukaryotic mRNAs. Nucleic Acids Res 2005, (32 Database):D141–146.Google Scholar
- Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, et al.: UniProt: the Universal Protein knowledgebase. Nucleic Acids Res 2004, (32 Database):D115–119. 10.1093/nar/gkh131Google Scholar
- Wheeler DL, Chappey C, Lash AE, Leipe DD, Madden TL, Schuler GD, Tatusova TA, Rapp BA: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2000, 28(1):10–14. 10.1093/nar/28.1.10PubMed CentralView ArticlePubMedGoogle Scholar
- Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C, et al.: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 2004, (32 Database):D258–261.Google Scholar
- ITB-Bari SRS server[http://www.ba.itb.cnr.it/srs/]
- Tatusova TA, Madden TL: BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol Lett 1999, 174(2):247–250.View ArticlePubMedGoogle Scholar
- EMBOSS Documentation[http://emboss.sourceforge.net/docs/]
- EMBOSS Stretcher program[http://emboss.sourceforge.net/apps/stretcher.html]
- The Open Source PHP framework[http://seagull.phpkitchen.com/]
- The Bioperl Project[http://bio.perl.org/]
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.