RefNetBuilder: a platform for construction of integrated reference gene regulatory networks from expressed sequence tags
© Li et al; licensee BioMed Central Ltd. 2011
Published: 18 October 2011
Gene Regulatory Networks (GRNs) provide integrated views of gene interactions that control biological processes. Many public databases contain biological interactions extracted from experimentally validated literature reports, but most furnish only information for a few genetic model organisms. In order to provide a bioinformatic tool for researchers who work with non-model organisms, we developed RefNetBuilder, a new platform that allows construction of putative reference pathways or GRNs from expressed sequence tags (ESTs).
RefNetBuilder was designed to have the flexibility to extract and archive pathway or GRN information from public databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG). It features sequence alignment tools such as BLAST to allow mapping ESTs to pathways and GRNs in model organisms. A scoring algorithm was incorporated to rank and select the best match for each query EST. We validated RefNetBuilder using DNA sequences of Caenorhabditis elegans, a model organism having manually curated KEGG pathways. Using the earthworm Eisenia fetida as an example, we demonstrated the functionalities and features of RefNetBuilder.
The RefNetBuilder provides a standalone application for building reference GRNs for non-model organisms on a number of operating system platforms with standard desktop computer hardware. As a new bioinformatic tool aimed for constructing putative GRNs for non-model organisms that have only ESTs available, RefNetBuilder is especially useful to explore pathway- or network-related information in these organisms.
Gene regulatory networks (GRNs) offer integrated views of gene interactions that control biological processes. Meanwhile, a number of reverse engineering approaches have been developed to infer GRNs. For instance, Boolean network , probabilistic Boolean network , modelling algorithms using mutual information (e.g., CLR  and ARACNE ), and dynamic Bayesian network . The accuracy of computationally inferred GRNs is often evaluated using manually curated pathway or interaction information of model organisms. Such information as functional annotation and relevant biological interactions associated with a particular gene is available from many online resources [6–9]. These public databases contain genetic interactions retrieved from literature with experimental validations, but unfortunately, only a few well-studied model organisms have been curated. The same types of genetic interaction information do not exist for non-model species despite a wealth of transcriptome-wide expressed sequence tags (ESTs) for the specific organisms of interests.
Although experimentally validated interactions among genes or proteins are deposited in the public databases, limitations in accessibility and scalability make retrieving and integrating relevant information difficult. Several bioinformatic toolkits have been developed to extract biological interactions from public databases for well-studied model organisms. For example, BioNetBuilder [10, 11] and NetMatch  are Cytoscape  plug-ins for retrieving, integrating, visualization and analysis of known biological networks. However, these programs cannot be applied to species that have no or limited genetic interaction information. Other tools such as BlastPath  and OmicViZ , also Cytoscape plug-ins, allow network mapping across species based on sequence homology. But they only map a query species to its closely related model organisms; and have limitations in the number of query genes / proteins. For many less-studied non-model organisms, their related species are often unavailable on the well-annotated model organisms list. Recently, an automatic genome annotation and pathway reconstruction tool named KAAS (KEGG Automatic Annotation Server) was developed for organisms with complete genome sequences [16, 17]. To the best of our knowledge, no tools are currently available that provide an integrated environment for building GRNs for less-studied non-model organisms from incomplete genomic or EST sequencing data. This motivates us to develop Reference Network Builder or RefNetBuilder, a cyber-based platform that constructs homologous reference GRNs, to fill this gap.
The intended applications of RefNetBuilder include: (1) build putative, reference GRNs/pathways for non-model organisms; (2) provide biological prior knowledge of GRNs that may assist in assessing and improving computational GRN inference models; (3) help to interpret and compare the GRNs reconstructed from wet-lab experiments; and (4) serve as a gene set selection tool for GRN reconstruction because many computational models can only accommodate a limited number of genes (nodes) from high dimensional microarray datasets.
Mapping of homologous genes
Homology among proteins and DNA is often concluded on the basis of sequence similarity. The Basic Local Alignment Search Tool (BLAST) [18, 19] is one of the most popular and widely-used algorithm for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotide sequences of DNA. A BLAST search enables comparison of a query sequence with a library or database of sequences. The library sequences that resemble the query sequence above a certain threshold are identified. In RefNetBuilder, the program blastx is used, after formatting the database of sequences, to map gene fragments of the query organism to select multiple model organisms in the KEGG (Kyoto Encyclopedia of Genes and Genomes) database. The rationale behind this selection is that many gene structures and functions, as well as pathways, are conserved in evolution. The default settings for the program were used and we limited the maximum target sequences to be one so that the best hit for a query sequence was picked. The cutoff for expect value (E-value) was set at 10 by default and the matching sequence that had a higher E-value (>10) were considered no statistical similarity. The E-value, along with the percentage of identity (pident) and the length of the identity (nident), was recorded.
Public databases of genetic interactions
Although many public databases contain information of genetic interactions associated with a particular pathway, pathway annotation is generally sparse for organisms other than human, mouse and rat. Many other organisms with fully sequenced genomes have very limited pathway annotation, which are usually located in dedicated databases that are difficult to retrieve. KEGG  is a collection of online databases dealing with genomes, enzymatic pathways, and biochemistry. The KEGG PATHWAY database archives information on molecular interaction networks, such as pathways and complexes, information about genes and proteins generated by genome projects, and information about biochemical compounds and reactions. In RefNetBuilder, all the systematic reference pathways/networks in the KEGG databases have been extracted and loaded into our own pathway annotation database. There are two major categories of reference pathways, namely metabolic pathways and non-metabolic pathways. The non-metabolic pathways capture the perturbed reaction/interaction networks for genetic information processing, environmental information processing, other cellular processes, and human diseases. The molecular network shown in each pathway map is a graph consisting of nodes (e.g., genes, proteins, small molecules, etc.) and edges (reactions, interactions and relations). In general, if two genes in the pathway map are connected with an edge, they are considered to have a regulatory relationship. Each gene extracted from the KEGG GENES database is assigned a unique KEGG Orthology (KO) identifier (KOID). The KO entry represents an ortholog group that is linked to a gene product in the KEGG pathway diagram. Thus, the BLAST scores between a query sequence and the reference sequence set from the KEGG GENES database are computed, and homologs are found in the reference set (Figure 2).
RefNetBuilder: Reference networks for non-model organisms
After BLAST between query genes and the reference gene set from KEGG GENES database, homologs are found for each query sequence. Then, homologs ranked above the threshold are selected as ortholog candidates based on the BLAST score. Ortholog candidates are divided into KO groups according to the annotation of the KEGG GENES database and each query sequence is mapped with the corresponding KO group (Figure 2).
Interpretation and integration of networks
Based on the results of mapping between query sequences and KO reference genes from the KEGG GENES database, all the reference pathways extracted from the KEGG database are interpreted by highlighting those KO reference genes if they are mapped to a query sequence from the non-model organism. That is, for each pathway map, the node (representation of ortholog gene) is highlighted in the red colour if it is the best hit for a query sequence, and gene names are replaced by its corresponding KO group identification. The rest of the structure on the map remains the same as in the original map from KEGG database. By using the KGML-ED tool , the customized interpretation of pathway maps that include mapping information of query gene and KO reference gene are generated and can be used as a graphical representation of reference GRNs/pathways for the query non-model organism.
Results and discussion
We first tested the accuracy of RefNetBuilder by reassigning KO identifiers to the Caenorhabditis elegans (nematode) genes queried against seven other model species curated in the KEGG GENES database. The seven organisms are Anopheles gambiae (mosquito), Apis mellifera (honey bee), Drosophila melanogaster (fruit fly), Homo sapiens (human), Mus musculus (mouse), Rattus norvegicus (rat) and Schistosoma mansoni (flatworm). Currently, 3913 C. elegans genes are annotated with a KO identifier number (KOID). The test results (see Additional file1) show that RefNetBuilder was able to assign each of the C. elegans gene a KOID with a 70.9% accuracy, i.e., 2773 assigned KOIDs matching the original KEGG curated KOIDs. This accuracy rate is comparable to the 62.5% ~90.1% sensitivity reported for KAAS by querying against a representative set of model species .
To demonstrate the functionality and features of RefNetBuilder, we used a non-model organism, the earthworm Eisenia fetida, as an example. A total of 43,803 E. fetida ESTs were queried against the above-mentioned eight model organisms. After processing through RefNetBuilder, 9,187 of these ESTs were assigned to 3,134 unique KOIDs that were mapped to 267 pathways out of the entire 317 KEGG pathways (see Additional file2). A subset of 2,574 earthworm ESTs identified as differentially expressed genes in response to chemical perturbations (unpublished data) was also annotated using RefNetBuilder. Results (see Additional file3) show that 604 of these ESTs were assigned to 450 unique KOIDs that belong to 226 KEGG pathways (88 metabolism and 138 non-metabolism pathways), with 218 ESTs being mapped to metabolic pathways, 460 to non-metabolic pathways, and 74 to both.
The above derived pathway information is currently being used for computational inference of GRNs from a large earthworm microarray dataset. Meanwhile, other curated pathway databases such as the Pathway Interaction Database (PID) , Reactome  and the BioCyc Tier 1 databases  are being added to the RefNetBuilder platform (Figure 1). This platform has the flexibility to expand and include more interaction information as it becomes available in the future.
Here we presented the development of RefNetBuilder, a new tool aimed for constructing GRNs for non-model organisms that have only ESTs available. Researchers who wish to explore pathway- or network-related bioinformatic information in these organisms may find this tool especially useful.
Availability and requirements
Project name: The RefNetBuilder Platform
Project Available at: http://orca.st.usm.edu/cbbl/refnet
Operating system(s): Windows XP, Vista(x86), Vista(x64), Linux, MacOS Programming languages: Perl
Other requirements: MySQL Server, ActivePerl, Blast
Any restrictions to use by non-academics: None
This work was supported by the US Army Corps of Engineers Environmental Quality Program under contract #W912HZ-08-2-0011 and the NSF EPSCoR project “Modeling and Simulation of Complex Systems” (NSF #EPS–0903787). Permission was granted by the Chief of Engineers to publish this information.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 10, 2011: Proceedings of the Eighth Annual MCBIOS Conference. Computational Biology and Bioinformatics for a New Decade. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S10.
- Liu W, Lahdesmaki H, Dougherty ER, Shmulevich I: Inference of Boolean networks using sensitivity regularization. EURASIP J Bioinform Syst Biol 2008, -780541.Google Scholar
- Shmulevich I, Dougherty ER, Zhang W: Gene perturbation and intervention in probabilistic Boolean networks. Bioinformatics 2002, 18: 1319–1331. 10.1093/bioinformatics/18.10.1319View ArticlePubMedGoogle Scholar
- Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, et al.: Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol 2007, 5: e8. 10.1371/journal.pbio.0050008PubMed CentralView ArticlePubMedGoogle Scholar
- Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla FR, et al.: ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics 2006, 7(Suppl 1):S7. 10.1186/1471-2105-7-S1-S7PubMed CentralView ArticlePubMedGoogle Scholar
- Zou M, Conzen SD: A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data. Bioinformatics 2005, 21: 71–79. 10.1093/bioinformatics/bth463View ArticlePubMedGoogle Scholar
- Bader GD, Donaldson I, Wolting C, Ouellette BF, Pawson T, Hogue CW: BIND--The Biomolecular Interaction Network Database. Nucleic Acids Res 2001, 29: 242–245. 10.1093/nar/29.1.242PubMed CentralView ArticlePubMedGoogle Scholar
- Ceol A, Chatr AA, Licata L, Peluso D, Briganti L, Perfetto L, et al.: MINT, the molecular interaction database: 2009 update. Nucleic Acids Res 2010, 38: D532-D539. 10.1093/nar/gkp983PubMed CentralView ArticlePubMedGoogle Scholar
- Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, et al.: Human Protein Reference Database--2009 update. Nucleic Acids Res 2009, 37: D767-D772. 10.1093/nar/gkn892PubMed CentralView ArticlePubMedGoogle Scholar
- Stark C, Breitkreutz BJ, Chatr-Aryamontri A, Boucher L, Oughtred R, Livstone MS, et al.: The BioGRID Interaction Database: 2011 update. Nucleic Acids Res 2011, 39: D698-D704. 10.1093/nar/gkq1116PubMed CentralView ArticlePubMedGoogle Scholar
- Konieczka JH, Drew K, Pine A, Belasco K, Davey S, Yatskievych TA, et al.: BioNetBuilder2.0: bringing systems biology to chicken and other model organisms. BMC Genomics 2009, 10(Suppl 2):S6. 10.1186/1471-2164-10-S2-S6PubMed CentralView ArticlePubMedGoogle Scholar
- Avila-Campillo I, Drew K, Lin J, Reiss DJ, Bonneau R: BioNetBuilder: automatic integration of biological networks. Bioinformatics 2007, 23: 392–393. 10.1093/bioinformatics/btl604View ArticlePubMedGoogle Scholar
- Ferro A, Giugno R, Pigola G, Pulvirenti A, Skripin D, Bader GD, et al.: NetMatch: a Cytoscape plugin for searching biological networks. Bioinformatics 2007, 23: 910–912. 10.1093/bioinformatics/btm032View ArticlePubMedGoogle Scholar
- Lopes CT, Franz M, Kazi F, Donaldson SL, Morris Q, Bader GD: Cytoscape Web: an interactive web-based network browser. Bioinformatics 2010, 26: 2347–2348. 10.1093/bioinformatics/btq430PubMed CentralView ArticlePubMedGoogle Scholar
- Kelley BP, Yuan B, Lewitter F, Sharan R, Stockwell BR, Ideker T: PathBLAST: a tool for alignment of protein interaction networks. Nucleic Acids Res 2004, 32: W83-W88. 10.1093/nar/gkh411PubMed CentralView ArticlePubMedGoogle Scholar
- Xia T, Dickerson JA: OmicsViz: Cytoscape plug-in for visualizing omics data across species. Bioinformatics 2008, 24: 2557–2558. 10.1093/bioinformatics/btn473PubMed CentralView ArticlePubMedGoogle Scholar
- Aoki-Kinoshita KF, Kanehisa M: Gene annotation and pathway mapping in KEGG. Methods Mol Biol 2007, 396: 71–91. 10.1007/978-1-59745-515-2_6View ArticlePubMedGoogle Scholar
- Moriya Y, Itoh M, Okuda S, Yoshizawa AC, Kanehisa M: KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic Acids Res 2007, 35: W182-W185. 10.1093/nar/gkm321PubMed CentralView ArticlePubMedGoogle Scholar
- Johnson M, Zaretskaya I, Raytselis Y, Merezhuk Y, McGinnis S, Madden TL: NCBI BLAST: a better web interface. Nucleic Acids Res 2008, 36: W5-W9. 10.1093/nar/gkn201PubMed CentralView ArticlePubMedGoogle Scholar
- Ye J, McGinnis S, Madden TL: BLAST: improvements for better sequence analysis. Nucleic Acids Res 2006, 34: W6-W9. 10.1093/nar/gkl164PubMed CentralView ArticlePubMedGoogle Scholar
- Klukas C, Schreiber F: Dynamic exploration and editing of KEGG pathway diagrams. Bioinformatics 2007, 23: 344–350. 10.1093/bioinformatics/btl611View ArticlePubMedGoogle Scholar
- Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, et al.: PID: the Pathway Interaction Database. Nucleic Acids Res 2009, 37: D674-D679. 10.1093/nar/gkn653PubMed CentralView ArticlePubMedGoogle Scholar
- Vastrik I, D'Eustachio P, Schmidt E, Gopinath G, Croft D, de B B, et al.: Reactome: a knowledge base of biologic pathways and processes. Genome Biol 2007, 8: R39. 10.1186/gb-2007-8-3-r39PubMed CentralView ArticlePubMedGoogle Scholar
- Karp PD, Ouzounis CA, Moore-Kochlacs C, Goldovsky L, Kaipa P, Ahren D, et al.: Expansion of the BioCyc collection of pathway/genome databases to 160 genomes. Nucleic Acids Res 2005, 33: 6083–6089. 10.1093/nar/gki892PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.