Genes2Networks: connecting lists of gene symbols using mammalian protein interactions databases
© Berger et al. 2007
Received: 07 May 2007
Accepted: 04 October 2007
Published: 04 October 2007
Skip to main content
© Berger et al. 2007
Received: 07 May 2007
Accepted: 04 October 2007
Published: 04 October 2007
In recent years, mammalian protein-protein interaction network databases have been developed. The interactions in these databases are either extracted manually from low-throughput experimental biomedical research literature, extracted automatically from literature using techniques such as natural language processing (NLP), generated experimentally using high-throughput methods such as yeast-2-hybrid screens, or interactions are predicted using an assortment of computational approaches. Genes or proteins identified as significantly changing in proteomic experiments, or identified as susceptibility disease genes in genomic studies, can be placed in the context of protein interaction networks in order to assign these genes and proteins to pathways and protein complexes.
Genes2Networks is a software system that integrates the content of ten mammalian interaction network datasets. Filtering techniques to prune low-confidence interactions were implemented. Genes2Networks is delivered as a web-based service using AJAX. The system can be used to extract relevant subnetworks created from "seed" lists of human Entrez gene symbols. The output includes a dynamic linkable three color web-based network map, with a statistical analysis report that identifies significant intermediate nodes used to connect the seed list.
Genes2Networks is powerful web-based software that can help experimental biologists to interpret lists of genes and proteins such as those commonly produced through genomic and proteomic experiments, as well as lists of genes and proteins associated with disease processes. This system can be used to find relationships between genes and proteins from seed lists, and predict additional genes or proteins that may play key roles in common pathways or protein complexes.
The rapid increase in experimentally identified binary interactions between proteins has brought us to a stage where we are now able to start viewing how these interactions and components come together to form large functional regulatory networks . However, it is impossible for researchers to keep up with the ever expanding literature. The emergence of high-throughput experimental technologies, such as yeast-2-hybrid screens [2, 3], cDNA microarrays [4, 5] and mass-spectrometry , as well as databases that mine legacy experimental literature [7, 8] allow for the construction of large networks. Networks, formally graphs, are simple abstract representations of biomolecular interactions where cellular components are represented as nodes, and interactions connect these nodes through links.
The construction of cellular network datasets has several valuable uses. Network representation allows for extraction of global topological statistical and structural properties such as connectivity distribution , clustering , and the identification of network motifs  or graphlets . These measurements provide clues about the design principles of intracellular organization. Interaction network datasets can also be used to predict unidentified interactions [13, 14], or used as a starting point for quantitative computational modeling . Additionally, interaction networks can assist in interpreting experimental results when identified lists of proteins or genes from proteomic or genomics experiments or computational studies can be placed in their contextual local interaction networks .
We used only mammalian (mouse/rat/human) interactions recorded in the following datasets: BIND , HPRD , IntAct , DIP , MINT , Rual et al. , Stelzl et al. , Ma'ayan et al. , PDZBase , and PPID [19, 26]. All interactions from these databases/datasets were determined experimentally and include a PubMed reference to the primary research article that describes the experiments used to identify the interactions. Some of the databases contain interactions that were manually extracted from the literature (e.g. HPRD); some datasets are the result of high throughput experimental data (e.g. Rual et al. and Stelzl et al.); whereas some databases contain both low and high-throughput interactions (e.g. BIND, IntAct, and DIP). Consolidation of interactions from the ten different network databases was accomplished by combining human/mouse/rat Entrez gene symbols using information from Swiss-Prot . The consolidated network created from the ten datasets contains 44,877 interactions and 11,033 nodes. This network is stored in a structured text flat-file-space-delimited format. This file is loaded into the program using a hash data structure implemented in c language for fast loading and access. We do not include in this initial implementation datasets of interactions created via in-silico ab-initio interaction prediction methods or model organisms orthologs interactions such as those collected in OPHID , HPID , IntNetDB , and POINT . The datasets we used describe mostly binary interactions, but in rare cases complexes containing more than two proteins are listed. These were excluded from the merged dataset. Nodes in the ten datasets are provided with accession codes linking them to entries describing genes and proteins in databases such as Swiss-Prot  and NCBI's Entrez Gene . HPRD  and PPID [19, 26] are not included in the public web interface application since these databases require a license for redistribution. Currently, HPRD and PPID data are only available to internal users at Mount Sinai School of Medicine.
Many of the interactions and components listed in the ten databases that we used are the result of high-throughput experiments such as yeast-2-hybrid screens [2, 3], and mass-spectrometry . These interactions are considered low-quality since these techniques often report many false positives . Thus, we applied a simple filtering approach allowing users to exclude interactions originating from articles that provide many interactions, and/or include only interactions reported by several different papers. The rationale for this filtering approach is the assumption that a research article that reports many interactions is likely reporting the results of a high-throughput technique which tends to produce many false positives. Alternatively, interactions that are reported in many different research articles, and appear in multiple databases, can be given more confidence because these interactions have been reported multiple times independently. Hence, users may select to include only interactions from low-throughput studies with multiple references to improve the reliability of the consolidated network. Users are presented with list-boxes and text-boxes that allow adjustment of the filtering thresholds. More sophisticated filtering techniques implementing machine learning technologies such as support vector machines (SVM) , and taking into account more knowledge about the interactions (i.e. experimental method used, impact factor of journals, etc.) are planned for future implementations.
To enhance accessibility to the core Genes2Networks software, we developed a state-of-the-art web-based interface. This interface allows users to input lists of human Entrez Gene symbols in a textbox or through uploading a text file. As genes are added, the system validates the entries using NCBI's e-utils. The validation is achieved by searching the NCBI gene database, with the input entry, while restricting the organism to human. If an exact match is not found, the user is presented with a list of suggestions with links to choose the intended matching entry. By clicking on a highlighted gene symbol from the list of suggestions, the gene can be added to the seed list.
Where "a" equals the links from the intermediate node being examined to nodes from the input seed list, "b" equals the total links for the intermediate node in the consolidated background reference network, "c" is the total links in the outputted subnetwork, and "d" is the total links in the consolidated background reference network. The outputted ranked list of intermediates is displayed underneath the subnetwork map viewer.
Several commercial and academic initiatives have been attempting to address the need for integration, consolidation, visualization, querying and organization of information about binary mammalian protein-protein interactions and signaling pathways from sparse sources. For example, Cytoscape  is Java-based desktop software for protein and gene network visualization. Cytoscape's several plug-ins allow for analysis and integration of experimental data as well as incorporation with Gene Ontology . One Cytoscape plug-in, called cPath , is a data warehouse that joins together databases stored in PSI-MI XML format . Other similar software platforms include: PIANA , Pathway Studio , ProViz , PATIKA , and Ingenuity. Some are commercial products and some were developed by academic laboratories and are freely available. Genes2Networks provides several advantages over existing systems; the consolidated network made from the ten databases, after filtering, is a high quality yet comprehensive dataset; the user interface is an intuitive web-based Web 2.0 enabled application; the systems is free for academic users; the system provides predictions about intermediate components and their involvement with the proteins and genes from seed lists by ranking intermediates according to their specificity to interact with the seed list. Genes2Networks is suitable for analysis of diverse proteomic and genomic experimental results. The web interface and visualization provide easy access and a user friendly environment eliminating the need for training.
Project name: Genes2Networks
Project home page: http://actin.pharm.mssm.edu/genes2networks
Operating system: Platform independent
Other requirements: The HPRD and PPID dataset are only available to Mount Sinai School of Medicine users due to licensing restrictions.
License: GNU GPL
Any restrictions to use by non-academics: License needed. Users should contact firstname.lastname@example.org
This research was supported by NIH Grant No. GM-054508 and an advanced center grant from NYSTAR to Ravi Iyengar. We thank the anonymous reviewers for their useful comments.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.