Ontological visualization of protein-protein interactions
© Drabkin et al. 2005
Received: 09 December 2004
Accepted: 11 February 2005
Published: 11 February 2005
Skip to main content
© Drabkin et al. 2005
Received: 09 December 2004
Accepted: 11 February 2005
Published: 11 February 2005
Cellular processes require the interaction of many proteins across several cellular compartments. Determining the collective network of such interactions is an important aspect of understanding the role and regulation of individual proteins. The Gene Ontology (GO) is used by model organism databases and other bioinformatics resources to provide functional annotation of proteins. The annotation process provides a mechanism to document the binding of one protein with another. We have constructed protein interaction networks for mouse proteins utilizing the information encoded in the GO annotations. The work reported here presents a methodology for integrating and visualizing information on protein-protein interactions.
GO annotation at Mouse Genome Informatics (MGI) captures 1318 curated, documented interactions. These include 129 binary interactions and 125 interaction involving three or more gene products. Three networks involve over 30 partners, the largest involving 109 proteins. Several tools are available at MGI to visualize and analyze these data.
Curators at the MGI database annotate protein-protein interaction data from experimental reports from the literature. Integration of these data with the other types of data curated at MGI places protein binding data into the larger context of mouse biology and facilitates the generation of new biological hypotheses based on physical interactions among gene products.
Cellular processes require the interaction of many proteins across several cellular compartments. Interactions can range in stability from persistent, such as between members of a stable complex, to transient, such as binding while being phosphorylated. Determining the collective network of such interactions should provide insight into which processes the individual members participate, and how they may be regulated.
Understanding protein interaction networks requires two steps. First, the interacting proteins must be identified, usually through some experimental methods. Secondly, the significance of the interaction networks needs to be assessed. Recently, there has been a focus on devising large scale screening methods to collect data on interacting proteins [1–3]. Additionally, several strategies have been used to predict networks based on small peptide interaction , analysis of co-evolution of protein families , analysis of orthology , and co-inheritance . However, many of these types of studies are hindered by their inability to place the significance of the interaction networks in the broader biological context.
In addition to the large screening efforts, a significant amount of specific protein-protein interaction data has been reported in the literature over the years. Quite often, these studies report on only a few interacting proteins. It is difficult to place these isolated, yet specific reports in the larger biological context and interconnect them with other data. Recently, there have been efforts to extract such literature-based interaction information using text mining , or combinations of text mining and other predictive methods . These then can be integrated into larger protein-protein interaction datasets. The work reported here presents a methodology for integrating and exploring information on protein-protein interactions.
Model Organism Databases (MODs) have been collecting diverse types of data about the genes and proteins from their respective organisms since the early 1990s (e.g. [10–13]). The goal of these databases is to integrate information about these organisms, placing experimental data in the context of the biology of the organism as a whole. Biological information on gene sequence, function, tissue-specific and developmental expression, as well as associated genetic and mutant phenotype data is incorporated into these systems. The documentation of protein-protein interactions and the integration with other data types allows potential for determining the significance of the interactions and placing these molecular interactions into greater biological context.
The Mouse Genome Informatics system (MGI) is the MOD for the laboratory mouse . MGI integrates not only data used for GO annotation, but also data on a variety of aspects of mouse biology including gene sequence, orthologs, embryonic gene expression, alleles and their phenotypes, strains, and chromosome feature maps [15, 16]. MGI provides highly curated information to the research community and to other bioinformatics resources .
The Gene Ontology Consortium provides the biological community a structured vocabulary with which to enable consistent functional annotation of genes and gene products. . Guidelines for the use of the GO vocabulary are provided by the Consortium . Users of the GO are required to submit their annotations in a specified format, which is then made available to the public via the GO database . Each annotation row lists the object being annotated, the GO term that is being assigned, an evidence code specifying the type of evidence that was used to make the assignment, and a reference. The format of the annotation includes the use of "modifier" fields which can be used either to modify the use of the term, or the use of the evidence code. One important modifier field is the "with" field. This field can be used to specify an external database link and provides the ability to qualify or support a given evidence code with a specific gene, nucleic acid sequence, protein sequence, or allele.
In the course of over six years, curators at MGI have made 79690 annotations to 15231 gene products using 3742 GO terms (All database statistics used in this paper are from the MGI release as of 7/30/04). The curation policy focuses on experiments in which the murine protein gene product is investigated. Many of the detailed annotations have been added on a paper-by-paper basis using the MGI literature collection that contains primary experimental information about mouse genes from over 90,000 references. The accumulation and use of these papers in annotation has been, for the most part, undirected. However, the structure of the GO and the relationships among terms allow grouping of the gene products that share common annotations. Such strategies may reveal hitherto unsuspected relationships between these proteins.
"Protein binding" (GO:0005515), as used by the GO in the Molecular Function ontology, is defined as "interacting selectively with any protein or protein complex" . This term has 70 sub-terms. A gene product can be annotated to "protein binding" using the IPI (inferred from physical interaction) evidence code and the "with" or "inferred from" field when the protein that it binds to has been specifically identified. In the case of the IPI evidence code, the "with" field requires a protein identifier, such as a SwissProt/Trembl ID (now UniProt). MGI curators use this evidence code to curate experimental evidence that demonstrates protein interactions
MGI has presently 1851 genes annotated to the term GO:0005515, "protein binding", or its sub-terms. These genes have 2247 annotations to this term, indicating that some of the gene products must bind more than one protein. These annotations were made independently over the years as curators entered data reference by reference. By collecting all of these annotation pairs, and identifying shared partners, it is possible to search for the presence of more complex networks that were not necessarily identified in each original piece of research literature.
There are instances in the annotations where a protein product has been shown to be able to bind another protein, but otherwise, nothing is known about the biological role of the protein. In these cases, MGI curators make an annotation to "protein binding", but also use a special annotation to indicate that nothing is known about the cellular location (GO:0008372, "cellular_component unknown") of the gene product or the process it is involved in (GO:0000004, "biological_process unknown"). A simple example is seen in the case of TIPIN (timeless interacting protein) (Figure 3B). It has been shown to bind the protein product of Timeless, a homolog of the Drosophila gene . However, GO annotation of Timeless indicates that it is involved in biological processes of lung development and branching morphogenesis , and thus we would predict that Tipin, which is currently annotated to "biological_process unknown" might also play a role in these processes. Additionally, the Gene Expression index in MGI indicates that the Tipin is expressed in similar spatial and temporal patterns as Timeless, supporting the hypothesis that Tipin may be involved in similar processes. that the interaction may be significant . These inferences can form the basis for directed experiments, such studying the effects of antisense RNA inhibition, as has been done for Timeless .
Cellular location may also be inferred from protein interactions. SOCS1 (suppressor of cytokine signaling 1) has "kinase inhibitor activity" (GO:0019210) and has been implemented in the "cytokine and chemokine mediated signaling pathway" (GO:0019221), and the JAK-STAT cascade (GO:0007259). However, its cellular location has not been documented in the available mouse literature. Analysis of the SOCS1 protein using predictive software such as Psort ) and SubLoc  predict that SOCS1 is a nuclear protein. However, there is as yet no direct evidence that this is so. The murine SOCS1 binds to JAK2 (Figure 3D) which has been reported to be localized to the cytoplasm . Therefore, we might expect that SOCS1 may also be localized to the cytoplasm. So, algorithmic evidence predicts that SOCS1 may also be localized to the nucleus and to the cytoplasm. These two independent predictions could stimulate investigations by direct experimentation. Although these types of analyses can be repeated for several proteins, their utility becomes unwieldy when analyzing networks larger than a few components.
For the set of 109 proteins shown in figure 5A fifty-one of the gene products have annotations that fall into the "signal transduction" bin (Figure 6A). A number of the gene products in Figure 5B have been annotated to processes involved in proliferation (twenty proteins) and protein metabolism (seventeen), and twenty-two are nuclear (Figure 6B and 6C). Finally, fifteen of the gene products in the third largest set are involved in transport (Figure 6D). In all of these cases, one might begin to develop hypotheses to test whether the unannotated members of the networks may be involved in these processes.
Tools such as GO_Term_Finder  and its graphical counterpart Vlad  can be useful in finding commonality as well suggesting additional information about the roles of proteins in the cell which could be then tested experimentally. GO_Term finder computes the significance of the annotations for a selected set of genes within an annotation set compared to all the annotations of the entire set using a hypergeometric distribution algorithm. In this study, the entire set is the set of all genes in MGI with GO annotation. For example, for the 109 gene products shown in Figure 5A, thirty-two have process annotations for signal transduction or one of its subterms (p < 1.0E-23), suggesting that the interaction of the proteins may depict a large signal transduction network. Thirty-six of 109 gene products currently have either no annotation to the process ontology, or are annotated to "biological_process_unknown". These proteins may also be involved in the process of signal transduction. Seventeen the proteins depicted in the 40-member network (Figure 5B) have been annotated to "regulation of the cell cycle" (GO:0000074, p < 1.0E-26). Therefore 1190002H23Rik is likely involved in regulation of the cell cycle. Further support for this is that this protein has been annotated to be involved in the "cell cycle" based on sequence similarity to human RGC32 .
Finally, twelve of the proteins displayed in Figure 5C have annotations to exocytosis or its children in common (GO:0006887, p < 1.0E-23).
The networks suggested by the collection of annotations to this GO term involve interactions that are more or less stable under experimental conditions. A gene product is shown to have protein binding activity by a variety of direct assays such as yeast two-hybrid screening , co-immunoprecipitation and other immunoaffinity methods , GST-or other tag pull-down assays , fluorescence resonance transfer , or other direct measurements . Due to the nature of some of the assays, caution must be taken when attributing significance. For example, false positives may obtained from yeast two-hybrid assays for a variety of reasons . Therefore, confirmation by other methods, such as co-immunoprecipitation, may strengthen the likelihood of the implied interaction. Currently, the GO annotation does not allow for the capture of any distinction among these assays, with the result that they are all included together. Despite these serious considerations, large data sets can be effectively examined using these procedures and the results can provide a basis for directed hypotheses and experimentation.
The Mouse Genome Informatics system integrates not only data used for GO annotation, but also data on a variety of aspects of mouse biology including embryonic gene expression, alleles and their phenotypes, and chromosome location. The integration of these datasets allows for complex queries, such as "list all genes expressed in the liver at Tyler Stage 15, located on chromosome 12, annotated to "protein binding" AND "nucleus". The integration of protein-protein network visualization into such queries can aide in determining the significance of more complex interaction networks. By combining the above query with our graphical tools, it is possible to get a graphical view of all protein interaction networks in the nucleus of a 9.5 dpc mouse embryo. As annotation progresses and becomes more complete, these types of queries will become more and more informative.
During the generation of the interaction sets, it was found that programs such as Graphviz, could easily visualize missing annotations based on the interaction of two proteins. When information about a protein comes from different sources, a curator that is curating a single reference may not necessarily record all of the information implied by a physical interaction, such as cellular location in the example above. Views such as Graphviz can help curators to spot missing data and they may at some point be useful in themselves to display annotations.
MGI curators aggressively adopted the use of the "with" field when annotating to "protein binding" during the early stages of annotation efforts at the database. Similar networks may also be mined from the GO data sets available from the other model organism databases participating in the GO. Recently, Lehner and Fraser used GO annotation to analyze a human interaction set predicted from orthology to yeast, Drosophila, and C. elegans interaction sets . The GO is used by many species-specific organism databases to annotate gene products. The use of these annotation sets to construct species-specific interaction will compliment curated interaction resources such as BIND  and HPRD  to guide hypothesis generation in suggesting specific experimental investigations.
We have demonstrated that functional annotations curated via GO hierarchies can be used to obtain a summary set from independent annotations to "protein-binding" to form protein-protein interaction networks. The members of these protein-protein interaction sets can be further examined for additional shared GO annotations. Integration of these data with the other types of data curated at MGI places protein binding data into the larger context of mouse biology and will aid in the discovery of new biological knowledge based on physical interactions among gene products.
Gene annotations for protein binding interactions are made by manual inspection of published literature. In every case, experimental evidence is supplied in the manuscript to support the interaction that is reported. Annotation of genes to other GO terms is made by a variety of methods including the conservative translation of functional information contained in SwissProt protein records, conservative inference from InterPro domains, and manual curation of the published literature.
Data was obtained from the Mouse Genome Informatics system by use of custom SQL queries to collect all markers that had been annotated to "protein binding" or its children using the IPI evidence code. The protein sequence identifier in the "inferred from field" was matched to the appropriate gene in the database. The final output consisted of a two-column file with column 1 being the first protein, and column 2 the protein it binds. This formed the basic data set that was passed to Graphviz  for display. Additional Perl scripts were used to separate out each individual network.
The two column lists were also used as the basis for data files listing all unique genes in each network. These were then used for input files for GO_Slim Tool  and GO_Term finder . These files are available on the MGI ftp site http://ftp.informatics.jax.org.
We wish to thank Lucie Hutchins and Lori Corbani for local assistance with this project. We also like to thank Joel Richardson for input on the use of GraphViz. MGI database resources are funded by NHGRI (HG00330,), NIH/NICHD (HD33745), and NCI (CA89713). The Gene Ontology Project is funded by NHGRI (HG02273).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.