- Methodology article
- Open Access
Integration of relational and hierarchical network information for protein function prediction
© Jiang et al; licensee BioMed Central Ltd. 2008
- Received: 14 March 2008
- Accepted: 22 August 2008
- Published: 22 August 2008
In the current climate of high-throughput computational biology, the inference of a protein's function from related measurements, such as protein-protein interaction relations, has become a canonical task. Most existing technologies pursue this task as a classification problem, on a term-by-term basis, for each term in a database, such as the Gene Ontology (GO) database, a popular rigorous vocabulary for biological functions. However, ontology structures are essentially hierarchies, with certain top to bottom annotation rules which protein function predictions should in principle follow. Currently, the most common approach to imposing these hierarchical constraints on network-based classifiers is through the use of transitive closure to predictions.
We propose a probabilistic framework to integrate information in relational data, in the form of a protein-protein interaction network, and a hierarchically structured database of terms, in the form of the GO database, for the purpose of protein function prediction. At the heart of our framework is a factorization of local neighborhood information in the protein-protein interaction network across successive ancestral terms in the GO hierarchy. We introduce a classifier within this framework, with computationally efficient implementation, that produces GO-term predictions that naturally obey a hierarchical 'true-path' consistency from root to leaves, without the need for further post-processing.
A cross-validation study, using data from the yeast Saccharomyces cerevisiae, shows our method offers substantial improvements over both standard 'guilt-by-association' (i.e., Nearest-Neighbor) and more refined Markov random field methods, whether in their original form or when post-processed to artificially impose 'true-path' consistency. Further analysis of the results indicates that these improvements are associated with increased predictive capabilities (i.e., increased positive predictive value), and that this increase is consistent uniformly with GO-term depth. Additional in silico validation on a collection of new annotations recently added to GO confirms the advantages suggested by the cross-validation study. Taken as a whole, our results show that a hierarchical approach to network-based protein function prediction, that exploits the ontological structure of protein annotation databases in a principled manner, can offer substantial advantages over the successive application of 'flat' network-based methods.
- Gene Ontology
- Receiver Operating Characteristic Curve
- Positive Predictive Value
- Negative Predictive Value
- Directed Acyclic Graph
Proteins are fundamental to the complex molecular and biochemical processes taking place within organisms. An understanding of their role is therefore critical in biology and bio-related areas, for purposes ranging from general knowledge to the development of targeted medicine and diagnostics. High-throughput sequencing technology has identified a tremendous number of genes with no known functional annotation. On average, as many as 70% of the genes in a genome have poorly known or unknown functions . Not surprisingly, therefore, the prediction of protein function has become an important and urgent problem in functional genomics.
Protein function prediction can take many forms. The traditional and most popular methodologies use homology modeling and sequence similarity to infer biochemical function [2, 3]. In simple cases, such as certain families of ribosomal proteins, globins, kinases or caspases, these procedures work reasonably well. Sequence similarity has been used with great success for inference of molecular function. For biological process and pathway annotation, guilt by association using functional linkage methods has been a popular choice in recent years.
For example, microarrays are often used to cluster proteins into groups of genes that respond concordantly to a given environmental stimuli. When these groups are strongly enriched in proteins in a given biological process such as insulin signaling and also contain proteins without annotation we often take the leap of faith and predict the unknown proteins to be associated with this process as well. Similarly, when two proteins are found to interact in a high throughput assay we also tend to use this as evidence of functional linkage.
However, enrichment and guilt by association are often highly misleading and can lead to a very high false positive rate if not used with caution. The work in  and several other papers, e.g., [5–7], attempted to frame these inference problems in a precise network-based probabilistic framework. Here we attempt to make a fundamental advance in this area, by augmenting the network-based perspective to additionally make explicit use of the structure of the GO hierarchy to compute more precise probabilities, thereby improving on the quality of predictions made by the inference algorithms.
More broadly, the work in this paper is important in demonstrating that an important role can be played in this context by the knowledge captured in biological ontologies, when properly harnessed. That this should be the case is not obvious a priori. For example, while many scientists use GO in their daily research, it can be (and has been) claimed that overlap among categories, as well as the inherent ambiguity and semantic complexity of naming biological functions and processes, can frequently lead to misleading interpretations and wild goose chases. Classic statistical approaches are based on flat disjoint categories, and quantitative measures of annotation similarity such as through semantic similarity remain somewhat ad hoc.
Nevertheless, despite such concerns, our work here shows that in the present context of automated protein function prediction, the leverage of hierarchies grounded in biological ontologies can yield real, quantifiable advantages over 'flat' network-based approaches.
Computational protein function prediction is typically treated as a classification problem. From this perspective, given a protein i and the label G of a potential function for that protein, the goal is to predict whether or not i has label G, using a classifier built from a set of training cases and additional related data. Such related data can be of many types (e.g., protein interaction data, gene expression data, protein localization data) but often can be summarized in the form of a functional linkage graph (e.g., protein-protein interaction network, gene association network). The labels G typically derive from a database of terms.
Databases of labels G are commonly structured in a hierarchical form (more formally, as a directed acyclic graph (DAG)). The Gene Ontology (GO) database is one such example http://www.geneontology.org/. Viewed as a DAG, nodes represent labels and a link represents the is_a and part_of relations between labels. Function assignments to proteins must obey the true-path rule: if a child term (i.e., more specific term) describes the gene product, then all its parent terms (i.e., less specific terms) must also apply to that gene product. Fig. 1(b) shows a small schematic illustration. For any protein labeled with term A, it may or may not have term D, the child of A; on the other hand, if it does not have term A, it surely does not have D.
This annotation rule suggests that when predicting the label of a term in the hierarchy, it is helpful to first consider whether the protein has the parent term or not. Thus, informative are not only the neighbors labeled with the term of interest but also those labeled with the parent. For instance, to predict the label of term D for the central protein in Fig. 1(a), we want to use both the neighbors labeled with A and the neighbors labeled with D. The exploitation of GO hierarchy is not novel, and indeed is natural. It has been used in functional annotation of genes, as mentioned in the Related Work section, as well as for other purposes, such as identifying over- and under-representation of GO terms in a gene dataset, and clustering functionally related genes [8–10].
As currently practiced in most instances, prediction of protein function is done with classifiers trained separately for each possible label G, as in [4, 7, 11, 12]. (Please also see the section of Related Work.) But, as just discussed, the overall collection of labels to be assigned generally has a hierarchical structure to it i.e., the labels are related to each other in a specific manner. This structure typically is enforced only after producing an initial set of predictions, as post-processing steps, either using transitive closure, , or using more sophisticated methods, [13, 14].
Most existing methods, as discussed earlier, predict protein function in a term-by-term fashion, without considering the relationship among terms. Suppose the probabilities in the plot are obtained from one of such methods. If we apply a cut-off of 0.5, which is a commonly used threshold in this field, we will predict that the protein is NOT annotated with term A, since the probability of having A is 0.4, less than 0.5; and is annotated with A's child C. This violates the true-path rule, since if the protein is predicted not having term A, then it is not having any of A's descendent terms. On the other hand, the protein is predicted to be labeled with both terms B and D, with probabilities of 0.7 and 0.5, respectively, which obeys the true-path rule, with the prediction on D as a false positive. Such a violation to the true-path rule is not uncommon.
The basic premise of this paper is that reference to this hierarchical relationship among labels is best incorporated in the initial stage of constructing a classifier, as a valuable source of information in and of itself. Our objective here is to demonstrate the power of this premise and to show that it may be tapped in the form of a single, coherent probabilistic classifier. In particular, we develop a probability model that integrates relational data and a hierarchy of labels, and illustrate its advantages in predicting protein function using a PPI network and the Gene Ontology (GO) hierarchy.
Many methodologies have been proposed to predict protein functions. Most of the earlier methods tend to use a single source of protein information, such as PPI. Typical examples include the "Nearest-Neighbor" algorithm, also known as "guilt-by-association" principle, and the Binomial-Neighborhood (BN) method .
These earlier methods were followed later by a surge of interest in combining heterogeneous sources of protein information. For example, a machine learning approach integrating datasets of PPI, gene expression, hydropathy profile and amino acid sequences, in the form of different kernels, has been introduced . Various genome-wide data can also be employed in a Bayesian framework to produce posterior probability for function assignments [5, 6]. And a Markov Random Field model combining PPI network and protein domain information was introduced in . A common characteristic of these methods is detecting protein functions individually, without considering the relationship among them. As remarked, a pitfall of this is that the predictions may conflict with the true-path rule of ontologies.
Motivated in part by seminal work of , combining protein data and ontology structure has recently become a focus. One approach is using a Bayesian network structure to correct inconsistent function predictions, by calculating the largest posterior probability of the true-path consistent labels, given the predictions from independent classifiers for each of the proteins . Similar work has been done in , where multiple classifiers are built and training data are modified according to the GO hierarchy. A Bayesian model consisting of a set of nested multinomial logit models, where a prior describing correlations of parameters for nearby GO terms is trained by the hierarchy, has been proposed in . Observing the fact that a protein is actually associated with multiple GO terms, this problem can also be treated as a hierarchical multi-label classification task [13, 17]. Yielding various degrees of improvement in prediction accuracy, these methods all seek to take advantage of the hierarchical label structure. However, importantly, we note that all of those that predict at multiple depths in the GO hierarchy take a separate step to correct true-path inconsistent predictions, rather than producing them directly in a probabilistically coherent fashion.
In summary, combining relational protein data, such as PPI, and hierarchical structures, as in GO, in one probabilistic model to predict true-path consistent function labels, has to the best of our knowledge not been done to date. This task is the focus of our work.
Ontologies like GO are structured as directed acyclic graphs (DAG's), where a child term may have multiple parent terms. The DAG structure, with alternative paths from the root to internal and leaf terms, is one of the reasons that formal approaches to annotation predictions have been difficult. It is well known that computing the most likely assignment of values to variables in a DAG of size N given their conditional probabilities on the arcs is a classical NP-hard problem in graphical models. In fact, variants of this problem are actually formally harder by some theoretical considerations. Therefore, people routinely use tree approximations of probability distributions, which goes back to the work in . In our work, clearly, a tree-based approach is the first step to something concrete, rather than ad hoc. We will show in the following sections that, as a way of balance, and in light of our results, it would appear that a tree is a good compromise between ad hoc and completely rigorous usage of the DAG.
We apply a minimal spanning tree (MST) algorithm to transform a DAG into a tree-structured hierarchy, by preserving the link between the child and the parent with the heaviest weight w, where w is the empirical conditional probability of having the child term given having the parent, based on a given PPI training set. Each GO term, in such a hierarchy, may still have more than one child term, but only one parent term (if the term itself is not the root of the hierarchy).
As a result of this transformation, there now exists a unique path from the root term to any non-root term. That is, let G d denote a term at the d-th level below the root. For example, d = 1 if the term is a child of the root. Then in our tree-structured hierarchy there is always a unique path of the form G d , Gd-1, ..., G1, G0, with G0 being the root, and Gi-1being the parent of G i . For example, in Fig. 1(b), the result of applying our MST algorithm would be to drop the (B, E) link.
We propose to build a classifier in this setting based on the use of hierarchical conditional probabilities of the form . Here i indexes a certain protein, and G d is a GO term of interest. The binary variable = 1 indicates that protein i is labeled with G d ; otherwise, it takes the value -1. Finally, denotes the status of all of protein i's neighbors in the PPI network, across all GO terms, as well as the status for protein i of all of the ancestor terms of G d . We will refer to as the neighborhood status of i.
In the remainder of this section, we present certain model assumptions that in turn lead to a particular form for the probabilities , as well as an efficient algorithm for their computation.
We assume that labels on proteins obey a Markov property with respect to the PPI. That is, that the labeling of a protein is independent of any other proteins given that of its neighbors. Similarly, we assume that a Markov property holds on the GO tree-structured hierarchy, meaning that for a given protein the status of a GO term label is independent of that of the other terms, given that of its parent.
G ch is the child term; G pa is its parent;
is the number of i's neighbors labeled with the G ch , and is the number of neighbors labeled with G pa ;
p1 is the probability with which neighbors of i are independently labeled with G ch (being already labeled with G pa ), given i is labeled with G ch ;
p0 is the probability with which neighbors of i are independently labeled with G ch (being already labeled with G pa ), given i is NOT labeled with G ch but is labeled with G pa .
We refer to this overall set of model assumptions as the Hierarchical Binomial-Neighborhood (HBN) assumptions, in reference to their extension of the Binomial-Neighborhood (BN) assumptions of . Note that the form of the probabilities above assumes that , the number of neighbors with the child term, is independent of the neighborhood size N, given , the number of neighbors with the parent. This condition seems reasonable since, recall that, by the true-path rule, only those among i's neighbors that are labeled with the parent term can possibly have the child term. In other words, those neighbors with the child form a subset of those neighbors with the parent.
Parameters p1 and p0 are term-specific: different terms have different p1 and p0. For a given term G ch , all proteins share the same p1 and p0. They are estimated by pseudo-likelihood approach, from the labeled training data, separately for each G ch to be predicted. When calculating , , we use only the neighbors in the training set.
More specifically, assume there are n proteins in the training set, with m proteins labeled with G ch and n - m proteins not labeled with G ch . To simplify notation, let kch, iand kpa, ibe protein i's training neighbors labeled with G ch and G pa , respectively. For the m G ch -annotated proteins, we haveKch, i~Binomial(kpa, i, p1),
with = 1.
with = -1. Estimators and are formally pseudo-likelihood estimators.
An issue of estimation is the lack of data. Few data will affect the predictability and interpretability of the terms. Thus, we focus on terms with at least 5 proteins annotated with in the GO dataset. In principle, more formal work could be done, by using smoothing techniques and Empirical Bayes approaches, which we are exploring in our current work. It appears that improvement is not uniform, and the issue clearly requires separate consideration and will likely form a substantial component of a separate paper. Its subtlety likely is due to the well-known issue of classifiers doing well for classification while still being off-target for estimation .
Also notice that we use one-hop neighborhoods in this paper, i.e., neighbors that are directly connected to the protein of study. The extension to larger neighborhoods could be easily done, and would likely yield some improvement in predictive performance, but at the expense of some additional mathematical overhead, replacing the BN framework with one like those in [20–24]. Our choice to use a one-hop neighborhood structure here simply reflects a desire of maintaining a certain transparency in our model development, so as to emphasize primarily the effect of adding hierarchical information.
Local Hierarchical Conditional Probability
where = 1 - f.
Global Hierarchical Conditional Probability
Equipped with the local hierarchical conditional probability, for any non-root GO term G d in the hierarchy, we now derive an expression for , the probability that protein i is annotated with G d given its neighborhood status.
Note that by the true-path rule we have , where Gd-1is the parent of G d .
where is the local hierarchical neighborhood information on the parent-child GO term pair, G m and Gm-1.
As we go down along the path from the root in the hierarchy, the probability that protein i is labeled with a more specific term is always no more than the probability of any of its ancestors. If the label of a term is predicted as -1, according to some pre-chosen threshold, the labels for every descendent below will also be assigned as -1. Thus, our model is guaranteed to produce GO term label assignments that comply with the true-path rule. Most existing methods for protein function prediction use ad-hoc enforcement to correct predictions in order to maintain true-path consistency.
Classification using our Hierarchical Binomial-Neighborhood (HBN) model may be accomplished using a straightforward top-to-bottom algorithm. Specifically, for a given protein i, and a pre-determined threshold t, we proceed from the child terms of the root in the MST representation of the GO hierarchy in the following fashion.
initialize PROB = 1
for m = 1: d max ,
while ∃ unlabeled terms G m at level m,
if >t, set = 1
else set = -1 and propagate to all descendants of G m
Notice that setting the labels at each step is not necessary. However, doing so facilitates the computation efficiency, by avoiding the calculation of the probabilities below the threshold. By letting t = 0, we can obtain all probabilities. The fact that we can do this is a direct outcome of the fact that our predictions are guaranteed to obey the true-path rule.
For a given protein, the algorithm requires at most O(N GO ) steps, where N GO is the number of GO terms, and therefore, for N Protein proteins, no more than O(N Protein N GO ) steps are needed. Hence, the algorithm is linear in the size of both the PPI and the GO networks. In practice, it has been found to be quite fast, particularly because each protein can be expected to have a large proportion of -1 labels, and once a -1 is assigned to a term it is simply propagated to all descendant terms.
The PPI data used in this paper is from the yeast Saccharomyces cerevisiae, as updated in January 2007 at http://www.thebiogrid.org/. There are 5143 genes (nodes) and 31190 non-redundant physical interactions (edges), after deleting self-interacting and unannotated nodes, and genetic interactions.
The Gene Ontology used is biological process, updated in June 2006, as posted at http://www.geneontology.org/. From the biological perspective, more specific terms are more interesting than less specific ones, and we therefore only predict for terms with 300 or less genes annotated in the database. As a result, the entire biological process ontology breaks down into 47 sub-hierarchies. In addition, to avoid extremes with little to no information, we only predict for terms with at least 5 genes. We also delete GO:0000004, biological function unknown. The total number of terms predicted is 1037. The GO term annotations used to train the model are updated in June 2006, from http://www.geneontology.org/.
From the initial data, a set of labels is constructed in a way that follows the true-path rule. Specifically, for any protein-term association in the data, we assign a +1 label to the term for that protein, as well as to all of the ancestors in the transitive closure of that term in the GO hierarchy. We repeat this for all protein-term associations to get the set of all positive labels. We assign -1 to all other protein-term pairs.
Please visit http://math.bu.edu/people/kolaczyk/software.html for the datasets used in this paper and the Matlab scripts for the HBN algorithm.
We apply our Hierarchical Binomial-Neighborhood (HBN) method, as well as the "Nearest-Neighbor" (NN) algorithm and the Binomial-Neighborhood (BN) method of , to the data just described, using a 5-fold cross-validation design. The HBN and BN methods each produce a probability of protein-term association, while the NN algorithm similarly produces a number between 0 and 1 (i.e., the fraction of a protein's neighbors in the PPI network possessing the term in question). For each test fold, representing 20% of the proteins, all GO term annotations are taken as unknown, and predictions of protein-term associations are made with each of the three methods, based on comparison of their output to a threshold t ∈ [0, 1], using the annotations in the other four folds as training data.
We use three metrics by which to evaluate the performance characteristics of each classification method. The first is the standard Receiver Operating Characteristic (ROC) curve, which evaluates a classifier's performance in a manner that aggregates across all terms. We examine ROC curves both for the overall GO hierarchy and within each of the 47 sub-hierarchies.
Since the ROC curve, as a metric, is 'flat', in that it ignores any hierarchical structure among terms, we use as a second metric a hierarchical performance measure, called hF β , proposed in [25, 26] and defined as follows. For a hierarchy of GO terms and any protein i that is annotated with the hierarchy root, first take the transitive closure of all of the most specific +1 predictions and change -1's into +1's, if there is any. Note that this step is only necessary here for "Nearest-Neighbor" and the "Binomial-Neighborhood" method.
Next, for each protein i, calculate the true positive (TP), false positive (FP), and false negative (FN) counts, based on the true labels of all terms in the hierarchy and the corrected predictions, denoted as TP i ,
where β ∈ [0, ∞) is a tuning parameter. In this paper, we use hF1 with equal weights on precision and recall, simply denoted as hF. Note that hF, hP and hR are all scaled between 0 and 1, with higher hF indicating better performance over the hierarchy.
where f is the prevalence of a given term. This quantity therefore indicates relative performance of a given classifier, in comparison with a method that simply predicts proteins to have a given term with a priori probability f.
G is the target GO term, GO:0045941;
k is the number of training neighbors labeled with G;
N is the training neighborhood size;
is the probability with which neighbors are independently labeled with G, given protein YGL017W is labeled with G;
is the probability with which neighbors are independently labeled with G, given protein YGL017W is NOT labeled with G;
f* is the relative frequency of G in the training set, and = 1 - f*.
Parameters from Nearest-Neighbor (NN), Binomial-Neighborhood (BN) and Hierarchical Binomial-Neighborhood (HBN)
k = 2
k = 2
k = 2
N = 6
N = 6
N = 6
p1 = 0.2927
p0 = 0.0992
f* = 0.0106
f = 0.2186
P = 0.3333
P = 0.3381
P = 0.6566
Recall that, as a result of our predicting only for GO terms annotated with less than 300 proteins in the database, the full biological process hierarchy actually breaks into 47 sub-hierarchies. Examination of performance on these sub-hierarchies provides some sense of the extent to which the HBN performance improvements are uniform across the GO hierarchy. We compute a ROC curve and hF plot for each of the sub-hierarchies (See additional file 1: ROC curves and hF plots for 47 sub-hierarchies in cross-validation study). Numerical comparison of the corresponding AUCs finds, at a 5% significance level that HBN improves on BN in 38 of the 47 sub-hierarchies, according to the ROC curves, 19 of the sub-hierarchies, according to the hF plots, and 18 commonly between them. Conversely, BN outperforms HBN in only 1 of the 47 sub-hierarchies, according to the ROC curves, and 9 of the sub-hierarchies, according to the hF plots. (NN was uniformly the worse performer.)
These ROC plots are constructed using the original BN (and NN) predictions, without any correction for "true-path" consistency. However, the overwhelming improvement of HBN over BN indicated by the ROC curves is actually similar when the initial predictions of BN are post-processed by applying transitive closure. Specifically, HBN improves on BN in 28 of the sub-hierarchies, while BN outperforms HBN in only 4 sub-hierarchies. These results strongly suggest the validity of our premise as to the importance of incorporating hierarchical information in the GO database in the initial construction of a classifier. The hF plots, which incorporate transitive closure for BN (and NN) directly into their definition, and are designed to provide a more accurate summary of classification accuracy with hierarchically related class labels, support this conclusion. The gains of HBN over BN, although reduced, are still substantial, with HBN outperforming BN in just over 40% of the 47 hierarchies, and BN outperforming HBN, in less than 20%.
Shown in Fig. 12 is the log-odds PPV of all three methods, for NPV = 0.987, as a function of depth in the GO hierarchy. We see that the improvement in positive predictive capabilities of HBN is fairly uniform across depths. A one-sided paired t-test at each depth confirmed the differences to be highly significant (i.e., p-values roughly 0.001 or less) at depths 3, 4, and 5, but not at depths 1, 2, 6, 7, or 8. We note, however, that the lack of significance at the latter depths is likely partly driven by sample size, since at each of these depths there were less than 30 cases of positive protein-term predictions by all three methods used in calculating LO-PPV, while at the other three depths there were well in excess of 100.
In Silico Validation Results
Recall that the above results are based on gene-GO term annotations in the January 2007 GO database. As an in silico proxy to in vitro validation, beyond that of the cross-validation study, we examined the performance of HBN, in comparison to NN and BN, when applied to new gene-GO term annotations found in the updated May 2007 database. Here our goal is to evaluate the robustness of our cross-validation results for predicting naturally occurring unknowns (i.e., as opposed to those left out in a random fashion through cross-validation).
We applied HBN, BN, and NN in each of the 47 sub-hierarchies to genes that (i) were annotated with only the root term in the June 2006 database, and (ii) were assigned more specific functions in that sub-hierarchy in the May 2007 database. There were a total of 508 genes that had received at least one new annotation in one of the sub-hierarchies, with as few as 1 gene and as many as 74 genes per hierarchy. There were 33 sub-hierarchies having such genes. The methods were compared for their accuracy through the hF function. We present the hF plots for only those sub-hierarchies (17) with sufficiently many annotations to yield meaningful results (See additional file 2: hF plots for 17 sub-hierarchies in in silico study); the hF measures for the others are trivial, due to too few new annotations. Over 40% (i.e., 7 out of 17) of these hF plots find HBN to work best in correctly detecting more specific associations, over a reasonably broad range of threshold values; in the majority of the remaining plots, HBN yields results similar to the at least one of the two other methods.
Overall, these results suggest that the performance advantages of HBN indicated by the cross-validation study are, if anything, potentially understated.
For a well-studied organism, such as S. cerevisiae, one can make certain inferences about genes for which there is no annotated function. First, it is likely that the gene has low sequence similarity to any gene of known function, thus preventing the most straightforward computational methods of predicting gene function. Secondly, it is likely that no altered phenotype is observed upon protein overexpression, knockdown, or knockout, foiling first-pass experimental attempts to discover gene function. In these cases, the next step would involve more elaborate experimental methods, which would typically be guided by a co-expression analysis of publicly available microarray data. The experiments selected will, in general, be time-consuming, costly, different for each gene being investigated, and offer modest chances for success. Thus, the development of more sophisticated and accurate methods of computational prediction of function which could precisely guide experimental activity remains a top priority.
Biological and biomedical ontologies have become a prominent, and perhaps indispensable, tool in bioinformatics and biological research. GO in particular has been used in numerous papers to detect biological process enrichment of co-expressed genes, identify biological processes associated with disease, etc. However in the vast majority of applications the hierarchical nature of GO is actually not being used directly. For example, in enrichment testing such as GSEA or GNEA we typically test for every biological process if the differentially expressed genes in some condition are associated with this process more than expected by chance.
Thus while GO and other ontologies obviously organize biological knowledge in an intuitive fashion, the structure is not typically exploited for actual inference by predictive analysis tools. This is rather different from evolutionary analysis tools and genetics frameworks where probabilistic ancestor/descendant relationships in phylogenies (hierarchies) are exploited very directly with substantial practical and theoretical benefits.
Our work here suggests that similar developments of probabilistic frameworks are not only feasible, but promising, for improved protein function inference with gene ontologies. In addition, it suggests the need for further research to be done to clarify the utility of different representations for such purposes. Finally, it also raises the prospect of re-engineering ontologies or other similar representations, from the perspective of seeking to provide maximal value for probabilistic inference programs.
We have developed a probabilistic framework for automated prediction of protein function using relational information (e.g., a network of protein-protein interactions) which exploits the hierarchical structure of ontologies, and guarantees the predictions obey a 'true-path' annotation rule. We have evaluated the performance of our method and compared it with two other network-based methods by both cross-validation and an in silico study, on the genome of yeast, for terms from the biological process category in the Gene Ontology. Results showed that our proposed method, by utilizing the ontological structure, significantly improved the prediction accuracy and the capability of detecting positive annotations over the hierarchies. Furthermore, our analysis suggests that such improvement persists across the ontology depths.
The authors thank Russ Greiner for early feedback. This work was supported in part by NHGRI grant R01 HG003367-01A1, NIH award GM078987, NSF grant ITR-048715, and ONR award N00014-06-1-0096.
- Murali TM, Wu CJ, Kasif S: The art of gene function prediction. Nature Biotechnology 2006, 24: 1474–1475. 10.1038/nbt1206-1474View ArticlePubMedGoogle Scholar
- Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, Studholme DJ, Yeats C, Eddy SR: The Pfam protein families database. Nucleic Acids Res 2004, 32: D138–41. [Database issue]. 10.1093/nar/gkh121PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Letovsky S, Kasif S: Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics 2003, 19: i197-i204. 10.1093/bioinformatics/btg1026View ArticlePubMedGoogle Scholar
- Troyanskaya OG, Dolinski K, Owen AB, Altman RB, Botstein D: A bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae ). Proc Natl Acad Sci USA 2003, 100: 8348–8353. 10.1073/pnas.0832373100PubMed CentralView ArticlePubMedGoogle Scholar
- Lee I, Date SV, Adai AT, Marcotte EM: A probabilistic functional network of yeast genes. Science 2004, 306: 1555–1558. 10.1126/science.1099511View ArticlePubMedGoogle Scholar
- Nariai N, Kolaczyk ED, Kasif S: Probabilistic protein function prediction from heterogeneous genome-wide data. PLoS ONE 2007, 2(3):e337. 10.1371/journal.pone.0000337PubMed CentralView ArticlePubMedGoogle Scholar
- Beissbarth T, Speed TP: GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics 2004, 20(9):1464–5. 10.1093/bioinformatics/bth088View ArticlePubMedGoogle Scholar
- Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G: GO::TermFinder-open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics 2004, 20(18):3710–5. 10.1093/bioinformatics/bth456PubMed CentralView ArticlePubMedGoogle Scholar
- Martin D, Brun C, Remy E, Mouren P, Thieffry D, Jacq B: GOToolBox: functional analysis of gene datasets based on Gene Ontology. Genome Biol 2004, 5(12):R101. 10.1186/gb-2004-5-12-r101PubMed CentralView ArticlePubMedGoogle Scholar
- Lanckriet GRG, Bie TD, Cristianini N, Jordan MI, Noble WS: A statistical framework for genomic data fusion. Bioinformatics 2004, 20: 2626–2635. 10.1093/bioinformatics/bth294View ArticlePubMedGoogle Scholar
- Deng M, Chen T, Sun F: An integrated analysis of protein function prediction. Journal of Computational Biology 2004, 11: 463–475. 10.1089/1066527041410346View ArticlePubMedGoogle Scholar
- Barutcuoglu Z, Schapire RE, Troyanskaya OG: Hierarchical multi-label prediction of gene function. Bioinformatics 2006, 22: 830–836. 10.1093/bioinformatics/btk048View ArticlePubMedGoogle Scholar
- Eisner R, Poulin B, Szafron D, Lu P, Greiner R: Improving protein function prediction using the hierarchical structure of the Gene Ontology. IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology 2005.Google Scholar
- Koller D, Sahami M: Hierarchically classifying documents using very few words. proceedings of the 14th International Conference on Machine Learning (ICML) 1997., 223:Google Scholar
- Shahbaba B, Neal M: Gene function classification using Bayesian models with hierarchy-based priors. BMC Bioinformatics 2006, 7: 448. 10.1186/1471-2105-7-448PubMed CentralView ArticlePubMedGoogle Scholar
- Blockeel H, Schietgat L, Struyf J, Clare ADS: Hierarchical multilabel classification trees for gene function prediction. Probabilistic Modeling and Machine Learning in Structural and Systems Biology (PMSB) 2006.Google Scholar
- Chow CK, Liu CN: Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory 1968, IT-14(3):462–467. 10.1109/TIT.1968.1054142View ArticleGoogle Scholar
- Friedman JH: On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery 1997, 1: 55–77. 10.1023/A:1009778005914View ArticleGoogle Scholar
- Samanta MP, Liang S: Predicting protein functions from redundancies in large-scale protein interaction networks. PNAS 2003, 100: 12579–12583. 10.1073/pnas.2132527100PubMed CentralView ArticlePubMedGoogle Scholar
- Brun C, Chevenet F, Martin D, Wojcik J, Guenoche A, Jacq B: Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network. Genome Biology 2003, 5: R6. 10.1186/gb-2003-5-1-r6PubMed CentralView ArticlePubMedGoogle Scholar
- Chua HN, Sung WK, L W: Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics 2006, 22(13):1623–1630. 10.1093/bioinformatics/btl145View ArticlePubMedGoogle Scholar
- Chua HN, Sung WK, L W: Using indirect protein interactions for the prediction of Gene Ontology functions. BMC Bioinformatics 2007, 8: S8. 10.1186/1471-2105-8-S4-S8PubMed CentralView ArticlePubMedGoogle Scholar
- Navieva E, Jin K, Agarwal A, Chazelle B, Singh M: Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 2005, 21: i302-i310. 10.1093/bioinformatics/bti1054View ArticleGoogle Scholar
- Kiritchenko S, Famili F, Matwin S, Nock R: Learning and evaluation in the presence of class hierarchies: application to text categorization. Proceedings of the 19th Canadian Conference on Artificial Intelligence 2006, NRC: 48737.Google Scholar
- Kiritchenko S, Matwin S, Famili AF: Hierarchical text categorization as a tool of associating genes with gene ontology codes. Proceedings of the 2nd European Workshop on Data Mining and Text Mining in Bioinformatics 2004, NRC: 48050.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.