Integration of relational and hierarchical network information for protein function prediction
 Xiaoyu Jiang^{1},
 Naoki Nariai^{2},
 Martin Steffen^{3, 4},
 Simon Kasif^{2, 4} and
 Eric D Kolaczyk^{1}Email author
https://doi.org/10.1186/147121059350
© Jiang et al; licensee BioMed Central Ltd. 2008
Received: 14 March 2008
Accepted: 22 August 2008
Published: 22 August 2008
Abstract
Background
In the current climate of highthroughput computational biology, the inference of a protein's function from related measurements, such as proteinprotein interaction relations, has become a canonical task. Most existing technologies pursue this task as a classification problem, on a termbyterm basis, for each term in a database, such as the Gene Ontology (GO) database, a popular rigorous vocabulary for biological functions. However, ontology structures are essentially hierarchies, with certain top to bottom annotation rules which protein function predictions should in principle follow. Currently, the most common approach to imposing these hierarchical constraints on networkbased classifiers is through the use of transitive closure to predictions.
Results
We propose a probabilistic framework to integrate information in relational data, in the form of a proteinprotein interaction network, and a hierarchically structured database of terms, in the form of the GO database, for the purpose of protein function prediction. At the heart of our framework is a factorization of local neighborhood information in the proteinprotein interaction network across successive ancestral terms in the GO hierarchy. We introduce a classifier within this framework, with computationally efficient implementation, that produces GOterm predictions that naturally obey a hierarchical 'truepath' consistency from root to leaves, without the need for further postprocessing.
Conclusion
A crossvalidation study, using data from the yeast Saccharomyces cerevisiae, shows our method offers substantial improvements over both standard 'guiltbyassociation' (i.e., NearestNeighbor) and more refined Markov random field methods, whether in their original form or when postprocessed to artificially impose 'truepath' consistency. Further analysis of the results indicates that these improvements are associated with increased predictive capabilities (i.e., increased positive predictive value), and that this increase is consistent uniformly with GOterm depth. Additional in silico validation on a collection of new annotations recently added to GO confirms the advantages suggested by the crossvalidation study. Taken as a whole, our results show that a hierarchical approach to networkbased protein function prediction, that exploits the ontological structure of protein annotation databases in a principled manner, can offer substantial advantages over the successive application of 'flat' networkbased methods.
Keywords
Background
Proteins are fundamental to the complex molecular and biochemical processes taking place within organisms. An understanding of their role is therefore critical in biology and biorelated areas, for purposes ranging from general knowledge to the development of targeted medicine and diagnostics. Highthroughput sequencing technology has identified a tremendous number of genes with no known functional annotation. On average, as many as 70% of the genes in a genome have poorly known or unknown functions [1]. Not surprisingly, therefore, the prediction of protein function has become an important and urgent problem in functional genomics.
Protein function prediction can take many forms. The traditional and most popular methodologies use homology modeling and sequence similarity to infer biochemical function [2, 3]. In simple cases, such as certain families of ribosomal proteins, globins, kinases or caspases, these procedures work reasonably well. Sequence similarity has been used with great success for inference of molecular function. For biological process and pathway annotation, guilt by association using functional linkage methods has been a popular choice in recent years.
For example, microarrays are often used to cluster proteins into groups of genes that respond concordantly to a given environmental stimuli. When these groups are strongly enriched in proteins in a given biological process such as insulin signaling and also contain proteins without annotation we often take the leap of faith and predict the unknown proteins to be associated with this process as well. Similarly, when two proteins are found to interact in a high throughput assay we also tend to use this as evidence of functional linkage.
However, enrichment and guilt by association are often highly misleading and can lead to a very high false positive rate if not used with caution. The work in [4] and several other papers, e.g., [5–7], attempted to frame these inference problems in a precise networkbased probabilistic framework. Here we attempt to make a fundamental advance in this area, by augmenting the networkbased perspective to additionally make explicit use of the structure of the GO hierarchy to compute more precise probabilities, thereby improving on the quality of predictions made by the inference algorithms.
More broadly, the work in this paper is important in demonstrating that an important role can be played in this context by the knowledge captured in biological ontologies, when properly harnessed. That this should be the case is not obvious a priori. For example, while many scientists use GO in their daily research, it can be (and has been) claimed that overlap among categories, as well as the inherent ambiguity and semantic complexity of naming biological functions and processes, can frequently lead to misleading interpretations and wild goose chases. Classic statistical approaches are based on flat disjoint categories, and quantitative measures of annotation similarity such as through semantic similarity remain somewhat ad hoc.
Nevertheless, despite such concerns, our work here shows that in the present context of automated protein function prediction, the leverage of hierarchies grounded in biological ontologies can yield real, quantifiable advantages over 'flat' networkbased approaches.
Objective
Computational protein function prediction is typically treated as a classification problem. From this perspective, given a protein i and the label G of a potential function for that protein, the goal is to predict whether or not i has label G, using a classifier built from a set of training cases and additional related data. Such related data can be of many types (e.g., protein interaction data, gene expression data, protein localization data) but often can be summarized in the form of a functional linkage graph (e.g., proteinprotein interaction network, gene association network). The labels G typically derive from a database of terms.
Databases of labels G are commonly structured in a hierarchical form (more formally, as a directed acyclic graph (DAG)). The Gene Ontology (GO) database is one such example http://www.geneontology.org/. Viewed as a DAG, nodes represent labels and a link represents the is_a and part_of relations between labels. Function assignments to proteins must obey the truepath rule: if a child term (i.e., more specific term) describes the gene product, then all its parent terms (i.e., less specific terms) must also apply to that gene product. Fig. 1(b) shows a small schematic illustration. For any protein labeled with term A, it may or may not have term D, the child of A; on the other hand, if it does not have term A, it surely does not have D.
This annotation rule suggests that when predicting the label of a term in the hierarchy, it is helpful to first consider whether the protein has the parent term or not. Thus, informative are not only the neighbors labeled with the term of interest but also those labeled with the parent. For instance, to predict the label of term D for the central protein in Fig. 1(a), we want to use both the neighbors labeled with A and the neighbors labeled with D. The exploitation of GO hierarchy is not novel, and indeed is natural. It has been used in functional annotation of genes, as mentioned in the Related Work section, as well as for other purposes, such as identifying over and underrepresentation of GO terms in a gene dataset, and clustering functionally related genes [8–10].
As currently practiced in most instances, prediction of protein function is done with classifiers trained separately for each possible label G, as in [4, 7, 11, 12]. (Please also see the section of Related Work.) But, as just discussed, the overall collection of labels to be assigned generally has a hierarchical structure to it i.e., the labels are related to each other in a specific manner. This structure typically is enforced only after producing an initial set of predictions, as postprocessing steps, either using transitive closure, [4], or using more sophisticated methods, [13, 14].
Most existing methods, as discussed earlier, predict protein function in a termbyterm fashion, without considering the relationship among terms. Suppose the probabilities in the plot are obtained from one of such methods. If we apply a cutoff of 0.5, which is a commonly used threshold in this field, we will predict that the protein is NOT annotated with term A, since the probability of having A is 0.4, less than 0.5; and is annotated with A's child C. This violates the truepath rule, since if the protein is predicted not having term A, then it is not having any of A's descendent terms. On the other hand, the protein is predicted to be labeled with both terms B and D, with probabilities of 0.7 and 0.5, respectively, which obeys the truepath rule, with the prediction on D as a false positive. Such a violation to the truepath rule is not uncommon.
The basic premise of this paper is that reference to this hierarchical relationship among labels is best incorporated in the initial stage of constructing a classifier, as a valuable source of information in and of itself. Our objective here is to demonstrate the power of this premise and to show that it may be tapped in the form of a single, coherent probabilistic classifier. In particular, we develop a probability model that integrates relational data and a hierarchy of labels, and illustrate its advantages in predicting protein function using a PPI network and the Gene Ontology (GO) hierarchy.
Related Work
Many methodologies have been proposed to predict protein functions. Most of the earlier methods tend to use a single source of protein information, such as PPI. Typical examples include the "NearestNeighbor" algorithm, also known as "guiltbyassociation" principle, and the BinomialNeighborhood (BN) method [4].
These earlier methods were followed later by a surge of interest in combining heterogeneous sources of protein information. For example, a machine learning approach integrating datasets of PPI, gene expression, hydropathy profile and amino acid sequences, in the form of different kernels, has been introduced [11]. Various genomewide data can also be employed in a Bayesian framework to produce posterior probability for function assignments [5, 6]. And a Markov Random Field model combining PPI network and protein domain information was introduced in [12]. A common characteristic of these methods is detecting protein functions individually, without considering the relationship among them. As remarked, a pitfall of this is that the predictions may conflict with the truepath rule of ontologies.
Motivated in part by seminal work of [15], combining protein data and ontology structure has recently become a focus. One approach is using a Bayesian network structure to correct inconsistent function predictions, by calculating the largest posterior probability of the truepath consistent labels, given the predictions from independent classifiers for each of the proteins [13]. Similar work has been done in [14], where multiple classifiers are built and training data are modified according to the GO hierarchy. A Bayesian model consisting of a set of nested multinomial logit models, where a prior describing correlations of parameters for nearby GO terms is trained by the hierarchy, has been proposed in [16]. Observing the fact that a protein is actually associated with multiple GO terms, this problem can also be treated as a hierarchical multilabel classification task [13, 17]. Yielding various degrees of improvement in prediction accuracy, these methods all seek to take advantage of the hierarchical label structure. However, importantly, we note that all of those that predict at multiple depths in the GO hierarchy take a separate step to correct truepath inconsistent predictions, rather than producing them directly in a probabilistically coherent fashion.
In summary, combining relational protein data, such as PPI, and hierarchical structures, as in GO, in one probabilistic model to predict truepath consistent function labels, has to the best of our knowledge not been done to date. This task is the focus of our work.
Methods
Ontologies like GO are structured as directed acyclic graphs (DAG's), where a child term may have multiple parent terms. The DAG structure, with alternative paths from the root to internal and leaf terms, is one of the reasons that formal approaches to annotation predictions have been difficult. It is well known that computing the most likely assignment of values to variables in a DAG of size N given their conditional probabilities on the arcs is a classical NPhard problem in graphical models. In fact, variants of this problem are actually formally harder by some theoretical considerations. Therefore, people routinely use tree approximations of probability distributions, which goes back to the work in [18]. In our work, clearly, a treebased approach is the first step to something concrete, rather than ad hoc. We will show in the following sections that, as a way of balance, and in light of our results, it would appear that a tree is a good compromise between ad hoc and completely rigorous usage of the DAG.
We apply a minimal spanning tree (MST) algorithm to transform a DAG into a treestructured hierarchy, by preserving the link between the child and the parent with the heaviest weight w, where w is the empirical conditional probability of having the child term given having the parent, based on a given PPI training set. Each GO term, in such a hierarchy, may still have more than one child term, but only one parent term (if the term itself is not the root of the hierarchy).
As a result of this transformation, there now exists a unique path from the root term to any nonroot term. That is, let G_{ d }denote a term at the dth level below the root. For example, d = 1 if the term is a child of the root. Then in our treestructured hierarchy there is always a unique path of the form G_{ d }, G_{d1}, ..., G_{1}, G_{0}, with G_{0} being the root, and G_{i1}being the parent of G_{ i }. For example, in Fig. 1(b), the result of applying our MST algorithm would be to drop the (B, E) link.
We propose to build a classifier in this setting based on the use of hierarchical conditional probabilities of the form $P({Y}_{{G}_{d}}^{(i)}=1\mathcal{X})$. Here i indexes a certain protein, and G_{ d }is a GO term of interest. The binary variable ${Y}_{{G}_{d}}^{(i)}$ = 1 indicates that protein i is labeled with G_{ d }; otherwise, it takes the value 1. Finally, $\mathcal{X}$ denotes the status of all of protein i's neighbors in the PPI network, across all GO terms, as well as the status for protein i of all of the ancestor terms of G_{ d }. We will refer to $\mathcal{X}$ as the neighborhood status of i.
In the remainder of this section, we present certain model assumptions that in turn lead to a particular form for the probabilities $P({Y}_{{G}_{d}}^{(i)}=1\mathcal{X})$, as well as an efficient algorithm for their computation.
Assumptions
We assume that labels on proteins obey a Markov property with respect to the PPI. That is, that the labeling of a protein is independent of any other proteins given that of its neighbors. Similarly, we assume that a Markov property holds on the GO treestructured hierarchy, meaning that for a given protein the status of a GO term label is independent of that of the other terms, given that of its parent.
where

G_{ ch }is the child term; G_{ pa }is its parent;

${k}_{{G}_{ch}}$ is the number of i's neighbors labeled with the G_{ ch }, and ${k}_{{G}_{pa}}$ is the number of neighbors labeled with G_{ pa };

p_{1} is the probability with which neighbors of i are independently labeled with G_{ ch }(being already labeled with G_{ pa }), given i is labeled with G_{ ch };

p_{0} is the probability with which neighbors of i are independently labeled with G_{ ch }(being already labeled with G_{ pa }), given i is NOT labeled with G_{ ch }but is labeled with G_{ pa }.
We refer to this overall set of model assumptions as the Hierarchical BinomialNeighborhood (HBN) assumptions, in reference to their extension of the BinomialNeighborhood (BN) assumptions of [4]. Note that the form of the probabilities above assumes that ${k}_{{G}_{ch}}$, the number of neighbors with the child term, is independent of the neighborhood size N, given ${k}_{{G}_{pa}}$, the number of neighbors with the parent. This condition seems reasonable since, recall that, by the truepath rule, only those among i's neighbors that are labeled with the parent term can possibly have the child term. In other words, those neighbors with the child form a subset of those neighbors with the parent.
Parameters p_{1} and p_{0} are termspecific: different terms have different p_{1} and p_{0}. For a given term G_{ ch }, all proteins share the same p_{1} and p_{0}. They are estimated by pseudolikelihood approach, from the labeled training data, separately for each G_{ ch }to be predicted. When calculating ${k}_{{G}_{ch}}$, ${k}_{{G}_{pa}}$, we use only the neighbors in the training set.
More specifically, assume there are n proteins in the training set, with m proteins labeled with G_{ ch }and n  m proteins not labeled with G_{ ch }. To simplify notation, let k_{ch, i}and k_{pa, i}be protein i's training neighbors labeled with G_{ ch }and G_{ pa }, respectively. For the m G_{ ch }annotated proteins, we haveK_{ch, i}~Binomial(k_{pa, i}, p_{1}),
with ${Y}_{{G}_{ch}}^{(i)}$ = 1.
with ${Y}_{{G}_{ch}}^{(i)}$ = 1. Estimators ${\stackrel{\u02c6}{p}}_{1}$ and ${\stackrel{\u02c6}{p}}_{0}$ are formally pseudolikelihood estimators.
An issue of estimation is the lack of data. Few data will affect the predictability and interpretability of the terms. Thus, we focus on terms with at least 5 proteins annotated with in the GO dataset. In principle, more formal work could be done, by using smoothing techniques and Empirical Bayes approaches, which we are exploring in our current work. It appears that improvement is not uniform, and the issue clearly requires separate consideration and will likely form a substantial component of a separate paper. Its subtlety likely is due to the wellknown issue of classifiers doing well for classification while still being offtarget for estimation [19].
Also notice that we use onehop neighborhoods in this paper, i.e., neighbors that are directly connected to the protein of study. The extension to larger neighborhoods could be easily done, and would likely yield some improvement in predictive performance, but at the expense of some additional mathematical overhead, replacing the BN framework with one like those in [20–24]. Our choice to use a onehop neighborhood structure here simply reflects a desire of maintaining a certain transparency in our model development, so as to emphasize primarily the effect of adding hierarchical information.
Local Hierarchical Conditional Probability
where $\overline{f}$ = 1  f.
Global Hierarchical Conditional Probability
Equipped with the local hierarchical conditional probability, for any nonroot GO term G_{ d }in the hierarchy, we now derive an expression for $P({Y}_{{G}_{d}}^{(i)}=1\mathcal{X})$, the probability that protein i is annotated with G_{ d }given its neighborhood status.
Note that by the truepath rule we have $P({Y}_{{G}_{d}}^{(i)}=1,{Y}_{{G}_{d1}}^{(i)}=1\mathcal{X})=0$, where G_{d1}is the parent of G_{ d }.
where ${\mathcal{X}}_{LOCA{L}_{m}}$ is the local hierarchical neighborhood information on the parentchild GO term pair, G_{ m }and G_{m1}.
As we go down along the path from the root in the hierarchy, the probability that protein i is labeled with a more specific term is always no more than the probability of any of its ancestors. If the label of a term is predicted as 1, according to some prechosen threshold, the labels for every descendent below will also be assigned as 1. Thus, our model is guaranteed to produce GO term label assignments that comply with the truepath rule. Most existing methods for protein function prediction use adhoc enforcement to correct predictions in order to maintain truepath consistency.
Algorithm
Classification using our Hierarchical BinomialNeighborhood (HBN) model may be accomplished using a straightforward toptobottom algorithm. Specifically, for a given protein i, and a predetermined threshold t, we proceed from the child terms of the root in the MST representation of the GO hierarchy in the following fashion.
initialize PROB = 1
for m = 1: d_{ max },
while ∃ unlabeled terms G_{ m }at level m,
compute ${\text{PROB}}_{{G}_{m}}\leftarrow {\text{PROB}}_{{G}_{m1}}\times $
$P({Y}_{{G}_{m}}^{(i)}=1{Y}_{{G}_{m1}}^{(i)}=1;{\mathcal{X}}_{LOCA{L}_{m}})$
if ${\text{PROB}}_{{G}_{m}}$ >t, set ${Y}_{{G}_{m}}^{(i)}$ = 1
else set ${Y}_{{G}_{m}}^{(i)}$ = 1 and propagate to all descendants of G_{ m }
end
end
Notice that setting the labels at each step is not necessary. However, doing so facilitates the computation efficiency, by avoiding the calculation of the probabilities below the threshold. By letting t = 0, we can obtain all probabilities. The fact that we can do this is a direct outcome of the fact that our predictions are guaranteed to obey the truepath rule.
For a given protein, the algorithm requires at most O(N_{ GO }) steps, where N_{ GO }is the number of GO terms, and therefore, for N_{ Protein }proteins, no more than O(N_{ Protein }N_{ GO }) steps are needed. Hence, the algorithm is linear in the size of both the PPI and the GO networks. In practice, it has been found to be quite fast, particularly because each protein can be expected to have a large proportion of 1 labels, and once a 1 is assigned to a term it is simply propagated to all descendant terms.
Results
Data
The PPI data used in this paper is from the yeast Saccharomyces cerevisiae, as updated in January 2007 at http://www.thebiogrid.org/. There are 5143 genes (nodes) and 31190 nonredundant physical interactions (edges), after deleting selfinteracting and unannotated nodes, and genetic interactions.
The Gene Ontology used is biological process, updated in June 2006, as posted at http://www.geneontology.org/. From the biological perspective, more specific terms are more interesting than less specific ones, and we therefore only predict for terms with 300 or less genes annotated in the database. As a result, the entire biological process ontology breaks down into 47 subhierarchies. In addition, to avoid extremes with little to no information, we only predict for terms with at least 5 genes. We also delete GO:0000004, biological function unknown. The total number of terms predicted is 1037. The GO term annotations used to train the model are updated in June 2006, from http://www.geneontology.org/.
From the initial data, a set of labels is constructed in a way that follows the truepath rule. Specifically, for any proteinterm association in the data, we assign a +1 label to the term for that protein, as well as to all of the ancestors in the transitive closure of that term in the GO hierarchy. We repeat this for all proteinterm associations to get the set of all positive labels. We assign 1 to all other proteinterm pairs.
Please visit http://math.bu.edu/people/kolaczyk/software.html for the datasets used in this paper and the Matlab scripts for the HBN algorithm.
CrossValidation Study
We apply our Hierarchical BinomialNeighborhood (HBN) method, as well as the "NearestNeighbor" (NN) algorithm and the BinomialNeighborhood (BN) method of [4], to the data just described, using a 5fold crossvalidation design. The HBN and BN methods each produce a probability of proteinterm association, while the NN algorithm similarly produces a number between 0 and 1 (i.e., the fraction of a protein's neighbors in the PPI network possessing the term in question). For each test fold, representing 20% of the proteins, all GO term annotations are taken as unknown, and predictions of proteinterm associations are made with each of the three methods, based on comparison of their output to a threshold t ∈ [0, 1], using the annotations in the other four folds as training data.
Evaluation
We use three metrics by which to evaluate the performance characteristics of each classification method. The first is the standard Receiver Operating Characteristic (ROC) curve, which evaluates a classifier's performance in a manner that aggregates across all terms. We examine ROC curves both for the overall GO hierarchy and within each of the 47 subhierarchies.
Since the ROC curve, as a metric, is 'flat', in that it ignores any hierarchical structure among terms, we use as a second metric a hierarchical performance measure, called hF_{ β }, proposed in [25, 26] and defined as follows. For a hierarchy of GO terms and any protein i that is annotated with the hierarchy root, first take the transitive closure of all of the most specific +1 predictions and change 1's into +1's, if there is any. Note that this step is only necessary here for "NearestNeighbor" and the "BinomialNeighborhood" method.
Next, for each protein i, calculate the true positive (TP), false positive (FP), and false negative (FN) counts, based on the true labels of all terms in the hierarchy and the corrected predictions, denoted as TP_{ i },
where β ∈ [0, ∞) is a tuning parameter. In this paper, we use hF_{1} with equal weights on precision and recall, simply denoted as hF. Note that hF, hP and hR are all scaled between 0 and 1, with higher hF indicating better performance over the hierarchy.
where f is the prevalence of a given term. This quantity therefore indicates relative performance of a given classifier, in comparison with a method that simply predicts proteins to have a given term with a priori probability f.
An Illustration
where

G is the target GO term, GO:0045941;

k is the number of training neighbors labeled with G;

N is the training neighborhood size;

${p}_{1}^{\ast}$ is the probability with which neighbors are independently labeled with G, given protein YGL017W is labeled with G;

${p}_{0}^{\ast}$ is the probability with which neighbors are independently labeled with G, given protein YGL017W is NOT labeled with G;

f* is the relative frequency of G in the training set, and ${\overline{f}}^{\ast}$ = 1  f*.
Parameters from NearestNeighbor (NN), BinomialNeighborhood (BN) and Hierarchical BinomialNeighborhood (HBN)
NN  BN  HBN 

k = 2  k = 2  k = 2 
N = 6  N = 6  N = 6 
.  ${p}_{1}^{\ast}$ = 0.0661  p_{1} = 0.2927 
.  ${p}_{0}^{\ast}$ = 0.0085  p_{0} = 0.0992 
.  f* = 0.0106  f = 0.2186 
P = 0.3333  P = 0.3381  P = 0.6566 
CrossValidation Results
Recall that, as a result of our predicting only for GO terms annotated with less than 300 proteins in the database, the full biological process hierarchy actually breaks into 47 subhierarchies. Examination of performance on these subhierarchies provides some sense of the extent to which the HBN performance improvements are uniform across the GO hierarchy. We compute a ROC curve and hF plot for each of the subhierarchies (See additional file 1: ROC curves and hF plots for 47 subhierarchies in crossvalidation study). Numerical comparison of the corresponding AUCs finds, at a 5% significance level that HBN improves on BN in 38 of the 47 subhierarchies, according to the ROC curves, 19 of the subhierarchies, according to the hF plots, and 18 commonly between them. Conversely, BN outperforms HBN in only 1 of the 47 subhierarchies, according to the ROC curves, and 9 of the subhierarchies, according to the hF plots. (NN was uniformly the worse performer.)
These ROC plots are constructed using the original BN (and NN) predictions, without any correction for "truepath" consistency. However, the overwhelming improvement of HBN over BN indicated by the ROC curves is actually similar when the initial predictions of BN are postprocessed by applying transitive closure. Specifically, HBN improves on BN in 28 of the subhierarchies, while BN outperforms HBN in only 4 subhierarchies. These results strongly suggest the validity of our premise as to the importance of incorporating hierarchical information in the GO database in the initial construction of a classifier. The hF plots, which incorporate transitive closure for BN (and NN) directly into their definition, and are designed to provide a more accurate summary of classification accuracy with hierarchically related class labels, support this conclusion. The gains of HBN over BN, although reduced, are still substantial, with HBN outperforming BN in just over 40% of the 47 hierarchies, and BN outperforming HBN, in less than 20%.
Shown in Fig. 12 is the logodds PPV of all three methods, for NPV = 0.987, as a function of depth in the GO hierarchy. We see that the improvement in positive predictive capabilities of HBN is fairly uniform across depths. A onesided paired ttest at each depth confirmed the differences to be highly significant (i.e., pvalues roughly 0.001 or less) at depths 3, 4, and 5, but not at depths 1, 2, 6, 7, or 8. We note, however, that the lack of significance at the latter depths is likely partly driven by sample size, since at each of these depths there were less than 30 cases of positive proteinterm predictions by all three methods used in calculating LOPPV, while at the other three depths there were well in excess of 100.
In Silico Validation Results
Recall that the above results are based on geneGO term annotations in the January 2007 GO database. As an in silico proxy to in vitro validation, beyond that of the crossvalidation study, we examined the performance of HBN, in comparison to NN and BN, when applied to new geneGO term annotations found in the updated May 2007 database. Here our goal is to evaluate the robustness of our crossvalidation results for predicting naturally occurring unknowns (i.e., as opposed to those left out in a random fashion through crossvalidation).
We applied HBN, BN, and NN in each of the 47 subhierarchies to genes that (i) were annotated with only the root term in the June 2006 database, and (ii) were assigned more specific functions in that subhierarchy in the May 2007 database. There were a total of 508 genes that had received at least one new annotation in one of the subhierarchies, with as few as 1 gene and as many as 74 genes per hierarchy. There were 33 subhierarchies having such genes. The methods were compared for their accuracy through the hF function. We present the hF plots for only those subhierarchies (17) with sufficiently many annotations to yield meaningful results (See additional file 2: hF plots for 17 subhierarchies in in silico study); the hF measures for the others are trivial, due to too few new annotations. Over 40% (i.e., 7 out of 17) of these hF plots find HBN to work best in correctly detecting more specific associations, over a reasonably broad range of threshold values; in the majority of the remaining plots, HBN yields results similar to the at least one of the two other methods.
Overall, these results suggest that the performance advantages of HBN indicated by the crossvalidation study are, if anything, potentially understated.
Discussion
For a wellstudied organism, such as S. cerevisiae, one can make certain inferences about genes for which there is no annotated function. First, it is likely that the gene has low sequence similarity to any gene of known function, thus preventing the most straightforward computational methods of predicting gene function. Secondly, it is likely that no altered phenotype is observed upon protein overexpression, knockdown, or knockout, foiling firstpass experimental attempts to discover gene function. In these cases, the next step would involve more elaborate experimental methods, which would typically be guided by a coexpression analysis of publicly available microarray data. The experiments selected will, in general, be timeconsuming, costly, different for each gene being investigated, and offer modest chances for success. Thus, the development of more sophisticated and accurate methods of computational prediction of function which could precisely guide experimental activity remains a top priority.
Biological and biomedical ontologies have become a prominent, and perhaps indispensable, tool in bioinformatics and biological research. GO in particular has been used in numerous papers to detect biological process enrichment of coexpressed genes, identify biological processes associated with disease, etc. However in the vast majority of applications the hierarchical nature of GO is actually not being used directly. For example, in enrichment testing such as GSEA or GNEA we typically test for every biological process if the differentially expressed genes in some condition are associated with this process more than expected by chance.
Thus while GO and other ontologies obviously organize biological knowledge in an intuitive fashion, the structure is not typically exploited for actual inference by predictive analysis tools. This is rather different from evolutionary analysis tools and genetics frameworks where probabilistic ancestor/descendant relationships in phylogenies (hierarchies) are exploited very directly with substantial practical and theoretical benefits.
Our work here suggests that similar developments of probabilistic frameworks are not only feasible, but promising, for improved protein function inference with gene ontologies. In addition, it suggests the need for further research to be done to clarify the utility of different representations for such purposes. Finally, it also raises the prospect of reengineering ontologies or other similar representations, from the perspective of seeking to provide maximal value for probabilistic inference programs.
Conclusion
We have developed a probabilistic framework for automated prediction of protein function using relational information (e.g., a network of proteinprotein interactions) which exploits the hierarchical structure of ontologies, and guarantees the predictions obey a 'truepath' annotation rule. We have evaluated the performance of our method and compared it with two other networkbased methods by both crossvalidation and an in silico study, on the genome of yeast, for terms from the biological process category in the Gene Ontology. Results showed that our proposed method, by utilizing the ontological structure, significantly improved the prediction accuracy and the capability of detecting positive annotations over the hierarchies. Furthermore, our analysis suggests that such improvement persists across the ontology depths.
Declarations
Acknowledgements
The authors thank Russ Greiner for early feedback. This work was supported in part by NHGRI grant R01 HG00336701A1, NIH award GM078987, NSF grant ITR048715, and ONR award N000140610096.
Authors’ Affiliations
References
 Murali TM, Wu CJ, Kasif S: The art of gene function prediction. Nature Biotechnology 2006, 24: 1474–1475. 10.1038/nbt12061474View ArticlePubMedGoogle Scholar
 Bateman A, Coin L, Durbin R, Finn RD, Hollich V, GriffithsJones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, Studholme DJ, Yeats C, Eddy SR: The Pfam protein families database. Nucleic Acids Res 2004, 32: D138–41. [Database issue]. 10.1093/nar/gkh121PubMed CentralView ArticlePubMedGoogle Scholar
 Altschul SF, Madden TL, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
 Letovsky S, Kasif S: Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics 2003, 19: i197i204. 10.1093/bioinformatics/btg1026View ArticlePubMedGoogle Scholar
 Troyanskaya OG, Dolinski K, Owen AB, Altman RB, Botstein D: A bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae ). Proc Natl Acad Sci USA 2003, 100: 8348–8353. 10.1073/pnas.0832373100PubMed CentralView ArticlePubMedGoogle Scholar
 Lee I, Date SV, Adai AT, Marcotte EM: A probabilistic functional network of yeast genes. Science 2004, 306: 1555–1558. 10.1126/science.1099511View ArticlePubMedGoogle Scholar
 Nariai N, Kolaczyk ED, Kasif S: Probabilistic protein function prediction from heterogeneous genomewide data. PLoS ONE 2007, 2(3):e337. 10.1371/journal.pone.0000337PubMed CentralView ArticlePubMedGoogle Scholar
 Beissbarth T, Speed TP: GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics 2004, 20(9):1464–5. 10.1093/bioinformatics/bth088View ArticlePubMedGoogle Scholar
 Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G: GO::TermFinderopen source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics 2004, 20(18):3710–5. 10.1093/bioinformatics/bth456PubMed CentralView ArticlePubMedGoogle Scholar
 Martin D, Brun C, Remy E, Mouren P, Thieffry D, Jacq B: GOToolBox: functional analysis of gene datasets based on Gene Ontology. Genome Biol 2004, 5(12):R101. 10.1186/gb2004512r101PubMed CentralView ArticlePubMedGoogle Scholar
 Lanckriet GRG, Bie TD, Cristianini N, Jordan MI, Noble WS: A statistical framework for genomic data fusion. Bioinformatics 2004, 20: 2626–2635. 10.1093/bioinformatics/bth294View ArticlePubMedGoogle Scholar
 Deng M, Chen T, Sun F: An integrated analysis of protein function prediction. Journal of Computational Biology 2004, 11: 463–475. 10.1089/1066527041410346View ArticlePubMedGoogle Scholar
 Barutcuoglu Z, Schapire RE, Troyanskaya OG: Hierarchical multilabel prediction of gene function. Bioinformatics 2006, 22: 830–836. 10.1093/bioinformatics/btk048View ArticlePubMedGoogle Scholar
 Eisner R, Poulin B, Szafron D, Lu P, Greiner R: Improving protein function prediction using the hierarchical structure of the Gene Ontology. IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology 2005.Google Scholar
 Koller D, Sahami M: Hierarchically classifying documents using very few words. proceedings of the 14th International Conference on Machine Learning (ICML) 1997., 223:Google Scholar
 Shahbaba B, Neal M: Gene function classification using Bayesian models with hierarchybased priors. BMC Bioinformatics 2006, 7: 448. 10.1186/147121057448PubMed CentralView ArticlePubMedGoogle Scholar
 Blockeel H, Schietgat L, Struyf J, Clare ADS: Hierarchical multilabel classification trees for gene function prediction. Probabilistic Modeling and Machine Learning in Structural and Systems Biology (PMSB) 2006.Google Scholar
 Chow CK, Liu CN: Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory 1968, IT14(3):462–467. 10.1109/TIT.1968.1054142View ArticleGoogle Scholar
 Friedman JH: On bias, variance, 0/1loss, and the curseofdimensionality. Data Mining and Knowledge Discovery 1997, 1: 55–77. 10.1023/A:1009778005914View ArticleGoogle Scholar
 Samanta MP, Liang S: Predicting protein functions from redundancies in largescale protein interaction networks. PNAS 2003, 100: 12579–12583. 10.1073/pnas.2132527100PubMed CentralView ArticlePubMedGoogle Scholar
 Brun C, Chevenet F, Martin D, Wojcik J, Guenoche A, Jacq B: Functional classification of proteins for the prediction of cellular function from a proteinprotein interaction network. Genome Biology 2003, 5: R6. 10.1186/gb200351r6PubMed CentralView ArticlePubMedGoogle Scholar
 Chua HN, Sung WK, L W: Exploiting indirect neighbours and topological weight to predict protein function from proteinprotein interactions. Bioinformatics 2006, 22(13):1623–1630. 10.1093/bioinformatics/btl145View ArticlePubMedGoogle Scholar
 Chua HN, Sung WK, L W: Using indirect protein interactions for the prediction of Gene Ontology functions. BMC Bioinformatics 2007, 8: S8. 10.1186/147121058S4S8PubMed CentralView ArticlePubMedGoogle Scholar
 Navieva E, Jin K, Agarwal A, Chazelle B, Singh M: Wholeproteome prediction of protein function via graphtheoretic analysis of interaction maps. Bioinformatics 2005, 21: i302i310. 10.1093/bioinformatics/bti1054View ArticleGoogle Scholar
 Kiritchenko S, Famili F, Matwin S, Nock R: Learning and evaluation in the presence of class hierarchies: application to text categorization. Proceedings of the 19th Canadian Conference on Artificial Intelligence 2006, NRC: 48737.Google Scholar
 Kiritchenko S, Matwin S, Famili AF: Hierarchical text categorization as a tool of associating genes with gene ontology codes. Proceedings of the 2nd European Workshop on Data Mining and Text Mining in Bioinformatics 2004, NRC: 48050.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.