Assessment of protein set coherence using functional annotations
© Chagoyen et al. 2008
Received: 08 July 2008
Accepted: 20 October 2008
Published: 20 October 2008
Skip to main content
© Chagoyen et al. 2008
Received: 08 July 2008
Accepted: 20 October 2008
Published: 20 October 2008
Analysis of large-scale experimental datasets frequently produces one or more sets of proteins that are subsequently mined for functional interpretation and validation. To this end, a number of computational methods have been devised that rely on the analysis of functional annotations. Although current methods provide valuable information (e.g. significantly enriched annotations, pairwise functional similarities), they do not specifically measure the degree of homogeneity of a protein set.
In this work we present a method that scores the degree of functional homogeneity, or coherence, of a set of proteins on the basis of the global similarity of their functional annotations. The method uses statistical hypothesis testing to assess the significance of the set in the context of the functional space of a reference set. As such, it can be used as a first step in the validation of sets expected to be homogeneous prior to further functional interpretation.
We evaluate our method by analysing known biologically relevant sets as well as random ones. The known relevant sets comprise macromolecular complexes, cellular components and pathways described forSaccharomyces cerevisiae, which are mostly significantly coherent. Finally, we illustrate the usefulness of our approach for validating 'functional modules' obtained from computational analysis of protein-protein interaction networks. Matlab code and supplementary data are available athttp://www.cnb.csic.es/~monica/coherence/
An increasing number of functional data are available at different genome databases and resources spanning all biological levels. Functional information is usually provided as annotations associated with gene products using functional terms from controlled vocabularies and ontologies. This information is being exploited to perform 'functional computations' in quite different contexts and applications. A first classification of these functional methods distinguishes between predictive and descriptive approaches.
Predictive approaches are intended to infer new functional annotations for a gene product or a set of them from available data (some recent reviews can be found[2–4]). Most methods use implicit functional information from experimental data (e.g. sequences, gene expression data, protein-protein interactions or phylogenetic profiles) while some approaches rely only on explicit functional information such as existing annotations[5–7] or a combination of annotations and literature references.
In contrast, descriptive approaches are intended to perform functional validation and interpretation of experimental results. The objective of these methods is to compare new experimental data with the current state of knowledge as stored in curated databases. In this way, experimental data can be validated and new insights can be highlighted from the analysis. Among descriptive methods, a distinction can be made between those that perform functional analysis of a protein set and those that perform pairwise functional analysis.
Given a set of proteins obtained from experimental or computational analysis, currently available methods are able to extract those functional annotations that best describe that protein set[9–11] or to classify it into subsets using functional annotations[12–15]. Nevertheless, the most widely-used functional methods for analyzing protein sets are those described as annotation 'enrichment'. These methods are used to find functional terms that are statistically significant in a protein set given a reference set (typically a whole organism or the genes spotted in a DNA microarray). A large variety of tools are available to perform such analyses (see a recent review or the Gene Ontology (GO) web sitehttp://www.geneontology.org/go.tools.html). Those tools first retrieve all annotations of a protein set of interest from a functional scheme. The number of proteins annotated with each functional term is then counted in both the input and reference sets. Finally, a statistical test (e.g. χ2, binomial, hypergeometric or Fisher's exact test) is applied to measure the significance of each functional term, and this is subsequently adjusted for multiple testing. The result of this type of analysis is therefore a list of functional terms with their corresponding p-values. Those terms with p-values indicating statistical significance are considered representative and therefore give information about the 'enriched' functions in the protein set. Although some methods have been developed to obtain enriched co-annotations (e.g.), most tools analyze functional terms independently, thus providing a view of the local significant functions of a protein set.
In addition, several studies have been reported that aim to establish a similarity score for a pair of proteins, accounting for the resemblance of their functional annotations. To this end, several similarity measurements have been described[13, 15, 18–23], each following different, though in many cases related, approaches. Pairwise protein similarities can be computed through a combination of functional term-term similarities (as in) or by measuring global protein-protein functional similarity directly (as in). These measurements can be applied to any controlled vocabulary scheme, although most of them exploit the hierarchical nature of functional ontologies such as Gene Ontology and the MIPS Functional Catalogue (FunCat).
Although these methods provide valuable information, they do not specifically address the issue of functional homogeneity, i.e. whether a set of proteins participates in related cellular processes, performs similar molecular activities, confers similar phenotypes, etc. An experimental set of proteins is usually grouped on the basis of shared experimental features (gene expression profiles, interaction partners, etc), and it is expected that such a set can be distinguished from a random set when considering a particular functional aspect. Therefore, a method that measures the degree of overall functional homogeneity of a protein set would be useful for validating experimentally or computationally derived sets, highlighting those that merit further investigation. For example, when protein-protein interaction networks are analyzed to discover functional modules, protein clusters could first be filtered on the basis of functional homogeneity, avoiding any additional functional interpretation for those heterogeneous cases.
To this end we propose a new descriptive method, based on functional annotations, that evaluates the statistical significance of the overall homogeneity of a protein set. Given a set of proteins, we first compute its degree of homogeneity (in terms of a functionalcoherence score) accounting for the global similarity of their functional profiles. This coherence score is computed using a previously-reported global pairwise functional similarity measure. Then we assess whether this score is statistically significant given a reference set (usually a complete organism, or the set of genes present in the experimental setting). This significance is measured in terms of the number of proteins in the reference that are also similar to the set at its particular coherence level. Note that a very homogeneous protein set (with a high coherence score) will not be statistically significant in the context of a reference set if it contains only a few proteins of the reference that are functionally related. On the other hand, a relatively homogeneous set (with a lower coherence score) might be significant if it contains a sufficient number of functionally related proteins of the reference.
To the best of our knowledge no previous method relying on functional annotations has addressed this task specifically. Nevertheless, previous studies have sought to evaluate the overall functional coherence of a set of proteins using literature analysis[26, 27]. In these methods a coherence score is assigned to a group of proteins from the perspective of the relevant published literature. The literature is known to report information that is both related and complementary to functional annotations. It is therefore expected that the overall functional coherence of a protein set could also be computed from functional annotations. Nevertheless, it is not obvious how to compute that overall functional coherence from the output of current enrichment analysis tools. As noted by Zheng and Lu, standard enrichment methods present some drawbacks, including: (i) they ignore the relationships among GO terms; (ii) when multiple GO terms are 'enriched' within a protein group, it is difficult to derive a quantitative metric that gives and overall reflection of the functional relationships of the proteins or their statistical significance evaluations. In the present work, we have addressed these limitations by providing a complementary descriptive method that (i) considers relationships among functional terms, both hierarchical and arising from co-annotation, (ii) measures the overall functional homogeneity of a protein set and its statistical significance.
A protein is represented as ann-dimensional vector, each dimension corresponding to one of thenfunctional annotations of the reference set (in this work, the complete genome). Therefore, each functional term will correspond to a coordinate of the vector space representation. In the case of hierarchical functional schemes (e.g. Gene Ontology and MIPS FunCat) this representation is constructed by assigning 1 to each functional term annotated to a gene product and to its corresponding ancestor terms in the hierarchy. The remaining vector coordinates are equal to 0.
To account for the specificity and generality of functional terms, a weighting scheme is applied to this vector representation using the information content of each term. The information content (IC) of a term is inversely related to its probability of annotation in the reference set Pr(t). The weight is formally calculated as:
w IC (t) = -ln(Pr(t)) = -ln(#genes t /m) (1)
where Pr(t) is the probability of annotation of a termt, estimated as the number of gene products associated witht(#genest) divided by the total number of protein-term associations (m) in a reference set R. Note that the total number of gene products associated withtis the sum of those directly annotated withtand those annotated with any of its descendants in the functional hierarchy.
wherep i •p j is the dot product between the two vectorsp i andp j .
where |P| denotes the cardinality of the set, i.e. the number of distinct elements it contains.
Therefore, the coherence score will range from 0 (no coherence) to 1 (full coherence, corresponding to exactly the same functional annotations for all proteins in S).
To assess the significance of the coherence score calculated for a set S in the context of a reference set R, we take into account the proteins in R that are functionally related to S. The definition of functional-relatedness is somewhat arbitrary. Therefore, for evaluation purposes, we use three different criteria to decide whether a protein is functionally related to the set S. In turn, these three criteria define three neighbourhoods in then-dimensional functional space. Therefore, each criterion is established for a set in the context of a reference and for the particular coherence score obtained for the set as computed in equation (4). These three criteria are as follows.
The first criterion defines proteins to be functionally related to S if their similarity to the set, as defined in equation (3), is greater than or equal to the coherence score of the set. This establishes a neighbourhood around the most homogeneous proteins of the set. Proteins in S fulfilling this criterion are defined as the 'core' of S, denoted as C(S). Thus, according to the first criterion, a proteinp∈ R is functionally related to S if sim (p, S) ≥ score(S).
The second criterion defines proteins to be functionally related to S if their similarity to at least one protein in C(S), as defined in equation (2), is greater than or equal to the coherence score of the set. This second neighbourhood can be described as open to the core of the set (as it captures proteins similar to one protein in the core). Thus, according to the second criterion, a proteinp∈ R is functionally related to S if ∃p i ∈ C(S), sim(p,p i ) ≥ score(S).
The third criterion defines proteins to be functionally related to S if their similarity to at least one protein in S, as defined in equation (2), is greater than or equal to the coherence score of the set. This third neighbourhood is open to the set (as it captures all proteins similar to one protein in the set). Thus, according to the third criterion, a proteinp∈ R is functionally related to S if ∃p j ∈ S, sim(p,p j ) ≥ score(S)
Given a reference set R withrelements 'functionally related to S',p-valuegives the probability of drawingsor more elements 'functionally related to S' when |S| elements are selected from R at random. In this work, |S| is the cardinality of the protein set to be analyzed, and |R| is the total number of gene products in the genome taken as reference. We obtain ap-valuefor each of the criteria described above (pv1, pv2 and pv3 respectively).
In summary, the coherence score of a protein set provides a global measure of the functional homogeneity of its proteins. Meanwhile, the significance measures we propose (pv1, pv2, pv3) account for the probability of obtaining a set from a reference with a given number of proteins functionally related to that set, just by chance. Note that the definitions of the three criteria for functional relatedness depend on the coherence score of the set. In this sense, the greater the coherence score, the fewer proteins in the reference will be found to be functionally related. Nevertheless, a particular set with a high coherence score might not be significantly coherent given the reference if it contains only a few of the proteins in the reference that are functionally related to the set (the exact number of proteins to be significant depends on both the size of the set and the number of similar proteins in the reference). Meanwhile, a set with a relatively low coherence score can be significantly coherent with respect to a reference if it contains a certain number of proteins of the reference that are functionally related at that coherence level.
We have assessed the validity of our method by performing several analyses. First, we evaluate the method by comparing the results obtained from the analysis of protein sets known to be homogeneous to those obtained from randomly created sets. Secondly, we analyze its robustness in terms of the functional similarity used, the completeness of functional annotation of the organism and the inclusion or exclusion of annotations obtained by automatic methods. Finally, we demonstrate the usefulness of our approach for a particular application: the validation of functional modules obtained from the analysis of protein-protein interaction networks.
To assess the validity of our method for characterizing the functional coherence of a set of proteins, as well as its significance, we analyzed both positive and random sets in the context of one of the most complete and expert-validated annotated genomes:Saccharomyces cerevisiae. In this scenario, our positive sets (those that are expected to be functionally homogeneous) correspond to macromolecular complexes, cellular components and proteins participating in the same pathway. As proteins in a complex or component act co-ordinately, participating in one or more cellular processes, these protein sets are expected to be significantly coherent from the biological process point of view. The same is expected in the case of proteins in the same pathway. Therefore, we restrict our analysis to GO 'biological process' terms (Gene Ontology annotation release 2007–12).
98 protein sets containing at least two proteins annotated with a 'biological process' term were compiled from the Kegg pathways ofS. cerevisiae. The coherence scores of these sets are in the range of 0.06–1, with set sizes between 2 and 147 proteins. Only 4 pathways are not significantly coherent (pv1 > 0.05), namely 'Limonene and pinene degradation', 'Lipoic acid metabolism', 'Tryptophan metabolism' and 'Alkaloid biosynthesis II'.
Non-significant cellular components
nuclear envelope lumen
AMP-activated protein kinase complex
extrinsic to vacuolar membrane
extrinsic to mitochondrial inner membrane
late endosome membrane
integral to mitochondrial outer membrane
extrinsic to organelle membrane
internal side of plasma membrane
The catalogue of MIPS complexes comprises both curated data and the results of systematic analyses of protein complexes based solely on high-throughput methods[32–34]. We have analyzed those complexes separately (see Figure2). Two hundred and seventeen protein sets corresponding to expert-annotated complexes contained at least two proteins with 'biological process' annotations. Their coherence scores range from 0.07 to 1, with set sizes in the range 2 to 81 proteins. Only two of these were not significant according to pv1: 'Mitochondrial processing complexes' (440.20) and 'DNA helicases' (410.40.40). The data from systematic analyses included 224 sets obtained by Gavinet al., 532 by Hoet al. and 62 by Kroganet al..
In order to ensure that our method does not provide significant sets by chance, we analyzed various randomly created sets of different sizes. Out of a total of 100,000 random sets, with a uniform size distribution from 2 to 200 proteins at 2-protein intervals (similar to the sizes of most positive sets), 4455 were found to be statistically significant (p-value < 0.05) according to pv1, 4379 according to pv2 and 682 according to pv3. These figures imply an FDR at or below a p-value of 0.05 using pv1, 0.045 using pv2 and a lower 0.0068 using pv3. The numbers of highly significant sets (p-value < 0.001) drop to 115 (pv1), 104 (pv2) and 20 (pv3) (with corresponding FDR at or below a p-value of 0.001 of 0.0015 using pv1, 0.0010 using pv2 and a lower 0.0002 using pv3). Additional file1 shows the coherence scores and p-values (pv1, pv2 and pv3) of random sets plotted against size. As expected, the coherence scores of larger random sets tend towards the mean pairwise similarity of the whole genome (0.115).
As shown in Figure2, expert-annotated datasets (GOcc annotations, curated MIPS complexes and Kegg pathways) are mostly significant (e.g. 94–99% with pv1 < 0.05). Nevertheless, they exhibit a wide range of coherence scores, in some cases even less than that expected by chance. This means that most sets corresponding to known macromolecular complexes, cellular components and pathways are significant in the context of the global functional landscape ofS. cerevisiae, though some of them are quite heterogeneous. On the other hand, the proportion of significantly coherent sets corresponding to complexes derived from high-throughput methods stored in the MIPS catalogue[32–34] is lower than the expert-annotated datasets according to the three criteria (see Figure2). Furthermore, the results of the analysis of random sets confirm that the probability of obtaining significant and highly significant coherence scores in such sets is very low.
As most expert-annotated data on known biologically meaningful sets are statistically significant, while the probability of obtaining significant sets just by chance is low, the measures proposed in this work seem to be valuable criteria for assessing the significance of the functional coherence of a protein set. Therefore, this significance can be used as a means of validating new experimental or hypothetical functional modules (e.g. co-expressed genes, protein-protein interaction clusters).
To evaluate the extent to which the statistical significance of the coherence score depends on various conditions such as functional similarity and completeness of annotation, we conducted the following experiments.
Significant sets, Jaccard similarity
p-value < 0.05
p-value < 0.001
460 (-10, -1.99%)
425 (-9, -1.79%)
440 (-9, -1.79%)
384 (-8, -1.59%)
381 (-19, -3.78%)
320 (-18, -3.58%)
214 (-1, -0.46%)
201 (-3, -1.38%)
212 (-3, -1.38%)
182 (-9, -4.15%)
207 (-6, -2.76%)
180 (-13, -5.99%)
94 (0, 0.00%)
91 (0, 0.00%)
92 (-4, -4.08%)
88 (-1, -1.02%)
92 (0, 0.00%)
83 (+1, +1.02%)
GO annotation statistics
S. pombe non-IEA
S. pombe all.
Assignment of GO terms to gene products can be inferred from electronic annotations that have not yet been reviewed by a curator. Therefore, it might be desirable in some cases to rely only on expert-validated annotations. As all the annotations provided forS. cerevisiaeare expert-validated (non-IEA codes), we analyzed GO cellular components for a closely similar organism for which IEA annotations are plentiful:Schizosaccharomyces pombe(S. pombe). Nearly 20% of the assignments of biological process (BP) terms were inferred from electronic annotations with 270 products annotated only with IEA codes. The electronic annotations increase the number of BP terms per product (from 2.0 to 2.4) and also increase the number of cellular components analyzed. The analyses performed with and without IEA annotations give very similar results (see Figure3).
Some recent work in the analysis of protein-protein interactions (PPIs) has concentrated on the detection of the modular organization of cellular function. A functional module can be described as a group of physically or functionally linked molecules that work together to achieve a relatively distinct function. Macromolecular complexes, cellular components and biological pathways are well-known examples of functional modules. Generally, computational methods try to find functional modules from a PPI network fulfilling topological constraints (e.g. densely connected regions for protein complexes), which are further tested for a common cellular function or relationship to an already-described complex. Nevertheless, there is a lack of reliable criteria for evaluating the quality of complexes derived from the analysis of PPI networks, making it difficult to assess the biological relevance of the derived modules.
Information about the overlap with known complexes, cellular co-localization, average semantic similarities for pairs of interacting proteins, and phenotype divergence[37–39] has been used to assess the quality of modules obtained from network analysis. As the preliminary results obtained from our study of MIPS complexes show (see Figure2), there are proportionately more significant sets within the curated complexes than among the complexes obtained from systematic analysis[32–34]. This suggests that our method can be used to qualify a potential module in terms of its homogeneity and completeness through the analysis of 'biological process' annotations.
Chen & Yuan used an extension of a betweenness-based partition for analyzing a weighted graph built from the integration of various proteomics and microarray datasets.
Kroganet al. obtained a new TAP-MS interaction network and used a Markov clustering algorithm to detect complexes.
Dutkowski & Tiuryn detected conserved functional modules through the alignment of yeast, worm and fly PPI networks. We have analyzed the protein sets corresponding to yeast proteins in these modules.
The conserved modules identified by Dutkowski & Tiuryn show the highest percentage of significant sets, although they describe fewer modules. In their analysis, evolutionary constraints were used as a guarantee to ensure the biological significance of functional units.
Moreover, the proportion of significant complexes is greater in the data obtained by the analysis of the Consolidated network by Puet al. than in those obtained by Kroganet al.. Therefore, this larger proportion of significant complexes agrees with other quality parameters computed by, namely overlap with known complexes and co-localization.
In this work we present a descriptive method, based on the analysis of functional annotations, for scoring the degree of homogeneity of a protein set and assessing its significance in the context of a reference set. The method has been evaluated using positive and randomly created datasets. Analysis of known biologically meaningful protein sets corresponding to macromolecular complexes, cellular components and pathways ofS. cerevisiaerevealed that most of them are significant in the context of the organism used as reference. However, the coherence scores obtained vary considerably, from very homogenous sets to fairly heterogeneous. This shows that the overall similarity of functional annotations (i.e. the coherence score) is not a good indicator of the functional completeness and separation of a protein set in the context of an organism. Therefore, in addition to measuring the functional homogeneity, a statistical assessment is performed.
The coherence score proposed in this work is based on previously-defined pairwise functional similarities. Pairwise similarity methods are increasingly used in quite different bioinformatics applications, such as prediction of protein-protein interaction data, priorization of disease candidate genes, missing value estimation in microarray data and prediction of novel gene function. Nevertheless, they have not so far been used to quantify the functional homogeneity of a protein set. For example, the average semantic similarity of interacting proteins was previously used by Puet al. to evaluate the quality of modules obtained from network analysis. The coherence score described in this work is expected to correlate with that measure, since it is defined as the average pairwise similarity between all distinct protein pairs. Nevertheless, the two measures are not directly comparable, for two reasons. First, the average similarity was obtained by Puet al. for pairs of interacting proteins within the same module. In contrast, the coherence score in the present work is computed over all protein-protein pairs within a protein set, as we are not using data on interactions themselves. Secondly, pairwise similarity is computed using dissimilar approaches. The similarity used by Puet al., as described in, is computed by averaging all functional term-term similarities between two proteins. Specifically, a similarity is first established among functional terms, using information from the GO hierarchy, and then similarity between proteins is computed by averaging pairwise term similarities. As semantic similarity accounts for the average term-term similarities of two proteins, it might underestimate or overestimate overall similarity, in contrast to the cosine and Jaccard similarities used in the present work, which exhibit a wider range of values, from 0 (no common terms) to 1 (exactly the same terms).
Therefore, the coherence score and corresponding p-values are shown to be valuable indicators of the global functional homogeneity of a protein set, complementing the functional analysis performed by currently available methods. To illustrate the type of information provided by our method and other functional methods, as well as their complementary relationship, we provide the results of the analysis of 'biological process' annotations of one of the functional modules obtained in: 'Module 39' (see additional file2). The exact application of the coherence score together with other functional analysis methods will depend on the type of analysis desired. If homogeneous sets are expected, our method can be used for validation in order to discard those that are heterogeneous. This is the case for the discovery of functional modules from protein-protein interaction networks, where protein clusters can first be filtered on the basis of functional homogeneity, avoiding any additional functional interpretation of those cases that are clearly heterogeneous. In contrast, if novel functional associations are sought, further analysis should be performed on those sets that are not highly homogeneous.
Both the coherence score and significance measures are computed from a set of functional annotations, from which as a first step a similarity is established. This similarity therefore depends, among other things, on the completeness of a genome annotation. In addition, we have applied our method to the analysis ofS. cerevisiaesets, using an alternative similarity measure (Jaccard), to an incipient annotation project,C. albicans, and to a genome with nearly 20% of biological process term annotations inferred from electronic resources,S. pombe. As with other methods based on functional annotations, the completeness of annotations is by far the most important limiting factor in our methodology.
Finally, to illustrate the usefulness of our method, we have applied it to various protein sets corresponding to hypothetical functional modules and complexes obtained from PPI network analysis. Our results seem to agree with and complement other validation criteria, such as evolutionary conservation and overlap with known complexes.
We have presented a method that scores the degree of homogeneity, or coherence, of a protein set on the basis of the global similarity of their functional annotations. It uses statistical hypothesis testing to assess the significance of the set in the context of the functional space of a reference set.
We can conclude that our method is complementary to previous descriptive functional analysis approaches. On the one hand, like enrichment methods, it analyzes a protein set. On the other, like some pairwise similarity methods, it measures the functional relatedness of proteins from a global point of view. Finally, as in enrichment methods, a statistical test is performed, in our case to evaluate the significance of the global coherence score of the protein set in the context of a reference set. However, in contrast to enrichment methods, it does not provide a functional interpretation of the protein set, as it reports two numerical values (coherence score and corresponding p-value) but not functional terms. As such it is a good filter prior to functional interpretation in cases where numerous protein sets are obtained (e.g. protein clusters obtained from protein interaction networks, gene expression clusters).
The coherence score and corresponding significance measures proposed in this work can be therefore used for validation of experimental sets where functionally homogeneous protein groups are expected. This is the case for –inter alia– cluster and bicluster analysis of gene expression profiles, protein-protein interaction clusters and sets of hypothetically homologous proteins.
We thank Janusz Dutkowski and Jerzy Tiuryn, as well as Shuye Pu and Shoshana J. Wodak, for providing their data for analysis. Special thanks to Pedro Carmona-Saez for fruitful discussions, and Federico Abascal for comments on the manuscript.
This work has been partially funded by the Spanish grants BIO2007-67150-C03-02, S-Gen-0166/2006, CYTED-505PI0058, TIN2005-5619, PR27/05-13964-BSCH. APM acknowledges the support of the Spanish Ramón y Cajal program.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.