BMC Bioinformatics BioMed Central Methodology article Assessment of protein set coherence using functional annotations

Background Analysis of large-scale experimental datasets frequently produces one or more sets of proteins that are subsequently mined for functional interpretation and validation. To this end, a number of computational methods have been devised that rely on the analysis of functional annotations. Although current methods provide valuable information (e.g. significantly enriched annotations, pairwise functional similarities), they do not specifically measure the degree of homogeneity of a protein set. Results In this work we present a method that scores the degree of functional homogeneity, or coherence, of a set of proteins on the basis of the global similarity of their functional annotations. The method uses statistical hypothesis testing to assess the significance of the set in the context of the functional space of a reference set. As such, it can be used as a first step in the validation of sets expected to be homogeneous prior to further functional interpretation. Conclusion We evaluate our method by analysing known biologically relevant sets as well as random ones. The known relevant sets comprise macromolecular complexes, cellular components and pathways described for Saccharomyces cerevisiae, which are mostly significantly coherent. Finally, we illustrate the usefulness of our approach for validating 'functional modules' obtained from computational analysis of protein-protein interaction networks. Matlab code and supplementary data are available at


CKS1p
Cyclin-dependent protein kinase regulatory subunit and adaptor; modulates proteolysis of M-phase targets through interactions with the proteasome; role in transcriptional regulation, recruiting proteasomal subunits to target gene promoters * negative regulation of transposition, RNA-mediated * re-entry into mitotic cell cycle after pheromone arrest * regulation of cyclin-dependent protein kinase activity

CDC28p
Catalytic subunit of the main cell cycle cyclin-dependent kinase (CDK); alternately associates with G1 cyclins (CLNs) and G2/M cyclins (CLBs) which direct the CDK to specific substrates * negative regulation of meiotic cell cycle * negative regulation of mitotic cell cycle * negative regulation of transcription, DNA-dependent * positive regulation of DNA replication during S phase * positive regulation of meiotic cell cycle * positive regulation of mitotic cell cycle * positive regulation of transcription, DNA-dependent * protein amino acid phosphorylation * regulation of budding cell apical growth * regulation of double-strand break repair via homologous recombination * regulation of filamentous growth CLB3p B-type cyclin involved in cell cycle progression; activates Cdc28p to promote the G2/M transition; may be involved in DNA replication and spindle assembly; accumulates during S phase and G2, then targeted for ubiquitin-mediated degradation * G2/M transition of mitotic cell cycle * regulation of cyclin-dependent protein kinase activity * S phase of mitotic cell cycle

SRL3p
Cytoplasmic protein that, when overexpressed, suppresses the lethality of a rad53 null mutation; potential Cdc28p substrate * nucleobase, nucleoside, nucleotide and nucleic acid metabolic process Information on proteins was obtained from the SGD (http://www.yeastgenome.org) Enrichment analysis provides a view on significant local functions. In this case:

Analysis of enriched annotations:
• Most significant biological processes are 'regulation of cyclin-dependent protein kinase activity' (annotated in 5 proteins of the set) and 'regulation of cell-cycle' (annotated in 7 proteins).
Only a view on the hierarchy of GO terms (which is also provided in a graphical representation by SGD GO Term finder), allows us to check that • 'regulation of cyclin-dependent protein kinase activity' is a child term of 'regulation of cell-cycle'. • Except 'regulation of DNA recombination', the rest of significant terms shown on the table are redundant (they are either parents of 'regulation of CDK activity' or 'regulation of cell-cycle').
Not significant terms discarded and are therefore usually not considered in functional interpretation of the set.

Analysis of pair-wise similarities:
Similarities between all pairs in the set computed using cosine similarity of weighted representations (see Methods for details) are: This type of pair-wise similarity analysis provides a view on global function relationships (all terms are taken into consideration). The analysis is able to highlight, among others, that: • CLB3p and CLB4p are in fact annotated with exactly the same terms (their similarity is equal to 1.0). • Most closely related proteins from their overall annotations are SIC1p, CLN1p, CLB3p and CLB4p (with similarities higher than 0.85). • Most dissimilar protein to other proteins in the set is SRL3.
Pair-wise similarities can be used to further perform other type of analysis like, e.g. functional clustering of the set, or analysis of coherence (see next).