Evaluation of GO-based functional similarity measures using S. cerevisiae protein interaction and expression profile data
© Xu et al; licensee BioMed Central Ltd. 2008
Received: 18 March 2008
Accepted: 06 November 2008
Published: 06 November 2008
Researchers interested in analysing the expression patterns of functionally related genes usually hope to improve the accuracy of their results beyond the boundaries of currently available experimental data. Gene ontology (GO) data provides a novel way to measure the functional relationship between gene products. Many approaches have been reported for calculating the similarities between two GO terms, known as semantic similarities. However, biologists are more interested in the relationship between gene products than in the scores linking the GO terms. To highlight the relationships among genes, recent studies have focused on functional similarities.
In this study, we evaluated five functional similarity methods using both protein-protein interaction (PPI) and expression data of S. cerevisiae. The receiver operating characteristics (ROC) and correlation coefficient analysis of these methods showed that the maximum method outperformed the other methods. Statistical comparison of multiple- and single-term annotated proteins in biological process ontology indicated that genes with multiple GO terms may be more reliable for separating true positives from noise.
This study demonstrated the reliability of current approaches that elevate the similarity of GO terms to the similarity of proteins. Suggestions for further improvements in functional similarity analysis are also provided.
Gene ontology (GO)  describes gene products based on their functions and is a structured and controlled vocabulary that has become quite popular among the known taxonomies. The root ontology (ALL) of GO consists of three independent terms: biological process (BP), molecular function (MF), and cellular component (CC). GO data provides a novel way to measure the functional relationship between gene products, which is the basis of most gene correlation studies . Researchers interested in functionally related genes always hope to improve the accuracy of the results beyond the boundaries of currently available experimental data. Addition of knowledge data, for example, by computing the semantic similarity between genes may partially address this problem. Most semantic-based applications follow a three-step approach that includes semantic similarity calculations of paired GO terms, functional similarity calculations of all possible combinations of related GO terms, and further studies such as clustering analysis [3–6]. However, optimization of the methods for elevating the similarity of GO terms to the similarity of proteins is still required.
In semantic-based applications, it is necessary to compute the similarity among GO terms before investigating the similarity between gene products. Fortunately, the method for calculating two terms in the well-known semantic tree WordNet  has been well established. When GO emerged, these measures were widely used for determining the GO lexical instinct. In 2003, Lord et al.  found that sequence similarity was almost consistent with semantic similarity. Since then, several approaches have been developed that range from traditional methods in which the distance of two given nodes is calculated [4, 5] to information-theoretic models in which the overall information of the tree structure is measured [3, 6]. Similarity information derived from GO has also supported functional module studies . There have been many reviews [10–12] in favour of information theory-based methods such as those proposed by Resnik  and Lin  since these methods are not sensitive to link densities, which are a key limitation of distance-dependent measurements. In this study, Resnik's method was chosen because it showed the best performance in most evaluations.
Assessment of functional similarity based on protein-protein interactions
The functional similarity methods introduced above and abbreviated as Max, Ave, Tao, Schlicker and Wang were tested by the receiver operating characteristics (ROC) analysis. ROC grades the performance of classifiers and rankers as a trade-off between specificity and sensitivity. The area under the ROC curve (AUC) is often taken as a measure of the prediction performance. An area of 0.5 represents random forecasts, while an area of 1 reflects perfect forecasts. A total of 6,459 S. cerevisiae protein interactions were retrieved from the Database of Interacting Proteins (DIP)  and filtered on the basis of the reliability of their GO annotation. We used 5,946 protein interactions (including 2,466 proteins) annotated in BP, 4,267 (1,945 proteins) in MF, 6,121 (2,534 proteins) in CC and 4,088 (1,850 proteins) in ALL as the positive datasets. The negative datasets containing the same number of protein pairs were randomly established based on the requirement of the ROC analysis (see Methods).
Areas under ROC
The combined GO annotations of S. cerevisiae, M. musculus and H. sapiens genes in BP, MF and CC
4 annotationsa (%)
3 annotations (%)
2 annotations (%)
1 annotation (%)
S. cerevisiae BP
M. musculus BP
H. sapiens BP
The PPIs from the Munich Information Center for Protein Sequences (MIPS) Comprehensive Yeast Genome Database CYGD  (14,545 interactions and 5134 proteins, release date: 19 April 2007) were manually compiled from the literature and published large-scale experiments. 12741 interactions and 4343 proteins that have GO annotations were used in this additional evaluation. Therefore, we performed the same analysis for the yeast PPI data set from the CYGD database. The results were similar to those described above. For example, the AUC values of BP ontology were as follows: Max, 0.73; Ave, 0.67; Tao, 0.67 and Wang, 0.70. The details are shown in the supplementary file [see Additional file 2 and 3].
Assessment of functional similarity based on microarray data
In this study, we used the expression and PPI datasets of S. cerevisiae to evaluate five popular functional similarity algorithms. Applications of these datasets as statistical standards have been widely reported [4, 6, 25–28]. Two evaluation approaches were used because information on gene function obtained from various lab studies has shown that no single dataset would be ideal for testing a knowledge database. In this study, 1,226 proteins and only 36 protein pairs were found to overlap between the PPI and expression datasets. The majority of the proteins had different interactions in these two datasets. Note that in the case of expression data, only gene pairs that had an absolute correlation value exceeding 0.6 were regarded as expression-related genes because little linearity was detected at lower values, as shown in Fig. 5. Thus, 624 proteins and 4,052 interactions were uniquely represented by the PPI dataset, and 1,229 proteins and 69,024 correlations were unique to the expression dataset. The majority of unique genes, and consequently unique relationships, in these two datasets supported our assumptions. Although our results on the performance of functional similarity measures were quite promising, unexpected semantic similarities may have been obtained due to poorly annotated genes. This problem may be minimized in the near future as gene annotations are continuously refined. Inconsistencies in individual examples did exist between the PPI and expression datasets. For example, approximately 25% of the interacting proteins in S. cerevisiae had an unexpectedly low expression coefficient (below 0.1). A plausible reason for this is that 79 expression profiles in Eisen's data may not be sufficiently sensitive for detecting the interacting proteins; therefore, more expression data would be required. On the other hand, false positives are unavoidable in the PPI dataset. Another possibility is that these genes are simply not related at the expression level. Therefore, the use of multiple standards (PPI and expression data) to evaluate the performance of functional similarity approaches has the advantage of both coverage and reliability.
In all tests, the Max method consistently showed the best performance. Shared annotations or closely related GO terms, which lead to high semantic scores, was probably one of the reasons. There were 2,622 gene pairs (of 5,946) that had semantic scores of more than 6, and 2,278 (87% of 2,622) were contributed by shared GO terms in the multiple annotation dataset of BP ontology. To obtain biological details, two case studies are presented, one of which was supported by expression data and the other by PPI data. First, we considered two cell division control proteins CDC20 and CDC26 that have similar expression profiles. Although there is no PPI information on these two genes, their significant coexpression correlation value of 0.77 suggested functional relationships, which was also indicated by their shared GO annotations (refer to Fig. 1) and confirmed by other studies [29, 30]. The functional similarity score calculated by the Max method was 8.9. CDC20 and CDC26 also have unique annotations such as 'anaphase-promoting complex activation during mitotic cell cycle' and 'protein ubiquitination', which led to some low semantic scores, and the Ave method was tuned to obtain a score of 1.9. Moreover, two interacting proteins SEC23 and BOS1 (described in the DIP dataset) have several annotations in which 'ER to Golgi vesicle-mediated transport' is their shared term. According to an earlier report , their PPI occurs during the process of 'ER to Golgi vesicle-mediated transport'. The semantic similarity deduced by the Max method was 9.5 whereas that obtained by the Ave method was 0.8.
In these examples, the average method that equilibrates all related semantic scores may compensate for some annotation mistakes but apparently leads to a much lower functional similarity score. The Wang and Tao methods, in which the best hits of each GO term subset are applied, enhanced the accuracy. However, the average of all best hits still led to a relatively low score. The Max and Schlicker methods, which gave the best scores, showed much better results. Although genes annotated with multiple terms may be associated in several ways, the most likely is through strong relationships, usually indicated by their shared terms or closely related terms. Other unique annotations or non-related annotations usually result in noise during the calculation of functional similarity. However, it is necessary to acknowledge that in the Max method, any annotation mistake may lead to false positive results. Note that in Fig. 2, the performance of the Ave method was relatively stable when the test dataset changed. Based on our preliminary findings and the conclusion that the sum of similarity scores of matched GO terms for two proteins shows best performance when applied to subnuclear localization prediction , use of a weighted average of all related semantic scores in favour of multiple shared terms may yield better results than any referred algorithms since there is a lower possibility of false annotations when multiple shared terms are used. In future studies, we will introduce an improved algorithm and some new software tools.
The functional similarity methods were tested in ALL, BP, MF and CC ontologies to evaluate their respective performances. As shown in Table 1, most methods consistently showed the best performance in ALL, better in BP and worst in MF. Note that the best performance in ALL resulted from our unique approach to test data collection. The ALL dataset contained the least number of proteins and protein interactions. For an inclusive ALL dataset that provides widespread protein coverage in BP, MF and CC (containing 2,764 proteins and 6,424 positive protein interactions), the performance will drop below that of the BP ontology (0.785 and 0.676 for the Max and Schlicker methods respectively). To trade-off coverage for performance, BP annotations would be the best choice, while the MF dataset would be the least informative among the S. cerevisiae datasets. New functional similarity algorithms need to consider different weights for the contributions of BP, MF and CC to obtain good performance and coverage.
In order to explain the worst performance or least informative character of the MF dataset of S. cerevisiae, the number of genes with single/multiple annotations were collated as shown in Table 2 and compared with those in BP and CC. It is very likely that the functional methods were not distinguishable in MF ontology because of the high proportion of single-term annotations, which were much fewer in BP. This raises the question of whether multiple annotations were responsible for the good performance of the BP dataset. Further analysis in BP ontology confirmed this possibility. Multiple-term annotations would lead to a more reliable functional similarity calculation. However, the AUC improvements were obtained at a cost. Approximately 60% of the protein interactions were not covered when single-term annotations were eliminated. In order to obtain high accuracy as well as extensive coverage, multiple-term and single-term annotations should be considered, but these should be treated differently. Table 2 shows an example of the distribution variations of single/multiple GO annotations in S. cerevisiae. Similar scenarios are observed with the human and mouse datasets; thus, if such functional similarity algorithms are extended to higher organisms and if multiple-term and single-term annotations are treated differently, as shown here, the results are expected to be quite promising.
Our evaluation of functional similarity approaches was based on S. cerevisiae datasets that have been continuously revised and improved. These datasets contain sufficient data that can be used to obtain accurate results. The results would contribute to the automated integration of prior and background knowledge in large-scale biological data mining. In particular, it provides good supporting information and suggestions for improving current and future applications of semantic similarity algorithms, such as functional similarity search tools , mRNA coexpression analysis, PPI prediction  and gene clustering.
Five popular functional similarity methods were evaluated using PPI and expression datasets of S. cerevisiae to obtain sufficient gene coverage and reliable results. The tests were consistently in favour of the simple maximum method. The results suggested that functional similarity algorithms should introduce different weights for the BP, MF and CC terms and for multiple annotations. In particular, multiple and single annotations should be treated differently for greater reliability together with total coverage. Although these findings were based on the information obtained from the S. cerevisiae datasets, there is a good possibility of extending this study to higher organisms such as humans and mice. Functional similarity in favour of knowledge represented by GO will contribute more to gene function studies in the near future.
Data acquisition and data processing
The GO annotation files were downloaded from SGD released in October 2007 and contained 23,814 GO terms subdivided into 13,916 biological process (BP) terms, 7,879 molecular functions (MF) terms and 2,019 cellular component (CC) terms. These vocabularies possessed a spindle distribution along 16 levels of depth, where the 8th level contained most of the terms. Genes inferred from electronic annotation (IEA) were eliminated from further analysis due to the lack of reliability.
We retrieved 6,459 distinct PPIs (including 2,772 proteins) in S. cerevisiae from DIP (release date: 7 October 2007). Considering that genes annotated with terms from the top levels of the directed acyclic graph (DAG) structure of GO would create noise, only those terms starting from the 3rd level and below (3rd–16th levels) were retrieved, resulting in 5,946 protein pairs (including 2,466 proteins) from BP, 4,267 pairs (1,945 proteins) from MF, and 6,121 pairs (2,534 proteins) from CC ontology. These were used as positive datasets for ROC curve analysis. For ALL (root ontology), 4,088 protein pairs that designated 1,850 proteins were used. Note that the ALL dataset contained the least number of proteins because the genes should simultaneously have the 3rd to 16th levels of annotations in BP, MF and CC ontologies.
The expression dataset was taken from the study of Eisen et al. that contained 79 gene expression profiles of S. cerevisia e. Since most of the proteins in the Eisen dataset were well annotated in GO, 2,461 non-IEA annotated genes were obtained.
ROC curve analysis
ROC grades the performance of classifiers and rankers as a trade-off between specificity and sensitivity. The positive datasets are described above. The negative datasets with the same number of protein pairs were generated by randomly choosing proteins from the non-positive genes located in the GO annotation files. To distinguish the reliability of multiple GO annotations and single GO annotations, positive and negative datasets of PPI containing 2,414 protein pairs annotated by multiple GO terms were built for BP ontology. The ROC and ROCR libraries in the R programming language were employed to calculate the AUCs and draw the graphs .
Pearson correlation analysis
Millions of gene pairs were derived from the Eisen  dataset for further analysis of the correlation between gene expression and semantic similarity. For each gene pair, the semantic similarity scores and absolute values of expression correlation were calculated. The well-known Pearson correlation was used to calculate the expression correlation. Based on Sevilla's  study, we split the gene pairs into 100 groups with respect to the absolute expression correlation values and then calculated the average of the correlation values and similarity scores in each interval. Finally, the correlation coefficients between expression correlation and semantic similarity were computed. In addition, our study separately covered four aspects (ALL: considers all hierarchies of GO, BP: biological process, MF: molecular function and CC: cellular component). The barplot library in the R programming language was used to visualize the correlation coefficient of all procedures.
This work was supported by the China National High-tech 863 Program (2006AA02Z335).
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature genetics 2000, 25: 25–29.PubMed CentralView ArticlePubMedGoogle Scholar
- Azuaje F, Al-Shahrour F, Dopazo J: Ontology-driven approaches to analyzing data in functional genomics. Methods in molecular biology (Clifton, NJ) 2006, 316: 67–86.Google Scholar
- Brameier M, Wiuf C: Co-clustering and visualization of gene expression data and gene ontology terms for Saccharomyces cerevisiae using self-organizing maps. Journal of biomedical informatics 2007, 40: 160–173.View ArticlePubMedGoogle Scholar
- Lee SG, Hur JU, Kim YS: A graph-theoretic modeling on GO space for biological interpretation of gene clusters. Bioinformatics (Oxford, England) 2004, 20: 381–388.View ArticleGoogle Scholar
- Cheng J, Cline M, Martin J, Finkelstein D, Awad T, Kulp D, Siani-Rose MA: A knowledge-based clustering algorithm driven by Gene Ontology. Journal of biopharmaceutical statistics 2004, 14: 687–700.View ArticlePubMedGoogle Scholar
- Wang H, Azuaje F, Bodenreider O: An ontology-driven clustering method for supporting gene expression analysis. Computer-Based Medical Systems, 2005 Proceedings 18th IEEE Symposium on; 23–24 June 2005, 389–394.Google Scholar
- Budanitsky A, Hirst G: Semantic Distance in WordNet: An Experimental, Application-oriented Evaluation of Five Measures. In Workshop on WordNet and Other Lexical Resources 2001.Google Scholar
- Lord PW, Stevens RD, Brass A, Goble CA: Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics (Oxford, England) 2003, 19: 1275–1283.View ArticleGoogle Scholar
- Wu H, Su Z, Mao F, Olman V, Xu Y: Prediction of functional modules based on comparative genome analysis and Gene Ontology application. Nucleic acids research 2005, 33: 2822–2837.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang H, Azuaje F, Bodenreider O, Dopazo J: Gene expression correlation and gene ontology-based similarity: an assessment of quantitative relationships. Computational Intelligence in Bioinformatics and Computational Biology, 2004 CIBCB '04 Proceedings of the 2004 IEEE Symposium on 2004, 25–31.View ArticleGoogle Scholar
- Sevilla JL, Segura V, Podhorski A, Guruceaga E, Mato JM, Martinez-Cruz LA, Corrales FJ, Rubio A: Correlation between gene expression and GO semantic similarity. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2005, 2: 330–338.View ArticlePubMedGoogle Scholar
- Guo X, Liu R, Shriver CD, Hu H, Liebman MN: Assessing semantic similarity measures for the characterization of human regulatory pathways. Bioinformatics (Oxford, England) 2006, 22: 967–973.View ArticleGoogle Scholar
- Resnik P: Using Information Content to Evaluate Semantic Similarity in a Taxonomy. Proceedings of the 14th International Joint Conference on Artificial Intelligence 1995, 448–453.Google Scholar
- Lin D: An Information-Theoretic Definition of Similarity. Proceedings of the Fifteenth International Conference on Machine Learning 1998.Google Scholar
- Speer N, Spieth C, Zell A: A memetic clustering algorithm for the functional partition of genes based on the gene ontology. Computational Intelligence in Bioinformatics and Computational Biology, 2004 CIBCB '04 Proceedings of the 2004 IEEE Symposium on 2004, 252–259.View ArticleGoogle Scholar
- Tao Y, Sam L, Li J, Friedman C, Lussier YA: Information theory applied to the sparse gene ontology annotation network to predict novel gene function. Bioinformatics (Oxford, England) 2007, 23: i529–538.View ArticleGoogle Scholar
- Schlicker A, Domingues FS, Rahnenfuhrer J, Lengauer T: A new measure for functional similarity of gene products based on Gene Ontology. BMC bioinformatics 2006, 7: 302.PubMed CentralView ArticlePubMedGoogle Scholar
- Schlicker A, Albrecht M: FunSimMat: a comprehensive functional similarity database. Nucleic acids research 2008, 36: D434–439.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF: A new method to measure the semantic similarity of GO terms. Bioinformatics (Oxford, England) 2007, 23: 1274–1281.View ArticleGoogle Scholar
- Lei Z, Dai Y: Assessing protein similarity with Gene Ontology and its use in subnuclear localization prediction. BMC bioinformatics 2006, 7: 491.PubMed CentralView ArticlePubMedGoogle Scholar
- Dellaire G, Farrall R, Bickmore WA: The Nuclear Protein Database (NPD): sub-nuclear localisation and functional annotation of the nuclear proteome. Nucleic acids research 2003, 31: 328–330.PubMed CentralView ArticlePubMedGoogle Scholar
- Xenarios I, Rice DW, Salwinski L, Baron MK, Marcotte EM, Eisenberg D: DIP: the database of interacting proteins. Nucleic acids research 2000, 28: 289–291.PubMed CentralView ArticlePubMedGoogle Scholar
- Guldener U, Munsterkotter M, Kastenmuller G, Strack N, van Helden J, Lemer C, Richelles J, Wodak SJ, Garcia-Martinez J, Perez-Ortin JE, et al.: CYGD: the Comprehensive Yeast Genome Database. Nucleic acids research 2005, 33: D364–368.PubMed CentralView ArticlePubMedGoogle Scholar
- Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the United States of America 1998, 95: 14863–14868.PubMed CentralView ArticlePubMedGoogle Scholar
- Aytuna AS, Gursoy A, Keskin O: Prediction of protein-protein interactions by combining structure and sequence conservation in protein interfaces. Bioinformatics (Oxford, England) 2005, 21: 2850–2855.View ArticleGoogle Scholar
- Guo Y, Yu L, Wen Z, Li M: Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic acids research 2008, 36: 3025–3030.PubMed CentralView ArticlePubMedGoogle Scholar
- Rhodes DR, Tomlins SA, Varambally S, Mahavisno V, Barrette T, Kalyana-Sundaram S, Ghosh D, Pandey A, Chinnaiyan AM: Probabilistic model of the human protein-protein interaction network. Nature biotechnology 2005, 23: 951–959.View ArticlePubMedGoogle Scholar
- Zhao XM, Wang RS, Chen L, Aihara K: Uncovering signal transduction networks from high-throughput data by integer linear programming. Nucleic acids research 2008, 36: e48.PubMed CentralView ArticlePubMedGoogle Scholar
- Chu S, DeRisi J, Eisen M, Mulholland J, Botstein D, Brown PO, Herskowitz I: The transcriptional program of sporulation in budding yeast. Science (New York, NY) 1998, 282: 699–705.View ArticleGoogle Scholar
- Hwang LH, Murray AW: A novel yeast screen for mitotic arrest mutants identifies DOC1, a new gene involved in cyclin proteolysis. Molecular biology of the cell 1997, 8: 1877–1887.PubMed CentralView ArticlePubMedGoogle Scholar
- Newman AP, Shim J, Ferro-Novick S: BET1, BOS1, and SEC22 are members of a group of interacting yeast genes required for transport from the endoplasmic reticulum to the Golgi complex. Molecular and cellular biology 1990, 10: 3405–3414.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang P, Zhang J, Sheng H, Russo JJ, Osborne B, Buetow K: Gene functional similarity search tool (GFSST). BMC bioinformatics 2006, 7: 135.PubMed CentralView ArticlePubMedGoogle Scholar
- Sing T, Sander O, Beerenwinkel N, Lengauer T: ROCR: visualizing classifier performance in R. Bioinformatics (Oxford, England) 2005, 21: 3940–3941.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.