- Methodology article
- Open Access
Prediction of enzyme function by combining sequence similarity and protein interactions
© Espadaler et al; licensee BioMed Central Ltd. 2008
- Received: 31 July 2007
- Accepted: 27 May 2008
- Published: 27 May 2008
A number of studies have used protein interaction data alone for protein function prediction. Here, we introduce a computational approach for annotation of enzymes, based on the observation that similar protein sequences are more likely to perform the same function if they share similar interacting partners.
The method has been tested against the PSI-BLAST program using a set of 3,890 protein sequences from which interaction data was available. For protein sequences that align with at least 40% sequence identity to a known enzyme, the specificity of our method in predicting the first three EC digits increased from 80% to 90% at 80% coverage when compared to PSI-BLAST.
Our method can also be used in proteins for which homologous sequences with known interacting partners can be detected. Thus, our method could increase 10% the specificity of genome-wide enzyme predictions based on sequence matching by PSI-BLAST alone.
- Enzyme Commission
- Sequence Pair
- Enzyme Commission Number
- Protein Interaction Data
- Sequence Identity Threshold
While the amount of genome sequence information is increasing exponentially, the annotation of protein sequences remains a problem, both in terms of quality and quantity . Bioinformatics-based annotation of uncharacterized proteins is still one of the most challenging problems in biology . The classical approach involves transfer of annotation from a functionally characterized protein to its functionally uncharacterized homologs. Although, several studies have highlighted the limitations of such methods[1, 3, 4], they have been extensively used on annotating proteins and in particular enzymes [5, 6].
About half of all proteins with experimentally characterized functions have enzymatic activity, making enzymes the largest single class of proteins . The Enzyme Commission (EC) uses four numbers (integers) separated by periods to classify the functions of enzymes . The first three digits describe the overall type of an enzymatic reaction, while the last digit represents the substrate specificity of the catalyzed reaction. The accuracy of transferring an enzymatic annotation between two globally aligned protein sequences has been reported to significantly drop under 60% sequence identity . To address this limitation, we introduce for first time an approach that combines sequence similarity search and comparative protein interaction data to increase the confidence in automatic enzyme annotation. Our hypothesis relies in the rationale that homologous proteins perform similar functions when associated with similar interacting partners. Therefore, two sufficiently similar proteins with common interactions should probably share the same first three EC numbers (common enzymatic function).
Protein-protein interactions have been used for functional annotation by several different approaches, such as Markov random fields [8, 9], minimization of interactions among proteins from different functional categories , message passing algorithms , neighbourhood weights , network-flow algorithms , the number of common interaction partners , and the combination of common interaction partners and common domains . However, the results from some of the existing methods are limited by the need to know the function of interacting partners to annotate the query protein. This limitation is even more dramatic for annotating enzymatic function because enzymes usually do not interact with other enzymes of the same function. Possible exceptions are enzymes involved in proteolytic (e.g., clotting cascade) or signalling cascades (e.g., MAP-kinase cascades). Moreover, the benchmarks of such methods did not account for the fact that protein families have different distributions in different genomes. Therefore, the accuracy obtained for a method in a given genome can be biased due to the specific representation of protein families within the genome. This problem has been already addressed by averaging the results for each protein families according to PFAM [5, 6] or by describing the degree of function conservation versus sequence identity [15, 16].
Next, we outline the impact of protein interactions on annotation based on sequence similarity alone (Results). Then, we discuss the implications of our combined approach for functional annotation in general (Discussion). Finally, we describe our approach in detail (Methods).
As mentioned above, our method relies in the rationale that homologous proteins perform similar functions when associated with similar interacting partners. Moreover, based on the observation that homologous proteins interact with similar partners in similar ways , our method applies an additional transference of interactions by means of homology (we name this as expansion of interactions ). Thus, if two proteins do not share exactly the same interaction, but interact with homologous proteins, our method is still applicable. Our method, called ModFun, has been trained and tested using a benchmark set of proteins with known enzymatic function and known interactions or with detectable homologous with known interactions. The benchmark set was randomly split in two sets with similar number of proteins. The first set (for training) was used to find the best criteria of similarity between sequences for the transference of interactions (expansion) and the threshold to filter the predictions using common interactions. The second set (for testing) was used to test the increase in accuracy for function assignment of our approach with respect to PSI-BLAST.
Analysis of the data set
A pair of homologous proteins P1 – P2 can be related through common interaction partners. The method requires that the proteins used in the expansion procedure (P1 and P1') perform the same enzymatic function. Previous work suggested 60% identity of a BLAST alignment as a reliable cutoff for the conservation of the enzymatic function as defined by the first three EC digits . However, to a lesser extent the BLAST e-value  has also been reported as indicative of enzymatic function conservation . Therefore, to expand an interaction between two proteins to their homologs, both sequence identity and BLAST e-value cutoffs need to be fulfilled, hereby referred as the expansion cutoff. We have explored expansion cutoffs ranging from 20% to 70% for the sequence identity and 0.0001 for the e-value.
Protein interactions were used in combination with homology detection only for sequence pairs below a certain sequence identity cutoff, referred as the filter cutoff. We have explored filter cutoffs ranging from 25% to 65% sequence identity.
ROC Analysis and Validation
At 40% sequence identity threshold, measured from the PSI-BLAST alignment, ModFun was able to annotate enzymes from the training set with 10% higher specificity than PSI-BLAST for the sensitivity of 80% (Figure 2b). The utility of the expansion procedure is further demonstrated by comparing the ROC curves obtained for different expansion cutoffs. For instance, for a specificity of 90%, the expansion procedure increases the coverage from 63% (no expansion) to 80% (optimal parameters).
Statistical significance of the enrichment by ModFun
The ratio of correct functional assignments (using the first three EC digits) in DIP-SP is 1.5% and increases to 3.5% when using proteins sharing common interactions with the query sequence (Methods). Such result indicates that there is an enrichment of the proteins with the same enzymatic function within the set of relatives with common interacting partners. To quantify the statistical significance of such enrichment, we compared the enrichment for each query using the Wilcoxon test  against 100 randomly selected sets of interactions with the same number of relatives from DIP-SP. The corresponding p-value (< 7 × 10-52) quantifies the statistical significance of enrichment in the set of relatives by means of common interactions.
Impact of our approach on annotating the genome of S. cerevisiae
We described, implemented, and tested a method that uses information about sequence similarity and protein-protein interactions to perform enzyme annotations between remotely related protein sequences. The method was tested on a set of proteins with known interactions containing 1,227 enzymes and 2,663 non-enzymes. Most of the existing methods have been tested by application to the S. cerevisiae proteome [8–10, 13–15]. In this work, we have taken advantage of available protein interaction data from several organisms to test our approach. Although S. cerevisiae accounts for a large fraction of the 1,227 enzymes used in the SP-DIP set of known interactions (54%), other organisms were also represented (i.e., E. coli, H. sapiens, D. melanogaster, and H. pylori with 9%, 9%, 8%, and 7% of the sequences, respectively).
Previous studies have stressed the need to compensate for overrepresented and underrepresented protein families to obtain reliable estimates of the first three digits of an enzymatic function [5, 6]. Members of an overrepresented protein family are more likely to find pairs from the same family, therefore yielding a high number of true positives. Moreover, since some protein families account for a larger fraction of the dataset than other families, statistics obtained from these families could bias the general statistics towards higher values of function conservation between pairs. In this work, we have addressed this issue by averaging our results within protein families. The results show that considering protein interactions increases the degree of enzyme function conservation for sequence pairs in the 40–55% identity range, as calculated by PSI-BLAST. Therefore, annotation transfers may be performed with increased confidence between such sequence pairs if similar interacting partners are found. For pairs with higher percentage of identity (>50%), sequence similarity alone is a good indication of function conservation (i.e., conservation of at least the first three EC digits).
A genome-wide test was performed for the S. cerevisiae sequences, for which abundant protein interactions data is available. By using ModFun, ~19% of all known enzymes in the yeast genome would have benefited from the increase in the confidence of their functional annotations. Moreover, the results show that our method can be applied to proteins without known interactions, via an "expansion" procedure based on known interactions of their close homologs. Because proteins involved in the expansion should perform the same enzymatic function, only homologs above 60% identity by BLAST should be considered in the expansion . However, here we show that homologs with 40% identity can be used, as long as the e-value of BLAST is smaller than 0.0001. For example, without using interaction data from yeast, 6% of all known yeast enzymes still benefit from the expansion.
Although not all protein interactions need to be determined in order to achieve full coverage by our method, the increase of the number of experimentally determined protein-protein interactions will likely result in a larger applicability of ModFun. In this work, we have also shown that the identification of only one similar interacting partner for a low sequence identity pair of proteins is enough to increase the reliability of the annotation transfer. Moreover, a more complete knowledge of the sets of interacting partners is likely to improve the accuracy of ModFun by allowing the scoring of protein pairs by the number of similar interaction partners, in addition to their sequence similarity.
In conclusion, ModFun provides a higher confidence in functional annotation from sequence than sequence-based methods alone. In particular, about 20% of the enzymes would be incorrectly predicted by using only PSI-BLAST.
To test our approach, we relied on the DIP database (Dec 2004 release) , the ENZYME database (release 36.0, Jan 2005) , the InterPro database (release 9.0, Feb 2005) , and the UniProt database (release 9.0, Feb 2005) .
Proteins with EC-codes were extracted from the SWISSPROT subset of UniProt. We excluded those proteins that (i) have EC numbers with undetermined digits; (ii) have more than one EC number; and (iii) are annotated as "probable", "hypothetical", "putative", "by similarity", "by homology" or "fragment" in the SWISSPROT keyword record [5, 6]. These criteria resulted in the SP-EC dataset containing 49,885 protein sequences.
Additionally, we extracted proteins from the SWISSPROT database that (i) have known interactions in the DIP database, by means of their "AC" codes; (ii) have PFAM mappings in InterPro; and, (iii) are not annotated with keywords such as "probable", "hypothetical", "putative", "by similarity", "by homology" or "fragment". The resulting subset of 3,890 SWISSPROT entries (i.e., SP-DIP) included 2,663 proteins that do not have an EC code (i.e., considered non-enzymes) and 1,227 proteins that were present in the SP-EC dataset (i.e., considered enzymes).
To test our procedure, profiles were built for each protein in the SP-DIP dataset by running the PSI-BLAST program  with default parameters against UniProt  for three iterations (or up to convergence). Enzyme-enzyme and enzyme-non-enzyme pairs were collected by searching with these profiles against the SP-DIP dataset. The outputs of the PSI-BLAST searches were filtered to remove self-matches and alignments shorter than 30 residues.
Relating sequence pairs through interacting partners
Grouping of enzymes into families
Sequences from SP-DIP dataset were classified into 1,926 families according to their PFAM domain architecture. Families containing enzyme sequences were further split according to their first three EC digits. These families are referred to as EC families.
Family-averaged sensitivity and specificity
where N i is the total number of families finding sequence matches above threshold i, and N is the total number of families in the set.
Optimal values for the expansion and filter identity cutoffs were selected on the basis of the relative improvement over PSI-BLAST. This improvement was defined as 100 × (D1 - D2)/D2, where D1 and D2 are the minimum distance of the ROC curve to the upper right corner of the plot for our method (ModFun) and PSI-BLAST, respectively.
Enzyme function conservation as a function of sequence identity
JE was supported by predoctoral fellowship from the Generalitat de Catalunya and CERBA (Spain). EQ acknowledges grants from the Spanish Ministerio de Educación y Ciencia (BIO2007-67904-C02-01). FXA acknowledges grants from the Spanish Ministerio de Educación y Ciencia (BIO2007-68046). AS acknowledges the financial support by the Sandler Family Supporting Foundation, IBM, HP, Netapps, and Intel for hardware gifts, and NIH grants GM74945, GM74929, GM71790, and GM54762. MAM-R acknowledges support from the Spanish Ministerio de Educación y Ciencia (BIO2007-66670). BO acknowledges grants from Generalitat de Catalunya (CIDEM), Spanish Ministerio de Educación y Ciencia (MEC BIO2005-00533 and PROFIT PSE-010000-2007-1) and by European Union INFOBIOMED-NoE (IST-507585) and @NEURIST (IST-2004-027703).
- Iliopoulos I, Tsoka S, Andrade MA, Enright AJ, Carroll M, Poullet P, Promponas V, Liakopoulos T, Palaios G, Pasquier C, Hamodrakas S, Tamames J, Yagnik AT, Tramontano A, Devos D, Blaschke C, Valencia A, Brett D, Martin D, Leroy C, Rigoutsos I, Sander C, Ouzounis CA: Evaluation of annotation strategies using an entire genome sequence. Bioinformatics 2003, 19: 717–726. 10.1093/bioinformatics/btg077View ArticlePubMedGoogle Scholar
- Friedberg I: Automated protein function prediction--the genomic challenge. Brief Bioinform 2006, 7: 225–242. 10.1093/bib/bbl004View ArticlePubMedGoogle Scholar
- Valencia A, Pazos F: Computational methods for the prediction of protein interactions. Curr Opin Struct Biol 2002, 12: 368–373. 10.1016/S0959-440X(02)00333-0View ArticlePubMedGoogle Scholar
- Devos D, Valencia A: Practical limits of function prediction. Proteins 2000, 41: 98–107. 10.1002/1097-0134(20001001)41:1<98::AID-PROT120>3.0.CO;2-SView ArticlePubMedGoogle Scholar
- Rost B: Enzyme function less conserved than anticipated. J Mol Biol 2002, 318: 595–608. 10.1016/S0022-2836(02)00016-5View ArticlePubMedGoogle Scholar
- Tian W, Skolnick J: How well is enzyme function conserved as a function of pairwise sequence identity? J Mol Biol 2003, 333: 863–882. 10.1016/j.jmb.2003.08.057View ArticlePubMedGoogle Scholar
- Bairoch A: The ENZYME database in 2000. Nucleic Acids Res 2000, 28: 304–305. 10.1093/nar/28.1.304PubMed CentralView ArticlePubMedGoogle Scholar
- Letovsky S, Kasif S: Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics 2003, 19 Suppl 1: i197–204. 10.1093/bioinformatics/btg1026View ArticlePubMedGoogle Scholar
- Deng M, Chen T, Sun F: An integrated probabilistic model for functional prediction of proteins. J Comput Biol 2004, 11: 463–475. 10.1089/1066527041410346View ArticlePubMedGoogle Scholar
- Vazquez A, Flammini A, Maritan A, Vespignani A: Global protein function prediction from protein-protein interaction networks. Nat Biotechnol 2003, 21: 697–700. 10.1038/nbt825View ArticlePubMedGoogle Scholar
- Leone M, Pagnani A: Predicting protein functions with message passing algorithms. Bioinformatics 2005, 21: 239–247. 10.1093/bioinformatics/bth491View ArticlePubMedGoogle Scholar
- McDermott J, Bumgarner R, Samudrala R: Functional annotation from predicted protein interaction networks. Bioinformatics 2005, 21: 3217–3226. 10.1093/bioinformatics/bti514View ArticlePubMedGoogle Scholar
- Nabieva E, Jim K, Agarwal A, Chazelle B, Singh M: Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 2005, 21 Suppl 1: i302–10. 10.1093/bioinformatics/bti1054View ArticlePubMedGoogle Scholar
- Samanta MP, Liang S: Predicting protein functions from redundancies in large-scale protein interaction networks. Proc Natl Acad Sci U S A 2003, 100: 12579–12583. 10.1073/pnas.2132527100PubMed CentralView ArticlePubMedGoogle Scholar
- Okada K, Kanaya S, Asai K: Accurate extraction of functional associations between proteins based on common interaction partners and common domains. Bioinformatics 2005, 21: 2043–2048. 10.1093/bioinformatics/bti305View ArticlePubMedGoogle Scholar
- Espadaler J, Romero-Isart O, Jackson RM, Oliva B: Prediction of protein-protein interactions using distant conservation of sequence patterns and structure relationships. Bioinformatics 2005.Google Scholar
- Aloy P, Stark A, Hadley C, Russell RB: Predictions without templates: new folds, secondary structure, and contacts in CASP5. 2003, 436–456.Google Scholar
- Espadaler J, Aragues R, Eswar N, Marti-Renom MA, Querol E, Aviles FX, Sali A, Oliva B: Detecting remotely related proteins by their interactions and sequence similarity. Proc Natl Acad Sci U S A 2005, 102: 7151–7156. 10.1073/pnas.0500831102PubMed CentralView ArticlePubMedGoogle Scholar
- Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, Studholme DJ, Yeats C, Eddy SR: The Pfam protein families database. Nucleic Acids Res 2004, 32 Database issue: D138–41. 10.1093/nar/gkh121View ArticleGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Wilcoxon F: Individual Comparisons by Ranking Methods. Biometrics 1945, 1: 80–83. 10.2307/3001968View ArticleGoogle Scholar
- Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic Acids Res 2004, 32 Database issue: D449–51. 10.1093/nar/gkh086View ArticleGoogle Scholar
- Mulder N, Apweiler R: InterPro and InterProScan: Tools for Protein Sequence Classification and Comparison. Methods Mol Biol 2007, 396: 59–70.View ArticlePubMedGoogle Scholar
- Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS: The Universal Protein Resource (UniProt). Nucleic Acids Res 2005, 33: D154–9. 10.1093/nar/gki070PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.