Quantitative sequence-function relationships in proteins based on gene ontology
© Sangar et al; licensee BioMed Central Ltd. 2007
Received: 05 February 2007
Accepted: 08 August 2007
Published: 08 August 2007
The relationship between divergence of amino-acid sequence and divergence of function among homologous proteins is complex. The assumption that homologs share function – the basis of transfer of annotations in databases – must therefore be regarded with caution. Here, we present a quantitative study of sequence and function divergence, based on the Gene Ontology classification of function. We determined the relationship between sequence divergence and function divergence in 6828 protein families from the PFAM database. Within families there is a broad range of sequence similarity from very closely related proteins – for instance, orthologs in different mammals – to very distantly-related proteins at the limit of reliable recognition of homology.
We correlated the divergence in sequences determined from pairwise alignments, and the divergence in function determined by path lengths in the Gene Ontology graph, taking into account the fact that many proteins have multiple functions. Our results show that, among homologous proteins, the proportion of divergent functions decreases dramatically above a threshold of sequence similarity at about 50% residue identity. For proteins with more than 50% residue identity, transfer of annotation between homologs will lead to an erroneous attribution with a totally dissimilar function in fewer than 6% of cases. This means that for very similar proteins (about 50 % identical residues) the chance of completely incorrect annotation is low; however, because of the phenomenon of recruitment, it is still non-zero.
Our results describe general features of the evolution of protein function, and serve as a guide to the reliability of annotation transfer, based on the closeness of the relationship between a new protein and its nearest annotated relative.
Assignment of function to gene products in the absence of direct experimental information is an important challenge of computational molecular biology [1–3]. In annotating proteins from newly-sequenced genomes, it is a common practice to transfer functional annotation from a homologous protein [4–8]. This approach depends on the assumptions that: (1) because homologous proteins have similar sequences and structures, they have similar functions, and (2) the annotation of the source homologue is correct. Often, but certainly not always, these assumptions are valid.
In this study we quantitatively assess the relationship between the divergence of protein function and the divergence of amino acid sequence in families of homologous proteins. In addition to illuminating the process by which proteins evolve altered and novel functions, the results provide guidance about the expected accuracy of transfer of functional annotation among homologous proteins in databases.
The most general evidence for protein homology, and inference of shared function, depends on comparative analysis of sequences and structures. PSI-BLAST  and Hidden Markov Models  identify distant homologs from multiple sequence alignments. Other techniques include the training of support vector machines  and neural networks  on protein features such as charge distribution and hydrophobicity to predict protein function. Structure comparisons improve the accuracy of inference of function in the absence of direct experimental evidence. These include the use of information from domains  and motifs [14–16]. Fleming et al.  combined structural and sequence alignments of proteins in an annotation tool named PHUNCTIONER.
Despite the sensitivity of these tools for detecting homologs and predicting function, many authors have pointed out that because closely-related proteins can change function, either through divergence to a related function or by recruitment for a very different function, annotations based only on homology can be incorrect [18–28].
Two problems that have arisen in studying the evolution of protein function and evaluating the expected accuracy of functional annotation transfer have been (1) standardization of terminology in describing function, and (2) defining a measure of the "distance" between functions. The Enzyme Commission classification has been very valuable but deals with only one class of protein functions . In 2000, The Gene Ontology (GO) Consortium formulated a newer and more general classification of protein functions and the relationships among them . Unlike the EC classification, which was a strict hierarchy, the GO scheme has the form of Directed Acyclic Graphs (DAGs), specialized to three domains: Molecular Function, Biological Process, and Cellular Component.
Enzyme Commission identifiers form a strict four-level hierarchy, or tree. For example, isopentenyl-diphosphate Δ-isomerase is assigned EC number 184.108.40.206, where the initial 5 specifies the most general category, 5 = isomerases; 5.3 comprises intramolecular isomerases; 5.3.3 those enzymes that transpose C = C bonds; and the full identifier 220.127.116.11 specifies a particular reaction. Note that the EC classified reactions, not enzymes. To compare functional assignments of two proteins according to the EC classification, it is conventional to ask at how many levels of the hierarchy the EC numbers agree.
In contrast, the GO classification is not a tree, but a more general type of graph. Each node is labeled by a general or specific protein function. Edges in the graph correspond to relationships between more general and more specific functions, that is, child-parent relationships. For example, the node "protein binding" is a child of the node containing the more general function "binding". The number of levels – the length of the path from any leaf to the root – is not constant. The structure of the GO DAG induces a measure of distances between functions, which will be used to quantify sequence-function relationships in proteins (see Materials and Methods).
Our current work treats the Molecular Function component of the GO classification. The GO Molecular Function graph forms a network that has characteristics in common with other biological networks. In the Gene Ontology DAGs, the average in-degree is 1.36 (that is, on average a node or GO ID had 1.36 parents.) The in-degree distribution is intermediate between an exponential and a power function. There is a wide range in out-degree, ranging from 1 to 298. Three nodes had very high out-degree with 122, 238 and 298 children. The out-degree distribution followed a power law, showing that there are hubs, or highly connected nodes. The total degree (in-degree + out-degree) distribution for the Molecular Function ontology has a mean of 2.69, and follows a power law.
1.1 Assignment of functions to proteins
Neither the Enzyme Commission nor the GO classifications of protein function constitutes an assignment of function to any particular protein. Both provide only a framework for making such assignments. The PIR database at Georgetown University  associates Gene Ontology Identifiers (GO IDs) with individual proteins. The annotation of each protein may include several GO IDs. Indeed, annotation with any function logically implies annotation with all more-inclusive functions, all the way up to the root of the graph. (Note, however, that annotations of proteins by GO terms in databases do not always explicitly contain all the ancestors of every function that appears.) Therefore for each protein we extracted the distal (= most precise) GO IDs to represent the function of the protein (see Materials and Methods).
1.2 The relationship between sequence divergence and function divergence
Many proteins with similar sequences have similar functions; for example, mammalian hemoglobins transport oxygen and carbon dioxide. For mammalian hemoglobins, transfer of annotation among homologs gives correct results. However, other families of homologs contain proteins with different functions. For example, hen egg white lysozyme and baboon α-lactalbumin have 37% identical residues in optimal sequence alignment, and retain very similar mainchain structures, but have unrelated functions. Contrasting mammalian hemoglobins with lysozyme/α-lactalbumin, there is a general correlation between divergence of sequence and divergence of function. That is, mammalian hemoglobins have similar sequences and similar functions; lysozyme and α-lactalbumin have more distantly related sequences and dissimilar functions.
However, there are many exceptions to this correlation. In the duck, eye lens crystallins are identical in sequence to liver enolase and lactate dehydrogenase . This is an example of "recruitment" – unrelated function with little or even no sequence change. This threatens to produce incomplete or even erroneous annotations, if annotation is passed freely among homologs. Conversely, some proteins very distantly related in sequence nevertheless retain similar function.
Several groups have studied the relationship between sequence similarity and functional similarity based on the Enzyme Commission classification. Those studies were necessarily limited to proteins with enzymatic functions:
In studying the relationship between sequences and EC classifications of proteins, Wilson, Kreychman & Gerstein , Todd, Orengo & Thornton , and Devos & Valencia  reached similar (although not identical) optimistic conclusions. Wilson, Kreychman & Gerstein  concluded that for pairs of single-domain proteins, at levels of sequence identity > 40%, precise function is conserved, and for levels of sequence identity > 25%, broad functional class is conserved (according to a functional classification that uses the EC hierarchy for enzymes, and supplements it with material from FLYBASE  for non-enzymes.) The study of Todd, Orengo & Thornton  analyzed only the homologous pairs of enzymes and reported that approximately 90% of pairs of proteins with sequence identity > 40% conserve all four EC numbers. Even at 30% sequence identity, Todd, Orengo & Thornton found conservation of three levels of the EC hierarchy for 70% of homologous pairs of enzymes. Devos & Valencia  reached very similar conclusions; they also reported the ability to predict correctly the agreement of FSSP categories  and SWISS-PROT  keywords, as a function of the level of sequence similarity.
Our work pursues the question of the relationship between divergence of sequence and function in homologous proteins, using the Molecular Function DAG of Gene Ontology for the classification of function. Use of the GO classification allows extension of the earlier work to proteins with non-enzymatic functions, permitting a comprehensive study of functions of proteins.
The steps of our analyses were as follows: For each pair of homologous proteins from a PFAM family, we recorded the % identical residues in the optimal alignment as a measure of sequence divergence, and we measured the functional distance between the sets of distal GO IDs associated with the two proteins. We based our definition of the distance between sets of annotations on a generalization of the simple minimum-path-length measure of the distance between two single GO ID's (see Materials and Methods).
2. Results and Discussion
We analyzed 6828 PFAM families (out of a total of 7863 in v. 18.0). The families ranged widely in size, from 2 to > 1200 proteins. Most families were relatively small; 85% of those studied had between 2 and 30 members.
For each pair of proteins within each family we determined the sequence similarity and the set of minimum distances between distal GO IDs (see Materials and Methods). For the functional distances, we distinguished, and analyzed separately, divergent functions within the same branch of the GO DAG (which we call similar functions); and entirely different functions, for which the root of DAG was the lowest common ancestor (which we call dissimilar functions).
A primary goal is to describe the relationship between sequence divergence and functional divergence.
2.1 The EF-hand family
The EF-hand family is typical and provides illustrative results. This family contains 498 proteins comprising two classes of functions: signaling and buffering/transport. EF-hand proteins involved in signaling include the best-known members of the family such as calmodulin, troponin C and S100B. These proteins typically undergo a Calcium-dependent conformational change which opens a target binding site. EF-hand proteins involved in buffering/transport include calbindin D9k. These do not undergo Calcium-dependent conformational changes [38, 39].
Similar and dissimilar functions have different distributions
Similar functions (Figure 3) show a dominant peak at distance = 0 (that is, identical function), and a subsidiary peak at 7. It is interesting that the different bins of sequence identity show distributions of similar shape. What distinguishes the distribution of pairs of closely-related proteins from pairs of distantly-related proteins is not so much a progressive increase in the set of functional distances represented, but a decrease in the number of pairs with identical function. The distribution (Figure 4) of dissimilar functions of course excludes the peaks at functional distance zero or one, and shows an uneven distribution with peaks between 6 and 10, with a very few pairs at a GO distance of 12. The high spikes are the artifacts of the normalization, in cases where there are very few data. There is a high peak at functional distance 6 for pairs of proteins with 80–100% sequence identity, signifying either recruitment or incomplete annotation (or both).
The graph combining all similar and dissimilar functions (Figure 5) showed three distinct peaks, at 0, 6, and 10; the peaks at 6 and 10 reflecting dissimilar functions. Two factors contribute to the non-smoothness of the distribution: (1) proteins in the EF-hand family are annotated by a relatively small set of distal GO IDs, and (2) there is an uneven distribution of pairs of proteins with different degrees of sequence similarity. The high peaks at 6 and 10 in Figure 6 arise from pairs of protein with 0–10% sequence identity. (The peak at distance 7, prominent in figure 3, is less prominent in Figure 5 because there are many fewer pairs of similar than the dissimilar functions.)
As the sequences progressively diverge, there is a systematic decrease in the number of pairs with distance 0 (identical function) (see Table 1). The % of pairs with 0 distance is approximately constant (about 35%) for bins of sequence similarity < 40%, and then increases sharply. The distribution of similar functions in pairs of proteins with 80–100% sequence identity has a unique peak at 0.
The increase in percentage of similar function with increase in sequence similarity in the experimental data of EF-hand family. Right column: Left column:
fraction of comparison with similar functions.
The data suggest the interesting result that there is a threshold at about 40% sequence identity, at which the observed behavior changes. For pairs of proteins with 0–40% residue identity, the distribution is largely independent of sequence identity. Above 40% sequence identity, there is a significant increase in similar functions over dissimilar ones. These results, shown in Figure 5 for the EF-hand family, were also observed when all the PFAM data were combined and the contribution of dissimilar functions to each range of sequence identity calculated.
2.2 Combined PFAM data
Figure 8 shows quantitatively how the distribution of functional divergence depends on the divergence of sequence. For example, Figure 8b describes PFAM families containing between 31 and 60 proteins. The data show generally that as the sequence identity decreases, the percentage of non-identical functions (distances > 0) increases. This graph also contains an example of recruitment (the peak at distance = 6, for the proteins with 81–100% sequence similarity). For proteins with 81–100% sequence identity, 15% of the comparisons have a distance of 6.
The data shown in Figure 8 confirm an "action zone" between 40% sequence identity and 60% sequence identity. This range of sequence identities shows the highest change in the identical functions (GO distance = 0). This suggests a threshold in the behavior: sequence divergence below 50–60% residue identity "releases" function to diverge, or alternatively, functional divergence by mechanisms other than recruitment generally requires > 40% amino acid substitution. Pairs of proteins with all values of sequence identity > 60% had nearly the same contribution from dissimilar functions. This means that for pairs of proteins with > 60% residue identity the fraction of dissimilar functions did not vary strongly with sequence divergence. A similar threshold behavior appears in the relationship between percentage of dissimilar functions and sequence identity. There is a steep increase in the percentage of dissimilar functions as the sequence identity fell below 40%.
Figure 9 shows a systematic difference between residue identities in the ranges 0–50% and 50–100%. Between 0–50% there is a linear decrease in the value of the mean, and almost no further change between 50–100%. This is consistent with a threshold of behavior change between 40–60% sequence identities. The data in the mid quartiles (25 – 75%) also decreases with increase in sequence identity, showing that most of the data are zeros and the number of outliers is also decreasing with the increase in sequence identity.
2.3 Comparison with sequence-function correlation based on the Enzyme Commission classification
Other investigators have studied the relation of divergence of function based on the EC classification. Of course, these studies were limited to proteins with enzymatic functions. In a result typical of these studies, Wilson et al. reported a threshold at 30 – 40% sequence identity for onset of more prevalent function divergence in their comparison of sequence and function conservation using Enzyme Commission classification (See ref 33, Figure 7A and 7D).
2.4 Comparison of experimental and non-experimental annotations
It could be argued that for purposes of judging the reliability of transfer of annotation among homologous proteins, the comparisons of annotations described in the previous sections are flawed, because the data contain annotations produced by transfer. In order to explore this, we did separate calculations limited to experimentally-based annotations.
GO provides for recording the source of functional assignments, which may be experimental or inferred. The GO consortium classified possible sources of annotation, and ranked them according to suggested reliability. The most direct evidence is experimental: evidence codes TAS (Traceable Author Statement), IDA (Inferred from Direct Assay), IMP (Inferred from Mutant Phenotype), IGI (Inferred from Genetic Interaction) and IPI (Inferred from Physical Interaction). Less direct sources of annotation are ISS (Inferred from Sequence Similarity), IEA (Inferred from Electronic Annotation) and NAS (Non-Traceable Author Statement). We used these evidence codes to compare sequence-function relationships for experimental and non-experimental annotations of proteins.
We extracted proteins of the EF-hand family for which all annotations had experimental support only. This reduced the number of proteins from 498 to 47 (9.5%). We formed two mutually exclusive sets: (1) Proteins with only experimentally verified annotations, and (2) Proteins with no experimentally verified annotations. We collected all the GO IDs for the proteins from both sets and determined the common and different terms. The experimental-based set had 30 unique annotations and the non-experimental set had 65 unique annotations. Both sets of annotations varied from very specific to quite general functions.
Many GO IDs in the non-experimental set did not appear in the experimental set. This raises the question of what these annotations were based on.
The set of non-experimentally-based annotations included more precise functions than the experimental set.
For instance, there is solid experimental evidence that proteins of the EF-hand families bind Calcium and Zinc; however, some proteins of the EF-hand family are annotated as binding Magnesium and Iron. The Magnesium and Iron binding annotations are given the evidence code IEA (= Inferred from electronic annotation). Although the non-experimental and experimental annotations share the idea of cation binding, the details – the identity of the metals – are different. Moreover, the non-experimental annotations include specific ligands for which experimental evidence has not been attributed to homologues.
For another example, the non-experimental set contained a GO ID which did not appear with experimental support as the annotation of any homologue: peptidyl-prolyl cis-trans isomerase activity (GO:0003755). The protein FKBP9_MOUSE is annotated with this function with the evidence code IEA. However, a literature search revealed that Shadidy et al. reported in 1999 that FKBP9_MOUSE contains an EF-hand domain and showed experimentally-measured peptidyl-prolyl cis-trans isomerase activity . The annotation correctly assigned the function but did not report that the assignment was grounded in experimental evidence.
Such observations persuaded us to leave the experimental/non-experimental comparison at the qualitative level. We conclude that the results of Figure 8 based on a mixture of experimental and non-experimental annotations: (1) probably underestimate, to some extent, the extent of divergence of protein function as a function of amino acid sequence divergence, and (2) probably overestimate, to some extent, the danger of introduction of error in annotation transfer.
Available data permit a quantitative study of the relation between divergence of sequence and divergence of function in proteins, based on the Gene Ontology functional classification.
Sequence divergence is generally accompanied by higher likelihood of divergence in function, although the phenomenon of recruitment provides exceptions in which proteins of similar sequence can perform very different functions.
There is a threshold at about 50% sequence similarity below which function divergence is enhanced. This is consistent with the conclusions of the previous authors, who used the EC functional classification.
If we were given only the amino acid sequence of a protein of unknown function, and asked to estimate the probability that transferring annotation from the closest homologue in the databanks would not lead to annotation errors, we would base the answer on the distribution of similar and dissimilar functions in homologous proteins only. The variation among different families suggests that it is worth looking at the families individually. This is consistent with the conclusions of Ranea et al. , who also observed that families evolve at different rates depending on their functional class.
Databases are prone to error, because the recording of experimental sources of functional annotation is a labor-intensive human activity, and because once introduced, errors tend to propagate. Given the very crucial importance of annotation in biomedical research, the development of objective methods for quality control and correction of annotations in databases have been recognized as essential .
Sources of data
We downloaded the Gene Ontology network file from the GO consortium website  and PFAM domains from the Washington University, St Louis, PFAM server. The March 2005 release of PFAM contained 7868 protein families. PFAM contains seed and full alignments of proteins in each family. We used the seed alignments, which are high-quality alignments that do not change substantially between releases. PFAM uses these alignments as the basis for doing full alignments for the respective PFAM families [42, 43].
The PIR database at Georgetown University provided the GO IDs for each protein. PIR presents GO IDs in Molecular Function, Biological Process and Cellular Component categories. We used only the Molecular Function assignments.
For each protein we identified the distal GO ID(s) in its annotation set. A distal GO ID is the GO ID included in the annotation of the protein, for which no more specific (descendant) GO ID is part of the annotation of the same protein. For example, suppose that in some data base a protein is given as its functional annotation three GO IDs: 00016788 (hydrolase activity, acting on ester bonds), 0004320 (oleyl-[acyl-carrier protein] hydrolase activity), and 0000036 (acyl carrier activity) (see Figure 3). GO ID 0004320 is a descendant of 0016788. In this chain of descent in the GO graph, 0004320 is the more precise annotation, distal to 0016788. However, 0000036 is not in the same chain of descent. From these annotations we would retain only GO: 0004320 and GO:0000036 as distal GO IDs.
As working data then, for every protein domain we had an amino acid sequence (from PFAM), and the distal GO ID(s) (from PIR). For each pair of proteins in each family, we measured the similarity of the sequence and function.
Calculation of sequence similarity
We aligned amino acid sequences by the standard dynamic programming algorithm using the BLOSUM 62 matrix . We used the alignment program MUSCLE , with default gap weighting. A comparison of sequence similarities computed from MUSCLE pairwise alignments with values computed from the alignments inherent in PFAM showed that the differences were generally so small as to make no significant differences in the results presented (See additional file 1). We note that the pairwise sequence alignment approach provides a measure of sequence similarity that is stable and independent of any realignment or reclassification that PFAM may adopt.
Calculation of functional similarity
We represented the functional divergence of the two proteins by the distance between their sets of annotations; that is, between the sets of distal GO IDs assigned to each protein. The measure of the distance between sets of annotations was based on a measure of the distance between individual GO IDs. We defined the distance between two individual GO IDs as the number of edges in a minimal-length path between the two nodes in the GO DAG that passes through the lowest common ancestor of the two nodes. Based on this, we needed to define a measure of distance between sets of distal GO IDs, as might appear in the annotation of a protein with multiple functions.
Case (a) is the simplest: each protein has one annotation and the minimum path length between them has length 4. The distance function should report this value.
In case (b), each protein has two annotated functions, appearing on two branches of the graph. However, the two proteins show similarity of both functions: for each X there is an O at distance 2. Although the distance between the leftmost X and the rightmost O is 4, it does not seem reasonable to report a functional distance of 4 between these two proteins. Therefore it would not be suitable to define a distance function as the set of minimal path lengths between every X, O pair.
In case (c), one protein has two annotated functions but the other has only one. The proteins share a similar function: the rightmost X and the O have distance 2. However, in this case, compared to (b), it is relevant that the distance from the leftmost node labeled X to the closest node labeled O is 4. This is a genuine difference between the annotated functions of the two proteins. It may be that protein X was recruited for a novel function not similar to the function of protein O. It is also possible that protein O shares the other function with protein X but is not so annotated. In any event, the distance function should report both 4 and 2. This implies that it would not be suitable to define a distance function as the minimum X-O distance for all X-O pairs.
We therefore adopted the following definition of the difference in functional annotations between two proteins, X and O. For each distal GO ID X, we determine the minimum distance to all the distal GO IDs O, and for each distal GO ID O, we determine the minimum distance to all the distal GO IDs X. This set of values represents the distance between the annotation sets X and O. In the cases shown in Figure 12, the distances reported would be: (a) 4, (b) 2, 2, (c) 2, 2, 4.
Any classification scheme may vary in the fineness with which it distinguishes different regions of its domain. Because for protein function (unlike for sequence or structure) there is no natural metric, there is no direct way to calibrate distances between nodes in either the EC or GO classifications. The problem is somewhat more acute for the GO classification because of the variable depths of the DAG. We explored the possibility of "normalizing" the GO distances according to the local depth of the DAG, but were unable to do this in a consistent way, largely because of the non-uniqueness of the lengths of the paths from any node up to the root.
In order to demonstrate that the problem will not seriously affect the results in at least most cases, we did the following calculation: For all distal nodes in the GO DAG (that is, all nodes that had no lower nodes = nodes farther from the root) we determined the minimal-length path from the root to the distal node. The result was that for 85% of the distal nodes, the minimal path lengths to the root were between 4 and 6. For these cases any reasonable "normalization factor" will vary only between 0.8 and 1.2. Nevertheless, for the particular application: "Given a novel sequence, what is the likelihood of error in transferring annotation between homologues?" we explicitly recommend a homologous-family-by-homologous family approach, in which one would in most cases be comparing functions in similar sections of the GO DAG. For these, the fineness of the distinctions between related functions would be comparable, and the differences in overall depths of different nodes would be a controlled quantity.
- Laskowski RA, Watson JD, Thornton JM: From protein structure to biochemical function. J Struct Funct Genomics 2003, 4: 167–77. 10.1023/A:1026127927612View ArticlePubMed
- Whisstock JC, Lesk AM: Prediction of protein function from protein sequence and structure. Quart Revs Biophys 2003, 36: 307–40. 10.1017/S0033583503003901View Article
- Jones S, Thornton JM: Searching for functional sites in protein structures. Curr Opin Chem Biol 2004, 8: 3–7. 10.1016/j.cbpa.2003.11.001View ArticlePubMed
- Andrade MA, Sander C: Bioinformatics: from genome data to biological knowledge. Curr Opin Biotechnol 1997, 8: 675–683. 10.1016/S0958-1669(97)80118-8View ArticlePubMed
- Camon E, Magrane M, Barrell D, Binns D, Fleischmann W, Kersey P, Mulder N, Oinn T, Maslen J, Cox A, Apweiler R: The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. Genome Res 2003, 13: 662–672. 10.1101/gr.461403PubMed CentralView ArticlePubMed
- Lu X, Zhai C, Gopalakrishnan V, Buchanan BG: Automatic annotation of protein motif function with Gene Ontology terms. BMC Bioinformatics 2004, 5: 122. 10.1186/1471-2105-5-122PubMed CentralView ArticlePubMed
- Koski LB, Gray MW, Lang BF, Burger G: AutoFACT: an automatic functional annotation and classification tool. BMC Bioinformatics 2005, 6: 151–162. 10.1186/1471-2105-6-151PubMed CentralView ArticlePubMed
- Conesa A, Gotz S, Garcia-Gomez JM, Terol J, Talon M, Robles M: Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 2005, 21: 3674–3676. 10.1093/bioinformatics/bti610View ArticlePubMed
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMed
- Eddy SR, Mitchison G, Durbin R: Maximum discrimination hidden Markov models of sequence consensus. J Comput Biol 1995, 2: 9–23.View ArticlePubMed
- Cai CZ, Wang WL, Sun LA, Chen YZ: Protein function classification via support vector machine approach. Math Biosci 2003, 85: 111–122. 10.1016/S0025-5564(03)00096-8View Article
- Jensen LJ, Gupta R, Staerfeldt H-H, Brunak S: Prediction of human protein function according to Gene Ontology categories. Bioinformatics 2003, 19: 635–642. 10.1093/bioinformatics/btg036View ArticlePubMed
- Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, Studholme DJ, Yeats C, Eddy SR: The Pfam protein families database. Nucleic Acids Res 2004, 32: 138–141. 10.1093/nar/gkh121View Article
- Attwood TK, Bradley P, Flower DR, Gaulton A, Maudling N, Mitchell AL, Moulton G, Nordle A, Paine K, Taylor P: PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res 2003, 31: 400–402. 10.1093/nar/gkg030PubMed CentralView ArticlePubMed
- Hulo N, Sigrist CJ, Le Saux V, Langendijk-Genevaux PS, Bordoli L, Gattiker A, De Castro E, Bucher P, Bairoch A: Recent improvements to the PROSITE database. Nucleic Acids Res 2004, 32: D134-D137. 10.1093/nar/gkh044PubMed CentralView ArticlePubMed
- Laskowski RA, Watson JD, Thornton JM: ProFunc: a server for predicting protein function from 3D structure. Nucleic Acids Res 2005, 33: W89-W93. 10.1093/nar/gki414PubMed CentralView ArticlePubMed
- Fleming K, Kelley LA, Islam SA, MacCallum RM, Muller A, Pazos F, Sternberg MJ: The proteome: structure, function and evolution. Philos Trans R Soc Lond B Biol Sci 2006, 361: 441–451. 10.1098/rstb.2005.1802PubMed CentralView ArticlePubMed
- Ganfornina MD, Sánchez D: Generation of evolutionary novelty by functional shift. BioEssays 1999, 21: 432–439. 10.1002/(SICI)1521-1878(199905)21:5<432::AID-BIES10>3.0.CO;2-TView ArticlePubMed
- Devos D, Valencia A: Practical limits of function prediction. Proteins 2000, 41: 98–107. 10.1002/1097-0134(20001001)41:1<98::AID-PROT120>3.0.CO;2-SView ArticlePubMed
- Smith TF, Zhang X: The challenges of genome sequence annotation 'the devil is in the details'. Nat Biotechnol 1997, 15: 1222–1223. 10.1038/nbt1197-1222View ArticlePubMed
- Bork P, Koonin EV: Predicting functions from protein sequences–where are the bottlenecks? Nat Genet 1998, 18: 313–318. 10.1038/ng0498-313View ArticlePubMed
- Karp R: What we do not know about sequence analysis and sequence databases? Bioinformatics 1998, 14: 753–754. 10.1093/bioinformatics/14.9.753View ArticlePubMed
- Doerks T, Bairoch A, Bork P: Protein annotation: detective work for function prediction. Trends Genet 1998, 14: 248–250. 10.1016/S0168-9525(98)01486-3View ArticlePubMed
- Brenner SE: Errors in genome annotation. Trends Gen 1995, 15: 132–133. 10.1016/S0168-9525(99)01706-0View Article
- CODATA Task Group on Biological Macromolecules and colleagues BioEssays 2000, 22: 1024–1034. 10.1002/1521-1878(200011)22:11<1024::AID-BIES9>3.0.CO;2-W
- Gerlt JA, Babbitt PC: Divergent evolution of enzymatic function: mechanistically diverse superfamilies and functionally distinct suprafamilies. Ann Rev Biochem 2001, 209–246. 10.1146/annurev.biochem.70.1.209
- Devos D, Valencia A: Intrinsic errors in genome annotation. Trends Genet 2001, 17: 429–431. 10.1016/S0168-9525(01)02348-4View ArticlePubMed
- Jeong SS, Chen R: Functional misassignment of gene. Nat Biotechnol 2001, 19: 95. 10.1038/84480View ArticlePubMed
- Enzyme Commission[http://www.chem.qmul.ac.uk/iubmb/enzyme]
- The Gene Ontology Consortium Nature Genet 2000, 25: 25–29. 10.1038/75556
- Piatigorsky J: Gene sharing, lens crystallins and speculations on an eye/ear evolutionary relationship. Int Comp Biol 2003, 43: 492–499. 10.1093/icb/43.4.492View Article
- Wilson CA, Kreychman J, Gerstein M: Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J Mol Biol 2000, 297: 233–249. 10.1006/jmbi.2000.3550View ArticlePubMed
- Todd AE, Orengo CA, Thornton JM: Evolution of function in protein superfamilies, from a structural perspective. J Mol Biol 2001, 307: 1113–1143. 10.1006/jmbi.2001.4513View ArticlePubMed
- Ashburner M, Drysdale R: Flybase: The Drosophila genetic database. Development 2004, 120: 2077–2079.
- Holm L, Sander C: Protein folds and families: sequence and structure alignments. Nucl Acids Res 1999, 27: 244–247. 10.1093/nar/27.1.244PubMed CentralView ArticlePubMed
- Bairoch A, Boeckmann B, Ferro S, Gasteiger E: Swiss-Prot: juggling between evolution and stability. Brief Bioinform 2005, 5(1):39–55. 10.1093/bib/5.1.39View Article
- Nakayama S, Moncrief ND, Kretsinger RH: Evolution of EF-hand calcium-modulated proteins. II. Domains of several subfamilies have diverse evolutionary histories. J Mol Evol 1992, 34: 416–448. 10.1007/BF00162998View ArticlePubMed
- Bairoch A, Cox JA: EF-hand motifs in inositol phospholipid-specific phospholipase C. FEBS Lett 1990, 269: 454–456. 10.1016/0014-5793(90)81214-9View ArticlePubMed
- Shadidy M, Caubit X, Olsen R, Seternes OM, Moens U, Krauss S: Biochemical analysis of mouse FKBP60, a novel member of the FKPB family. Biochim Biophys Acta 1999, 1446(3):295–307.View ArticlePubMed
- Ranea JA, Sillero A, Thornton JM, Orengo CA: Protein Superfamily Evolution and the Last Universal Common Ancestor (LUCA). J Mol Evol 2006, 63(4):513–25. 10.1007/s00239-005-0289-7View ArticlePubMed
- Sonnhammer EL, Eddy SR, Durbin R: Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 1997, 28(3):405–20. 10.1002/(SICI)1097-0134(199707)28:3<405::AID-PROT10>3.0.CO;2-LView ArticlePubMed
- Bateman A, Birney E, Durbin R, Eddy SR, Finn RD, Sonnhammer ELL: Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucleic Acids Research 1999, 27(1):260–262. 10.1093/nar/27.1.260PubMed CentralView ArticlePubMed
- Gene Ontology[http://www.geneontology.org]
- Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48: 443–453. 10.1016/0022-2836(70)90057-4View ArticlePubMed
- Edgar RC: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 2004, 5(1):113. 10.1186/1471-2105-5-113PubMed CentralView ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.