Volume 14 Supplement 3
Function prediction from networks of local evolutionary similarity in protein structure
© Erdin et al.; licensee BioMed Central Ltd. 2013
Published: 28 February 2013
Annotating protein function with both high accuracy and sensitivity remains a major challenge in structural genomics. One proven computational strategy has been to group a few key functional amino acids into templates and search for these templates in other protein structures, so as to transfer function when a match is found. To this end, we previously developed Evolutionary Trace Annotation (ETA) and showed that diffusing known annotations over a network of template matches on a structural genomic scale improved predictions of function. In order to further increase sensitivity, we now let each protein contribute multiple templates rather than just one, and also let the template size vary.
Retrospective benchmarks in 605 Structural Genomics enzymes showed that multiple templates increased sensitivity by up to 14% when combined with single template predictions even as they maintained the accuracy over 91%. Diffusing function globally on networks of single and multiple template matches marginally increased the area under the ROC curve over 0.97, but in a subset of proteins that could not be annotated by ETA, the network approach recovered annotations for the most confident 20-23 of 91 cases with 100% accuracy.
We improve the accuracy and sensitivity of predictions by using multiple templates per protein structure when constructing networks of ETA matches and diffusing annotations.
Predicting function for the ever-increasing number of sequences and structures produced by Genomics Centers remains a major problem [1, 2]. Fewer than 0.5% of current UniProt database protein annotations come from experiments , showing that despite recent advances in high-throughput functional screening experiments, these are far from being sufficiently scalable to characterize protein function on a massive scale. In the foreseeable future therefore computational annotation of protein function will remain essential.
The computational tools for this purpose  are based on sequence or structure . Here, we focus on the latter, with the rationale that structure is more conserved than sequence during evolutionary divergence . As a result, any global (described by CATH  or SCOP  codes) similarities that exist between structures may indicate functional similarities that are not recognizable from sequence comparisons alone . One pitfall, however, is that proteins with the same fold can often perform unrelated functions, for example TIM Barrels . Thus, fold-based prediction may not be specific for a majority of functions . Another pitfall is that some functions can be carried out by different folds. For example, subtilisin and trypsin are both are serine proteases with different structure. However, both share an identical catalytic triad of His-Ser-Asp , suggesting that functional similarity may be detected from local similarity of just a few key catalytic residues [13–16].
This observation spurred template approaches on the hypothesis that if relatively few residues determine binding or catalytic activity, then the presence of these identical residues in identical geometries may suggest an identical function even if the structure is different [13–16]. This raises a series of challenges: to detect those key functional residues, to locate them in other structures, and to make sure that these matches are not random. The Evolutionary Trace Annotation (ETA) pipeline was developed to address each of these problems . ETA generates templates using Evolutionary Trace (ET) to identify putative functional sites, and their key residues, in a protein structure [18–20]. This use of ET obviates the need for any prior information on functional mechanisms in order to build templates. These ET-based templates are then matched to other structures and matched pairs of structures are accepted when the matched sites have sufficient evolutionary and geometric similarities. The requirement for evolutionary similarity specifically ensures that these matches are at sites that are functionally important in both proteins, which decreases the likelihood that the match is due to random chance . Finally, ETA transfers function between the matched proteins, from an annotated structure to an unannotated structure. A public ETA webserver is available for that purpose .
We systematically tested this protocol for enzymes [17, 23, 24] and non-enzymes  using Enzyme Commission (EC) numbers  and Gene Ontology terms  as functional classifications. The accuracy was 92% and 94% for enzymes and non-enzymes respectively, with sensitivity near 50% in both [23, 24]. To raise sensitivity, we then pooled together all ETA matches into a network of protein structures  and let functional information diffuse globally within it from proteins of known function to unannotated ones. The accuracy improved to 96% at 65% coverage in a test set of 1217 Structural Genomics enzymes. Similarly, others have also used network analyses to link functional information to unknown proteins in three diverse enzyme superfamilies .
In this work, we aim to raise sensitivity further by allowing more than one template for each protein. A single template may on occasion fail to capture a true functional site , and if a structure has multiple functional sites, secondary functional annotations would be missed. To further relax our requirements, we also test smaller templates with just five residues instead of exactly six residues. In a test set of 605 Structural Genomics enzymes, the combination of these innovations increased sensitivity by 14% with a modest decrease in accuracy (5%) over the default ETA. Moreover, when we applied network diffusion, performance rose further to yield an area-under-the-curve of 0.97 regardless of the ETA variant.
Results and discussion
We can extend this approach by organizing all matches into a network . In this network, each protein structure is a node; each ETA match is a symmetric edge; and all known functions are treated as distinct node labels (see Figure 1B). We calculate a weighted edge from the RMSD and differences in ET scores in an ETA match . Then, for each function, we label nodes based on whether they are known to have a given function, known not to have it, or whether this is unknown. We can then set up a minimization problem to find a set of diffused labels that aim to preserve the initial labeling while at the same time assigning neighboring nodes similar labels (see Methods). This process is repeated for all possible labels, i.e. functions, so as to yield a normalized confidence score to every possible pairing of nodes and labels, i.e. for every possible protein function prediction. We then evaluate the accuracy of the highest confidence prediction.
Next, we sought to assess performance in a subgroup of especially difficult cases. To this end, we identified 73 query proteins that share no more than 30% sequence identity with any annotated protein in the target set with the same function. The results (Figure 2B) show that smaller, multiple templates increased sensitivity with only a slight, tolerable decrease in accuracy. The best overall performance in terms of F-measure, was from six-residue templates followed by multiple six-residue template (See Figure 2B).
In order to compare each template selection mode with a sequence-based annotation strategy, we identified the closest hits from target set to each protein in both query sets based on sequence identity between the query and the target. We label this strategy SeqID. Each template selection mode outperformed SeqID in terms of accuracy and F-measure in both test sets (see Figure 2). In case of non-trivial proteins, four ETA template selection modes yielded average 0.643 in terms of F-measure versus 0.017 by the SeqID. This result shows that our template-based strategy is especially useful if the protein of interest does not have close homologs.
Prediction similarity of four ETA methods with one another (A) in cases of 605 Structural Genomics enzymes, and (B) in cases of 73 non-trivial Structural Genomics enzymes.
This led us to an iterative annotation strategy that applies first the template selection method with the best accuracy, followed in further rounds by the next best, and so on. The methods were ordered by their accuracies as 6R > 5R > M6R > M5R following Figure 2A. As a result, sensitivity rose to 83.8% and accuracy to 91.4%. In Figure 2B, the accuracy order is 6R > M6R > 5R > M5R. This strategy achieved accuracy of 67.4% and sensitivity of 49.2% in this particularly challenging test set.
To evaluate the ability of our confidence score to identify accurate predictions, we plotted cumulative accuracy against confidence (z-score) (Additional file 1). The predictions separate into three regions: over a confidence value of 2.0, predictions are nearly 100% accurate across all ETA modes. In the range between 2.0 and 0.5, accuracy begins to drop: In this range, predictions are 96% accurate regardless of ETA variant. Finally, below 0.5, accuracy declines steeply, with 39%, 22%, 20% and 12% accuracy for multiple five-residue, multiple six-residue, five-residue and six-residue networks, respectively.
This work aimed to increase the sensitivity of function annotation in protein structures through alternative ET-based template matching strategies. We found that using more than one template per protein raised sensitivity by 6-9% at a cost of 1-5% in accuracy. Overall, the use of multiple templates did yield a better annotation performance as shown by the increase in F-measure. Furthermore, network diffusion based on multiple-template matches outperformed single template-based network diffusion with 3-8% increase in AUC in recovery of the functional classification of cases where ETA completely failed. Interestingly, the simultaneous use of multiple templates arising from all of the structures under consideration was always best, and robust enough to be insensitive to the size or number of templates per protein. Thus, when functional labels are diffused over a network of ETA matches, the result plateaus at a very high AUC value between 0.96 and 0.97. Most usefully, these experiments define prediction confidence thresholds that distinguished reliable prediction, from those less so, and those that likely to be no better than chance. The results show that among proteins with no ETA prediction, diffusion over a network can add novel predictions. More broadly, this strategy of competitive annotation diffusion over ETA-networks should help accurate large-scale annotation of the structural proteome.
Data Sets: Our query test set contains 605 enzymes from Structural Genomics centers that cover 348 distinct full-EC numbers. Each pair of proteins in this set share less than 90% sequence identity with one another. Non-trivial test set constitute 73 proteins of the query set. These proteins share at most 30% sequence identity with any protein of the same function at the full-EC level in the target set. Target proteins are selected from 17824 2008PDB90  proteins with the criteria that their truncation ratio greater than 0.95. Truncation ratio is defined to be the ratio of the number of amino acids in the structure to the actual sequence length of the protein. Truncation ratio shows how much of the protein is solved in the structure. 8537 proteins hold this criterion and 3082 of them carry full-EC annotations that cover 1190 distinct catalytic function in the databases of Uniprot/SwissProt/TrembL  and PDB . Additional file 2 shows the composition of these three sets in terms of PFAM protein families . In doing so, only three and two proteins had unassigned PFAM accessions from 605-query test set and 73-query test set respectively, while 3082-protein target set had only 14 proteins with no PFAM accessions. The most represented protein families in 605-query test set and 3082-protein target set were Aminotran_1_2 (PFAM:PF00155) with 13 proteins and Adh_short (PFAM:PF00106) with 60 proteins respectively. Non-trivial query set with 73 proteins had three most represented protein families Hydrolase_3 (PFAM:PF08282), Pyr_redox_2 (PFAM:PF07992) and DeoC(PFAM:PF01791), each were associated with three proteins. Therefore, the portions of overrepresented protein families in 605-protein and 73-protein query test sets and 3082-protein target set are 2%, 4% and 2% respectively.
Evolutionary Trace: In this work, we used the real-valued version of Evolutionary Trace (ET) algorithm whose detailed description can be found elsewhere [19, 35]. Briefly, ET algorithm first identifies homologous sequences for a query protein by performing BLAST  searches over NCBI Entrez non-redundant protein sequence database. The identified BLAST hits shared minimum 20% and maximum 95% sequence identity with the query sequence. Next, ClustalW  generates multiple sequence alignment for the identified sequence hits for the query protein. Further, a phylogenetic tree is obtained by using UPGMA algorithm . Finally, ET algorithm calculates the evolutionary ranks by employing a combined method of integer ET  and Shannon Entropy  that quantifies correlation of variations in multiple sequence alignment with branching in phylogenetic tree for each amino acid in the query sequence.
Template selection: In this work, we used five and six-residue templates relying on a previous study showing that these sizes maximized accuracy and sensitivity . Template creation was described in details elsewhere . Briefly, template picker algorithm performs the following steps: ETA first applies real-valued version of ET algorithm (see the above section) to assign evolutionarily importance ranks to the amino acid residues in a protein structure of interest and sorts clusters of important residues according to their evolutionarily importance rank from the lowest (the most important) to the highest (the least important). The cluster size increases as the rank increases. In case of single templates of five or six residues, ETA first identifies the cluster of at least 11 important surface residues whose surface accessible area is greater than 2Å2, which is computed by DSSP . Each residue is represented by its Cα's coordinates. Template selection algorithm further identifies the center of mass (CM) of the chosen clusters and the closest best ranked-amino acid as the first residue of the template. In order to identify the other residues of the template, the algorithm iteratively selects other most important residues that are closer to the midpoint between the CM of cluster of selected residues in the previous iteration and the CM of the chosen cluster. Residue positions in the templates are labeled by both original residue types in the structure and any non-gapped combination that is at least seen twice in the multiple sequence alignment. In order to generate multiple templates, ETA first identifies a cluster containing at least 11 important surface residues as the first cluster. Next, ETA identifies other clusters, if any, that do not completely overlap with the first cluster, and that contain surface residues with better evolutionary importance ranks than those in the first cluster. Additional templates from those clusters were generated by following the same steps as in case of single templates. Therefore, the number of templates for a given protein was dictated by the number of identified distinct clusters of evolutionary important surface residues for that protein. For example, if there is no other distinct cluster than the cluster chosen for single templates, the template selection algorithm yields only a single template.
Template searching: ETA utilizes paired-distance matching (PDM) algorithm to probe geometric similarities between query templates and other structures. In doing so, PDM first identifies all the residues that are identical to the first residue types of the query template in the other structure. In the next iteration, PDM identifies the residues that are identical to the type of second template residue. PDM retains only the pair of residues whose paired-distance is within 2.5Å with the paired distance of first and second template residues. Further, residues that are identical to the type of third template residue are identified. PDM again applies distance constraint of 2.5 Å upon comparison of combination of all possible-paired distances in order to select the third residue. In further iterations, these steps are repeated to select other residues in the target structure. Each match is assigned with a value of root mean square deviation (RMSD) to quantify the geometric similarity.
Match filtering: In the next round, ETA eliminates self-matches and matches with RMSD greater than 2 Å. The remaining matches are fed into the Support Vector Machine (SVM) that is trained for enzymes . Each match is represented by either six or seven dimensional vectors depending on the template size. One dimension is for geometric similarity whereas others quantify evolutionary similarity. The latter is computed as the absolute value of percentile-rank differences of a query template residue and matched residue in the target structure. We use the same six-residue SVM for five-residue ETA by constructing a virtual sixth residue as the average of the other five positions. In the end, SVM filters the significant matches.
Reciprocal match: Geometric and evolutionary similarity is probed by both template(s) from query protein onto target structures and templates from target structures onto the query protein. Reciprocal matches constitute the intersection of significant matches selected by SVM in both directions.
Function prediction: In the final round, ETA assigns votes to the functions of unique reciprocal matches and thereby suggests the function with the highest vote as a predicted function. If a particular protein is identified as a significant match several times for a query protein, its function gets only one vote.
Performance measures: We used accuracy, sensitivity and F-measure to assess ETA's performance, which are defined as follows: Accuracy = TP/(TP+FP) and Sensitivity=(TP)/(TP+FN), where TP is true positive, both known and predicted function agree at the fourth EC level, FP is false positive, predicted and known function does not agree at the fourth EC level, and FN is false negative, which represents the cases with no prediction. F-measure = (1+β2). accuracy. sensitivity/(β2 . accuracy + sensitivity), where β equals 0.5 to put more emphasis on accuracy.
Sequence identity: In order to calculate the sequence identity between two proteins, we first aligned their sequences by ClustalW . Sequence identity is defined to be the ratio of the number of aligned identical residues in both sequences to the total number of aligned residues.
Prediction similarity: ETA's performance at a given template selection category was expressed in a vector form of length N, where N is the size of query test set. Each element of the vector represents a particular query protein, which is labeled by either correct, incorrect or no prediction tags. Similarity between performances of category 1 and 2 is assessed by , where denotes the type of tag attached to the ith protein in the query set based on category 1 annotation scheme. Then, if , else .
ETA networks: ETA networks are constructed in three parts: query-query, query-target and target-target. In each part, an ETA variant is used to identify significant reciprocal matches. For example, in the query-query part, ETA identifies a reciprocal match for each protein in the query set from another non-self match protein of query set. Query-target part is default ETA protocol. In the target-target part, ETA assigned reciprocal matches for each protein in the target set from another non-self match protein in the target set.
ETA provides the rmsd and ETScore which collectively describe the quality of a template match. An ETScore reflects the total difference between the evolutionary score of matched residues. The rmsd is the geometric difference between the template and the matching structure. μrmsd is the average rmsd for all matches in the network, σrmsd is the standard deviation for all rmsds in the network, μETScore is the average ETScore for all matches in the network and σETScore is the standard deviation of all ETScores in the network.
We store the resulting network in an adjacency matrix, W. If there are n nodes, W is n by n and Wij is set to the weight calculated above if a match is detected between protein i and j, or set to 0 if there is no such match. We then create an n-dimensional label vector y for a particular function present in our network by setting yi, 0 ≤ i < n to 1 if node i is known to have that function, to -1 if it is known to not have that function, or to 0 if the protein has unknown function (or is a member of the test set). We use these elements to pose an optimization problem:
The n dimensional vector f will hold predicted functional scores after diffusion. The first term penalizes nodes which deviate highly from their initial label - encouraging the system to keep the initial knowledge. The second term penalizes neighbors that have different labels according to the weight of their connection, encouraging functions to propagate through the network. The parameter α trades-off these two conditions. This can be efficiently solved as follows: f = (I + αL)y where L = D - W is the Laplacian matrix (D is the diagonal matrix, Dii = Σj wij) . We repeat this process for every function represented in the network and compare the resulting values in f for each node with unknown function. To do this we normalize the values in f across all nodes with unknown function (those entries in y initially set to 0) by taking z = (fi - fμ)/fσ where fμ is the average fi across all proteins with unknown function and fσ is the standard deviation of fi across all proteins with unknown function. This results in a z score, which we use as a measure of confidence. The function with the largest z score at for node i becomes our prediction. The Cytocape plugin for ETA networks is available at http://mammoth.bcm.tmc.edu/networks/. Multiple template and five-residue template extension will be incorporated into http://mammoth.bcm.tmc.edu/eta/.
The Authors gratefully acknowledge grant support from the National Institute of Health, NIH GM079656 and GM066099, and from the National Science Foundation, NSF, CCF 0905536 and DBI 1062455. This research was funded partially by a training fellowship from the Keck Center of the Gulf Coast Consortia, on the Training Program in Biomedical Informatics, National Library of Medicine (NLM) T15LM007093.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 3, 2013: Proceedings of Automated Function Prediction SIG 2011 featuring the CAFA Challenge: Critical Assessment of Function Annotations. The full contents of the supplement are available online at URL. http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S3.
- Huang S: The practical problems of post-genomic biology. Nat Biotechnol. 2000, 18 (5): 471-472. 10.1038/75235.View ArticlePubMedGoogle Scholar
- Eisenberg D, Marcotte EM, Xenarios I, Yeates TO: Protein function in the post-enomic era. Nature. 2000, 405 (6788): 823-826. 10.1038/35015694.View ArticlePubMedGoogle Scholar
- Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012, 40 (D1): D71-D75.Google Scholar
- Rentzsch R, Orengo CA: Protein function prediction-the power of multiplicity. Trends Biotechnol. 2009, 27 (4): 210-219. 10.1016/j.tibtech.2009.01.002.View ArticlePubMedGoogle Scholar
- Lee D, Redfern O, Orengo C: Predicting protein function from sequence and structure. Nat Rev Mol Cell Biol. 2007, 8 (12): 995-1005. 10.1038/nrm2281.View ArticlePubMedGoogle Scholar
- Chothia C, Lesk AM: The relation between the divergence of sequence and structure in proteins. Embo J. 1986, 5 (4): 823-826.PubMed CentralPubMedGoogle Scholar
- Greene LH, Lewis TE, Addou S, Cuff A, Dallman T, Dibley M, Redfern O, Pearl F, Nambudiry R, Reid A: The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic Acids Res. 2007, 35 (Database): D291-297. 10.1093/nar/gkl959.PubMed CentralView ArticlePubMedGoogle Scholar
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995, 247 (4): 536-540.PubMedGoogle Scholar
- Erdin S, Lisewski AM, Lichtarge O: Protein function prediction: towards integration of similarity metrics. Curr Opin Struct Biol. 2011, 21 (2): 180-188. 10.1016/j.sbi.2011.02.001.PubMed CentralView ArticlePubMedGoogle Scholar
- Nagano N, Orengo CA, Thornton JM: One fold with many functions: the evolutionary relationships between TIM barrel families based on their sequences, structures and functions. J Mol Biol. 2002, 321 (5): 741-765. 10.1016/S0022-2836(02)00649-6.View ArticlePubMedGoogle Scholar
- Redfern O, Dessailly B, Orengo C: Exploring structure and function paradigm. Curr Op Struct Biol. 2008, 18: 394-402. 10.1016/j.sbi.2008.05.007.View ArticleGoogle Scholar
- Dodson G, Wlodawer A: Catalytic triads and their relatives. Trends Biochem Sci. 1998, 23 (9): 347-352. 10.1016/S0968-0004(98)01254-7.View ArticlePubMedGoogle Scholar
- Wallace AC, Laskowski RA, Thornton JM: Derivation of 3D coordinate templates for searching structural databases: application to Ser-His-Asp catalytic triads in the serine proteinases and lipases. Protein Sci. 1996, 5 (6): 1001-1013.PubMed CentralView ArticlePubMedGoogle Scholar
- Shulman-Peleg A, Nussinov R, Wolfson HJ: Recognition of functional sites in protein structures. J Mol Biol. 2004, 339 (3): 607-633. 10.1016/j.jmb.2004.04.012.View ArticlePubMedGoogle Scholar
- Polacco BJ, Babbitt PC: Automated discovery of 3D motifs for protein function annotation. Bioinformatics. 2006, 22 (6): 723-730. 10.1093/bioinformatics/btk038.View ArticlePubMedGoogle Scholar
- Meng EC, Polacco BJ, Babbitt PC, Rigden DJ: 3D Motifs. From Protein Structure to Function with Bioinformatics. Edited by: Rigden DJ. 2009, Springer Netherlands, 187-216.View ArticleGoogle Scholar
- Kristensen DM, Ward RM, Lisewski AM, Erdin S, Chen BY, Fofanov VY, Kimmel M, Kavraki LE, Lichtarge O: Prediction of enzyme function based on 3D templates of evolutionarily important amino acids. BMC Bioinformatics. 2008, 9: 17-10.1186/1471-2105-9-17.PubMed CentralView ArticlePubMedGoogle Scholar
- Lichtarge O, Bourne HR, Cohen FE: An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol. 1996, 257 (2): 342-358. 10.1006/jmbi.1996.0167.View ArticlePubMedGoogle Scholar
- Mihalek I, Res I, Lichtarge O: A family of evolution-entropy hybrid methods for ranking protein residues by importance. J Mol Biol. 2004, 336 (5): 1265-1282. 10.1016/j.jmb.2003.12.078.View ArticlePubMedGoogle Scholar
- Wilkins AD, Lua R, Erdin S, Ward RM, Lichtarge O: Sequence and structure continuity of evolutionary importance improves protein functional site discovery and annotation. Protein Sci. 2010, 19 (7): 1296-1311. 10.1002/pro.406.PubMed CentralView ArticlePubMedGoogle Scholar
- Kristensen DM, Chen BY, Fofanov VY, Ward RM, Lisewski AM, Kimmel M, Kavraki LE, Lichtarge O: Recurrent use of evolutionary importance for functional annotation of proteins based on local structural similarity. Protein Sci. 2006, 15 (6): 1530-1536. 10.1110/ps.062152706.PubMed CentralView ArticlePubMedGoogle Scholar
- Ward RM, Venner E, Daines B, Murray S, Erdin S, Kristensen DM, Lichtarge O: Evolutionary Trace Annotation Server: automated enzyme function prediction in protein structures using 3D templates. Bioinformatics. 2009, 25 (11): 1426-1427. 10.1093/bioinformatics/btp160.View ArticlePubMedGoogle Scholar
- Ward RM, Erdin S, Tran TA, Kristensen DM, Lisewski AM, Lichtarge O: De-orphaning the structural proteome through reciprocal comparison of evolutionarily important structural features. PLoS ONE. 2008, 3 (5): e2136-10.1371/journal.pone.0002136.PubMed CentralView ArticlePubMedGoogle Scholar
- Erdin S, Ward RM, Venner E, Lichtarge O: Evolutionary trace annotation of protein function in the structural proteome. J Mol Biol. 2010, 396 (5): 1451-1473. 10.1016/j.jmb.2009.12.037.PubMed CentralView ArticlePubMedGoogle Scholar
- International Union of Biochemistry and Molecular Biology. Nomenclature Committee. Webb EC: Enzyme nomenclature 1992: recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the nomenclature and classification of enzymes. 1992, San Diego: Academic PressGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25 (1): 25-29. 10.1038/75556.PubMed CentralView ArticlePubMedGoogle Scholar
- Venner E, Lisewski AM, Erdin S, Ward RM, Amin SR, Lichtarge O: Accurate protein structure annotation through competitive diffusion of enzymatic functions over a network of local evolutionary similarities. PLoS One. 2010, 5 (12): e14286-10.1371/journal.pone.0014286.PubMed CentralView ArticlePubMedGoogle Scholar
- Brown SD, Babbitt PC: Inference of Functional Properties from Large-scale Analysis of Enzyme Superfamilies. J Biol Chem. 2012, 287 (1): 35-42. 10.1074/jbc.R111.283408.PubMed CentralView ArticlePubMedGoogle Scholar
- Byres E, Alphey MS, Smith TK, Hunter WN: Crystal structures of Trypanosoma brucei and Staphylococcus aureus mevalonate diphosphate decarboxylase inform on the determinants of specificity and reactivity. J Mol Biol. 2007, 371 (2): 540-553. 10.1016/j.jmb.2007.05.094.View ArticlePubMedGoogle Scholar
- Voynova NE, Fu Z, Battaile KP, Herdendorf TJ, Kim JJ, Miziorko HM: Human mevalonate diphosphate decarboxylase: characterization, investigation of the mevalonate diphosphate binding site, and crystal structure. Arch Biochem Biophys. 2008, 480 (1): 58-67. 10.1016/j.abb.2008.08.024.PubMed CentralView ArticlePubMedGoogle Scholar
- Thoden JB, Holden HM, Firestine SM: Structural analysis of the active site geometry of N5-carboxyaminoimidazole ribonucleotide synthetase from Escherichia coli. Biochemistry. 2008, 47 (50): 13346-13353. 10.1021/bi801734z.PubMed CentralView ArticlePubMedGoogle Scholar
- Goihberg E, Dym O, Tel-Or S, Levin I, Peretz M, Burstein Y: A single proline substitution is critical for the thermostabilization of Clostridium beijerinckii alcohol dehydrogenase. Proteins. 2007, 66 (1): 196-204.View ArticlePubMedGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res. 2000, 28 (1): 235-242. 10.1093/nar/28.1.235.PubMed CentralView ArticlePubMedGoogle Scholar
- Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K: The Pfam protein families database. Nucleic Acids Res. 2010, 38 (Database): D211-222. 10.1093/nar/gkp985.PubMed CentralView ArticlePubMedGoogle Scholar
- Wilkins A, Erdin S, Lua R, Lichtarge O: Evolutionary trace for prediction and redesign of protein functional sites. Methods Mol Biol. 2012, 819: 29-42. 10.1007/978-1-61779-465-0_3.View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R: Clustal W and Clustal X version 2.0. Bioinformatics. 2007, 23 (21): 2947-2948. 10.1093/bioinformatics/btm404.View ArticlePubMedGoogle Scholar
- Waterman M: 2000, London/Boca Raton: Chapman & Hall/CRCGoogle Scholar
- Shannon CE, Weaver W: 1949, Urbana: University of Illinois PressGoogle Scholar
- Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983, 22 (12): 2577-2637. 10.1002/bip.360221211.View ArticlePubMedGoogle Scholar
- Shin H, Lisewski AM, Lichtarge O: Graph sharpening plus graph integration: a synergy that improves protein functional classification. Bioinformatics. 2007, 23 (23): 3217-3224. 10.1093/bioinformatics/btm511.View ArticlePubMedGoogle Scholar
- Bachman BJ, Venner E, Lua RC, Erdin S, Lichtarge O: ETAscape: analyzing protein networks to predict enzymatic function and substrates in Cytoscape. Bioinformatics. 2012, 28 (16): 2186-2188. 10.1093/bioinformatics/bts331.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.