Effective inter-residue contact definitions for accurate protein fold recognition
© Yuan et al.; licensee BioMed Central Ltd. 2012
Received: 5 March 2012
Accepted: 29 October 2012
Published: 9 November 2012
Effective encoding of residue contact information is crucial for protein structure prediction since it has a unique role to capture long-range residue interactions compared to other commonly used scoring terms. The residue contact information can be incorporated in structure prediction in several different ways: It can be incorporated as statistical potentials or it can be also used as constraints in ab initio structure prediction. To seek the most effective definition of residue contacts for template-based protein structure prediction, we evaluated 45 different contact definitions, varying bases of contacts and distance cutoffs, in terms of their ability to identify proteins of the same fold.
We found that overall the residue contact pattern can distinguish protein folds best when contacts are defined for residue pairs whose Cβ atoms are at 7.0 Å or closer to each other. Lower fold recognition accuracy was observed when inaccurate threading alignments were used to identify common residue contacts between protein pairs. In the case of threading, alignment accuracy strongly influences the fraction of common contacts identified among proteins of the same fold, which eventually affects the fold recognition accuracy. The largest deterioration of the fold recognition was observed for β-class proteins when the threading methods were used because the average alignment accuracy was worst for this fold class. When results of fold recognition were examined for individual proteins, we found that the effective contact definition depends on the fold of the proteins. A larger distance cutoff is often advantageous for capturing spatial arrangement of the secondary structures which are not physically in contact. For capturing contacts between neighboring β strands, considering the distance between Cα atoms is better than the Cβ−based distance because the side-chain of interacting residues on β strands sometimes point to opposite directions.
Residue contacts defined by Cβ−Cβ distance of 7.0 Å work best overall among tested to identify proteins of the same fold. We also found that effective contact definitions differ from fold to fold, suggesting that using different residue contact definition specific for each template will lead to improvement of the performance of threading.
KeywordsProtein structure prediction Threading Fold recognition Structural features Residue-residue contact Protein fold
The tertiary structure of proteins provides crucial information for understanding molecular mechanisms of biological functions. Protein structures also serve as a platform for various branches of biotechnology, including drug design[1, 2] and protein engineering[3–5]. Although protein structures have been solved by experiments at an increasing rate, a flood of new sequences have been determined even more rapidly due to the advance of sequencing technologies[6, 7]. Taking advantage of the enlarging database of experimentally solved protein structures, it is expected that computational structure prediction methods, especially template-based methods, will play a more significant role in providing structure of newly sequenced proteins[9–12]. However, computing accurate structure models is still not always possible especially when template structures available do not share significant sequence similarity to a target sequence. Template-based structure prediction methods usually employ structure-based scoring terms together with sequence matching terms to enhance structure recognition and alignment accuracy[14–18]. Structure-based terms used include secondary structure prediction, main-chain angle propensity, burial/exposure status, residue depth, and the number of residue contacts for each amino acid. These structure-based terms are commonly derived from statistics of structural properties observed in representative structures (knowledge-based statistical potentials). Among various structure-based terms, residue-residue contact potentials[21–23] are unique in that they capture long-range interactions in a protein structure. A proper encoding of residue contact information is crucial for structure prediction because in principle, a full distance map or a residue contact map has sufficient information for reconstructing the tertiary structure of a protein. It has been also shown that a certain fraction of errors or missing contacts are tolerated for modeling the native structure of proteins[26–28]. When contact information is used as constraints in an “ab initio” structure prediction method, even very sparse information of residue contacts, for example, a contact for every eight residues in a protein sequence is sufficient to reconstruct the native structure. Correct identification of residue contacts is also important for template-based structure prediction since contact maps are usually well conserved between proteins of the same fold even at a very low sequence identity. There are two strategies of using residue contact information for structure prediction. One is to predict residue contact from a protein sequence[31–37] and use them as constraints or as an additional scoring term in a structure prediction procedure. The other approach is to employ a knowledge-based statistical residue contact potential to take into account general propensity of residue interactions. Various types of contact potentials have been proposed and applied for protein structure prediction[21–23, 39, 40]. They share the same principle but vary in details of their designs. For example, they differ in the definition of residue contacts, the reference state, whether or not to consider dependency to the distance and orientation. There are also contact potentials that consider more than two residues that are in contact[41, 42]. Here, we examined various definitions of residue contacts to identify the most effective definitions in the context of fold recognition. In contrast to the previous works that evaluated contact maps in terms of the accuracy of protein structure reconstruction[26–28], we examine definitions of residue contacts that can effectively distinguish proteins of the same fold from those of the other folds. Thus, information contained in residue contacts that are specific to each protein fold is evaluated in purely a practical scenario of the fold recognition.
Concretely, we prepared 45 different contact definitions that consist of combinations of three different contacting atoms, i.e. Cα, Cβ, and heavy atoms with 15 distance cutoffs. Using the 45 different contact definitions, we examined how well contact maps defined by each definition can distinguish proteins of the same fold from others. The similarity of contact maps of two proteins is defined as the fraction of the common contacts between the two proteins, where equivalent residues are identified either by structural superimposition or a threading method. The purpose of using threading methods is to simulate the actual situation of threading where an alignment between a query sequence and a template structure is not always accurate. We found that 7.5/7.0 Å, 7.0/6.5 Å, and 4.5/5.0 Å perform best for the distance cutoff of contact definition using Cα, Cβ, and heavy atoms, respectively, for identifying protein pairs of the same fold. These cutoffs worked consistently well when threading-based alignments were used for identifying equivalent residues in protein pairs. On average, contact maps effectively distinguish proteins of the same fold from others when contacting residue pairs occupy 4.1 – 6.9% of the whole contact maps. We also found that effective contact definitions differ from fold to fold, suggesting that using different residue contact definition specific for each template will lead to improvement of threading performance.
Structural retrieval performance using different contact definitions
AUC values of the best contact cutoff values for the three alignment methods
Heavy atom 4.5/5.0 Å
Structure retrieval with common contacts when threading alignments were used
Contact map occupancy
Structural retrieval evaluated with TM-score
Fold recognition using residue contacts of different sequence separation ranges
Fold recognition with relaxed contact matching
We further examined fold recognition with a relaxed definition of common contacts. A pair of residue contacts in two proteins are considered as common when they occur within ±1 residues to each other in a given structural alignment. Although the results do not differ much from those by the original definition of common contacts, “blurring” contacts made fold recognition slightly worse for all three types of alignments. For TM-align alignments, AUC decreased from 0.907 to 0.888, from 0.847 to 0.837 for the HHpred alignments and from 0.717 to 0.699 for the SUPRB alignments. The AUC values are for the contact definition of Cβ-Cβ 6.5 Å.
Fold recognition for different structural classes
Best contact definitions for individual folds
In this work, we tested thirty different residue contact definitions in the context of fold recognition. To investigate the pure ability of contact patterns for distinguishing folds, we introduced the fraction of common contacts (FCC) of protein pairs and examined how well FCC computed with different definitions select proteins of the same fold from the rest of the protein pairs of different folds. To examine how much incorrect alignments in threading affect the fold recognition accuracy, we also used two threading methods, HHpred and SUPRB, to determine corresponding residues of proteins. We found that overall, the Cβ-Cβ distance 7.0 Å works best for identifying proteins of the same fold consistently for structural alignments and threading alignments. A qualitative difference between the threading alignments and structural alignments is that the former prefer larger distance cutoffs for defining contacts because they are more tolerant to misalignments (Figure10). In the case of threading, alignment accuracy strongly influences the fraction of common contacts identified among proteins of the same fold (Figure2), which eventually affects fold recognition accuracy (Figures1,5). It turned out that threading alignment accuracy is relatively poorer for all-β proteins (Figure8), and thus those proteins have lower fold recognition accuracy (Figure7). Finally, we found that the effective contact definition to identify folds depends on the folds (Figure10). A larger distance cutoff is advantageous for capturing spatial arrangement of the secondary structures of a fold, which are not physically in contact. For capturing contacts between neighboring β strands, considering Cα atoms is better than Cβ, because sometimes the side-chains point to opposite directions (Figure11C). The results of this work suggest two potential directions of implementing residue contacts for improving fold recognition. Since a larger distance cutoff is effective in capturing local topology of proteins, employing a “long-distance” interaction potential for residues that are 6.5 Å to 12 Å apart may improve recognition accuracy. The long-distance interaction potential may be used as a scoring term in threading together with a regular contact potential (e.g. for contacts defined within 4.5 Å between heavy atoms). Another idea is to use different fold-specific contact definitions (Figures10,11) for each structure in a template database.
This study focused on seeking effective inter-residue contact definitions for template-based protein structure prediction. Residue contacts defined by Cβ−Cβ distance of 7.0 Å work best overall among tested to identify proteins of the same fold. We also found that effective contact definitions differ from fold to fold, suggesting that using different residue contact definition specific for each template will lead to improvement of the performance of threading.
Dataset of domain structures of globular proteins
Two sets of domain structures of globular proteins were selected according to the SCOP database (release 1.73), one for representative protein folds and another one for representative superfamilies. We selected protein folds that have at least three superfamilies, from each of which one domain structure was selected. Entries were discarded if their PDB files contain only Cα traces. In total, 194 folds were selected. The numbers of structures in each fold range from 3 to 110. In total, there are 2167 structures in the fold dataset. Similarly, a dataset of 250 representative superfamilies that contains a total of 1672 structures were selected. Each superfamily in the dataset contains at least three families, from each of which one structure was selected. In the following part, we will explain the experiment procedure on the fold dataset and readers should be aware that the same procedure was performed on the superfamily dataset.
Construction of contact maps
For each structure in the datasets, we constructed contact maps using thirty different contact definitions: three contact bases to consider for an amino acid residue, i.e. Cα, Cβ (Cα atom is used for glycine), and heavy atoms from two residues, with 15 distance cutoffs for each (4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 10.0, 12.0, 15.0, 20.0, 30.0, 50.0, and 100.0 Å). To eliminate obvious contacts from neighboring residues, we only considered contacts between amino acid residues that are at least three residues apart in the primary sequence.
Common contacts between two protein structures
The aim of this work is to examine how well residue contacts determined by each of thirty definitions can distinguish proteins of the same fold from the others. To identify common contacts between two protein structures (more precisely, contact maps of the two protein structures), we need an alignment of the two proteins to identify structurally equivalent residues between them. Alignments were obtained using three methods, TM-align, HHpred, and SUPRB. TM-align is a structure alignment method, which aligns two tertiary structures using a dynamic programming algorithm and computes the root mean square deviation (RMSD). We consider structural alignments calculated by TM-align as the golden standard of the alignments. The latter two methods, HHpred and SUPRB, are threading methods. For a pair of proteins, the sequence of one of them is threaded (aligned) on the other protein structure. The purpose of using the threading methods is to introduce realistic errors that can happen in the alignment process of threading. HHpred uses a hidden Markov model that characterizes proteins with sequence profiles and predicted secondary structures. SUPRB is a threading method that uses a composite scoring function with sequence profile, solvent accessibility, secondary structure matching, main chain angle preference, and a residue contact potential term. In this experiment we deleted the contact potential term from the scoring function. Given contact maps of two proteins and an alignment (either by TM-align, HHPred, or SUPRB), the fraction of common contacts (FCC) was computed as follows: Suppose residues a i and a j in protein A (the query) are aligned with residues b m and b n in another protein B (the template), respectively. If the (a i , a j ) pair and the (b m , b n ) pair are in contact within each protein respectively, then we count them as a common contact between the two proteins. Finally, the FCC for the query protein is computed as the number of residues in the query that are involved in at least one common contact relative to the number of aligned residues. FCC ranges from 0 to 1.
Identification of proteins of the same fold/superfamily by fraction of common contacts
For a group of proteins of the same fold, FCC was computed for each pair of them. As a reference, we took one protein from each fold (thus 194 proteins in total) and computed FCC between the selected protein of the fold and the other proteins from different folds. The difference between FCC values of proteins within the same fold and those across different folds reflects the ability of fold recognition by a certain definition of residue contacts. For a fold group, we sorted protein pairs of the same fold and those from different folds by their FCC and computed the receiver operator characteristic (ROC) curve. For each contact definition, an average ROC curve was computed by averaging the true positive values of all the folds at the same false positive rate.
The authors thank Lillian Liu for proofreading the manuscript. This work has been supported by grants from the National Institutes of Health (R01GM075004, R01GM097528), National Science Foundation (EF0850009, IIS0915801, DMS0800568), and National Research Foundation of Korea Grant funded by the Korean Government (NRF-2011-220-C00004).
- Hillisch A, Pineda LF, Hilgenfeld R: Utility of homology models in the drug discovery process. Drug Discov Today 2004, 9: 659–669. 10.1016/S1359-6446(04)03196-4View ArticlePubMed
- Takeda-Shitaka M, Takaya D, Chiba C, Tanaka H, Umeyama H: Protein structure prediction in structure based drug design. Curr Med Chem 2004, 11: 551–558. 10.2174/0929867043455837View ArticlePubMed
- Ashworth J, Havranek JJ, Duarte CM, Sussman D, Monnat RJ Jr, Stoddard BL, Baker D: Computational redesign of endonuclease DNA binding and cleavage specificity. Nature 2006, 441: 656–659. 10.1038/nature04818PubMed CentralView ArticlePubMed
- Jiang L, Althoff EA, Clemente FR, Doyle L, Rothlisberger D, Zanghellini A, Gallaher JL, Betker JL, Tanaka F, Barbas CF III, Hilvert D, Houk KN, Stoddard BL, Baker D: De novo computational design of retro-aldol enzymes. Science 2008, 319: 1387–1391. 10.1126/science.1152692PubMed CentralView ArticlePubMed
- Saven JG: Computational protein design: engineering molecular diversity, nonnatural enzymes, nonbiological cofactor complexes, and membrane proteins. Curr Opin Chem Biol 2011, 15: 452–457. 10.1016/j.cbpa.2011.03.014PubMed CentralView ArticlePubMed
- Mardis ER: Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 2008, 9: 387–402. 10.1146/annurev.genom.9.081307.164359View ArticlePubMed
- Metzker ML: Sequencing technologies - the next generation. Nat Rev Genet 2010, 11: 31–46. 10.1038/nrg2626View ArticlePubMed
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The protein data bank. Nucleic Acids Res 2000, 28: 235–242. 10.1093/nar/28.1.235PubMed CentralView ArticlePubMed
- Pieper U, Eswar N, Davis FP, Braberg H, Madhusudhan MS, Rossi A, Marti-Renom M, Karchin R, Webb BM, Eramian D, Shen MY, Kelly L, Melo F, Sali A: MODBASE: a database of annotated comparative protein structure models and associated resources. Nucleic Acids Res 2006, 34: D291-D295. 10.1093/nar/gkj059PubMed CentralView ArticlePubMed
- Kihara D, Skolnick J: Microbial Genomes have over 72% structure assignment by the threading algorithm PROSPECTOR_Q. Proteins 2004, 55: 464–473. 10.1002/prot.20044View ArticlePubMed
- Zhang Y: Progress and challenges in protein structure prediction. Curr Opin Struct Biol 2008, 18: 342–348. 10.1016/j.sbi.2008.02.004PubMed CentralView ArticlePubMed
- Chen H, Kihara D: Effect of using suboptimal alignments in template-based protein structure prediction. Proteins 2011, 79: 315–334. 10.1002/prot.22885PubMed CentralView ArticlePubMed
- Kinch L, Yong SS, Cong Q, Cheng H, Liao Y, Grishin NV: CASP9 assessment of free modeling target predictions. Proteins 2011, 79(Suppl 10):59–73.PubMed CentralView ArticlePubMed
- Qu X, Swanson R, Day R, Tsai J: A guide to template based structure prediction. Curr Protein Pept Sci 2009, 10: 270–285. 10.2174/138920309788452182View ArticlePubMed
- Liu S, Zhang C, Liang S, Zhou Y: Fold recognition by concurrent use of solvent accessibility and residue depth. Proteins 2007, 68: 636–645. 10.1002/prot.21459View ArticlePubMed
- Zhou H, Zhou Y: Single-body residue-level knowledge-based energy score combined with sequence-profile and secondary structure information for fold recognition. Proteins 2004, 55: 1005–1013. 10.1002/prot.20007View ArticlePubMed
- Skolnick J, Kihara D: Defrosting the frozen approximation: PROSPECTOR–a new approach to threading. Proteins 2001, 42: 319–331. 10.1002/1097-0134(20010215)42:3<319::AID-PROT30>3.0.CO;2-AView ArticlePubMed
- Skolnick J, Kihara D, Zhang Y: Development and large scale benchmark testing of the PROSPECTOR 3.0 threading algorithm. Proteins 2004, 56: 502–518. 10.1002/prot.20106View ArticlePubMed
- Adamczak R, Porollo A, Meller J: Combining prediction of secondary structure and solvent accessibility in proteins. Proteins 2005, 59: 467–475. 10.1002/prot.20441View ArticlePubMed
- Yang YD, Park C, Kihara D: Protein structure prediction without optimizing weighting factors for scoring function. Biophys J 2009, 96: 653a.View Article
- Sippl MJ: Knowledge-based potentials for proteins. Curr Opin Struct Biol 1995, 5: 229–235. 10.1016/0959-440X(95)80081-6View ArticlePubMed
- Skolnick J, Jaroszewski L, Kolinski A, Godzik A: Derivation and testing of pair potentials for protein folding. When is the quasichemical approximation correct. Protein Sci 1997, 6: 676–688.PubMed CentralView ArticlePubMed
- Zhou H, Skolnick J: GOAP: a generalized orientation-dependent, all-atom statistical potential for protein structure prediction. Biophys J 2011, 101: 2043–2052. 10.1016/j.bpj.2011.09.012PubMed CentralView ArticlePubMed
- Kihara D: The effect of long-range interactions on the secondary structure formation of proteins. Protein Sci 2005, 14: 1955–1963. 10.1110/ps.051479505PubMed CentralView ArticlePubMed
- Taketomi H, Ueda Y, Go N: Studies on protein folding, unfolding and fluctuations by computer simulation. I. The effect of specific amino acid sequence represented by specific inter-unit interactions. Int J Pept Protein Res 1975, 7: 445–459.View ArticlePubMed
- Vassura M, Di LP, Margara L, Mirto M, Aloisio G, Fariselli P, Casadio R: Blurring contact maps of thousands of proteins: what we can learn by reconstructing 3D structure. BioData Min 2011, 4: 1. 10.1186/1756-0381-4-1PubMed CentralView ArticlePubMed
- Duarte JM, Sathyapriya R, Stehr H, Filippis I, Lappe M: Optimal contact definition for reconstruction of contact maps. BMC Bioinformatics 2010, 11: 283. 10.1186/1471-2105-11-283PubMed CentralView ArticlePubMed
- Vendruscolo M, Kussell E, Domany E: Recovery of protein structure from contact maps. Fold Des 1997, 2: 295–306. 10.1016/S1359-0278(97)00041-2View ArticlePubMed
- Li W, Zhang Y, Kihara D, Huang YJ, Zheng D, Montelione GT, Kolinski A, Skolnick J: TOUCHSTONEX: protein structure prediction with sparse NMR data. Proteins 2003, 53: 290–306. 10.1002/prot.10499View ArticlePubMed
- Rodionov MA, Johnson MS: Residue-residue contact substitution probabilities derived from aligned three-dimensional structures and the identification of common folds. Protein Sci 1994, 3: 2366–2377. 10.1002/pro.5560031221PubMed CentralView ArticlePubMed
- Li Y, Fang Y, Fang J: Predicting residue-residue contacts using random forest models. Bioinformatics 2011, 27: 3379–3384. 10.1093/bioinformatics/btr579View ArticlePubMed
- Shackelford G, Karplus K: Contact prediction using mutual information and neural nets. Proteins 2007, 69(Suppl 8):159–164.View ArticlePubMed
- Frenkel-Morgenstern M, Magid R, Eyal E, Pietrokovski S: Refining intra-protein contact prediction by graph analysis. BMC Bioinformatics 2007, 8(Suppl 5):S6. 10.1186/1471-2105-8-S5-S6PubMed CentralView ArticlePubMed
- Cheng J, Baldi P: Improved residue contact prediction using support vector machines and a large feature set. BMC Bioinformatics 2007, 8: 113. 10.1186/1471-2105-8-113PubMed CentralView ArticlePubMed
- Hamilton N, Burrage K, Ragan MA, Huber T: Protein contact prediction using patterns of correlation. Proteins 2004, 56: 679–684. 10.1002/prot.20160View ArticlePubMed
- Fariselli P, Olmea O, Valencia A, Casadio R: Prediction of contact maps with neural networks and correlated mutations. Protein Eng 2001, 14: 835–843. 10.1093/protein/14.11.835View ArticlePubMed
- Vullo A, Walsh I, Pollastri G: A two-stage approach for improved prediction of residue contact maps. BMC Bioinformatics 2006, 7: 180. 10.1186/1471-2105-7-180PubMed CentralView ArticlePubMed
- Kihara D, Lu H, Kolinski A, Skolnick J: TOUCHSTONE: an ab initio protein structure prediction method that uses threading-based tertiary restraints. Proc Natl Acad Sci U S A 2001, 98: 10125–10130. 10.1073/pnas.181328398PubMed CentralView ArticlePubMed
- Miyazawa S, Jernigan RL: An empirical energy potential with a reference state for protein fold and sequence recognition. Proteins 1999, 36: 357–369. 10.1002/(SICI)1097-0134(19990815)36:3<357::AID-PROT10>3.0.CO;2-UView ArticlePubMed
- Miyazawa S, Jernigan RL: Estimation of effective inter-residue contact energies from protein crystal structures: quasi-chemical approximation. Macromolecules 1985, 18: 534–552. 10.1021/ma00145a039View Article
- Gniewek P, Leelananda SP, Kolinski A, Jernigan RL, Kloczkowski A: Multibody coarse-grained potentials for native structure recognition and quality assessment of protein models. Proteins 2011, 79: 1923–1929. 10.1002/prot.23015PubMed CentralView ArticlePubMed
- Krishnamoorthy B, Tropsha A: Development of a four-body statistical pseudo-potential to discriminate native from non-native protein conformations. Bioinformatics 2003, 19: 1540–1548. 10.1093/bioinformatics/btg186View ArticlePubMed
- Zhang Y, Skolnick J: TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res 2005, 33: 2302–2309. 10.1093/nar/gki524PubMed CentralView ArticlePubMed
- Hildebrand A, Remmert M, Biegert A, Soding J: Fast and accurate automatic structure prediction with HHpred. Proteins 2009, 77(Suppl 9):128–132.View ArticlePubMed
- Xu J, Zhang Y: How significant is a protein structure similarity with TM-score = 0.5. Bioinformatics 2010, 26: 889–895. 10.1093/bioinformatics/btq066PubMed CentralView ArticlePubMed
- Zemla A: LGA: A method for finding 3D similarities in protein structures. Nucleic Acids Res 2003, 31: 3370–3374. 10.1093/nar/gkg571PubMed CentralView ArticlePubMed
- Vehlow C, Stehr H, Winkelmann M, Duarte JM, Petzold L, Dinse J, Lappe M: CMView: interactive contact map visualization and analysis. Bioinformatics 2011, 27: 1573–1574. 10.1093/bioinformatics/btr163View ArticlePubMed
- Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJ, Chothia C, Murzin AG: Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res 2008, 36: D419-D425.PubMed CentralView ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.