Inter-residue distances derived from fold contact propensities correlate with evolutionary substitution costs
© Williams and Doherty; licensee BioMed Central Ltd. 2004
Received: 02 June 2004
Accepted: 18 October 2004
Published: 18 October 2004
The wealth of information on protein structure has led to a variety of statistical analyses of the role played by individual amino acid types in the protein fold. In particular, the contact propensities between the various amino acids can be converted into folding energies that have proved useful in structure prediction. The present study addresses the relationship of protein folding propensities to the evolutionary relationship between residues.
The contact preferences of residue types observed in a representative sample of protein structures are converted into a residue similarity matrix or inter-residue distance matrix. Remarkably, these distances correlate excellently with evolutionary substitution costs. Residue vectors are derived from the distance matrix. The residue vectors give a concrete picture of the grouping of residues into families sharing properties crucial for protein folding.
Inter-residue distances have proved useful in showing the explicit relationship between contact preferences and evolutionary substitution rates. It is proposed that the distance matrix derived from structural analysis may be useful in aligning proteins where remote homologs share structural features. Residue vectors derived from the distance matrix illustrate the spatial arrangement of residues and point to ways in which they can be grouped.
The large number of protein crystal structures available has naturally led to statistical analyses of protein folding and protein interaction in the hope that these will point to intrinsic residue characteristics and therefore serve as aids in protein fold and interaction prediction. The first such analysis was performed by Miyazawa and Jernigan [1–3], where a statistical protein folding potential, the MJ matrix, was deduced from residue contact propensities in a set of monomeric protein crystal structures. The MJ matrix has been used in various in silico folding experiments, reviewed by Jernigan et al , and shown to point to the essentially hydrophobic nature of folding interactions . An analysis of the MJ matrix has enabled the reduction in sequence complexity by grouping residues into families . A more detailed study of crystal interactions focusing on hydrogen bond distributions has resulted in mean force potentials that have been successfully used in ligand prediction . It is reasonable therefore to conclude that the statistical approach has pointed to an intrinsic residue:residue potential. In this study we will show that crystal contact statistics can be used to define an inter-residue similarity score that is strongly correlated with an evolutionary substitution cost. As this score is not based on aligning homologous proteins it can serve as a complement to similarity scores derived from substitution matrices when faced with the problem of aligning remotely homologous but structurally similar proteins.
If we have really got a measure of the distances between residue types then it should follow that residues sharing physical properties are close together. More crucially, we expect that residues that are distant according to D(P) will be difficult to mutate into one another and vice versa. This is because the factors involved in determining mutation rates are dominated by those affecting the structural integrity of the protein. Such factors are residue hydropathy, size, charge and etc. Substitution matrices such as PAM and Blosum are determined from mutation rates in aligned protein sequences [8, 9]. We can define an amino acid distance matrix in a similar way to D ij above.
Relating amino acids through a structurally defined distance measure should provide a useful tool for aligning remotely homologous protein sequences. Also, a distance measure naturally leads us to look for a vector representation of the amino acids. In much the same way as average hydropathy plots are useful in structural analysis we expect that average vector profiles will also pick out various structural features. Given a vector for each residue type we can visualise the residues in some abstract space and look for natural groupings of residues and thereby find ways of reducing the effective number of residues.
A representative set of crystal structures was compiled from the PDBselect25 database , which contains structures sharing at most 25% sequence homology. We made sure that side chain coordinates were defined and restricted chain lengths to be greater than 50 and less than 500 residues long. In short we arrived at 1073 structures and performed the statistical analysis on these. Residues are held to be in contact if any of their respective side chain atoms are within a given distance of each other. Only residue pairs that are not neighbours along the chain are considered in the analysis of intra-molecular contacts.
The dominant driving force of folding, at least in defining the crude fold, is hydrophobicity and it is apparent that residues with similar hydrophobicities are grouped together. It also seems that residues of similar size tend to be close in this space. To make a direct comparison between existing residue scales and our vectors we can project the residue vectors onto a line. Here the amino acid scalars, one-dimensional vectors, d i are defined such that is minimal. We find that these distance matrix derived scalars have a correlation of 0.65 with the Kyte-Doolittle hydrophobicity scale  and a correlation of 0.53 with an amino acid volume scale . It is clear therefore that the residue vectors capture a combination of factors determining protein folding.
It is worth noting that a scalar reduction of the distance matrix can be got by a principal eigenvector analysis. In a principal eigenvector reduction of the contact propensity matrix we have P ij = λ e i e j , where λ is the principal eigenvalue and e i is the principal eigenvector, and consequently our distance matrix has a scalar representation, . It is not surprising that the eigenvector is closely related to our scalar, in fact r(e,d) = 0.98. There are many hydrophobicity scales in the literature  and some are remarkably similar to our scalar amino acid representation, for example r = 0.95 for Wertz & Scheraga scale . However, the highly correlated scales are derived from residue burial statistics in protein structures and are therefore not independent of our statistic.
We have generated full atom residue:residue contact propensity profiles for intra-molecular interactions from a non-redundant crystal structure database. Recasting the contact propensity matrix as a distance matrix we see that close residues are those with a low evolutionary substitution cost. The structure derived distance measures can serve as additional scores when aligning proteins where remote homologs share structural features. The distance matrix led us naturally to derive effective residue vectors. We found that residues sharing similar physical characteristics, such as hydrophobicity and volume, are grouped together. In contrast to the MJ matrix analysis, we find that a scalar representation for the residues is inadequate to capture the complexity of the propensity distance matrix. The most successful scalar representation for the amino acid residues has been the hydropathy scale. Representing a sequence as a smoothed hydropathy profile through wavelet analysis or simple averaging has resulted in many effective analytical tools, such as periodic structure prediction , remote homology analysis, helix prediction , transmembrane prediction  etc. It is then probable that a higher dimensional vector representation of the amino acids may lead to a more subtle sequence analysis. The distance matrix may also serve as an additional tool in sequence alignment as it gives one a measure of the structural cost of residue mutation and this is an idea we hope to pursue in a future study.
In this study we have shown that inter-residue distance matrices and residue vectors allow us to make an explicit connection between amino acid interaction preferences observed in protein structures and amino acid evolutionary substitution costs. When problems are encountered with aligning structurally related proteins that are remote homologs then the structurally defined distance matrix may prove to be an effective supplement to existing substitution rate derived matrices. The distance matrix leads naturally to an amino acid vector representation. Projecting the vectors onto a two-dimensional plane illustrates ways in which the amino acids can be grouped and their effective number thereby reduced.
The database used in the present study was compiled from the PDBselect25  list of representative proteins with known crystal structure that share less than 25% sequence homology. The structural coordinates were downloaded by automated ftp from the NCBI protein data bank. All programmes were written in C, compiled with Metrowerks CodeWarrior and run on a PC. In brief, the contact propensity statistics were compiled by reading the amino acid sequence and atomic coordinates for the specified chain of each pdb structure file in turn. The number of possible pairings of amino acid type i with amino acid type j, N ij were counted together with the number of these pairings corresponding to a pair with side chain atoms within a given distance of each other, C ij . The contact propensity matrix is given by . The residue vectors were defined such that is minimal. The minimisation was carried out by a standard Newton-Raphson steepest descent iteration.
- Miyazawa S, Jernigan RL: Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation. Macromolecules 1985, 18: 534–552.View ArticleGoogle Scholar
- Miyazawa S, Jernigan RL: Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J Mol Biol 1996, 256: 623–644. 10.1006/jmbi.1996.0114View ArticlePubMedGoogle Scholar
- Miyazawa S, Jernigan RL: An empirical energy potential with a reference state for protein fold and sequence recognition. Proteins 1999, 36(3):357–369. 10.1002/(SICI)1097-0134(19990815)36:3<357::AID-PROT10>3.3.CO;2-LView ArticlePubMedGoogle Scholar
- Jernigan RL, Bahar I: Structure-derived potentials and protein simulations. Curr Opin Struct Biol 1996, 6: 195–209. 10.1016/S0959-440X(96)80075-3View ArticlePubMedGoogle Scholar
- Li H, Tang C, Wingreen NS: Nature of Driving Force for Protein Folding: A Result From Analyzing the Statistical Potential. Phys Rev Lett 1997, 79: 765–768. 10.1103/PhysRevLett.79.765View ArticleGoogle Scholar
- Wang J, Wang W: Grouping of residues based on their contact interactions. Phys Rev E Stat Nonlin Soft Matter Phys 2002, 65(4 Pt 1):041911.View ArticlePubMedGoogle Scholar
- Grzybowski BA, Ishchenko AV, Kim CY, Topalov G, Chapman R, Christianson DW, Whitesides GM, Shakhnovich EI: Combinatorial computational method gives new picomolar ligands for a known enzyme. Proc Natl Acad Sci 2002, 99: 1270–1273. 10.1073/pnas.032673399PubMed CentralView ArticlePubMedGoogle Scholar
- Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins. In In Atlas of Protein Sequence and Structure. Edited by: Dayhoff MO. Washington, DC: National Biomedical Research Foundation;; 1978:345–352.Google Scholar
- Henikoff S, Henikoff JG: Amino Acid Substitution Matrices from Protein Blocks. Proc Natl Acad Sci 1992, 89: 10915–10919.PubMed CentralView ArticlePubMedGoogle Scholar
- Hobohm U, Sander C: Enlarged representative set of protein structures. Protein Sci 1994, 3: 522–524.PubMed CentralView ArticlePubMedGoogle Scholar
- Zvelebil MJ, Barton GJ, Taylor WR, Sternberg MJ: Prediction of protein secondary structure and active sites using the alignment of homologous sequences. J Mol Biol 1987, 195: 957–961. 10.1016/0022-2836(87)90501-8View ArticlePubMedGoogle Scholar
- Kyte J, Doolittle RF: A simple method for displaying the hydropathic character of a protein. J Mol Biol 1982, 157: 105–132.View ArticlePubMedGoogle Scholar
- Zamyatnin AA: Protein volume in solution. Prog Biopsy Mol Boil 1972, 24: 107–123. 10.1016/0079-6107(72)90005-3View ArticleGoogle Scholar
- Cornette JL, Cease KB, Margalit H, Spouge JL, Berzofsky JA, DeLisi C: Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins. J Mol Biol 1987, 195: 659–685.View ArticlePubMedGoogle Scholar
- Wertz DH, Scheraga HA: Influence of water on protein structure. An analysis of the preferences of amino acid residues for the inside or outside and for specific conformations in a protein molecule. Macromolecules 1978, 11: 9–15.View ArticlePubMedGoogle Scholar
- Murray KB, Gorse D, Thornton JM: Wavelet transforms for the characterization and detection of repeating motifs. J Mol Biol 2002, 316: 341–363. 10.1006/jmbi.2001.5332View ArticlePubMedGoogle Scholar
- Kim J, Moriyama EN, Warr CG, Clyne PJ, Carlson JR: Identification of novel multi-transmembrane proteins from genomic databases using quasi-periodic structural properties. Bioinformatics 2000, 16: 767–775. 10.1093/bioinformatics/16.9.767View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.