Prediction of antigenic epitopes on protein surfaces by consensus scoring
© Liang et al; licensee BioMed Central Ltd. 2009
Received: 7 April 2009
Accepted: 22 September 2009
Published: 22 September 2009
Prediction of antigenic epitopes on protein surfaces is important for vaccine design. Most existing epitope prediction methods focus on protein sequences to predict continuous epitopes linear in sequence. Only a few structure-based epitope prediction algorithms are available and they have not yet shown satisfying performance.
We present a new antigen Epitope Prediction method, which uses ConsEnsus Scoring (EPCES) from six different scoring functions - residue epitope propensity, conservation score, side-chain energy score, contact number, surface planarity score, and secondary structure composition. Applied to unbounded antigen structures from an independent test set, EPCES was able to predict antigenic eptitopes with 47.8% sensitivity, 69.5% specificity and an AUC value of 0.632. The performance of the method is statistically similar to other published methods. The AUC value of EPCES is slightly higher compared to the best results of existing algorithms by about 0.034.
Our work shows consensus scoring of multiple features has a better performance than any single term. The successful prediction is also due to the new score of residue epitope propensity based on atomic solvent accessibility.
Realistic prediction of protein surface regions that are preferentially recognized by antibodies (antigenic epitopes) can help in the design of vaccine components and immuno-diagnostic reagents. Antigenic epitopes are classified as continuous or discontinues epitopes. If the residues involved in an epitope are contiguous in the polypeptide chain, this epitope is called a continuous epitope or a linear epitope. On the other hand, a discontinuous or non-linear epitope is composed of residues that are not necessarily continuous in the polypeptide sequence but have spatial proximity on the surface of a protein structure. A significant fraction of epitopes are discontinuous in the sense that antibody binding is not fully determined by a linear peptide segment but also influenced by adjacent surface regions .
However, the majority of available epitope prediction methods focus on continuous epitopes due to the convenience of the investigation in which the amino acid sequence of a protein is taken as the input. Such prediction methods are based upon the amino acid properties including hydrophilicity [2, 3], solvent accessibility , secondary structure , flexibility , and antigenicity . In addition, based on the known linear epitope databases such as Bcipep  and FIMM , there also exist some methods using machine learning algorithms such as Hidden Markov Model (HMM) , Artificial Neural Network (ANN) , and Support Vector Machine (SVM) [12, 13] to locate linear epitopes. A study by Blythe and Flower has demonstrated that, using single-scale amino acid property profiles, a linear epitope prediction method was not able to predict epitope location reliably , whereas Greenbaum et al. showed that, using a combination of more than one amino aid property scale, machine learning algorithms could improve prediction accuracy.
Unlike linear epitope prediction, only a small number of studies have been performed so far on the prediction of discontinuous epitopes employing structural information of a target protein. Although such studies are of highly importance, the small number of available structures of antibody-antigen complexes limits this kind of studies. Several databases, such as IEDB , SACS , and CED , collected all existing structures of antibody-antigen complexes from the PDB bank. With the 3-dimentional structures of proteins as input, a few methods have been designed to predict putative antigenic epitopes by using conservation score, amino acid statistics, accessibility, and spatial information [19–24]. Ponomarenko and Bourne evaluated DiscoTope and CEP along with six other protein binding site prediction methods by benchmarking on 62 epitope structures and 82 antibody-antigen structures. They concluded that none of those prediction methods have a performance exceeding 40% precision and 46% recall. Clearly, there is still a large gap between the strong need for antigenic epitope prediction and the low accuracy that existing prediction methods can achieve.
Use of multiple features could potentially improve performances on predicting antigenic epitopes, but this raises another question: are the properties effective for the limited number of antigens with available complex structures also work as well for all antigens? In this study, we tested 6 properties, which were used in protein/antibody binding site prediction previously, with the published databases plus the most recently released PDB entries. We found that the performances of the 6 terms were quite different for the two databases. Nevertheless, consensus prediction of the 6 terms resulted in reasonable accuracy for both databases.
Protein Dataset 1
48 antigen-antibody complexes with resolution <3.0 Å were selected from the 59 representative antigen-antibody complexes compiled by Ponomarenko and Bourne: 2ADF, 1FE8, 1BGX, 1E6J, 1EGJ, 1FSK, 1H0D, 1IQD, 1JRH, 1LK3, 1MHP, 1NL0, 1NSN, 1OAZ, 1ORS, 1PKQ, 1RJL, 1SY6, 1TZI, 1WEJ, 1YJD, 1YY9, 1ZTX, 2JEL, 1A14, 1NCA, 1BVK, 1JHL, 1NDG, 1P2C, 1JPS, 1AR1, 1EO8, 1QFU, 1EZV, 1OSP, 1FJ1, 1FNS, 1G9M, 1R3J, 1N8Z, 1NFD, 1TQB, 2VDL, 1V7M, 1XIW, 2AEP, and 1R0A. All entries were released before January 2006 except for 2VDL, which was the new version of original entry 1TXV. This dataset was used to derive residue epitope propensities.
Protein Dataset 2
22 antigen-antibody complexes and their unbound structures were selected from protein docking Benchmark 2.0 . Benchmark 2.0 was published in 2005 and overlaps with Protein Dataset 1. The complex structures in this dataset were used to locate the antibody binding sites. Interface residues on the surface of unbound antigens were used to optimize the parameters for the binding site prediction method and considered as the training set.
Protein Dataset 3
This dataset was curated by us and served as an independent test set, which has 17 antigen-antibody complexes released between February 2006 and October 2008. Within this window, there were 180 entries returned by querying the PDB with a resolution <3.0 Å, using key words "antibody" and "complex". All complexes of antibodies with non-protein-ligands were manually removed from those 180 structures. Subsequently we performed a sequence alignment for antigens in the remaining complexes and Protein Dataset 1. A complex was kept if the maximum sequence identity between its antigen and any antigen in Protein Dataset 1 was less than 35% in local alignment. For a complex with a maximum sequence identity in the range of 35~50%, we accepted the complexes if the binding topology was not the same as the corresponding complex in Protein Dataset 1. The same criterion was also applied on any two complexes within Protein Dataset 3 itself. As a result, a total of 17 antigen-antibody complexes were selected. The unbound structures of the antigens in those 17 complexes were also obtained from the PDB. The structure with the best resolution was selected if there was more than one protein structure available in PDB. For the case that an antigen's unbound structure was not available, its bound structure in a complex with another protein was used for evaluation.
Definition of surface residues, surface patches, and interface residues
Following previous work , we consider an amino acid residue as a surface residue if the relative accessibility of its side chain is greater than 6% with probe radius = 1.2Å. Also to confirm with previous work on protein binding site prediction  a surface patch is defined as a central surface residue and its 19 nearest surface neighbors in space. Solvent vector constraints  were applied in order to avoid patches sampled on different sides of a protein surface. An interface residue is the surface residue with solvent accessibility decreased more than 1 Å2 upon association.
Six terms for antibody binding site prediction
Residue epitope propensity, conservation score, side chain energy score, contact number, surface planarity score, and secondary structure composition were exploited for antibody binding site prediction. We previously used the first three terms for protein-protein interface prediction (PINUP). In an independent comparative study the PINUP method showed the highest prediction accuracy compared to other published interface prediction approaches . The last three terms have already been used for antibody binding site prediction by others. We describe the details of those six terms in the following paragraphs.
Residue epitope propensity
Where and are the contribution of residue type r to the antibody binding site and to the protein surface area, respectively, Sr and are the relative accessible surface area of residue r at the sequence position i and the average relative accessible surface area of surface residues of type r, respectively. The Cα atom of Gly is considered as a side chain atom for convenience. Since antigen-antibody interfaces have different residue composition compared with other protein-protein interfaces, we used Protein Dataset 1 to derive residue antibody binding site propensity instead of using the former residue interface propensity score. Here, and were obtained from statistical analysis of Protein Dataset 1. Some antigens in Dataset 1 have multiple epitopes. Those residues belonging to any of the epitopes were considered as antibody-binding interface residues. The values of for 20 amino acid residues were obtained from the statistical analyses on 41 antigens in Protein Dataset 1.
Residue conservation score
where M ir is the self-substitution score in the position-specific substitution matrix generated from PSIBLAST for the residue type r at sequence position i, and B rr is the diagonal element of BLOSUM62 for residue type r. Usually, protein-protein interface residues are more conserved than other surface residues due to functional constraints, and hence, the conserved surface residues in the unbound structure will be predicted as interface residues. The residues in the antibody-binding site, however, are less conserved than other surface residues due to the constraint of the host immune system. The unconserved residues are considered as the putative antibody binding site residues.
Side-chain energy score
The exact expression for side chain energy score can be found in Eq. (3) in PINUP . It was calculated from the side-chain energies of all possible rotamers for a given residue type at a sequence position whereas other sequence positions have native residue types and observed atomic coordinates. The weights of the energy function were optimized so that the native residue was predicted energetically favorable at each position of the training proteins The assumption is that the residues at the antibody binding site may have a higher energy score than other surface residues so that the free energy of the antigen-antibody system could go down significantly upon association.
The residue contact number is the number of Cα atoms in the antigen within a distance of 10 Å of the Cα atom of residue i . A residue with a small contact number was considered as an antibody binding site residue.
The planarity of each surface patch was calculated by evaluating the root mean squared (rms) deviation of all the Cα atoms in the surface patch from the least squares plane through the atoms. The rms deviations were inverted such that a high planarity score for a patch was interpreted as a planar patch and antibody binding site.
Secondary structure composition
This score was defined as the fraction of patch residues forming turns or loops in all 20 patch residues. Following Chou & Fasman's method, the α-helix and β-sheet were defined as four or more consecutive residues having ϕ, ψ angles within 40° of (-60°, -50°) and three or more residues having ϕ, ψ angles within 40° of (-120°, 110°) or (-140°, 135°), respectively. The remaining regions were considered turns and loops.
Prediction of discontinuous epitopes
Prediction with one term
Given the structure of an antigen, all of its surface residues were sampled, and hence all its corresponding surface patches could be obtained. The score of a patch for one scoring function is given by the average value of its scores for all 20 residues. Based on a certain threshold, the central residue of the top percentile patch was predicted as an interface residue. In case of secondary structure composition and contact number score if a patch was not ranked above the threshold but scored the same as any top ranked patch, the patch was also added into the top-ranked patch set.
Prediction with consensus scoring
To take the advantage of the multiple features, we used a voting mechanism with the above described six scoring functions. A patch was considered as an interface patch if five of the all six terms scored it into the top-ranked patch set. We did not use the vote mechanism of all six votes from the six scoring function because one surface patch with a small contact number could not have a high planarity score at the same time. The number of predicted residues with each single term is the same but the threshold of how many top ranked patches shall be kept can be varied to yield predictions with different sensitivities.
Patch score derived by unevenly averaged single-residue scores
where Eresidue(k) is the score of residue k in the patch; d is the distance between residue k and the central residue of the patch; T is the parameter to be optimized during training.
Sensitivity and precision were defined as the ratios of the number of correctly predicted interface residues to the number of real interface residues and to the number of all predicted interface residues, respectively. Specificity is defined as the fraction of correctly predicted surface residues in the total number of observed surface residues. As recommended by Ponomarenko & Bourne, the area under the receiver operating characteristic curve (AUC) was used as the primary evaluation metric. A receiver operating characteristic (ROC) curve represents a dependency of sensitivity and (1-specificity). To obtain the ROC curve, we increased the number of predicted residues (or the predicted residue with the single term in consensus prediction) in steps of 1% of total surface residues. A java program downloaded from http://pages.cs.wisc.edu/~richm/programs/AUC/, was used to calculate the AUC.
Results and Discussions
Predictions with one term
The values of for twenty amino acid residues.
AUC values for training and testing datasets predicted by the single term
Training set a
Testing set b
Binding site propensity
Side chain energy score
Fraction of turns & loops
Prediction results for the training dataset with consensus scoring
The precision of the conservation score always increases for both the training and testing sets as the number of predicted residues decreases. The residues scored above the cutoff value by only conservation were also considered as interface residues. Furthermore, when the predicted residues with the single term are less than 28% of total surface residues, none of interface residues are predicted by consensus scoring for at least one training protein and interface residues are only predicted by conservation score. We tried the cutoff value of 5%, 10%, 15%, and 20%, and the AUC values were 0.619, 0.622, 0.626, and 0.618 respectively. The cutoff value (15%) yielding the best AUC value was selected.
Prediction results for the training set with 6 combined terms
No. of Surface Residues
No. of Interface Residues
Prediction results for the testing set
No. of Surface Residues
No. of Interface Residues
Comparison with other epitope prediction methods
In this study, we investigated residue antibody binding site propensity based on atomic solvent accessibility for 20 amino acids. The AUC values for training and testing sets were 0.637 and 0.577, respectively, when the single term of propensity score were used for prediction. Currently, two other algorithms using multiple features for antibody binding site prediction are available, DiscoTope and PEPITO. These methods used similar antibody binding site propensity scores at residue level. We also tried using the propensity score of DiscoTope  for comparative predictions. The AUC values were 0.587 and 0.551 for our training and testing sets, respectively. Propensity score based on atomic solvent accessibility has a slightly better performance than the propensity score of residue level for both datasets.
Comparison with other algorithms
Unlike Discotop1.2 and BEpro, our algorithm has lower prediction accuracy for the bound structures than the unbound structures due to the inclusion of the side chain energy score. The interface residues of bound antigen are buried in the complex and usually have a lower temperature factor than other surface residues. In the bound forms these side chains have systematically lower energies than in the unbound form which in our algorithm contributes unfavorably to the score . Predictions with the side chain energy score as single term yielded AUC values of training and testing sets of 0.555 and 0.569, respectively, for unbound structures and 0.532 and 0.521 for bound structures, respectively.
An important conclusion of the present study is that antibody binding site prediction is more difficult than prediction of other protein binding regions. A combination of multiple surface features which allows relatively accurate prediction of protein binding sites in general shows poorer performance in case of antibody binding site prediction. An important issue is also that a given protein usually contains not only one but several putative antibody binding sites. Usually an antibody-antigen complex structure indicates only one of these possible antigenic epitopes. In addition, care must be taken in evaluating prediction methods when a relatively small number of antibody-antigen complexes were used as the testing set. The prediction algorithm may work reasonably well on one testing set but could show poorer prediction accuracy on new targets due to different interface properties. More training proteins are required for developing new prediction algorithms in the future. Nevertheless, the study demonstrated that consensus scoring of widely used features for binding site prediction showed a better performance than any single term for the independent test set. The prediction accuracy was improved further by utilizing residue epitope propensity based on atomic solvent accessibility. However, a detailed comparison with other published methods indicated that overall the performance of our combined approach is similar to existing methods.
The EPCES program is available upon request. A web-based EPCES application is available at: http://www.t38.physik.tu-muenchen.de/programs.htm.
We thank Dr. S. Fiorucci for helpful discussions. This project was supported by funding under the Sixth Research Framework Programme of the European Union (FP6 STREP "BacAbs", ref. LSHB-CT-2006-037325). Calculations were performed using computational resources of the CLAMV (Computational Laboratories for Analysis, Modeling and Visualization) at Jacobs University Bremen, Germany.
- Van Regenmortel MHV: Mapping Epitope Structure and Activity: From One-Dimensional Prediction to Four-Dimensional Description of Antigenic Specificity. Methods 1996, 9: 465–472. 10.1006/meth.1996.0054View ArticlePubMed
- Parker JM, Guo D, Hodges RS: New hydrophilicity scale derived from high-performance liquid chromatography peptide retention data: correlation of predicted surface residues with antigenicity and X-ray-derived accessible sites. Biochemistry 1986, 25: 5425–5432. 10.1021/bi00367a013View ArticlePubMed
- Hopp TP, Woods KR: Prediction of protein antigenic determinants from amino acid sequences. Proc Natl Acad Sci USA 1981, 78: 3824–3828. 10.1073/pnas.78.6.3824PubMed CentralView ArticlePubMed
- Emini EA, Hughes JV, Perlow DS, Boger J: Induction of hepatitis A virus-neutralizing antibody by a virus-specific synthetic peptide. J Virol 1985, 55: 836–839.PubMed CentralPubMed
- Pellequer JL, Westhof E, Van Regenmortel MH: Correlation between the location of antigenic sites and the prediction of turns in proteins. Immunol Lett 1993, 36: 83–99. 10.1016/0165-2478(93)90072-AView ArticlePubMed
- Karplus PA, Schulz GE: Prediction of Chain Flexibility in Proteins a -Tool for the Selection of Peptide Antigens. Naturwissenschaften 1985, 72: 212–213. 10.1007/BF01195768View Article
- Kolaskar AS, Tongaonkar PC: A semi-empirical method for prediction of antigenic determinants on protein antigens. FEBS Lett 1990, 276: 172–174. 10.1016/0014-5793(90)80535-QView ArticlePubMed
- Saha S, Bhasin M, Raghava GP: Bcipep: a database of B-cell epitopes. BMC Genomics 2005, 6: 79. 10.1186/1471-2164-6-79PubMed CentralView ArticlePubMed
- Schonbach C, Koh JL, Sheng X, Wong L, Brusic V: FIMM, a database of functional molecular immunology. Nucleic Acids Res 2000, 28: 222–224. 10.1093/nar/28.1.222PubMed CentralView ArticlePubMed
- Larsen JE, Lund O, Nielsen M: Improved method for predicting linear B-cell epitopes. Immunome Res 2006, 2: 2. 10.1186/1745-7580-2-2PubMed CentralView ArticlePubMed
- Saha S, Raghava GP: Prediction of continuous B-cell epitopes in an antigen using recurrent neural network. Proteins 2006, 65: 40–48. 10.1002/prot.21078View ArticlePubMed
- Chen J, Liu H, Yang J, Chou KC: Prediction of linear B-cell epitopes using amino acid pair antigenicity scale. Amino Acids 2007, 33: 423–428. 10.1007/s00726-006-0485-9View ArticlePubMed
- El-Manzalawy Y, Dobbs D, Honavar V: Predicting linear B-cell epitopes using string kernels. J Mol Recognit 2008, 21: 243–255. 10.1002/jmr.893PubMed CentralView ArticlePubMed
- Blythe MJ, Flower DR: Benchmarking B cell epitope prediction: underperformance of existing methods. Protein Sci 2005, 14: 246–248. 10.1110/ps.041059505PubMed CentralView ArticlePubMed
- Greenbaum JA, Andersen PH, Blythe M, Bui HH, Cachau RE, Crowe J, Davies M, Kolaskar AS, Lund O, Morrison S, et al.: Towards a consensus on datasets and evaluation metrics for developing B-cell epitope prediction tools. J Mol Recognit 2007, 20: 75–82. 10.1002/jmr.815View ArticlePubMed
- Peters B, Sidney J, Bourne P, Bui HH, Buus S, Doh G, Fleri W, Kronenberg M, Kubo R, Lund O, et al.: The immune epitope database and analysis resource: from vision to blueprint. PLoS Biol 2005, 3: e91. 10.1371/journal.pbio.0030091PubMed CentralView ArticlePubMed
- Allcorn LC, Martin AC: SACS--self-maintaining database of antibody crystal structure information. Bioinformatics 2002, 18: 175–181. 10.1093/bioinformatics/18.1.175View ArticlePubMed
- Huang J, Honda W: CED: a conformational epitope database. BMC Immunol 2006, 7: 7. 10.1186/1471-2172-7-7PubMed CentralView ArticlePubMed
- Kulkarni-Kale U, Bhosle S, Kolaskar AS: CEP: a conformational epitope prediction server. Nucleic Acids Res 2005, 33: W168–171. 10.1093/nar/gki460PubMed CentralView ArticlePubMed
- Andersen PH, Nielsen M, Lund O: Prediction of residues in discontinuous B-cell epitopes using protein 3D structures. Protein Science 2006, 15: 2558–2567. 10.1110/ps.062405906View Article
- Rapberger R, Lukas A, Mayer B: Identification of discontinuous antigenic determinants on proteins based on shape complementarities. J Mol Recognit 2007, 20: 113–121. 10.1002/jmr.819View ArticlePubMed
- Caoili SE: A structural-energetic basis for B-cell epitope prediction. Protein Pept Lett 2006, 13: 743–751. 10.2174/092986606777790502View ArticlePubMed
- Rubinstein ND, Mayrose I, Pupko T: A machine-learning approach for predicting B-cell epitopes. Mol Immunol 2009, 46: 840–847. 10.1016/j.molimm.2008.09.009View ArticlePubMed
- Ponomarenko J, Bui HH, Li W, Fusseder N, Bourne PE, Sette A, Peters B: ElliPro: a new structure-based tool for the prediction of antibody epitopes. BMC Bioinformatics 2008, 9: 514. 10.1186/1471-2105-9-514PubMed CentralView ArticlePubMed
- Ponomarenko JV, Bourne PE: Antibody-protein interactions: benchmark datasets and prediction tools evaluation. BMC Struct Biol 2007, 7: 64. 10.1186/1472-6807-7-64PubMed CentralView ArticlePubMed
- Mintseris J, Wiehe K, Pierce B, Anderson R, Chen R, Janin J, Weng Z: Protein-Protein Docking Benchmark 2.0: an update. Proteins 2005, 60: 214–216. 10.1002/prot.20560View ArticlePubMed
- Liang S, Zhang J, Zhang S, Guo H: Prediction of the interaction site on the surface of an isolated protein structure by analysis of side chain energy scores. Proteins 2004, 57: 548–557. 10.1002/prot.20238View ArticlePubMed
- Jones S, Thornton JM: Analysis of protein-protein interaction sites using surface patches. J Mol Biol 1997, 272: 121–132. 10.1006/jmbi.1997.1234View ArticlePubMed
- Liang S, Zhang C, Liu S, Zhou Y: Protein binding site prediction using an empirical scoring function. Nucleic Acids Res 2006, 34: 3698–3707. 10.1093/nar/gkl454PubMed CentralView ArticlePubMed
- Jones S, Thornton JM: Prediction of protein-protein interaction sites using patch analysis. Journal of Molecular Biology 1997, 272: 133–143. 10.1006/jmbi.1997.1233View ArticlePubMed
- Zhou HX, Qin S: Interaction-site prediction for protein complexes: a critical assessment. Bioinformatics 2007, 23: 2203–2209. 10.1093/bioinformatics/btm323View ArticlePubMed
- Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 1992, 89: 10915–10919. 10.1073/pnas.89.22.10915PubMed CentralView ArticlePubMed
- Liang S, Grishin NV: Effective scoring function for protein sequence design. Proteins 2004, 54: 271–281. 10.1002/prot.10560View ArticlePubMed
- Jones S, Thornton JM: Analysis of protein-protein interaction sites using surface patches. Journal of Molecular Biology 1997, 272: 121–132. 10.1006/jmbi.1997.1234View ArticlePubMed
- Chou PY, Fasman GD: Empirical predictions of protein conformation. Annu Rev Biochem 1978, 47: 251–276. 10.1146/annurev.bi.47.070178.001343View ArticlePubMed
- Sweredoski MJ, Baldi P: PEPITO: improved discontinuous B-cell epitope prediction using multiple distance thresholds and half sphere exposure. Bioinformatics 2008, 24: 1459–1460. 10.1093/bioinformatics/btn199View ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.