Evaluation and integration of existing methods for computational prediction of allergens
© Wang et al.; licensee BioMed Central Ltd. 2013
Published: 8 March 2013
Allergy involves a series of complex reactions and factors that contribute to the development of the disease and triggering of the symptoms, including rhinitis, asthma, atopic eczema, skin sensitivity, even acute and fatal anaphylactic shock. Prediction and evaluation of the potential allergenicity is of importance for safety evaluation of foods and other environment factors. Although several computational approaches for assessing the potential allergenicity of proteins have been developed, their performance and relative merits and shortcomings have not been compared systematically.
To evaluate and improve the existing methods for allergen prediction, we collected an up-to-date definitive dataset consisting of 989 known allergens and massive putative non-allergens. The three most widely used allergen computational prediction approaches including sequence-, motif- and SVM-based (Support Vector Machine) methods were systematically compared using the defined parameters and we found that SVM-based method outperformed the other two methods with higher accuracy and specificity. The sequence-based method with the criteria defined by FAO/WHO (FAO: Food and Agriculture Organization of the United Nations; WHO: World Health Organization) has higher sensitivity of over 98%, but having a low specificity. The advantage of motif-based method is the ability to visualize the key motif within the allergen. Notably, the performances of the sequence-based method defined by FAO/WHO and motif eliciting strategy could be improved by the optimization of parameters. To facilitate the allergen prediction, we integrated these three methods in a web-based application proAP, which provides the global search of the known allergens and a powerful tool for allergen predication. Flexible parameter setting and batch prediction were also implemented. The proAP can be accessed at http://gmobl.sjtu.edu.cn/proAP/main.html.
This study comprehensively evaluated sequence-, motif- and SVM-based computational prediction approaches for allergens and optimized their parameters to obtain better performance. These findings may provide helpful guidance for the researchers in allergen-prediction. Furthermore, we integrated these methods into a web application proAP, greatly facilitating users to do customizable allergen search and prediction.
Allergy and other hypersensitivity reactions from the foods and environmental factors are major causes of chronic ill health in the world [1, 2], affecting about 25% of the population [3, 4]. Allergens include proteins in food, cold air, hot air, ultraviolet rays, metal, and so on. Among these allergenic proteins may cause possible great dangers to health. Therefore, assessment of the potential allergenicity of proteins is essential for food production.
Over the last 15 years, several documents have been officially released providing guidance for definition of the potential allergenic proteins [5–7]. ILSI (International Life Sciences Institute) Allergy and Immunology Institute provided a science-based decision tree approach to assess the allergenic concerns associated with the introduction of gene products into new plant varieties in 1996. Codex Alimentarious Commission advanced the 'decision tree' twice in 2001 and 2003 to achieve a better performance. DuPont Experimental Station presented a "weight-of-evidence" approach, which take into account a variety of factors and approaches for an overall assessment of allergenic potential [7, 8]. This guideline suggested the assessment ranging from the source of novel proteins, similarities of the target proteins to known allergens at the primary protein sequence level, the physicochemical properties, and protein abundance etc.
To enforce the requirement of evaluation of allergenicity of novel proteins, several computational approaches have been developed for effectively screening the possible allergenicity of proteins. The first computational approach proposed by the consultation group of FAO/WHO in 2001, defined a possible allergenic protein with the exact match a stretch of six or more consecutive identical amino acids (rule 1) or more than 35% identity within any window of 80 amino acids in comparison with any known allergen (rule 2) . This sequence-based approach has been widely accepted for allergen prediction using web tools, such as Allermatch, AllerTool and AllergenPro [9–11]. However, it was reported that only 1 of 200 "positive matches" represents a true allergen when using FAO/WHO guidelines in 2003. Subsequently, a motif-based approach using the secondary structure of proteins was proposed for allergen prediction with an increase of the precision from 37.6% to 94.8%, while its recall decreased from 97.0% to 86.2% . In 2006, a statistical learning method SVM (support vector machine) was developed using the principle of pattern recognition [13–17]. Furthermore, additional two approaches: epitope- and ARPs-based (Allergen Representative Peptides) methods were reported using common subsequences of target proteins [13–20]. These two methods were limited by few known epitopes and allergenic domains.
Although a variety of computational methods for allergen prediction have been reported, there exists no comprehensive comparison of these methods. Motif-, epitope-, ARPs- and SVM-based approaches were attempted to be compared in the previous study , but the sequence-based method was not included and only one motif for one subset was selected for prediction, which may cause prediction with low sensitivity. In this article, we comprehensively evaluated the performances of sequence-based, motif-based and SVM-based allergen prediction approaches using the training and testing datasets respectively. Further, these approaches were integrated and optimized in a web-based application proAP to provide a comprehensive, integrative and friendly resource for allergen prediction.
Methods and materials
Methods for allergen prediction
As mentioned above, sequence-based approach was proposed by FAO/WHO , which required doing amino acid sequence similarity analysis in comparison with known allergens. Wordmatch programming by Perl was developed to meet the requirement of FAO/WHO rule 1 , and this method searches short sub-sequences (words), which have perfect identity with an allergen entry . To implement rule 2, the query protein sequence was divided into 80 amino acids by a sliding window with steps of a single residue, then each of these windows used to align to all allergen sequences using blast-2.2.23 . The wordsize (the number of consecutive identical amino acids exactly matched) and the identity threshold were set to be configurable.
Motif eliciting strategy
Unlike sequence-based approach, motif-based approach relies on the protein secondary structure (motif) instead of primary structure (amino acid sequence). The motif-based approach included the extraction of the characteristic motifs from known allergens and subsequent comparison of the query proteins with these motifs. Generally it starts with the positive dataset, then the following steps were performed iteratively until no motif with E-value less than 0.01 was found: the most relevant motif contained in the allergen sequences was identified using MEME motif discovery tool ; the generalized profile of the identified motif was scaled on the allergens with MAST ; matching allergen sequences were removed from the allergen database, and remaining sequences were submitted to the next iteration of motif discovery.
Feature vectors computation in SVM-based prediction
SVM (Support vector machine) is a statistical learning method, which performs classification by constructing an N-dimensional hyperplane that optimally separates the data into two categories. For allergens prediction, take the features of the known allergens and the non-allergens as input to SVM for modeling, and then SVM predicts the query as allergen or non-allergen according to the model.
, where can be any amino acid. And then these compositions were utilized as input vectors of dimension 20 for testing.
The performances of all computational methods applied in this study were evaluated using ten-fold cross-validation. The dataset was randomly partitioned into ten subsets, where each subset had nearly equal number of allergens and non-allergens. Of the ten subsets, a single set was retained as the validation data for testing the method, and the remaining nine subsets were used as training data. This process was then repeated 10 times with each of the ten subsets used exactly once as the validation data. The overall performance of a method was the average performance over ten subsets.
Several statistics measurements were used to evaluate the performance of each allergen prediction methods presented in this study and were briefly described as below :
Sensitivity, also referred to as recall, is the percentage of correctly predicted allergens. It is derived by the Eq. (2).
Specificity is the percentage of correctly predicted non-allergens. It is derived by Eq. (3).
Accuracy is the proportion of correctly predicted proteins. The computational formula is Eq. (4).(2)(3)(4)
In the formulas aforementioned, TP and FN refer to true positives and false negatives where TN and FP refer to true negatives and false positives.
The web server was built on the developing environment of LAMP, and program language perl  was used for processing operator. The detail versions of these software are: Linux (CentOS_5.5 http://www.centos.org/); Apache (httpd_2.3.8 http://httpd.apache.org/); Mysql (MySQL-5.5.7 http://dev.mysql.com/downloads/); PHP (php_5.3.3 http://www.php.net/); and perl (perl_5.12.2 http://www.perl.org/).
Optimization of analysis parameters
Motif-based approach's performance on different MAST E-values
Running time of each approach for querying one protein
FAO/WHO rule 1 n* = 6
FAO/WHO rule 1 n* = 8
Integrative web-based server
This study comprehensively evaluated the existing computational methods and provided a guide for predicting protein allergens. We built a uniform test dataset composed of all known allergens and putative non-allergens to evaluate mostly used computational allergen-predictive methods with ten-fold cross-validation. The comparison results showed that the SVM-based method significantly has advantages in the accuracy and processing time over the sequence-based and motif-based ones, whereas FAO/WHO criteria have a higher sensitivity and the motif-based approach may give a view on the key allergenic motif. Although a number of resources in allergen search or prediction have been reported previously, some of them provide the search of known allergen alone, such as WHO/IUIS Allergen Nomenclature . And even in the other tools, only one or two computational methods of allergen prediction were available [9–11, 13, 15, 17]. Accordingly, we built an integrative web application including the most commonly used methods and providing individual or combination allergen prediction on-line in addition to the data search of known allergens, so that users can pack individual or multiple methods in customized way according to their own purpose. Moreover, the batch prediction in proAP is very useful feature in practice that has not been implemented in any existing web tools yet.
Also, the impacts of wordmatch and sliding window in the sequence-based method were analyzed. And the performances of the motif-based approach with a variety of E-values were investigated and displayed in this study. These results are very helpful for optimizing parameters in allergen prediction. Low specificity was obtained under FAO/WHO criteria, and this situation improved significantly when we aggrandizing the number of matched amino acids or identity threshold. But it should be noticed that the computational complexity may rise accordingly when longer length of matched sequence is required. In the long term, either motif-based or SVM-based method has a "re-build" problem because one has to re-extract motifs and re-build SVM model when new allergens are detected and to be added in the positive database.
Furthermore there are several issues that could be addressed in future studies. Firstly, the existing computational methods predict allergenicity with good precision for those proteins that have high sequence similarity with the known allergens, but they are less effective when the overall similarity is low. We still can not answer clearly why a protein is more like become an allergen while the other not. Since allergenic proteins were reported have specific physiological functions and highly similar folding structures [31–34], taking the protein families classification and folding or 3D structures into the allergen prediction would be helpful to solve this issue. Secondly, more features besides protein amino acid sequence, such as biochemical characteristics and subcellular location can be included in SVM-based prediction. At last, the feature components may be sorted and selected by statistic method to optimize the performance of predictor [35, 36].
In summary, we systematically evaluated the performances of commonly used approaches in prediction of allergens, and developed an integrative web-based application proAP for users to more comprehensive, friendly and flexible search or predict of allergenic proteins.
This work was supported by the Funds from National Basic Research Program of China (973 Program) (2012CB720804), National Transgenic Plant Special Fund (2011ZX08011-006, 2011ZX08011-002, 2011BAK10B03), Shanghai Pujiang Talent Program (12PJ1406600) and Program for "Chen Xing" Young Scholars, Shanghai Jiao Tong University. We acknowledge Dr. Bing Zhang at Vanderbilt University for useful comments on this manuscript.
This supplement was funded by National Transgenic Plant Special Fund (2011ZX08011-006) and Program for "Chen Xing" Young Scholars, Shanghai Jiao Tong University.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 4, 2013: Special Issue on Computational Vaccinology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S4
- Taylor SL: Protein allergenicity assessment of foods produced through agricultural biotechnology. Annu Rev Pharmacal Toxical. 2002, 42: 99-112. 10.1146/annurev.pharmtox.42.082401.130208.View ArticleGoogle Scholar
- Lee YH, Sinko PJ: Oral delivery of salmon calcitonin. Adv Drug Deliv Rev. 2000, 42: 225-238. 10.1016/S0169-409X(00)00063-6.View ArticlePubMedGoogle Scholar
- Mekori YA: Introduction to allergic diseases. Crit Rev Food Sci Nutr. 1996, 36 (Suppl.): S1-S18.View ArticlePubMedGoogle Scholar
- Nieuwenhuizen NE, Lopata AL: Fighting food allergy: Current Approaches. Ann N Y Acad Sci. 2005, 1056: 30-45. 10.1196/annals.1352.003.View ArticlePubMedGoogle Scholar
- Metcalfe DD, Astwood JD, Townsend R, Sampson HA, Taylor SL, Fuchs RL: Assessment of the allergenic potential of foods derived from genetically engineered crop plants. Crit Rev Food Sci Nutr. 1996, 36 (Suppl.): S165-S186.View ArticlePubMedGoogle Scholar
- Codex Alimentarius Commission: Joint FAO/WHO Food Standard Program Codex Alimentarius Commission. 2001, RomeGoogle Scholar
- FAO/WHO: Evaluation of allergenicity of Genetically Modified Foods. Report of a Joint FAO/WHO Expert Consultation on Allergenicity of Foods Derived from Biotechnology. 2003, RomeGoogle Scholar
- Ladic GS: Current codex guidelines for assessment of potential protein allergenicity. Food Chem Toxicol. 2008, 46 (suppl. 10): S20-S23.View ArticleGoogle Scholar
- Fiers MW, Kleter GA, Nijland H, Peijnenburg AA, Nap JP, van Ham RC: Allermatch™, a webtool for the prediction of potential allergenicity according to current FAO/WHO Codex alimentarius guidelines. BMC Bioinformatics. 2004, 5: 133-10.1186/1471-2105-5-133.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang ZH, Koh JL, Zhang GL, Choo KH, Tammi MT, Tong JC: AllerTool: a web server for predicting allergenicity and allergic cross-reactivity in proteins. Bioinformatics. 2007, 23 (4): 504-506. 10.1093/bioinformatics/btl621.View ArticlePubMedGoogle Scholar
- Kim C, Kwon S, Lee G, Lee H, Choi J, Kim Y, Hahn J: A database for allergenic proteins and tools for allergenicity prediction. Bioinformation. 2009, 3 (8): 344-345. 10.6026/97320630003344. Apr 21PubMed CentralView ArticlePubMedGoogle Scholar
- Stadler MB, Stadler BM: Allergenicity prediction by protein sequence. FASEB J. 2003, 17 (9): 1141-1143.PubMedGoogle Scholar
- Saha S, Raghava GP: AlgPred: prediction of allergenic proteins and mapping of IgE epitopes. Nucleic Acids Research. 2006, 34: W202-W209. 10.1093/nar/gkl343.PubMed CentralView ArticlePubMedGoogle Scholar
- Soeria-Atmadja D, Lundell T, Gustafsson MG, Hammerling U: Computational detection of allergenic proteins attains a new level of accuracy with in silico variable-length peptide extraction and machine learning. Nucleic Acids Res. 2006, 34: 3779-3793. 10.1093/nar/gkl467.PubMed CentralView ArticlePubMedGoogle Scholar
- Martinez Barrio A, Soeria-Atmadja D, Nistér A, Gustafsson MG, Hammerling U, Bongcam-Rudloff E: EVALLER: a web server for in silico assessment of potential protein allergenicity. Nucleic Acida Research. 2007, 35: W694-W700. 10.1093/nar/gkm370.View ArticleGoogle Scholar
- Cui J, Han LY, Li H, Ung CY, Tang ZQ, Zheng CJ, Cao ZW, Chen YZ: Computer prediction of allergen proteins from sequence-derived protein structural and physicochemical properties. Mol Immunol. 2007, 44 (4): 514-520. 10.1016/j.molimm.2006.02.010.View ArticlePubMedGoogle Scholar
- Muh HC, Tong JC, Tammi MT: AllerHunter: A SVM-Pairwise System for Assessment of Allergenicity and Allergic Cross-Reactivity in Proteins. PLoS One. 2009, 4 (6): e5861-10.1371/journal.pone.0005861.PubMed CentralView ArticlePubMedGoogle Scholar
- Ivanciuc O, Midoro-Horiuti T, Schein CH, Xie L, Hilliman GR, Goldblum RM, Braun W: The property distance index PD predicts peptides that cross-react with IgE antibodies. Mol Immunol. 2009, 46 (5): 873-883. 10.1016/j.molimm.2008.09.004.PubMed CentralView ArticlePubMedGoogle Scholar
- Schein CH, Ivanciuc O, Braun W: Structural Database of Allergenic Proteins (SDAP). Food Allergy. Edited by: Maleki, SJ. 2006, ASM Press, Washington D.C, 257-283.View ArticleGoogle Scholar
- Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol. 1981, 147: 195-197. 10.1016/0022-2836(81)90087-5.View ArticlePubMedGoogle Scholar
- Perl 5.14.1. [http://www.perl.org/]
- Blast-2.2.23. [ftp://ftp.ncbi.nih.gov/blast/]
- Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology: 1994. 1994, Menlo Park, California, 28-36.Google Scholar
- Bailey TL, Gribskov M: Combining evidence using p-values: application to sequence homology searches. Bioinformatics. 1998, 14: 48-54. 10.1093/bioinformatics/14.1.48.View ArticlePubMedGoogle Scholar
- Chang C-C, Lin C-J: LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology. 2011, 2 (27): 1-27.View ArticleGoogle Scholar
- Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics. 2000, 16: 412-424. 10.1093/bioinformatics/16.5.412.View ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410.View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proceedings of National Academy of Sciences of the United States of America. 1988, 85 (8): 2444-2448. 10.1073/pnas.85.8.2444.View ArticleGoogle Scholar
- Marsh DG, Goodfriend L, King TP, Lowenstein H, Platts-Mills TA: Allergen nomenclature. Bull World Health Organ. 1986, 64: 767-74.PubMed CentralPubMedGoogle Scholar
- Hoffmann-Sommergruber K: Pathogenesis-related (PR)-proteins identified as allergens. Biochem Soc Trans. 2002, 30 (Pt 6): 930-935.View ArticlePubMedGoogle Scholar
- Ledesma A, Villalba M, Rodriguez R: Cloning, expression and characterization of a novel four EF-hand Ca(2+)-binding protein from olive pollen with allergenic activity. FEBS Lett. 2000, 466 (1): 192-196. 10.1016/S0014-5793(99)01790-1.View ArticlePubMedGoogle Scholar
- Riascos JJ, Weissinger AK, Weissinger SM, Burks AW: Hypoallergenic legume crops and food allergy: factors affecting feasibility and risk. J Agric Food Chem. 2010, 58 (1): 20-27. 10.1021/jf902526y.View ArticlePubMedGoogle Scholar
- Breiteneder H, Mills EN: Molecular properties of food allergens. J Allergy Clin Immunol. 2005, 115 (1): 14-23. 10.1016/j.jaci.2004.10.022. quiz 24View ArticlePubMedGoogle Scholar
- Peng H, Long F, Ding C: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell. 2005, 27: 1226-1238.View ArticlePubMedGoogle Scholar
- Huang T, Shi XH, Wang P, He Z, Feng KY, Hu L, Kong X, Li YX, Cai YD, Chou KC: Analysis and prediction of the metabolic stability of proteins based on their sequential features, subcellular locations and interaction networks. PLoS One. 2010, 5 (6): e10972-10.1371/journal.pone.0010972.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.