- Open Access
The value of position-specific scoring matrices for assessment of protein allegenicity
© Lim et al; licensee BioMed Central Ltd. 2008
- Published: 12 December 2008
Bioinformatics tools are commonly used for assessing potential protein allergenicity. While these methods have achieved good accuracies for highly conserved sequences, they are less effective when the overall similarity is low. In this study, we assessed the feasibility of using position-specific scoring matrices as a basis for predicting potential allergenicity in proteins.
Two simple methods for predicting potential allergenicity in proteins, based on general and group-specific allergen profiles, are presented. Testing results indicate that the performances of both methods are comparable to the best results of other methods. The group-specific profile approach, with a sensitivity of 84.04% and specificity of 96.52%, gives similar results as those obtained using the general profile approach (sensitivity = 82.45%, specificity = 96.92%).
We show that position-specific scoring matrices are highly promising for constructing computational models suitable for allergenicity assessment. These data suggest it may be possible to apply a targeted approach for allergenicity assessment based on the profiles of allergens of interest.
- Support Vector Machine
- Matthews Correlation Coefficient
- Dipeptide Composition
- Include Support Vector Machine
- Allergen Profile
Atopic allergy and other forms of hypersensitivity reactions pose a major concern for public health, affecting up to 25% of the population in industrial nations [1, 2]. With the rapid growth in the number of genetically modified (GM) food, biopharmaceuticals and other biotechnology-derived products, identifying potential allergenicity in proteins has become crucial in product safety assessment [3, 4].
Unlike laboratory-based allergenicity assessment methods such as the skin prick test and RAST (radioallergosorbent test), which are often rigorous and time-consuming, the use of bioinformatics tools has come in favorably for accelerating the discovery of novel allergens. Guidelines to evaluate allergenicity potential of proteins have been jointly proposed by the Food and Agriculture Organization (FAO)/World Health Organization (WHO) Expert Consultation on Allergenicity of Foods Derived from Biotechnology . According to the bioinformatics section of the guidelines, a protein is a potential allergen if it either has an identity of ≥ 6 continuous amino acids or ≥ 35% sequence similarity over a window of 80 amino acids with a known allergen.
Although useful in some cases, it has been shown that the FAO/WHO joint recommendation produces a large number of false positives, resulting in specificities that are too low to be of practical use [6, 7]. To address these drawbacks, more sophisticated bioinformatics tools have been developed. These include support vector machines (SVM) , Gaussian classification algorithms [9, 10], wavelet transform models , allergen motifs , IgE sequence comparisons [13, 14] and the use of allergen-representative peptides (ARP) . While these systems are effective for high similarity allergen sequences, they are less effective for when the overall similarity is low .
Position-specific scoring matrices (PSSM) have been very successful for detecting distantly related protein sequences [17–19], but have yet been applied for assessing allergenic potentials in proteins. In this study, we shall examine the feasibility of using PSSM as a basis for developing an effective allergenicity prediction system. As will be seen below, the use of an iterative PSI-BLAST in combination with various filters for accuracy optimization shows great promise for constructing general and group-specific profiles suitable for allergenicity assessment.
Prediction quality of the profile-based methods
General profile model
The predictive performance of the general allergen profile approach is in accordance with expected allergenic patterns in proteins and provided an accuracy (ACC) of greater than 85% (SE > 82%, SP > 85%) for E-value cutoffs of ≤ 10-1. This approach is shown to perform best at the E-value threshold of 10-9 (ACC = 95.02%). At this threshold, the sensitivity and specificity of the model is 82.45% and 96.92% respectively.
Group-specific profile model
Allergen sequences are currently classified into 9 major groups by the IUIS Allergen Nomenclature Sub-Committee http://www.allergen.org – i) weeds, ii) fungi, iii) grasses, iv) trees, v) mites, vi) animals, vii) insects, viii) food, and ix) others . We constructed group-specific profiles based on all 9 major allergen groups, and tested their capability in predicting allergen sequences. As illustrated in Table 1, the approach achieved similar performance as the general profile model, and can predict allergens with high accuracy (ACC > 84%, SE > 84%, SP > 84%) at E-value thresholds of ≤ 10-1. The best performance is observed at the E-value threshold of 10-9 (ACC = 94.88%). At this threshold, the sensitivity and specificity of the model is 96.52% and 84.04% respectively.
Average prediction quality of the group-specific profiles. Performance of group-specific profile models at E-value threshold of 10-9.
Comparison with existing methods
To benchmark the performance of the profile-based prediction methods, the five testing datasets, each consisting of 302 allergen sequences and 2000 non-allergen sequences, was used to evaluate six available techniques – the FAO/WHO evaluation scheme , SVM global description approach , SVM amino acid composition approach , SVM dipeptide composition approach , MEME motif discovery tool  and ARP technique . The overall performance of each technique is indicated by the average performance over the five datasets.
Comparison of the performance between the profile-based methods and existing allergenicity prediction systems
General profile model
Group-specific profile model
SVM (global description) 
SVM (aa composition) 
SVM (dipeptide composition) 
MEME/MAST motifs 
It is shown that profile-based methods are highly promising for assessing potential allergenicity and cross-reactivity in proteins with sensitivities and specificities of over 80%. The strength of such models lies in its ability to detect distantly related protein homologues through the use of iterated profiles [17–19]. To date, the exact mechanisms of allergy remains unclear as the structural, functional or biochemical properties of allergens that leads to allergic responses have yet to be elucidated. The allergen profiles that are constructed in this study may also be used as a basis for identifying common amino acid residues or physicochemical properties that support allergenicity .
The dataset was shuffled randomly and partitioned into five sets for five-fold cross validation, each time using one set for testing and the remaining four sets for training. Each training set contains 1,208 experimentally determined allergens and 8,000 non-allergens while each testing set contains 302 experimentally determined allergens and 2,000 non allergens.
Method 1: general allergen profiles
During the initial screening step, a PSI-BLAST search (10 iterations, e-value threshold 10-3) was performed on each allergen sequence in the training set against all other allergen sequences in the dataset. This generates a profile or PSSM for each allergen protein sequence. In this study, a minimum of two sequences was used for constructing a profile.
In the optimization step, another round of PSI-BLAST search was performed on each of the selected allergen sequence using eight different e-value thresholds (10, 1, 10-1, 10-2, 10-3, 10-4, 10-6 and 10-9). This generates eight profiles for each allergen sequence corresponding to the different e-value threshold. Each of the eight profiles was tested by RPS-BLAST using allergen sequences in the training set as query. For each allergen sequence in the training dataset, the best profile (with the highest accuracy) was selected and incorporated into the predictive model. This approach produces a collection of general allergen profiles optimized for accuracy and performance.
Method 2: group-specific allergen profiles
This method predicts protein allergenicity by performing a RPS-BLAST search against a database of group-specific allergen profiles optimized for accuracy and performance.
Allergen sequences in the training set were partitioned into nine groups – i) weeds, ii) fungi, iii) grasses, iv) trees, v) mites, vi) animals, vii) insects, viii) food, and ix) others, according to the recommendation by the IUIS Allergen Nomenclature Sub-Committee . For the screening phase, PSI-BLAST was performed by partitioning allergens into the 9 major groups and using individual groups of allergens as the training dataset. This generates profiles specific to each particular group of allergens, which are subsequently optimized according to their predictive accuracy and used for constructing group-specific allergenicity prediction systems.
The MCC returns a value between -1 and 1: MCC = 1 for 100% agreement of the prediction, MCC = 0 for completely random prediction and MCC = -1 for 100% disagreement of the prediction.
Five-Fold cross validation
Five-fold cross validation was performed to assess the quality of all predictive models described in this study . In k-fold cross-validation, k random, (approximately) equal-sized, disjoint partitions of the sample data are constructed, and a given model is trained on (k-1) partitions and tested on the excluded partition. The results are averaged after k such experiments, and the observed error rate may be taken as an estimate of the error rate expected upon generalization to new data.
This work was supported in part by grant R-154-000-265-112 from the National University of Singapore.
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 12, 2008: Asia Pacific Bioinformatics Network (APBioNet) Seventh International Conference on Bioinformatics (InCoB2008). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S12.
- Mekori YA: Introduction to allergic diseases. Crit Rev Food Sci Nutr 1996, 36: S1–18.View ArticlePubMedGoogle Scholar
- Nieuwenhuizen NE, Lopata AL: Fighting food allergy: current approaches. Ann NY Acad Sci 2005, 1056: 30–45. 10.1196/annals.1352.003View ArticlePubMedGoogle Scholar
- Goodman RE, Hefle SL, Taylor SL, Ree RV: Assessing genetically modified crops to minimize the risk of increased food allergy: a review. Int Arch Allergy Immunol 2005, 137: 153–166. 10.1159/000086314View ArticlePubMedGoogle Scholar
- Heppenheimer TA: The growth of genetically modified foods. Am Herit Invent Technol 2003, 19: 16–25.PubMedGoogle Scholar
- FAO/WHO: Codex Principles and Guidelines on Foods Derived from Biotechnology. Joint FAO/WHO Food Standards Programme, Rome, Italy; 2003.Google Scholar
- Fiers MW, Kleter GA, Nijland H, Peijnenberg AA, Nap JP, van Ham RC: Allermatch, a webtool for the prediction of potential allergenicity according to current FAO/WHO Codex alimentarius guidelines. BMC Bioinformatics 2004, 5: 133. 10.1186/1471-2105-5-133PubMed CentralView ArticlePubMedGoogle Scholar
- Hileman RE, Silvanovich A, Goodman RE, Rice EA, Holleschak G, Astwood JD, et al.: Bioinformatic methods for allergenicity assessment using a comprehensive ALLERGEN database. Int Arch Allergy Immunol 2002, 128: 280–291. 10.1159/000063861View ArticlePubMedGoogle Scholar
- Cui J, Han LY, Li H, Ung CY, Tang ZQ, Zheng CJ, Cao ZW, Chen YZ: Computer prediction of allergen proteins from sequence-derived protein structural and physicochemical properties. Mol Immunol 2007, 4: 514–520. 10.1016/j.molimm.2006.02.010View ArticleGoogle Scholar
- Soeria-Atmadja D, Zorzet A, Gustafsson MG, Hammerling U: Statistical evaluation of local alignment features predicting allergenicity using supervised classification algorithms. Int Arch Allergy Immunol 2004, 133: 101–112. 10.1159/000076382View ArticlePubMedGoogle Scholar
- Zorzet A, Gustafsson M, Hammerling U: Prediction of food protein allergenicity: a bioinformatic learning systems approach. In Silico Biol 2002, 2: 525–534.PubMedGoogle Scholar
- Li KB, Isaac P, Krishnan A: Predicting allergenic proteins using wavelet transform. Bioinformatics 2004, 20: 2572–2578. 10.1093/bioinformatics/bth286View ArticlePubMedGoogle Scholar
- Stadler MB, Stadler BM: Allergenicity prediction by protein sequence. FASEB J 2003, 17: 1141–1143.PubMedGoogle Scholar
- Ivanciuc O, Schein CH, Braun W: SDAP: database and computational tools for allergenic proteins. Nucleic Acids Res 2003, 31: 359–362. 10.1093/nar/gkg010PubMed CentralView ArticlePubMedGoogle Scholar
- Saha S, Raghava GPS: AlgPred: prediction of allergenic proteins and mapping of IgE epitopes. Nucleic Acids Res 2006, 34: W202-W209. 10.1093/nar/gkl343PubMed CentralView ArticlePubMedGoogle Scholar
- Björklund AK, Soeria-Atmadja D, Zorzet A, Hammerling U, Gustafsson MG: Supervised identification of allergen-representative peptides for in silico detection of potentially allergenic proteins. Bioinformatics 2005, 21: 39–50. 10.1093/bioinformatics/bth477View ArticlePubMedGoogle Scholar
- Tong JC, Tammi MT: Methods and protocols for the assessment of protein allergenicity and cross-reactivity. Front Biosci 2008, 13: 4882–4888. 10.2741/3047View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Jones DT: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 1999, 292: 195–202. 10.1006/jmbi.1999.3091View ArticlePubMedGoogle Scholar
- Xie D, Li A, Wang M, Fan Z, Feng H: LOCSVMPSI: a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST. Nucleic Acid Res 2005, 33: 105–110. 10.1093/nar/gki359View ArticleGoogle Scholar
- Tong JC, Zhang GL, Tan TW, August JT, Brusic V, Ranganathan S: Prediction of HLA-DQ3.2β ligands: Evidence of multiple registers in class II binding peptides. Bioinformatics 2006, 22: 1232–1238. 10.1093/bioinformatics/btl071View ArticlePubMedGoogle Scholar
- King TP, Hoffman D, Lowenstein H, Marsh DG, Platts-Mills TA, Thomas W: Allergen nomenclature. WHO/IUIS allergen nomenclature subcommittee. Int Arch Allergy Immunol 1994, 105: 224–233.View ArticlePubMedGoogle Scholar
- Breiteneder H, Mills EN: Molecular properties of food allergens. J Allergy Clin Immunol 2005, 115: 14–23. 10.1016/j.jaci.2004.10.022View ArticlePubMedGoogle Scholar
- Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, et al.: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 2003, 31: 365–370. 10.1093/nar/gkg095PubMed CentralView ArticlePubMedGoogle Scholar
- Dennis AB, Ilene KM, David JL, James O, David LW: Genbank. Nucleic Acid Res 2005, 33: D34-D38. 10.1093/nar/gni032View ArticleGoogle Scholar
- Mari A, Mari V, Ronconi A: Allergome – a database of Allergenic molecules: structure and data implementations of a web-based resource. J Allergy Clin Immunol 2005, 115: S87. 10.1016/j.jaci.2004.12.359View ArticleGoogle Scholar
- Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 2000, 16: 412–424. 10.1093/bioinformatics/16.5.412View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.