Skip to main content

Classification epitopes in groups based on their protein family



The humoral immune system response is based on the interaction between antibodies and antigens for the clearance of pathogens and foreign molecules. The interaction between these proteins occurs at specific positions known as antigenic determinants or B-cell epitopes. The experimental identification of epitopes is costly and time consuming. Therefore the use of in silico methods, to help discover new epitopes, is an appealing alternative due the importance of biomedical applications such as vaccine design, disease diagnostic, anti-venoms and immune-therapeutics. However, the performance of predictions is not optimal been around 70% of accuracy. Further research could increase our understanding of the biochemical and structural properties that characterize a B-cell epitope.


We investigated the possibility of linear epitopes from the same protein family to share common properties. This hypothesis led us to analyze physico-chemical (PCP) and predicted secondary structure (PSS) features of a curated dataset of epitope sequences available in the literature belonging to two different groups of antigens (metalloproteinases and neurotoxins). We discovered statistically significant parameters with data mining techniques which allow us to distinguish neurotoxin from metalloproteinase and these two from random sequences. After a five cross fold validation we found that PCP based models obtained area under the curve values (AUC) and accuracy above 0.9 for regression, decision tree and support vector machine.


We demonstrated that antigen's family can be inferred from properties within a single group of linear epitopes (metalloproteinases or neurotoxins). Also we discovered the characteristics that represent these two epitope groups including their similarities and differences with random peptides and their respective amino acid sequence. These findings open new perspectives to improve epitope prediction by considering the specific antigen's protein family. We expect that these findings will help to improve current computational mapping methods based on physico-chemical due it's potential application during epitope discovery.


Living organisms often encounter a pathogenic virus, microbe or any foreign molecule during it's lifetime [1]. The B cells of the immune system recognize the foreign body or pathogen's antigen by their membrane bound immunoglobulin receptors, which later produce antibodies against this antigen [2, 3]. The recognized sites on the antigen's surface, known as epitopes, represent the minimum wedge recognized by the immune system [4]. Therefore, epitopes lie at the heart of the humoral immune response [5]. The rapid reaction to a previously encountered antigen depends on the binding ability of the antibodies found in the immune system of the organism [6], the physico-chemical properties of the epitope and it's structural conformation [7]. Thus, understanding epitope characteristics and how they are recognized, in sufficient detail, would allow us to identify and predict their position in the antigen [8].

The main objective of epitope prediction is to design a molecule that can replace an antigen in the process of either antibody production or antibody detection [4, 911]. Such a protein can be synthesized in case of peptides or in case of a larger protein, produced by yeast after the gene is cloned into an expression vector [12]. After 30 years of research, it is known that the optimum size of peptides possessing cross-reactive immunogenicity is between 10-15 amino acids [13]. The earliest efforts made to understand and predict B-cell epitopes were based on the amino acid properties, such as flexibility [14], hydrophaty [15], antigenicity [7], beta turns [16] and accessibility [17]. Epitope prediction is important to design epitope-based vaccines and precise diagnostic tools such as diagnostic immunoassay for detection, isolation and characterization of associated molecules for various disease states. These benefits are of undoubted medical importance [18, 19].

Recently developed prediction methods face several challenges like data quality [20, 7], a limited amount of positive learning examples [21] or difficulty in choosing an appropriate negative learning examples [22]. These negative training samples may harbor genuine B cell epitopes and affect the training procedure, resulting in a poor classification performance [23, 24]. Moreover, none of the published work took into account the protein family or function to predict epitopes [25].

The present study explores the possibility of epitopes belonging to same protein family share common properties. For these purpose, the amino acid statistics, physico-chemical and structural properties were compared within each other [26] for two protein's group. This assumption is based on previous studies showing that it exists amino acid trends in composition and shared properties for intravenous immunoglobulins [27]. Despite the difficulty of distinguishing epitopes from non epitopes [28] the addition of information, such as evolutionary and propensity scales, proved to be helpful for epitope prediction [21]. Therefore, it is interesting to assume including information about the protein antigen's family may be resourceful to improve prediction.


Dataset composition

We have obtained experimentally validated 106 linear B-cell epitopes for two groups of antigens (metalloproteinases and neurotoxins) extracted from Pubmed (

They were manually curated until September 2012 following several search criteria based on the keywords: epitope, metalloproteinase, proteinase, peptidase, toxin and neurotoxin in a joint and disjoint manner. The redundancy was removed for repeated sequences using 100% identity as threshold and the maximum size of the epitopes was fixed to be equal or less than 32. As non epitope data, we created 49 linear random peptides proportional number to the mean of the amount of epitopes in the groups metallorproteinase and neurotoxin. These random peptides are based on the statistics from the dataset UniProtKB/Swiss-Prot, meaning that the sum of the random peptides amino acids are equal to the percentages found in uniprot database. The final set contained 99 non redundant epitopes, containing 29 metalloproteinases, 70 neurotoxins and 49 random peptides as showed in Additional file 1.

Feature selection for data mining analysis

In this study, we generated and used 33 physico-chemical parameters composed by aliphatic index, GRAVY, isoelectric point, amino acid content in percentages, amino acid groups such as hydrophobic (AVILMFYW), positive charged (RHK), negative charged (DE), not charged (STNQ) and specials (SGP) as described by Gasteiger with the difference that each feature was transformed to percentage removing the length difference for the epitope sequences [29]. Also 6 predicted secondary structure properties such as strand, helix, coil, relative surface accessibility, absolute surface accessibility and z-fit which were calculated with Netsurf algorithm [29]. These parameters were calculated for the three groups in study (Metalloproteinase, Neurotoxin and Random) and the results where compared using Welch two sample t-test available in the statistical software R. In total, we evaluated 3 different matrices for the classification purpose of discover how much sequence-derived information was needed to obtained a good classification. The first matrix based of purely PCP information, a second with only PSS data and a third one which was merely the addition of the PSS features to the PCP matrix.

Selection of data mining methods and statistical analysis

The Konstanz Information Miner (KNIME) [30] was used to evaluate Kmeans (KM), decision tree [31] (DT), naive bayes classifier (NB), support vector machine [32] (SVM) for the matrices generated with our dataset. The free software environment R for statistical computing and graphics was used to create the multiple regression models (LMR). For LMR the nominal class variable was transformed into a numerical variable for the two groups, a positive with value log(0.99/(1-0.99)) for metalloproteinases and a negative been log(0.01/(1-0.01)) for neurotoxins. The linear model function available in R was used to solve a series of equations where the class variable was equal to the feature variables. After solving the equations, a linear multiple regression model was generated, a p-value was calculated and the model was rejected for any p-value superior to 0.005. The predicted resulting score of the model was scaled (0 to 1) by using exp(predicted value./(1+predicted value)) formula. The performance of all the generated models was evaluated for every possible decision threshold with ROCR package by using the parameters AUC (area under the curve formed by true and false positive rates) and accuracy, which gives an overall view of the performance of the classification method used [33].


Statistical differences of amino acid composition between metalloproteinase and neurotoxin linear epitopes compared with random sequences

The dataset contain 11 metalloproteinases and 16 neurotoxins. The two protein families (or group) respectively contains 29 and 70 epitopes with an average sequence length of 13.8 amino acids (aa). The minimum length was 4 aa and maximum 32 aa. The negative or non epitope set contained 49 sequences of 14 aa length (Table 1).

Table 1 Dataset composition

These epitope groups also indicated variation when compared to our non epitope control for the amino acids K, C, A, V and I for metalloproteinases and R, K, D, N, Q, C, A, I, K, M and W for neurotoxins (Table 2 columns 2 and 3). As expected, we also detected differences in other parameters such as aliphatic index, grand average of hydropaty and isoelectric point (Table 2 last three rows). Therefore, we were able to identify common characteristics in epitope's composition within unique antigen groups and differences between neurotoxin and metalloproteinase epitope groups.

Table 2 Analysis of means for all datasets with Welch two sample T-test

Decision tree and multiple regression models can distinguish linear B-cell epitopes from two different antigen groups

We investigated our capacity to discriminate if an epitope belonged to neurotoxin or metalloprotease based on the statistical significant differences observed in epitopes amino acids composition, isoeletric point, gravy and aliphatic index (Table 2). For this purpose, we used five different methods: SVM, NB, DT, KM and LMR.

Our analysis used three different input matrices as described before: Only physico-chemical properties (PCP), only secondary structure (PSS) and the combination of both (PCP+PSS) for each algorithm. The performances displayed as AUC values for all data mining methods are showed in table 3. All the methods with the exception of KM were able to group and distinguish correctly both groups of epitopes. As expected, the best results were for SVM followed by similar performance by much simpler techniques, LMR and DT.

Table 3 Performance of all data mining methods showed in AUC and accuracy.

During the use of PSS features as input, a reduction in the performance of 0.1-0.3 AUC value was noticed for MLR and NB techniques (Table 3). Only SVM and DT obtained an AUC superior to 0.9 while all the other methods performed poorly with AUC of 0.65 for LMR and close to 0.5 for the others. The SVM technique performed with an AUC of 1.0 for combined properties while LMR showed a slight increase from 0.9 to 1.0. By the other hand DT, NB and Kmeans stayed the same (Table 3). These results indicate that the type of input used (PSS or PCP) were not significant, where the models based on the PCP were the simplest to analyze and understand. The most stable AUC results were obtained with DT method where all the matrices analyzed resulted in an AUC value around 0.95.

The techniques DT and LMR are statistical approaches that showed results similar to SVM which is a non statistical classifier. These methods allowed us to discriminate the epitopes belonging to metalloproteinases or neurotoxins and to identify the important properties inside these groups. The relevant features to classify the epitope groups for the LMR and DT models can be found in table 4.

Table 4 Properties used by the classification models until 8º order out of 39.

We observed which amino acids were critical to differentiate epitopes from neurotoxins and metalloproteinases. In the case of LMR model, the amino acids asparagine (N), glutamine (Q) and serine (S), and in the case of DT model the amino acids lysine (K), aspartate (D) and methionine (M) were the key to achieve good classification (above 0.9 AUC) (Table 4).


The amino acid composition has been investigated for proteins related to the B-cell response [34] and as key for understanding protein-protein interactions [35, 36] alongside their role during prediction of epitopes for both T and B-cells [37]. Epitopes are rich in charged and polar amino acids and low in aliphatic hydrophobic amino acids, when comparing the epitope amino acid distribution to either the entire PDB database [38] or to the antigen [39, 40]. Also Rubinstein [39] suggested that the amino acid Tyr is significantly over-represented in epitopes and that Val is significantly depleted. Interestingly, the residues Arg and Lys are more frequent in the epitopes of our dataset along other differences as aliphatic index and gravy. This particularities are probably a result of focusing common features in a diverse epitope group, phenomena which was evidenced in the amino acids composition found in epitopes for papilloma viruses [22]. The PCP based methods have been explored in detail for epitope prediction [40] with some limitations in terms of specificity and precision as seen in models for SVM with AUC values of 0.85 for amino acid composition and 0.58, where the accuracy never surpass 0.8 [26].

Our study suggests an improvement in performance when a single epitope group is targeted, resulting in AUC and accuracy superior to 0.9. We included groups of amino acids based on type of charge and lateral chain due to the the concept of amino acids working cooperatively in protein:protein interfaces [41]. Our results indicate that these amino acid groups such as hydrophobic, polar, or special amino acids (CGP), do not posses significance for the prediction models by themselves but may add value when combined with single amino acid statistics.

The secondary structure of epitopes was also investigated by several authors [4244], and epitopes are in general reported to have significantly less strands and helices and significantly more loops compared to the rest of the antigen [8, 38]. The over-representation of loops is small but significant and in agreement with the perception that protein-protein binding sites are flexible regions [41]. The overall secondary structure of epitopes has been reported to been different from regular protein-protein interfaces [23] based on crystals available on the PDB indicating some structural particularities of the Ab-Ag interaction [45]. These particularities could be also family restrictred which could be interesting to explore with computational methods despite of having an accuracy of 79% when predicted from sequence [46] but the DT outcome showed no real relevance in PSS features when applied to epitope classification. The inclusion of predicted secondary structure as commonly done [40] could be a source of misleading results for the prediction, issue which has been reviewed briefly in the literature [47].

The features that characterize each epitope's group could represent the complementary data needed to improve epitope prediction. For example, when adding evolutionary information to the prediction the performance was improved [48] despite recent studies that explain no relation exits between epitope and antigens sequences [28]. Therefore, we showed that a wide range of data mining methods including support vector machine [21], decision tree [48], regression [26] and Naive Bayer classifier had similar successful results bringing some light to the question of which characteristics are important for these epitope groups. It's important to note that we used amino acid percentage [4] in comparison with some recent epitope prediction methods that prefer propensities [12]. The data normalization made in the present study are based on the assumption that each feature is equally relevant for any protein sequence based analysis [9]. We also demonstrate that despite the method, it was possible to classify the studied groups, pointing out the importance of the quality of the used data [49].


Our study indicates that linear epitopes that belong a single protein family share common properties but different when compared to epitopes from different families, as demonstrated for neurotoxins and metalloproteinases. We confirmed our hypothesis with five different data mining algorithms, probabilistic and non probabilistic, showing similar results except for Kmeans. The proposed models allowed to separate the studied groups from random sequences based on Uniprot statistics. The models based only in PCP features were enough to show and identify the differences between epitope groups. Therefore, we demonstrate that considering the epitope's protein family can reveal unseen patterns within epitope groups that could be used to improve epitope discovery.



Support Vector Machine


Naive Bayes


Decision Tree




Linear Multiple Regression


Protein Data Bank


Position Specific Matrix




Absolute Surface Area


Relative Surface Area


Area Under the Curve


Receiver Operating Characteristic


Metalloproteinase epitopes


Metalloproteinase proteins


Neurotoxin epitopes


Neurotoxin proteins


  1. Cochrane Norris Charles: Thucydides and the Science of History. 1929, Oxford University Press, 35 (3): 584-585. Apr

    Google Scholar 

  2. Burnet FM: A modification of Jerne's theory of antibody. Australian Journal of Science. 1957, 20: 67-69.

    Google Scholar 

  3. Jerne NK: The natural-selection theory of antibody formation. Proceedings of the National Academy of Sciences. 1955, 41: 849-857.

    Article  CAS  Google Scholar 

  4. Perlow DS, Boger J, Emini EA, Hughes JV: Induction of hepatitis A virus-neutralizing antibody by a virus-specific synthetic peptide. J Virol. 1985, 55 (3): 836-839.

    PubMed  PubMed Central  Google Scholar 

  5. Silverstein AM: A History of Immunology. 1989, Academic Press, San Diego

    Google Scholar 

  6. Abbas Andrew, Lichtman Abul: Cellular and Molecular Immunology. 2005, 5 (1): 3-14.

    Google Scholar 

  7. Greenbaum JA, Andersen PH, Blythe M, Bui HH, Cachau RE, Crowe J, Davies M, Kolaskar AS, Lund O, Morrison S, Mumey B, Ofran Y, Pellequer JL, Pinilla C, Ponomarenko JV, Raghava GP, van Regenmortel MH, Roggen EL, Sette A, Sch-lessinger A, Sollner J, Zand M, Peters B: Towards a consensus on datasets and evaluation metrics for developing B-cell epitope prediction tools. J. Mol. Recognit. 2007, 20 (2): 75-82.

    Article  CAS  PubMed  Google Scholar 

  8. Yang J, Chou KC, Chen J, Liu H: Prediction of linear B-cell epitopes using amino acid pair antigenicity scale. Amino Acids. 2007, 33 (3): 423-428. Jan

    Article  CAS  PubMed  Google Scholar 

  9. Hopp TP, Woods KR: Prediction of protein antigenic determinants from amino acid sequences. Proc. Natl. Acad. Sci. U.S.A. 1981, 78 (6): 3824-3828. Jun

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Toth I, Moyle PM: Modern subunit vaccines: development, components, and research opportunities. ChemMedChem. 2013, 8 (3): 360-376. Mar

    Article  CAS  PubMed  Google Scholar 

  11. Ditzel HJ, Williamson RA, Burton DR, Parren PW, Poignard P: Antibodies in human infectious disease. Immunol Res. 2000, 21 (2-3): 265-278.

    Article  PubMed  Google Scholar 

  12. Patel VL, Shortliffe EH, Stefanelli M, Szolovits P, Berthold MR, Bellazzi R, Abu-Hanna A: The coming of age of artificial intelligence in medicine. Artif Intell Med. 2009, 46 (1): 5-17. May

    Article  PubMed  PubMed Central  Google Scholar 

  13. Sivalingam GN, Shepherd AJ: An analysis of B-cell epitope discontinuity. Mol. Immunol. 2012, 51 (3-4): 304-309. Jul

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Karplus M, McCammon JA: The dynamics of proteins. Sci. Am. 1986, 254 (4): 42-51. Apr

    Article  CAS  PubMed  Google Scholar 

  15. Parker JM, Guo D, Hodges RS: New hydrophilicity scale derived from high-performance liquid chromatography peptide retention data: correlation of predicted surface residues with antigenicity and X-ray-derived accessible sites. Biochemistry. 1986, 25 (19): 5425-5432. Sep

    Article  CAS  PubMed  Google Scholar 

  16. Pellequer JL, Westhof E: PREDITOP: a program for antigenicity prediction. J Mol Graph. 1993, 11 (3): 204-210. Sep

    Article  CAS  PubMed  Google Scholar 

  17. Davydov I, Tonevitski AG: Linear B-cell epitope prediction. Mol. Biol. (Mosk.). 2009, 43 (1): 166-174.

    Article  CAS  Google Scholar 

  18. Atassi MZ, Azzazy HM, Highsmith WE: Phage display technology: clinical applications and recent innovations. Clin. Biochem. 2002, 35 (6): 425-445. Sep

    Article  Google Scholar 

  19. Blythe MJ, Flower DR: Benchmarking B cell epitope prediction: underperformance of existing methods. Protein Sci. 2005, 14 (1): 246-248. Jan

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Deng Houtao, Runger George, Tuv Eugene: Bias of importance measures for multi-valued attributes and solutions. Lecture Notes in Computer Science. 2011, 6792: 293-300.

    Article  Google Scholar 

  21. Wang HW1, Lin YC, Pai TW, Chang HT: Prediction of B-cell linear epitopes with a combination of support vector machine classification and amino acid propensity identification. J Biomed Biotechnol. 2011, 2011: 432830-doi: 10.1155/2011/432830. Epub 2011 Aug 23

    PubMed  PubMed Central  Google Scholar 

  22. Subramanian N, Chinnappan S: Prediction of promiscuous epitopes in the e6 protein of three high risk human papilloma viruses: a computational approach. Asian Pac. J. Cancer Prev. 2013, 14 (7): 4167-4175.

    Article  PubMed  Google Scholar 

  23. Zhou E, Ruan Y, Kurgan J, Gao L, Faraggi J: BEST: improved prediction of B-cell epitopes from antigen sequences. PloS One. 2012, 7 (6): e40104. Jun.-

    Article  CAS  Google Scholar 

  24. El-Manzalawy Y, Dobbs D, Honavar V: Predicting linear B-cell epitopes using string kernels. J. Mol. Recognit. 2008, 21 (4): 243-255.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Kolaskar PC, Tongaonkar AS: A semi-empirical method for prediction of antigenic determi-nants on protein antigens. FEBS Lett. 1990, 276: 172-174.

    Article  CAS  PubMed  Google Scholar 

  26. Singh H, Ansari HR, Raghava GP: Improved method for linear B-cell epitope prediction using antigen's primary sequence. PloS ONE. 2013, 8 (5): e62216.-

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Luštrek M, Lorenz P, Kreutzer M, Qian Z, Steinbeck F, Wu D, Born N, Ziems B, Hecker M, Blank M, Shoenfeld Y, Cao Z, Glocker MO, Li Y, Fuellen G, Thiesen HJ: Epitope predictions indicate the presence of two distinct types of epitope-antibody-reactivities determined by epitope profiling of intravenous immunoglobulins. PloS One. 2013, 8 (11): e78605-Nov 11. Doi: 10.1371/journal.pone.0078605. Ecollection 2013

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Ofran Y, Kunik V: The indistinguishability of epitopes from protein surface is explained by the distinct binding preferences of each of the six antigen-binding loops. Protein Eng Des Sel. 2013, 26 (10): 599-609. Oct

    Article  CAS  PubMed  Google Scholar 

  29. Petersen Bent, Petersen Nordahl Thomas, Andersen Pernille, Nielsen Morten, Lundegaard1 Claus: A generic method for assignment of reliability scores applied to solvent accessibility predictions. BMC Structural Biology. 2009, 9: 51-doi:10.1186/1472-6807-9

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Berthold Michael, Cebron Nicolas, Dill Fabian, Gabriel Thomas, Otter Tobias, Meinl Thorsten, Ohl Peter, Sieb Christoph, Thiel Kilian, Wiswedel Bernd: KNIME: The Konstanz Information Miner. Studies in Classification, Data Analysis, and Knowledge Organization. Springer. ISSN:1431-8814. 2007

    Google Scholar 

  31. Bremel EJ, Homan RD: An integrated approach to epitope analysis I: Dimensional reduction, visualization and prediction of MHC binding using amino acid principal components and regression approaches. Immunome Res. 2010, 6 (7): 1745-7580. Nov

    Google Scholar 

  32. Kam D, Tong YW, Wee JC, Simarmata LJ: SVM-based prediction of linear B-cell epitopes using Bayes Feature Extraction. BMC Genomics. 2010, 2 (11): 1471-2164.

    Google Scholar 

  33. R Core Team: R: A Language and Environment for Statistical Computing. 2014, R Foundation for Statistical Computing. Vienna, Austria

    Google Scholar 

  34. Kurosaki T: Regulation of B-cell signal transduction by adaptor proteins. Nat. Rev. Immunol. 2002, 2 (5): 354-363. May

    Article  CAS  PubMed  Google Scholar 

  35. Jones S, Thornton JM: Principles of protein-protein interactions. Proc. Natl. Acad. Sci. U.S.A. 1996, 93 (1): 13-20. Jan

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Su CW, Lin EC, Cheng SY, Liu R, Hu J: Computational prediction of heme-binding residues by exploiting residue interaction network. PloS ONE. 2011, 6 (10): e25560-

    Article  CAS  Google Scholar 

  37. Greenbaum JA, Emami H, Hoof I, Salimi N, Damle R, Sette A, Peters B, Vita R, Zarebski L: The immune epitope database 2.0. Nucleic Acids Res. 2010, 854-862. Nov, D

  38. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res. 2000, 28 (1): 235-242. Jan

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Rubinstein ND, Mayrose I, Halperin D, Yekutieli D, Gershoni JM, Pupko T: Computational characterization of B-cell epitopes. Mol. Immunol. 2008, 45 (12): 3477-3489. Jul

    Article  CAS  PubMed  Google Scholar 

  40. Zhao M, Li Q, Zhang W, Liu J: Predicting linear B-cell epitopes by using sequence-derived structural and physicochemical features. Int J Data Min Bioinform. 2012, 6 (5): 557-569.

    Article  PubMed  Google Scholar 

  41. Janin J, Chothia C: The structure of protein-protein recognition sites. J. Biol. Chem. 1990, 265 (27): 16027-16030. Sep

    CAS  PubMed  Google Scholar 

  42. Reimer U: Prediction of linear B-cell epitopes. Methods Mol Biol. 2009, 524: 335-344. N. D. Rubinstein, I. Mayrose, D. Halperin, D. Yekutieli, J. M. Gershoni, and T. Pupko. Computational characterization of B-cell epitopes. Mol. Immunol., 45(12):3477-3489, Jul 2008

    Article  CAS  PubMed  Google Scholar 

  43. Toseland CP, Clayton DJ, McSparron H, Hemsley SL, Blythe MJ, Paine K, Doytchinova IA, Guan P, Hattotuwagama CK, Flower DR: AntiJen: a quantitative immunology database integrating functional, thermodynamic, kinetic, biophysical, and cellular data. Immunome Res. 2005, 1 (1): 4-Oct

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Zhao L, Wong L, Lu L, Hoi SC, Li J: B-cell epitope prediction through a graph model. BMC Bioinformatics. 2012, 13 (Suppl 17): S20-

    PubMed  PubMed Central  Google Scholar 

  45. Keskin O, Ma B, Rogale K, Gunasekaran K, Nussinov R: Protein-protein interactions: organization, cooperativity and mapping in a bottom-up Systems Biology approach. Phys Biol. 2005, 2 (2): 24-35. Jun

    Article  CAS  Google Scholar 

  46. Pellequer JL, Westhof E, Van Regenmortel MH: Correlation between the location of antigenic sites and the prediction of turns in proteins. Immunol. Lett. 1993, 36 (1): 83-99. Apr

    Article  CAS  PubMed  Google Scholar 

  47. Bourne PE, Ponomarenko JV: Antibody-protein interactions: benchmark datasets and prediction tools evaluation. BMC Struct Biol. 2007, 2: 7-64. Oct

    Google Scholar 

  48. Saha S, Raghava GP: Prediction of continuous B-cell epitopes in an antigen using recurrent neural network. Proteins. 2006, 65 (1): 40-48. Oct

    Article  CAS  PubMed  Google Scholar 

  49. Saha S, Bhasin M, Raghava GP: Bcipep: a database of B-cell epitopes. BMC Genomics. 2005

    Google Scholar 

Download references


This research and funding for publication was supported by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior, (CAPES-Brazil),(Toxinologia No 23038000825/2011-63). Fundação de Amparo a Pesquisa do Estado de Minas Gerais, Brazil (FAPEMIG-Brazil) and Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq-Brazil).

This article has been published as part of BMC Bioinformatics Volume 16 Supplement 19, 2015: Brazilian Symposium on Bioinformatics 2014. The full contents of the supplement are available online at

Author information

Authors and Affiliations


Corresponding author

Correspondence to Carlos Chavez-Olortegui.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

Carlos Chavez Olortegui: Advising, professional orientation, results review and science encouragement.

Edgar Ernesto Gonzalez Kozlova: Data mining models and statistical analysis.

Benjamin Thomas Viart: Statistical analysis advising.

Liza Figueredo Felicori: Hypothesis help and advising.

Ricardo Andrez Machado de Avila: Hypothesis help and advising, general advising, results review and science encouragement.

Electronic supplementary material


Aditional file 1: The datasets composed of the sequences used in this work is available in this .csv file, containing four columns. First column shows the pubmedID of the paper from which the sequence was extracted. The second column contains the sequence. The third collumn contain the sequence IDs from genebank, uniprot or pdb, databases. The fourth column contains the class of the sequences which can be neurotoxin, metalloproteinase or random. The column separator in this .csv file is a standart semicolon ";". (CSV 7 KB)

Rights and permissions

Open Access  This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit

The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kozlova, E.E.G., Viart, B.T., de Avila, R.A.M. et al. Classification epitopes in groups based on their protein family. BMC Bioinformatics 16 (Suppl 19), S7 (2015).

Download citation

  • Published:

  • DOI: