Positive-unlabeled learning for the prediction of conformational B-cell epitopes
© Ren et al. 2015
Published: 9 December 2015
The incomplete ground truth of training data of B-cell epitopes is a demanding issue in computational epitope prediction. The challenge is that only a small fraction of the surface residues of an antigen are confirmed as antigenic residues (positive training data); the remaining residues are unlabeled. As some of these uncertain residues can possibly be grouped to form novel but currently unknown epitopes, it is misguided to unanimously classify all the unlabeled residues as negative training data following the traditional supervised learning scheme.
We propose a positive-unlabeled learning algorithm to address this problem. The key idea is to distinguish between epitope-likely residues and reliable negative residues in unlabeled data. The method has two steps: (1) identify reliable negative residues using a weighted SVM with a high recall; and (2) construct a classification model on the positive residues and the reliable negative residues. Complex-based 10-fold cross-validation was conducted to show that this method outperforms those commonly used predictors DiscoTope 2.0, ElliPro and SEPPA 2.0 in every aspect. We conducted four case studies, in which the approach was tested on antigens of West Nile virus, dihydrofolate reductase, beta-lactamase, and two Ebola antigens whose epitopes are currently unknown. All the results were assessed on a newly-established data set of antigen structures not bound by antibodies, instead of on antibody-bound antigen structures. These bound structures may contain unfair binding information such as bound-state B-factors and protrusion index which could exaggerate the epitope prediction performance. Source codes are available on request.
A B-cell epitope is a small surface area of an antigen that interacts with an antibody. It is a much safer and more economical target than an entire inactivated antigen for the design and development of vaccines against infectious diseases [1, 2]. More than 90% of epitopes are conformational epitopes which are discontinuous in sequence but are compact in 3D structure after folding [2, 3]. The most accurate way to identify conformational epitopes is to conduct wet-lab experiments to obtain the bound structures of antigen-antibody complexes. Given that there are a vast number of antigen and epitope candidates for known antigens, the wet-lab approach is unscalable and labour-intensive.
The computational approach to identify B-cell epitopes is to make predictions for new epitopes by sophisticated algorithms based on the wet-lab confirmed epitope data. Early methods explored the use of essential characteristics of epitopes, and found useful individual features including hydrophobicity [4, 5], flexibility , secondary structure , protrusion index (PI) , accessible surface area (ASA), relative accessible surface area (RSA) and B-factor [9, 10]. However, none of these single characteristics is sufficient to locate B-cell epitopes accurately. Later, advanced conformational epitope prediction methods emerged, integrating window strategies, statistical ideas and compound features [2, 11–14]. Recently, many epitope predictors have used machine learning techniques, such as Naive Bayesian learning  and random forest classification [10, 16].
All these methods have overlooked the incomplete ground truth of the training data of epitopes. The training data is simply divided into positive (i.e., confirmed epitope residues) and negative (i.e., non-epitope residues) classes by the traditional methods. In fact, the non-epitope residues are unlabeled residues. These unlabeled residues may contain a significant number of undiscovered antigenic residues (i.e., potentially positive). It is therefore misguided to unanimously treat all the unlabeled residues as negative training data. Classification models based on such biased training data would significantly impair their prediction performance.
An intuitive way to address this problem is to train the models on positive samples only (one-class learning). One-class SVM [17, 18] was developed, but its performance does not seem to be satisfactory . Positive-unlabeled learning (PU learning) provides another direction. It learns from both positive and unlabeled samples, and exploits the distribution of the unlabeled data to reduce the error labels of training samples to enhance prediction performance . One idea in PU learning is to assign each sample a score indicating the probability of it being a positive sample. For example, Lee and Liu first fitted samples with specific distribution by weighted logistic regression and then scored the samples . Another idea is the bagging strategy, in which a series of classifiers is constructed by randomly sampling unlabeled data, and these classifiers are then combined using aggregation techniques . A third idea is a two-step model: reliable negative (RN) samples from unlabeled data are first obtained, then a classifier is built by applying a classification algorithm on the positive and reliable negative samples [19, 22–24].
We introduce a novel two-step PU learning algorithm. The first step is to identify reliable negative samples from unlabeled data by a weighted SVM  with a recall threshold set at a high level. The high recall means that the majority of positive samples should be correctly identified; thus if an unlabeled sample is predicted as negative, it would have a high probability of being a non-epitope residue. Accordingly, true negative predictions (i.e., unlabeled residues predicted as negative) can be annotated as reliable negative samples. A classifier (a weighted SVM model) is then trained on the positive and reliable negative samples to predict novel antigenic resides and epitopes. Our method is called PUPre (Positive- Unlabeled Prediction).
The performance of PUPre was evaluated on a newly-established data set of unbound structures of antigens. We would like to point out that most existing epitope prediction methods have been evaluated on bound-state structures of antigens [2, 11, 13, 26]. Bound-state data has two limitations. Firstly, bound-state structures contain binding site information ; Secondly, if an antigen can be bound by multiple antibodies, only one epitope is annotated as an epitope in a bound-state structure, while those epitopes bound to the other antibodies are taken as non-epitope. Such an annotation exaggerates the false negative annotations.
We conducted complex-based 10-fold cross-validation for performance evaluation, in which all the residues of 10% randomly selected complexes are reserved for test at each round (not 10% of randomly selected residues). We show that the PUPre method demonstrates better performance compared to commonly used conformational B-cell epitope predictors, such as DiscoTope 2.0, ElliPro and SEPPA 2.0. The use of PUPre was also demonstrated through its application to antigens of West Nile virus, dihydrofolate reductase, beta-lactamase, and two Ebola antigens (whose epitopes are currently unknown) to show its usefulness in real-life applications for the prediction of unknown epitopes. To understand the importance of species information in epitope prediction, a species-wise feature analysis was also conducted on the newly-established unbound structure data set. We found that the divergence between epitopes and normal surface areas is large, suggesting that the prediction methods are useful for all species.We note that a difference exists between certain species on important structural features and amino acid composition. We speculate that it may be possible to enhance epitope prediction performance by using species information in the future.
Large-scale bound-state structure data sets have previously been constructed by the literature, and used in other studies [2, 11, 13, 26, 27] for epitope prediction and feature analysis. The use of bound-state structures can result in two problems. One is that bound-state structures contain a large amount of explicit binding information , which can result in biased characterization of epitopes and can exaggerate the prediction performance. The other is that they can aggravate the issue of false negative annotations when an antigen can be bound by multiple antibodies--only the epitope to the antibody in the bound structure is labeled as an epitope site and all those epitopes to other antibodies are marked as non-epitope. To overcome these two problems, our predictions and analysis were based on a newly-established unbound-state structure data set. As the data set does not contain information about the binding site, a more accurate characterization of unboundstate epitopes is expected. The use of unbound-state structures can also reduce the false negative annotations by aggregating multiple epitopes on the same antigen. These unbound-state structures were manually organized in terms of species and disease which is especially useful for species/disease-specific feature analysis.
The construction of the unbound-state structure data set requires reference information from bound structures. We used the following steps to obtain the bound structures with epitope annotations:
Collect bound structures of antigen-antibody complexes. Bound structures were collected by text search of 'complex' and 'antibody/Fab/ Fv/VHH' from the PDB database dated 9th Sep 2014, which retrieved 1596 structures.
Filter the bound structures. A bound structure was removed if it was consistent with any of the following conditions: (1) there is no antibody chain; (2) there is a chain of 'DNA/RNA/Fc/T-cell/receptor'; (3) the resolution is worse (more) than 3Å; (4) the antigen chain is less than 25 residues [2, 28]. In total, 598 bound structures of antigen-antibody complexes were retained.
Subsequently, the steps to build the unbound-state structure data set are:
Obtain candidate unbound-state structures of antigens. An antigen structure in unbound state is selected as a candidate if it has more than 70% sequence similarity to any antigen in bound state (i.e., the 598 bound structures). By this way, there may be multiple candidate unbound-state antigen structures that are similar to the same bound-state antigen, but only one will be used for mapping in the next step. The candidate with the highest similarity to the bound-state antigen and with higher resolution is considered to have higher priority. Bound-state structures will be removed if their antigens do not have high similarity to antigens in unbound state.
Map the epitopes onto the unbound-state structures. The epitopes extracted from the bound structures were mapped onto the corresponding unbound-state structures by structure alignment. An epitope was retained if it could be completely aligned with the unbound structure. This step reduces the false negative annotations: if various bound structures share the same antigen, their epitopes will be mapped on the same unbound-state structure. For example, 1VFB and 2EIZ are bound structures of lysozyme and antibodies, and their distinct epitopes were mapped onto the same unbound-state structure 2VB1 to reduce the false negative annotations. In this step, 308 epitopes were mapped onto 92 unbound-state structures.
Remove duplicate units. For the 92 unbound-state structures, only one asymmetric unit was retained for each structure.
A residue was retained for the unbound data set if its ASA was more than 0Å. This is because the candidate epitope residues at least need to be exposed to contribute to the binding affinity. We used the relative low threshold of 0Å to preserve the ground truth of the epitopes.
By following these steps, a data set of 92 unbound-state structures was constructed (Additional File 1) which contained 2123 confirmed epitope residues labeled as positive (Additional File 2), and 16615 residues marked as unlabeled.
Feature vector representation for residues
Description of features.
AAIndex (80% similarity)
Some of these features may make little contribution to the characterization and identification of epitopes. A non-parameter hypothesis test (Wilcoxon rank-sum test on the epitope residues and the unlabeled residues) was used to find out which features better characterized epitopes. The p-value reflects the significance: the smaller the p-value is, the better the feature characterizes epitope residues. The features are then ranked by p-values. Only those with p-value less than 1e - 4 were retained as important features. To avoid over-fitting, the hypothesis test was conducted on 62 (2/3 of 92) randomly selected complexes each time. This procedure was repeated nine times, and produced nine important feature lists. The final winning features were selected by majority voting. The procedure helps to identify highly useful features for discovering unknown epitopes. Eighty-nine basic features were ultimately selected. Sequence window features and structure window features were also added to the vector to reflect the impact of sequential or structural adjacent residues on epitope residues. Please refer to  for detailed steps to derive these sequence window features and structure window features.
PU learning has been already explored for text mining [19, 20, 22], disease gene identification [29–31], and protein function identification . However, this advanced learning approach has not been explored for the prediction of epitopes.
To study which factor contributes to this major improvement, we also designed two baseline algorithms using linear SVM. In the baseline algorithms, we simply use weight = 10 on the rare epitope data to handle the issue of data imbalance, and calibrate the penalty factor cc in linear SVM to obtain optimum performance. The penalty factor controls the trade-off between the margin and the training errors .
The raw baseline algorithm is trained and evaluated on all the epitope and unlabeled residues. It is used to investigate whether selected features are effective. The preprocessing baseline algorithm was designed on the basis of the following observation. Many of the unbound structures are multimeric, and in most cases PDB files only record parts of the symmetric units. Clearly, the interfaces between target chains and other chains in symmetric units cannot become epitopes. Thus, a preprocessing procedure is deployed to enhance the performance: we first calculate the complete structure according to the PDB file and then detect and remove the interfaces with other chains. Without loss of generality, we assume an antigen structure has two chains A and B. A residue on chain A is defined as an internally interacting residue if a heavy atom of this residue is within 4Å distance of any heavy atom of chain B. In a training process, those internal interactions are excluded from the training data, taken as neither positive nor unlabeled residues; in a testing process, they are labeled as negative. The internal interactions on our data set are provided in Additional File 3.
Results and Discussion
This section presents validation results in two parts. The performance of PUPre under the complex-based 10-fold cross-validation is reported first, followed by its detailed prediction performance on the four antigens in our case studies. Our methods and all the comparing partners in this section receive exactly the same inputs: all residues of the chains listed in Additional File 1.
Prediction results by complex-based 10-fold cross-validation
The performance of complex-based 10-fold cross-validation.
0.58 ± 0.002
0.17 ± 0.003
0.26 ± 0.003
0.17 ± 0.004
0.59 ± 0.001
0.18 ± 0.003
0.27 ± 0.003
0.18 ± 0.004
0.71 ± 0.015
0.18 ± 0.002
0.28 ± 0.003
0.21 ± 0.005
The most distinguishing feature of PUPre is its high recall performance. It achieves an excellent recall of 0.71 while its precision is the highest level 0.18 of the four predictors. This indicates that most of the epitope residues have been correctly identified. Though ElliPro has a competitive recall, its precision of 0.12 is only slightly better than random .
We can also see that the raw baseline algorithm (Table 3, Baseline(r)) outperforms the three other predictors except that its recall is lower than ElliPro. This implies that the selected features and the method of feature space construction are as effective as expected. By integrating the preprocessing procedure, the new baseline algorithm improves performance in every aspect: the F-score has increased from 0.26 to 0.27, the MCC has improved from 0.17 to 0.18, the recall has increased from 0.58 to 0.59 (implying that more epitope residues have been identified), and precision has increased from 0.17 to 0.18 (implying that a greater proportion of predicted epitope residues are true epitope residues). The removal of internally interacting residues was conducted as a preprocessing step rather than postprocessing. Thus, these extreme negative cases can be removed before training, and it will help predictors to focus on the more confusing residues. The performance results of DiscoTope 2.0, ElliPro, SEPPA 2.0 with a similar postprocessing procedure to remove the internally interacting residues are shown in Additional File 4: Table S1.
Compared with the two baseline algorithms, PUPre achieves an overall improvement in performance: the F-score has increased from 0.27 to 0.28 and the MCC has improved from 0.18 to 0.21. With precision unchanged, recall shows a significant increase from 0.59 to 0.71, indicating the effectiveness of the PU learning algorithm: more confirmed epitope residues are re-discovered (predicted) and there is potential to discover new epitopes.
In epitope prediction, handling the more ambiguous residues has always been difficult. The nature of epitope residues is complicated; simply using the distribution of certain features (even important features) is insufficient to distinguish these 'middle points'. Additional File 5: Figure S1 gives an example to illustrate this difficulty. As can be seen, there is no clear boundary among these samples that is able to correctly classify the determined epitope residues (positive). A more systematic machine learning method could be a better choice, to utilize the distribution of more useful features. In PUPre, two strategies were employed against these ambiguous residues. The first strategy was the preprocessing procedure to remove positive samples before step one. These internally interacting residues are a kind of extremely negative residue, and were removed before training and testing. Thus, the predictor is able to focus on the more ambiguous points. The second strategy was to train a new SVM predictor with optimized F-score based on the positive and RN residues. The distribution of positive and RN residues was utilized in this more systematic way to distinguish the more ambiguous points that were not labeled in step one.
Four case studies
PUPre was tested on three antigens with known epitopes which were not used as training data to see whether these known epitopes can be correctly re-discovered (predicted). PUPre was also applied to two Ebola virus antigen structures, whose epitopes are currently unknown, to predict novel epitopes. We note that all these antigens have a far kinship from any antigen in the training data.
Prediction results for an antigen of West Nile virus
PDB entry 4OIE is an unbound structure of an antigen of West Nile virus. The epitope site of 4OIE was annotated with the reference information from PDB entry 4OII (protein NS1 of the West Nile virus binding with antibody 22NS1), which is the only bound structure of a similar antigen in PDB. The epitope consists of 21 residues. PUPre was constructed on 91 structures of our data set to predict the epitopes of 4OIE after the unbound structure 4OIE was removed from the training data. The sequence similarity between 4OIE and each of the 91 training structures was calculated by BLAST; the highest sequence similarity is only at 11.0%, confirming they are not related.
Prediction results on West Nile virus 4OIE.
There are multiple symmetrical units in 4OIE. The footprint of other symmetrical units on this chain (internal interactions, colored in blue) cannot be epitope candidates. The integrated preprocess allows PUPre to avoid recognizing this area as an epitope site (there is no blue sphere in Figure 3(b)), while the other predictors all mistook internally interacting residues for epitope sites. For example, DiscoTope 2.0 predicted two internally interacting residues TRP-210 and ASP-234 as epitope residues (see blue spheres in Figure 3(c)).
Prediction results for dihydrofolate reductase antigen
PDB entry 4NX7 is an unbound structure of an antigen of dihydrofolate reductase. This structure has multiple epitopes that have been confirmed at six other PDB entries: 3K74, 4EJ1, 4EIG, 4EIZ, 4I1N and 4I13. Again, the PUPre classifier was trained on the remaining 91 unbound structures after 4NX7 was removed from the unbound structure data set. The best sequence similarity between 4NX7 and the remaining 91 structures is only 12.6%.
Prediction results on dihydrofolate reductase 4NX7.
We also use this case study to illustrate the impact of non-standard components on epitope prediction. There are four non-standard components in 4NX7 (in yellow dots in Figure 4(b) and (d)): MN (Manganese), FOL (Folic acid), BME (Betamercaptoethanol) and NAP (NADP nicotinamide-adenine-dinucleotipe). The nonstandard elements have a sophisticated impact on the folding of the protein as well as the binding with antibodies. As shown in Figure 4(b) and (d), the residues alongside the non-standard elements are unlikely to be epitope candidates since they are difficult to bind by antibodies. Predictors and most feature extraction methods have failed to deal with this issue.
Prediction results for beta-lactamase antigen
Our third case study is on an unbound structure of an antigen of beta-lactamase obtained from Bacillus licheniformis (PDB ID: 2WK0). The epitope site was mapped from the bound structure 4M3K. There are two symmetrical Chains A and B in 2WK0, and Chain A is used here as an example. The best sequence similarity of 2WK0 with the training data (91 unbound structures) is at 16.7%.
Prediction results on beta-lactamase 2WK0.
Predicted epitopes for Ebola virus antigen
Ebola is a fatal infectious disease that caused a pandemic in Africa in 2014. There are two unbound structures of the matrix proteins of Ebola virus, 1ES6 and 4LD8, stored in PDB. Neither of them can be aligned with any bound structure in PDB with greater than 30% sequence similarity, which means that their bound structures with antibodies have not been determined or published, thus their epitopes cannot be determined from any complex structure data. We attempted to make predictions for the currently unknown epitopes of Ebola antigens through the four structure-based predictors PUPre, DiscoTope 2.0, ElliPro and SEPPA 2.0.
Taking 1ES6 as an example, GLY-84, PRO-85, LYS-86, ALA-128 and GLY-201 of Chain A were predicted to be epitope residues by all four methods, and some nearby residues, e.g., SER-83, VAL-87, THR-129, GLN-167, GLN-170, ALA-202, ASN-227 and THR-232 were predicted as epitope residues by three methods. It is interesting to see that these residues are spatially close to each other. As aggregated antigenic residues are more likely to constitute an epitope , these residues are good candidates for forming novel epitope sites on 1ES6 of the Ebola matrix protein.
The identification of important features plays a key role in various areas of biological research [33, 34]. Feature analysis is a detailed approach to understanding the particular properties or compound properties of antigen-antibody interfaces that can contrast protein-protein binding sites and the other surface residues. For the purpose of accurately predicting currently unknown epitopes from unbound structures, it is useful for feature analysis to be conducted on a large-scale unbound-state structure data set. Traditional feature analysis has usually been conducted on bound-state structure data sets which introduced bias to the investigation of structural features such as RSA, ASA, PI and B-factor . To understand the unique properties of antigens of different species, we also carried out species-specific feature analyses for virus, bacteria and mammals.
The Wilcoxon rank-sum hypothesis test was used to rank all features extracted from our large-scale unbound structure data set with 18738 residues. The features consist of a total of 239 physico-chemical, evolutionary and structural features. To avoid over-fitting, nine data sets by independent sampling were used.
Top 20 features selected by Wilcoxon rank-sum hypothesis test.
1.11 ≈ 1
8.33 ≈ 8
1.89 ≈ 2
10.78 ≈ 11
3.00 = 3
12.22 ≈ 12
5.67 ≈ 6
12.44 ≈ 12
5.67 ≈ 6
12.67 ≈ 13
7.22 ≈ 7
15.89 ≈ 16
7.22 ≈ 7
16.33 ≈ 16
7.67 ≈ 8
18.44 ≈ 18
The distinction between epitope residues and surface residues in these top ranked features is significant (the p-values are all below 1e-9). ASA and RSA: the median ASA of epitope residues is 67.7 ºA2, while that of other surface residues is 37.2 ºA2; the median RSA of epitope residues is 43.8% and that of other surface residues is 24.6%. This indicates that epitope residues are more exposed than other surface residues. PI is an important feature often taken into account in the identification of epitopes [8, 13]. The median PI of epitope residues is 0.709, and that of other surface residues is 0.436, suggesting that epitope is more protrusive than the normal surface. B-factor characterizes the mobility of residues, and is claimed to be an effective feature in epitope prediction [9, 10]. Normalized B-factor on each antigen was used here, because B-factor may be influenced by the experimental conditions, such as resolution. The median B-factor of epitope sites is 0.31, while that of other surface sites is -0.06, indicating that the epitope sites are more flexible than the surface sites. More details are reported in Additional File 5: Figure S2-S4. Since we assumed that some of the unlabeled residues are undiscovered antigenic residues, the distribution of these features between epitope residues and true non-epitope residues is expected to be more opposed.
Species-specific feature analysis
Species have unique differences in morphology and structure. Investigating whether the epitopes of antigens of different species have distinct characteristics would assist the construction of epitope predictors using species information. We organized the whole data set into seven sub-groups: virus (group 0), parasite (group 1), bacteria (group 2), mammal (group 3), insect (group 4), plant (group 5) and other microbes (group 6). We especially conducted species-specific feature analysis for groups 0, 2 and 3. (The other groups all have a small number of samples, and so were excluded from analysis).
Difference between virus (e0), bacteria (e2) and mammals (e3) by the rank sum test.
Amino acid composition
Similarly, the secondary structure distribution of different species exhibits a similar trend, as shown in Additional File 5: Figures S6 and S7, but a specific secondary structure has varies in distribution across species.
To deal with the issue of incomplete ground truth of training data in B-cell epitope prediction, we have designed a PU learning algorithm based on weighted SVM. A preprocessing procedure was incorporated to remove the internal interactions within the unbound structure of antigens. The integrated framework is named PUPre. A complex-based 10-fold cross-validation process was deployed to evaluate the prediction performance. The results show that PUPre performance exceeds three other commonly used conformational B-cell epitope predictors DiscoTope 2.0, ElliPro and SEPPA 2.0, and two well-designed baseline algorithms, demonstrating the effectiveness of its features, preprocessing procedure and the PU learning algorithm. PUPre was tested on antigens from West Nile virus, dihydrofolate reductase, and beta-lactamase to illustrate the detailed performance of the prediction methods. It was also used for the prediction of unknown epitopes on an antigen of Ebola virus. A species-specific feature analysis was conducted which shows that similar trends exist between epitope and surface in different species, which enables traditional predictors to be useful for all species; the details vary, however, thus refinement by using species information may help to enhance prediction performance. Incomplete training data is a long-neglected but key issue in epitope prediction, as it seriously prevents further performance improvement by traditional methods. PU learning provides a promising direction to pursue to resolve this issue.
This research work was partially supported by a UTS 2013 Early Career Research Grant, an ARC Discovery Project (DP130102124), and the China Scholarship Council. We thank Sue Felix for her efforts in proofreading this manuscript.
Publication charges for this article have been funded by ARC Discovery Project DP130102124.
This article has been published as part of BMC Bioinformatics Volume 16 Supplement 18, 2015: Joint 26th Genome Informatics Workshop and 14th International Conference on Bioinformatics: Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S18.
- Groot ASD, Rappuoli R: Genome-derived vaccines. Expert Review of Vaccines. 2004, 3 (1): 59-76.View ArticlePubMedGoogle Scholar
- Andersen PH, Nielsen M, Lund O: Prediction of residues in discontinuous B-cell epitopes using protein 3D structures. Protein Science. 2006, 15 (11): 2558-2567.View ArticleGoogle Scholar
- Barlow DJ, Edwards MS, Thornton JM: Continuous and discontinuous protein antigenic determinants. Nature. 1986, 322 (6081): 747-748. 10.1038/322747a0View ArticlePubMedGoogle Scholar
- Hopp TP, Woods KR: Prediction of protein antigenic determinants from amino acid sequences. Proceedings of the National Academy of Sciences. 1981, 78 (6): 3824-3828.View ArticleGoogle Scholar
- Parker JMR, Guo D, Hodges RS: New hydrophilicity scale derived from high-performance liquid chromatography peptide retention data: Correlation of predicted surface residues with antigenicity and X-ray-derived accessible sites. Biochemistry. 1986, 25 (19): 5425-5432.View ArticlePubMedGoogle Scholar
- Karplus PA, Schulz GE: Prediction of chain flexibility in proteins. Naturwissenschaften. 1985, 72 (4): 212-213.View ArticleGoogle Scholar
- Pellequer JL, Westhof E, Van Regenmortel MHV: Correlation between the location of antigenic sites and the prediction of turns in proteins. Immunology Letters. 1993, 36 (1): 83-99.View ArticlePubMedGoogle Scholar
- Thornton JM, Edwards MS, Taylor WR, Barlow DJ: Location of 'continuous' antigenic determinants in the protruding regions of proteins. The EMBO Journal. 1986, 5 (2): 409-PubMedPubMed CentralGoogle Scholar
- Liu R, Hu J: Prediction of discontinuous B-cell epitopes using logistic regression and structural information. J Proteomics Bioinform. 2011, 4: 010-015.Google Scholar
- Ren J, Liu Q, Ellis J, Li J: Tertiary structure-based prediction of conformational B-cell epitopes through B factors. Bioinformatics. 2014, 30 (12): 264-273.View ArticleGoogle Scholar
- Kulkarni-Kale U, Bhosle S, Kolaskar AS: CEP: A conformational epitope prediction server. Nucleic Acids Research. 2005, 33 (Suppl 2): 168-171.View ArticleGoogle Scholar
- Moreau V, Fleury C, Piquer D, Nguyen C, Novali N, Villard S, Laune D, Granier C, Molina F: PEPOP: Computational design of immunogenic peptides. BMC Bioinformatics. 2008, 9 (1): 71-View ArticlePubMedPubMed CentralGoogle Scholar
- Ponomarenko J, Bui HH, Li W, Fusseder N, Bourne PE, Sette A, Peters B: ElliPro: A new structure-based tool for the prediction of antibody epitopes. BMC Bioinformatics. 2008, 9 (1): 514-View ArticlePubMedPubMed CentralGoogle Scholar
- Sweredoski MJ, Baldi P: PEPITO: Improved discontinuous B-cell epitope prediction using multiple distance thresholds and half sphere exposure. Bioinformatics. 2008, 24 (12): 1459-1460.View ArticlePubMedGoogle Scholar
- Rubinstein ND, Mayrose I, Martz E, Pupko T: Epitopia: A web-server for predicting B-cell epitopes. BMC Bioinformatics. 2009, 10 (1): 287-View ArticlePubMedPubMed CentralGoogle Scholar
- Zhang W, Xiong Y, Zhao M, Zou H, Ye X, Liu J: Prediction of conformational B-cell epitopes from 3D structures by random forests with a distance-based feature. BMC Bioinformatics. 2011, 12 (1): 341-View ArticlePubMedPubMed CentralGoogle Scholar
- Manevitz LM, Yousef M: One-class SVMs for document classification. Journal of Machine Learning Research. 2002, 2: 139-154.Google Scholar
- Chang CC, Lin CJ: LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST). 2011, 2 (3): 27-Google Scholar
- Liu B, Lee WS, Yu PS, Li X: Partially supervised classification of text documents. Proceedings of the Nineteenth International Conference on Machine Learning (ICML): 8-12 July 2002. Edited by: Sammut, C., Hoffmann, A.G. 2002, Sydney, The University of New South Wales (UNSW), 2: 387-394.Google Scholar
- Lee WS, Liu B: Learning with positive and unlabeled examples using weighted logistic regression. Proceedings of the Twentieth International Conference on Machine Learning (ICML): 21-24 August 2003. Edited by: Fawcett, T., Mishra, N. 2003, Washington DC, HP Labs, 3: 448-455.Google Scholar
- Mordelet F, Vert JP: A bagging SVM to learn from positive and unlabeled examples. Pattern Recognition Letters. 2014, 37: 201-209.View ArticleGoogle Scholar
- Yu H, Han J, Chang KCC: PEBL: Positive example based learning for web page classification using SVM. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD): 23 - 25 July 2002. Edited by: Zaiane, O.R., Goebel, R., Hand, D., Keim, D., Ng, R. 2002, Edmonton, ACM, 239-248.View ArticleGoogle Scholar
- Li X, Liu B: Learning to classify texts using positive and unlabeled data. Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI): 9-15 August 2003; Acapulco. Edited by: Gottlob, G., Walsh, T. 2003, IJCAI Organization, 3: 587-592.Google Scholar
- Liu B, Dai Y, Li X, Lee WS, Yu PS: Building text classifiers using positive and unlabeled examples. Proceedings of the Third IEEE International Conference on Data Mining (ICDM): 19-22 November 2003. Edited by: Wu, X., Tuzhilin, A., Shavlik, J. 2003, Melbourne, Florida, IEEE, 179-186.View ArticleGoogle Scholar
- Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ: LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research. 2008, 9: 1871-1874.Google Scholar
- Sun J, Wu D, Xu T, Wang X, Xu X, Tao L, Li YX, Cao ZW: SEPPA: A computational server for spatial epitope prediction of protein antigens. Nucleic Acids Research. 2009, 37 (Suppl 2): 612-616.View ArticleGoogle Scholar
- Kringelum JV, Nielsen M, Padkjaer SB, Lund O: Structural analysis of B-cell epitopes in antibody: protein complexes. Molecular Immunology. 2013, 53 (1): 24-34.View ArticlePubMedPubMed CentralGoogle Scholar
- Qi T, Qiu T, Zhang Q, Tang K, Fan Y, Qiu J, Wu D, Zhang W, Chen Y, Gao J: SEPPA 2.0-more refined server to predict spatial epitope considering species of immune host and subcellular localization of protein antigen. Nucleic Acids Research. 2014, 42 (Web Server): 59-63.View ArticleGoogle Scholar
- Mordelet F, Vert JP: ProDiGe: Prioritization of disease genes with multitask machine learning from positive and unlabeled examples. BMC Bioinformatics. 2011, 12 (1): 389-View ArticlePubMedPubMed CentralGoogle Scholar
- Yang P, Li X, Mei JP, Kwoh CK, Ng SK: Positive-unlabeled learning for disease gene identification. Bioinformatics. 2012, 28 (20): 2640-2647.View ArticlePubMedPubMed CentralGoogle Scholar
- Yang P, Li X, Chua HN, Kwoh CK, Ng SK: Ensemble positive unlabeled learning for disease gene identification. PloS One. 2014, 9 (5): 97079-View ArticleGoogle Scholar
- Bhardwaj N, Gerstein M, Lu H: Genome-wide sequence-based prediction of peripheral proteins using a novel semi-supervised learning technique. BMC Bioinformatics. 2010, 11 (Suppl 1): 6-View ArticleGoogle Scholar
- Jones S, Thornton JM: Principles of protein-protein interactions. Proceedings of the National Academy of Sciences. 1996, 93 (1): 13-20.View ArticleGoogle Scholar
- Nayal M, Honig B: On the nature of cavities on protein surfaces: Application to the identification of drug-binding sites. Proteins: Structure, Function, and Bioinformatics. 2006, 63 (4): 892-906.View ArticleGoogle Scholar
- Chen J, Liu H, Yang J, Chou KC: Prediction of linear B-cell epitopes using amino acid pair antigenicity scale. Amino Acids. 2007, 33 (3): 423-428.View ArticlePubMedGoogle Scholar
- Kringelum JV, Lundegaard C, Lund O, Nielsen M: Reliable B cell epitope predictions: Impacts of method development and improved benchmarking. PLoS Computational Biology. 2012, 8 (12): 1002829-View ArticleGoogle Scholar
- Liu Q, Li J: Protein binding hot spots and the residue-residue pairing preference: A water exclusion perspective. BMC Bioinformatics. 2010, 11 (1): 244-View ArticlePubMedPubMed CentralGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.