Positive-unlabeled learning for the prediction of conformational B-cell epitopes

Background The incomplete ground truth of training data of B-cell epitopes is a demanding issue in computational epitope prediction. The challenge is that only a small fraction of the surface residues of an antigen are confirmed as antigenic residues (positive training data); the remaining residues are unlabeled. As some of these uncertain residues can possibly be grouped to form novel but currently unknown epitopes, it is misguided to unanimously classify all the unlabeled residues as negative training data following the traditional supervised learning scheme. Results We propose a positive-unlabeled learning algorithm to address this problem. The key idea is to distinguish between epitope-likely residues and reliable negative residues in unlabeled data. The method has two steps: (1) identify reliable negative residues using a weighted SVM with a high recall; and (2) construct a classification model on the positive residues and the reliable negative residues. Complex-based 10-fold cross-validation was conducted to show that this method outperforms those commonly used predictors DiscoTope 2.0, ElliPro and SEPPA 2.0 in every aspect. We conducted four case studies, in which the approach was tested on antigens of West Nile virus, dihydrofolate reductase, beta-lactamase, and two Ebola antigens whose epitopes are currently unknown. All the results were assessed on a newly-established data set of antigen structures not bound by antibodies, instead of on antibody-bound antigen structures. These bound structures may contain unfair binding information such as bound-state B-factors and protrusion index which could exaggerate the epitope prediction performance. Source codes are available on request.


Jing Ren, Qian Liu, John Ellis and Jinyan Li
Full list of author information is available at the end of the article Figure S1: The distribution of residues on RSA and PI. Figure S1 presents the distribution of positive and unlabeled points on RSA and PI. The distribution of residues on RSA and PI. The confirmed epitope residues are colored in red, while the unlabeled residues are colored in blue. In most cases, a residue needs to be exposed to be identified by antibodies [1]. Figure S2 illustrates the distribution of ASA and RSA in epitope and other surface residues (ASA>0). The median ASA of epitope residues is 67.7Å 2 , while that of other surface residues is 37.2Å 2 ; the median RSA of epitope residues is 43.8%, but that of other surface residues is 24.6%. The differences in these two features between epitope residues and surface residues are significant with p-values (rank-sum test) of 3.1e-127 and 1.3e-124 respectively. This indicates that epitope residues are more exposed than other surface residues.   Table S2. Previous studies also suggested that this feature is important for the identification and prediction of epitopes [2,3]. The distribution of PI is shown in Figure S3: the median PI of epitope residues is 0.709, and that of other surface residues is 0.436; their difference is obvious with a p-value of 1.2e-74. It suggests that epitope is more protrusive than surface.  B factor characterizes the mobility of residues, and is claimed to be an effective feature in epitope prediction [4,5]. It ranks 7th in Additional File 4: Table S2. Figure S4 demonstrates the distribution of the normalized B factor in epitope and surface residues. Normalized B factor on each antigen is used here, because B factor may be influenced by the determination conditions, such as resolution. The median B factor of epitope sites is 0.31, while that of other surface sites is -0.06. Their distribution is remarkably different as the p-value between them is 4.0e-30, which implies that the epitope sites are more flexible than the surface sites.  Figure S5 shows the ratio of amino acids between the epitope (or interaction) and surface residues.

Figure S5
The ratio between the epitope (or interaction) and surface amino acids. The amino acids are sorted by hydrophobicity: the amino acids on the left side are more hydrophilic, while the amino acids on the right side are more hydrophobic.

Figure S6: Secondary structure
Secondary structure is another conventional feature used in epitope prediction [6]. Figure S6 shows the distribution of secondary structures. Compared with surface sites, the epitope sites are richer in turn and shorter in beta sheets. Additionally, epitope and interaction demonstrate a obviously different preference, for example, interactions contain more alpha helix than epitopes.

Figure S8: Overlapping between epitopes and internal interactions.
Epitopes and internally interacting residues may overlap in their edges as shown in Figure S8. This is the reason for the slight decrease in recall in Additional File 4: Table S1. Of all the 1648 internally interacting residues in our unbound data set, 165 overlap the epitope residues.

Figure S8
Overlapping between epitopes and internal interactions in 3K7B. The epitope is colored in red, the internally interacting residues are colored blue, and the overlapping residues are colored orange. The other chain that interacts with the target chain is shown in gray.