Datasets
Datasets used in the studies are relevant to their goals and scopes. Some conformational epitope prediction models are constructed on the bound structures, while others are built on the unbound structures. Therefore, we use both bound and unbound dataset to evaluate and compare models.
We use the dataset published by Rubinstein as the benchmark bound dataset [26]. The bound dataset consists of 66 non-redundant Ag-Ab structures, available at: http://epitopia.tau.ac.il/trainData/.
We use the Liang's dataset as the benchmark unbound dataset [28]. Liang's dataset is compiled as follows: (1) 22 antigen-antibody complexes and their unbound structures were sourced from protein docking Benchmark 2.0 [37]; 59 representative antigen-antibody complexes were provided by [38]; 17 antigen-antibody complex structures were collected from [27]; (2) these structures were merged, and the complexes without available unbound structure were removed. Finally, a total of 48 complexes and their unbound structures were retained as the benchmark unbound dataset, available at: http://sysbio.unl.edu/services/.
In addition, the independent test set compiled from entries of the Conformational Epitope Database (CED) [39] is used, which contains 19 antigen structures with annotated epitopes. This dataset is available at: http://sysbio.unl.edu/services/.
We compile a benchmark dataset of 83 antigen sequences from Rubinstein's structure dataset, available at http://code.google.com/p/my-project-bpredictor/downloads/list. Hence, we can fairly compare the sequence-based models with structure-based models.
Epitope definition
There are several definitions ever used for the epitopes inferred from the X-ray structures of Ag-Ab complexes, such as the accessible surface area loss upon antibody binding or the distance between antigen residues and antibody residues. However, the study in [38] indicated that different epitope definitions are likely to give out similar results. Hence, we follow the commonly used distance-based definition. Specifically, an antigen residue separated from any antibody residue by a distance less than 4Å is defined as an epitope residue, and the distance between two residues is measured by the minimal Euclidean distance between the centers of any of their non-hydrogen atoms.
Thick surface patch
A residue is defined as the surface residue, if its relative accessible surface area (RASA) calculated by DSSP program [40] is more than 5%. When using the surface patch to describe the spatial characteristics of antigen residues, the epitope residues and non-epitope residues are considered to be distinct with respect to their surface patches. We notice that the surface patch only include the surface residues, therefore this raises a question: are the adjacent interior residues unimportant or unnecessary for the representation of spatial context? Clearly, the interior residues cannot be epitope residues, but it does not mean that they cannot influence surface residues, and the interior residues may contribute to the formation of epitope sites. In order to address the issue, the impact of interior residues cannot be neglected and should be investigated. In this study, we propose a new concept 'thick surface patch'. Formally, the thick surface patch of a surface residue is defined as a set of n nearest adjacent residues, including interior neighbors as well as surface neighbors. For simplicity, the thick surface patch and the surface patch are generally named 'residue patch' in the following sections.
Adjacent residue distance feature
The residue patch is critical for the conformational epitope prediction. However, contributions of residues in a patch may be distinct and depend on their distances to the central residue. Since existing methods usually used the patch of 20 residues, the analysis is implemented on the patch of this size.
To test whether the distances between adjacent residues and the central residue have impact on the state of the central residue, we calculate the average distance between adjacent residues and the central residue, and the average distance is compared between epitope patches and non-epitope patches for each central residue type. The results reveal that non-epitope patches have significantly less average distance than epitope patches (P = 1.59 × 10-10 by paired t-test for the bound dataset, see Figure 1).
For further test, we compare the distance between k th nearest adjacent residues (k = 1, 2... 20) and the central residue in epitope patches versus non-epitope patches. The results show that the distance distribution in epitope patches is significantly different from that of non-epitope patches (P = 1.20 × 10-7 by paired t-test for the bound dataset, see Figure 2).
According to the statistical analysis on the bound dataset, it is observed that the average distance of the patch and the distance distribution of the patch may help to distinguish the epitope patches from non-epitope patches. The similar conclusion can be drawn for the unbound dataset (data not shown).
Based on the above study, we propose an adjacent residue distance feature based on the distance between the adjacent residue and the central residue, which is defined as follows:
Where x
i
represents an adjacent residue in a patch, I = 1, 2, ...,n. , and d
i
is the Euclidean distance between x
i
and the central residue (based on the nearest non-hydrogen atoms). In the feature, the contributions of adjacent residues in a patch are quantitatively represented and depend on their relative distances to the central residue.
Descriptors for residue patch
While constructing prediction models, each patch should be represented as a feature vector by using physicochemical and structural features. In addition to the adjacent residue distance feature, several popular physicochemical and structural features are used.
Relative accessible surface area: it is an important factor influencing the antigen-antibody binding, and the greater relative area of a surface residue means the greater probability of being an epitope residue. The relative accessible surface area of a residue is calculated by dividing its accessible surface area with the accessible surface area of fully exposed amino acid. The accessible surface areas of surface residues are calculated by using DSSP program [40], and the fully exposed amino acid area can be obtain from [41].
Evolutionary conservation: Generally speaking, functional regions on protein surfaces are usually more evolutionarily conserved than other regions, but the study on antigen crystal structures draws opposite conclusion. Statistical test reveals that evolutionary conservation can significantly distinguish epitopes from non-epitope region [42]. In order to calculate conservation scores, the primary sequence of the antigen chain we want to predict is aligned to the non-redundant protein database by using BLAST program (round of iteration is set to 3), and a position specific scoring matrix (PSSM) is returned. Then, the conservation score of the residue at the sequence position i is calculated by following function:
Here, M
ir
is the value of residue type r at the sequence position i, according to the PSSM, and B
rr
is the diagonal element of BLOSUM62 for residue type r. The same function is used in [28].
Secondary structure: secondary structures are proved as important factors for the Ag-Ab interaction, and epitopes are likely to have specific secondary structure elements versus non-epitope surfaces [42]. Here, we use DSSP to calculate the secondary structures of surface residues, and each secondary structure (helix, sheet or coil) is represented as a three-bit string, such as (1, 0, 0), (0, 1, 0) and (0, 0, 1), respectively.
Amino acid composition: amino acid composition is widely used in protein function analysis and classification. In the Ag-Ab interaction, some amino acid types are significantly overrepresented in epitopes, and others are underrepresented, thus the amino acid composition can be used to differentiate epitope patches from non-epitope patches [42]. For a patch, the percentage of each amino acid type is calculated as the amino acid composition.
With respect to these physicochemical and structural features, each residue in a residue patch can be represented as a feature vector of 7 dimensions (1 for relative accessible surface area, evolutionary conservation, the adjacent residue distance, and amino acid composition, respectively, 3 for secondary structure). As a result, a patch of n residues is represented by a 7 × n -dimensional feature vector.
The strategy for the imbalanced dataset
In fact, a great number of real datasets are imbalanced, in which the instances from one class take majority of the data. The common machine learning methods cannot well handle the imbalanced dataset, and they are usually combined with some strategies to solve the problem. There are two common approaches to deal with the imbalanced datasets. One approach is assigning a high cost to the misclassification of minority class and redesigning the classifier by minimizing the error rate. The other is downsizing the majority class or upsizing the minority class.
An approach based on data bootstrapping and voting is used here to deal with the imbalanced data, summarized as follows,
-
1.
Let A be the training set, A - be the set of negative instances and A + be the set of positive instances, and there are much more negative instances than positive instances;
-
2.
Random data sampling is implemented n times on the set A - to obtain n data subset whose size is equal to the size of A +, i = 1, 2, ..., n;
-
3.
Combined each and A + to generate n different training sets, i = 1, 2, ..., n, and a random forest model can be built on one training set. Totally, n models can be obtained;
-
4.
Given a new instance, n random forest models (sub-classifier) will make n decision values (binary value), and the voting strategy is utilized to make the final decision.
Random forest and data bootstrapping are implemented by Weka package [43], and default parameters are adopted.
Performance evaluation metrics
The performance of the models is evaluated by LOOCV and the independent test. In the study, LOOCV procedure is slightly different. For a dataset of n structures, each time, n-1 structures are used to train the model, and one structure is used to test the model. In the independent test, the prediction models are trained on the training set, and then they are tested by the independent test structures.
The performance of models is scored by several metrics, i.e. sensitivity (SN), specificity (SP), F-measure (F), accuracy (ACC) and the area under ROC curve (AUC).
Where TP, TN, FP and FN are the number of true positives, the number of true negatives, the number of false positives and the number of false negatives. Here, AUC is used as the primary evaluation metric. In order to calculate AUC, we use a voting cutoff to make final prediction, and then change the cutoff to obtain different SN and SP. The scores of SN, SP, ACC and F in the following tables are calculated at the cutoff that half the number of all sub-classifiers give out the positive decision.