Epitope-based vaccines (EVs) make use of short, antigen-derived peptides (corresponding to epitopes) that are administered to trigger a protective humoral (B-cell epitopes) and/or cellular (T-cell epitopes) immune response. T-cell epitopes are presented to T-cells in association with major histocompatibility complex (MHC) proteins. While cytotoxic T-cells recognize intracellular peptides displayed by MHC class I molecules (CD8+ T-cell epitopes), T helper cells recognize peptides that are taken up from the extracellular space and displayed by MHC class II molecules (CD4+ T-cell epitopes). The peptide:MHC complex (pMHC) interacts with the T-cell receptor, leading to its activation and subsequent induction of a cellular immune response. EVs offer several potential benefits over traditional vaccines, including the precise control over the immune response activation, the ability to focus on the most relevant antigen regions (conserved and/or highly immunogenic), as well as production and biosafety advantages due to their chemically simple and well-characterized composition. CD4+ T-cell epitopes play a key role in EV design , as the cognate help provided by these cells is essential for the generation of vigorous humoral and cytotoxic CD8+ T-cell responses . Because experimental screening of large sets of peptides is time-consuming and costly, in silico methods that facilitate CD4+ T-cell epitope mapping on protein antigens are paramount for EV development.
Human MHC class II (HLA class II) proteins are made up of two transmembrane chains: α-chain (34 kDa) and β-chain (29 kDa), which together shape the peptide-binding groove. This region defines five pockets that mostly interact with individual residues of the peptide . The HLA genes are the most polymorphic in the human genome. Currently, the IMGT/HLA database  lists 1679 HLA class II alleles associated with three classical loci (1267 DR, 223 DQ and 189 DP alleles), corresponding to 931 distinct HLA class II DR protein variants (allotypes; as of September 2012), and this number continues to grow at a rate of 200 alleles per year . This huge diversity poses serious problems for vaccine design, as different alleles are expressed at dramatically different frequencies in different ethnicities. Individuals display different sets of alleles that likely respond to a different set of peptides from a given pathogen.
The MHC class II binding groove is open at both ends, allowing peptides (~9-22 residues long) to project out of the groove, causing ambiguity in their positional alignment and making the alignment a fundamental step in predicting binding. Two classes of methods for CD4+ T-cell epitope prediction have emerged. The first category (data-driven methods) relies on peptide sequence comparisons to identify binding motifs, and includes the pioneer method SYFPEITHI  and matrix-based approaches such as position-specific binding profiles (e.g. Rankpep , ARB  and SMM-align ). TEPITOPE  and TEPITOPEpan  are based on the so-called “pocket profiles” and MultiRTA  is based on thermodynamic principles. Another group of data-driven approaches involves machine learning, which has been proposed to capture subtle MHC class II-binding patterns (e.g. NN-align , NetMHCIIPan-2.0  and MHCIIMulti ). The second category involves structure-based methods. These methods perform molecular modeling calculations on pMHCs in order to estimate the binding energies, thus offering independence from experimental binding data. A recent analysis showed them to be better than random, but inferior to state-of-the-art data-driven approaches .
The few methods able to cope with the extent of HLA class II polymorphism are collectively referred to as pan-specific approaches . Although these methods (NetMHCIIPan-2.0, TEPITOPEpan, MultiRTA) are potentially suitable for EV design, they do not fully account for the entire allotypic diversity of human ethnic populations and they do not take into consideration immunodominance. Our new tool Predivac described in this paper implements a different pan-specific approach based on the concept of specificity-determining residues (SDRs). This methodology has been previously described by our group for the prediction of substrate specificity of protein kinases [18-20]. While the binding interface of a protein can be extensive, only a small group of SDRs is responsible for specific interactions. The SDRs have been mainly studied in peptide recognition domains (e.g., PDZ, SH3 and kinase domains) of proteins with roles in signalling pathways . Identifying mutations that alter specificity may require a large amount of experimental work; therefore, a number of computational approaches have been developed to predict SDRs. Most of these methods are based on multiple sequence alignments and the use of statistical analysis and evolutionary information to identify SDRs . Other approaches combine multiple sequence alignments with structural information of the binding site [23, 24] or physical properties . Like other bioinformatics methods for CD4+ T-cell epitope prediction, Predivac focuses on predicting pMHC complex formation. The method assumes that T cells with the required specificity will be present in the T-cell repertoire. However, despite improvements in the performance of methods predicting MHC class II peptide binding, a recent study showed that state-of-the-art methods are still unsuccessful in predicting CD4+ T-cell epitopes , highlighting the need to develop new approaches that cope better with epitope discovery. A significant source of complexity in EV design comes from the fact that most of the response is mounted against a few so-called immunodominant epitopes, despite the presence of many potential epitopes within an immunogen. This restricted antigenic specificity of T cells poses a serious challenge for EV design, as vaccine formulations built on epitopes that do not dominate the immune response will not induce effective protection in the vaccinated organism. Recent evidence suggests that the pMHC kinetic stability plays a central role in controlling MHC class II peptide immunogenicity . In concordance with this model, a strong correlation has been observed between high affinity HLA class II peptide binding, immunodominance and promiscuous CD4+ T-cell recognition . Several studies support this correlation both for MHC class I  and class II proteins . Predivac was consequently developed using high-affinity binding data, on the assumption that it is the positive bias toward capturing underlying peptide features that correlates with promiscuity and immunodominance, two properties that are fundamental for EV design .
In this study, we introduce the pan-specific method Predivac for CD4+ T-cell epitope prediction, which is based on the SDR concept previously applied to protein phosphorylation site prediction. We assessed the performance of Predivac by cross-validation and compared the predictive performance against several state-of-the-art methods in terms of HLA class II peptide binding and CD4+ T-cell epitope prediction. The comparisons showed that Predivac performed comparable to ten competing methods in high-affinity binding prediction, but delivered the highest specificity in CD4+ T-cell epitope identification, with a particularly strong performance in immunodominant epitope identification, compared to three other pan-specific approaches.