An automated framework for understanding structural variations in the binding grooves of MHC class II molecules

Background MHC/HLA class II molecules are important components of the immune system and play a critical role in processes such as phagocytosis. Understanding peptide recognition properties of the hundreds of MHC class II alleles is essential to appreciate determinants of antigenicity and ultimately to predict epitopes. While there are several methods for epitope prediction, each differing in their success rates, there are no reports so far in the literature to systematically characterize the binding sites at the structural level and infer recognition profiles from them. Results Here we report a new approach to compare the binding sites of MHC class II molecules using their three dimensional structures. We use a specifically tuned version of our recent algorithm, PocketMatch. We show that our methodology is useful for classification of MHC class II molecules based on similarities or differences among their binding sites. A new module has been used to define binding sites in MHC molecules. Comparison of binding sites of 103 MHC molecules, both at the whole groove and individual sub-pocket levels has been carried out, and their clustering patterns analyzed. While clusters largely agree with serotypic classification, deviations from it and several new insights are obtained from our study. We also present how differences in sub-pockets of molecules associated with a pair of autoimmune diseases, narcolepsy and rheumatoid arthritis, were captured by PocketMatch13. Conclusion The systematic framework for understanding structural variations in MHC class II molecules enables large scale comparison of binding grooves and sub-pockets, which is likely to have direct implications towards predicting epitopes and understanding peptide binding preferences.


Background
Major histocompatibility complex (MHC) class II molecules are important components of the immune system and play a critical role in processes such as phagocytosis. Antigenic peptide binding by these molecules is a prerequisite for triggering immune responses. The diversity in antigen recognition is achieved through hundreds of class II alleles labelled by their serotypes, each differing from the others in terms of the residues at the binding site and their precise three dimensional arrangement.
The nature of binding site of an MHC class II molecule ( Figure 1) has an important bearing on the immune system of an individual [1,2]. MHC class II molecules provide important clues in understanding autoimmune diseases (e.g. [3][4][5]) and susceptibility to pathogens. In the context of tuberculosis, it has been reported that different MHC alleles bind peptides from Mycobacterium tuberculosis with different specificities, influencing an individual's susceptibility to infection [6][7][8].
A thorough knowledge of the structure of the binding site is useful in designing or identifying peptide antigens for rational vaccine design. In addition, knowledge of similar or dissimilar sites aid in understanding peptide specificities. While a general appreciation of the differences between a pair of structures can be obtained through interactive molecular graphics software tools, a thorough characterization of the differences and their mapping to individual residues in the corresponding structures, and more importantly obtaining a quantitative perspective of the extent of similarities, necessarily requires a systematic method for their analysis.
We have recently reported a new algorithm PocketMatch [9] based on alignment of sorted distance elements binned into point-type-pair bins. An important step that precedes pocket comparison is the definition of the binding site itself. In the previous study, all residues (or any atoms in them) that were present in a 4 Å zone around any atom of the ligand were taken to constitute the site. This approach though common, is rather simplistic and more detailed methods to define the binding site need to be explored to have more accurate site definitions. Here we incorporate a new module for defining binding sites and apply it for a large scale comparison of binding sites in the MHC class II molecules.
The modified algorithm is referred to as PocketMatch 13 hereafter. Further, we show that our algorithm is useful for classification of MHC class II molecules based on binding site analysis. The algorithm captures the overall shape, detailed geometry and the chemistry at the binding sites. This analysis also aids in understanding peptide preferences by different alleles which may become the first step in the optimal design of allele specific antigens.

Results and Discussion
We report a new approach for a large scale comparison of binding sites in protein structures and apply it for comparing and classifying a set of 103 MHC class II molecules. The method, which utilizes structural features of the whole site as well as of the sub-pockets, also serves as a high resolution framework to systematically understand similarities and differences among alleles. We have used this to identify automatically intra-and inter-allelic variations in the binding grooves of molecules in the data set, and to explore the structural basis for correlations with disease.

Inter-allelic variations
To investigate similarities across MHC molecules of different types, one MHC molecule was selected from each of the 65 Protein Data Bank (PDB) entries in the dataset, and all-against-all comparisons were carried out on this set of 65 molecules (Table 1). Binding site similarity scores (PM 13 Scores ) were computed for all the pairs of molecules both at the level of whole groove and sub-pocket levels. Cladograms were generated to show similarities and differences in PM 13 Scores across the dataset, both at the level of the whole groove, and at the level of the five sub-pockets (Figures 2 and Figures S1-S4 in Additional file 1). In addition to considering whole binding groove, it is important to know how the similarities of the sub-pockets (P1, P4, P6, P7, P9) vary as these are the ones that determine peptide specificity.
Some MHC molecules of the same type are in different branches of the cladogram calculated for the whole groove, however clustering at the sub-pocket level was more in line with the different MHC molecule types, particularly for the P4 sub-pocket. This suggests that the P4 sub-pocket is more structurally conserved within an allele, but difference occurs across alleles. The importance of the P4 sub-pocket has been noted in many studies (e.g. [1,2,10]).
Some different MHC molecules are grouped together in the same branch in some of the sub-pocket trees. In these cases, the PM 13 Scores highlight similarities that would otherwise be difficult to spot in a large dataset. These can be followed up by looking for independent observations about these similarities that have been reported in the literature. The matching alleles, corresponding PDB codes and PM 13 Scores for pairs of sub-pockets are listed in (Table 2), where the significance of the grouping of different alleles is discussed and supporting references are presented.   Meinl and co-workers [27] also report similarity between the two allele types in recognition of myelin basic protein. The P1 similarity between the two alleles is reported by Jurcevic and co-workers [28].
Though P1 score is high, the other subpocket scores are low (less than 0.34) which is in accordance with study by Texier and co-workers [29] that reports difference between the two alleles in their peptide binding properties.  To analyze the net distribution of similarity scores with respect to each other for each of the five sub-pockets, a histogram is plotted for various bins of PM 13 Scores ( Figure 3). Each bin corresponds to a range of PM 13 Scores. For example, bin-5 corresponds to a PM 13 Score range of [0.5 to 0.6); bin-7 to the range [0.7 to 0.8) and so on. The histogram shows that P1 and P9 score highly at bin 6, corresponding to [0.6 to 0.7) of PM 13 Score. The histogram gives an indication of the overall distribution of scores for each sub-pocket viewed in the context of others. This could possibly mean overrepresentation of data or true conservation of these two sub-pockets.
This analysis has implications for understanding subtle differences that otherwise go undetected and aid in understanding antigen recognition preferences by different alleles and range of antigens recognized by a given allele.

Intra-allelic variations
Some MHC molecules are present more than once in the PDB entries in the dataset (Table 1). In these cases, PocketMatch 13 can be used to highlight differences in the peptide binding sites in different structures for the same allele.
The sites are first compared by considering the whole binding grooves. In many cases, as expected, PM 13 Scores are high, indicating strong similarities in the binding sites of a given allele. However, there are cases where PM 13 Scores are low for different structures of the same molecule, for example different structures of DR1 and DR5 give similarity scores as low as 0.44 (Table S1 in Additional file 1). These differences can be explored by examining the individual sub-pockets within the binding grooves (see Methods). While many pairs of corresponding sub-pockets score highly, indicating similarity in the structures of the sub-pockets, in some cases the scores are significantly lower. This can be due to differences in MHC side chain conformations giving rise to different sets of intra-site distances, or can be due to determination of which MHC atoms are accessible to a probe sphere and are thus included in sub-pocket calculations. Sub-pockets highlighted by PocketMatch 13 to be  The two independent molecules in the crystal structure of DQ8 [PDB:1S9V] differ from each other at the P9 sub-pocket ( Figure 4B); the difference between the two molecules at the P9 position is noted by [11]. This analysis indicates that PocketMatch 13 is sufficiently sensitive to capture subtle differences that exist among molecules belonging to the same allele.
Correlation with disease: case studies Several MHC class II alleles are known to be either positively or negatively associated with certain diseases, and this motivates studies to identify the reasons for disease susceptibility in terms of three-dimensional molecular structure [1]. For example, Jones et al. [1] review the structures of alleles that are known to be positively or negatively associated with various diseases, including narcolepsy and rheumatoid arthritis (RA). We have used PocketMatch 13 to examine the binding grooves of alleles discussed by Jones et al.  were compared to those in a model structure of HLA-DQ6.1 (negatively associated with the disease). These molecules differ at only a few positions in the b chain. PocketMatch 13 identified the P4 sub-pocket corresponding to the Thr6 residue of the peptide to be the most dissimilar between these two structures ( Table 3). The residues Ala13bb and Tyr26b in HLA-DQ6.2 changed to Gly13b and Leu26b in HLA-DQ6.1 in the neighbourhood of peptide residue Thr6, corresponding to P4 ( Figure 5A); this difference is captured by the PocketMatch 13 algorithm.
In case of RA, alleles HLA-DR4.1, HLA-DR4.4 and HLA-DR1 are positively associated with the disease, while HLA-DR4.2 is neutral or negative [1]. The a chains of these four MHC molecules are the same (DRA*0101), and sequence comparison of the b chains with ClustalW [12] gives sequence identities of -DR4.   The low P4 scores are in line with that study. The superposition of these two alleles is shown in Figure 5B. The P4 peptide residue has Gln70b and Lys71b present in HLA-DR4.1 within 3.0 Å of the residue whereas an Asp at the position 70b and only Glu71b are present in the case of the model built for HLA-DR4.2.
All-against-all PM 13 Scores are presented in Table 4B, C. The scores indicate low PM 13 Score of [PDB:1DLH] to others in the P7 region of the binding site. Work by Rosloniec and co-workers found that mutation of the residue at the P7 position to an alanine has affected T cell stimulation more with DR4 than with DR1 [13]. The involvement of P7 sub-pocket in peptide recognition specificity is also discussed in [10]. In carrying out these case studies, model structures have been a useful supplement to the set of experimentally determined MHC class II molecules. We envisage future studies that make use of larger sets of model structures where the binding grooves have been modelled consistently using the same protocol [14].

Conclusion
A strategy for automatically comparing MHC class II binding grooves and sub-pockets based on their chemical nature and geometry is presented. Comparisons are facilitated by a pre-processing step in which MHCpeptide complexes are extracted from PDB files, and chains and structurally equivalent residue positions are relabelled consistently. Pocket similarity scores calculated by PocketMatch 13 can be used as the basis for clustering pockets based on their structural and chemical characteristics.
The framework we report can be used to carry out large scale comparison of binding grooves and sub-pockets, both to highlight differences in the binding grooves of MHC molecules of the same kind, and to identify similarities in the binding grooves of different MHC alleles. Investigations of MHC alleles associated with narcolepsy and rheumatoid arthritis demonstrate that binding grooves of alleles that are positively associated with an autoimmune disease can be compared with

Binding site comparison
Binding sites are represented in a frame invariant manner by distances between pairs of points, partitioned into bins, and pairs of sites are compared based on alignment of sorted sequences of distances. The sorted arrays are then aligned and scored to finally obtain comparison scores.
Molecules can be clustered based on their comparison scores.
In this study, the points used are the centres of those atoms lining the binding site. These are determined by considering accessibility to a probe sphere with radius 1.4 Å. Those MHC atoms whose accessibility is reduced by the presence of the peptide are determined to be part of the peptide binding site. Similarly, the MHC atoms that comprise individual pockets are identified as the set of atoms whose accessibility is reduced by the presence of the peptide residue at position P1, P4, P6, P7 or P9. The ProtOr radii from Table 2 of [18] are used for protein atomic groups in accessibility calculations.
The corresponding pockets between a pair of MHC binding sites are compared on large scale in an allagainst-all comparison scheme. The shape signature of each pocket, capturing chemical nature and geometric distribution of atoms, is derived based on the distance lists concept used in PocketMatch [9].
• To compare a pair of sites, each of the 91 lists is chosen in one site together with the corresponding list from the other site, and the cumulative number of similar distance elements is determined.
• A pair of distances from two lists is marked a match if the distance differ at most by a threshold of 0.5.
We call the tuned version of PocketMatch for the MHC class II binding site comparison, by considering solvent accessible atoms and 13 atomic group types, Pocket-Match 13 .
The numerator is simply the number of matching intrasite distances. However, the denominator can be the number of intra-site distances in either the smaller site or in the larger sitethese give rise to two PM 13 Score values, referred to as PMSMax and PMSMin, respectively. Unless stated otherwise, PM 13 Score refers to the PMSMin value.
PM 13 Score values decrease as the similarity between a pair of binding grooves decreases ( Figure 6). The rate at which the scores decrease is affected by the threshold chosen for site comparison, since this affects the number of matching distance elements between a pair of distance-sequences.
To illustrate the effect of perturbing the conformation of a binding groove, the coordinates of atoms in the binding groove of [PDB:1JWS] (A, B, C chains) were perturbed randomly, and an ensemble of 1000 structures was generated with root mean square deviation (RMSD) values up to 5 Å with respect to the original [PDB:1JWS] structure. We have used a similar strategy for sensitivity analysis for the original PocketMatch algorithm [9] and found that a threshold of 0.5 Å was adequate to distinguish between similar and dissimilar sites. Figure 6A shows the PM 13 Scores obtained by comparing the original [PDB:1JWS] structure with each of the perturbed structures in the ensemble. Rather than perturbing the atomic coordinates randomly, an alternative method for generating an ensemble of perturbed conformations would be to use conformations from a molecular dynamics trajectory. To investigate the effect of altering the chemical nature of the binding groove while retaining its original geometry, the atomic group labels of some of the atomic groups in the binding groove of [PDB:1JWS] (A, B, C chains) were re-assigned randomly, and PocketMatch 13 was used to compare the modified binding groove with the original one ( Figure 6B). Figures 6A and 6B demonstrate that PM 13 Scores capture differences due to both the geometry and the chemical nature of the binding groove.

Cladogram generation
Given a set of binding sites (whole groove or subpockets), one way of visualizing the relationships among these is to generate a cladogram based on distances between pairs of sites. The distance between a pair of sites is defined here to be 1-PM 13 Score between the two sites. The cladogram generation program is based on the neighbour joining method available in Phylip-3.67 [19] which generates trees in Newick format, which can be visualized and labelled using MEGA [20]. When generating cladograms, data were input to the program in descending order of PM 13 Scores.