Having a consistent and simple representation of both persistent and variable predicted pockets across arbitrarily large sets of related structures simplifies interpretation of data from which it may otherwise be difficult to extract meaning. Even with a simple example, the predictions of two programs on a single structure, it is not obvious how to quantitatively compare the two pocket predictions. With many conformations the mass of prediction points becomes spread-out and ambiguous as the increasing number of conformers leads to increased local structure variation and smearing due to the changes in the overall global alignments. With the structural dissimilarities that occur in sets of homologues the comparisons can become even harder and necessitate an approach that deals effectively with the problems of structure and sequence alignments. All three of these problems can benefit from the probabilistic approach taken here that assesses the extent to which predictions of pocket-lining atoms or residues hold for the dataset as a whole.
This atom-centred approach avoids the complex issue of how to define each individual pocket and compare these between structures, rather focusing on groups of residues or atoms whose roles in pocket formation are most (or least) variable as a whole. The corollary is that these gross dataset properties do not provide detailed geometric or structural information on individual pockets. The fraction of structures of an ensemble in which equivalent atoms or residues are involved in pocket formation does, of course, not contain all the information present in the original outputs of the prediction programs. For example, if a group of residues form a pocket in all members of an ensemble, their Provar score will readily identify them, but it will not indicate whether the pocket has the same geometry in each structure or not. Our use of persistent and variable to describe pockets reflects the identity of the residues forming the pocket (and indirectly its location) and not necessarily the shape of the pocket. However, in the case of variable pockets, variation in the number and location of residues forming the pocket will almost certainly be accompanied by shape changes. The aim in summarising information using Provar's scoring schemes is to allow insight into large amounts of data that is otherwise difficult to visualise. Once regions of interest are identified then further more focused analyses may be possible.
We have not attempted to re-validate the outputs of existing pocket prediction software (which have in any case been recently critically evaluated  for their ability to predict small-molecule binding sites). In this regard, the Provar methodology described here merely aids comparison of different prediction software. Provar analysis readily shows that pocket prediction programs give somewhat different outputs (Figure 1, Tables 1 and 2). Which programs are most suited to particular investigations in the context of analysis of sets of structures remains to be tested, and it may be that it makes sense to combine results of prediction algrorithms that may have different strengths and weaknesses . In this latter case, Provar scores provide a straightforward basis for creating a summary or consensus of several programs (Figure 5).
In the examples that we have presented here, we have mainly been concerned with visualising the persistence/variation of pockets in a protein's conformational or superfamily ensemble in the absence of any ligand, and identifying instances of correlation of pockets formed in these ensembles with persistence or variation of experimentally known ligand binding sites. Such identification (or prediction) of features of ligand binding sites is a widespread application of pocket prediction software when applied to individual structures. The application of the Provar algorithm provides means to visualise the results of analyses on large sets of related structures.
For a kinase superfamily, pockets whose locations are highly conserved across homologues were readily identified and correspond to the enzyme's active and allosteric regulatory sites (Figure 7). There is a potential for similar analysis of other less well understood protein superfamilies to identify common features that would then be the target of functional investigation.
We have seen how Provar visualisations allow us to identify pockets present in members of an ensemble that may be absent from an individual crystal structure. In analysing a conformational ensemble of Bcl-2, Provar analysis indicates an extended binding groove among simulated apo conformations compared to that of the crystal structure (Figure 8). We have shown that analyses of conformational ensembles of apo structures usually recover more of known PPI inhibitor binding sites than analyses of single static structures, but that precise outcomes of such analyses are rather dependent on the pocket prediction software used. Again, Provar scoring does provide a convenient approach to comparing such results.
Provar analysis of pocket predictions on simulated ensembles may help guide ligand design efforts by indicating which regions of the proteins surface may adapt to accommodate larger (or smaller) ligands. The residue-based Provar scores themselves could be further analysed to identify subsets of conformations (or subfamilies) in which particular residues are involved in pocket formation. Such subsets may then find a use in computational design efforts, e.g., docking, were they may increase the diversity of candidate ligands, which in turn increases the likelihood of finding one that simultaneously satisfies the requirements of specificity, affinity and ADME-Tox. In the kinases, identifying variable pocket-lining regions bordering conserved regions may be helpful when designing inhibitors that are specific to a particular kinase or kinase subset.
In common with many other forms of structural analyses, the type and quality of inferences made from Provar visualisation depend on an appropriate choice of structure set. We anticipate that a judicious combination of evidence obtained from both sets of homologues (where suitable) and simulated conformational ensembles of individual proteins may provide most insight into variability of pockets, as illustrated with the IL-2:IL-2R interface. In binding-site prediction applications, it is necessary to be careful to exclude any structures that have ligands bound. In the case of comparison of homologous structures, it is necessary to create a set of proteins or domains which are representative of the members of the superfamily, but sufficiently dissimilar from each other to avoid bias to the features of the members with the most numerous structures. However, other applications of the Provar approach may require different criteria, e.g., it may be of interest to compare sets of apo and ligand containing structures to identify structural changes leading to pocket formation upon ligand binding that may suggest sites for allosteric regulation.