TAP score: torsion angle propensity normalization applied to local protein structure evaluation

Background Experimentally determined protein structures may contain errors and require validation. Conformational criteria based on the Ramachandran plot are mainly used to distinguish between distorted and adequately refined models. While the readily available criteria are sufficient to detect totally wrong structures, establishing the more subtle differences between plausible structures remains more challenging. Results A new criterion, called TAP score, measuring local sequence to structure fitness based on torsion angle propensities normalized against the global minimum and maximum is introduced. It is shown to be more accurate than previous methods at estimating the validity of a protein model in terms of commonly used experimental quality parameters on two test sets representing the full PDB database and a subset of obsolete PDB structures. Highly selective TAP thresholds are derived to recognize over 90% of the top experimental structures in the absence of experimental information. Both a web server and an executable version of the TAP score are available at . Conclusion A novel procedure for energy normalization (TAP) has significantly improved the possibility to recognize the best experimental structures. It will allow the user to more reliably isolate problematic structures in the context of automated experimental structure determination.


Background
The number of experimentally determined protein threedimensional (3D) structures deposited in the protein data bank (PDB) [1] is increasing exponentially over the years and being progressively automated. The vast majority of such 3D structures is produced by X-ray crystallography. In the case of limited resolution and imperfect phase information often available to the crystallographer, building and refining such a protein model is a process that can depend on the experimentalist. Errors are almost unavoidable and the quality of the refined models has to be eval-uated in order to assess their validity [2]. Errors come in various classes [3] and can nowadays range from mistraced segments of the protein to locally incorrect backbone and/or side chain conformations. Identification of such errors can be achieved with a combination of experimental and computational parameters.
Several quality measures based on experimental parameters exist for X-ray crystallography and have been reviewed [2]. X-ray resolution is an index of the quality of the experimental data. It is related to the amount of available data, i.e. to the parameter-to-observation ratio [4] and is an indirect indicator of the maximum attainable details of the protein model. In contrast, the R-free value [5] represents a measure of the fit between the refined structure and the electron density map, highlighting the quality of the refinement. The relationship with resolution is not straightforward, but it is generally assumed that higher resolution structures will produce lower R-free values [6]. A different type of information is contained in the Luzzati [7] and σ a [8] plots. These estimate the mean positional error for all atoms in the protein model. These parameters require the structure factors, but yield the expected uncertainty of atoms in the protein model in a single number estimate (in Å). A simpler estimation of the mean positional error is the diffraction precision index based on Rfree (DPI), which can be computed from the readily available data contained in the PDB files [9].
A wide range of computational quality parameters have been developed and reviewed over the years [2,10,11]. Generally speaking, it is possible to distinguish geometric, energetic and conformational criteria. Geometric criteria are mainly standard values for bond lengths and angles derived from small molecule data. These form strong restraints and are generally enforced during the refinement process, so they possess little validation power. Energetic criteria are based on evaluation of interaction preferences or profiles [12][13][14][15][16]. While these methods can provide insight into the quality of the structure, their interpretation in experimental terms and feedback into the refinement process is rather difficult.
The most promising validation criteria are based on conformational criteria. The best example is the Ramachandran plot [17] of backbone (ϕ,φ) torsion angles. While each amino acid type may, in theory, adopt a large number of different conformations, large areas of the Ramachandran plot are almost empty. This is due to steric clashes deriving from the local geometry of the polypeptide chain. The main chain (ϕ,φ) torsion angles are usually not restrained during refinement and this makes the Ramachandran plot a powerful validation tool [2,18]. Several tools have been developed to estimate the quality of a protein model based on the Ramachandran plot [16,[18][19][20][21][22]. Of these, PROCHECK [19] and WHAT_CHECK [16] are perhaps the most frequently used methods for validation in X-ray crystallography as they are used for judging structures to be deposited in the PDB, combining several stereochemical checks and measures of torsion angle compatibility. HOPPscore [22] has been recently developed to take into account higher order backbone torsion angle maps.
Several of these methods (e.g. WHAT_CHECK) are able to pinpoint the really wrong structures through a detailed analysis of different aspects of protein structures. Once a structure falls into the range of roughly plausible folds however the situation becomes more complicated. It is possible to construct structures with acceptable values for the standard criteria that are largely incompatible with the protein sequence. In the present work we focus on this aspect of experimental structure validation. Given roughly plausible structures, is it possible to quantify the degree of "nativeness" and highlight the best structures?
One possible limit to the previous methods is the difficulty in establishing a quantitative correspondence scale between different structures. I.e. how much is score X for structure A better or worse than score Y for structure B? The answer is not obvious, as the reference state is different for structures A and B. One solution would be a normalization procedure adapted to a particular structure. To the best of our knowledge, this has not been done yet. For this reason, we derive a novel measure of sequence to structure compatibility based on the normalization of torsion angle propensities including the side chain. The normalization involves definition of the global minimum and maximum of the protein sequence. The normalized propensity (called TAP) will be shown to be more accurate than several previous methods at quantifying the degree of "nativeness" of a protein model in terms of commonly used experimental quality parameters on two test sets representing the full PDB database and obsolete PDB structures. A comparison with several energetic criteria on standard protein decoys and theoretical models has already been addressed elsewhere [23,24].

Baseline comparison on the all PDB set
Available experimentally derived quality parameters for the all PDB set representing 13,691 structures is summarized in Table 1. While the X-ray resolution information is available for all structures considered, some of the older structures are lacking an R-free value. The cross-validated Luzzati and σ A plots are present only in roughly one out of three structures, mainly because deposition was not mandatory for a long time. While it is possible to calculate both parameters from the structure factors, we have chosen to limit our comparison only to the publicly available cases.
The Pearson correlation coefficients (cc) between the different experimental quality parameters were calculated for the available data (see Table 2). As expected, the correlation between experimental parameters is usually very high (cc > 0.8). The main exception is R-free with cc <= 0.62 to the X-ray resolution and DPI (see also Figure 1). As information contained in the various measures appears largely redundant, we restrict further analysis to resolution and Rfree. R-free is not a perfect measure, but rather available for more structures and, perhaps, less inaccurate than the other quality parameters.
In order to estimate the performance of available computational methods, we compare our method with PRO-CHECK [19], WHAT_CHECK [16], HOPPscore [22] and FRST [23]. This covers all the range from energetic to geometric and conformational criteria. Table 3 shows the Pearson correlation coefficients between computational and experimental parameters for the all PDB set. The best performance is seen for methods using Ramachandran plot analysis, i.e. TAP and WC_Rama, once again confirming its utility in structure validation [2,10].
For the TAP score, the intermediate (ϕ,φ) bin size of 10 degrees shows the highest correlation (see also Figure 1). A similar trend was already observed [24]. This probably maximizes the tradeoff between precise transitions and lack of data to discriminate certain sparsely populated regions in the Ramachandran plot. Note that a larger background distribution, e.g. covering the entire PDB, was excluded to avoid biasing the comparison.
It is apparent from Table 3 that some methods work significantly better than others. The statistical potentials (except the Ramachandran plot based TORS) and several geometric criteria do not yield good correlation coefficients. Performance of the TAP score against R-free is particularly interesting, as the correlation coefficients for Rfree are overall lower. TAP has a higher correlation against R-free (-0.66; see Table 3) than the X-ray resolution (0.62; see Table 2) has. In order to evaluate the effect of the background distribution on the performance of TAP, a further test was made using the TAP score based on NMR derived torsion angles. The data reported in Table 3 shows that, while the usage of high-quality data improves the performance, TAP-NMR still significantly outperforms many other methods. For the sake of simplicity, further analysis was restricted to TAP and the most diverse parameters. As conformational criteria we have chosen PROCHECK, WC_Rama, WC_Chi1&2, Hopp1, Hopp2 and Hopp5. For the energetic criteria we have restricted our analysis to WC_Pack2, SOLV, RAPDF and TORS.

Detailed comparison on the obsolete PDB set
A set of 494 pairs of obsolete PDB structures and their replacement was analyzed in order to evaluate the capacity of the TAP score to discriminate better models for the same protein. Figure 2 shows the number of times each method correctly assigned a better score to the newer, and therefore more accurate, model in the obsolete PDB set. The results for each individual methods and the combination with TAP are shown. None of the methods discriminates all 494 structures, and the statistical potentials in particular have difficulty recognizing the improved structures, while TAP has one of the highest single recognition rates. Combining methods improves the overall performance, with TAP typically contributing more unique information than the other method. Figure 3 shows a different analysis of the data in terms of separation Z-score, i.e. the normalized difference between the obsolete and replacement structure. Here it is again apparent that the Ramachandran plot based methods outperform the others, with TAP coming a close second with WC_Chi1&2 after WC_Rama. Taken together, the Ramachandran plot based methods (especially TAP and WC_Rama) seem able to qualitatively discriminate improved from obsolete structures.

Quantifying absolute model accuracy
Another interesting question related to structure quality is whether the methods under consideration are able to quantify the difference between structures, i.e. reliably identify the top x% structures present in the PDB. This problem was addressed in terms of fraction enrichment on R-free (see Materials and Methods), with results shown in Figure 4. Unsurprisingly, the statistical potentials perform worse than the conformational methods. Of the latter group, it is worth noting how WC_Chi1&2, based on side chain rotamer preferences, performs worse than  methods using backbone preferences. The difference between TAP and WC_Rama suggests that energy normalization is improving recognition especially with the highest quality structures. This idea is confirmed when comparing TAP to TORS, the torsion angle propensity energy before normalization.

Confidence estimates
Since it can be very useful to derive TAP score threshold levels indicating the expected quality of a model, the all PDB set was used to derive confidence estimates. The resolution and R-free parameter distributions were analyzed on the all PDB set to estimate average and standard deviation (σ) (see Table 4). Three derived threshold levels are shown in Table 5. The medium quality class is based on commonly accepted conservative parameter settings, i.e. resolution <= 2.5 A and R-free < 0.30, and excludes the 25% lowest quality structures. The high quality class was defined to be just above the average for resolution and Rfree, selecting ca. the 40% top structures. The very high quality class was defined at one σ above average and rep-resents ca. the 10% top structures. The distribution therefore appears not to be truly normal, but somewhat skewed towards lower quality structures. Similarly, TAP score thresholds were chosen from the average to be -1 σ (low), 0 (medium) and + 1 σ (high).
The results for TAP on all three experimental quality classes are expressed in terms of accuracy and coverage (see Materials and Methods) and shown in Table 5 for the all PDB set. As can be expected, it becomes gradually more difficult for TAP to discriminate the structures with increasing quality level. At the same time, coverage drops with increasing TAP threshold. Taking the intersection between both, TAP recognizes ca. 90% of the medium quality structures with 90% accuracy. These values drop to ca. 75% for the high and 35% for the very high quality structures. Even in the latter case it implies a significant enrichment in discrimination with respect to a random predictor. To the best of our knowledge, this type of analysis was not performed before. In the context of automated experimental structure determination, it will allow TAP-10 vs. R-free scatter plot on the all PDB set the user to isolate problematic structures for manual refinement and could prove a valid addition to the PDB data deposition procedures.

Database effects vs. novel approach
The Ramachandran plot, i.e. (ϕ,φ) torsion angle preferences, has been seen as a powerful tool for validating experimental protein models for a long time [10,[18][19][20]22]. Usually, the Ramachandran plot is used only as a rather qualitative tool to discriminate grossly mistraced structures from plausible ones. PROCHECK and HOPPscore both consider a generic Ramachandran plot for the twenty amino acids divided in discrete classes. WHAT_CHECK-2 uses a more sophisticated Z-score analysis of the Ramachandran plot. All of these do not appear to discriminate effectively the compatibility between sequence and structure, nor the subtle differences between amino acids.
The main advantage of TAP consists in effectively measuring the compatibility of the sequence with the proposed structure in a detailed, quantitative way. Energy score normalization is a novel concept which could be applied because TAP is based on a single body potential. This is not usually applicable to pair wise (or higher order) potentials, where it is difficult to estimate the maximum or minimum interaction between an amino acid and its surroundings. The benefits of energy normalization are apparent from the comparison between TAP and TORS, the torsion angle potential on which it is based. Where TORS gives rough indications, TAP (despite using similar information) has greater accuracy.
Since torsion angles are not generally restrained in X-ray crystallography, this compatibility is orthogonal to the data used in refinement and should be expected to give a good indication of the degree of "nativeness" of the protein model. To support this view, rather than a simple improvement based on database growth, a variant of TAP was derived from NMR data. Even in this case, where the Ramachandran plot is on average rather blurred, TAP-NMR still outperforms other validation tools. This supports the idea that it is capturing the relationship between sequence and structure rather than a tighter clustering in torsion angle space.
Perhaps the most important feature of the TAP score is that it simultaneously combines five different torsion angles into a single pseudo-energy value. Adding more torsion angles was previously shown to improve the overall discrimination of protein decoys [24], as it captures the subtle interplay between them. For instance, it is known that the ω torsion angle varies slightly depending on the (ϕ,φ) angles [10,25]. Even more widely used is the dependence of the χ torsion angles on (ϕ,φ), which is widely accepted in side chain modeling [26,27]. A TAP variant based solely on the (ϕ,φ) angles performs significantly worse, with correlation coefficients of -0.53 and -0.42 for resolution resp. R-free. This view has been recently reinforced by an elegant statistical and conformational analysis of the electron density of protein side chains showing a vast majority of all residues in high resolution X-ray structures to have rotameric side chain positions [28]. Therefore, where HOPPscore successfully extends the concept of Ramachandran plot to higher order (ϕ,φ) torsion angle pairs, TAP score explores the avenue of capturing the interplay between protein backbone and side chain. Figure 6 shows an example of TAP score depending on χ 1 side chain conformation.
The main limit of the TAP score approach is the independence of subsequent residues in the calculation of the global minimum and maximum for normalization. These minimum and maximum are likely to be overestimated, as compatibility along the polypeptide chain is not guaranteed. This may result in impossibly "knotted" structures having the best pseudo-energy and the native structure being lower in normalized score. In principle, adding information about the preceding residue's (ϕ,φ) torsion angles would alleviate this situation and has been shown to yield more discriminative pseudo-energies [24]. However, it is only possible to calculate the global optimum for normalization precisely because it is a single body potential. Adding a dependence on the preceding residue would transform it into a two-body potential, making the estimation of the global optimum problematic. Calculating the global optimum on such a two-body potential is an optimization problem in itself.
The discriminative power of torsion angle propensities has implications for the accuracy of empirical force fields such as AMBER [29] or CHARMM [30]. Torsion angle propensities derive from the subtle interactions between neighboring residues which cannot be captured very precisely by currently available physico-chemical models [31]. This knowledge is one of the reasons for the success of modern de novo folding methods based on assembly of short peptide fragments [31][32][33][34].
Small changes in the AMBER torsion angle parameter between param94 and param96 have drastic effects on the energy landscape [35]. It may be argued that addition of a Ramachandran plot propensity parameter could improve the capacity of a force field to capture the local geometric details more precisely. This approach is frequently selected in loop modelling [36][37][38] where it is important to reconcile the structural restraints with sequence preference. The method of Fiser and co-workers [38] in particular uses the CHARMM bonded potential augmented by a Ramachandran plot propensity term and statistical nonbonded potential to generate loop conformations by minimization. Adding a properly calibrated (ϕ,φ) torsion angle propensity term to a force field may therefore help to improve convergence in energy minimization for molecular mechanics simulations. The option to calculate such values for every single residue also opens up the interesting possibility to use the TAP score as a valuable Histogram of correctly recognized improved structures on the obsolete PDB set for TAP and ten previous methods Figure 2 Histogram of correctly recognized improved structures on the obsolete PDB set for TAP and ten previous methods. The number of correct predictions by both methods (blue), TAP only (green) and only the second method (red). The total height corresponds to the performance of combining TAP and the other method. Individual performance of each method (except TAP) is the sum of the first two terms (i.e. red and blue).
tool during refinement of a crystallographic model, in addition to the already available geometric validation tools.
An interesting question is why some crystal structures exhibit higher TAP scores than others. A cursory analysis reveals that the highest scoring structures are short helical bundles, e.g. Hemoglobin. The lowest scoring structures are diverse and contain any combination of α-helices and β-sheets. Even in the TOP500H database used for deriving the propensities, the TAP scores vary between 0.781 and 0.907 (avg = 0.847; σ = 0.017). This is comparable to the TAP score of very high experimental quality structures varying between 0.754 and 0.889 (avg = 0.822; σ = 0.017). This may be caused by the rough approximation of the global optimum overestimating some folds more than others due to some intrinsic feature, e.g. higher contact order or lower degree of flexibility. A better way to calculate the global optimum would be needed to exclude this possibility. As local interactions appear to impose the selection of certain amino acids at each structural position in the fold [39], it will however be worth investigating whether proteins whose native structure lies closer to the global optimum could perhaps also have a better sequence to structure compatibility.

Conclusion
We have presented a novel method for the evaluation of the quality of protein models determined by X-ray crystallography, demonstrated both on a large-scale dataset and a set of obsolete PDB structures. The TAP score is based on a relative pseudo-energy calculated simultaneously from the backbone and side chain torsion angle propensities, normalized against the global minimum and maximum for the protein sequence under consideration. Our results show a quantitative relationship between TAP score and the overall quality of experimental structures as expressed in terms of sequence to structure compatibility. TAP score can improve the confidence in quality validation of protein models derived from automated experimental procedures.

Torsion angle potential
The torsion angle potential was developed as part of the FRST scoring function [23] for model quality estimation which later has been extended [24]. It is a statistical poten-Histogram of separation Z-score on the obsolete PDB set for TAP and ten previous methods

Figure 3
Histogram of separation Z-score on the obsolete PDB set for TAP and ten previous methods.
tial [40] based on torsion angle propensities calculated in analogy to the one defined by Shortle [41]. If x describes a discrete torsion angle combination and A is a particular amino acid type, a propensity P can be defined as the fraction of two probabilities: Where N A,x is the number of amino acid type A with torsion angle combination x, N A the total number of amino acids of type A, N x the number of amino acids with torsion angle combination x and N total the total number of amino acids. The three terms N A , N x and N total can be derived from a background distribution of native structures, while N A,x is the observed state in the model being evaluated.
Since the background distribution shows sharp transitions between highly populated and disallowed regions, no pseudo counts are used. Where needed, N x is set to a single count to avoid division by zero. A pseudo-energy scoring function E for a protein composed of n residues can be defined as: The torsion angle pseudo-energy score E i for the i-th residue A of a protein is thus a measure of the log propensity that amino acid type A(i) will have torsion angle combination x(i). E i < 0 indicates that A(i) is favoured relative to the mean of all 20 amino acids, whereas E i > 0 indicates that it is disfavoured. In order to be applied, both the background distribution and the relevant torsion angles have to be defined.
The Top500H database [42] was chosen to derive the background distribution in order to have a representative subset of high quality structures that is small enough to allow unbiased large-scale benchmarking of the PDB. It is a non-redundant, hand-picked set of 500 high-resolution X-ray crystallographic protein structures resolved to 1.8 Å or better resolution with no obvious errors and less than 60% sequence identity. In order to assess the effect of the background distribution quality, an alternative ensemble of 609 NMR structures (9,578 models) was also used. As the Ramachandran plot of NMR structures is largely deter- Fraction enrichment plot for TAP and ten previous methods on the all PDB set Figure 4 Fraction enrichment plot for TAP and ten previous methods on the all PDB set.
mined by the force field used in refinement, this alternative ensemble contains blurred transitions and serves to highlight the effect of the background distribution on TAP score accuracy (see discussion).
Two free parameters, the choice of torsion angles to represent and the discretization of the data, have to be chosen in order to define the measured torsion angle combinations. The (ϕ,φ) angles were discretized as either 5, 10 or 20 degree bins. Since other torsion angles are less informative, but still important, a limited number of bins was used to represent the additional torsion angles [24]. Three bins were defined for the ω angle, distinguishing values [-180°, -150°], [+150°, +180°] and the rest. This was found to model the distribution of ω angles, where the cis (0°) state is very rare (except for proline) and the trans state preference is somewhat influenced by the Ramachandran plot [25] and has a slightly bimodal distribution around 180° (data not shown). Both the χ1 and χ2 torsion angles were discretized in eight bins centered on the canonical rotamer preferences. The total number of data points, i.e. non-terminal residues with all torsion angles available from the TOP500H is 100,245.
The torsion angle potential is a single body potential, representing the local structural preferences coded by each single residue in a polypeptide chain. Unlike other conventional statistical potentials, which are typically based on two-body interactions, it is therefore straightforward to calculate the global minimum E >min and maximum E max for a given protein sequence of length n: where E min (E max ) is the sum of the lowest (highest) pseudo-energy, i.e. highest (lowest) propensity, torsion angle combination for a residue of type i. Note that this definition makes no assumption about the physical plausibility of the overall conformation. Indeed, it is entirely possible that a sequence of minimal (or maximal) states would produce an impossibly "knotted" structure.
Given E min and E max of a protein sequence, it is possible to normalize the torsion angle potential score E as follows: The normalized torsion angle propensity TAP gives a rough indication of the degree of "nativeness" of a protein model. The value will be close to 1 for the native structure and close to 0 for structures with largely incompatible sequences. It is therefore a measure of compatibility between sequence and structure.

Data sets
In order to evaluate the method for structure quality estimation on a large set, we downloaded the ASTRAL database [43]

Methods used for comparison
The comparison with published methods is based on PROCHECK [19], WHAT_CHECK [16], HOPPscore [22] and FRST [23]. All three programs were either down-loaded from the author's website or directly requested (HOPPscore). For PROCHECK, the overall G-factor was used. WHAT_CHECK analysis is based on nine overall quality indicators available from the Pdbfinder2 database [45]. These are: overall quality (WC_Qual) expressed as a sum of various terms, torsion angles (WC_Tors), Ramachandran plot appearance (WC_Rama), chirality (WC_Chir), backbone conformation (WC_Back), rotamer normality (WC_Rot), χ-1/χ-2 rotamer normality (WC_Chi1&2) and 1 st and 2 nd generation packing quality (WC_pack1 and WC_pack2). It should be noted that WC_Pack1 and WC_Pack2 are measures based on contact analysis. For HOPPscore, the five (Hopp-5) through single residue (Hopp-1) scores were calculated with default parameters. FRST is a linear combination of four different statistical potentials [23]: a pairwise potential (RAPDF), solvation potential (SOLV), a simplified count of main chain hydrogen bonds (HYDB) and the torsion angle potential (TORS) on which TAP is built. All five (partial) potentials were used for analysis. The PROCHECK, WC_Rama, Hopp1 and TORS scores essentially represent a quantification of the structural fit with the Ramachandran plot.
Accuracy and coverage plot for TAP score on the all PDB set Figure 5 Accuracy and coverage plot for TAP score on the all PDB set. Accuracy (acc) and coverage (cov) are plotted for three experimental quality classes (medium, high, very high) as a function of TAP score threshold.
Ramachandran plot of valine (a) and serine (b) residues Figure 6 Ramachandran plot of valine (a) and serine (b) residues. In each plot, the χ1 torsion angle is fixed at a rotamer position (trans, g-, g+) and the (φ,ϕ) TAP score landscape (resolution 5° × 5°) is plotted from white to black. Darker colours represent higher (i.e. better) TAP scores. Note that the less favoured regions are those where a given residue is less frequently present than the average of the twenty types. ) σ (6) acc TP TP FN = + ( ) (7) cov ( ) = + TP TP FP (8) Additional file 1