Predicting zinc binding at the proteome level

Background Metalloproteins are proteins capable of binding one or more metal ions, which may be required for their biological function, for regulation of their activities or for structural purposes. Metal-binding properties remain difficult to predict as well as to investigate experimentally at the whole-proteome level. Consequently, the current knowledge about metalloproteins is only partial. Results The present work reports on the development of a machine learning method for the prediction of the zinc-binding state of pairs of nearby amino-acids, using predictors based on support vector machines. The predictor was trained using chains containing zinc-binding sites and non-metalloproteins in order to provide positive and negative examples. Results based on strong non-redundancy tests prove that (1) zinc-binding residues can be predicted and (2) modelling the correlation between the binding state of nearby residues significantly improves performance. The trained predictor was then applied to the human proteome. The present results were in good agreement with the outcomes of previous, highly manually curated, efforts for the identification of human zinc-binding proteins. Some unprecedented zinc-binding sites could be identified, and were further validated through structural modelling. The software implementing the predictor is freely available at: Conclusion The proposed approach constitutes a highly automated tool for the identification of metalloproteins, which provides results of comparable quality with respect to highly manually refined predictions. The ability to model correlations between pairwise residues allows it to obtain a significant improvement over standard 1D based approaches. In addition, the method permits the identification of unprecedented metal sites, providing important hints for the work of experimentalists.


Background
Knowledge about the capability to bind metal ions is important when investigating the function of an experimentally uncharacterized protein. Unfortunately, the identification of bound metal ions can be quite difficult experimentally, especially when attempted at the whole proteome scale. Some results in this direction (metalloproteomics) have been recently reported [1][2][3], but these techniques are still far from becoming available for routine application. Furthermore, experimental approaches may suffer from biases such as incorporation of the wrong metal cofactor in vivo, removal of the metal ion(s) during protein purification procedures, binding of metals at adventitious sites [4]. Within this frame, bioinformatics tools are thus important to guide in the design and in the interpretation of experiments. The prediction of metal binding capabilities is a challenging task for which the development of reliable tools is still in progress [5].
In this paper, we investigate the use of machine learning approaches to automatically annotate metal-binding proteins on the whole-proteome scale. In particular, we focus on an important class of structural and functional sites involving the binding of zinc ions. Zinc is essential for Life and is the second most abundant transition metal ion in living organisms after iron. In contrast to other transition metal ions, such as copper and iron, zinc(II) does not undergo redox reactions thanks to its filled d-shell. In Nature, it has essentially two possible roles: catalytic or structural [6,7]. In humans, zinc has a crucial importance in the complex network of inter-molecular interactions responsible for the proper regulation of protein expression. Indeed, a major role of zinc is in the stabilization of the structure of a huge number of transcription factors such as zinc fingers, which constitute a significant share of the human proteome [8,9]. Only a subset of the natural amino acids can coordinate zinc ions with their side chains. In addition, the binding sites are locally constrained by the requirements on the side chain geometry imposed by coordination chemistry. For these reasons, several sites can be identified with high precision by mining regular expression patterns along the protein sequence while simultaneously inspecting amino acid conservation near the (putative) site [10]. A potential problem with the use of regular expression patterns is that they are usually quite specific but may give a low coverage (many false negatives). On the other hand, a support vector machine (SVM) predictor based on multiple alignments outperforms a predictor based on PROSITE [11] patterns in discriminating between cysteines bound to prosthetic groups and cysteines involved in disulfide bridges [12].
The application of a similar approach to the prediction of zinc-binding properties is not straightforward because most supervised learning algorithms (including SVM) build upon the assumption that examples are sampled independently. Unfortunately, this assumption can be violated when formulating prediction of metal binding sites as a traditional 1D prediction problem. The autocorrelation between the metal binding state is a consequence of the fact that most binding sites contain at least two coordinating residues with short sequence separation. Autocorrelation problems have been recently identified in the context of relational learning [13] and collective classification solutions have been proposed based on proba-bilistic learners [14,15]. In a recent work [16] we tried to address the autocorrelation problem in the context of metal binding site prediction by developing a two stage approach, where a bi-recurrent neural network refines residue-level SVM predictions by jointly considering all SVM outputs from residues in the same chain when computing the refined prediction for each residue. While the approach performs better then the local SVM predictor alone, such improvement is still not statistically significant. In this work we followed a different approach which aims at exploiting the regularities of zinc-binding sites in terms of sequence separation between ligands. The use of information on the sequential distance between cysteines was recently shown to improve performance in the task of disulfide connectivity prediction [17]. Our solution is based on a reformulation of the learning problem where examples formed by pairs of sequentially close residues are considered. Most of the zinc-binding sites contain at least one of such pairs, which in the following will be named semi-patterns. We developed a semi-pattern SVM trained to predict the zinc-binding attitude of a full semipattern. A traditional 1D SVM predictor was employed to account for the isolated ligands, and the final prediction for a given residue was computed by a gating network combining the probability of belonging to a zinc-binding semi-pattern and that of being an isolated ligand. In the following we will refer to the learning architecture as SP-SVM in order to stress the importance of the semi-pattern prediction as well as the role of the SVM components.
The method was tested on a representative non-redundant set of zinc-binding protein chains in order to assess its generalization power on new chains. Two evaluation procedures were employed, a full leave-one-out procedure on a subset with pairwise HSSP-value up to five, and a k-fold cross validation procedure guaranteeing that no test chain was remotely homologous with respect to any chain in the training set (see details in Results). This second test is a stronger requirement with respect to other common approaches to remove redundancy. A significant improvement over the traditional 1D prediction approach was observed. We additionally used the trained predictor to analyze the entire human proteome and observed a good agreement with previous, manually curated, annotations.

PDB data preparation
A data set of high-quality annotated chains was extracted from the Protein Data Bank (PDB) [23] by selecting all the structures deposited in the PDB at June 2005 and containing at least one zinc ion in the coordinate file. Structures binding zinc spuriously because of experimental settings (e.g. high zinc concentration in the crystallization buffer) were removed. Homologs were removed, by retaining only one representative chain. This procedure resulted in a set of 305 unique chains. Amino acids binding to the zinc ion(s) were detected using a threshold of 3 Å for the distance between the metal and the protein donor atoms. In order to provide negative examples of non metal-binding proteins, an additional set was generated by performing a single run of UniqueProt [18] with zero HSSP-value on PDB entries that are not metalloproteins. We thus obtained a second data set of 2,369 chains. Zinc-binding chains whose structure had been solved in the apo (i.e. without metal) form were removed from the ensemble of non-metalloproteins. We computed multiple alignment profiles for all chains using PSI-Blast [24] on the nonredundant (nr) NCBI protein database. In order to reduce noise in the training data we ignored residues whose profile had a relative weight less than 0.015, indicating that too few sequences had aligned at that position. This also allowed to discard poly-histidine tags which are attached at either the N-or C-terminus of some chains in the PDB, as a result of protein engineering aimed at making protein purification easier.

Analysis of zinc-binding sites
The choice of predicting zinc-binding sites by modelling semi-patterns was motivated by an extensive analysis of the characteristics of the sites, which we briefly report in this section.
Zinc-binding sites of zinc metalloenzymes are traditionally divided into two main groups [6]: catalytic (if the ions bind a molecule directly involved in a reaction) and structural (stabilizing the folding of the protein but not involved in any reaction). In addition, zinc may influence quaternary structure; we consider these cases as belonging to a third site type (interface), which also lacks a catalytic role. Site types can be heuristically correlated to the number of coordinating residues in the same chain. The distribution of site types obtained in this way is reported in Table 1. Table 2 reports the observed binding frequencies grouped by amino acid type and site type. As expected, cysteine, histidine, aspartic acid and glutamic acid are the only residues that bind zinc with a high enough frequency. It is interesting to note that such residues show different binding attitudes with respect to the site type. While cysteines are mainly involved in structural sites and histidines participate to both Zn4 and Zn3 sites with similar frequency, aspartic and glutamic acids are much more common in catalytic sites. The fact that multiple residues coordinate a single zinc ion implies that there is a strong correlation between the bonding state of residues within a given protein chain. Such correlation is often tied to the sequence separation between residues, as many binding sites contain pairs or sequentially close residues. We empirically measured the zinc-bonding state correlation between the residues in our chains. In Figure 1(a) the prior probability of zinc-binding for a residue is compared to the same probability conditioned on the presence of another zincbinding residue within a certain separation, for different values of the separation threshold. Figure 1(b) reports the correlation coefficient between the bonding state of pairs of residues, again varying the separation threshold between them. Both curves show a very similar behavior, with the highest peak for a distance of less then three residues, and a small one for a distance of around twenty residues. It can be noted that correlation tends to a non zero residual asymptotic value as distance grows, quantifying the contribution due to the fact of belonging to the same chain.

Patterns of binding sites
Metal binding sites can be described by patterns characterized by the type of residues coordinating the same ion and their sequence separation. Table 3 reports the most commonly occurring zinc-binding patterns together with their number of occurrences within our data set. Note that more than one pattern can match at a given site. Many of Top: Distribution of site types (according to the number of coordinating residues in the same chain) in the 305 zinc-chain data set. The second column is the number of sites for each site type; the third column is the number of chains having at least one site of the type specified in the row. Bottom: Number of chains containing multiple site types. The second row gives the number of chains that contain at least one site for each of the types specified in the first row. Statistics over the 305 zinc chains (464 binding sites) divided by amino acid and site type. N a is the amino acid occurrence number in corresponding site type; f a is the observed percentage of each amino acid in a given site type; f s is the observed percentage of each site type for a given amino acid.
All is the total number of times a given amino acid binds zinc in general.
Correlation between zinc-binding residues Figure 1 Correlation between zinc-binding residues. (a) Probabilities of zinc binding for a given residue: prior and conditioned on the presence of another zinc binding residue within a certain separation, (b) Correlation between the targets of pairs of residues within a given distance.
these sites, especially the structural ones, contain pairs of coordinating residues whose sequence separation is within seven residues. Such pairs are identified by an S (standing for "short") in the Type column of Table 3, as opposed to L (standing for "long") which identifies pairs of residues with a sequence separation of at least eight residues. In the following, a pattern formed by a single pair of nearby coordinating residues is called a semi-pattern.
Most structural sites consist of two semi-patterns whose distance ranges between 8 and 29 (the SLS Type in Table  3). Catalytic sites typically contain a semi-pattern and a single residue (SL or LS Type). Finally, interface sites are observed as a single semi-pattern in each chain. Figure 2 shows the fraction of sites (a) and zinc chains (b) containing at least once the semi-pattern [CHDE] x(0-7) [CHDE]. These observations suggest a partial solution to the relational auto-correlation problem based on binary classification of semi-patterns to predict binding sites.

Evaluation of SVM-based predictors
A traditional 1D SVM predictor was compared to the full SP-SVM architecture, in order to assess the significance of the proposed approach. While aspartic and glutamic acids coordinate zinc ions less frequently than cysteines and histidines (see Table 2), they are far more abundant in protein chains. This yielded an extremely unbalanced data set, and forced us to initially focus on cysteine and histidine residues only (we will refer to such predictor as SP CH -SVM). Moreover, we labelled a [CH] x(0-7) [CH] semipattern as positive if both candidate residues bound a zinc ion, even if they were not actually binding the same ion. Preliminary experiments showed this to be a better choice than considering such a case as a negative example, allowing to recover a few positive examples, especially for semi- Binding site patterns ordered by number of matches, N, in the 464 sites. Note that more than one pattern can match at a given site, thus the total number of matches is greater than the total number of sites. Square brackets denote alternative binding residues, x(·) denotes a sequence of residues of an arbitrary length, x(n -m) denotes a sequence of between n and m residues, x(> n) denotes a sequence of more than n residues. The type column highlights some common binding site patterns: S stands for "short" and refers to x(0-7), L stands for "long" and refers to x(> 7).
[CHDE] x(0-7) [CHDE] coverage  pattern matches with longer gaps. Model selection was performed by a stratified 4-fold cross validation procedure on the full data set, aimed at tuning Gaussian kernel width, C regularization parameter, window size and parameters of the sigmoids of the gating network. Due to the strong unbalance of the data set, accuracy is not a reliable measure of performance. We used the area under the recall-precision curve (AURPC) for both model selection and final evaluation, as it is especially suitable for extremely unbalanced data sets. We also computed the area under the ROC curve (AUC) to further assess the significance of the results.
Generalization performances of the best models for the local predictor and the gating network were assessed with two different procedures. First, we evaluated generalization over non-homologous chains. We repeatedly run UniqueProt [18] with HSSP-value equal to five starting from the full data set and stopping when then program found only clusters of singletons, thus assuring that no two chains had an HSSP-value greater than the threshold. We then run a full leave-one-out (LOO) procedure on the resulting data set, which consisted of 230 zinc-binding chains and 1,949 negative ones. Second, we evaluated generalization over chains which had no remote homologue in the training set. To this aim, we employed a stratified five fold cross validation (CV) procedure on the full data set. Few (38) non-metalloprotein chains were removed in this procedure as they lacked the information about SCOP [19] classification, which prevented us from assigning them to the correct CV fold. In fact, we distributed protein chains over the CV folds by ensuring that two chains having a zinc-binding domain belonging to the same SCOP [19] superfamily always appeared in the same CV fold, and two free chains (which were employed as negative examples) having a domain in the same SCOP superfamily also appeared in the same CV fold. In this way, we measure generalization across different superfamilies, a setting in which not even remote homology modelling techniques could be successfully applied for prediction. Note that by k-fold cross-validation we mean splitting the data in k subsets (commonly called folds) and using one of them in turn for testing. The term "fold" in SCOP has a totally different meaning.
In the LOO procedure, the local predictor and the gating network obtained an AURPC equal to 0.590 and 0.633 respectively. Figure 3 reports full recall-precision curves, showing that the gating network consistently outperforms the local predictor. While cysteines are far better predicted than histidines, both predictions are improved by the use of the gating network. AUC values were 0.895 ± 0.007 and 0.914 ± 0.006 for local predictor and gating network respectively, where the confidence intervals are the standard error of the Wilcoxon-Mann-Whitney statistic, confirming that the gating network attains a significant improvement over the local predictor.
The CV procedure gave similar results. The local predictor and the gating network obtained an AURPC equal to 0.428 and 0.500 respectively, and full recall-precision curves are shown in Figure 4. Both methods are able to generalize over non remotely-homologue chains, and the performance of the gating network is still significantly higher than that of the local predictor, as confirmed by AUC values of 0.890 ± 0.006 and 0.867 ± 0.007 respectively.
Protein-level predictions were computed by requiring that at least three residues within the chain were predicted to bind zinc with a given probability, as computed by the gating network (Eq. (2)). By varying such probability we obtained a recall-precision curve at the chain level.  Finally, we investigated the viability of training a predictor for all the four amino acids involved in zinc binding (it will be referred to as SP CHDE -SVM), trying to overcome the disproportion issue. On the rationale that binding residues should be well conserved because of their important functional role, we put a threshold on the residue conservation (Pr(X)) in the multiple alignment profile in order to consider it as a candidate target. By requiring that Pr(D) + Pr(E) ≥ 0.8, we more than halved the unbalance in the data set for the local predictor. At the level of semi-patterns, we realized that such a threshold produced a rea-sonable unbalance only for gap lengths between one and three, and thus decided to ignore semi-patterns containing aspartic or glutamic acid with gaps of different lengths. While global performances were almost unchanged, aspartic acid and glutamic acid alone obtained a value of the AUC of 0.74 ± 0.03 and 0.70 ± 0.06 respectively in the LOO procedure and 0.73 ± 0.03 and 0.65 ± 0.05 in the CV procedure (with respect to the 0.5 baseline), showing that performances are significantly better than random. However, results on these two residues are still preliminary and further work is required to provide a prediction quality comparable to that obtained for cysteines and histidines. It is interesting to note that at the level of chain classification, the only difference that can be noted by using sites, as reported in Table 2. The list of protein chains employed in the two experimental settings, together to the splits of the 5-fold cross validation procedure and the model parameters obtained in the tuning phase, are available in the additional file 1.

Predictions for the human proteome
A bioinformatic analysis of the content of the human proteome in terms of zinc-binding proteins is already available [9]. In that work, putative zinc-binding proteins were identified based on the occurrence of known (from the PDB) zinc-binding patterns together with some sequence similarity around the pattern, following a previously proposed methodology [10]. These results were integrated by those independently obtained by i) text-mining the available annotations of human genes and ii) using Pfam protein domains described as having zinc-binding properties to scan the proteome. These three search approaches cumulatively allowed identification of zinc-binding proteins in the entire PDB with a precision of 78% and a recall of 89% [9]. This strategy is intrinsically limited in that it can exploit thoroughly existing information but cannot predict new binding sites. Nevertheless, when applied to the human proteome, it identified ab. 3,200 human chains that are potentially zinc binding. Of these, 53% were identified independently by all three approaches, and 76% were identified by at least two methods [9]. These results required a significant degree of manual care (e.g. in the selection of Pfam domains to be searched) and contain a certain degree of subjectivity (e.g. due to the fact that several gene annotations are relatively speculative). The present approach, which is fully automated, has a performance on the PDB only slightly worse than that of the manually curated methodology described in [9], while providing the unique opportunity of predicting unprecedented zinc-binding patterns and thus entirely new classes of zinc-proteins, as discussed in detail below.
To meaningfully compare the presently developed SVMbased approach and the above-described published work, the SP-SVM was used to scan the same human proteome version for putative zinc-binding chains. In the present approach a chain is dubbed as zinc-binding if the predictor assigns a probability of being zinc-binding greater than 0.7 to at least three residues in the chain. By doing so, we switch from per-residue prediction (SP-SVM output) to a per-protein prediction. Indeed, the output most relevant for the biologists is the prediction of zinc-binding capabilities at the entire protein level.
The SP CH -SVM identified 2,833 putative human zincbinding chains, which constitute the predicted human zinc-proteome. The results obtained employing the SP CHDE -SVM are very similar to those of the SP CH -SVM, possibly because the comparatively small number of available examples of sites containing aspartic and glutamic acids as ligands limits the training of the machine.
Comparing the present results with those previously published, we verified that 965 out of 3,207 putative zincbinding chains were not confirmed by the SP CH -SVM. However, not all the 3,207 human proteins found in [9] are equally likely to be true zinc-binding chains, and three different qualitative levels of likelihood were identified [9]. Figure 7 shows that the distribution of the chains retrieved by the SP CH -SVM is in agreement with the results of the previously published work. Only about 350 chains were previously found by two or three independent approaches but have not been retrieved by the SP CH -SVM. These 350 protein chains mainly comprise (i) chains that have a domain annotated as zinc-binding but lack any metal-binding pattern; (ii) chains that have a pattern composed by only 2 ligands; (iii) chains that contain at least one domain annotated as zinc-binding but not yet CV: SP CH -SVM predictions at a protein level structurally characterized (therefore no metal-binding pattern can be associated to these chains). About 100 chains have not been confirmed by the SP CH -SVM for no obvious reason, corresponding to about 3% of the previously predicted human zinc-proteome.
About 600 proteins not detected in [9] are predicted to be zinc-binding proteins by the SP CH -SVM. This group comprises some false positives like the tumor necrosis factor receptors or keratin associated proteins which use the predicted zinc-binding cysteines to make disulphide bridges, as well as some possible true positives previously undetected. As an example, a potential zinc-binding site was found in a chain annotated as hypothetical and functionally uncharacterized ([RefSeq:NP_060357.1]). For this chain it was possible to build a 3D-model on the basis of the X-ray structure of the Plasmodium falciparum homologue ([PDB:1ZSO] [20]). The alignment of these two sequences shows that the Plasmodium falciparum protein does not present any potential ligand corresponding to the predicted binding cysteines (CX(2)CX(33)C), and in fact the available structure does not contain any metal. The theoretical structural model of the human protein ( Figure 8) shows that these three cysteines are close in space. The metal-binding pattern probably involves another cysteine which was predicted with a probability (0.67) only slightly below the threshold. Homologues of these proteins are only found in eukaryota. The metalbinding pattern is conserved (additional file 2), except in sequences from Alveolata (Plasmodium, Cryptosporidium, Theileria). Another case worth mentioning is that of about 50 human chains annotated as ubiquitin-specific protease and predicted to bind zinc with a CX(2)CX(43-50)CX(2)C pattern. All these proteins are involved in the deubiquitination process and belong to the same family (the UBP family), which contains highly divergent sequences. Their catalytic domain has been structurally characterized and appears similar to an extended right hand, ready to receive the substrate [21]. 3D homology models of these 50 chains confirm their potential capability of binding zinc. As shown in Figure 9 the potential zinc-binding pattern falls at the tip of the fingers of the hand, a region that is directly involved in the interaction with the substrate [21]. The zinc ion may thus stabilize the structure in this peripheral part of the chain, indirectly contributing to the interaction with ubiquitin. A structure of bovine UBP41 released after preparation of this work confirmed the prediction of above zinc-binding site ([PDB:2HD5] [22]). The zinc-binding site is next to the region of interaction of this enzyme with ubiquitin.
Finally, it must be noted that in some cases the SVMs do not predict all the ligands in the structure with a high probability but can predict only a part of the pattern or include erroneous residues in the pattern. An explicative example is the binding-site prediction for the ADAM-TS family. This family, which has not yet been structurally characterized, comprises Zn-dependent endopeptidases using the HX(3)HX(5)H motif to bind the catalytic zinc ion. For all these chains the SVMs predicted the first two histidines as ligands with a high probability (more than 0.7) while the third histidine is often predicted with very low values (average value = 0.32). Chain-level comparisons between SPCH-SVM and results in [9] are available in the additional file 3.

Conclusion
In the present work we have described a novel approach based on SVMs to the prediction of zinc-binding capabilities at the level of an entire proteome. The method has been trained using the structures available in the PDB where zinc was bound in a physiologically relevant manner. This should maximize, but cannot guarantee, that the properties predicted are relevant also in vivo and not just in vitro. However, due to the complexity of the processes controlling the insertion of metal cofactors in proteins and, in particular, due to the fact that they are under kinetic rather than thermodynamic control, it is not possible to exclude that a protein predicted here to be zincbinding will in vivo bind other metal ions (e.g. iron, copper). With all these caveats in mind, the present approach constitutes a highly automated tool for the identification of metalloproteins, which provides results of comparable quality with respect to highly manually refined predictions. In addition, it permits the identification of unprecedented metal sites, providing important hints for the work of experimentalists. The performance of the proposed method was evaluated on strong non-redundancy tests showing a significant improvement due to correla- Figure 7 Predictions on the Human proteome. Identification of previously detected human zinc-binding chains by the SP CH -SVM (dark gray: chains retrieved by the predictor, light gray: chains not retrieved by the predictor) as a function of the reliability of the previous results [9].

Predictions on the Human proteome
) that a single residue binds zinc can now be computed by the following gating network: P(Y b = 1|x) = P(Y s = 1|x) + (1 -P(Y s = 1|x))P(Y l = 1|x) (2) where P(Y l = 1|x) is the probability of zinc binding from the local predictor, while P(Y s = 1|x) is the probability of x being involved in a positive semi-pattern, approximated as the maximum between the probabilities for each semipattern x is actually involved in.

Validation through homology modelling
We attempted to model the 3D structure of all the human chains retrieved by the present SP CH -SVM but not reported in the literature or previously predicted to be zinc-binding. Appropriate templates were looked for in the PDB, by searching for proteins of known structure having a sequence identity greater than 30% to the target. Structural models were built using the program Modeller-6v2 [30]. The input alignment for Modeller was obtained with ClustalW [31].