Prediction of the functional class of metal-binding proteins from sequence derived physicochemical properties by support vector machine approach

Metal-binding proteins play important roles in structural stability, signaling, regulation, transport, immune response, metabolism control, and metal homeostasis. Because of their functional and sequence diversity, it is desirable to explore additional methods for predicting metal-binding proteins irrespective of sequence similarity. This work explores support vector machines (SVM) as such a method. SVM prediction systems were developed by using 53,333 metal-binding and 147,347 non-metal-binding proteins, and evaluated by an independent set of 31,448 metal-binding and 79,051 non-metal-binding proteins. The computed prediction accuracy is 86.3%, 81.6%, 83.5%, 94.0%, 81.2%, 85.4%, 77.6%, 90.4%, 90.9%, 74.9% and 78.1% for calcium-binding, cobalt-binding, copper-binding, iron-binding, magnesium-binding, manganese-binding, nickel-binding, potassium-binding, sodium-binding, zinc-binding, and all metal-binding proteins respectively. The accuracy for the non-member proteins of each class is 88.2%, 99.9%, 98.1%, 91.4%, 87.9%, 94.5%, 99.2%, 99.9%, 99.9%, 98.0%, and 88.0% respectively. Comparable accuracies were obtained by using a different SVM kernel function. Our method predicts 67% of the 87 metal-binding proteins non-homologous to any protein in the Swissprot database and 85.3% of the 333 proteins of known metal-binding domains as metal-binding. These suggest the usefulness of SVM for facilitating the prediction of metal-binding proteins. Our software can be accessed at the SVMProt server .

Metal-binding proteins have been identified by such experimental approaches as absorbance spectroscopy [21], gel electrophoresis [22], metal-affinity columns and shift assay [23], chromatography [16], mass spectroscopy [22], NMR [6], and combined spectroscopic studies [24]. However, some of these methods generally require a purified or semi-purified target of interest, do not facilitate identification of unknown targets form complex protein mixtures, or require multi-step processes and very specialized equipment, which limit their application ranges [23]. Therefore, there is a need to explore other methods including computational approaches for facilitating the identification of metal-biding proteins to complement these experimental methods.
Several computational methods have been explored for identifying and characterizing metal-binding proteins. In many metalloproteins, the metal ions tightly bind to the proteins and their metal-bound structures could be accurately determined by x-ray crystallography [3,5,10,17]. Thus structural information has been used for predicting metal-binding sites based on the detection of principal liganding residues and metal-ligand complex architectures [25,26], the use of common local structural parameters [25], combination of sequence and structural profiles [27], analysis of bond strength contributions [28], and the computation of force fields [29,30]. But for those proteins with loosely or temporarily bound metals, such as enzymes that use metal ions as cofactors, the specific metal binding sites are often poorly characterized or unknown [6]. Therefore, sequence-based computational methods appear to be useful for these types of proteins and those without 3D structures. Apart from sequence similarity methods, the recently explored sequence-based methods include metal-binding sites sequence motifs [31,32], multiple sequence alignments against known metal-binding proteins [26], and neural networks of sequence segments of amino acids of higher metal-binding propensity [33]. Moreover, combinatorial use of multiple structural, sequence alignments and annotation methods has been found to be highly useful for improving prediction accuracy of metal-binding proteins [26].
Because of the sequence, structural and functional diversity of metal-binding proteins [1][2][3][4][5]8,9,[11][12][13][14][15]17], it is desirable to explore additional methods that predict metal-binding proteins directly from sequence or sequence-derived properties. This work explored a statistical learning method, support vector machines (SVM), as such an approach. SVM has been successfully used for predicting the functional classes of molecule-binding proteins such as RNA-binding proteins [34,35], DNA-binding proteins [35], lipid-binding proteins [36], and transporters [37] from sequence-derived structural and physicochemical properties and irrespective of sequence similarity. Metal-binding proteins involve a substantially more diverse spectrum of proteins than most of the other classes of proteins. For instance, the zinc-binding proteins of 16,072 sequence entries belong to 765 Pfam [38] domain families, while the EC2.7 enzymes and RNAbinding proteins of similar number of sequence entries (14,171 and 14,208) belong to 548 and 378 Pfam families respectively. The diverse spectrum of proteins poses a more critical test for constructing a SVM prediction system.
Metal-binding proteins are diverse in sequence, structure, and function [1][2][3][4][5]8,9,[11][12][13][14][15]17]. Nonetheless, metal cations generally bind to centers of high hydrophilicity and reduce the enthalpy of a system upon binding [30,39], and metal ions bind to a shell of polar hydrophilic residues surrounded by a shell of non-polar residues [25]. The binding sites of some metal-ligand complexes have specific structural architectures [25]. To some extent, these metal binding features are similar to those of other molecule-binding features of proteins such as RNA-binding proteins, DNA-binding proteins and transporters that are also diverse in sequence, structure and function whose binding capability are mediated by certain structural and physiochemical characteristics [36,40,41]. Therefore, it is expected that SVM is also applicable to the prediction of metal-binding proteins.
In this paper, we developed SVM prediction systems for 10 metal-binding classes and for all metal-binding proteins. These classes are calcium-binding, cobalt-binding, copper-binding, iron-binding, magnesium-binding, manganese-binding, nickel-binding, potassium-binding, sodium-binding and zinc-binding. In addition to the estimate of the prediction accuracy by using an independent set of proteins, the performance of our developed SVM prediction systems was further evaluated by four additional tests to determine the usefulness of SVM for predicting novel metal-binding proteins and the applicability of other kernel functions. One is the evaluation of the prediction accuracies when homologous proteins are considered as one. The second is the prediction of metal-binding proteins non-homologous to any protein in Swissprot database [42]. The third is to study whether the known metal-binding domains can be predicted as metal-binding by our SVM systems. The fourth is to study the performance of SVM with a different kernel function.
(page number not for citation purposes)

Overall prediction accuracy
The statistics of the datasets and prediction results of specific metal-binding classes and all metal-binding proteins are given in Table 1. In this Table, TP, FN, TN FP, SE, and SP stand for true positive (correctly predicted metal-binding proteins of specific class), false negative (specific class of metal-binding proteins incorrectly predicted as nonclass-members), true negative (correctly predicted nonclass-members), false positive (non-class-members incorrectly predicted as specific class of metal-binding proteins), the predicted sensitivity (accuracy for members in each metal-binding class), and the predicted specificity (accuracy for non-members of each metal-binding class). The SE for the class of calcium-binding, cobalt-binding, copper-binding, iron-binding, magnesium-binding, manganese-binding, nickel-binding, potassium-binding, sodium-binding, zinc-binding, and all metal-binding proteins is 86 A direct comparison with results from previous metalbinding protein prediction studies may not be most appropriate because of the differences in the protein classes predicted, datasets, protein descriptors, prediction methods and parameters. Nonetheless, a tentative comparison may provide some crude estimate regarding the level of accuracy of our method with respect to those achieved by other studies of metal-binding proteins. The reported SEs of other studies are in the range of 87%~93% for calcium-binding proteins [28] and 90~97% for all metal-binding proteins [25,27,29,33]. Thus the corresponding SEs of 93.8% and 78.9% of our SVM prediction systems are comparable to those of other studies despite of the use of a significantly higher number, and thus more diverse range, of proteins in our studies.
The prediction accuracy of the non-members of each metal-binding class appears to be better than that of the members. The higher prediction accuracy for non-members likely results from the availability of more diverse set of non-members than that of members, which enables SVM to perform a better statistical learning for recognition of non-members. Based on the statistics provided on the webpage of Pfam database [38], there are over 8,000 families of proteins, from which one can generate a diverse set of non-members for each metal-binding class. Because of the differences in the number of members and that of non-members in each class, there is an imbalance between each dataset. SVM based on an imbalanced datasets tends to produce feature vectors that push the hyperplane towards the side with smaller number of data [43], which can lead to a reduced accuracy for the set either with a smaller number of samples or of less diversity. This might partly explain why the prediction accuracy for members is generally lower than that for non-members. It is however inappropriate to simply reduce the size of nonmembers to artificially match that of members, since this compromises the diversity needed to fully represent all non-members. Computational methods for re-adjusting biased shift of hyperplane are being explored [44]. Application of these methods may help improving SVM prediction accuracy in this and other cases involving unbalanced data.

Prediction of novel metal-binding proteins
One particular application of SVM is the prediction of novel metal-binding proteins that are non-homologous to other proteins [45]. To test this capability, Swiss-Prot   Those proteins found in the SVM training sets are then removed. As shown in Table 2, a total of 87 proteins are found from this process, 58 or 66.7% of these proteins were correctly predicted as metal-binding by our SVM classification systems respectively. Therefore, our SVM classification systems appear to show reasonably good capability for predicting novel metal-binding proteins based on the set of proteins tested. 9 of the 29 incorrectly predicted novel metal-binding proteins are nucleic acid binders. These include endonucleases Cfr10I, CviJI, EcoRV, PvuII and BslI, transcription activators chrR and rep2, meiosis-specific protein HOP1, protein suppressor of variegation 3-7. One possible reason for the misclassification of these nucleic acid binding proteins is that spatial conservation rather than sequence conservation plays the dominant role in the formation of active site where metal ions are clustered [47], which is more difficult to predict than the classes of proteins with more apparent sequence signatures.

Prediction of metal-binding proteins with specific structural characteristics
A number of metal-binding proteins contain metal-binding domains [31,48,49] or motifs [31,32]. Several families of such metal-binding proteins have been documented, and examples of these families are zinc finger family [50], EF hand family [51], and Fer4 family [52]. These families have distinguished structural features responsible for metal-binding. Thus the performance of SVM prediction of metal-binding proteins can be evaluated by examining whether or not proteins containing one of these domains or motifs can be correctly classified as metal-binding.

SVM prediction performance by using a different kernel function
Apart from the Gaussian kernel function of sequencederived physicochemical properties used in this work, several other kernel functions have been developed and applied for SVM analysis of proteins and DNAs [57][58][59]. It is of interest to test the usefulness of some of these kernel functions for predicting metal-binding proteins. The string-kernel function has been extensively used and it has shown promising potential for protein and DNA studies [57,58]. This kernel function is constructed by comparison of sequences of classes of proteins or DNAs and the assignment of individual weights to amino acids or nucleotides to describe physicochemical or other characteristics of the proteins and DNAs. In this work, this kernel function is used to develop three SVM systems for predicting the class of nickel-binding proteins, potassium-binding proteins, and sodium-binding proteins. Spectrum kernel with mismatches [60] is used to generate the string-kernel for each protein. Testing results by using the independent set of proteins for each class show that the SE is 75.1%, 89.5% and 88.7%, and the SP is 99.0%, 98.7% and 97.8% for each of these classes respectively. Thus comparable prediction performance can be achieved by using stringkernel SVM, which suggests the usefulness of this and other kernel functions for SVM prediction of metal-binding proteins.

Comparison of SVM prediction performance with those of other methods
To compare the prediction performance of SVM with those of other methods, our SVM classification system for predicting zinc-binding proteins was used to scan the human genome, and the predicted zinc-binding proteins were compared with those predicted by one or combinations of three other methods [26]. These methods have been used for searching potential zinc-binding proteins in human genome by means of (i) zinc-binding pattern identification via structural comparison with all available X-ray structural data, (ii) multiple sequence alignment based on libraries of zinc-binding domains, and (iii) analysis of sequence annotations [26]. SVM predicted a total of 4,518 zinc-binding proteins compared to that of 3,207 by at least one of the other three methods, 2,770 of which are mutually predicted. The percentages of mutually predicted proteins are significantly higher for those proteins predicted by using combinations of the other three methods. The numbers of proteins predicted by at least two and three of the other methods are 2,430 and 1,684 respectively, 2,256 and 1,615 of which are also predicted by SVM. Therefore, SVM is capable of predicting most of the zinc-binding proteins predicted by the combinations of the other three methods. SVM appears to predict a higher number of zinc-binding proteins than each of the other three methods. Apart from the expected prediction error, the reported problems of the other three methods associated with structurally uncharacterized, non-conserved, unclearly annotated zinc-binding proteins [26] may also contribute to the discrepancy between SVM and the other methods. For example, two SVM predicted proteins that are not predicted by the other three methods, forkhead box protein P1 and TRAF-interacting protein, are annotated as zinc-binding in GO and described to contain ZINC_FINGER_C2H2_1 in PROSITE.

Contribution of feature properties to the classification of metal-binding proteins
In this work, a total of nine feature properties were used to describe physicochemical characteristics of each protein, which have been routinely used for the prediction of other molecular-binding proteins [61]. It has been reported that, not all feature vectors contribute equally to the classification of proteins, some have been found to play relatively more prominent role than others in specific aspects of proteins [62]. It is therefore of interest to examine which feature properties play more prominent role in classification of metal-binding proteins.
In an earlier study, contribution of individual feature property to protein classification is investigated by separately conducting classification using each feature property [36,40,41]. The same method was employed here. An analysis on the classification of the group of all metal-binding proteins seems to suggest that, in order of prominence, the hydrophobicity, solvent accessibility, polarity and composition play more prominent role than other feature properties. Hydrophobicity have been shown to be important for metal-protein interactions such that metal binding sites usually appear in clusters with hydrophobic environment. High-affinity metal binding sites in some proteins are located at sequence segments with specific amino acid composition [63], and specific sequence motifs have been used for predicting metal-binding proteins [64,65]. It was also found that polarity and solvent accessibility of the binding site influences the functional properties of metal-binding proteins [66]. Therefore, our prediction results are consistent with these experimental findings.

Conclusion
SVM appears to be a potentially useful tool for the prediction of metal-binding proteins of different classes. The prediction accuracy may be further enhanced with the further expansion of our knowledge about metal-binding proteins particularly for those small metal-binding classes, more refined representation of the structural and physicochemical properties of proteins, and the improvement of prediction algorithms such as the better treatment of imbalanced dataset. The SVM-derived metal-binding protein classification systems developed in this work can be assessed, free of charge for academic use, at the SVM-Prot server [67].

Selection of metal-binding and non-metal-binding proteins
All metal-binding proteins used in this study are collected from a comprehensive search of Swiss-Prot database [42]. A total of 33295 metal-binding protein sequences were obtained. Most of these proteins can be classified into one of the 10 metal-binding classes, and the number of proteins is 5426, 1467, 1645, 9462, 9688, 4214, 705, 1240, 1567 and 16072 in calcium-binding, cobalt-binding, copper-binding, iron-binding, magnesium-binding, manganese-binding, nickel-binding, potassium-binding, sodium-binding and zinc-binding class respectively. Some proteins were found to belong to more than one class. The distribution of all these proteins in different kingdoms and in top 10 host species is given in Table 3, and that of the four largest classes of metal-binding proteins is given in Table 4. From these two Tables one finds that these proteins are from diverse range of species and all species appear to be fairly adequately represented.
All distinct members in each class were used to construct a positive dataset for the corresponding SVM prediction system. A negative dataset, representing non-class members, are selected by a well-established procedure [45,68,69] such that all proteins are grouped into domain families [38] and the representative proteins of those families un-related to the specific metal-binding class are used as negative samples. Members in the other metal-binding classes were included in the negative dataset if they are not a member of the class being studied. These datasets are divided into separate training, testing and independent evaluation sets by the following procedure: First, proteins were clustered into groups based on their distance in the  (65) structural and physicochemical feature-space by using the hierarchical clustering method. An upper-limit of the largest separation of 20 was used to for each cluster. One representative protein was randomly selected from each group to form a training set that is sufficiently diverse and broadly distributed in the feature space. One or up to 50% of the remaining proteins in each group were randomly selected to form the testing set. The selected proteins from each group were further checked to ensure that they are distinguished from proteins from other groups. The remaining proteins were used as the independent evaluation set, which are also of reasonable level of diversity. Moreover, an analysis of the "similarity" proteins in each cluster shows that the majority of the proteins in a cluster are non-homologous. Thus, the testing and independent evaluation sets are expected to have certain level of usefulness for performing their task of fine-tuning the parameter of a SVM classification system and for evaluating its prediction performance. The statistics of the members and non-members of the datasets of each metal-binding class is given in Table 1.

Derivation of structural and physicochemical properties from protein sequence
Construction of the feature vector for each protein is based on the formula used in the prediction of RNA-binding proteins [69], protein-protein interaction [70], protein fold recognition [62], and protein functional family pre-diction [68]. Given the sequence of a protein, its amino acid composition and the properties of every constituent amino acid are computed and then used to generate this vector. The computed amino acid properties include hydrophobicity, normalized Van der Waals volume, polarity, polarizability, charge, surface tension, secondary structure and solvent accessibility [68].
For each of these properties, amino acids are divided into three groups such that those in a particular group are regarded to have approximately the same property. For instance, amino acids can be divided into hydrophobic (CVLIMFW), neutral (GASTPHY), and polar (RKEDQN) groups. Three descriptors, composition (C), transition (T), and distribution (D), are introduced to describe global composition of each of these properties. C is the number of amino acids of a particular property (such as hydrophobicity) divided by the total number of amino acids in a protein sequence. T characterizes the percent frequency with which amino acids of a particular property is followed by amino acids of a different property. D measures the chain length within which the first, 25%, 50%, 75% and 100% of the amino acids of a particular property is located respectively.
A hypothetical protein sequence AEAAAEAEEAAAAAE-AEEEAAEEAEEEAAE, as shown in Figure 1, has 16 alanines (n1 = 16) and 14 glutamic acids (n2 = 14). The Overall, there are 21 elements representing these three descriptors: 3 for C, 3 for T and 15 for D. The feature vector of a protein is constructed by combining the 21 elements of all of these properties and the 20 elements of amino acid composition in sequential order.
There is some level of overlap in the descriptors for hydrophobicity, polarity, and surface tension. Thus the dimensionality of the feature vectors may be reduced by principle component analysis (PCA). Our own study suggests that the use of PCA reduced feature vectors only moderately improves the accuracy for some of the families. It is thus unclear to which extent this overlap affects the accuracy of SVM classification. It is noted that reasonably accurate results have been obtained using these overlapping descriptors in various protein classification studies [62,68,[70][71][72].

Support Vector Machines method
The algorithms of SVM and its applications to proteins are extensively described in the literature [68,69,73]. Thus only a brief description is given here. A linear SVM constructs a hyperplane that separates two different classes of feature vectors with a maximum margin. One class represents metal-binding proteins and the other non-metalbinding proteins. This hyperplane is constructed by finding a vector w and a parameter b that minimizes ||w|| 2 which satisfies the following conditions: w · x i + b ≥ 1, for y i = +1(positive class) and w · x i + b ≤ -1, for y i = -1 (negative class). Here x i is a feature vector, y i is the group index, w is a vector normal to the hyperplane, |b|/||w|| is the perpendicular distance from the hyperplane to the origin, and ||w|| 2 is the Euclidean norm of w.
A nonlinear SVM projects feature vectors into a high dimensional feature space by using a kernel function such as a Gaussian kernel function .
The linear SVM procedure is then applied to the feature vectors in this feature space. After the determination of w and b, a given vector x can be classified by using sign [(w · x) + b], a positive or negative value indicates that the vector x belongs to the positive or negative class respectively.
The performance of SVM has been measured by the positive, negative and overall prediction accuracies P p = TP/ ( The sequence of a hypothetic protein for illustration of derivation of the feature vector of a protein