Predicting zinc binding at the proteome level
- Andrea Passerini†1Email author,
- Claudia Andreini†2,
- Sauro Menchetti1,
- Antonio Rosato2 and
- Paolo Frasconi1
© Passerini et al; licensee BioMed Central Ltd. 2007
Received: 30 August 2006
Accepted: 05 February 2007
Published: 05 February 2007
Metalloproteins are proteins capable of binding one or more metal ions, which may be required for their biological function, for regulation of their activities or for structural purposes. Metal-binding properties remain difficult to predict as well as to investigate experimentally at the whole-proteome level. Consequently, the current knowledge about metalloproteins is only partial.
The present work reports on the development of a machine learning method for the prediction of the zinc-binding state of pairs of nearby amino-acids, using predictors based on support vector machines. The predictor was trained using chains containing zinc-binding sites and non-metalloproteins in order to provide positive and negative examples. Results based on strong non-redundancy tests prove that (1) zinc-binding residues can be predicted and (2) modelling the correlation between the binding state of nearby residues significantly improves performance. The trained predictor was then applied to the human proteome. The present results were in good agreement with the outcomes of previous, highly manually curated, efforts for the identification of human zinc-binding proteins. Some unprecedented zinc-binding sites could be identified, and were further validated through structural modelling. The software implementing the predictor is freely available at: http://zincfinder.dsi.unifi.it
The proposed approach constitutes a highly automated tool for the identification of metalloproteins, which provides results of comparable quality with respect to highly manually refined predictions. The ability to model correlations between pairwise residues allows it to obtain a significant improvement over standard 1D based approaches. In addition, the method permits the identification of unprecedented metal sites, providing important hints for the work of experimentalists.
Knowledge about the capability to bind metal ions is important when investigating the function of an experimentally uncharacterized protein. Unfortunately, the identification of bound metal ions can be quite difficult experimentally, especially when attempted at the whole proteome scale. Some results in this direction (metalloproteomics) have been recently reported [1–3], but these techniques are still far from becoming available for routine application. Furthermore, experimental approaches may suffer from biases such as incorporation of the wrong metal cofactor in vivo, removal of the metal ion(s) during protein purification procedures, binding of metals at adventitious sites . Within this frame, bioinformatics tools are thus important to guide in the design and in the interpretation of experiments. The prediction of metal binding capabilities is a challenging task for which the development of reliable tools is still in progress .
In this paper, we investigate the use of machine learning approaches to automatically annotate metal-binding proteins on the whole-proteome scale. In particular, we focus on an important class of structural and functional sites involving the binding of zinc ions. Zinc is essential for Life and is the second most abundant transition metal ion in living organisms after iron. In contrast to other transition metal ions, such as copper and iron, zinc(II) does not undergo redox reactions thanks to its filled d-shell. In Nature, it has essentially two possible roles: catalytic or structural [6, 7]. In humans, zinc has a crucial importance in the complex network of inter-molecular interactions responsible for the proper regulation of protein expression. Indeed, a major role of zinc is in the stabilization of the structure of a huge number of transcription factors such as zinc fingers, which constitute a significant share of the human proteome [8, 9]. Only a subset of the natural amino acids can coordinate zinc ions with their side chains. In addition, the binding sites are locally constrained by the requirements on the side chain geometry imposed by coordination chemistry. For these reasons, several sites can be identified with high precision by mining regular expression patterns along the protein sequence while simultaneously inspecting amino acid conservation near the (putative) site . A potential problem with the use of regular expression patterns is that they are usually quite specific but may give a low coverage (many false negatives). On the other hand, a support vector machine (SVM) predictor based on multiple alignments outperforms a predictor based on PROSITE  patterns in discriminating between cysteines bound to prosthetic groups and cysteines involved in disulfide bridges .
The application of a similar approach to the prediction of zinc-binding properties is not straightforward because most supervised learning algorithms (including SVM) build upon the assumption that examples are sampled independently. Unfortunately, this assumption can be violated when formulating prediction of metal binding sites as a traditional 1D prediction problem. The autocorrelation between the metal binding state is a consequence of the fact that most binding sites contain at least two coordinating residues with short sequence separation. Autocorrelation problems have been recently identified in the context of relational learning  and collective classification solutions have been proposed based on probabilistic learners [14, 15]. In a recent work  we tried to address the autocorrelation problem in the context of metal binding site prediction by developing a two stage approach, where a bi-recurrent neural network refines residue-level SVM predictions by jointly considering all SVM outputs from residues in the same chain when computing the refined prediction for each residue. While the approach performs better then the local SVM predictor alone, such improvement is still not statistically significant. In this work we followed a different approach which aims at exploiting the regularities of zinc-binding sites in terms of sequence separation between ligands. The use of information on the sequential distance between cysteines was recently shown to improve performance in the task of disulfide connectivity prediction . Our solution is based on a reformulation of the learning problem where examples formed by pairs of sequentially close residues are considered. Most of the zinc-binding sites contain at least one of such pairs, which in the following will be named semi-patterns. We developed a semi-pattern SVM trained to predict the zinc-binding attitude of a full semi-pattern. A traditional 1D SVM predictor was employed to account for the isolated ligands, and the final prediction for a given residue was computed by a gating network combining the probability of belonging to a zinc-binding semi-pattern and that of being an isolated ligand. In the following we will refer to the learning architecture as SP-SVM in order to stress the importance of the semi-pattern prediction as well as the role of the SVM components.
The method was tested on a representative non-redundant set of zinc-binding protein chains in order to assess its generalization power on new chains. Two evaluation procedures were employed, a full leave-one-out procedure on a subset with pairwise HSSP-value up to five, and a k-fold cross validation procedure guaranteeing that no test chain was remotely homologous with respect to any chain in the training set (see details in Results). This second test is a stronger requirement with respect to other common approaches to remove redundancy. A significant improvement over the traditional 1D prediction approach was observed. We additionally used the trained predictor to analyze the entire human proteome and observed a good agreement with previous, manually curated, annotations.
Results and discussion
PDB data preparation
A data set of high-quality annotated chains was extracted from the Protein Data Bank (PDB)  by selecting all the structures deposited in the PDB at June 2005 and containing at least one zinc ion in the coordinate file. Structures binding zinc spuriously because of experimental settings (e.g. high zinc concentration in the crystallization buffer) were removed. Homologs were removed, by retaining only one representative chain. This procedure resulted in a set of 305 unique chains. Amino acids binding to the zinc ion(s) were detected using a threshold of 3 Å for the distance between the metal and the protein donor atoms. In order to provide negative examples of non metal-binding proteins, an additional set was generated by performing a single run of UniqueProt  with zero HSSP-value on PDB entries that are not metalloproteins. We thus obtained a second data set of 2,369 chains. Zinc-binding chains whose structure had been solved in the apo (i.e. without metal) form were removed from the ensemble of non-metalloproteins. We computed multiple alignment profiles for all chains using PSI-Blast  on the non-redundant (nr) NCBI protein database. In order to reduce noise in the training data we ignored residues whose profile had a relative weight less than 0.015, indicating that too few sequences had aligned at that position. This also allowed to discard poly-histidine tags which are attached at either the N- or C-terminus of some chains in the PDB, as a result of protein engineering aimed at making protein purification easier.
Analysis of zinc-binding sites
The choice of predicting zinc-binding sites by modelling semi-patterns was motivated by an extensive analysis of the characteristics of the sites, which we briefly report in this section.
Distribution of zinc site types
# Coordinating Residues
2 (interface – Zn2)
3 (catalytic – Zn3)
4 (structural – Zn4)
Amino acid statistics on zinc sites
Patterns of binding sites
Zinc binding site patterns
Binding Site Patterns
[CHDE] x(·) [CHDE] x(·) [CHDE] x(·) [CHDE]
[CH] x(·) [CH] x(·) [CH] x(·) [CH]
[CHDE] x(0–7) [CHDE] x(·) [CHDE] x(0–7) [CHDE]
[CHDE] x(0–7) [CHDE] x(> 7) [CHDE] x(0–7) [CHDE]
[CHDE] x(·) [CHDE] x(·) [CHDE]
[C] x(·) [C] x(·) [C] x(·) [C]
[CHDE] x(·) [CHDE]
[CHDE] x(0–7) [CHDE] x(> 7) [CHDE]
[CH] x(·) [CH] x(·) [CH]
[CHDE] x(> 7) [CHDE] x(0–7) [CHDE]
[CH] x(·) [CH]
[CHDE] x(0–7) [CHDE] x(> 7) [CHDE] x(> 7) [CHDE]
[CHDE] x(> 7) [CHDE] x(0–7) [CHDE] x(0–7) [CHDE]
[DE] x(·) [DE]
[DE] x(·) [DE] x(·) [DE]
[CHDE] x(> 7) [CHDE] x(> 7) [CHDE] x(0–7) [CHDE]
[CHDE] x(0–7) [CHDE] x(0–7) [CHDE] x(> 7) [CHDE]
[DE] x(·) [DE] x(·) [DE] x(·) [DE]
Evaluation of SVM-based predictors
A traditional 1D SVM predictor was compared to the full SP-SVM architecture, in order to assess the significance of the proposed approach. While aspartic and glutamic acids coordinate zinc ions less frequently than cysteines and histidines (see Table 2), they are far more abundant in protein chains. This yielded an extremely unbalanced data set, and forced us to initially focus on cysteine and histidine residues only (we will refer to such predictor as SP CH -SVM). Moreover, we labelled a [CH] x(0–7) [CH] semi-pattern as positive if both candidate residues bound a zinc ion, even if they were not actually binding the same ion. Preliminary experiments showed this to be a better choice than considering such a case as a negative example, allowing to recover a few positive examples, especially for semi-pattern matches with longer gaps. Model selection was performed by a stratified 4-fold cross validation procedure on the full data set, aimed at tuning Gaussian kernel width, C regularization parameter, window size and parameters of the sigmoids of the gating network. Due to the strong unbalance of the data set, accuracy is not a reliable measure of performance. We used the area under the recall-precision curve (AURPC) for both model selection and final evaluation, as it is especially suitable for extremely unbalanced data sets. We also computed the area under the ROC curve (AUC) to further assess the significance of the results.
Generalization performances of the best models for the local predictor and the gating network were assessed with two different procedures. First, we evaluated generalization over non-homologous chains. We repeatedly run UniqueProt  with HSSP-value equal to five starting from the full data set and stopping when then program found only clusters of singletons, thus assuring that no two chains had an HSSP-value greater than the threshold. We then run a full leave-one-out (LOO) procedure on the resulting data set, which consisted of 230 zinc-binding chains and 1,949 negative ones. Second, we evaluated generalization over chains which had no remote homologue in the training set. To this aim, we employed a stratified five fold cross validation (CV) procedure on the full data set. Few (38) non-metalloprotein chains were removed in this procedure as they lacked the information about SCOP  classification, which prevented us from assigning them to the correct CV fold. In fact, we distributed protein chains over the CV folds by ensuring that two chains having a zinc-binding domain belonging to the same SCOP  superfamily always appeared in the same CV fold, and two free chains (which were employed as negative examples) having a domain in the same SCOP superfamily also appeared in the same CV fold. In this way, we measure generalization across different super-families, a setting in which not even remote homology modelling techniques could be successfully applied for prediction. Note that by k-fold cross-validation we mean splitting the data in k subsets (commonly called folds) and using one of them in turn for testing. The term "fold" in SCOP has a totally different meaning.
Finally, we investigated the viability of training a predictor for all the four amino acids involved in zinc binding (it will be referred to as SP CHDE -SVM), trying to overcome the disproportion issue. On the rationale that binding residues should be well conserved because of their important functional role, we put a threshold on the residue conservation (Pr(X)) in the multiple alignment profile in order to consider it as a candidate target. By requiring that Pr(D) + Pr(E) ≥ 0.8, we more than halved the unbalance in the data set for the local predictor. At the level of semi-patterns, we realized that such a threshold produced a reasonable unbalance only for gap lengths between one and three, and thus decided to ignore semi-patterns containing aspartic or glutamic acid with gaps of different lengths. While global performances were almost unchanged, aspartic acid and glutamic acid alone obtained a value of the AUC of 0.74 ± 0.03 and 0.70 ± 0.06 respectively in the LOO procedure and 0.73 ± 0.03 and 0.65 ± 0.05 in the CV procedure (with respect to the 0.5 baseline), showing that performances are significantly better than random. However, results on these two residues are still preliminary and further work is required to provide a prediction quality comparable to that obtained for cysteines and histidines. It is interesting to note that at the level of chain classification, the only difference that can be noted by using [CHDE] instead of [CH] is an improvement in the performances for the Zn3 binding sites, as shown in Figures 5(c) and 6(c). This is perhaps not surprising given that half of [DE] residues binding zinc are contained in Zn3 sites, as reported in Table 2. The list of protein chains employed in the two experimental settings, together to the splits of the 5-fold cross validation procedure and the model parameters obtained in the tuning phase, are available in the additional file 1.
Predictions for the human proteome
A bioinformatic analysis of the content of the human proteome in terms of zinc-binding proteins is already available . In that work, putative zinc-binding proteins were identified based on the occurrence of known (from the PDB) zinc-binding patterns together with some sequence similarity around the pattern, following a previously proposed methodology . These results were integrated by those independently obtained by i) text-mining the available annotations of human genes and ii) using Pfam protein domains described as having zinc-binding properties to scan the proteome. These three search approaches cumulatively allowed identification of zinc-binding proteins in the entire PDB with a precision of 78% and a recall of 89% . This strategy is intrinsically limited in that it can exploit thoroughly existing information but cannot predict new binding sites. Nevertheless, when applied to the human proteome, it identified ab. 3,200 human chains that are potentially zinc binding. Of these, 53% were identified independently by all three approaches, and 76% were identified by at least two methods . These results required a significant degree of manual care (e.g. in the selection of Pfam domains to be searched) and contain a certain degree of subjectivity (e.g. due to the fact that several gene annotations are relatively speculative). The present approach, which is fully automated, has a performance on the PDB only slightly worse than that of the manually curated methodology described in , while providing the unique opportunity of predicting unprecedented zinc-binding patterns and thus entirely new classes of zinc-proteins, as discussed in detail below.
To meaningfully compare the presently developed SVM-based approach and the above-described published work, the SP-SVM was used to scan the same human proteome version for putative zinc-binding chains. In the present approach a chain is dubbed as zinc-binding if the predictor assigns a probability of being zinc-binding greater than 0.7 to at least three residues in the chain. By doing so, we switch from per-residue prediction (SP-SVM output) to a per-protein prediction. Indeed, the output most relevant for the biologists is the prediction of zinc-binding capabilities at the entire protein level.
The SP CH -SVM identified 2,833 putative human zinc-binding chains, which constitute the predicted human zinc-proteome. The results obtained employing the SP CHDE -SVM are very similar to those of the SP CH -SVM, possibly because the comparatively small number of available examples of sites containing aspartic and glutamic acids as ligands limits the training of the machine.
Finally, it must be noted that in some cases the SVMs do not predict all the ligands in the structure with a high probability but can predict only a part of the pattern or include erroneous residues in the pattern. An explicative example is the binding-site prediction for the ADAM-TS family. This family, which has not yet been structurally characterized, comprises Zn-dependent endopeptidases using the HX(3)HX(5)H motif to bind the catalytic zinc ion. For all these chains the SVMs predicted the first two histidines as ligands with a high probability (more than 0.7) while the third histidine is often predicted with very low values (average value = 0.32). Chain-level comparisons between SP CH -SVM and results in  are available in the additional file 3.
In the present work we have described a novel approach based on SVMs to the prediction of zinc-binding capabilities at the level of an entire proteome. The method has been trained using the structures available in the PDB where zinc was bound in a physiologically relevant manner. This should maximize, but cannot guarantee, that the properties predicted are relevant also in vivo and not just in vitro. However, due to the complexity of the processes controlling the insertion of metal cofactors in proteins and, in particular, due to the fact that they are under kinetic rather than thermodynamic control, it is not possible to exclude that a protein predicted here to be zinc-binding will in vivo bind other metal ions (e.g. iron, copper). With all these caveats in mind, the present approach constitutes a highly automated tool for the identification of metalloproteins, which provides results of comparable quality with respect to highly manually refined predictions. In addition, it permits the identification of unprecedented metal sites, providing important hints for the work of experimentalists. The performance of the proposed method was evaluated on strong non-redundancy tests showing a significant improvement due to correlation modelling. The present SVMs exploit well the occurrence in metal-binding sites of cysteine and histidine residues, while there is room for improving the performance with respect to sites containing aspartic and glutamic acid residues.
Prediction using SVM
Many applications of machine learning to 1D prediction tasks use a simple vector representation obtained by forming a window of flanking residues centered around the site of interest. Evolutionary information is incorporated in this representation by computing multiple alignment profiles . In this approach, each example is represented as a vector of size d = (2k + 1)p, where k is the size of the window and p the size of the position specific descriptor. In this paper we developed a learning architecture which expands such representation in order to address the relational auto-correlation problem described in the previous paragraph. A local predictor based on SVM [26–28] uses the standard window representation for classifying the zinc-binding state of individual residues. Multiple alignment profiles are enriched by two indicators of profile quality, namely the entropy and the relative weight of gapless real matches to pseudocounts. An additional flag is included to mark positions ranging out of the sequence limits, resulting in an all-zero profile. We thus obtain a position specific descriptor of size p = 23. The correlation between nearby residues is modeled by an SVM semi-pattern predictor, trained to predict the bonding state of pairs of residues close in sequence. A candidate semi-pattern is a pair of residues separated by a gap of δ residues, with δ ranging from zero to seven. The task is to predict whether the semi-pattern is part of a zinc-binding site. Each example is represented by a window of local descriptors (based on multiple alignment profiles) centered around the semi-pattern, including the gap between the candidate residues. An ad-hoc semi-pattern kernel (K sp ) measuring the similarity between two semi-patterns was developed in the following way: given two vectors x and z, of size d x and d z , representing semi-patterns with gap length δ x and δ z respectively,
where is the sub-vector of v that extends from i to j, and w = (k + 1)p. The first two contributions compute the dot products between the left and right windows around the semi-patterns, included the two candidate residues, whose sizes do not vary regardless of the gap lengths. K gap is the kernel between the gaps separating the candidate residues, and is computed as:
K μgap computes the dot product between the average position specific descriptors within each gap, and if the two gaps have same length, the full dot product between the descriptors in the gaps is added.
We employ a Gaussian kernel on top of both the linear kernel of the local predictor and the semi-pattern kernel (Eq. (1)). To get a better performance, we combine the single output from the local predictor on a given residue and the (possibly empty) set of outputs from the semi-pattern based predictor by a gating network. In order to combine two predictors, it is preferable to convert their SVM functional margins into conditional probabilities using the sigmoid function approach suggested in Platt :
P(Y = 1|x) = 1/(1 + exp (-Af(x)-B)) where f(x) is the SVM output for example x and sigmoid slope (A) and offset (B) are parameters to be learned from data. The probability P(Y b = 1|x) that a single residue binds zinc can now be computed by the following gating network:
P(Y b = 1|x) = P(Y s = 1|x) + (1 - P(Y s = 1|x))P(Y l = 1|x) (2)
where P(Y l = 1|x) is the probability of zinc binding from the local predictor, while P(Y s = 1|x) is the probability of x being involved in a positive semi-pattern, approximated as the maximum between the probabilities for each semi-pattern x is actually involved in.
Validation through homology modelling
We attempted to model the 3D structure of all the human chains retrieved by the present SP CH -SVM but not reported in the literature or previously predicted to be zinc-binding. Appropriate templates were looked for in the PDB, by searching for proteins of known structure having a sequence identity greater than 30% to the target. Structural models were built using the program Modeller-6v2 . The input alignment for Modeller was obtained with ClustalW .
Availability and requirements
Project Name: Zinc Finder
Project home page: http://zincfinder.dsi.unifi.it
Operating system(s): Platform independent
Programming language: c++
Other requirements: c++ compiler
License: GNU GPL
Any restrictions to use by non-academics: none
The work of A.P., S.M., and P.F. is supported by EU STREP APrIL II (contract no. FP6-508861) and EU NoE BIOPATTERN (contract no. FP6-508803).
- Shi W, Zhan C, Ignatov A, Manjasetty BA, Marinkovic N, Sullivan M, Huang R, Chance MR: Metalloproteomics: high-throughput structural and functional annotation of proteins in structural genomics. Structure (Camb) 2005, 13(10):1473–1486. 10.1016/j.str.2005.07.014View ArticleGoogle Scholar
- Scott RA, Shokes JE, Cosper NJ, Jenney FE, Adams MWW: Bottlenecks and roadblocks in high-throughput XAS for structural genomics. Journal of Synchrotron Radiation 2005, 12: 19–22. 10.1107/S0909049504028791View ArticlePubMedGoogle Scholar
- Hogbom M, Ericsson UB, Lam R, Bakali HM, Kuznetsova E, Nordlund P, Zamble DB: A High Throughput Method for the Detection of Metalloproteins on a Microgram Scale. Mol Cell Proteomics 2005, 4(6):827–834. 10.1074/mcp.T400023-MCP200View ArticlePubMedGoogle Scholar
- Chaudhuri BN, Ko J, Park C, Jones TA, Mowbray SL: Structure of D -allose binding protein from Escherichia coli bound to D -allose at 1.8 Å resolution. J Mol Biol 1999, 286: 1519–1531. 10.1006/jmbi.1999.2571View ArticlePubMedGoogle Scholar
- Bertini I, Rosato A: Bioinorganic Chemistry Special Feature: Bioinorganic chemistry in the postgenomic era. PNAS 2003, 100: 3601–3604. 10.1073/pnas.0736657100PubMed CentralView ArticlePubMedGoogle Scholar
- Vallee BL, Auld DS: Functional zinc – binding motifs in enzymes and DNA – binding proteins. Faraday Discuss 1992, 93: 47–65. 10.1039/fd9929300047View ArticlePubMedGoogle Scholar
- Bertini I, Sigel A, Sigel H, (Eds): Handbook on Metalloproteins. 1st edition. Marcel Dekker, New York; 2001.Google Scholar
- Tupler R, Perini G, Green MR: Expressing the human genome. Nature 2001, 409(6822):832–833. 10.1038/35057011View ArticlePubMedGoogle Scholar
- Andreini C, Banci L, Bertini I, Rosato A: Counting the zinc-proteins encoded in the human genome. J Proteome Res 2006, 5: 196–201. [http://dx.doi.org/10.1021/pr050361j] 10.1021/pr050361jView ArticlePubMedGoogle Scholar
- Andreini C, Bertini I, Rosato A: A hint to search for metalloproteins in gene banks. Bioinformatics 2004, 20(9):1373–1380. 10.1093/bioinformatics/bth095View ArticlePubMedGoogle Scholar
- Hulo N, Sigrist CJA, Saux VL, Langendijk-Genevaux PS, Bordoli L, Gattiker A, Castro ED, Bucher P, Bairoch A: Recent improvements to the PROSITE database. Nucleic Acids Res 2004, (32 Database):134–137. 10.1093/nar/gkh044
- Passerini A, Frasconi P: Learning to discriminate between ligand-bound and disulflde-bound cysteines. Protein Eng 2004, 17(4):367–373. 10.1093/protein/gzh042View ArticleGoogle Scholar
- Jensen D, Neville J: Linkage and autocorrelation cause feature selection bias in relational learning. Proceedings of the Nineteenth International Conference on Machine Learning (ICML2002) 2002.Google Scholar
- Taskar B, Abbeel P, Koller D: Discriminative probabilistic models for relational data. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann; 2002.Google Scholar
- Jensen D, Neville J, Gallagher B: Why collective inference improves relational classification. Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2004.Google Scholar
- Passerini A, Punta M, Ceroni A, Rost B, Frasconi P: Identifying cysteines and histidines in transition-metal-binding sites using support vector machines and neural networks. Proteins: Structure, Function, and Bioinformatics 2006. [Early View] [Early View]Google Scholar
- Tsai CH, Chen BJ, Chan CH, Liu HL, Kao CY: Improving disulflde connectivity prediction with sequential distance between oxidized cysteines. Bioinformatics 2005, 21(24):4416–4419. 10.1093/bioinformatics/bti715View ArticlePubMedGoogle Scholar
- Mika S, Rost B: UniqueProt: creating sequence-unique protein data sets. Nucleic Acids Res 2003, 31(13):3789–3791. 10.1093/nar/gkg620PubMed CentralView ArticlePubMedGoogle Scholar
- Murzin A, Brenner S, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247(4):536–540. 10.1006/jmbi.1995.0159PubMedGoogle Scholar
- Holmes MA, Buckner FS, Van Voorhis WC, Mehlin C, Boni E, Earnest TN, DeTitta G, Luft J, Lauricella A, Anderson L, Kalyuzhniy O, Zucker F, Schoenfeld LW, Hol WGJ, Merritt EA: Structure of the conserved hypothetical protein MAL13P1.257 from Plasmodium falciparum . Acta Crystallographica Section F 2006, 62(3):180–185.PubMed CentralGoogle Scholar
- Hu M, Li P, Li M, Li W, Yao T, Wu JW, Gu W, Cohen RE, Shi Y: Crystal structure of a UBP-family deubiquitinating enzyme in isolation and in complex with ubiquitin aldehyde. Cell 2002, 111(7):1041–1054. 10.1016/S0092-8674(02)01199-6View ArticlePubMedGoogle Scholar
- Renatus M, Parrado SG, D'Arcy A, Eidhoff U, Gerhartz B, Hassiepen U, Pierrat B, Riedl R, Vinzenz D, Worpenberg S, Kroemer M: Structural basis of ubiquitin recognition by the deubiquitinating protease USP2. Structure 2006, 14(8):1293–1302. 10.1016/j.str.2006.06.012View ArticlePubMedGoogle Scholar
- Berman HM, Battistuz T, Bhat TN, Bluhm WF, Bourne PE, Burkhardt K, Feng Z, Gilliland GL, lype L, Jain S, Fagan P, Marvin J, Padilla D, Ravichandran V, Schneider B, Thanki N, Weissig H, Westbrook JD, Zardecki C: The Protein Data Bank. Acta Cryst 2002, D58: 899–907.Google Scholar
- Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Rost B, Sander C: Improved prediction of protein secondary structure by use of sequence profiles and neural networks. Proc Natl Acad Sci USA 1993, 90(16):7558–7562. 10.1073/pnas.90.16.7558PubMed CentralView ArticlePubMedGoogle Scholar
- Cortes C, Vapnik V: Support Vector Networks. Machine Learning 1995, 20: 1–25.Google Scholar
- Schölkopf B, Smola A: Learning with Kernels. Cambridge, MA: The MIT Press; 2002.Google Scholar
- Shawe-Taylor J, Cristianini N: Kernel methods for pattern analysis. Cambridge Univ. Press; 2004.View ArticleGoogle Scholar
- Platt J: Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. In Advances in Large Margin Classifiers. Edited by: Smola A, Bartlett P, Schölkopf B, Schuurmans D. MIT Press; 2000.Google Scholar
- Sali A, Blundell T: Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 1993, 234: 779–815. 10.1006/jmbi.1993.1626View ArticlePubMedGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22(22):4673–4680. 10.1093/nar/22.22.4673PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.