Investigation and identification of protein γ-glutamyl carboxylation sites
© Lee et al; licensee BioMed Central Ltd. 2011
Published: 30 November 2011
Skip to main content
© Lee et al; licensee BioMed Central Ltd. 2011
Published: 30 November 2011
Carboxylation is a modification of glutamate (Glu) residues which occurs post-translation that is catalyzed by γ-glutamyl carboxylase in the lumen of the endoplasmic reticulum. Vitamin K is a critical co-factor in the post-translational conversion of Glu residues to γ-carboxyglutamate (Gla) residues. It has been shown that the process of carboxylation is involved in the blood clotting cascade, bone growth, and extraosseous calcification. However, studies in this field have been limited by the difficulty of experimentally studying substrate site specificity in γ-glutamyl carboxylation. In silico investigations have the potential for characterizing carboxylated sites before experiments are carried out.
Because of the importance of γ-glutamyl carboxylation in biological mechanisms, this study investigates the substrate site specificity in carboxylation sites. It considers not only the composition of amino acids that surround carboxylation sites, but also the structural characteristics of these sites, including secondary structure and solvent-accessible surface area (ASA). The explored features are used to establish a predictive model for differentiating between carboxylation sites and non-carboxylation sites. A support vector machine (SVM) is employed to establish a predictive model with various features. A five-fold cross-validation evaluation reveals that the SVM model, trained with the combined features of positional weighted matrix (PWM), amino acid composition (AAC), and ASA, yields the highest accuracy (0.892). Furthermore, an independent testing set is constructed to evaluate whether the predictive model is over-fitted to the training set.
Independent testing data that did not undergo the cross-validation process shows that the proposed model can differentiate between carboxylation sites and non-carboxylation sites. This investigation is the first to study carboxylation sites and to develop a system for identifying them. The proposed method is a practical means of preliminary analysis and greatly diminishes the total number of potential carboxylation sites requiring further experimental confirmation.
Carboxylation is a post-translational modification (PTM) of glutamate (Glu) residues in proteins that is primarily involved in the blood clotting cascade specifically occurring in factors II, VII, IX, and X, protein C, protein S, as well in some bone proteins [1, 2]. Vitamin K is a critical cofactor in the post-translational conversion of Glu residues to γ-carboxyglutamate (Gla) residues . Carboxylation is catalyzed by γ-glutamyl carboxylase  and proceeds in the lumen of the endoplasmic reticulum . The vitamin K-dependent carboxylase transforms Glu to Gla when carbon dioxide (CO2) is added at the γ-position in the presence of oxygen (O2) and reduced vitamin K . The carboxylated proteins can be activated when Gla domain binds Ca2+. Since Glu is a weak Ca2+ chelator and Gla is a much stronger one, the transformation by the vitamin K-dependent carboxylase greatly increases the Ca2+-binding capacity of a protein . Studies conducted over the last few years have revealed that the γ-glutamyl carboxylated proteins in vertebrates can be categorized into three main groups . The first group comprises the carboxylated proteins with an amino terminal Gla domain, and includes vitamin K-dependent blood coagulation factors and co-regulators of blood coagulation . The second group is composed of osteocalcin and matrix Gla protein (MGP) [10, 11], and includes three and five Gla residues, respectively, which are critical to the regulation of bone growth and extraosseous calcification [12, 13]. The third group is the γ-glutamyl carboxylase, itself, which includes Gla residues .
Morris et al. used mass spectrometry to reveal the processive behavior of vitamin K-dependent carboxylation wherein multiple carboxylations occur during a single substrate binding event . The efficient carboxylation of native substrates depends on the binding of a conserved region to the carboxylase with a submicromolar affinity constant [15, 16]. Owing to the biological importance of γ-glutamyl carboxylation, more attention has been paid to mass spectrometric analyses in order to identify experimentally confirmed carboxylation sites. Nevertheless, in vitro identification of protein carboxylation sites requires a very large amount of time and effort. Meanwhile, in silico methods have the potential to characterize carboxylated sites before experiments are conducted. Additionally, in silico identification presents a more feasible means of preliminary analysis with the potential to significantly diminish the number of potential carboxylation sites requiring further experimental verification.
Given the importance of γ-glutamyl carboxylation in biological mechanisms, this work focuses on investigating the substrate site specificity of carboxylation sites. The experimentally validated γ-glutamyl carboxylations were mainly gathered from UniProtKB, a universal protein resource . Numerous experimental carboxylation sites in humans have been taken from HPRD , which contains curated data that the authors manually extracted from the PubMed literature database. Amino acid side chains that undergo post-translational modification tend to be accessible on protein surfaces . Therefore, this investigation not only examines the amino acid composition around carboxylation sites, but also considers structural characteristics, including solvent-accessible surface area (ASA) and secondary structure. In order to characterize the structural properties of the tertiary structures of carboxylated proteins, experimentally identified carboxylation sites are mapped to its corresponding positions from the protein sequences of Protein Data Bank (PDB) . However, most of the collected carboxylation sites lack the corresponding PDB tertiary structures. Accordingly, two tools, RVP-Net [21, 22] and PSIPRED , are utilized to calculate the ASA values and the secondary structures of amino acids in protein sequences, respectively.
Each feature is examined to evaluate its capacity to differentiate carboxylation sites from non-carboxylation sites. A support vector machine (SVM) is utilized to establish a predictive model with various features, including the positional weighted matrix of amino acids, amino acid composition, ASA, and secondary structure. Five-fold cross-validation is utilized to test the predictive performance of the generated SVM model. To prevent the generated model from an overestimation of predictive power, homologous sequences are eliminated by applying a window length of 2n+1 to carboxylation sites. After building and evaluating the model, the selected models providing the best accuracy are further tested using a data set of independent testing. The independent test set, which is not included in the training set, is then adopted to evaluate whether the selected model is over-fitting to the training data. Finally, the training features and window length providing the highest predictive accuracy for the model are used to construct a web-based system for identifying γ-glutamyl carboxylation.
A comprehensive PTM resource dbPTM , which collects PTM data from release 15.0 of UniProtKB  and release 8.0 of HPRD , comprises of 1123 Gla residues in 182 protein entries from multiple organisms. After the non-experimental sites, annotated as “by similarity”, “potential” and “probable” in the “MOD_RES” fields of UniProtKB, have been removed, 463 experimental carboxylation sites from 134 carboxylated proteins are obtained. It is observed that the carboxylation site occurs on Glu residues. In this investigation, all carboxylated Glu residues are regarded as the positive set. On the other hand, Glu residues, found in the experimental carboxylated proteins, which are not annotated as carboxylation sites are regarded as the negative set. Consequently, a total of 854 non-carboxylated Glu sites are defined as negative set. This work aims to investigate the structural characteristics of protein carboxylation sites; with reference to a previous work , features such as amino acid composition, accessible surface area (ASA), and secondary structure are explored. The extracted features are used to construct a predictive model and are evaluated in terms of its ability to differentiate carboxylation sites from non-carboxylation sites.
To avoid an overestimation in the cross-validation, the removal of homologous sequences in the positive data is done by using a window size of 2n+1 centered on the carboxylation site. With reference to the method of homology reduction in N-Ace , two carboxylated proteins having more than 30% sequence similarity are regarded as homologous proteins. Then, for every two homologous protein sequences, BL2SEQ  is applied to re-align the fragment sequences with 2n+1 amino acids centered on the carboxylated sites. For two fragment sequences having a sequence similarity higher than 50%, only one fragment sequence is kept while the other is removed the positive set. The same approach is also applied to generate the non-homologous negative set.
Statistics of experimentally verified carboxylation sites in training data and independent testing data.
All data (UniProt release 15.0 and HPRD 8.0)
Training data (non-homologous)
Independent testing data (non-homologous)
Number of carboxylated proteins
Number of carboxylated glutamate residues
Number of non-carboxylated glutamate residues
This work not only investigates the composition of amino acids that surround carboxylation sites, but also takes the solvent accessible surface area (ASA), and secondary structure (SS) into account. A window size of 2n+1 is utilized to extract fragment sequences from positive and negative sets of training data. Various values of n changing from four to ten are applied to decide the optimal window length. The amino acid composition (AAC) , which refers to the relative frequencies of twenty amino acids in a given window length, is regarded as the elementary feature in the investigation of carboxylation sites . A vector of 20 elements for amino acid composition specifies the number of occurrences of the twenty amino acids normalized by the total number of residues in a fragmented sequence with window length 2n+1. To investigate the position-specific amino acid composition around the carboxylation sites, a positional weighted matrix (PWM) is determined using non-homologous positive set of training data . The PWM specifies the occurring frequency of amino acids in each position of a fragment. A matrix of (2n+1)×m elements, where 2n+1 refers to the window length and m contains 21 elements that stands for the 20 amino acids and one terminal signal, is referred to in order to encode each fragment sequence in the training data.
The amino acids that undergoes post-translational modification were reported to be exposed on the surface of a protein . Thus, the accessible surface area (ASA) surrounding the carboxylation sites is considered. Due to the fact that almost all of the experimental carboxylated proteins do not contain a corresponding PDB tertiary structure, RVP-Net [21, 22] is utilized to calculate the ASA value, which is the percentage of the solvent-accessible area of each amino acid on a protein sequence. RVP-net is a prediction tool to determine the value of residual ASAs using neighborhood information and yields a mean absolute difference of 18.0 – 19.5% between the predicted and experimental values of ASA . In this work, the full-length sequences of carboxylated proteins are submitted to RVP-Net to calculate the ASA value for all amino acids. The ASA values of the amino acids surrounding the carboxylation site are normalized to zero to one.
In investigating secondary structures surrounding carboxylation sites, PSIPRED  was utilized to predict the secondary structure from a given protein sequence. PSIPRED applied two feed-forward neural networks to predict secondary structure using the results from PSI-BLAST (Position Specific Iterated - BLAST) . PSIPRED 2.0 has been reported as the top out of 20 evaluated methods by achieving a mean Q3 score of 80.6% for a test data containing 40 domains which have no significant similarity to PDB structures . The full-length sequences of carboxylated proteins are submitted to PSIPRED to obtain the secondary structure of all amino acids. The resulted data of PSIPRED is encoded in terms of “H” for helix, “E” for sheet, and “C” for coil. In order to transform the three terms into numeric vectors, a three-dimensional binary vector is applied: helix (H) is encoded as “100,” sheet (E) is encoded as “010,” and coil (C) is encoded as “001”.
In this study, support vector machine (SVM) is employed in order to create predictive models that utilize the explored features. In terms of binary classification, SVM adopts a kernel function to map samples into a higher dimensional space and subsequently determines a hyper-plane for effectively discriminating between the two classes of samples with a maximum margin and a minimum inaccuracy. LibSVM  is employed to generate a binary prediction model using the positive and negative training sets. The kernel function of SVM is radial basis function (RBF): . Moreover, two SVM parameters, cost and gamma value, are tuned to yield a highest accuracy.
where TP, FN, TN, and FP represent the numbers of true positives, false negatives, true negatives, and false positives, respectively. Since a two-class classification is of very different sizes, the Matthews correlation coefficient (MCC) is usually regarded as a balanced measure which takes into account the true and false positives and negatives . The range of MCC values is between -1 and +1: a measure of +1 stands for a perfect prediction, 0 is a random prediction and -1 represents an inverse prediction. Furthermore, the parameters of a predictive model, such as window length, cost value of SVM, and gamma value of SVM, are optimized to achieve a highest accuracy. Finally, the SVM model yielding the highest accuracy is selected for further independent testing.
To compare amino acid compositions between positive and negative data, the web-based tool, TwoSampleLogo , was utilized to detect significant differences in terms of the position-specific symbol compositions between two sets of multiple sequence alignments. Figure 1B shows the position-specific differences in amino acid composition between carboxylation sites (302 sequences) and non-carboxylation sites (567 sequences). With a 15-mer window length (from -7 to +7), the most prominent feature of carboxylation sites is the abundant and uniform Glu residue (negatively charged amino acid) that surrounds positions -7, -6, -4, -3, -1, +1, +3, +4, +6 and +7. Another interesting feature is the slight enrichment of arginine (positively charged amino acid) at positions -7, -5, -4, -1, and +3. The distant amino acids in the primary sequence, which may be immediately adjacent to the Gla residue in a three-dimensional structure, differ significantly between carboxylation and non-carboxylation sites. Also notable is the depleted Glu residue at flanking positions, particularly positions -2 and +2, which are close to the carboxylation sites.
Cross-validation performance of the predictive models trained with various features.
Positional Weighted Matrix of flanking Amino Acids (AA_PWM)
Amino Acid Composition (AAC)
Accessible Surface Area (ASA)
Secondary structure (SS)
AA_PWM + AAC
AA_PWM + ASA
AA_PWM + SS
AA_PWM + AAC + ASA
AA_PWM + AAC + SS
AA_PWM + AAC + ASA + SS
Additionally, the predictive power of the model that is trained using a hybrid combination of AAC, AA_PWM, ASA, and SS is evaluated. With respect to the performance achieved using individual features, AA_PWM is crucial for training a model with other individual features. The model trained using the combination of AA_PWM and ASA substantially outperforms those trained with other combinations but slightly outperforms one that is trained with AA_PWM alone. Overall, the model trained with AA_PWM, AAC and ASA has the best accuracy. The predictive precision, sensitivity, specificity, accuracy, and MCC of the best model are 0.836, 0.860, 0.910, 0.892, and 0.765, respectively. However, some of the models that are trained with ASA or SS do not outperform that trained with AA_PWM alone. In conclusion, five-fold cross-validation indicates that the model that is trained using a combination of AA_PWM, AAC and ASA performs best, and is therefore adopted in further independent testing.
Comparison of performances between cross-validation and independent testing.
Five-fold cross validation
Number of positive data
Number of negative data
Matthews Correlation Coefficient
Statistics of InterPro functional annotations in 134 carboxylated proteins.
Number of carboxylated proteins
Gamma-carboxyglutamic acid-rich (GLA) domain
Coagulation factor, Gla domain
Peptidase S1/S6, chymotrypsin/Hap
Serine/cysteine peptidase, trypsin-like
Peptidase S1A, chymotrypsin
EGF-like region, conserved site
EGF-like, type 3
EGF-type aspartate/asparagine hydroxylation site
Bone matrix, Gla protein
Peptidase S1/S6, chymotrypsin/Hap, active site
EGF-like calcium-binding, conserved site
Peptidase S1A, coagulation factor VII/IX/X/C/Z
Although the importance of carboxylation in the blood clotting cascade, bone growth and extraosseous calcification has been established, studies on carboxylation are still subject to technical limitations - especially when the concern is regarding substrate site specificity in γ-glutamyl carboxylation. Following the collection of experimentally verified carboxylation sites from UniProtKB and HPRD, 463 experimentally verified carboxylation sites have been identified in 134 carboxylated proteins. In this investigation, after the construction of a training data set and an independent test data set, the composition of the flanking amino acids among the training data is studied. The analysis of position-specific differences in amino acid composition reveals that the most prominent feature is the abundance of a uniform residue of glutamate around carboxylation sites. Another important characteristic is the slight increase in the abundance of arginine (positively charged amino acid) at positions -7, -5, -4, -1, and +3. This investigation indicates that the distant amino acids in the primary sequence, which may be immediately adjacent to the γ-carboxyglutamate (Gla) residue in a three-dimensional structure, differ notably between carboxylation sites and non-carboxylation sites.
Structural characteristics such as solvent-accessible surface area and secondary structure are also investigated. In the study of ASA curves, the region from -7 to -4 exhibits notable differences between carboxylation sites and non-carboxylation sites. The mean percentage of ASA values on the carboxylated glutamate residues is 37.8% which shows that it is greatly exposed to the solvent. The analysis of the secondary structure shows that the flanking regions of the carboxylated sites are likely to be present on the coil (loop) and helix structures. The five-fold cross-validation evaluation demonstrates that the SVM model trained using the combined features of positional weighted matrix, amino acid composition, and accessible surface area has the highest accuracy. Additionally, independent testing reveals that the proposed model can differentiate carboxylation sites from non-carboxylation sites when the data are truly blind to the cross-validation process.
Although the independent tests verify that the proposed method performs accurately and robustly, some issues warrant further investigation. First, the evolutionary conservation of carboxylation sites should be examined. The analysis of functional domains (See Supplementary Information) shows that carboxylated proteins can be categorized into three main functional groups: Gla domain, peptidase, and EGF-like proteins. However, carboxylated proteins with the same functional domain may be homologous. The carboxylation sites of homologous proteins should exhibit evolutionary conservation. Second, the structural affinities of carboxylation sites, particularly when flanking amino acids are not conserved, should be studied in greater detail [48, 49]. The intrinsic disordered regions, B-factor, protein linker region, and other factors should also be explored at the vicinity of carboxylation sites in PDB entries. Finally, the biological function of carboxylated proteins should be investigated. Analysis of gene ontology , the occurrence of other PTMs, and the network context of protein-protein interaction  may offer a clue to the functions of carboxylated proteins.
The authors sincerely appreciate the National Science Council of the Republic of China for financially supporting this research under Contract Numbers of NSC 99-2320-B-155-001 and NSC 100-2221-E-155-079.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 13, 2011: Tenth International Conference on Bioinformatics – First ISCB Asia Joint Conference 2011 (InCoB/ISCB-Asia 2011): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S13.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.