- Open Access
Investigation and identification of protein γ-glutamyl carboxylation sites
- Tzong-Yi Lee1Email author,
- Cheng-Tsung Lu†1,
- Shu-An Chen†1,
- Neil Arvin Bretaña†1,
- Tzu-Hsiu Cheng1,
- Min-Gang Su1 and
- Kai-Yao Huang1
© Lee et al; licensee BioMed Central Ltd. 2011
- Published: 30 November 2011
Carboxylation is a modification of glutamate (Glu) residues which occurs post-translation that is catalyzed by γ-glutamyl carboxylase in the lumen of the endoplasmic reticulum. Vitamin K is a critical co-factor in the post-translational conversion of Glu residues to γ-carboxyglutamate (Gla) residues. It has been shown that the process of carboxylation is involved in the blood clotting cascade, bone growth, and extraosseous calcification. However, studies in this field have been limited by the difficulty of experimentally studying substrate site specificity in γ-glutamyl carboxylation. In silico investigations have the potential for characterizing carboxylated sites before experiments are carried out.
Because of the importance of γ-glutamyl carboxylation in biological mechanisms, this study investigates the substrate site specificity in carboxylation sites. It considers not only the composition of amino acids that surround carboxylation sites, but also the structural characteristics of these sites, including secondary structure and solvent-accessible surface area (ASA). The explored features are used to establish a predictive model for differentiating between carboxylation sites and non-carboxylation sites. A support vector machine (SVM) is employed to establish a predictive model with various features. A five-fold cross-validation evaluation reveals that the SVM model, trained with the combined features of positional weighted matrix (PWM), amino acid composition (AAC), and ASA, yields the highest accuracy (0.892). Furthermore, an independent testing set is constructed to evaluate whether the predictive model is over-fitted to the training set.
Independent testing data that did not undergo the cross-validation process shows that the proposed model can differentiate between carboxylation sites and non-carboxylation sites. This investigation is the first to study carboxylation sites and to develop a system for identifying them. The proposed method is a practical means of preliminary analysis and greatly diminishes the total number of potential carboxylation sites requiring further experimental confirmation.
- Amino Acid Composition
- Support Vector Machine Model
- Accessible Surface Area
- Matthews Correlation Coefficient
- Carboxylation Site
Carboxylation is a post-translational modification (PTM) of glutamate (Glu) residues in proteins that is primarily involved in the blood clotting cascade specifically occurring in factors II, VII, IX, and X, protein C, protein S, as well in some bone proteins [1, 2]. Vitamin K is a critical cofactor in the post-translational conversion of Glu residues to γ-carboxyglutamate (Gla) residues . Carboxylation is catalyzed by γ-glutamyl carboxylase  and proceeds in the lumen of the endoplasmic reticulum . The vitamin K-dependent carboxylase transforms Glu to Gla when carbon dioxide (CO2) is added at the γ-position in the presence of oxygen (O2) and reduced vitamin K . The carboxylated proteins can be activated when Gla domain binds Ca2+. Since Glu is a weak Ca2+ chelator and Gla is a much stronger one, the transformation by the vitamin K-dependent carboxylase greatly increases the Ca2+-binding capacity of a protein . Studies conducted over the last few years have revealed that the γ-glutamyl carboxylated proteins in vertebrates can be categorized into three main groups . The first group comprises the carboxylated proteins with an amino terminal Gla domain, and includes vitamin K-dependent blood coagulation factors and co-regulators of blood coagulation . The second group is composed of osteocalcin and matrix Gla protein (MGP) [10, 11], and includes three and five Gla residues, respectively, which are critical to the regulation of bone growth and extraosseous calcification [12, 13]. The third group is the γ-glutamyl carboxylase, itself, which includes Gla residues .
Morris et al. used mass spectrometry to reveal the processive behavior of vitamin K-dependent carboxylation wherein multiple carboxylations occur during a single substrate binding event . The efficient carboxylation of native substrates depends on the binding of a conserved region to the carboxylase with a submicromolar affinity constant [15, 16]. Owing to the biological importance of γ-glutamyl carboxylation, more attention has been paid to mass spectrometric analyses in order to identify experimentally confirmed carboxylation sites. Nevertheless, in vitro identification of protein carboxylation sites requires a very large amount of time and effort. Meanwhile, in silico methods have the potential to characterize carboxylated sites before experiments are conducted. Additionally, in silico identification presents a more feasible means of preliminary analysis with the potential to significantly diminish the number of potential carboxylation sites requiring further experimental verification.
Given the importance of γ-glutamyl carboxylation in biological mechanisms, this work focuses on investigating the substrate site specificity of carboxylation sites. The experimentally validated γ-glutamyl carboxylations were mainly gathered from UniProtKB, a universal protein resource . Numerous experimental carboxylation sites in humans have been taken from HPRD , which contains curated data that the authors manually extracted from the PubMed literature database. Amino acid side chains that undergo post-translational modification tend to be accessible on protein surfaces . Therefore, this investigation not only examines the amino acid composition around carboxylation sites, but also considers structural characteristics, including solvent-accessible surface area (ASA) and secondary structure. In order to characterize the structural properties of the tertiary structures of carboxylated proteins, experimentally identified carboxylation sites are mapped to its corresponding positions from the protein sequences of Protein Data Bank (PDB) . However, most of the collected carboxylation sites lack the corresponding PDB tertiary structures. Accordingly, two tools, RVP-Net [21, 22] and PSIPRED , are utilized to calculate the ASA values and the secondary structures of amino acids in protein sequences, respectively.
Each feature is examined to evaluate its capacity to differentiate carboxylation sites from non-carboxylation sites. A support vector machine (SVM) is utilized to establish a predictive model with various features, including the positional weighted matrix of amino acids, amino acid composition, ASA, and secondary structure. Five-fold cross-validation is utilized to test the predictive performance of the generated SVM model. To prevent the generated model from an overestimation of predictive power, homologous sequences are eliminated by applying a window length of 2n+1 to carboxylation sites. After building and evaluating the model, the selected models providing the best accuracy are further tested using a data set of independent testing. The independent test set, which is not included in the training set, is then adopted to evaluate whether the selected model is over-fitting to the training data. Finally, the training features and window length providing the highest predictive accuracy for the model are used to construct a web-based system for identifying γ-glutamyl carboxylation.
Data collection and preprocessing
A comprehensive PTM resource dbPTM , which collects PTM data from release 15.0 of UniProtKB  and release 8.0 of HPRD , comprises of 1123 Gla residues in 182 protein entries from multiple organisms. After the non-experimental sites, annotated as “by similarity”, “potential” and “probable” in the “MOD_RES” fields of UniProtKB, have been removed, 463 experimental carboxylation sites from 134 carboxylated proteins are obtained. It is observed that the carboxylation site occurs on Glu residues. In this investigation, all carboxylated Glu residues are regarded as the positive set. On the other hand, Glu residues, found in the experimental carboxylated proteins, which are not annotated as carboxylation sites are regarded as the negative set. Consequently, a total of 854 non-carboxylated Glu sites are defined as negative set. This work aims to investigate the structural characteristics of protein carboxylation sites; with reference to a previous work , features such as amino acid composition, accessible surface area (ASA), and secondary structure are explored. The extracted features are used to construct a predictive model and are evaluated in terms of its ability to differentiate carboxylation sites from non-carboxylation sites.
To avoid an overestimation in the cross-validation, the removal of homologous sequences in the positive data is done by using a window size of 2n+1 centered on the carboxylation site. With reference to the method of homology reduction in N-Ace , two carboxylated proteins having more than 30% sequence similarity are regarded as homologous proteins. Then, for every two homologous protein sequences, BL2SEQ  is applied to re-align the fragment sequences with 2n+1 amino acids centered on the carboxylated sites. For two fragment sequences having a sequence similarity higher than 50%, only one fragment sequence is kept while the other is removed the positive set. The same approach is also applied to generate the non-homologous negative set.
Statistics of experimentally verified carboxylation sites in training data and independent testing data.
All data (UniProt release 15.0 and HPRD 8.0)
Training data (non-homologous)
Independent testing data (non-homologous)
Number of carboxylated proteins
Number of carboxylated glutamate residues
Number of non-carboxylated glutamate residues
This work not only investigates the composition of amino acids that surround carboxylation sites, but also takes the solvent accessible surface area (ASA), and secondary structure (SS) into account. A window size of 2n+1 is utilized to extract fragment sequences from positive and negative sets of training data. Various values of n changing from four to ten are applied to decide the optimal window length. The amino acid composition (AAC) , which refers to the relative frequencies of twenty amino acids in a given window length, is regarded as the elementary feature in the investigation of carboxylation sites . A vector of 20 elements for amino acid composition specifies the number of occurrences of the twenty amino acids normalized by the total number of residues in a fragmented sequence with window length 2n+1. To investigate the position-specific amino acid composition around the carboxylation sites, a positional weighted matrix (PWM) is determined using non-homologous positive set of training data . The PWM specifies the occurring frequency of amino acids in each position of a fragment. A matrix of (2n+1)×m elements, where 2n+1 refers to the window length and m contains 21 elements that stands for the 20 amino acids and one terminal signal, is referred to in order to encode each fragment sequence in the training data.
The amino acids that undergoes post-translational modification were reported to be exposed on the surface of a protein . Thus, the accessible surface area (ASA) surrounding the carboxylation sites is considered. Due to the fact that almost all of the experimental carboxylated proteins do not contain a corresponding PDB tertiary structure, RVP-Net [21, 22] is utilized to calculate the ASA value, which is the percentage of the solvent-accessible area of each amino acid on a protein sequence. RVP-net is a prediction tool to determine the value of residual ASAs using neighborhood information and yields a mean absolute difference of 18.0 – 19.5% between the predicted and experimental values of ASA . In this work, the full-length sequences of carboxylated proteins are submitted to RVP-Net to calculate the ASA value for all amino acids. The ASA values of the amino acids surrounding the carboxylation site are normalized to zero to one.
In investigating secondary structures surrounding carboxylation sites, PSIPRED  was utilized to predict the secondary structure from a given protein sequence. PSIPRED applied two feed-forward neural networks to predict secondary structure using the results from PSI-BLAST (Position Specific Iterated - BLAST) . PSIPRED 2.0 has been reported as the top out of 20 evaluated methods by achieving a mean Q3 score of 80.6% for a test data containing 40 domains which have no significant similarity to PDB structures . The full-length sequences of carboxylated proteins are submitted to PSIPRED to obtain the secondary structure of all amino acids. The resulted data of PSIPRED is encoded in terms of “H” for helix, “E” for sheet, and “C” for coil. In order to transform the three terms into numeric vectors, a three-dimensional binary vector is applied: helix (H) is encoded as “100,” sheet (E) is encoded as “010,” and coil (C) is encoded as “001”.
Model learning and evaluation
In this study, support vector machine (SVM) is employed in order to create predictive models that utilize the explored features. In terms of binary classification, SVM adopts a kernel function to map samples into a higher dimensional space and subsequently determines a hyper-plane for effectively discriminating between the two classes of samples with a maximum margin and a minimum inaccuracy. LibSVM  is employed to generate a binary prediction model using the positive and negative training sets. The kernel function of SVM is radial basis function (RBF): . Moreover, two SVM parameters, cost and gamma value, are tuned to yield a highest accuracy.
where TP, FN, TN, and FP represent the numbers of true positives, false negatives, true negatives, and false positives, respectively. Since a two-class classification is of very different sizes, the Matthews correlation coefficient (MCC) is usually regarded as a balanced measure which takes into account the true and false positives and negatives . The range of MCC values is between -1 and +1: a measure of +1 stands for a perfect prediction, 0 is a random prediction and -1 represents an inverse prediction. Furthermore, the parameters of a predictive model, such as window length, cost value of SVM, and gamma value of SVM, are optimized to achieve a highest accuracy. Finally, the SVM model yielding the highest accuracy is selected for further independent testing.
Amino acid composition at the vicinity of carboxylation sites
To compare amino acid compositions between positive and negative data, the web-based tool, TwoSampleLogo , was utilized to detect significant differences in terms of the position-specific symbol compositions between two sets of multiple sequence alignments. Figure 1B shows the position-specific differences in amino acid composition between carboxylation sites (302 sequences) and non-carboxylation sites (567 sequences). With a 15-mer window length (from -7 to +7), the most prominent feature of carboxylation sites is the abundant and uniform Glu residue (negatively charged amino acid) that surrounds positions -7, -6, -4, -3, -1, +1, +3, +4, +6 and +7. Another interesting feature is the slight enrichment of arginine (positively charged amino acid) at positions -7, -5, -4, -1, and +3. The distant amino acids in the primary sequence, which may be immediately adjacent to the Gla residue in a three-dimensional structure, differ significantly between carboxylation and non-carboxylation sites. Also notable is the depleted Glu residue at flanking positions, particularly positions -2 and +2, which are close to the carboxylation sites.
Structural characteristics at carboxylation sites
Determination of optimal window size based on amino acid composition
Evaluation of the effectiveness of studied features in identifying carboxylation sites
Cross-validation performance of the predictive models trained with various features.
Positional Weighted Matrix of flanking Amino Acids (AA_PWM)
Amino Acid Composition (AAC)
Accessible Surface Area (ASA)
Secondary structure (SS)
AA_PWM + AAC
AA_PWM + ASA
AA_PWM + SS
AA_PWM + AAC + ASA
AA_PWM + AAC + SS
AA_PWM + AAC + ASA + SS
Additionally, the predictive power of the model that is trained using a hybrid combination of AAC, AA_PWM, ASA, and SS is evaluated. With respect to the performance achieved using individual features, AA_PWM is crucial for training a model with other individual features. The model trained using the combination of AA_PWM and ASA substantially outperforms those trained with other combinations but slightly outperforms one that is trained with AA_PWM alone. Overall, the model trained with AA_PWM, AAC and ASA has the best accuracy. The predictive precision, sensitivity, specificity, accuracy, and MCC of the best model are 0.836, 0.860, 0.910, 0.892, and 0.765, respectively. However, some of the models that are trained with ASA or SS do not outperform that trained with AA_PWM alone. In conclusion, five-fold cross-validation indicates that the model that is trained using a combination of AA_PWM, AAC and ASA performs best, and is therefore adopted in further independent testing.
Evaluation of γ-glutamyl carboxylation predictive model using independent test set
Comparison of performances between cross-validation and independent testing.
Five-fold cross validation
Number of positive data
Number of negative data
Matthews Correlation Coefficient
Investigation of functional domains in carboxylated proteins
Statistics of InterPro functional annotations in 134 carboxylated proteins.
Number of carboxylated proteins
Gamma-carboxyglutamic acid-rich (GLA) domain
Coagulation factor, Gla domain
Peptidase S1/S6, chymotrypsin/Hap
Serine/cysteine peptidase, trypsin-like
Peptidase S1A, chymotrypsin
EGF-like region, conserved site
EGF-like, type 3
EGF-type aspartate/asparagine hydroxylation site
Bone matrix, Gla protein
Peptidase S1/S6, chymotrypsin/Hap, active site
EGF-like calcium-binding, conserved site
Peptidase S1A, coagulation factor VII/IX/X/C/Z
Although the importance of carboxylation in the blood clotting cascade, bone growth and extraosseous calcification has been established, studies on carboxylation are still subject to technical limitations - especially when the concern is regarding substrate site specificity in γ-glutamyl carboxylation. Following the collection of experimentally verified carboxylation sites from UniProtKB and HPRD, 463 experimentally verified carboxylation sites have been identified in 134 carboxylated proteins. In this investigation, after the construction of a training data set and an independent test data set, the composition of the flanking amino acids among the training data is studied. The analysis of position-specific differences in amino acid composition reveals that the most prominent feature is the abundance of a uniform residue of glutamate around carboxylation sites. Another important characteristic is the slight increase in the abundance of arginine (positively charged amino acid) at positions -7, -5, -4, -1, and +3. This investigation indicates that the distant amino acids in the primary sequence, which may be immediately adjacent to the γ-carboxyglutamate (Gla) residue in a three-dimensional structure, differ notably between carboxylation sites and non-carboxylation sites.
Structural characteristics such as solvent-accessible surface area and secondary structure are also investigated. In the study of ASA curves, the region from -7 to -4 exhibits notable differences between carboxylation sites and non-carboxylation sites. The mean percentage of ASA values on the carboxylated glutamate residues is 37.8% which shows that it is greatly exposed to the solvent. The analysis of the secondary structure shows that the flanking regions of the carboxylated sites are likely to be present on the coil (loop) and helix structures. The five-fold cross-validation evaluation demonstrates that the SVM model trained using the combined features of positional weighted matrix, amino acid composition, and accessible surface area has the highest accuracy. Additionally, independent testing reveals that the proposed model can differentiate carboxylation sites from non-carboxylation sites when the data are truly blind to the cross-validation process.
Although the independent tests verify that the proposed method performs accurately and robustly, some issues warrant further investigation. First, the evolutionary conservation of carboxylation sites should be examined. The analysis of functional domains (See Supplementary Information) shows that carboxylated proteins can be categorized into three main functional groups: Gla domain, peptidase, and EGF-like proteins. However, carboxylated proteins with the same functional domain may be homologous. The carboxylation sites of homologous proteins should exhibit evolutionary conservation. Second, the structural affinities of carboxylation sites, particularly when flanking amino acids are not conserved, should be studied in greater detail [48, 49]. The intrinsic disordered regions, B-factor, protein linker region, and other factors should also be explored at the vicinity of carboxylation sites in PDB entries. Finally, the biological function of carboxylated proteins should be investigated. Analysis of gene ontology , the occurrence of other PTMs, and the network context of protein-protein interaction  may offer a clue to the functions of carboxylated proteins.
The authors sincerely appreciate the National Science Council of the Republic of China for financially supporting this research under Contract Numbers of NSC 99-2320-B-155-001 and NSC 100-2221-E-155-079.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 13, 2011: Tenth International Conference on Bioinformatics – First ISCB Asia Joint Conference 2011 (InCoB/ISCB-Asia 2011): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S13.
- Suttie JW: Vitamin K-dependent carboxylase. Annu Rev Biochem 1985, 54: 459–477. 10.1146/annurev.bi.54.070185.002331View ArticlePubMedGoogle Scholar
- Burnier JP, Borowski M, Furie BC, Furie B: Gamma-carboxyglutamic acid. Mol Cell Biochem 1981, 39: 191–207. 10.1007/BF00232574View ArticlePubMedGoogle Scholar
- Olson RE, Suttie JW: Vitamin K and gamma-carboxyglutamate biosynthesis. Vitam Horm 1977, 35: 59–108.View ArticlePubMedGoogle Scholar
- Wu SM, Cheung WF, Frazier D, Stafford DW: Cloning and expression of the cDNA for human gamma-glutamyl carboxylase. Science 1991, 254(5038):1634–1636. 10.1126/science.1749935View ArticlePubMedGoogle Scholar
- Bristol JA, Ratcliffe JV, Roth DA, Jacobs MA, Furie BC, Furie B: Biosynthesis of prothrombin: intracellular localization of the vitamin K-dependent carboxylase and the sites of gamma-carboxylation. Blood 1996, 88(7):2585–2593.PubMedGoogle Scholar
- Morris DP, Stevens RD, Wright DJ, Stafford DW: Processive post-translational modification. Vitamin K-dependent carboxylation of a peptide substrate. The Journal of biological chemistry 1995, 270(51):30491–30498. 10.1074/jbc.270.51.30491View ArticlePubMedGoogle Scholar
- Vermeer C: Gamma-carboxyglutamate-containing proteins and the vitamin K-dependent carboxylase. Biochem J 1990, 266(3):625–636.PubMed CentralView ArticlePubMedGoogle Scholar
- Bandyopadhyay PK, Garrett JE, Shetty RP, Keate T, Walker CS, Olivera BM: Gamma-glutamyl carboxylation: an extracellular posttranslational modification that antedates the divergence of molluscs, arthropods, and chordates. Proc Natl Acad Sci U S A 2002, 99(3):1264–1269. 10.1073/pnas.022637099PubMed CentralView ArticlePubMedGoogle Scholar
- Kulman JD, Harris JE, Xie L, Davie EW: Identification of two novel transmembrane gamma-carboxyglutamic acid proteins expressed broadly in fetal and adult tissues. Proc Natl Acad Sci U S A 2001, 98(4):1370–1375. 10.1073/pnas.98.4.1370PubMed CentralView ArticlePubMedGoogle Scholar
- Price PA, Urist MR, Otawara Y: Matrix Gla protein, a new gamma-carboxyglutamic acid-containing protein which is associated with the organic matrix of bone. Biochemical and biophysical research communications 1983, 117(3):765–771. 10.1016/0006-291X(83)91663-7View ArticlePubMedGoogle Scholar
- Price PA, Poser JW, Raman N: Primary structure of the gamma-carboxyglutamic acid-containing protein from bovine bone. Proc Natl Acad Sci U S A 1976, 73(10):3374–3375. 10.1073/pnas.73.10.3374PubMed CentralView ArticlePubMedGoogle Scholar
- Luo G, Ducy P, McKee MD, Pinero GJ, Loyer E, Behringer RR, Karsenty G: Spontaneous calcification of arteries and cartilage in mice lacking matrix GLA protein. Nature 1997, 386(6620):78–81. 10.1038/386078a0View ArticlePubMedGoogle Scholar
- Ducy P, Desbois C, Boyce B, Pinero G, Story B, Dunstan C, Smith E, Bonadio J, Goldstein S, Gundberg C, et al.: Increased bone formation in osteocalcin-deficient mice. Nature 1996, 382(6590):448–452. 10.1038/382448a0View ArticlePubMedGoogle Scholar
- Berkner KL, Pudota BN: Vitamin K-dependent carboxylation of the carboxylase. Proc Natl Acad Sci U S A 1998, 95(2):466–471. 10.1073/pnas.95.2.466PubMed CentralView ArticlePubMedGoogle Scholar
- Knobloch JE, Suttie JW: Vitamin K-dependent carboxylase. Control of enzyme activity by the "propeptide" region of factor X. The Journal of biological chemistry 1987, 262(32):15334–15337.PubMedGoogle Scholar
- Pan LC, Price PA: The propeptide of rat bone gamma-carboxyglutamic acid protein shares homology with other vitamin K-dependent protein precursors. Proc Natl Acad Sci U S A 1985, 82(18):6109–6113. 10.1073/pnas.82.18.6109PubMed CentralView ArticlePubMedGoogle Scholar
- Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, et al.: UniProt: the Universal Protein knowledgebase. Nucleic acids research 2004, 32(Database issue):D115–119.PubMed CentralView ArticlePubMedGoogle Scholar
- Peri S, Navarro JD, Kristiansen TZ, Amanchy R, Surendranath V, Muthusamy B, Gandhi TK, Chandrika KN, Deshpande N, Suresh S, et al.: Human protein reference database as a discovery resource for proteomics. Nucleic acids research 2004, 32(Database issue):D497–501.PubMed CentralView ArticlePubMedGoogle Scholar
- Pang CN, Hayen A, Wilkins MR: Surface accessibility of protein post-translational modifications. J Proteome Res 2007, 6(5):1833–1845. 10.1021/pr060674uView ArticlePubMedGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28(1):235–242. 10.1093/nar/28.1.235PubMed CentralView ArticlePubMedGoogle Scholar
- Ahmad S, Gromiha MM, Sarai A: RVP-net: online prediction of real valued accessible surface area of proteins from single sequences. Bioinformatics 2003, 19(14):1849–1851. 10.1093/bioinformatics/btg249View ArticlePubMedGoogle Scholar
- Ahmad S, Gromiha MM, Sarai A: Real value prediction of solvent accessibility from amino acid sequence. Proteins 2003, 50(4):629–635. 10.1002/prot.10328View ArticlePubMedGoogle Scholar
- McGuffin LJ, Bryson K, Jones DT: The PSIPRED protein structure prediction server. Bioinformatics 2000, 16(4):404–405. 10.1093/bioinformatics/16.4.404View ArticlePubMedGoogle Scholar
- Lee TY, Huang HD, Hung JH, Huang HY, Yang YS, Wang TH: dbPTM: an information repository of protein post-translational modification. Nucleic Acids Res 2006, 34(Database issue):D622–627.PubMed CentralView ArticlePubMedGoogle Scholar
- Shien DM, Lee TY, Chang WC, Hsu JB, Horng JT, Hsu PC, Wang TY, Huang HD: Incorporating structural characteristics for identification of protein methylation sites. J Comput Chem 2009, 30(9):1532–1543. 10.1002/jcc.21232View ArticlePubMedGoogle Scholar
- Lee TY, Hsu JB, Lin FM, Chang WC, Hsu PC, Huang HD: N-Ace: using solvent accessibility and physicochemical properties to identify protein N-acetylation sites. J Comput Chem 2010, 31(15):2759–2771. 10.1002/jcc.21569View ArticlePubMedGoogle Scholar
- Tatusova TA, Madden TL: BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol Lett 1999, 174(2):247–250. 10.1111/j.1574-6968.1999.tb13575.xView ArticlePubMedGoogle Scholar
- Lee TY, Chen SA, Hung HY, Ou YY: Incorporating distant sequence features and radial basis function networks to identify ubiquitin conjugation sites. PLoS One 2011, 6(3):e17331. 10.1371/journal.pone.0017331PubMed CentralView ArticlePubMedGoogle Scholar
- Wong YH, Lee TY, Liang HK, Huang CM, Wang TY, Yang YH, Chu CH, Huang HD, Ko MT, Hwang JK: KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns. Nucleic Acids Res 2007, 35(Web Server issue):W588–594.PubMed CentralView ArticlePubMedGoogle Scholar
- Huang HD, Lee TY, Tzeng SW, Horng JT: KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites. Nucleic Acids Res 2005, 33(Web Server issue):W226–229.PubMed CentralView ArticlePubMedGoogle Scholar
- Dubchak I, Holbrook SR, Kim SH: Prediction of protein folding class from amino acid composition. Proteins 1993, 16(1):79–91. 10.1002/prot.340160109View ArticlePubMedGoogle Scholar
- Chang WC, Lee TY, Shien DM, Hsu JB, Horng JT, Hsu PC, Wang TY, Huang HD, Pan RL: Incorporating support vector machine for identifying protein tyrosine sulfation sites. J Comput Chem 2009.Google Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Bryson K, McGuffin LJ, Marsden RL, Ward JJ, Sodhi JS, Jones DT: Protein structure prediction servers at University College London. Nucleic Acids Res 2005, 33(Web Server issue):W36–38.PubMed CentralView ArticlePubMedGoogle Scholar
- Chang C-C, Lin C-J: LIBSVM : a library for support vector machines. In . ; 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.Google Scholar
- Chou KC, Shen HB: Recent progress in protein subcellular location prediction. Anal Biochem 2007, 370(1):1–16. 10.1016/j.ab.2007.07.006View ArticlePubMedGoogle Scholar
- Ron K: A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence 1995, 2(12):1137–1143.Google Scholar
- Carugo O: Detailed estimation of bioinformatics prediction reliability through the fragmented prediction performance plots. BMC Bioinformatics 2007, 8: 380. 10.1186/1471-2105-8-380PubMed CentralView ArticlePubMedGoogle Scholar
- Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: a sequence logo generator. Genome Res 2004, 14(6):1188–1190. 10.1101/gr.849004PubMed CentralView ArticlePubMedGoogle Scholar
- Schneider TD, Stephens RM: Sequence logos: a new way to display consensus sequences. Nucleic Acids Res 1990, 18(20):6097–6100. 10.1093/nar/18.20.6097PubMed CentralView ArticlePubMedGoogle Scholar
- Furie BC, Ratcliffe JV, Tward J, Jorgensen MJ, Blaszkowsky LS, DiMichele D, Furie B: The gamma-carboxylation recognition site is sufficient to direct vitamin K-dependent carboxylation on an adjacent glutamate-rich region of thrombin in a propeptide-thrombin chimera. The Journal of biological chemistry 1997, 272(45):28258–28262. 10.1074/jbc.272.45.28258View ArticlePubMedGoogle Scholar
- Vacic V, Iakoucheva LM, Radivojac P: Two Sample Logo: a graphical representation of the differences between two sets of sequence alignments. Bioinformatics 2006, 22(12):1536–1537. 10.1093/bioinformatics/btl151View ArticlePubMedGoogle Scholar
- Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti L, Corpet F, Croning MD, et al.: InterPro--an integrated documentation resource for protein families, domains and functional sites. Bioinformatics 2000, 16(12):1145–1150. 10.1093/bioinformatics/16.12.1145View ArticlePubMedGoogle Scholar
- Bairoch A: PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Res 1991, 19(Suppl):2241–2245.PubMed CentralView ArticlePubMedGoogle Scholar
- Attwood TK, Beck ME, Bleasby AJ, Parry-Smith DJ: PRINTS--a database of protein motif fingerprints. Nucleic Acids Res 1994, 22(17):3590–3596.PubMed CentralPubMedGoogle Scholar
- Sonnhammer EL, Eddy SR, Durbin R: Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 1997, 28(3):405–420. 10.1002/(SICI)1097-0134(199707)28:3<405::AID-PROT10>3.0.CO;2-LView ArticlePubMedGoogle Scholar
- Corpet F, Gouzy J, Kahn D: The ProDom database of protein domain families. Nucleic Acids Res 1998, 26(1):323–326. 10.1093/nar/26.1.323PubMed CentralView ArticlePubMedGoogle Scholar
- Lee TY, Lin ZQ, Hsieh SJ, Bretana NA, Lu CT: Exploiting maximal dependence decomposition to identify conserved motifs from a group of aligned signal sequences. Bioinformatics 2011, 27(13):1780–1787. 10.1093/bioinformatics/btr291View ArticlePubMedGoogle Scholar
- Lee TY, Bretana NA, Lu CT: PlantPhos: using maximal dependence decomposition to identify plant phosphorylation sites with substrate site specificity. BMC Bioinformatics 2011, 12(1):261. 10.1186/1471-2105-12-261PubMed CentralView ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al.: Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat Genet 2000, 25(1):25–29. 10.1038/75556PubMed CentralView ArticlePubMedGoogle Scholar
- Lee TY, Bo-Kai Hsu J, Chang WC, Huang HD: RegPhos: a system to explore the protein kinase-substrate phosphorylation network in humans. Nucleic Acids Res 2011, 39(Database issue):D777–787.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.