dbOGAP - An Integrated Bioinformatics Resource for Protein O-GlcNAcylation

Background Protein O-GlcNAcylation (or O-GlcNAc-ylation) is an O-linked glycosylation involving the transfer of β-N-acetylglucosamine to the hydroxyl group of serine or threonine residues of proteins. Growing evidences suggest that protein O-GlcNAcylation is common and is analogous to phosphorylation in modulating broad ranges of biological processes. However, compared to phosphorylation, the amount of protein O-GlcNAcylation data is relatively limited and its annotation in databases is scarce. Furthermore, a bioinformatics resource for O-GlcNAcylation is lacking, and an O-GlcNAcylation site prediction tool is much needed. Description We developed a database of O-GlcNAcylated proteins and sites, dbOGAP, primarily based on literature published since O-GlcNAcylation was first described in 1984. The database currently contains ~800 proteins with experimental O-GlcNAcylation information, of which ~61% are of humans, and 172 proteins have a total of ~400 O-GlcNAcylation sites identified. The O-GlcNAcylated proteins are primarily nucleocytoplasmic, including membrane- and non-membrane bounded organelle-associated proteins. The known O-GlcNAcylated proteins exert a broad range of functions including transcriptional regulation, macromolecular complex assembly, intracellular transport, translation, and regulation of cell growth or death. The database also contains ~365 potential O-GlcNAcylated proteins inferred from known O-GlcNAcylated orthologs. Additional annotations, including other protein posttranslational modifications, biological pathways and disease information are integrated into the database. We developed an O-GlcNAcylation site prediction system, OGlcNAcScan, based on Support Vector Machine and trained using protein sequences with known O-GlcNAcylation sites from dbOGAP. The site prediction system achieved an area under ROC curve of 74.3% in five-fold cross-validation. The dbOGAP website was developed to allow for performing search and query on O-GlcNAcylated proteins and associated literature, as well as for browsing by gene names, organisms or pathways, and downloading of the database. Also available from the website, the OGlcNAcScan tool presents a list of predicted O-GlcNAcylation sites for given protein sequences. Conclusions dbOGAP is the first public bioinformatics resource to allow systematic access to the O-GlcNAcylated proteins, and related functional information and bibliography, as well as to an O-GlcNAcylation site prediction tool. The resource will facilitate research on O-GlcNAcylation and its proteomic identification.

Background O-GlcNAcylation, or O-GlcNAc-ylation to distinguish it from acylation, is an O-linked glycosylation involving the β-attachment of a single N-acetylglucosamine (GlcNAc) to the serine (Ser)/threonine (Thr) residues catalyzed by O-GlcNAc transferase (OGT), whose removal is catalyzed by O-GlcNAcase (OGA) [1]. The two O-GlcNAc cycling enzymes OGT and OGA are each encoded by a single gene in mammalian species. Unlike N-linked or mucin-type O-linked glycosylation, O-GlcNAcylation occurs primarily in nucleocytoplasmic proteins [1]. Analogous to phosphorylation, the modification is dynamic and the O-GlcNAc moiety is not further extended [1]. O-GlcNAcylation is also often reciprocal to phosphorylation at the same or adjacent Ser/Thr residues [1][2][3], which led to a "Yin-Yang" hypothesis on protein functions modulated by the two post-translational modifications (PTMs) [4] through competitively blocking each other's occupancy at given sites. For example, reciprocal O-GlcNAcylation and phosphorylation at the same Ser16 of murine estrogen receptor β (ERβ modulate the degradation of ERβ by stabilizing or destabilizing the protein, respectively [5]. Similarly, O-GlcNAcylation of p53 at Ser149 is associated with decreased phosphorylation at the adjacent Thr155, resulting in decreased p53 ubiquitination and subsequent degradation, thus stabilizing p53 [6]. In contrast to the enormous body of research on phosphorylation, the amount of research on O-GlcNAcylation has been disproportionally small due to difficulties in detecting the O-GlcNAc group, partly because of its being labile, dynamic, and substoichiometric [7]. Over 600 proteins have been reported to be O-GlcNAcylated since it was first identified in 1984 [8], many of which were identified in recent years [1][2][3][9][10][11] as a result of improved mass spectrometry technologies. Growing evidences now suggest that O-GlcNAcylation is very common and has broad roles in physiology and diseases, especially through its reciprocal interplay with phosphorylation, e.g., regulation of insulin signaling, transcription, and roles in diabetes and neurodegenerative diseases [2]. A number of bioinformatics databases have been developed for protein post-translational modifications, including those of general PTMs, e.g., dbPTM [12], or specific types, e.g., databases of protein phosphorylation, e.g., PhosphoELM [13], PhosphoSite [14], and those of protein glycosylation [15], ubiquitination [16] and protease cleavage [17]. By contrast, there has been no special database dedicated to O-GlcNAcylated proteins and sites, and their annotations are also scarce in protein databases, e.g., only~100 experimental O-GlcNAcylation sites for 35 proteins are currently annotated in UniProtKB [18]. Moreover, O-GlcNAcylation annotations have not been included in the specialized glycosylation databases (e.g., GlycoBase, the Functional Glycomics Gateway) [15,19].
Because of growing interests in studying the crucial roles of O-GlcNAcylation in cell signaling and many other cellular processes, identifying the site motifs and computationally predicting the O-GlcNAcylation sites become important bioinformatics tasks to assist those studies. Unlike N-linked glcycosylation with a consensus motif of "Asn-X-Thr/Ser", O-linked glycosylation, including mucin-type O-glycosylation and O-GlcNAc glycosylation, has not yet found well-defined sequence motifs. The past effort in developing prediction method for O-glycosylation has mostly focused on the mucintype [20][21][22][23]. To our best knowledge there has been only one site prediction tool for O-GlcNAcylation, YinOYang, which is an artificial neural network system trained on sequence fragments of~40 GlcNAcylation sites available at the time [24]. The motif of O-GlcNAcylation remains poorly defined, and there is a pressing need to develop an O-GlcNAcylation site prediction tool based on a much greater number of experimental O-GlcNAcylation sites available now.
Here we report the development of a database of O-GlcNAcylated proteins and sites (dbOGAP) for all currently known O-GlcNAcylated proteins reported from literature, and of an O-GlcNAcylation site prediction system (OGlcNAcScan) based on nearly 400 O-GlcNAcylation sites. Both the database and the prediction system are available through the dbOGAP web site, which serves as a public bioinformatics resource to facilitate research on O-GlcNAcylated proteins and to assist proteomic identification of O-GlcNAcylation sites.

The Database Development
The primary data source used for developing the dbO-GAP database is literature about O-GlcNAcylated proteins published since O-GlcNAcylation was first discovered in early 1980's [8]. Figure 1 depicts the overall workflow of the dbOGAP database and web site development. About 500 original and review articles were retrieved from PubMed (April 2010) that are related to protein O-GlcNAcylation and/or the O-GlcNAc cycling enzymes OGT and OGA. Abstracts and full-length articles were used to identify experimentally determined O-GlcNAcylated proteins and sites. The proteins were then mapped to UniProtKB entry records based on sequences and/or sequence identifiers (IDs) followed by manual verification. O-GlcNAcylated proteins and sites determined only from large-scale mass spectrometry (MS) without further validation using targeted MS and/or additional biochemical methods were annotated with evidence tags (e.g., "LS: MALDI-TOF-MS"). Orthologs of known O-GlcNAcylated proteins with identified O-GlcNAcylation sites were populated based on the HomoloGene groups [25] and/or BLAST neighbors [26], where the potential O-GlcNAcylation sites on the orthologs were inferred based on the conserved Ser/Thr residues. The experimental or inferred O-GlcNAcylation was attributed with literature (PubMed ID) or inference (from orthologs), respectively. A small number of currently annotated O-GlcNAcylated proteins in UniProtKB were also integrated into dbO-GAP with the source attributed. Additional protein annotations, including other protein modifications (e.g., phosphorylation) and site features, Gene Ontology, pathways and disease information were integrated into dbO-GAP from UniProtKB [18] or iProClass [27] databases.

The O-GlcNAc Site Prediction
An O-GlcNAcylation site prediction system, OGlcNAcScan, was developed based on annotated O-GlcNAcylation sites in dbOGAP using the SVM light implementation of Support Vector Machine (SVM) [28]. A training data set of the prediction system consists of 373 positive instances that are experimental O-GlcNAcylation sites in 167 protein sequences from dbOGAP, and also of 29,897 negative instances that are the rest of the un-annotated Ser/ Thr sites in the same protein sequences. Given a Ser/Thr site, n upstream and n downstream amino acids were regarded as its sequence context and then 2n+1 amino acids, including the O-GlcNAcylated Ser or Thr residue in the middle, were converted into a vector of binary values (0 or 1) using the widely-used sparse encoding method described, for example, in Julenius et al. 2005 [21]. Note, if the site is less than n amino acid away from the sequence terminals, the end-of-sequence symbol is padded at the terminal as many as needed to derive a fixed-length sequence fragment. In this encoding method, each amino acid type and the end-of-sequence symbol is coded with 21 binary values, e.g., 100...0 (one followed by 20 zeros) for Ala, 010...0 for Arg, ..., and 000...1 for end-of-sequence), and the resulting feature vector consists of 21 × (2n+1) binary values. For different values of n, we trained SVM classifiers with the RBF kernel. The parameters involving these classifiers, C and g, were optimized through 5-fold cross-validation tests, where classifiers were trained and tested, respectively, on a four-fifths and the remaining one-fifth of the data set for five times. We explored different sequence encoding methods, such as frequencies of amino acid types [21,23] and gappy bi-grams/dimers [22], but the orthodox sparse encoding method with n = 5 yielded the best prediction performance.

The Database and the Web site Implementation
The dbOGAP database is implemented using the open source relational database management system, MySQL, with tables to store and manage the O-GlcNAcylation protein entries, O-GlcNAcylation sites from different sources and related literature information. The database is deployed on RedHat Enterprise Linux operating system (version 5.5). The Apache web server (version 2.2.15) (http://httpd.apache.org/) with the security enhanced module ModSecurity (version 2.5.10, http:// www.modsecurity.org/), was deployed for the dbOGAP web site. All data query and retrieval from the dbOGAP web site is accomplished by scripts written in Perl, PHP and Javascript.  (Table 1). Overall, the number of currently identified O-GlcNAcylation sites is only~11% (404/ 3687) of that of phosphorylation sites on all known

Functional profiles of O-GlcNAcylated proteins
We analyzed Gene Ontology (GO) profiles of currently known human O-GlcNAcylated proteins (~490) using the DAVID tool [29]. We first examined the major enriched GO categories of O-GlcNAcylated proteins annotated with GO terms at higher levels of GO hierarchy (covering ≥10% of the proteins) ( Table 2). As shown by the GO Cellular Components profiling, O-GlcNAcylated proteins are mostly those of nucleoplasmic distribution, including membrane or non-membrane bounded organelles, cytosol, cytoskeleton, and nuclear compartments. The O-GlcNAcylated proteins mainly possess nucleotide and nucleic acid binding activities and transcription regulator activities (GO Molecular Function), and participate in transcriptional regulation, macromolecular complex assembly, intracellular transport, translation, regulation of cell cycle and apoptosis, and regulation of macromolecule metabolic process (GO Biological Processes).
We further examined the O-GlcNAcylated proteins for enrichment of GO terms at deeper levels of the GO hierarchy. As summarized in [Additional file 1, Supplementary  Table S1], the top enriched GO biological processes relate to protein translation, carbohydrate (glucose) metabolism, RNA processing/splicing, and RNA/protein transport, followed by macromolecular complex and organelle organization, regulation of cell cycle and cell death, chromosome organization and transcription, regulation of protein and other small molecule metabolisms. The enriched GO molecular functions include nucleoside, nucleotide and nucleic acid binding, transcription factor activity, protein binding and other molecular activities. The enriched GO cellular components include cytosol, organelle lumen and non-membrane-bounded organelles, nuclear compartments such as nucleoplasm, nuclear pore and nucleolus, ribosome and cytoskeleton, nuclear protein complexes and chromatin, membrane and vesicle associated spaces, and contractile associated proteins. Notably, although significant proportions of known O-GlcNAcylated proteins are associated with intracellular membranes or inner side of plasma membrane, only a few plasma transmembrane proteins, such as glucose transporters and notch receptor were reported to be O-GlcNAcylated [30][31][32]. Therefore O-GlcNAcylated proteins are primarily nucleocytoplasmic and are engaged in broad biological functions.

The O-GlcNAcylation Site Prediction
Figure 3 (Above) shows the graphical representation of sequence patterns surrounding the O-GlcNAcylation sites annotated in dbOGAP using the "Two Sample Logo" tool [42]. Enrichment of amino acids at -3/+2 position of the modified Ser/Thr, PPV(S/T)TA, can be observed. However, the amino acid enrichment at each position independently is not sufficient for defining a sequence motif for O-GlcNAcylation sites. OGlcNAcScan was designed to exploit sequence properties through SVM for the site prediction. The system achieved an area under ROC (the receiver operating characteristic) curve (AUC) of 74.3% (Figure 3, Below) in a five-fold cross-validation test. AUC is a widely used performance measure of binary classifiers. A perfect classifier yields an AUC of 100% while random guessing yields that of 50%. Although the AUC value of OGlcNAcScan is relatively low, we need to consider at least the following two factors for its interpretation. First, the fraction of positive instances is extremely low in this task, i.e., 373 (1.23%) of 30270 Ser/Thr sites are annotated O-GlcNAcylation sites in dbOGAP. Some of the past studies on PTM site prediction reported the performance of prediction systems on a balanced data set, where sampled negative sites were used in the evaluation data set (e.g., the ratio of positive and negative sites were made to be 1:1 (50% positive) or 1:5 (16.7% positive)). In fact, the relative improvement of our trained SVM classifier, when compared to random guessing [43], can be as high as 14-fold (i.e., the precision of the classifier can be 14 times higher than the original rate of positives sites of 1.23%). The second factor to be considered is that negative instances in the evaluation data set may include not-yet-annotated true O-GlcNAcylation sites, which could have lowered the performance measures. We believe, however, sequence-based prediction of O-GlcNAcylation sites is inherently challenging. Additional training data through further annotation of proteins and sites as well as incorporation of other feature types, such as physiochemical properties of amino acids and protein structure information, may help improve the performance.

The dbOGAP Web Site
The dbOGAP web site provides two primary functionalities, search, query and browse of O-GlcNAcylated proteins and their related annotations, and de novo prediction of O-GlcNAcylation sites (Figure 4, #1 and #2). The dbOGAP database can be searched based on gene/protein names or identifiers, pathway names, or PubMed IDs. The protein entries can also be browsed based on gene names, organisms or pathways. The OGlcNAcScan site prediction system allows input of a protein sequence in FASTA format or a UniProtKB identifier (AC or ID) for site prediction. In addition, users can contribute their annotations to the database based on literature or from unpublished proteomic data on newly identified O-GlcNAcylation sites (Figure 4, #3). All O-GlcNAcylation related literature citations are also available for browsing (Figure 4, #4).

The O-GlcNAcylated protein entry
The dbOGAP protein entries are assigned unique IDs (e.g., OG00001) and are mapped to the corresponding UniProtKB IDs (1433B_HUMAN) and Accessions (P31946). The entry report provides detailed O-GlcNAcylation information and evidence attributions, including experimental and inferred O-GlcNAcylaytion data (Figure 5). O-GlcNAcylated residues and positions, as well as other modification sites (e.g., phosphorylation) and site features (e.g., binding sites), can be visualized in the context of protein sequences. The entry record also provides additional annotations such as GO, pathways (e.g., KEGG, PID and Reactome), protein-protein interactions (e.g., IntAct), protein families (e.g., Pfam) and diseases (OMIM), as well as additional protein bibliography integrated from UniProt and iProClass. Hyperlinks to source databases are provided for integrated annotations in dbOGAP entry records.

The O-GlcNAcScan report
The OGlcNAcScan report page provides a list of predicted O-GlcNAcylation sites for a given query sequence ( Figure 6). The list can be sorted based on the

Discussion
Up to now, the amount of data published on protein O-GlcNAcylation is only a fraction of that of phosphorylation, and its biological role is much less understood. Since 2006, the identification of O-GlcNAcylated proteins and sites has been rapidly growing due to the improved mass spectrometry technologies and O-GlcNAc enrichment techniques [7][8][9]. The dbOGAP database provides a timely bioinformatics resource to allow readily access by the community to the known and potential O-GlcNAcylated proteins and sites.
While a large number of O-GlcNAcylated proteins and sites were identified in recent years, many were determined based on large-scale mass spectrometry and would need to be further validated. Although O-GlcNAcylation has been known to occur primarily in nucleocytoplamic proteins, the GO profiles show that O-GlcNAcylated proteins are localized in a broad range of intracellular compartments. Interestingly, some O-GlcNAcylated proteins are of unusual classes, e.g., adenylate kinase 2 (AK2, UniProtKB: KAD2_HUMAN) [44] localized in the mitochondria inter-membrane space, and alpha-1-inhibitor 3 (A1i3, UniProtKB: A1I3_RAT) Figure 5 The dbOGAP protein entry view (shown is human AKT1). The entry report provides general protein information as well as specific O-GlcNAcylation information in the context of other posttranslational modifications and site features. The literature evidence (PMID) for the O-GlcNAc sites (e.g. S473 and T308) is given. Clicking on any site will display the residue in the neighboring sequence context (pointed by blue arrow). If the O-GlcNAcylation sites are inferred from orthologs with known sites (e.g. T308 of mouse AKT1, pointed by red arrow, inferred from human AKT1 shown in the inset), sequence alignment for the inferred sites can be displayed (lower portion of the inset). Other annotations are also included in the entry record (below the sequence section, not shown), including gene ontology, pathway, derived from UniProtKB and iProClass.
[45], a secreted protein. Although false positive identification of O-GlcNAcylation is not uncommon from mass spectrometry, it is possible that such proteins may be indeed O-GlcNAcylated. It is known that OGT has at least three isoforms differing in N-terminal sequences with identical catalytic domain, the mitochondrial (mOGT) and two nucleocytoplasmic forms (ncOGT and sOGT) [46,47]. The mOGT form was shown associated with the mitochondrial inner membrane [46], thus consistent with the observation of O-GlcNAcylation of the mitochondrial protein AK2. There are a total of~11 O-GlcNAcylated proteins in dbOGAP that are known to be secreted or have secreted forms besides cytoplasmic ones. It is possible that only the cytoplasmic forms of some of these proteins are O-GlcNAcylated while the secreted ones may not, albeit experimental validation is needed. Thus, the types and/or sources of O-GlcNAcylation identification have been assigned to protein entries as evidence attribution to annotations in the dbOGAP database.
The OGlcNAcScan site prediction system provides a much needed tool for studying protein glycosylation as well as phosphorylation. Since the site prediction is primarily based on the protein sequence context, some Figure 6 The O-GlcNAcylation site prediction result from OGlcNAcScan (shown is human ankyrin-1). The section at the bottom displays a ranked list of predicted O-GlcNAcylation sites (e.g., S1162 as the top one). The rank is based on the output value of the SVM classifier, which is converted into "Estimated Precision" and "Lift" scores (see help page linked from the top of the page for explanation). The estimated precision score is an estimated lower-bound of the precision (e.g., the score of 0.3910 indicates that at least 39.1% of sites assigned with the similar SVM output scores are O-GlcNAcylation sites), and the Lift score is an index of relative improvement through the classifier, which is calculated as the estimated precision divided by a constant value corresponding to the initial rate of positive sites (i.e.,~0.0123). All displayed potential sites are shown as red "S/T" in the sequence section (middle). Clicking on any predicted site, the residue will be highlighted in the sequence (arrow). secreted proteins may be erroneously predicted even with a relatively high score, e.g., T298 in mucin 4 (UniProtKB: MUC4_HUMAN) predicted with a score of 0.287, though it is unlikely to be O-GlcNAcylated. In such cases, a cautionary note is given to indicate that a protein sequence being predicted is known to have "secreted" form(s). With the continuing growth of O-GlcNAcylation sites data, the OGlcNAcScan tool will be further enhanced through retraining the SVM model, as well as by integrating physiochemical properties and structural information into the SVM prediction model.

Conclusion
In conclusion, the dbOGAP database and the web site become the first of its kind in the public domain to provide readily access to a curated and systematic collection of protein O-GlcNAcylation information, and to a stateof-the-art O-GlcNAcylation site prediction system, OGlcNAcScan, to assist proteomic identification of O-GlcNAc modification sites. Thus, the dbOGAP resource should benefit the biological community to study the broad roles of O-GlcNAcylation in physiology and diseases.