Systematic analysis of human kinase genes: a large number of genes and alternative splicing events result in functional and structural diversity
© Milanesi et al; licensee BioMed Central Ltd 2005
Published: 1 December 2005
Protein kinases are a well defined family of proteins, characterized by the presence of a common kinase catalytic domain and playing a significant role in many important cellular processes, such as proliferation, maintenance of cell shape, apoptosys. In many members of the family, additional non-kinase domains contribute further specialization, resulting in subcellular localization, protein binding and regulation of activity, among others. About 500 genes encode members of the kinase family in the human genome, and although many of them represent well known genes, a larger number of genes code for proteins of more recent identification, or for unknown proteins identified as kinase only after computational studies.
A systematic in silico study performed on the human genome, led to the identification of 5 genes, on chromosome 1, 11, 13, 15 and 16 respectively, and 1 pseudogene on chromosome X; some of these genes are reported as kinases from NCBI but are absent in other databases, such as KinBase. Comparative analysis of 483 gene regions and subsequent computational analysis, aimed at identifying unannotated exons, indicates that a large number of kinase may code for alternately spliced forms or be incorrectly annotated. An InterProScan automated analysis was perfomed to study domain distribution and combination in the various families. At the same time, other structural features were also added to the annotation process, including the putative presence of transmembrane alpha helices, and the cystein propensity to participate into a disulfide bridge.
The predicted human kinome was extended by identifiying both additional genes and potential splice variants, resulting in a varied panorama where functionality may be searched at the gene and protein level. Structural analysis of kinase proteins domains as defined in multiple sources together with transmembrane alpha helices and signal peptide prediction provides hints to function assignment. The results of the human kinome analysis are collected in the KinWeb database, available for browsing and searching over the internet, where all results from the comparative analysis and the gene structure annotation are made available, alongside the domain information. Kinases may be searched by domain combinations and the relative genes may be viewed in a graphic browser at various level of magnification up to gene organization on the full chromosome set.
Eukaryotic protein kinases (ePKs) are important players in virtually every signalling pathways involved in normal development and disease: the transduction, amplification and integration of many intracellular and intercellular processes need a specific regulation made, often, by protein phosphorylation . Most ePKs belong to a single superfamily, characterized by a contiguous stretch of approximately 250 aminoacids that constitutes the catalytic domain (ePK domain) . A much smaller number of protein kinases do not share this catalytic domain with other kinases, and are often collectively called atypical kinases. The availability of complete sequences for human and vertebrate genomes stimulated the computational search of the whole sequence, in order to identify additional unknown protein kinases: Swissprot, Uniprot, ENSEMBL [3–5] and other commonly used databases all annotate different numbers of kinase proteins or genes. Manning et al , in a systematic attempt to establish the full set of human kinases (kinome), identified 478 ePKs and 106 kinase pseudogenes in human genome; apart from 40 atypical protein kinases lacking sequence similarity in the ePK domain. This study was more recently extended to cover additional species . In many cases the ability of protein kinases to regulate biological events depends on the presence, along with the kinase domain, of other functional domains involved in regulation, interactions with other protein partners or subcellular localization . These non-catalytic domains extend the already wide diversification of these proteins, and, at the same time, offer a contribute to explain the high degree of this functional diversification, suggesting alternative targets for structural and functional analysis.
In this article we try to expand the kinase superfamily by identifying novel protein kinases, either encoded in previously unidentified genes or generated via alternative processing steps from the known ones, via comparative genomics analysis. We also try to characterize the gene products by protein sequence analysis, by means of different tools of structure/domain prediction, including some machine learning-based methods specifically developed [9, 10] to predict transmembrane alpha helices and disulfide bond propensity for cysteine residues. The data are collected in a database, where all the information from the present study may be publicly accessed.
Results and discussion
The human kinase gene set
Genomic loci identified after the described procedure
Entrez Gene Name
phosphatidylinositol 4-kinase catalytic beta polipeptide
similar to bone morphogenetic protein receptor, type IA precursor; activin A receptor, type II-like kinase 3
similar to NEK1 (NimA-related protein kinase 1)
similar to casein kinase I alpha
similar to 3-phosphoinositide dependent protein kinase-1 (hPDK1)
mitogen-activated protein kinase kinase 4 pseudogene
Kinase gene analysis
basic features concerning sequence content and similarity with orthologous counterpart, such as GC%; number of gaps, length, sequence identity, polarity.;
genomic localization, according to ENSEMBL data and concerning chromosomal coordinates, relationship with the target and the closest gene (i.e. distances from transcription and coding start, transcription and coding end, according to gene orientation), sequence gene type (i.e. intergenic, intronic, exonic), number of known SNPs;
predicted functional features, concerning the identification of motifs and putative signals, such as transcription factor binding sites, exonic splicing enhancers (ESE), RNA secondary structures, palindromes and tandem repeats.
Many CTSs reported in Fig. 1a as intergenic or intronic (i.e. non-exonic), could of course represent additional, previously unidentified exons, either constitutive or alternatively spliced, and their identification would lead to further expansion of the available kinome. Several computational tests were therefore directed to determine the coding potential, including identification of the long open reading frames, calculation of codon frequency and periodicity, statistics on synonymous codon usage, coincidence with exons and suboptimal exons of genes predicted by running GENSCAN on the selected human genomic regions. In addition human and mouse EST collections were scanned by BLAST for similarity to human and mouse CSTs, and different genomic annotations were compared to highlight differences between human and mouse annotations, such as human CSTs annotated as intergenic or intronic with mouse counterpart annotated as exonic. The last four tests, i.e. coincidence with GENSCAN exons, matches with human or mouse EST and annotation as exons in mouse are, alone or in combination, particularly convenient criteria to assess the exon potential: about a third of the CSTs initially marked as intronic or intergenic were positive to at least one of the described criteria and should be considered as "exon-like" (Fig. 1b). Concomitant positivity to more than one criteria is often observed and allows to rank sequences in classes of higher potential to be unannotated constitutive or alternatively spliced exons (Fig. 1c). The number of kinase genes containing such CSTs is large: results are likely to represent unannotated constitutive exons or to code for alternatively spliced isoforms: about half of the genes contain at least one region positive to two or more of the four criteria mentioned above (Fig. 1d).
Two CSTs identified in CDKL2 (Cycline-dependent kinase-like 2), a member of the cdc2-related serine/threonine kinase subfamily, contain putative exons inserted into the 3' end of the coding region (fig. 2b). The generated protein differs from CDKL2 in its C-terminal domain by increasing the length of a random coil region located at the carboxy-end of the protein, a region in which previous analysis of mouse cDNA clones already revealed multiple variants generated by alternative splicing events. These exons are not described in the numerous species where the protein has been reported, with the only exception of rabbit, where a similar sequence from deep cerebellar nuclei is described as the only form available . Despite their similarity with cdc2, most of the members of this family show roles other than cell cycle regulation and are expressed in terminally differentiated cells of the nervous system. Consistently, a human EST ending in a poliA+ tail confirms its inclusion in human brain RNA and provides evidence for a brain specific form of the protein.
Similarly, VRK1 (vaccinia-related kinase 1) shows an extra exon contained within CST, which is suggestive of an alternative splicing event localized at the 3' end of the VRK1 gene, affecting the low complexity domain at the C-terminus of the protein (Fig. 2c). The protein, identified from a new group of human serine/threonine kinases, known to prevent p53 ubiquitinization via phosphorylation in thr-18 [16, 17], only has one isoform according to the SWISSPROT protein database, while five alternative products are reported as predicted for the murine one.
Structure-function relationship in kinase proteins
Further information in our data mining system comes from an exhaustive prediction of transmembrane domains and other structural features. This information allows to better understand the connection between structure and function of known proteins, but also permits to express hypothesis about the role and the subcellular localization of novel proteins. Filtering of the kinase sequence set with machine-learning based methods, specifically suited to predict signal peptides, transmembrane protein domains of the alpha helical type and propensity of cysteine residues to form disulfide bridges, allowed the annotation on predictive basis of these characteristics. The results are shown in Fig. 3c. We found that 13.5% of the kinase sequences are endowed with signal peptides, suggesting that these proteins may be secreted via the SEC-dependent secretory pathway; 40.9% are endowed with at least one transmembrane domain different from the signal peptide, a number substantially higher than the 18.4% annotated as containing a Tm domain in the Swissprot database; 15% of the kinases are endowed with at least one disulfide bridge.
The information stored for each kinase gene consists of annotations automatically extracted from public databanks or literature such as:
alternative names as defined in the HUGO database;
family classification according to Manning;
transcript variants and genomic contig names and coordinates from RefSeq;
functional annotations from Gene-Entrez;
information about transcripts, exons and genomic coordinates from Ensembl;
direct links to RefSeq, Gene-Entrez, OMIM, Ensembl and SwissProt databases, together with the Kinbase protein and mRNA sequences.
These annotations are stored alongside the results from the present analysis on kinase genes and proteins. The available data include:
type and position of detected domains;
predictions for secondary structure, transmembrane domains and cystein propensity to form disulfur bridges;
mouse orthologous kinase genes;
CSTs common to human and mouse;
all CST annotations.
The predicted human kinome was extended by identifying kinase genes through a custom built pipeline and by identifying a large number of non-exonic, apparently non-coding, highly conserved sequences through comparative analysis. Some of these conserved sequences were annotated as exon-like, and may be responsible for additional protein variability through alternative processing; others may play different roles, for example contribute to regulation of gene expression. Domain analysis and prediction of structural features provide further information, resulting in a varied panorama where functionality may be searched at the gene or protein level. All results from the comparative analysis and the gene structure annotation are made available alongside the domain information in the KinWeb database, made available for browsing and searching over the internet and where it is possible to search for kinases by domain combinations and to visualize the relative genes, including annotation of conserved sequences. A graphic browser is used to view kinase genes at various levels of magnification, from single exons up to gene organization on the full chromosome set.
Kinase gene identification
A contiguous stretch of approximately 90 aminoacids, containing the well known "DxxxxN, DFG, APE, DxxxxG" motif, was extracted from an arbitrarly choosen kinase, ABL, and used as input to a three-iteration PSI-BLAST search of a query database containing the whole kinase dataset identified by Manning and coworkers . The resulting Position Specific Score Matrix (PSSM) was used as a query sequence to perform tBLASTn agaist all human chromosome sequences, available from NCBI, release April 14 2003; human sequences had been previously masked to remove sequences coding for kinase genes contained in the starting set. Sequence regions matching the PSSM were extracted and extended 200 kb upstream and downstream for full length gene prediction on the resulting genomic region by GenomeScan http://genes.mit.edu/. This software allows prediction of genes on the basis of an input protein expected to be similar to the gene product encoded in the DNA sequence. We found such proteins by doing a BLASTX comparison of our sequences to all known proteins.
Kinase domain identification
For each human hit, all the features (gene name, alternative names, classification) were stored into a table of a relational database, along with protein, mRNA and kinase catalytic domain sequences. Pseudogenes were manually curated and inserted into a distinct relational database. All putative kinases were then analyzed by using the InterProScan for the complete domain annotation. InterProScan  is freely available under the GNU licence agreement from the EBI's ftp server ftp://ftp.ebi.ac.uk/pub/databases/interpro/iprscan/. The output generated in XML format, is parsed by a Perl script in order to extract all the annotations and recorded into a MySQL relational database which can be consulted through interfaces written in PHP, and can be visualized with common browsers over the internet.
Human/mouse orthologous regions corresponding to kinase genes were taken from ENSEMBL annotation, when available, or by manual identification based on sequence conservation. The limits of the genomic sequences were set between 20 kb and 250 kb depending on the distance of the closest known gene. Species-specific repeats were masked and BLASTZ was used for comparison. The final set of about 33000 CSTs was finally selected according to the parameters described in the "Results" section.
Annotation of human and murine CSTs was carried out through a pipeline, formed of several independent modules. The pipeline is based on PHP scripts and includes: classification of CST type according to Ensembl gene definitions, coding capability according to Ensembl exon definitions, GC content, distances from analysed and closest genes and coding regions.
A number of programs, were run on the whole CST set to annotate specific features: equiktandem and palindrome were used to identify direct and inverted repeats; marscan to annotate MAR sites; tcode, syco and getorf to assess coding potential; Genesplicer http://cbcb.umd.edu/software/GeneSplicer/ to detect splice sites; GENESCAN was used for ab initio transcripts and suboptimal exons prediction. equiktandem, palindrome, marscan, tcode, syco and getorf are EMBOSS applications http://emboss.sourceforge.net.
BLAST was run on all CSTs to annotate matches within the human and mouse genomes: matches showing score higher than 50 or E-value better than 10-5 were kept as annotations in the DB. Similarly results of BLAST runs of all CSTs against human and mouse EST libraries having E-value better than 10-20, length and identity higher than 30 and 90 respectively, have been annotated.
Exon like definition
mouse counterpart is annotated as exonic
CST matches with GENSCAN exons or suboptimal exons
CST matches with one or more human EST
Mouse counterpart matches with one or more rodent EST
Database construction and web interface
CSTs are fed to an SQL-database. The pipeline is able to manage CST import together with automatic annotation. The WEB interface consists of PHP scripts, which query the database and dynamically generate the result pages. The graphic visualization tool has been developed in PHP and is based on the GD graphics library http://www.boutell.com/gd/.
Signal peptide prediction
The entire set of proteins was filtered with SPEPlip, a neural network-based method for predicting the presence of a signal peptide, trained and tested on a set of experimentally derived signal peptides from eukaryotes and prokaryotes. SPEPlip identifies the presence of sorting signals and predicts their cleavage sites. The accuracy is 97%. It can be accessed through the web page at http://gpcr.biocomp.unibo.it/
All-alpha membrane proteins prediction
All-alpha membrane proteins constitute a functionally relevant subset of the whole proteome. Their content ranges from about 10 to 30% of the cell proteins, based on sequence comparison and specific predictive methods. ENSEMBLE is an ensemble of methods, containing a cascade-neural network (NN) and two different hidden Markov models (HMM). It was trained and tested in cross validation on 59 well resolved membrane proteins, available when the method was implemented. ENSEMBLE scores with a per-protein accuracy of 90% for topography and 71% for topology. When tested on a low resolution set of 151 proteins, with no homology with the 59 proteins, the per-protein accuracy of ENSEMBLE is 76% for topography and 68% for topology.
Disulfide bond prediction
The propensity of disulfide-bonded cysteines has been predicted with a Hidden Neural Network-based method starting from the residue sequence of the protein chain. The method scores as high as 89% and 86% per cysteine residue and per protein, respectively, and in this it is superior to other predictors of the same category.
We thank M.V. Barone for useful discussions. This work was supported by CISI "Comune di Milano", MIUR: "Functional genomics", "Bioinformatics for Genome and Proteome" and Laboratory of Interdisciplinary Technologies in Bioinformatics (LITBIO) RBLA0332RH, FIRB projects, MIUR Grant 12/2000 to CEINGE.
- Cohen P: The role of protein phosphorylation in human health and disease. The Sir Hans Krebs Medal Lecture. Eur J Biochem 2001, 268(19):5001–5010. 10.1046/j.0014-2956.2001.02473.xView ArticlePubMedGoogle Scholar
- Hanks SK: Genomic analysis of the eukaryotic protein kinase superfamily: a perspective. Genome Biol 2003, 4(5):111. 10.1186/gb-2003-4-5-111PubMed CentralView ArticlePubMedGoogle Scholar
- Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS: The Universal Protein Resource (UniProt). Nucleic Acids Res 33(Database):D154–9. 2005 Jan 1 10.1093/nar/gki070Google Scholar
- Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bradley P, Bork P, Bucher P, Cerutti L, Copley R, Courcelle E, Das U, Durbin R, Fleischmann W, Gough J, Haft D, Harte N, Hulo N, Kahn D, Kanapin A, Krestyaninova M, Lonsdale D, Lopez R, Letunic I, Madera M, Maslen J, McDowall J, Mitchell A, Nikolskaya AN, Orchard S, Pagni M, Ponting CP, Quevillon E, Selengut J, Sigrist CJ, Silventoinen V, Studholme DJ, Vaughan R, Wu CH: InterPro, progress and status in 2005. Nucleic Acids Res 2005, 33(Database):D201–5. 10.1093/nar/gki106PubMed CentralPubMedGoogle Scholar
- Hubbard T, Andrews D, Caccamo M, Cameron G, Chen Y, Clamp M, Clarke L, Coates G, Cox T, Cunningham F, Curwen V, Cutts T, Down T, Durbin R, Fernandez-Suarez XM, Gilbert J, Hammond M, Herrero J, Hotz H, Howe K, Iyer V, Jekosch K, Kahari A, Kasprzyk A, Keefe D, Keenan S, Kokocinsci F, London D, Longden I, McVicker G, Melsopp C, Meidl P, Potter S, Proctor G, Rae M, Rios D, Schuster M, Searle S, Severin J, Slater G, Smedley D, Smith J, Spooner W, Stabenau A, Stalker J, Storey R, Trevanion S, Ureta-Vidal A, Vogel J, White S, Woodwark C, Birney E: Ensembl 2005. Nucleic Acids Res 33(Database):D447–53. 2005 Jan 1 10.1093/nar/gki138Google Scholar
- Manning G, Whyte DB, Martinez R, Hunter T, Sudarsanam S: The protein kinase complement of the human genome. Science 2002, 298(5600):1912–34. 10.1126/science.1075762View ArticlePubMedGoogle Scholar
- Caenepeel S, Charydczak G, Sudarsanam S, Hunter T, Manning G: The mouse kinome: Discovery and comparative genomics of all mouse protein kinases. PNAS 2004, 101: 11707–11712. 10.1073/pnas.0306880101PubMed CentralView ArticlePubMedGoogle Scholar
- Krupa A, Srinivasan N: The repertoire of protein kinases encoded in the draft version of the human genome: atypical variations and uncommon domain combinations. Genome Biol 2002, 3(12):RESEARCH0066. 10.1186/gb-2002-3-12-research0066PubMed CentralView ArticlePubMedGoogle Scholar
- Fariselli P, Finocchiaro G, Casadio R: SPEPlip: the detection of signal peptide and lipoprotein cleavage sites. Bioinformatics 19(18):2498–9. 2003, Dec 12 10.1093/bioinformatics/btg360Google Scholar
- Martelli PL, Fariselli P, Casadio R: An ENSEMBLE machine learning approach for the prediction of all-alpha membrane proteins. Bioinformatics 2003, 19(Suppl 1):i205-i211. 10.1093/bioinformatics/btg1027View ArticlePubMedGoogle Scholar
- Boccia A, Petrillo M, di Bernardo D, Guffanti A, Mignone F, Confalonieri S, Luzi L, Pesole G, Paolella G, Ballabio A, Banfi S: DG-CST (Disease Gene Conserved Sequence Tags), a database of human-mouse conserved elements associated to disease genes. Nucleic Acids Res 33(Database):D505–10. 2005, Jan 1 10.1093/nar/gki011Google Scholar
- Johnson DE, Lu J, Chen E, Werner S, Williams LT: The human fibroblast growth factor receptor genes: a common structural arrangment underlies the mechanism for generating receptor forms that differ in their third immunoglobulin domain. Mol Cell Biol 1991, 11(9):4627–4634.PubMed CentralPubMedGoogle Scholar
- Werner S, Duan DS, de Vries C, Peters KG, Johnson DE, Williams LT: Differential splicing in the extracellular region of fibroblast growth factor receptor 1 generates receptor variants with different ligand-binding specificities. Mol Cell Biol 1992, 12(1):82–8.PubMed CentralPubMedGoogle Scholar
- S Beer HD, Vindevoghel L, Gait MJ, Revest JM, Duan DR, Mason I, Dickson C, Werner S: Fibroblast growth factor (FGF) receptor 1-IIIb is a naturally occurring functional receptor for FGFs that is preferentially expressed in the skin and the brain. J Biol Chem 275(21):16091–7. 2000 May 26 10.1074/jbc.275.21.16091Google Scholar
- Sassa T, Gomi H, Itohara S: Postnatal expression of Cdkl2 in mouse brain revealed by LacZ inserted into the Cdkl2 locus. Cell Tissue Res 2004, 315(2):147–56. Epub 2003 Nov 7. 10.1007/s00441-003-0828-8View ArticlePubMedGoogle Scholar
- Nezu J, Oku A, Jones MH, Shimane M: Identification of two novel human putative serine/threonine kinases, VRK1 and VRK2, with structural similarity to vaccinia virus B1R kinase. Genomics 45(2):327–31. 1997, Oct 15 10.1006/geno.1997.4938Google Scholar
- Vega FM, Sevilla A, Lazo PA: p53 Stabilization and accumulation induced by human vaccinia-related kinase 1. Mol Cell Biol 2004, 24(23):10366–80. 10.1128/MCB.24.23.10366-10380.2004PubMed CentralView ArticlePubMedGoogle Scholar
- Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA. J Mol Biol 1997, 268: 78–94. 10.1006/jmbi.1997.0951View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.