Systematic analysis of human kinase genes: a large number of genes and alternative splicing events result in functional and structural diversity

Background Protein kinases are a well defined family of proteins, characterized by the presence of a common kinase catalytic domain and playing a significant role in many important cellular processes, such as proliferation, maintenance of cell shape, apoptosys. In many members of the family, additional non-kinase domains contribute further specialization, resulting in subcellular localization, protein binding and regulation of activity, among others. About 500 genes encode members of the kinase family in the human genome, and although many of them represent well known genes, a larger number of genes code for proteins of more recent identification, or for unknown proteins identified as kinase only after computational studies. Results A systematic in silico study performed on the human genome, led to the identification of 5 genes, on chromosome 1, 11, 13, 15 and 16 respectively, and 1 pseudogene on chromosome X; some of these genes are reported as kinases from NCBI but are absent in other databases, such as KinBase. Comparative analysis of 483 gene regions and subsequent computational analysis, aimed at identifying unannotated exons, indicates that a large number of kinase may code for alternately spliced forms or be incorrectly annotated. An InterProScan automated analysis was perfomed to study domain distribution and combination in the various families. At the same time, other structural features were also added to the annotation process, including the putative presence of transmembrane alpha helices, and the cystein propensity to participate into a disulfide bridge. Conclusion The predicted human kinome was extended by identifiying both additional genes and potential splice variants, resulting in a varied panorama where functionality may be searched at the gene and protein level. Structural analysis of kinase proteins domains as defined in multiple sources together with transmembrane alpha helices and signal peptide prediction provides hints to function assignment. The results of the human kinome analysis are collected in the KinWeb database, available for browsing and searching over the internet, where all results from the comparative analysis and the gene structure annotation are made available, alongside the domain information. Kinases may be searched by domain combinations and the relative genes may be viewed in a graphic browser at various level of magnification up to gene organization on the full chromosome set.


Background
Eukaryotic protein kinases (ePKs) are important players in virtually every signalling pathways involved in normal development and disease: the transduction, amplification and integration of many intracellular and intercellular processes need a specific regulation made, often, by protein phosphorylation [1]. Most ePKs belong to a single superfamily, characterized by a contiguous stretch of approximately 250 aminoacids that constitutes the catalytic domain (ePK domain) [2]. A much smaller number of protein kinases do not share this catalytic domain with other kinases, and are often collectively called atypical kinases. The availability of complete sequences for human and vertebrate genomes stimulated the computational search of the whole sequence, in order to identify additional unknown protein kinases: Swissprot, Uniprot, ENSEMBL [3][4][5] and other commonly used databases all annotate different numbers of kinase proteins or genes. Manning et al [6], in a systematic attempt to establish the full set of human kinases (kinome), identified 478 ePKs and 106 kinase pseudogenes in human genome; apart from 40 atypical protein kinases lacking sequence similarity in the ePK domain. This study was more recently extended to cover additional species [7]. In many cases the ability of protein kinases to regulate biological events depends on the presence, along with the kinase domain, of other functional domains involved in regulation, interactions with other protein partners or subcellular localization [8]. These non-catalytic domains extend the already wide diversification of these proteins, and, at the same time, offer a contribute to explain the high degree of this functional diversification, suggesting alternative targets for structural and functional analysis.
In this article we try to expand the kinase superfamily by identifying novel protein kinases, either encoded in previously unidentified genes or generated via alternative processing steps from the known ones, via comparative genomics analysis. We also try to characterize the gene products by protein sequence analysis, by means of differ-ent tools of structure/domain prediction, including some machine learning-based methods specifically developed [9,10] to predict transmembrane alpha helices and disulfide bond propensity for cysteine residues. The data are collected in a database, where all the information from the present study may be publicly accessed.

Results and discussion
The human kinase gene set The exact number of protein kinases present in the human genome is still debatable. The attempt of Manning et al. to identify, by a combination of computational and laboratory findings, the full set of human kinases, showed that the number is much less than originally expected, and close to 500; the exact number is still subject to change, following the refinements of the genome sequence, and the introduction of better prediction tools. The INTER-PRO ID IPR000719, corresponding to the domain name "Protein kinase" [8], finds 500 human genes in ENSEMBL, 396 in UNIPROT-Swissprot section and 1268 in the full Uniprot; a set of 452 unique entries in ENSEMBL refer to the gene ontology entries for protein kinase activity or more specific definitions. Kinbase, currently the most comprehensive pool of human kinases, lists 518 kinases (478 as typical ePKs and 40 atypical aPKs) together with 106 pseudogenes. While performing the present analysis, Caenepeel et al. [7] published the mouse kinome, identifying 510 mouse genes as orthologs of the 518 human KinBase kinases; as for human kinases, many mouse genes were not mapped or annotated in ENSEMBL. The KinBase collection, the most accurate set available to date, was used as the starting data set for the present work: a PSSM obtained by PSI-BLAST analysis of this set, using a 90 aminoacids input sequence from the catalytic domain, was used to search for kinases in the full human assembly. This workflow, described under "Methods" led to the identification of 5 genes on chromosome 1, 11, 13, 15, 16 respectively and 1 pseudogene on chromosome X. One of them (PIK4CB) is annotated as a kinase and four other as predicted genes in NCBI; all are    (Table 1). Moreover, 7 pseudogenes and the 'KIS' kinase gene presented different chromosomic localization from the KinBase annotation.

Kinase gene analysis
Starting from the new pool of kinase genes, including the 518 from Manning and 5 new ones from our analysis, we proceeded to a comparative analysis of man and mouse orthologous kinase genes, aimed to the identification of small genomic conserved sequence tags (CSTs). As functional sequences tend to evolve at a slower rate than neutral sequences, information about conserved regions in these genes, obtained from alignment of DNA sequences from different species, such as human and mouse, provides an useful opportunity to amplify the knowledge about coding sequences, alternatively spliced exons or about other regulatory elements. The analysis was performed according to a tool, previously developed in our lab for identification of CSTs in disease genes [11]. Briefly, given a list of target genes, human genomic regions spanning the selected transcripts are identified and compared to their mouse counterparts as defined according to ENSEMBL orthologous definitions; when the target genes are very close, they are included in the same genomic region. Stretches longer than 100 bp and sharing a minimum of 70% identity, are selected as CSTs and assigned to the nearest gene. As ENSEMBL gene codes were not available for all human kinases, gene codes for the kinase gene set were determined according to their gene name or contig position and manually verified. From the initial set, 483 could be analyzed: sequences from man and mouse, corresponding to genomic regions spanning the selected kinase transcripts, were extracted from ENSEMBL, masked for repeats and compared for the identification of CSTs. The comparison led to the identification of about 35000 human (and mouse) CSTs, which together cover about 8% of the selected human regions. Kinase CSTs are mostly unique and half of them correspond to the expected exons for genes contained in the analyzed regions. A summary of the CSTs grouped by gene type is reported in Fig. 1a.
The CST collection underwent several annotation steps involving: • basic features concerning sequence content and similarity with orthologous counterpart, such as GC%; number of gaps, length, sequence identity, polarity.; • genomic localization, according to ENSEMBL data and concerning chromosomal coordinates, relationship with the target and the closest gene (i.e. distances from transcription and coding start, transcription and coding end, according to gene orientation), sequence gene type (i.e. intergenic, intronic, exonic), number of known SNPs; • predicted functional features, concerning the identification of motifs and putative signals, such as transcription factor binding sites, exonic splicing enhancers (ESE), RNA secondary structures, palindromes and tandem repeats.
Many CTSs reported in Fig. 1a as intergenic or intronic (i.e. non-exonic), could of course represent additional, previously unidentified exons, either constitutive or alternatively spliced, and their identification would lead to further expansion of the available kinome. Several computational tests were therefore directed to determine the coding potential, including identification of the long open reading frames, calculation of codon frequency and periodicity, statistics on synonymous codon usage, coincidence with exons and suboptimal exons of genes predicted by running GENSCAN on the selected human genomic regions. In addition human and mouse EST collections were scanned by BLAST for similarity to human and mouse CSTs, and different genomic annotations were compared to highlight differences between human and mouse annotations, such as human CSTs annotated as intergenic or intronic with mouse counterpart annotated as exonic. The last four tests, i.e. coincidence with GENS-CAN exons, matches with human or mouse EST and annotation as exons in mouse are, alone or in combination, particularly convenient criteria to assess the exon potential: about a third of the CSTs initially marked as intronic or intergenic were positive to at least one of the described criteria and should be considered as "exon-like" (Fig. 1b). Concomitant positivity to more than one criteria is often observed and allows to rank sequences in classes of higher potential to be unannotated constitutive or alternatively spliced exons (Fig. 1c). The number of kinase genes containing such CSTs is large: results are likely to represent unannotated constitutive exons or to code for alternatively spliced isoforms: about half of the genes contain at least one region positive to two or more of the four criteria mentioned above (Fig. 1d).
Manual analysis of some exon-like CSTs, identified with the above described procedure, revealed potential kinase isoforms that could be validated against information currently available in literature. The human gene encoding fibroblast growth factor receptor 1 (FGFR1) includes an unannotated exon which, if inserted into the mRNA, produces a FGFR1-IIIb form, not currently annotated in the used databases. This exon is consistent with the common gene organization seen both in man and mouse in the genes coding for other FGFR isoforms, which underlies the mechanisms for generating receptors differing in their third immunoglobulin domain: three different exons (IIIa, IIIb and IIIc) encode possible alternatives within the third Ig-like domain at the N-terminal end of the molecule (Fig. 2a). A literature search provided experimental evidence supporting the existence of FGFR1-IIIa and IIIc transcript variants both in human and mouse, and a cDNA containing the IIIb exon in mouse. A recombinat protein containing the human IIIb sequence is active; RNAse protection and PCR experiments confirm the existence of the m-RNA encoding the FGFR1-IIIb in mouse and human brain [12][13][14].
Two CSTs identified in CDKL2 (Cycline-dependent kinase-like 2), a member of the cdc2-related serine/threonine kinase subfamily, contain putative exons inserted into the 3' end of the coding region ( fig. 2b). The generated protein differs from CDKL2 in its C-terminal domain by increasing the length of a random coil region located at the carboxy-end of the protein, a region in which previous analysis of mouse cDNA clones already revealed multiple variants generated by alternative splicing events. These exons are not described in the numerous species where the protein has been reported, with the only exception of rabbit, where a similar sequence from deep cerebellar nuclei is described as the only form available [15]. Despite their similarity with cdc2, most of the members of this family show roles other than cell cycle regulation and are expressed in terminally differentiated cells of the nervous system. Consistently, a human EST ending in a poliA+ tail confirms its inclusion in human brain RNA and provides evidence for a brain specific form of the protein.
Similarly, VRK1 (vaccinia-related kinase 1) shows an extra exon contained within CST, which is suggestive of an alternative splicing event localized at the 3' end of the VRK1 gene, affecting the low complexity domain at the Cterminus of the protein (Fig. 2c). The protein, identified from a new group of human serine/threonine kinases, known to prevent p53 ubiquitinization via phosphorylation in thr-18 [16,17], only has one isoform according to the SWISSPROT protein database, while five alternative products are reported as predicted for the murine one.

Structure-function relationship in kinase proteins
The human kinase genes, in addition to the catalytic domain, contain several other domains involved in various regulation processes. In order to evaluate the domain distribution in all human kinases, an automated procedure was developed based on an INTERPRO-Scan analysis of the protein kinase sequences. The procedure led to the identification of 20 families, 91 different domains, 12 Some kinase genes containing extra exons    (Fig. 3a). The domains found are reported in Fig. 3b, sorted by frequency; domains appearing less than 6 times are not reported in figure. A majority of the human protein kinases contains at least one domain other than the catalytic kinase domain. Many domains are useful to function assignment: 60 kinases The KinWeb database   domain tends to maintain similar lengths, on average 20 amino-acids long. The large number and diversity of nonkinase domains, contained within protein kinases, is directly related to the high degree of functional diversification, which greatly depends on their ability to interact with a large number of other cellular proteins, mainly via additional subunits or domains. The concomitant presence of additional, non-catalytic, domains may thus lead to the understanding of possible interacting partners and intracellular pathways recruited, unltimately hinting to a specific cell function.
Further information in our data mining system comes from an exhaustive prediction of transmembrane domains and other structural features. This information allows to better understand the connection between structure and function of known proteins, but also permits to express hypothesis about the role and the subcellular localization of novel proteins. Filtering of the kinase sequence set with machine-learning based methods, specifically suited to predict signal peptides, transmembrane protein domains of the alpha helical type and propensity of cysteine residues to form disulfide bridges, allowed the annotation on predictive basis of these characteristics. The results are shown in Fig. 3c. We found that 13.5% of the kinase sequences are endowed with signal peptides, suggesting that these proteins may be secreted via the SECdependent secretory pathway; 40.9% are endowed with at least one transmembrane domain different from the signal peptide, a number substantially higher than the 18.4% annotated as containing a Tm domain in the Swissprot database; 15% of the kinases are endowed with at least one disulfide bridge.

KinWeb database
All the results produced by the analysis have been integrated with the information about kinase genes derived from public databanks into a new database, KinWeb, which is available as a public access site at the following addresses: http://www.itb.cnr.it/kinweb and http://kin web.ceinge.unina.it. The human kinome may be accessed through a graphic genome browser and investigated at the genomic level, starting from kinase gene locations, and, progressively adding detail, at the level of gene structure and corresponding CSTs. Kinase genes may be searched on the basis of structural features, such as domain combinations, and various annotations, i.e. gene name or kinase group (Fig. 4). It is also possible to use BLAST for similarity searches between a given sequence and kinase proteins or cDNAs. A sequence may also be compared through HMMER with the full set of available catalitic domains.
The information stored for each kinase gene consists of annotations automatically extracted from public databanks or literature such as: • alternative names as defined in the HUGO database; • family classification according to Manning; • transcript variants and genomic contig names and coordinates from RefSeq; • functional annotations from Gene-Entrez; • information about transcripts, exons and genomic coordinates from Ensembl; • direct links to RefSeq, Gene-Entrez, OMIM, Ensembl and SwissProt databases, together with the Kinbase protein and mRNA sequences.
These annotations are stored alongside the results from the present analysis on kinase genes and proteins. The available data include: • type and position of detected domains; • predictions for secondary structure, transmembrane domains and cystein propensity to form disulfur bridges; • mouse orthologous kinase genes; • CSTs common to human and mouse; • all CST annotations.
The CST elements and their complete annotations are associated with the corresponding gene, but may also be seen, with the help of a graphic browser, in their chromosomal context and in relation to the exons of the gene transcripts; color code is used to label CSTs according to the various annotations, including the number and type of BLAST matches found (Fig. 5). A link leads from each human kinase gene to the orthologous mouse gene, where information on structure and CSTs, also stored within the database and accessible through the graphic browser, are available. Altoghether the data provide an exaustive analysis of various aspects of gene and protein features for each kinase, integrating data from literature and other DBs with information about gene organization, sequence conservation and protein structural predictions, obtained within the present analysis.

Conclusion
The predicted human kinome was extended by identifying kinase genes through a custom built pipeline and by identifying a large number of non-exonic, apparently non-coding, highly conserved sequences through comparative analysis. Some of these conserved sequences were annotated as exon-like, and may be responsible for additional protein variability through alternative processing; others may play different roles, for example contribute to regulation of gene expression. Domain analysis and prediction of structural features provide further information, resulting in a varied panorama where functionality may be searched at the gene or protein level. All results from the comparative analysis and the gene structure annotation are made available alongside the domain information in the KinWeb database, made available for browsing and searching over the internet and where it is possible to search for kinases by domain combinations and to visualize the relative genes, including annotation of conserved sequences. A graphic browser is used to view kinase genes at various levels of magnification, from single exons up to gene organization on the full chromosome set.

Kinase gene identification
A contiguous stretch of approximately 90 aminoacids, containing the well known "DxxxxN, DFG, APE, DxxxxG" motif, was extracted from an arbitrarly choosen kinase, ABL, and used as input to a three-iteration PSI-BLAST search of a query database containing the whole kinase dataset identified by Manning and coworkers [6]. The resulting Position Specific Score Matrix (PSSM) was used as a query sequence to perform tBLASTn agaist all human chromosome sequences, available from NCBI, release April 14 2003; human sequences had been previously masked to remove sequences coding for kinase genes contained in the starting set. Sequence regions matching the PSSM were extracted and extended 200 kb upstream and downstream for full length gene prediction on the resulting genomic region by GenomeScan http:// genes.mit.edu/. This software allows prediction of genes on the basis of an input protein expected to be similar to the gene product encoded in the DNA sequence. We found such proteins by doing a BLASTX comparison of our sequences to all known proteins.

Kinase domain identification
For each human hit, all the features (gene name, alternative names, classification) were stored into a table of a relational database, along with protein, mRNA and kinase catalytic domain sequences. Pseudogenes were manually curated and inserted into a distinct relational database. All putative kinases were then analyzed by using the InterPro-Scan for the complete domain annotation. InterProScan [4] is freely available under the GNU licence agreement from the EBI's ftp server ftp://ftp.ebi.ac.uk/pub/data bases/interpro/iprscan/. The output generated in XML format, is parsed by a Perl script in order to extract all the annotations and recorded into a MySQL relational database which can be consulted through interfaces written in PHP, and can be visualized with common browsers over the internet.

CST identification
Human/mouse orthologous regions corresponding to kinase genes were taken from ENSEMBL annotation, when available, or by manual identification based on sequence conservation. The limits of the genomic sequences were set between 20 kb and 250 kb depending on the distance of the closest known gene. Species-specific repeats were masked and BLASTZ was used for comparison. The final set of about 33000 CSTs was finally selected according to the parameters described in the "Results" section.

CST annotation
Annotation of human and murine CSTs was carried out through a pipeline, formed of several independent modules. The pipeline is based on PHP scripts and includes: classification of CST type according to Ensembl gene definitions, coding capability according to Ensembl exon definitions, GC content, distances from analysed and closest genes and coding regions.
A number of programs, were run on the whole CST set to annotate specific features: equiktandem and palindrome were used to identify direct and inverted repeats; marscan to annotate MAR sites; tcode, syco and getorf to assess coding potential; Genesplicer http://cbcb.umd.edu/soft ware/GeneSplicer/ to detect splice sites; GENESCAN was used for ab initio transcripts and suboptimal exons prediction. equiktandem, palindrome, marscan, tcode, syco and getorf are EMBOSS applications http://emboss.source forge.net.

BLAST searches
BLAST was run on all CSTs to annotate matches within the human and mouse genomes: matches showing score higher than 50 or E-value better than 10 -5 were kept as annotations in the DB. Similarly results of BLAST runs of all CSTs against human and mouse EST libraries having Evalue better than 10 -20 , length and identity higher than 30 and 90 respectively, have been annotated.

Exon like definition
The CSTs are annotated as exon-like when one or more of the following conditions is verified: -mouse counterpart is annotated as exonic -CST matches with GENSCAN exons or suboptimal exons -CST matches with one or more human EST -Mouse counterpart matches with one or more rodent EST