KinOrtho: a method for mapping human kinase orthologs across the tree of life and illuminating understudied kinases

Background Protein kinases are among the largest druggable family of signaling proteins, involved in various human diseases, including cancers and neurodegenerative disorders. Despite their clinical relevance, nearly 30% of the 545 human protein kinases remain highly understudied. Comparative genomics is a powerful approach for predicting and investigating the functions of understudied kinases. However, an incomplete knowledge of kinase orthologs across fully sequenced kinomes severely limits the application of comparative genomics approaches for illuminating understudied kinases. Here, we introduce KinOrtho, a query- and graph-based orthology inference method that combines full-length and domain-based approaches to map one-to-one kinase orthologs across 17 thousand species. Results Using multiple metrics, we show that KinOrtho performed better than existing methods in identifying kinase orthologs across evolutionarily divergent species and eliminated potential false positives by flagging sequences without a proper kinase domain for further evaluation. We demonstrate the advantage of using domain-based approaches for identifying domain fusion events, highlighting a case between an understudied serine/threonine kinase TAOK1 and a metabolic kinase PIK3C2A with high co-expression in human cells. We also identify evolutionary fission events involving the understudied OBSCN kinase domains, further highlighting the value of domain-based orthology inference approaches. Using KinOrtho-defined orthologs, Gene Ontology annotations, and machine learning, we propose putative biological functions of several understudied kinases, including the role of TP53RK in cell cycle checkpoint(s), the involvement of TSSK3 and TSSK6 in acrosomal vesicle localization, and potential functions for the ULK4 pseudokinase in neuronal development. Conclusions In sum, KinOrtho presents a novel query-based tool to identify one-to-one orthologous relationships across thousands of proteomes that can be applied to any protein family of interest. We exploit KinOrtho here to identify kinase orthologs and show that its well-curated kinome ortholog set can serve as a valuable resource for illuminating understudied kinases, and the KinOrtho framework can be extended to any protein-family of interest. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04358-3.


Supplementary Results and Discussion
Unique KinOrtho-defined human kinase orthologs We mapped the comparison results into the human kinome tree to visualize the orthology detection trends across human kinases. We calculated an overlap ratio as the number of one-to-one human kinase orthologs identified by KinOrtho (overlapping set) and other state-of-the-art orthology inference methods ( Figure S2). The average overlap ratio for each human kinase is shown in Figure S3, and kinases with the least overlaps are labeled. The orthologs of these ten human kinases, such as understudied kinases "Cyclindependent kinase 19" (CDK19; synonym: CDK11), CDK11B (synonym: PITSLRE), and CDK3, were mainly identified by KinOrtho alone (average overlap ratios ranged from 0.10 to 0.35). Notably, from the 78 reference proteomes, KinOrtho uniquely identified six CDK19, three CDK3, and one CDK17 ortholog. To validate these ten KinOrtho-unique CDK orthologs, we built a CDK family tree using these ten sequences and the CDK orthologs identified by at least 90% of the compared methods ( Figure S4). We found that all these ten were clustered in correct subfamilies. Out of the six KinOrtho-unique CDK19 orthologs, two were the outgroup of CDK8 orthologs, and four were the outgroup of both CDK8 and CDK19 orthologs within the CDK8 subfamily. The CDK17 ortholog uniquely identified by KinOrtho was the outgroup of all CDK16, CDK17, and CDK18 orthologs within the PCTAIRE subfamily. Although two of the KinOrtho-unique CDK3 orthologs were called "CDK1" in C.albicans and Baker's yeast, the phylogenetic tree showed more closer either CDK3 or CDK2 instead of CDK1. The phylogenetic analysis provides additional support for these KinOrtho-unique orthologs. Although other orthology inference methods also identify these orthologs as orthologous groups, experimental characterization of these subfamilies will be needed to establish their true orthology.
KinOrtho delineates orthologs based on the protein kinase domain against orthologs based on other conserved domains Unlike most traditional orthology detection methods, KinOrtho combines orthologs based on full-length sequences with orthologs found using only the kinase domains. It permits KinOrtho to distinguish orthologs at the protein domain level, allowing for the identification and selection of orthologs that share a closer functional relationship based on the domain of interest, here the kinase domain. On the other hand, this utility also helps identify orthologs stemming from regions other than the kinase domain, potentially sharing no kinase function similarity and leading to erroneous orthologs. Figure S6a illustrates an example of such a relationship. Instead of the kinase domain, other functional domains defined by Pfam [1] are getting matched for inferring orthology for Peripheral plasma membrane protein CASK (CASK). Nearly 20% of human kinase orthologs were inferred based on similarity in non-kinase domains ( Figure S6b). KinOrtho effectively flagged such relationships, thus preventing misleading inferences of function and regulation. Within the human kinome, we found a higher ratio for orthologs missing the kinase domain in the calcium and calmodulinregulated kinase (CAMK) group (6 out of the top 10 kinases; Figure S6c). This could be attributed to the long length of the protein sequence and the number of functional domains. The sequence lengths of four gigantic CAMK kinases, Kalirin (KALRN; synonym: Trad), Triple functional domain protein (TRIO), Obscurin (OBSCN), and Titin (TTN), range from 2,986 to 34,350 amino acids with 12, 13, 57, and 310 functional domains, respectively. On average, the human protein kinases with more than 750 amino acids have a higher ratio of full-length orthologs missing the kinase domain than the shorter kinases have (ratio = 0.043 vs. 0.004; p-value = 3.0e-8, Wilcoxon Rank Sum Test).
Besides, the human protein kinases with more than one functional domain have a higher ratio than those with one functional domain have (ratio = 0.036 vs. 0.002; p-value = 2.9e-11, Wilcoxon Rank Sum Test). For such sequences with multiple domains, KinOrtho offers the flexibility to identify orthologs based on the domain of interest.
Comparison with KinBase, a database with domain-based kinase classifications We next evaluated the congruence of KinOrtho with the current widely used classification of the human kinome developed by Manning et al. [2]. We retrieved the Manning classification of kinomes for humans and 14 other model organisms from KinBase [3]. We first applied KinOrtho to these 15 model organisms' proteomes to collect kinase orthologous relationships (Table S3). We identified 9,282 orthologous relationships, 4,210 of which were ortholog pairs. Because KinBase has no orthology information, if an ortholog pair identified by KinOrtho has two protein members classified as the same family/subfamily by KinBase, we consider this ortholog pair is in agreement with KinBase classification. Among the 4,210 ortholog pairs, there were 3,658 (86.9%) pairs between two proteins of the same classification. After investigating the disagreement between KinOrtho and KinBase, we found that ABC1, Alpha, PIKK, and RIO families in KinBase are classified into multiple groups. For example, a human understudied kinase "Uncharacterized aarF domaincontaining protein kinase 2" (ADCK2) is classified as the ABC1-C subfamily of the ABC1 family under the Atypical group. However, its ortholog in L.major is classified as the same subfamily and family under the protein kinase-like (PKL) group. After correction for these four families, there were 3,777 (89.7%) ortholog pairs in agreement with KinBase classification. In the remaining 433 ortholog pairs, 346 had at least one protein belonging to either a species-specific or unclassified family/subfamily (named with "-Unique" or "-Unclassified"); all the others had two proteins classified as the same family but different subfamilies, except for two ortholog pairs. These two pairs were Tyrosine-protein kinase receptor UFO (AXL) proteins in humans and mice vs. their ortholog in A.queenslandica, "AqueK368". In KinBase, the AXL kinase domain is classified as the AXL family under the TK group, while AqueK368 is classified as the MET family under the TK group. Based on the profile hidden Markov models (HMM) analysis performed by KinBase using its in-house profile, AqueK368 can be classified as either the AXL or MET family with close scores.
KinOrtho is also a profile-and annotation-free orthology inference method In addition to KinOrtho's query-centric characteristic, it is also a profile-and annotation-free tool that identifies target sequences' functional domains based solely on BLAST's result, unlike other domain-based methods requiring prerequisite domain annotations [4, 5, 6, 7, 8]. To check if adding Pfam's detailed annotation leads to improved orthology detection, we collected 22 thousand protein sequences with Pfam annotated protein kinase domains ("Pkinase" or "Pkinase Tyr" in Pfam) as query sequences (Table S4). Using these expanded query sequences required 30 times sequence comparisons during the initial homology search and the following all-vs-all homology search. At the expense of computation, the number of orthologs only increased by 9.8%, and the performance was not significantly better or even worse in some metrics ("Pfam" and "PfamSub" in Figure S9), supporting the simplified BLAST-based domain searching implemented in KinOrtho.

Intrinsic issue of in-paralogs identification
In the experiment of using expanded query sequences based on Pfam annotations, we found that the ratio of in-paralogs over all orthologous relationships increased from 2.5% to 4.8% (Table S4). Moreover, when we applied KinOrtho to the proteomes in Kin-Base, the ratio of in-paralogs significantly increased to 15.6% (Table S3). It is an intrinsic issue of finding real in-paralogs. Under a limited number of species, many proteins cannot find a close sequence from other species, so that they might be defined as the inparalogs of some proteins in their species. As a graphbased method, KinOrtho also takes in-paralogs into account when identifying orthologous groups. Although the cluster analysis helped KinOrtho ease this issue, an extensive collection of reference proteomes across the entire tree of life is needed to infer orthologous relationships using graph-based methods accurately.  Query * : the relationship is between a query sequence and any other sequences. The confusion matrix of each GO domain represents the number of kinase-GO term pairs observed/predicted as present/absent. Query * : the relationship is between a query sequence and any other sequences. Full-length * : full-length pipeline with e-value threshold = 10 1 . Domain-based * : domain-based pipeline with e-value threshold = 10 1 . Query * : the relationship is between a query sequence and any other sequences. Pkinase( Tyr): the domain "Pkinase" and "Pkinase Tyr" in Pfam.

Table S5
Statistics of compared methods Figure S1 Benchmarking based on nine metrics. The title of each plot represents the evaluation metric. KinOrtho results are marked in red. The dotted line represents the Pareto frontier, which runs over the participants with the best efficiency (except KinOrtho).

Dataset
The arrow in the plot shows the optimal corner.

Figure S10
Determining the optimal number of clusters for clustering phylogenetic profile. The gap statistic method was used to determine the optimal number of k-means clusters. Dashed line indicates the optimal number of clusters.