Skip to main content

Analyses of domains and domain fusions in human proto-oncogenes



Understanding the constituent domains of oncogenes, their origins and their fusions may shed new light about the initiation and the development of cancers.


We have developed a computational pipeline for identification of functional domains of human genes, prediction of the origins of these domains and their major fusion events during evolution through integration of existing and new tools of our own. An application of the pipeline to 124 well-characterized human oncogenes has led to the identification of a collection of domains and domain pairs that occur substantially more frequently in oncogenes than in human genes on average. Most of these enriched domains and domain pairs are related to tyrosine kinase activities. In addition, our analyses indicate that a substantial portion of the domain-fusion events of oncogenes took place in metazoans during evolution.


We expect that the computational pipeline for domain identification, domain origin and domain fusion prediction will prove to be useful for studying other groups of genes.


An oncogene is a modified gene that promotes unregulated proliferation of cells, increasing the chance that a normal cell develops into a tumor cell, possibly resulting in cancer [1]. The normal copy of such a gene is called a proto-oncogene. The first oncogene, SRC, was discovered in a chicken retrovirus in 1970 [2]. Since then, numerous oncogenes have been identified and classified into different groups based on their cellular functions. As of now, oncogenes have been identified at all levels of signal transduction cascades that control cell growth, proliferation and differentiation [13].

Protein domains are compact and semi-independent units of a protein, each of which may consist of one or more contiguous segments of a peptide chain and have its own biological function [3]. They are generally viewed as the basic unit of protein function and evolution. Various sequence- and structure-based methods have been developed for the identification of protein domains [46], and several domain databases, such as DALI [7], PFAM [8], SMART [9] and Prodom [10], have been established.

Recent studies on oncogenes and cancer pathology have pointed to the importance of individual domains and domain fusions in oncogenesis. It has been reported that genes containing domains from specific domain families may have particular relevance to human cancer [1113]. For example, the tyrosine kinase domain is known to play significant roles in the development of numerous diseases such as cancer [11]. Another example is the ATM-related domain that is required for histone acetyltransferase recruitment and Myc-dependent oncogenesis [12]. Additionally, CML, a form of leukaemia, is associated with the fusion of Bcr and Abl genes or their constituent domains [13]. Therefore, understanding the constituent domains of oncogenes as well as their origins may shed new light about the initiation and development of cancers.

In this study, we have developed an integrated computational pipeline for studying the domain composition, domain fusion and domain origin. Specifically, our computational pipeline includes the following key components: (1) identification of the origin of each component domain of known oncogenes and the relevant fusion events; (2) co-occurrence analysis of oncogene domains; (3) identification of the domains and domain pairs that appear more frequently in oncogenes than in the background, namely the collection of all human genes; and (4) functional analyses of the identified frequent domains and domain pairs. We then applied this pipeline to all well characterized human oncogenes, and had a number of new and interesting observations. To the best of our knowledge, this is the first comprehensive analysis specifically addressing the domain composition, origin and fusion of oncogenes.

Results and discussion

Using the computational procedures outlined in Material and Methods, we have carried out a detailed analysis of oncogene domains and co-occurring domains for their origins and functional analysis.

A. Origin of oncogene domains

Origin of distinct domains in cellular organisms

103 distinct domains [see Additional file 1] have been identified from 124 oncogenes, based on Pfam domain assignments. We have considered the subtype scenarios for specific domains, i.e., the different alignments for a specific domain in one clan and using one domain ID to denote the corresponding subtypes. In our dataset, there exist two alignments SH3_2 and SH3_1 for the SH3 domain. The same holds for the SAM domain, where SAM_PNT is the entry for the SAM domain and two different alignments, SAM_1 and SAM_2, exist for this entry, respectively. Although they have different accession numbers in Pfam, we just use SH3 and SAM_PNT to denote these two types of domains, respectively. The distribution of these domains' origins across different cellular organisms is given in Figure 1. About 50% (55/103) of oncogene domains have their origins in the early stages of organismal evolution prior to the emergence of the metazoans, and no domains are found to arise from mammals. It should be noted that these results have been further refined by our literature survey from the original subtractive searching results (see Material and Methods), to take potential HGT into consideration. Based on the literature search, we found that domain SWIB and non-enzymatic domains ig and SAM are likely to have arisen in eukaryote. Their homologs are identified in prokaryotes, likely resulted from HGTs from eukaryotes [14]. Also the origin of tyrosine kinases (Pkinase_Tyr) is probably in eukaryote and their presence in bacteria may also be explained most parsimoniously by HGT events [14].

Figure 1

Distributions of origins of 105 oncogene domains across cellular organisms. Archaea: 1(1%); Bacteria: 17 (16%), Archaea_Bacteria: 22 (21%); Eukaryota: 19 (18%); Metazoa: 30(29%); Chordata: 16 (15%); Mammalia: 0 (0%); Homo sapiens: 0 (0%).

In order to further analyze the statistical difference between the domain origin distribution of oncogenes versus that of the other genes, we have compared our results with Lipika et al. [15], which presented an analysis on the origins of the conserved domains in the whole human proteome. Table 1 presents a thorough calculation of the enrichment ratios of oncogene domains that originated from 8 categories (i.e., Bacteria_only, Archaea_only, Bacteria_archaea, low level Eukaryotes, Metazoan, Chordate, Mammalian, Homo sapiens) compared with the whole domain dataset in the human genes and their p-values. Our results indicate that the origin distribution of oncogene domains is largely consistent with that reported by Lipika et al. [15] for the whole human proteome, EXCEPT FOR those of bacterial or metozoan origins.

Table 1 Enrichment analysis of oncogene domain origination distribution compared with background human genome.

Domain functions

We divided the oncogene domains into groups based on their GO annotation (Table 2). These oncogene domains show diverse functions, including regulation of transcription and apoptosis, protein kinase activity and DNA/RNA/protein binding activity.

Table 2 Main function groups of oncogene domains.

Further analyses suggest that domains with different functions might have come from different origins (Figure 1; Table 2). For example, domains related to immunoglobulin and tyrosine kinase (e.g., SH2, SH3, I-sev, and V-set) are found in archaea, bacteria or in both. These domains are known to be closely related to oncogenesis [16] (Note that another two important oncogenesis-related domains: Pkinase_Tyr and ig, originated in eukaryotes, but were horizontally transferred to prokaryotes [14]). Other domains such as rhodopsin domains (7tm_1, 7tm_3), cyclin dependent kinases (CDKs) domains (Cyclin_N, Cyclin_C) and the intracellular signalling domains (PH, CH) seem to have originated in eukaryotes. Several domains related to the development of the nervous system such as wnt, ephrin_lbd and Sema seem to have originated in metazoans. In addition, function domains required by vertebrates such as hormones involved in mitogenic and inflammatory activity (Myc_N, Myc_LZ, Maf_N, Cys_knot) seem to have originated in chordates.

Domains originated from viruses

Among the 103 identified oncogene domains, 38 are found to be present in viruses (Table 3). The three most frequently occurring domains in virus proteins are Helicase_C, Ank and DEAD. Ank has been reported in diverse groups of proteins such as enzymes, toxins and transcription factors. The existence of Ank in both prokaryotes and viruses may have resulted from horizontal gene transfers [17]. The Helicase domain family (including Helicase_C and DEAD) is reportedly related to hepatitis virus-associated hepatocellular carcinoma and involved in cell growth control [18]. In addition, some other families such as Zinc finger domains (zf-C3HC4, zf-C2H2, zf-C4), Immunoglobulin-related domains (ig, V-set, I-set) and protein-tyrosine kinase related domains (Pkinase_Tyr, SH2, SH3) also have remote homologs in viruses and all these three domain families are closely related to oncogenesis. Overall, 20 of the 38 virus-originated domains are known to be related to oncogenesis.

Table 3 38 oncogene domains present in virus dataset (367,752 proteins).

B. Oncogene domain fusion

Domain fusion in cellular organisms

We have identified 50 whole domain fusion events in the 124 oncogenes. Among them, 21 contain two distinct domains (domain pairs) and the others contain at least three different domains. Their initial appearance in cellular organisms and their presence/absence in viruses are given [see Additional file 2].

Fused domains in viruses

Among the 50 fused domains, 7 events have been identified in viruses. These 7 fused domains can be divided into 4 categories according to their functions: pkinase-related domain fusion ({SH2, SH3, Pkinase_Tyr}, {SH2, FCH, Pkinase_Tyr}); platelet-derived growth factor domain fusion ({PDGF, PDGF_N}); helicases-related domain fusion ({DEAD, Helicase_C}) and DNA/ligand-binding domain fusion ({Hormone_recep, zf-C4}; {HLH, Myc_N, Myc-LZ}; {HLH, Myc_N}). Interestingly, ~90% of the virus proteins harbouring these fused domains come from the Potyviridae family and the remaining almost all come from the Orthoretrovirinae family. Potyviridae is one of the largest and most important families of plant viruses. Although the relationship between retroviruses and cancer has been widely established [1921], the possible link between Potyviridae and oncogenesis is unknown.

C. Proteome-wide patterns of origins of oncogenes

We have also examined the origins of all the oncogenes as a whole. Our goal is to find out at what stage in evolution all component domains of an oncogene are fused together for the first time, considered as the origin of the oncogene [see Additional file 3].

Among the 24 oncogenes whose initial domain fusions occurred in prokaryotes, 20 have the same domain fusions in viruses (Table 4). It seems that domains with prokaryotic origins tend to present in viruses.

Table 4 24 oncogenes whose domain fusion events arose in prokaryotes.

We have divided the oncogenes into six categories according to their functions: signal transducers, no-receptor kinases, growth factors, growth factor receptors, transcription factors and others. Based on our examination of the oncogene origins, we have observed some general relationships between the origins and the functional categories of the oncogenes (Table 5).

Table 5 General classification of oncogene origins according to their functions.

Signal transducers

In our dataset, most of the oncogenes acting as signal transducers originated from prokaryotes. We have observed that a large number of such genes contain the Ras and Pkinase domains, and are involved in signal transduction, protein binding and kinase activities. It is believed that most ras proteins exist in an inactive state in the resting cell where they bind GDP [22], and their oncogenesis is closely related to their interactions with other receptors.

No-receptor kinases

Non-receptor kinases oncogenes are mostly tyrosine kinases discovered through retroviral transduction and/or through DNA transfection that do not have a receptor-like transmembrane domain. These proteins are partly associated with the inner surface of the plasma membrane, and more related to cell differentiation than to proliferation. Another group of serine/threonine kinases such as RAF1 also belongs to this category. Our analysis shows that all the oncogenes of this group originated from metazoans.

Growth factors

Only one oncogene PDGFB (sis) is known to be a growth factor. This gene encodes one of the two polypeptide chains that together constitute PDGF, a platelet-derived growth factor domain. Our analysis shows that the PDGF domain generally originated in metazoans or chordates, and the corresponding oncogene first came into being in chordates.

Growth factor receptors

The ERBB oncogene family was originally isolated from chicken erthroleukemia, encoding an epidermal growth factor (EGF) receptor [1]. Several other oncogenes also encode proteins with a receptor-like domain, including KIT and ROS [1]. These oncogenes consist of an extracellular ligand-binding domain, a transmembrane domain and an intracellular domain. Our analysis results show that these genes generally originated in metazoans or chordates, representing important regulatory proteins involved in phosphorylation [23].

Transcription factors

Transcription factors are nuclear proteins that regulate the expression of their target genes. They typically belong to multi-gene families that share common DNA-binding domains such as zinc fingers. Our data shows that oncogenes acting as transcription factors mostly originated in chordates, and a few of them (25%) came from metazoans. It has been speculated that the pathologically activated form of these transcription factors no longer fulfils their physiological regulating functions but acts as a carcinogen [1, 24].

Many oncogenes of this category have been identified in our dataset. One representative is JUN, which can bind tightly to other nuclear onco-proteins. In addition, a substantial portion of oncogenes in this category belongs to the myc gene family that is related to nuclear transcription and myeloblastosis. It has been reported that the Myc genes have been found in a wide variety of vertebrates, including mammals, birds, amphibians, and fish [25, 26]. The myeloblastosis function in these oncogenes may have evolved in response to some specific needs by chordates.

Programmed cell death regulators

The first oncogene shown to regulate programmed cell death is BCL2 [27]. Several other oncogenes related to apoptosis have also been identified in our dataset. We found that these oncogenes often originated in metazoans. The mechanisms of apoptosis have not been fully elucidated, but previous studies indicate that the process of apoptosis is controlled by a diverse range of cell signals which may originate either extracellularly (extrinsic inducers) or intracellularly (intrinsic inducers) [1, 27]. This type of complex cell signal network may be more active and required by metazoans.

D. Frequent domains and domain pairs in oncogenes

Oncogene domain co-occurrence graph

We have constructed a domain co-occurrence graph for 124 oncogenes, which consists of 105 domains (nodes) and 141 co-occurring domain pairs (edges), as shown in Figure 2. The graph has 8 connected components, each containing at least 3 nodes, with the largest component having 37 nodes and 82 edges. The graph has a sparse but highly clustered structure. The few highly connected nodes representing domains like Pkinase_Tyr, SH2, SH3 form hubs of the (co-occurrence) network.

A large-scale analysis of co-occurrence networks of the protein domains collected from the ProDom, Pfam and Prosite domain databases was previously performed by S. Wuchty [28], which found that these networks exhibited small-world and scale-free properties. In our study, the same properties were observed for oncogene domain network (Figure 3). We conclude that the oncogene domain network has a sparse but highly clustered structure. Highly connected nodes emerge in the network which predominantly shapes the topology of the underlying network and a few domains are connected to many different domains forming a few hubs.

Figure 2

Oncogene domain co-occurrence graph consisting of 105 domains. Each node is labelled with a domain name. The weight of each edge represents the co-occurrence frequency across all the 124 oncogenes.

Figure 3

Frequency distribution of node degrees in oncogene domain network. The distribution follows a generalized power law:. Parameter values of the fit (solid curve) are a = 1.125; b = -0.887, and r = 0.101.

Frequent domains and domain pairs

We have identified a number of domains and domain pairs that are highly frequent in oncogenes, compared with those in the background human genome. We consider such domains as significant if they show high occurring frequencies and high numbers of co-occurring domains in the oncogenes but not in the background set. These domains are Pkinase_Tyr, SH2, SH3, RhoGEF and fn3, with functions related to signal transduction, enzymatic activity, and cell surface binding. Moreover, pkinase_Tyr has protein-tyrosine kinase activities, and SH2 and SH3 mediate protein interactions. They are known to play a key role in diverse biological processes such as growth, differentiation, metabolism and apoptosis in response to external and internal stimuli.

To find out which domain pairs occur more frequently in oncogenes than in the background genome, we have carried out enrichment analyses of the domain pairs. Table 6 gives the domain pairs with p-values more significant than 10-6. It should be noted that 10-6 is a rather significant cut-off based on our experience in identifying frequent domain pairs from the background.

Table 6 Frequent domain pairs XY in the oncogene graph compared with the background genome.

We expect that domain fusions might have brought new functions to their host proteins. This type of functional transformation has been reported previously [25, 2931]. For instance, the SH3 and SH2 domains frequently appear together in various signalling proteins involved in recognition of phosphorylated tyrosine [30], where SH2 localizes tyrosine-phosphorylated sites and SH3 binds to target proteins [31]. Another example is that the bHLH motif and the "Myc boxes" co-exist in the Myc gene family. bHLH uses a common mechanism for DNA binding and dimerization while the Myc boxes, on the another hand, appear to be unique to the Myc family and are involved in transcription activation and neoplastic transformation [25]. While the individual functions of these two domains are generally understood, their synergistic effects in their bounded protein complex are not known [25].

Two significant triad domain fusions, {SH2, SH3, Pkinase_Tyr} and {Furin-like, Recep_L_domain, Pkinase_Tyr}, are found (Figure 2) and they form six fused domain pairs (shown in Table 6). Pkinase_Tyr are known to be related to protein tyrosine kinase activities and amino acid phosphorylation. The other two domains, Furin-like and Recep_L_domain, are involved in signal transduction by receptor tyrosine kinases [32]. It is also noteworthy that domains corresponding to the tyrosine kinase family are among the most frequent families in oncogenes. These domains may carry essential functions as standalone domains and may also extend their functionality to accomplish complex tasks in combination with other domains.

E. Phylogenetic profiling diversities of frequent domains and domain pairs

Diverse origins of frequent domains and domain pairs are found in cellular organisms through our phylogenetic profile analyses, which provide complementary information to our earlier analysis of domains and domain pairs. Phylogenetic profiling is a computational technique for functional analyses of domains and their fusions [33]. We have calculated the phylogenetic profiles of all oncogene domains and domain pairs to find their taxonomic distribution across 495 cellular genomes, grouped into 7 taxa: archaea, bacteria, protozoa, viridiplantae, fungi, metazoan-invertebrates, and metazoan-chordates. The phylogenetic profiles of frequent domains and domain pairs are listed in Table 7.

Table 7 Phylogenetic profiling analysis of frequent individual domains and domain pairs through 7 taxa from 495 genomes.

Our data show that nearly all frequent individual domains originated in prokaryotes, and have a wide distribution across many genomes, while the frequent domain pairs almost all first emerged in metazoans (Table 7). Therefore, while individual domains may have early origins, most frequent domain pairs first came together in higher organisms. Although multi-domain proteins are more common in higher organisms, it is not clear if this observation about frequent oncogene domain pairs is generally true for any domain pairs from any groups of genes, which will be left for future study.


We have analyzed the origins of component domains and domain fusions of oncogenes, and studied the unique characteristics of the oncogene domain pairs in comparison with those in the background human genome. Most of these domains and domain pairs are functionally related to protein tyrosine kinase activities, which are closely related to cancer pathophysiology. Our phylogenetic profile analysis provides additional evidence to support our observation that frequent domain pairs in oncogenes tend to originate in higher organisms. The knowledge gained from this computational study may provide useful insights about the complex processes of oncongenesis.


A. Data sources

124 proto-oncogenes of Homo sapiens were collected from CNIO OncoChip project website and the Cancer Genome Anatomy Project database [see Additional file 4], and their protein sequences were obtained from the Uniprot database [34] (only the primary protein form was used). The pre-calculated domain structures of these proteins were retrieved from the Pfam-A database (version 21.0) [8], using HMMER [8] and RPS-BLAST [8] (E-value cutoff 0.001; sequences were masked for coiled-coils and low complexity regions). Our list includes all the important proto-oncogenes previously reported in the literature [2, 3537]. All these proto-oncogenes were manually curated based on the published literature.

A proto-oncogene only becomes an oncogene when mutations or over-expressions take place [37]. Note that "oncogenes" are different from "cancer genes". Commonly we consider oncogenes as those involved in uncontrollable cell growth while cancer genes are generally referred to genes that are identified with somatic or germline mutations in cancer tissues. Futreal et al. recently conducted a census of human cancer genes on the basis of genetic evidence [16], whose cancer-gene list partly overlaps our oncogene list. Throughout the rest of the paper, we use oncogenes to refer proto-oncogenes for the terminology simplicity.

Two sets of genomes and their encoded proteins were used in our study, one including the whole set of proteins encoded in 495 sequenced genomes (with 34 archaea, 422 bacteria and 39 eukaryotes) from the Integr8 database (release 58) [38] and the other including 367,752 protein sequences with Pfam annotations from 6,774 sequenced virus genomes. The second data set was downloaded from the Uniprot database at the FTP site [39].

The complete set of proteins of Homo sapiens with Pfam domain annotation was downloaded from the Integr8 database, which contains 25,025 protein sequences without splicing isoforms. This dataset set served as the background for our statistical analyses.

It should be noted that currently there is no well-accepted benchmark dataset for oncogenes. Since our data were mainly selected from CNIO OncoChip project and Cancer Genome Anatomy Project database, a likely bias may exist when compared with other datasets. One future plan of our work is to investigate several other cancer gene datasets, including those identified by exon sequencing studies such as TCGA [40] dataset from the group at John Hopkins [41] and the cancer gene lists compiled by Futreal et al.[16], to derive a more comparative dataset of oncogenes.

B. A computational pipeline for domain analyses of oncogenes

Our computational pipeline for identification of the origins of oncogene domains and domain fusion events consists of three main steps (Figure 4). The first step is to predict the origins of domains and domain fusion events in oncogenes, which is done through application of a subtractive search procedure [15], in conjunction with identification and analyses of horizontal gene transfers to avoid pitfalls, which could potentially lead to misclassification of domain origins in prokaryotes. The second step is to perform comparative analyses on domains between oncogenes and the background, namely the whole collection of human proteins. Domains and domain pairs with higher occurrence frequencies in oncogenes than in the background are identified, through an analysis of a domain co-occurrence graph. Detailed analyses on these domains and domain pairs are carried out in the third step of the pipeline, through a combination of a domain/domain pair enrichment analysis and a phylogenetic profile analysis (see following sections for details).

Figure 4

A computational pipeline for prediction of origins of oncogene domains. Different components of the pipeline are colour-coded with yellow for prediction of domain origins, blue for analysis of oncogene domain co-occurrence and red for analysis of evolutionary characteristics of frequent domains and domain pairs.

Subtractive search

First we generate the domain list of all the 124 oncogenes, and search them against all the sequenced genomes, which are organized into a simplified taxonomy tree, including viruses, archaea, bacteria, eukaryotes, plus a few increasingly finer subclasses of eukaryotes leading to Homo sapiens, namely metazoans, chordates and mammalian (Figure 5). The questions we ask here are (a) for each domain, where did it occur for the first time going from a simplest class of organisms to the most complex one? (b) for each pair of co-occurring domains, where did the co-occurrence take place for the first time in the aforementioned taxonomy?

In Figure 5, the term other_node is used to denote the group of organisms excluding the next higher node in the taxonomy. For example, if node B is next to node A in the taxonomy, 'other_node A' refers to all species from 'nodeA minus nodeB'. Thus, for node eukaryote, 'other_eukaryota' refers to all species from eukaryotes minus metazoans. Briefly, the tracing procedure starts from the organisms in a bottom-level group, and goes up the taxonomy tree to higher organisms in each of the groups. It should be noted that when a remote homolog of a domain is found at one other_node, its node of origin will be its immediate lower major node. Then the hit domains will be subtracted from the set and the others will be searched against the higher level other_node until all other_nodes are searched along the whole taxonomy. For example, if a domain is found at the other_chordata node, then its node of origin will be the chordata node. When a domain does not have a hit when searched against all the other_node genomes, then its node of origin will be considered as Homo sapiens [15].

Figure 5

A simplified taxonomy. For cellular organisms, each ellipse represents a major taxonomic class. Each rectangle represents all organisms covered by its parent class but not covered under its sibling ellipse.

We have used three types of nodes of origin for bacteria and archaea depending on the presence of remote homologs, i.e. archaea_only (first hit only in archaea, but not in bacteria), bacteria_only (first hit only in bacteria but not in archaea) and archaea_bacteria (first hit in both archaea and bacteria).

A tool package TaxDom is developed to execute the procedure outlined above and to facilitate visualization of the search results. The program is written in Perl and Java.

Refinement of subtractive search results

Horizontal gene transfer (HGT) has played a substantial role in organismal and genome evolutions [42]. Gene transfer from prokaryote to eukaryotes, particularly in the context of organellar endosymbiosis, is a major evolutionary phenomenon [43]. However, horizontal transfer in the opposite direction, i.e., from eukaryotes to bacteria or archaea, has been reported only anecdotally [14]. Although this type of transfer may occur only rarely, we have performed manual curation on the tracing results generated by the subtractive searching procedure, to fix any false origination classification for proteins that have been reported in literature to have emerged in eukaryotes first and then were transferred to prokaryotes. We have corrected such false origination predictions by our above procedure, based on our extensive literature search results.

Domain co-occurrence graph and enrichment analysis

We define a domain co-occurrence graph [44] as follows. Each node represents a distinct domain in the oncogenes, and two nodes are linked by an edge if they co-occur in some proteins in one of the reference genomes. Each edge has a weight defined as the number of co-occurrences of the corresponding domains in the same protein. Note that this graph is not necessarily a connected graph.

The following defines the enrichment ratio [45] of domains between oncogenes and the background human genome, for identification of domain pairs with higher co-occurrence frequencies in oncogenes compared to the whole human genome. Let

N = the number of proteins in the background set,

ns = the number of proteins in the background that contain domain pair s,

M = the number of proteins in the oncogene set, and

ms = the number of oncogene proteins that contain domain pair s.

We use the following formula to calculate the enrichment ratio of proteins that contain a specific domain pair in oncogenes and its p-value, knowing that it follows a hypergeometric distribution [45]:

E n r i c h m e n t _ r a t i o = m s / M n s / N MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyrauKaemOBa4MaemOCaiNaemyAaKMaem4yamMaemiAaGMaemyBa0MaemyzauMaemOBa4MaemiDaqNaei4xa8LaemOCaiNaemyyaeMaemiDaqNaemyAaKMaem4Ba8Maeyypa0tcfa4aaSaaaeaacqWGTbqBdaWgaaqaaiabdohaZbqabaGaei4la8Iaemyta0eabaGaemOBa42aaSbaaeaacqWGZbWCaeqaaiabc+caViabd6eaobaaaaa@4D68@
p value = { k = m s n s ( M k ) ( N M n s k ) ( N n s ) , e n r i c h m e n t _ r a t i o 1 k = 0 m s ( M k ) ( N M n s k ) ( N n s ) , e n r i c h m e n t _ r a t i o < 1 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiCaaNaeyOeI0IaeeODayNaeeyyaeMaeeiBaWMaeeyDauNaeeyzauMaeyypa0ZaaiqaaeaafaqabeGabaaabaWaaabCaeaajuaGdaWcaaqaamaabmaabaqbaeqabiqaaaqaaiabd2eanbqaaiabdUgaRbaaaiaawIcacaGLPaaadaqadaqaauaabeqaceaaaeaacqWGobGtcqGHsislcqWGnbqtaeaacqWGUbGBdaWgaaqaaiabdohaZbqabaGaeyOeI0Iaem4AaSgaaaGaayjkaiaawMcaaaqaamaabmaabaqbaeqabiqaaaqaaiabd6eaobqaaiabd6gaUnaaBaaabaGaem4CamhabeaaaaaacaGLOaGaayzkaaaaaOGaeiilaWIaemyzauMaemOBa4MaemOCaiNaemyAaKMaem4yamMaemiAaGMaemyBa0MaemyzauMaemOBa4MaemiDaqNaei4xa8LaemOCaiNaemyyaeMaemiDaqNaemyAaKMaem4Ba8MaeyyzImRaeGymaedaleaacqWGRbWAcqGH9aqpcqWGTbqBdaWgaaadbaGaem4CamhabeaaaSqaaiabd6gaUnaaBaaameaacqWGZbWCaeqaaaqdcqGHris5aaGcbaWaaabCaeaajuaGdaWcaaqaamaabmaabaqbaeqabiqaaaqaaiabd2eanbqaaiabdUgaRbaaaiaawIcacaGLPaaadaqadaqaauaabeqaceaaaeaacqWGobGtcqGHsislcqWGnbqtaeaacqWGUbGBdaWgaaqaaiabdohaZbqabaGaeyOeI0Iaem4AaSgaaaGaayjkaiaawMcaaaqaamaabmaabaqbaeqabiqaaaqaaiabd6eaobqaaiabd6gaUnaaBaaabaGaem4CamhabeaaaaaacaGLOaGaayzkaaaaaOGaeiilaWIaemyzauMaemOBa4MaemOCaiNaemyAaKMaem4yamMaemiAaGMaemyBa0MaemyzauMaemOBa4MaemiDaqNaei4xa8LaemOCaiNaemyyaeMaemiDaqNaemyAaKMaem4Ba8MaeyipaWJaeGymaedaleaacqWGRbWAcqGH9aqpcqaIWaamaeaacqWGTbqBdaWgaaadbaGaem4Camhabeaaa0GaeyyeIuoaaaaakiaawUhaaaaa@A609@

Availability and requirements

TaxDom is the computational tool that we developed for visualizing domain evolution and their fusion events presented in this study, and it is freely accessible at


  1. 1.

    Pierotti MicroA, Frattini Milo, Sozzi Gabriella: Oncogenes. In Cancer Medicine. 7th edition. Edited by: James F. Holland et al. Lea&Febiger, London; 2007.

    Google Scholar 

  2. 2.

    Steven Martin G: The road to Src. Oncogene 2004, 23: 7910–7917. 10.1038/sj.onc.1208077

    Article  PubMed  Google Scholar 

  3. 3.

    Vogel C, Bashton M, Kerrison ND, Chothia C, Teichmann SA: Structure, function and evolution of multidomain proteins. Curr Opin Struct Biol 2004, 14: 208–216. 10.1016/

    CAS  Article  PubMed  Google Scholar 

  4. 4.

    Holm L, Sander C: The FSSP database of structurally aligned protein fold families. Nucleic Acids Res 1994, 22: 3600–3609.

    PubMed Central  CAS  PubMed  Google Scholar 

  5. 5.

    Siddiqui AS, Barton GJ: Continuous and discontinuous domains: An algorithm for the automatic generation of reliable protein domain definitions. Protein Sci 1995, 4: 872–884.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  6. 6.

    Swindells MB: A procedure for detecting structural domains in proteins. Protein Sci 1995, 4: 103–112.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  7. 7.

    Holm L, Sander C: Dictionary of recurrent domains in protein structures. Proteins 1998, 33: 88–96. 10.1002/(SICI)1097-0134(19981001)33:1<88::AID-PROT8>3.0.CO;2-H

    CAS  Article  PubMed  Google Scholar 

  8. 8.

    Finn RobertD, Mistry Jaina, Schuster-Böckler Benjamin, Griffiths-Jones Sam, Hollich1 Volker, Lassmann1 Timo, Moxon Simon, Marshall Mhairi, Khanna2 Ajay, Durbin Richard, Eddy2 SeanR, Sonnhammer1 ErikLL, Bateman Alex: Pfam: clans, web tools and services. Nucleic Acids Research Database Issue 2006, 34: D247-D251. 10.1093/nar/gkj149

    CAS  Article  Google Scholar 

  9. 9.

    Schultz J, Milpetz F, Bork P, Ponting CP: SMART, a simple modular architecture research tool: identification of signalling domains. Proc Natl Acad Sci 1998, 95: 5857–5864. 10.1073/pnas.95.11.5857

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  10. 10.

    Servant F, Bru C, Carrere S, Courcelle E, Gouzy J, Peyruc D, Kahn D: ProDom: automated clustering of homologous domains. Brief Bioinform 2002, 3: 246–251. 10.1093/bib/3.3.246

    CAS  Article  PubMed  Google Scholar 

  11. 11.

    Robinson DanR, Wu1 Yi-Mi, Lin Su-Fang: The protein tyrosine kinase family of the human genome. Oncogene 2000, 19: 5548–5557. 10.1038/sj.onc.1203957

    CAS  Article  PubMed  Google Scholar 

  12. 12.

    Park Jeonghyeon, Kunjibettu Sudeesha, McMahon StevenB, Cole MichaelD: The ATM-related domain of TRRAP is required for histone acetyltransferase recruitment and Myc-dependent oncogenesis. Genes Dev 2001, 15: 1619–1624. 10.1101/gad.900101

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  13. 13.

    Westbrook CA, Hooberman AL, Spino C, Dodge RK, Larson RA, Davey F, Wurster-Hill DH, Sobol RE, Schiffer C, Bloomfield CD: Clinical Significance of the BCR-ABL Fusion Gene in Adult Acute Lymphoblastic Leukemia: A Cancer and Leukemia Group B Study. Blood 1992, 80(12):2983–2990.

    CAS  PubMed  Google Scholar 

  14. 14.

    Ponting CP, Aravind L, Schultz J, Bork P, Koonin EV: Eukaryotic signalling domain homologues in archaea and bacteria. Ancient ancestry and horizontal gene transfer. J Mol Biol 1999, 289(4):729–745. 10.1006/jmbi.1999.2827

    CAS  Article  PubMed  Google Scholar 

  15. 15.

    Lipika R Pal, Chittibabu Guda: Tracing the origin of functional and conserved domains in the human proteome: implications for protein evolution at the modular level. BMC Evolutionary Biology 2006, 6: 91. 10.1186/1471-2148-6-91

    Article  Google Scholar 

  16. 16.

    Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N, Stratton MR: A census of human cancer genes. Nat Rev Cancer 2004, 4(3):177–83. 10.1038/nrc1299

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  17. 17.

    Bork P: Hundreds of ankyrin-like repeats in functionallydiverse proteins: mobile modules that cross phyla horizontally? Proteins: Structure, Function, and Genetics 1993, 17(4):363–74. 10.1002/prot.340170405

    CAS  Article  Google Scholar 

  18. 18.

    Chang PC, Chi CW, Chau GY, Li FY, Tsai YH, Wu JC: DDX3, a DEAD box RNA helicase, is deregulated in hepatitis virus-associated hepatocellular carcinoma and is involved in cell growth control. Oncogene 2006, 25: 1991–2003. 10.1038/sj.onc.1209239

    CAS  Article  PubMed  Google Scholar 

  19. 19.

    Robinson HL: Retroviruses and cancer. Rev Infect Dis 1982, 4(5):1015–25.

    CAS  Article  PubMed  Google Scholar 

  20. 20.

    Beral V, Newton R, Weiss RA, eds: Infection and Human Cancer. Cancer Surveys 1998, 33: 1–396.

  21. 21.

    Coffin J, Hughes SH, Varmus HE, eds: Retroviruses. Cold Spring Harbor Laboratory Press, New York; 1997.

    Google Scholar 

  22. 22.

    Hurley JB, Simon MI, Teplow DB: Homologies between signal transducing G proteins and gene products. Science 1984, 226(4676):860–862. 10.1126/science.6436980

    CAS  Article  PubMed  Google Scholar 

  23. 23.

    Klein G: Cellular Oncogene Activation. Marcel Dekker Inc, NY; 1988.

    Google Scholar 

  24. 24.

    Banerjee R, Caruccio L, Zhang YJ, Mckercher S, Santelia RM: Effects of carcinogen-induced transcription factors on the activation of hepatitis B virus expression in human hepatoblastoma HepG2 cells and its implication on hepatocellular carcinomas. Hepatology 2000, 32(2):367–74. 10.1053/jhep.2000.9197

    CAS  Article  PubMed  Google Scholar 

  25. 25.

    Atchley WR, Fitch WM: Myc and Max: Molecular Evolution of a Family of Proto-Oncogene Products and Their Dimerization Partner. Proc Natl Acad Sci 1995, 92: 10217–10221. 10.1073/pnas.92.22.10217

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  26. 26.

    Walker CW, Boom JD, Marsh AG: First non-vertebrate member of the myc gene family is seasonally expressed in an invertebrate testis. Oncogene 1992, 7(10):2007–2012.

    CAS  PubMed  Google Scholar 

  27. 27.

    Korsmeyer SJ: Bcl-2 initiates a new category of oncogenes: regulators of cell death. Blood 1992, 80: 879–886.

    CAS  PubMed  Google Scholar 

  28. 28.

    Wuchty Stefan: Scale-free behavior in protein domain networks. Mol Biol Evol 2001, 18: 1694–1702.

    CAS  Article  PubMed  Google Scholar 

  29. 29.

    Bashton Matthew, Chothia Cyrus: The Generation of New Protein Functions. Structure 2007, 15: 85–99. 10.1016/j.str.2006.11.009

    CAS  Article  PubMed  Google Scholar 

  30. 30.

    Hegyi Hedi, Gerstein Mark: Annotation Transfer for Genomics: Measuring Functional Divergence in Multi-Domain Proteins. Genome Res 2001, 11: 1632–1640. 10.1101/gr. 183801

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  31. 31.

    Vogel Christine, Berzuini Carlo, Bashton Matthew: Supra-domains: Evolutionary Units Larger than Single Protein Domains. J Mol Biol 2004, 336: 809–823. 10.1016/j.jmb.2003.12.026

    CAS  Article  PubMed  Google Scholar 

  32. 32.

    Raz E, Schejter ED, Shilo BZ: Interallelic complementation among DER/flb alleles: implications for the mechanism of signal transduction by receptor-tyrosine kinases. Genetics 1991, 129(1):191–201.

    PubMed Central  CAS  PubMed  Google Scholar 

  33. 33.

    Pellegrini Matteo, Marcotte EdwardM, Thompson MichaelJ, Eisenberg David, Grothe Robert, Yeates ToddO: Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Proc Natl Acad Sci 1999, 96: 4285–4288. 10.1073/pnas.96.8.4285

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  34. 34.

    The UniProt Consortium: The Universal Protein Resource (UniProt). Nucleic Acids Res 2007, 35: D193–197. 10.1093/nar/gkl929

    PubMed Central  Article  Google Scholar 

  35. 35.

    Darmoul Dalila, Gratio Valérie, Devaud Hélène, Peiretti Franck, Laburthe Marc: Activation of proteinase-activated receptor 1 promotes human colon cancer cell proliferation through epidermal growth factor receptor transactivation. Mol Cancer Res 2004, 2(9):514–522.

    CAS  PubMed  Google Scholar 

  36. 36.

    Espinosa AV, Porchia L, Ringel MD: Targeting BRAF in thyroid cancer. Br J Cancer 2007, 96(1):16–20. 10.1038/sj.bjc.6603520

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  37. 37.

    Robert AW: The biology of Cancer. 1st edition. Garland Science; London; 2006.

    Google Scholar 

  38. 38.

    Paul K, Bower L, Morris L, Horne A, Petryszak R, Kanz C, Kanapin A, Das U, Michoud K, Phan I, Gattiker A, Kulikova T, Faruque N, Duggan K, Mclaren P, Reimholz B, Duret L, Penel S, Reuter I, Apweiler R: Integr8 and genome reviews: integrated views of complete genomes and proteomes. Nucleic Acids Res 2005, 33: D297-D302.

    Google Scholar 

  39. 39.

    The Uniprot virus data[]

  40. 40.

    The Cancer Genome Atlas Research Network: Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 2008, 455(7216):1061–1068. 10.1038/nature07385

    Article  Google Scholar 

  41. 41.

    Sjoblom T, Jones S, Wood LD, Parsons DW, Lin J, Barber TD, Mandelker D, Leary RJ, Ptak J, Silliman N: The Consensus Coding Sequences of Human Breast and Colorectal Cancers. Science 2006, 314: 268–274. 10.1126/science.1133427

    Article  PubMed  Google Scholar 

  42. 42.

    Salzberg StevenL, White Owen, Peterson Jeremy, Eisen JonathanA: Microbial Genes in the Human Genome: Lateral Transfer or Gene Loss? Science 2001, 292: 1903–1906. 10.1126/science.1061036

    CAS  Article  PubMed  Google Scholar 

  43. 43.

    Doolittle WF: You are what you eat: a gene transfer ratchet could account for bacterial genes in eukaryotic nuclear genomes. Trends Genet 1998, 14(8):307–311. 10.1016/S0168-9525(98)01494-2

    CAS  Article  PubMed  Google Scholar 

  44. 44.

    Ye Y, Godzik Z: A Comparative analysis of protein domain organization. Genome Res 2004, 14: 343–353. 10.1101/gr.1610504

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  45. 45.

    Xing Yi, Xu Qiang, Lee Christopher: Widespread production of novel soluble protein isoforms by alternative splicing removal of transmembrane anchoring domains. FEBS Letters 2003, 555: 572–578. 10.1016/S0014-5793(03)01354-1

    CAS  Article  PubMed  Google Scholar 

Download references


This work was supported in part by National Science Foundation (DBI-0354771, ITR-IIS-0407204, CCF-0621700, DBI-0542119), the National Institutes of Health (R01GM075331) and the Distinguished Cancer Clinicians and Scientists Program from Georgia Cancer Coalition. JH acknowledges the support by a Research and Creativity Award from the East Carolina University. The authors would also like to thank other members of the Computational Systems Biology Laboratory for their helpful discussions.

Author information



Corresponding authors

Correspondence to Xiuzi Ye or Ying Xu.

Additional information

Authors' contributions

QL carried out the design and implementation of the computational pipeline and drafted the manuscript. JH was responsible for the evolutionary analysis of the oncogenes. HL and PW participated in the preparation and analysis of oncogene data. YX and XY conceived the study and coordinated the involved data analyses as well as writing the manuscript. All authors read and approved the final manuscript.

Electronic supplementary material

Supplementary S2.

Additional file 1: 103 domains encoded from oncogene proteins. (XLS 34 KB)

Supplementary S3.

Additional file 2: 50 whole domain fusions of oncogenes. (XLS 22 KB)

Supplementary S4.

Additional file 3: Proteome-wide patterns of origin nodes in oncogene proteins. (XLS 82 KB)

Additional file 4: Supplementary S1. oncogene list. (XLS 64 KB)

Authors’ original submitted files for images

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Liu, Q., Huang, J., Liu, H. et al. Analyses of domains and domain fusions in human proto-oncogenes. BMC Bioinformatics 10, 88 (2009).

Download citation


  • Phylogenetic Profile
  • Domain Pair
  • Enrichment Ratio
  • Domain Fusion
  • Computational Pipeline