Protein domain organisation: adding order
© Kummerfeld and Teichmann; licensee BioMed Central Ltd. 2009
Received: 07 May 2008
Accepted: 29 January 2009
Published: 29 January 2009
Domains are the building blocks of proteins. During evolution, they have been duplicated, fused and recombined, to produce proteins with novel structures and functions. Structural and genome-scale studies have shown that pairs or groups of domains observed together in a protein are almost always found in only one N to C terminal order and are the result of a single recombination event that has been propagated by duplication of the multi-domain unit.
Previous studies of domain organisation have used graph theory to represent the co-occurrence of domains within proteins. We build on this approach by adding directionality to the graphs and connecting nodes based on their relative order in the protein. Most of the time, the linear order of domains is conserved. However, using the directed graph representation we have identified non-linear features of domain organization that are over-represented in genomes. Recognising these patterns and unravelling how they have arisen may allow us to understand the functional relationships between domains and understand how the protein repertoire has evolved.
We identify groups of domains that are not linearly conserved, but instead have been shuffled during evolution so that they occur in multiple different orders. We consider 192 genomes across all three kingdoms of life and use domain and protein annotation to understand their functional significance.
To identify these features and assess their statistical significance, we represent the linear order of domains in proteins as a directed graph and apply graph theoretical methods. We describe two higher-order patterns of domain organisation: clusters and bi-directionally associated domain pairs and explore their functional importance and phylogenetic conservation.
Taking into account the order of domains, we have derived a novel picture of global protein organization. We found that all genomes have a higher than expected degree of clustering and more domain pairs in forward and reverse orientation in different proteins relative to random graphs with identical degree distributions. While these features were statistically over-represented, they are still fairly rare. Looking in detail at the proteins involved, we found strong functional relationships within each cluster. In addition, the domains tended to be involved in protein-protein interaction and are able to function as independent structural units. A particularly striking example was the human Jak-STAT signalling pathway which makes use of a set of domains in a range of orders and orientations to provide nuanced signaling functionality. This illustrated the importance of functional and structural constraints (or lack thereof) on domain organisation.
One of the driving forces behind protein evolution is the duplication and shuffling of domains [1–3]. There are approximately 1500 known domain superfamilies that have been combined in different ways to form the protein repertoire (SCOP version 1.65,  and corroborated by ). Domains can be thought of as the building blocks of proteins. During evolution, pairs and groups of domains have joined to form multi-domain proteins. In many cases, these groups have then been preserved and duplicated to generate higher-order combinations [6–8]. Unravelling how this duplication and rearrangement has occurred allows us to understand the functional relationships between domains and determine how proteins have evolved. Structural and genome-scale studies have shown that pairs of domains that are adjacent on proteins are usually the result of a single recombination event which is preserved and duplicated as a unit.  found that N-C terminal order of any particular pair of domains is almost always preserved across protein space. That is, if the domain pair A-B is observed, B-A is unlikely to occur.  showed that this extends to triplets of domains and identified two and three domain patterns that are over represented.  showed that multi-domain architectures are almost always the result of a single recombination event with convergent evolution to generate a particular pattern of domains being rare. Further, structural analysis of the geometry of adjacent domains indicated that most domain pairs we observe next to each other on a protein have become joined once in evolution . Recent studies have found higher rates of domain architecture re-invention, but the proportion of these cases is still low .
Some domains occur in multiple of different domain architectures. They are relatively rare, typically involved in protein-protein interactions  and most often located at the ends of proteins or as single domain proteins .
Previous global studies of domain organisation have used undirected graphs to represent the co-occurrence of domains within proteins [16–23]. These studies represented proteins as a graph with vertices corresponding to domains and edges linking domains that are found within one protein. This model views a protein as a "bag of domains" because all domains on the protein are linked irrespective of their order or relative positions. For example,  analysed the global properties of the network showing that the domain graph has small-world and scale-free topology and  used the domain graph representation to compare changes in domains and combinations between genomes. These analyses concentrated on the global properties of the network and used the "bag of domains" model which does not account for evolutionary changes manifested through re-ordering of domains. We set out to consider the functional and evolutionary importance of domain order by building a model that accounts for the relative position of domains on a protein. Given the importance of domain order as shown by structural and sequence-based studies (described above), incorporation of information about sequential domain arrangements may lead to novel insights into how proteins are organised.
Results and discussion
0.1 Global features of domain organisation
We consider proteins in terms of their domains based on the SCOP database  superfamily definition of a domain. Domains were assigned to proteins using the SUPERFAMILY database (v1.65, ) predictions and include 192 completely sequenced genomes (19 archaea, 129 bacteria and 44 eukaryotes, [see Additional file 2]). These assignments are used to determine the sequential order of domains along a protein, termed the protein's domain architecture. We represented domain architectures as a directed graph with superfamilies as nodes and directed edges linking adjacent domains from the N- to C-terminus (Figure 1). We chose the SUPERFAMILY database because it represents domains that highly divergent but evolutionarily related.
This section presents the global properties of the directed domain graph, establishes its topology and compares it with the previously studied "bag-of-domains" model [16, 17]. The properties we considered were: mean degree, degree distribution, mean clustering coefficient, characteristic path length and network density. We explain each network parameter, describe the values observed for the directed domain graph in comparison to random expectation and discuss their biological significance.
The degree distribution [see Additional file 1] of the directed domain graph follows a power-law: a small number of superfamilies have many neighbours, while the majority have only one or two. Networks with power-law degree distribution are described as having scale-free topology .
In order to assess the significance of the domain graph's global network properties, we compared the observed values with those expected at random. The randomly expected values were determined by calculating the network properties for 1000 random graphs. We were specifically interested in the network properties that are impacted by domain order: the clustering relationships of domains, the density and connectedness of the network.
In designing our randomisation strategy it was important to consider both the known properties of the graph and the parameters of interest. We knew that the domain graph was scale free. A simple randomisation approach that preserves only the number of nodes and edges generates a truly random graph, with a degree distribution that follow a Poisson distribution . By definition, such random graphs will exhibit significantly different global properties from their scale-free counterparts, making comparisons between observed network properties and these random graphs meaningless.
Global properties of the domain architecture network.
Deviation from observed
Mean Clustering Coefficient
Characteristic Path Length
Network density (% connected)
Mean Clustering Coefficient
Characteristic Path Length
Network density (% connected)
0.2 Genome domain graphs
The domain graph above included domain architectures from all proteins in our set of 192 genomes, providing a broad picture of protein organization. By including proteins from different genomes in a single graph, the domains represented may not all interact physically or during evolution. In order focus on evolutionary and functional features of domain organisation, we considered the global network parameters for the graphs of domains from individual genomes and phylogenetic groups of genomes. We calculated the graph parameters expected at random separately for each genome or group through 1000 randomisation of each graph.
Global properties of the genome domain architecture network
Homo sapiens 22.34d
Mus musculus 22.32b
In contrast, the percentage of connected nodes and average path length are not consistently higher or lower than expected. The percentage of connected domains for the majority of genomes is slightly lower (0.5 and 2 standard deviations) than expected; while around 20 bacterial species have values slightly higher than expected. The average path lengths present a more complicated picture. For the all-genome graph, the observed path length is longer than expected. However, for all the archaea, a large proportion of eukaryotes and around half of the bacteria, the average path length is slightly lower than expected. For the most part the single-genome values fall within two standard deviations of the mean and may not be significant. This suggests that average path lengths are no different from the values expected at random and therefore not influenced or controlled by functional or evolutionary constraints.
Section 1 investigates this variability between genomes further by looking on a case-by-case basis at the functional significance of these clusters and their phylogenetic distribution.
0.3 Bi-directional paths
 and 9 demonstrated that N-C terminal domain order is generally conserved across proteins. That is, if domain A is found N-terminal to domain B, the reverse combination (BA) tends not to exist.  and  looked in detail at the structures of proteins that contradict this rule such that one protein has domain combination AB while a second contains BA. They found that in general the relative structural orientation of domains A and B was different in the forward compared to the reverse case and the proteins have different functions. The clusters above illustrate that there are cases where pairs of domains are observed in both orientations; we call these bi-directional paths.
% bidirectional links
Human bidirectionally-linked domains
P-loop containing nucleoside triphosphate hydrolases
P-loop containing nucleoside triphosphate hydrolases
Concanavalin A-like lectinsglucanases
P-loop containing nucleoside triphosphate hydrolases
p53-like transcription factors
Serine proteinase inhibitor lekti
Ovomucoid/PCI-1 like inhibitors
RNA-binding domain, RBD
CCCH zinc finger
"Winged helix" DNA-binding domain
C2H2 and C2HC zinc fingers
Microbial and mitochondrial ADK, insert "zinc finger" domain
1 Local features of domain organisation
The global analysis described above highlighted features of domain organisation occurring more often than expected at random that cannot be explained by a simple linear model of domain combination evolution. First, the level of clustering within the directed domain graphs is higher than expected at random. That means that some groups of domains have recombined in multiple different ways. Second, some domain pairs are found in two different N-C terminal orders, and this occurs more often than expected at random. That is, domain A is sometimes followed and other times preceded by B, providing evidence of functional but not evolutionary links. This section discuss these features locally, using functional annotation to investigate why they occur and how they are distributed across phylogenetic groups.
1.1 Function of domain clusters
One of the most pronounced features of the domain graph is the higher than expected mean clustering coefficient. For the complete domain graph, the mean clustering coefficient is 22.5 standard deviations above that of the random graphs. In comparison, the characteristic path length and network density are only +2 and -5 standard deviations from the random mean. A high mean clustering coefficient indicates that the domain graph includes groups of nodes whose neighbours are interlinked. In the context of domain organisation, this means that there are groups of promiscuous domains that occur in multiple different combinations.
To investigate the functional and evolutionary features of these clusters, we extracted groups of domains with inter-linking neighbours from genome level domain organisation graphs for 192 completely sequenced genomes. This gave us a list of proteins present in each cluster. In order to identify functional relationships, we used the KEGG pathway database  and Gene Ontology (GO) annotation. This allowed us to extract clusters of domains with every node represented in a particular functional category. We focused on GO molecular function and biological process categories with between 5 and 500 member proteins (in order to exclude both overly specific and very large, non-specific functional groups). We found that for every cluster, all nodes belong to at least one common GO category; significantly more than expected by chance (p < 0.001 [see Additional file 3]). Statistical significance was assessed by comparing the proportion of randomly selected groups of domains that belong to a common functional class with the observed proportion. Random groups were sampled from the entire domain graph and chosen to be the same size as each observed cluster; randomisation was carried out 1,000 times per cluster.
The clusters we describe are an extreme form of re-use, because not only have individual domains recombined with multiple partners, but their partners have recombined with each-other. Aside from signal-transduction domains, the two central human clusters (shown in figure 7) also include protein-interaction domains; for example, the Ankyrin repeat domain. These clusters are examples where the proteins involved make use of the protein interaction domains in different combinations and with other partners to diversify or specialise their functions.
This suggests that functional and structural constraints (or lack there-of) can lead to exceptional arrangements of domains. For instance, multi-domain structures that are essentially beads-on-a-string with no fixed interface between domains are more likely to be functional in multiple orientations than their tightly (structurally) inter-linked counterparts. Many of the domains we observe within our clusters can function independently of their neighbours.
1.2 Phyletic patterns of human domain clusters
The clusters that we observed in Homo sapiens are almost exclusively eukaryote-specific (shown in Figure 6) and for the most part the domains that occur within the clusters are themselves found only in eukaryotes. Even for the domains that are found in bacteria, the particular combinations we observe in clusters are peculiar to eukaryotes.
The great majority of human clusters are also found in Chimpanzee and almost all of these are present in mouse and rat. Looking at more distantly related Chordate genomes the conservation declines rapidly with only the central cluster common to chicken and many links missing in Xenopus tropicalis, Fugu rubripes, Danio rerio and Ciona intestinalis. If we look to even more distantly related species, for example, Saccharomyces cerevisiae, none of the clusters are conserved.
The arrangement of domains within proteins has been studied previously using a graph representation where nodes are domains and edges join domains observed within a single protein. A shortcoming of this representation is that it does not take into account the N to C terminal arrangement of domains on the protein. We have developed a directed graph model of domain organisation that considers order and relative N to C terminal position. By investigating the global properties of the network, we have shown that domain clustering occurs significantly more often than expected at random.
Considering each genome in isolation we found that the high degree of clustering observed for the multi-genome dataset also holds for each genome individually. However, the characteristic path length and percentage of connected nodes are not very different from the randomly expected values. These findings suggest that the domain organisation of individual genomes varies but all show a higher than expected degree of clustering.
Focusing in detail on domain clusters, we identified functional constraints that make this arrangement highly preferable to the organism. Clusters in human are almost exclusively eukaryote-specific and have roles in signal transduction and protein-protein interaction.
Finally, we observe pairs of domains found in forward and reverse orientation in different proteins more often that would be expected at random. While previous work has shown that this phenomenon is rare in terms of the number of occurrences in proteins, we see the opposite trend for the existence of such domain pairs. The function of the domains that occur in both orientations are similar to those found in clusters suggesting a common underlying functional or structural cause.
SKK is supported by an Australian NHMRC CJ Martin Postdoctoral Fellowship.
- Brenner SE, Hubbard T, Murzin A, Chothia C: Gene duplications in the H. influenzae genome. Nature 1995, 378: 140. 10.1038/378140a0View ArticlePubMedGoogle Scholar
- Teichmann SA, Park J, Chothia C: Structural assignments to the Mycoplasma genitalium proteins show extensive gene duplications and domain rearrangements. Proc Natl Acad Sci USA 1998, 95: 14658–14663. 10.1073/pnas.95.25.14658PubMed CentralView ArticlePubMedGoogle Scholar
- Apic G, Gough J, Teichmann SA: An insight into domain combinations. Bioinformatics 2001, 17 Suppl 1: S83-S89.View ArticlePubMedGoogle Scholar
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247(4):536–540.PubMedGoogle Scholar
- Orengo C, Thornton J: Protein families and their evolution-a structural perspective. Annual Review of Biochemistry 2005, 74: 867–900. 10.1146/annurev.biochem.74.082803.133029View ArticlePubMedGoogle Scholar
- Koonin EV, Wolf YI, Karev GP: The structure of the protein universe and genome evolution. Nature 2002, 420(6912):218–223. 10.1038/nature01256View ArticlePubMedGoogle Scholar
- Muller A, MacCallum RM, Sternberg MJE: Structural characterization of the human proteome. Genome Res 2002, 12(11):1625–1641. 10.1101/gr.221202PubMed CentralView ArticlePubMedGoogle Scholar
- Gerrard DT, Bornberg-Bauer E: doMosaic – Analysis of the mosaic-like domain arrangements in proteins. Informatica 2003, 27: 15–20.Google Scholar
- Apic G, Huber W, Teichmann SA: Multi-domain protein families and domain pairs: comparison with known structures and a random model of domain recombination. J Struct Funct Genomics 2003, 4(2–3):67–78. 10.1023/A:1026113408773View ArticlePubMedGoogle Scholar
- Vogel C, Berzuini C, Bashton M, Gough J, Teichmann S: Supra-domains: evolutionary units larger than single protein domains. J Mol Biol 2004, 336: 809–23. 10.1016/j.jmb.2003.12.026View ArticlePubMedGoogle Scholar
- Gough J: Convergent evolution of domain architectures (is rare). Bioinformatics 2005, 21(8):1464–1471. 10.1093/bioinformatics/bti204View ArticlePubMedGoogle Scholar
- Bashton M, Chothia C: The geometry of domain combination in proteins. J Mol Biol 2002, 315(4):927–939. 10.1006/jmbi.2001.5288View ArticlePubMedGoogle Scholar
- Forslund K, Henricson A, Hollich V, Sonnhammer EL: Domain tree-based analysis of protein architecture evolution. Mol Biol Evol 2008, 25: 254–64. 10.1093/molbev/msm254View ArticlePubMedGoogle Scholar
- Basu MK, Carmel L, Rogozin IB, Koonin EV: Evolution of protein domain promiscuity in eukaryotes. Genome Res 2008, 18(3):449–61. 10.1101/gr.6943508PubMed CentralView ArticlePubMedGoogle Scholar
- Weiner JBBE 3rd, Moore AD: Just how versatile are domains? BMC Evol Biol 2008, 8: 285. 10.1186/1471-2148-8-285PubMed CentralView ArticlePubMedGoogle Scholar
- Wuchty S: Scale-free behavior in protein domain networks. Mol Biol Evol 2001, 18: 1694–702.View ArticlePubMedGoogle Scholar
- Ye Y, Godzik A: Comparative analysis of protein domain organization. Genome Res 2004, 14: 343–353. 10.1101/gr.1610504PubMed CentralView ArticlePubMedGoogle Scholar
- Weiner J, Beaussart F, Bornberg-Bauer E: Domain deletions and substitutions in the modular protein evolution. FEBS J 2006, 273(9):2037–47. 10.1111/j.1742-4658.2006.05220.xView ArticlePubMedGoogle Scholar
- Przytycka T, Davis G, Song N, Durand D: Graph theoretical insights into evolution of multidomain proteins. J Comput Biol 2006, 13(2):351–63. 10.1089/cmb.2006.13.351PubMed CentralView ArticlePubMedGoogle Scholar
- Wuchty S, Almaas E: Evolutionary cores of domain co-occurrence networks. BMC Evol Biol 2005, 5: 24. 10.1186/1471-2148-5-24PubMed CentralView ArticlePubMedGoogle Scholar
- Cohen-Gihon I, Nussinov R, Sharan R: Comprehensive analysis of co-occurring domain sets in yeast proteins. BMC Genomics 2007, 8: 161. 10.1186/1471-2164-8-161PubMed CentralView ArticlePubMedGoogle Scholar
- Qian J, Luscombe NM, Gerstein M: Protein family and fold occurrence in genomes: power-law behaviour and evolutionary model. J Mol Biol 2001, 313(4):673–81. 10.1006/jmbi.2001.5079View ArticlePubMedGoogle Scholar
- Dokholyan NV, Shakhnovich B, Shakhnovich EI: Expanding protein universe and its origin from the biological Big Bang. Proc Natl Acad Sci USA 2002, 99(22):14132–6. 10.1073/pnas.202497999PubMed CentralView ArticlePubMedGoogle Scholar
- Gough J, Karplus K, Hughey R, Chothia C: Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J Mol Biol 2001, 313(4):903–919. 10.1006/jmbi.2001.5080View ArticlePubMedGoogle Scholar
- Barabasi A, Albert R: Emergence of scaling in random networks. Science 1999, 286: 509–512. 10.1126/science.286.5439.509View ArticlePubMedGoogle Scholar
- Newman ME, Strogatz SH, Watts DJ: Random graphs with arbitrary degree distributions and their applications. Physical review E, Statistical, nonlinear, and soft matter physics 2001, 64(2 Pt 2):026118.View ArticlePubMedGoogle Scholar
- Bollobas B, Borgs C, Chayes J, Riordan O: Directed scale-free graphs. In Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics; 2003:132–139.Google Scholar
- Watts DJ, Strogatz SH: Collective dynamics of 'small-world' networks. Nature 1998, 393: 440–442. 10.1038/30918View ArticlePubMedGoogle Scholar
- Apic G, Gough J, Teichmann SA: Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J Mol Biol 2001, 310(2):311–325. 10.1006/jmbi.2001.4776View ArticlePubMedGoogle Scholar
- Todd AE, Orengo CA, Thornton JM: Evolution of function in protein superfamilies, from a structural perspective. J Mol Biol 2001, 307(4):1113–1143. 10.1006/jmbi.2001.4513View ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 2000, 28: 27–30. 10.1093/nar/28.1.27PubMed CentralView ArticlePubMedGoogle Scholar
- Pawson T, Nash P: Assembly of cell regulatory systems through protein interaction domains. Science 2003, 300(5618):445–452. 10.1126/science.1083653View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.