- Open Access
Protein comparison at the domain architecture level
© Lee and Lee; licensee BioMed Central Ltd. 2009
- Published: 3 December 2009
The general method used to determine the function of newly discovered proteins is to transfer annotations from well-characterized homologous proteins. The process of selecting homologous proteins can largely be classified into sequence-based and domain-based approaches. Domain-based methods have several advantages for identifying distant homology and homology among proteins with multiple domains, as compared to sequence-based methods. However, these methods are challenged by large families defined by 'promiscuous' (or 'mobile') domains.
Here we present a measure, called Weighed Domain Architecture Comparison (WDAC), of domain architecture similarity, which can be used to identify homolog of multidomain proteins. To distinguish these promiscuous domains from conventional protein domains, we assigned a weight score to Pfam domain extracted from RefSeq proteins, based on its abundance and versatility. To measure the similarity of two domain architectures, cosine similarity (a similarity measure used in information retrieval) is used. We combined sequence similarity with domain architecture comparisons to identify proteins belonging to the same domain architecture. Using human and nematode proteomes, we compared WDAC with an unweighted domain architecture method (DAC) to evaluate the effectiveness of domain weight scores. We found that WDAC is better at identifying homology among multidomain proteins.
Our analysis indicates that considering domain weight scores in domain architecture comparisons improves protein homology identification. We developed a web-based server to allow users to compare their proteins with protein domain architectures.
- Cosine Similarity
- Weight Score
- Domain Architecture
- Pfam Domain
- Inverse Document Frequency
Homology identification is part of a broad spectrum of genomic analyses, including the annotation of new whole genome sequences, the construction of comparative maps, the analysis of whole genome duplications and comparative approaches to identifying regulatory motifs . The general method used to determine the function of newly discovered proteins is to transfer annotation from well-characterized homologous proteins sharing a common ancestry . Current methods for the identification of homologous proteins can be largely classified into sequence-based and domain-based approaches . Sequence comparison methods, such as BLAST and FASTA, are the commonly-used traditional approaches to identify homologous genes [4, 5]. These methods assume that sequences with significant similarity share common ancestry, i.e. are homologs. However, the existence of multi-domain proteins and complex evolutionary mechanisms poses difficulties for sequence-based methods .
Domain-based methods use information of the domains contained in proteins . Domains are the building blocks of all proteins, and present one of the most useful levels at which protein function can be understood . Although the concept of a 'domain' now permeates biological descriptions, there are several definitions directed at different levels of the protein . In structural biology, a domain is defined as a spatially distinct, compact and stable protein structural unit that could conceivably fold and function in isolation. Domains are also defined as distinct regions of protein sequence that are highly conserved throughout evolution. These are described as sequence homologs and are often present in different molecular contexts. Sequence-based domain definitions represent one of the most convenient and practically important levels at which the evolution and function of both proteins and domains can be understood.
Domain-based approaches identify homologous proteins generally by comparing protein domain architecture, which is the linear order of the individual domains in multidomain protein. About two thirds of proteins in prokaryotes and 80% of proteins in eukaryotes are multi-domain proteins . Studies of domain-based methods indicate that comparing domain architecture is a useful method for classifying evolutionarily related proteins and detecting evolutionarily distant homologs . Several studies have proposed tools for domain architecture comparison, such as CDART  and PDART . However, these methods are challenged by large families defined by 'promiscuous' (or 'mobile') domains, which combine in many ways with other domains to form different proteins . Promiscuous domains have typically auxiliary functions to the primary role of protein, acting as signal transducers, or adapters [14, 15]. They also play a major role in creating diversity of protein domain architecture in the proteome . Because they are not directly related by homology, they should be given less importance in homology identification than key domains. Another problem inherent to these domain-based measures is that they treat all proteins in a domain architecture equally. They cannot discriminate among proteins belonging to the same domain architecture. Since most domain architectures consist of many proteins, identification methods are needed to determine which protein is the most homologous to the query protein within a set of proteins belonging to the same domain architecture.
Here we present a measure, called Weighed Domain Architecture Comparison (WDAC), of domain architecture similarity, which can be used to identify homologs of multidomain proteins. The key ideas are the use of weight scores for domain promiscuity and combining domain architecture comparison with sequence similarity method. The weight scores are calculated based on a domain's frequency and versatility in RefSeq  proteins. The effectiveness of our method is evaluated using human and nematode proteomes. We developed a web-based server to allow users to compare their proteins with protein domain architectures. The server is available at http://wdac.kr/.
In this study we used the Pfam  database to analyze the domain organization of proteins. Pfam is a large and widely used database of protein domains and families. Pfam contains curated multiple sequence alignments for each family, as well as profile hidden Markov models for finding these domains in new sequences. Pfam also provides better genomic coverage than structure-based domain assignments, such as CATH  and SCOP , particularly for membrane proteins.
Measuring the strength of domain promiscuity
where p t is the number of total proteins and p d is the number of proteins containing domain d.
Comparison of domain architectures using weight scores
The range of the cosine similarity is [0, 1], where 1 indicates that x and y have the same domains and 0 indicates that they share no domains.
Pipeline for domain architecture comparison
We developed a web-based server to provide a back-end pipeline for protein homology and to allow users to compare their protein sequences with a domain architecture database. The web interface is implemented with static HTML and CGI scripts, and MySQL DBMS is used to store the database.
Obtaining weight scores of protein domains
We downloaded 6,042,750 protein sequences from the RefSeq database (Release 32). The domain content of the sequences was analyzed with Pfam 23.0 containing 10,340 families. The Pfam domain annotations of all RefSeq proteins were obtained from the Similarity Matrix of Proteins (SIMAP)  database. We filtered domain hits in proteins with a cutoff E-value of 0.01 and excluded proteins without Pfam signatures. We extracted all the Pfam domains from the Pfam-annotated proteins
Because domains are differently distributed in the three kingdoms and some domains are absent or present in one or two kingdoms, we assigned three kingdom-specific weight scores to each domain based on its abundance and versatility in the three kingdoms. To measure domain abundance, we obtained the kingdom-specific protein frequency for each domain. Most domains occur in a hundred or fewer proteins, but a few domains are highly duplicated and occur in over 10,000 sequences. The most abundant domain in the three kingdoms is ABC_tran (PF00005), appearing in 54,980 bacterial proteins. To measure domain versatility, we obtained kingdom-specific N- and C-side distinct domain families adjacent to each domain. We found that most domains have one or two distinct adjacent domain families. The features of the obtained domain versatility are consistent with earlier reports that the number of different partner domains for a single domain or for a domain combination follows a power law distribution: many domains or domain combinations have only very few different N-terminal or C-terminal partner domains. The most versatile domain in the three kingdoms is Ank (PF00023), having 220 distinct partner domain families in eukaryotes.
The distribution of weight scores across the three kingdoms of life.
120 - 140
100 - 120
80 - 100
60 - 80
40 - 60
20 - 40
0 - 20
Top-ten promiscuous domains in the three Kingdoms Numbers in parenthesis is the weight scores of Pfam domains.
where f11 is the number of domains common to both sequences X and Y, f10 is the number of domains in X, and f01 is the number of domains in Y.
We extracted all complete human and nematode protein sequences from RefSeq proteins, yielding 32,999 human and 23,220 nematode protein sequences. Among these proteins, 23,295 human and 14,522 nematode proteins have detectable Pfam domain information. Among the human proteins, we selected 9,764 proteins that contain more than 2 Pfam domains and performed domain architecture comparisons between the selected human proteins (≥ two domains) and those from the nematode proteome (≥ one domain) using the WDAC and DAC algorithms.
To validate homologous pairs of human and nematode proteins, we used the HomoloGene database , a NCBI dataset that curates sets of orthologs from the annotated genes of several completely sequenced eukaryotic genomes. Among the 44,481 groups in HomoloGene release 61, we selected 2,559 groups that have both the selected human proteins and nematode proteins. From the comparison results, we extracted the WDAC and DAC results that have the same HomoloGene ID in the query (human) and the best matched protein (nematode). The results show that the number of true positive values in the WDAC and DAC results are 2,328 (91%) and 2,175 (85%) respectively, which means that considering weight scores in domain architecture comparison can improve homology identification.
Construction of web server
There are several current homology methods which compare domain architectures. However, these methods are challenged by large families defined by promiscuous domains. To cope with the promiscuous domain problem, we present a method for measuring the similarity among protein domain architectures based on their Pfam-A domain annotations. The Pfam database may contain a small number of false positives and false negatives. Nevertheless, it is currently one of the most useful and practical domain annotation databases for protein sequences. In this study, we consider domain weight scores, obtained based on the abundance and versatility of domains. Our analysis indicates that considering domain weight scores in domain architecture comparison improves the performance of protein homology identification. The WDAC algorithm is also effective in resolving some issues that have baffled traditional sequence-based comparison methods, such as the comparison of proteins with promiscuous domain(s). The WDAC algorithm and its web server could be used to explore the underlying evolutionary relationships among proteins at the level of their whole domain architectures, rather than at the single-domain or protein sequence level.
Other papers from the meeting have been published as part of BMC Genomics Volume 10 Supplement 3, 2009: Eighth International Conference on Bioinformatics (InCoB2009): Computational Biology, available online at http://www.biomedcentral.com/1471-2164/10?issue=S3.
BL was supported by the KRIBB Research Initiative Program and by the Korea Science and Engineering Foundation (KOSEF) grant funded by the Korea government (MEST) (No. M10869030002-08N6903-00210). DL was supported by the Ministry of Knowledge Economy, Korea, under the ITRC support program supervised by the IITA (IITA-2008-C1090-0801-0001) and Korea Institute of Science and Technology Information Supercomputing Center.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 15, 2009: Eighth International Conference on Bioinformatics (InCoB2009): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S15.
- Song N, Sedgewick RD, Durand D: Domain architecture comparison for multidomain homology identification. J Comput Biol 2007, 14(4):496–516. 10.1089/cmb.2007.A009View ArticlePubMedGoogle Scholar
- Punta M, Ofran Y: The rough guide to in silico function prediction, or how to use sequence and structure information to predict protein function. PLoS Comput Biol 2008, 4(10):e1000160. 10.1371/journal.pcbi.1000160PubMed CentralView ArticlePubMedGoogle Scholar
- Ponting CP, Russell RR: The natural history of protein domains. Annu Rev Biophys Biomol Struct 2002, 31: 45–71. 10.1146/annurev.biophys.31.082901.134314View ArticlePubMedGoogle Scholar
- Lee B, Hong T, Byun SJ, Woo T, Choi YJ: ESTpass: a web-based server for processing and annotating expressed sequence tag (EST) sequences. Nucleic acids research 2007, (35 Web Server):W159–162. 10.1093/nar/gkm369Google Scholar
- Lee B, Shin G: CleanEST: a database of cleansed EST libraries. Nucleic acids research 2009, (37 Database):D686–689. 10.1093/nar/gkn648Google Scholar
- Song N, Joseph JM, Davis GB, Durand D: Sequence similarity network reveals common ancestry of multidomain proteins. PLoS Comput Biol 2008, 4(4):e1000063. 10.1371/journal.pcbi.1000063PubMed CentralView ArticlePubMedGoogle Scholar
- Hollich V, Sonnhammer EL: PfamAlyzer: domain-centric homology search. Bioinformatics (Oxford, England) 2007, 23(24):3382–3383. 10.1093/bioinformatics/btm521View ArticleGoogle Scholar
- Chothia C, Gough J, Vogel C, Teichmann SA: Evolution of the protein repertoire. Science 2003, 300(5626):1701–1703. 10.1126/science.1085371View ArticlePubMedGoogle Scholar
- Lin K, Zhu L, Zhang DY: An initial strategy for comparing proteins at the domain architecture level. Bioinformatics (Oxford, England) 2006, 22(17):2081–2086. 10.1093/bioinformatics/btl366View ArticleGoogle Scholar
- Tordai H, Nagy A, Farkas K, Banyai L, Patthy L: Modules, multidomain proteins and organismic complexity. The FEBS journal 2005, 272(19):5064–5078. 10.1111/j.1742-4658.2005.04917.xView ArticlePubMedGoogle Scholar
- Fong JH, Geer LY, Panchenko AR, Bryant SH: Modeling the evolution of protein domain architectures using maximum parsimony. Journal of molecular biology 2007, 366(1):307–315. 10.1016/j.jmb.2006.11.017PubMed CentralView ArticlePubMedGoogle Scholar
- Geer LY, Domrachev M, Lipman DJ, Bryant SH: CDART: protein homology by domain architecture. Genome research 2002, 12(10):1619–1623. 10.1101/gr.278202PubMed CentralView ArticlePubMedGoogle Scholar
- Bjorklund AK, Ekman D, Light S, Frey-Skott J, Elofsson A: Domain rearrangements in protein evolution. Journal of molecular biology 2005, 353(4):911–923. 10.1016/j.jmb.2005.08.067View ArticlePubMedGoogle Scholar
- Basu MK, Carmel L, Rogozin IB, Koonin EV: Evolution of protein domain promiscuity in eukaryotes. Genome research 2008, 18(3):449–461. 10.1101/gr.6943508PubMed CentralView ArticlePubMedGoogle Scholar
- Lee B, Lee D: DAhunter: a web-based server that identifies homologous proteins by comparing domain architecture. Nucleic Acids Res 2008, (36 Web Server):W60–64. 10.1093/nar/gkn172Google Scholar
- Basu MK, Poliakov E, Rogozin IB: Domain mobility in proteins: functional and evolutionary implications. Brief Bioinform 2009, 10(3):205–216. 10.1093/bib/bbn057PubMed CentralView ArticlePubMedGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic acids research 2007, (35 Database):D61–65. 10.1093/nar/gkl842Google Scholar
- Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, et al.: The Pfam protein families database. Nucleic acids research 2008, (36 Database):D281–288.Google Scholar
- Greene LH, Lewis TE, Addou S, Cuff A, Dallman T, Dibley M, Redfern O, Pearl F, Nambudiry R, Reid A, et al.: The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic acids research 2007, (35 Database):D291–297. 10.1093/nar/gkl959Google Scholar
- Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJ, Chothia C, Murzin AG: Data growth and its impact on the SCOP database: new developments. Nucleic acids research 2008, (36 Database):D419–425.Google Scholar
- Vogel C, Teichmann SA, Pereira-Leal J: The relationship between domain duplication and recombination. Journal of molecular biology 2005, 346(1):355–365. 10.1016/j.jmb.2004.11.050View ArticlePubMedGoogle Scholar
- Yu S, Van Vooren S, Tranchevent LC, De Moor B, Moreau Y: Comparison of vocabularies, representations and ranking algorithms for gene prioritization by text mining. Bioinformatics (Oxford, England) 2008, 24(16):i119–125. 10.1093/bioinformatics/btn291View ArticleGoogle Scholar
- Glenisson P, Coessens B, Van Vooren S, Mathys J, Moreau Y, De Moor B: TXTGate: profiling gene groups with text-based information. Genome Biol 2004, 5(6):R43. 10.1186/gb-2004-5-6-r43PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Rattei T, Tischler P, Arnold R, Hamberger F, Krebs J, Krumsiek J, Wachinger B, Stumpflen V, Mewes W: SIMAP--structuring the network of protein similarities. Nucleic Acids Res 2008, (36 Database):D289–292.Google Scholar
- Balestre M, Von Pinho RG, Souza JC, Lima JL: Comparison of maize similarity and dissimilarity genetic coefficients based on microsatellite markers. Genet Mol Res 2008, 7(3):695–705. 10.4238/vol7-3gmr458View ArticlePubMedGoogle Scholar
- Sayers EW, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, et al.: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2009, (37 Database):D5–15. 10.1093/nar/gkn741Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.