DODO: an efficient orthologous genes assignment tool based on domain architectures. Domain based ortholog detection
© Chen et al; licensee BioMed Central Ltd. 2010
Published: 15 October 2010
Orthologs are genes derived from the same ancestor gene loci after speciation events. Orthologous proteins usually have similar sequences and perform comparable biological functions. Therefore, ortholog identification is useful in annotations of newly sequenced genomes. With rapidly increasing number of sequenced genomes, constructing or updating ortholog relationship between all genomes requires lots of effort and computation time. In addition, elucidating ortholog relationships between distantly related genomes is challenging because of the lower sequence similarity. Therefore, an efficient ortholog detection method that can deal with large number of distantly related genomes is desired.
An efficient ortholog detection pipeline DODO (DOmain based Detection of Orthologs) is created on the basis of domain architectures in this study. Supported by domain composition, which usually directly related with protein function, DODO could facilitate orthologs detection across distantly related genomes. DODO works in two main steps. Starting from domain information, it first assigns protein groups according to their domain architectures and further identifies orthologs within those groups with much reduced complexity. Here DODO is shown to detect orthologs between two genomes in considerably shorter period of time than traditional methods of reciprocal best hits and it is more significant when analyzed a large number of genomes. The output results of DODO are highly comparable with other known ortholog databases.
DODO provides a new efficient pipeline for detection of orthologs in a large number of genomes. In addition, a database established with DODO is also easier to maintain and could be updated relatively effortlessly. The pipeline of DODO could be downloaded from http://18.104.22.168:16080/dodo_web/home.htm
Orthologous gene identification is an important step in comparative genomics. The word orthologs originally refer to genes in different species derived from the same locus in their last common ancestor. Since orthologs are genes derived from the same ancestor gene, orthologs often have similar amino acid sequences and expected to perform the same or similar cellular function [1, 2]. These properties make orthologs useful in functional genomics analysis. In addition to reconstructing the phylogeny and revealing the evolution history of species, orthologs could also be applied to genome annotation and protein-protein interaction prediction [3, 4]. The orthologs can be treated as corresponding genes in different species after species evolved and consequently it is an important issue to detect this kind of ortholog relationship between species.
A number of methods have been developed for orthologs detection. In practice, orthologs are defined through reciprocal best hits (RBH) from primary protein sequences between two species by various algorithms. For instance, the COG, InParanoid, and orthDB are built based on such RBH approach [6–8]. Beside RBH, tree-based methods such as those for reconstructing the LOFT, COCO-CL and HOPS database have also been developed [9–11], where trees are established via heuristic calculations of sequence similarity and the orthologous relationships are inferred from the tree structures. Some databases such as the Ensembl Compara and HomoloGene are constructed with both RBH and phylogenetic tree information [12, 13]. In addition, some methods identify orthologs by reconstructing genome rearrangement events in closely related genomes such as MSOAR and MultiMSOAR [14, 15].
With the advance of high throughput sequencing technologies, it is anticipated a dramatic increase in the number of completed genomes. Two challenges are posed to ortholog identification. The first issue is the speed of analyzing a large number of proteins. Increasing number of genomes necessitate faster method for data analysis and processing. Another issue is the ability to identify orthologs in distantly related species where sequence similarity might be low. However, the complexity and computation time of the RBH methods increase considerably as mutual comparisons are needed between each pair of species. For example, it needs 4,950 times of mutual comparisons between pair of genomes to identify ortholog relationships among 100 genomes and for 1000 genomes it would need 499,500 times of sequence comparison and alignments. Thus, new methods that can identify orthologous relationships among a large number of genomes, some of which are distantly related, in a reasonable time are beneficial. Here we propose an efficient and function-based new ortholog detection method called DODO (DOmain based Detection of Orthologs) to overcome the hurdles in ortholog identification from a large number of genomes.
DODO pipeline is designed for efficient discovery of the orthologous relationship between an anchor genome of interest (or well studied) and other genome(s). DODO detects homolog groups aided by protein domain information. In the beginning, DODO classifies proteins into groups based on both their domain composition and architecture. Domains are the functional units of proteins. Proteins having the same domain architecture likely have the same cellular function which implies homology in structures and functions. While the similarity between primary sequence of orthologs may decrease dramatically in distantly related species, the domain composition is more likely to be conserved through evolution due to the functional constraint [16, 17]. The domain architecture based method could be applied to detect homologous relationships between distantly related species. After proteins of the same domain architecture are grouped together, DODO further refines the orthologous relationship within each homolog group by identifying RBH among the smaller protein set. This strategy of ortholog searching in smaller groups instead of the whole genome makes DODO an efficient pipeline.
In addition to efficiency, database established by DODO could also be easily updated and practically the DODO results are comparable to those predicted by the traditional RBH methods. Adding new species into the database does not require reprocessing of he previously analyzed species which already existed in the database - a procedure necessitated by the traditional RBH methods. For traditional RBH methods, to update a database consisted of n existed old species, the newly added m species will cost n*m times of mutual comparisons between each pair of existed old genomes and newly added genomes. Instead, to update a database constructed by DODO only needs m times of domain identification for those newly input genomes no matter how many species already included in the database. It is easier to maintain and update an ortholog database efficiently in this schema.
The DODO pipeline, which can be freely download and executed locally, is written in Python. Given input the protein sequences in FASTA format, the pipeline will run RBS-BLAST, cluster the proteins with the same domains, and finally output a report the ortholog groups automatically. DODO requires BLAST for domain identification and similarity search. The ortholog group assignment is done in two steps. Proteins are assigned into homolog groups based on their domain information and then further classified by RBH within homolog groups.
Grouping of proteins according to domain architecture
Domain assignment is performed with RPS-BLAST for each protein sequence using Pfam v23  as the source database. Default parameters are used except the expected value which is set to below 0.01. Domain hit(s) information is then extracted from hits in the RPS-BLAST result files. Proteins having the same domain composition and order are grouped together into one group. Proteins without Pfam domain information are all grouped into an uncharacterized group for further analysis.
Assigning the ortholog group
For some of the proteins, the information of protein domain alone may not be sufficient to determine the orthologous relationship. These groups may contain the same protein architecture, but some of them may nevertheless be very different at the sequence level and thus their ortholog relationship could be resolved. This is especially evident on expended paralogous gene families. Therefore, proteins within the same domain architectural group are further sub-classified with the RBH method. Choosing one species as anchor, BLASTP is performed to identify RBH between the anchor species and all the other species. These final sets of groups are then reported as the ortholog groups.
The output of DODO pipeline is a text file containing the ortholog information. Orthologs identified based on both domain information and RBH have IDs starting with 'PfamArcNu' while orthologs identified based purely on RBH have IDs starting with 'NoDomainInfo'. The domain architecture for orthologs could be found in the file PfamArcMap.txt under the project folder.
DODO first clustered proteins into groups based on their domain architectures and then found orthologous relationship within each group. This strategy speeds up the ortholog identification procedure and facilitates the maintenance of ortholog database. Here we investigate the efficiency of DODO and compare the performance of DODO against published databases.
Computation time comparison
A dataset of 21,673 human and 23,497 mouse protein sequences used in InParanoid  is utilized to demonstrate the relative short processing time of DODO. The comparison was done on a Linux server with 16GB RAM and 4*AMD Operon CPU. The total computation time of DODO was 21,263 seconds (5.91 hours) while the InParanoid pipeline took 135,585 seconds (37.66 hours). This result shows that, even considering only two species, DODO can identify the orthologous relationships within these species in about 15.7% of the time that the conventional RBH takes. This difference in computation time will become larger as more species are analyzed. The computation time of the conventional RBH method grows roughly proportionally to the square of the number of species. On the other hand, DODO compares each species to the same domain database only once, regardless of how many species were in comparison. Therefore DODO has significant advantage over conventional RBH in terms of the process time. This is increasingly important as more and more genomes are being sequenced and analyzed today than ever before.
Comparison of DODO ortholog groups with the HomoloGene release 64
HomoloGene  is a homolog sequence database which was constructed based on both sequence information and phylogeny information. It records the homolog relationship between 20 completely sequenced eukaryotic genomes. We extracted the 300,701 protein sequences that are used in HomoloGene release 64 from RefSeq and those sequences are a subset of a total of 330,610 protein sequences originally used in HomoloGene release 64 reconstruction. Using human as the anchor species, DODO identified 18,202 ortholog groups. These cover 92.7% of homolog groups containing human proteins in the HomoloGene dataset. We investigated whether those ortholog groups identified with DODO was a subset of groups reported in HomoloGene. Since HomoloGene is a database of homologs, each group in HomoloGene is likely to be a superset of orthologs. We found that 46.7% of ortholog groups identified with DODO have exactly the same classification as HomoloGene and 89.5% of them have more than half of the proteins present in the corresponding ortholog groups in HomoloGene 64.
Comparison of DODO ortholog groups with InParanoid
Orthologs detection in 100 genomes in InParanoid
Examples of DODO identified ortholog groups that were not identified in InParanoid.
Ensembl human gene id
number of species
Average a.a. length
DODO detects ortholog based on domain compositions instead of primary protein sequences and has brought up several advantages in the aspect of biology. As shown in the results above, DODO was able to detect most orthologs in several published databases. In addition, it can detect orthologs having short sequences and lower sequence similarity if information of the domain architecture is evident. This strategy finds orthologs based more directly on functional constraints. As a result, ortholog groups detected with DODO are thought to have similar if not the same biological functions in organisms. Ortholog detected by this strategy will be helpful in the annotations of newly sequenced genomes of which the functions of genes are interested. The domain compositions of proteins should be more conserved than primary sequence since the sequence of proteins are susceptible to mutation while the function of proteins are under greater constraints. The protein domain composition is responsible for protein function and is thus more likely to be conserved than primary sequences in distantly related genomes.
In addition to the relative high efficiency of DODO, an orthologous database built with DODO is less costly to maintain comparing to other methods. When a new genome is added to the database, sequences of this genome could be assigned into their homolog groups based purely on their domain architecture without searching through existing genomes. Further ortholog assignment could be simply achieved through the sequence comparison between the sequence(s) from the newly input genome and the sequence from anchor genome within each homolog groups. The two-step approach of DODO will largely reduce the computation complexity when an established database is updated.
A few limitations do exist with our method. Since DODO detects ortholog based on the domain architecture, the accuracy and sensitivity of domain identification directly affect the performance of DODO. DODO cannot detect orthologs having different reported domain architecture. Indeed, these phenomena can explain most ortholog groups reported by InParanoid but cannot be found with DODO as shown in the results. There are also sequences having domain(s) on only a small part of the sequence, which may lead to a wrong homolog group classification and end in no orthologous relationship identified. This limitation of protein domain information is inherent in the method thus cannot be avoided. However, this limitation will be improved as new domains are identified, less characterized domains, such as PfamB are used or domain detection method is improved in the future. As we can expect, removing the redundancy in domain database or considering the domain match length may improve the domain identification on proteins .
In summary, DODO could efficiently detect orthologs having the same domain architecture even when these orthologs have short sequences or low sequence similarity. Those same domain architecture orthologs are likely to perform the same biological function and could be beneficial in annotation of newly sequenced genome. An ortholog database built by DODO is easy to update. However, the performance of DODO is highly dependent on the domain detection step.
Several protein evolution events increase the difficulty of ortholog detection, such as gene loss, gene duplication and domain rearrangement . Gene loss events are known to hinder detection of ortholog in many RBH based methods. For DODO, if it occurs in genomes other than the anchor genome, this will not have significant influence on the prediction results. However, if gene loss occurs in the anchor genome, DODO could not detect ortholog relationships since there is no corresponding gene to start with in the anchor genome. This kind of missing ortholog group can be completely avoided by taking multiple genomes as the anchor genomes as shown in Figure 5. Even though there was a gene lost event in genome A, the ortholog group 3 could be identified while take other genome as the anchor genome. In the case of gene duplication, there are two different kinds of duplication. One is in-paralog, where duplication happened after the separation from the common ancestor and the other is out-paralog, where duplication happened before the speciation. For out-paralogs, DODO can detect them as separate different ortholog groups only if there was no gene loss or domain changing event. However, in the in-paralog DODO can lose one (or several) of the in-paralog(s), since DODO only keeps the RBH in the final report. That is, only the most similar in-paralog will be included in the ortholog group. Still the in-paralogs will be classified into the same domain architecture group. For the domain rearrangement events, there are tree-based methods RIO and Orthostrapper which already have been used to build ortholog relationships at the domain level [23, 24]. These two methods generate confidence values from ortholog bootstrap support. Orthostrapper is used to build the HOPS database, which is a orthologous protein domain database. RIO and HOPS built ortholog relationships at the domain level instead of the protein level and need taxonomic information in advance while DODO built ortholog relationship between proteins and does not require the taxonomy information. Indeed, our ortholog detection is heavily based on domain architecture; hence it is affected by evolutionary events such as domain rearrangement, domain deletion or domain insertion event. DODO cannot detect orthologous relationships if there are those domain changing events in the evolution histories of the proteins.
An efficient and sensitive ortholog detection method DODO is proposed. DODO could be useful in ortholog relationship construction or update of ortholog relationships especially when taking lots of organisms into consideration. In addition, most orthologous relationships detected with DODO are composed of the proteins having the same domain composition. Ortholog detection based on domain information may disclose the more biologically meaningful ortholog groups. This ortholog identification tool will be useful for those newly sequenced genome annotations using well studied genome as anchor. Indeed, DODO was able to detect most ortholog groups recorded in the known orthologous databases as well as discover new ortholog groups having relative short or dissimilar sequences but the same domain architecture. Given the high efficiency and sensitivity, DODO could be a useful method to analyze sequences produced from many genome projects.
Availability and requirements
Project name: DODO
Project home page: http://22.214.171.124:16080/dodo_web/home.htm.
Operating system: Linux, Mac OS X
Programming language: Python
Software requirements: installation of BLAST
This work is supported in part by grant from Academia Sinica and National Science Council. InParanoid code was kindly given by Stockholm Bioinformatics Centre.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 7, 2010: Ninth International Conference on Bioinformatics (InCoB2010): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S7.
- Fitch WM: Distinguishing homologous from analogous proteins. Syst Zool 1970, 19(2):99–13. 10.2307/2412448View ArticlePubMedGoogle Scholar
- Fitch WM: Homology a personal view on some of the problems. Trends Genet 2000, 16(5):227–231. 10.1016/S0168-9525(00)02005-9View ArticlePubMedGoogle Scholar
- Delsuc F, Brinkmann H, Philippe H: Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet 2005, 6(5):361–375. 10.1038/nrg1603View ArticlePubMedGoogle Scholar
- Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA 1999, 96(8):4285–4288. 10.1073/pnas.96.8.4285PubMed CentralView ArticlePubMedGoogle Scholar
- Kuzniar A, van Ham RC, Pongor S, Leunissen JA: The quest for orthologs: finding the corresponding gene across genomes. Trends Genet 2008, 24(11):539–551. 10.1016/j.tig.2008.08.009View ArticlePubMedGoogle Scholar
- Kriventseva EV, Rahman N, Espinosa O, Zdobnov EM: OrthoDB: the hierarchical catalog of eukaryotic orthologs. Nucleic Acids Res 2008, (36 Database):D271–275.Google Scholar
- Ostlund G, Schmitt T, Forslund K, Kostler T, Messina DN, Roopra S, Frings O, Sonnhammer EL: InParanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res 2010, (38 Database):D196–203. 10.1093/nar/gkp931Google Scholar
- Tatusov RL, Galperin MY, Natale DA, Koonin EV: The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res 2000, 28(1):33–36. 10.1093/nar/28.1.33PubMed CentralView ArticlePubMedGoogle Scholar
- Jothi R, Zotenko E, Tasneem A, Przytycka TM: COCO-CL: hierarchical clustering of homology relations based on evolutionary correlations. Bioinformatics 2006, 22(7):779–788. 10.1093/bioinformatics/btl009PubMed CentralView ArticlePubMedGoogle Scholar
- Storm CE, Sonnhammer EL: Comprehensive analysis of orthologous protein domains using the HOPS database. Genome Res 2003, 13(10):2353–2362. 10.1101/gr1305203PubMed CentralView ArticlePubMedGoogle Scholar
- van der Heijden RT, Snel B, van Noort V, Huynen MA: Orthology prediction at scalable resolution by phylogenetic tree analysis. BMC Bioinformatics 2007, 8: 83. 10.1186/1471-2105-8-83PubMed CentralView ArticlePubMedGoogle Scholar
- Hubbard TJ, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, et al.: Ensembl 2007. Nucleic Acids Res 2007, (35 Database):D610–617. 10.1093/nar/gkl996Google Scholar
- Sayers EW, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, et al.: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2009, 37(Database issue):D5–15. 10.1093/nar/gkn741PubMed CentralView ArticlePubMedGoogle Scholar
- Fu Z, Chen X, Vacic V, Nan P, Zhong Y, Jiang T: MSOAR: a high-throughput ortholog assignment system based on genome rearrangement. J Comput Biol 2007, 14(9):1160–1175. 10.1089/cmb.2007.0048View ArticlePubMedGoogle Scholar
- Fu Z, Jiang T: Clustering of main orthologs for multiple genomes. J Bioinform Comput Biol 2008, 6(3):573–584. 10.1142/S0219720008003540View ArticlePubMedGoogle Scholar
- Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, et al.: Pfam: clans, web tools and services. Nucleic Acids Res 2006, (34 Database):D247–251. 10.1093/nar/gkj149Google Scholar
- Vogel C, Bashton M, Kerrison ND, Chothia C, Teichmann SA: Structure, function and evolution of multidomain proteins. Current opinion in structural biology 2004, 14(2):208–216. 10.1016/j.sbi.2004.03.011View ArticlePubMedGoogle Scholar
- Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, et al.: The Pfam protein families database. Nucleic Acids Res 2008, (36 Database):D281–288.Google Scholar
- Bashton M, Chothia C: The geometry of domain combination in proteins. Journal of molecular biology 2002, 315(4):927–939. 10.1006/jmbi.2001.5288View ArticlePubMedGoogle Scholar
- Remm M, Storm CE, Sonnhammer EL: Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. Journal of molecular biology 2001, 314(5):1041–1052. 10.1006/jmbi.2000.5197View ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25(1):25–29. 10.1038/75556PubMed CentralView ArticlePubMedGoogle Scholar
- Levitt M: Nature of the protein universe. Proc Natl Acad Sci USA 2009, 106(27):11079–11084. 10.1073/pnas.0905029106PubMed CentralView ArticlePubMedGoogle Scholar
- Storm CE, Sonnhammer EL: Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics 2002, 18(1):92–99. 10.1093/bioinformatics/18.1.92View ArticlePubMedGoogle Scholar
- Zmasek CM, Eddy SR: RIO: analyzing proteomes by automated phylogenomics using resampled inference of orthologs. BMC Bioinformatics 2002, 3: 14. 10.1186/1471-2105-3-14PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.