DODO: an efficient orthologous genes assignment tool based on domain architectures. Domain based ortholog detection
© Chen et al. 2010
Published: 15 October 2010
Skip to main content
© Chen et al. 2010
Published: 15 October 2010
Orthologs are genes derived from the same ancestor gene loci after speciation events. Orthologous proteins usually have similar sequences and perform comparable biological functions. Therefore, ortholog identification is useful in annotations of newly sequenced genomes. With rapidly increasing number of sequenced genomes, constructing or updating ortholog relationship between all genomes requires lots of effort and computation time. In addition, elucidating ortholog relationships between distantly related genomes is challenging because of the lower sequence similarity. Therefore, an efficient ortholog detection method that can deal with large number of distantly related genomes is desired.
An efficient ortholog detection pipeline DODO (DOmain based Detection of Orthologs) is created on the basis of domain architectures in this study. Supported by domain composition, which usually directly related with protein function, DODO could facilitate orthologs detection across distantly related genomes. DODO works in two main steps. Starting from domain information, it first assigns protein groups according to their domain architectures and further identifies orthologs within those groups with much reduced complexity. Here DODO is shown to detect orthologs between two genomes in considerably shorter period of time than traditional methods of reciprocal best hits and it is more significant when analyzed a large number of genomes. The output results of DODO are highly comparable with other known ortholog databases.
DODO provides a new efficient pipeline for detection of orthologs in a large number of genomes. In addition, a database established with DODO is also easier to maintain and could be updated relatively effortlessly. The pipeline of DODO could be downloaded from http://22.214.171.124:16080/dodo_web/home.htm
Orthologous gene identification is an important step in comparative genomics. The word orthologs originally refer to genes in different species derived from the same locus in their last common ancestor. Since orthologs are genes derived from the same ancestor gene, orthologs often have similar amino acid sequences and expected to perform the same or similar cellular function [1, 2]. These properties make orthologs useful in functional genomics analysis. In addition to reconstructing the phylogeny and revealing the evolution history of species, orthologs could also be applied to genome annotation and protein-protein interaction prediction [3, 4]. The orthologs can be treated as corresponding genes in different species after species evolved and consequently it is an important issue to detect this kind of ortholog relationship between species.
A number of methods have been developed for orthologs detection. In practice, orthologs are defined through reciprocal best hits (RBH) from primary protein sequences between two species by various algorithms. For instance, the COG, InParanoid, and orthDB are built based on such RBH approach [6–8]. Beside RBH, tree-based methods such as those for reconstructing the LOFT, COCO-CL and HOPS database have also been developed [9–11], where trees are established via heuristic calculations of sequence similarity and the orthologous relationships are inferred from the tree structures. Some databases such as the Ensembl Compara and HomoloGene are constructed with both RBH and phylogenetic tree information [12, 13]. In addition, some methods identify orthologs by reconstructing genome rearrangement events in closely related genomes such as MSOAR and MultiMSOAR [14, 15].
With the advance of high throughput sequencing technologies, it is anticipated a dramatic increase in the number of completed genomes. Two challenges are posed to ortholog identification. The first issue is the speed of analyzing a large number of proteins. Increasing number of genomes necessitate faster method for data analysis and processing. Another issue is the ability to identify orthologs in distantly related species where sequence similarity might be low. However, the complexity and computation time of the RBH methods increase considerably as mutual comparisons are needed between each pair of species. For example, it needs 4,950 times of mutual comparisons between pair of genomes to identify ortholog relationships among 100 genomes and for 1000 genomes it would need 499,500 times of sequence comparison and alignments. Thus, new methods that can identify orthologous relationships among a large number of genomes, some of which are distantly related, in a reasonable time are beneficial. Here we propose an efficient and function-based new ortholog detection method called DODO (DOmain based Detection of Orthologs) to overcome the hurdles in ortholog identification from a large number of genomes.
DODO pipeline is designed for efficient discovery of the orthologous relationship between an anchor genome of interest (or well studied) and other genome(s). DODO detects homolog groups aided by protein domain information. In the beginning, DODO classifies proteins into groups based on both their domain composition and architecture. Domains are the functional units of proteins. Proteins having the same domain architecture likely have the same cellular function which implies homology in structures and functions. While the similarity between primary sequence of orthologs may decrease dramatically in distantly related species, the domain composition is more likely to be conserved through evolution due to the functional constraint [16, 17]. The domain architecture based method could be applied to detect homologous relationships between distantly related species. After proteins of the same domain architecture are grouped together, DODO further refines the orthologous relationship within each homolog group by identifying RBH among the smaller protein set. This strategy of ortholog searching in smaller groups instead of the whole genome makes DODO an efficient pipeline.
In addition to efficiency, database established by DODO could also be easily updated and practically the DODO results are comparable to those predicted by the traditional RBH methods. Adding new species into the database does not require reprocessing of he previously analyzed species which already existed in the database - a procedure necessitated by the traditional RBH methods. For traditional RBH methods, to update a database consisted of n existed old species, the newly added m species will cost n*m times of mutual comparisons between each pair of existed old genomes and newly added genomes. Instead, to update a database constructed by DODO only needs m times of domain identification for those newly input genomes no matter how many species already included in the database. It is easier to maintain and update an ortholog database efficiently in this schema.
The DODO pipeline, which can be freely download and executed locally, is written in Python. Given input the protein sequences in FASTA format, the pipeline will run RBS-BLAST, cluster the proteins with the same domains, and finally output a report the ortholog groups automatically. DODO requires BLAST for domain identification and similarity search. The ortholog group assignment is done in two steps. Proteins are assigned into homolog groups based on their domain information and then further classified by RBH within homolog groups.
Domain assignment is performed with RPS-BLAST for each protein sequence using Pfam v23  as the source database. Default parameters are used except the expected value which is set to below 0.01. Domain hit(s) information is then extracted from hits in the RPS-BLAST result files. Proteins having the same domain composition and order are grouped together into one group. Proteins without Pfam domain information are all grouped into an uncharacterized group for further analysis.
For some of the proteins, the information of protein domain alone may not be sufficient to determine the orthologous relationship. These groups may contain the same protein architecture, but some of them may nevertheless be very different at the sequence level and thus their ortholog relationship could be resolved. This is especially evident on expended paralogous gene families. Therefore, proteins within the same domain architectural group are further sub-classified with the RBH method. Choosing one species as anchor, BLASTP is performed to identify RBH between the anchor species and all the other species. These final sets of groups are then reported as the ortholog groups.
The output of DODO pipeline is a text file containing the ortholog information. Orthologs identified based on both domain information and RBH have IDs starting with 'PfamArcNu' while orthologs identified based purely on RBH have IDs starting with 'NoDomainInfo'. The domain architecture for orthologs could be found in the file PfamArcMap.txt under the project folder.
DODO first clustered proteins into groups based on their domain architectures and then found orthologous relationship within each group. This strategy speeds up the ortholog identification procedure and facilitates the maintenance of ortholog database. Here we investigate the efficiency of DODO and compare the performance of DODO against published databases.
A dataset of 21,673 human and 23,497 mouse protein sequences used in InParanoid  is utilized to demonstrate the relative short processing time of DODO. The comparison was done on a Linux server with 16GB RAM and 4*AMD Operon CPU. The total computation time of DODO was 21,263 seconds (5.91 hours) while the InParanoid pipeline took 135,585 seconds (37.66 hours). This result shows that, even considering only two species, DODO can identify the orthologous relationships within these species in about 15.7% of the time that the conventional RBH takes. This difference in computation time will become larger as more species are analyzed. The computation time of the conventional RBH method grows roughly proportionally to the square of the number of species. On the other hand, DODO compares each species to the same domain database only once, regardless of how many species were in comparison. Therefore DODO has significant advantage over conventional RBH in terms of the process time. This is increasingly important as more and more genomes are being sequenced and analyzed today than ever before.
HomoloGene  is a homolog sequence database which was constructed based on both sequence information and phylogeny information. It records the homolog relationship between 20 completely sequenced eukaryotic genomes. We extracted the 300,701 protein sequences that are used in HomoloGene release 64 from RefSeq and those sequences are a subset of a total of 330,610 protein sequences originally used in HomoloGene release 64 reconstruction. Using human as the anchor species, DODO identified 18,202 ortholog groups. These cover 92.7% of homolog groups containing human proteins in the HomoloGene dataset. We investigated whether those ortholog groups identified with DODO was a subset of groups reported in HomoloGene. Since HomoloGene is a database of homologs, each group in HomoloGene is likely to be a superset of orthologs. We found that 46.7% of ortholog groups identified with DODO have exactly the same classification as HomoloGene and 89.5% of them have more than half of the proteins present in the corresponding ortholog groups in HomoloGene 64.
Examples of DODO identified ortholog groups that were not identified in InParanoid.
Ensembl human gene id
number of species
Average a.a. length
DODO detects ortholog based on domain compositions instead of primary protein sequences and has brought up several advantages in the aspect of biology. As shown in the results above, DODO was able to detect most orthologs in several published databases. In addition, it can detect orthologs having short sequences and lower sequence similarity if information of the domain architecture is evident. This strategy finds orthologs based more directly on functional constraints. As a result, ortholog groups detected with DODO are thought to have similar if not the same biological functions in organisms. Ortholog detected by this strategy will be helpful in the annotations of newly sequenced genomes of which the functions of genes are interested. The domain compositions of proteins should be more conserved than primary sequence since the sequence of proteins are susceptible to mutation while the function of proteins are under greater constraints. The protein domain composition is responsible for protein function and is thus more likely to be conserved than primary sequences in distantly related genomes.
In addition to the relative high efficiency of DODO, an orthologous database built with DODO is less costly to maintain comparing to other methods. When a new genome is added to the database, sequences of this genome could be assigned into their homolog groups based purely on their domain architecture without searching through existing genomes. Further ortholog assignment could be simply achieved through the sequence comparison between the sequence(s) from the newly input genome and the sequence from anchor genome within each homolog groups. The two-step approach of DODO will largely reduce the computation complexity when an established database is updated.
A few limitations do exist with our method. Since DODO detects ortholog based on the domain architecture, the accuracy and sensitivity of domain identification directly affect the performance of DODO. DODO cannot detect orthologs having different reported domain architecture. Indeed, these phenomena can explain most ortholog groups reported by InParanoid but cannot be found with DODO as shown in the results. There are also sequences having domain(s) on only a small part of the sequence, which may lead to a wrong homolog group classification and end in no orthologous relationship identified. This limitation of protein domain information is inherent in the method thus cannot be avoided. However, this limitation will be improved as new domains are identified, less characterized domains, such as PfamB are used or domain detection method is improved in the future. As we can expect, removing the redundancy in domain database or considering the domain match length may improve the domain identification on proteins .
In summary, DODO could efficiently detect orthologs having the same domain architecture even when these orthologs have short sequences or low sequence similarity. Those same domain architecture orthologs are likely to perform the same biological function and could be beneficial in annotation of newly sequenced genome. An ortholog database built by DODO is easy to update. However, the performance of DODO is highly dependent on the domain detection step.
Several protein evolution events increase the difficulty of ortholog detection, such as gene loss, gene duplication and domain rearrangement . Gene loss events are known to hinder detection of ortholog in many RBH based methods. For DODO, if it occurs in genomes other than the anchor genome, this will not have significant influence on the prediction results. However, if gene loss occurs in the anchor genome, DODO could not detect ortholog relationships since there is no corresponding gene to start with in the anchor genome. This kind of missing ortholog group can be completely avoided by taking multiple genomes as the anchor genomes as shown in Figure 5. Even though there was a gene lost event in genome A, the ortholog group 3 could be identified while take other genome as the anchor genome. In the case of gene duplication, there are two different kinds of duplication. One is in-paralog, where duplication happened after the separation from the common ancestor and the other is out-paralog, where duplication happened before the speciation. For out-paralogs, DODO can detect them as separate different ortholog groups only if there was no gene loss or domain changing event. However, in the in-paralog DODO can lose one (or several) of the in-paralog(s), since DODO only keeps the RBH in the final report. That is, only the most similar in-paralog will be included in the ortholog group. Still the in-paralogs will be classified into the same domain architecture group. For the domain rearrangement events, there are tree-based methods RIO and Orthostrapper which already have been used to build ortholog relationships at the domain level [23, 24]. These two methods generate confidence values from ortholog bootstrap support. Orthostrapper is used to build the HOPS database, which is a orthologous protein domain database. RIO and HOPS built ortholog relationships at the domain level instead of the protein level and need taxonomic information in advance while DODO built ortholog relationship between proteins and does not require the taxonomy information. Indeed, our ortholog detection is heavily based on domain architecture; hence it is affected by evolutionary events such as domain rearrangement, domain deletion or domain insertion event. DODO cannot detect orthologous relationships if there are those domain changing events in the evolution histories of the proteins.
An efficient and sensitive ortholog detection method DODO is proposed. DODO could be useful in ortholog relationship construction or update of ortholog relationships especially when taking lots of organisms into consideration. In addition, most orthologous relationships detected with DODO are composed of the proteins having the same domain composition. Ortholog detection based on domain information may disclose the more biologically meaningful ortholog groups. This ortholog identification tool will be useful for those newly sequenced genome annotations using well studied genome as anchor. Indeed, DODO was able to detect most ortholog groups recorded in the known orthologous databases as well as discover new ortholog groups having relative short or dissimilar sequences but the same domain architecture. Given the high efficiency and sensitivity, DODO could be a useful method to analyze sequences produced from many genome projects.
Project name: DODO
Project home page: http://126.96.36.199:16080/dodo_web/home.htm.
Operating system: Linux, Mac OS X
Programming language: Python
Software requirements: installation of BLAST
This work is supported in part by grant from Academia Sinica and National Science Council. InParanoid code was kindly given by Stockholm Bioinformatics Centre.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 7, 2010: Ninth International Conference on Bioinformatics (InCoB2010): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S7.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.