Algorithm of OMA for large-scale orthology inference
© Roth et al. 2008
Received: 22 August 2008
Accepted: 04 December 2008
Published: 04 December 2008
Skip to main content
© Roth et al. 2008
Received: 22 August 2008
Accepted: 04 December 2008
Published: 04 December 2008
OMA is a project that aims to identify orthologs within publicly available, complete genomes. With 657 genomes analyzed to date, OMA is one of the largest projects of its kind.
The algorithm of OMA improves upon standard bidirectional best-hit approach in several respects: it uses evolutionary distances instead of scores, considers distance inference uncertainty, includes many-to-many orthologous relations, and accounts for differential gene losses. Herein, we describe in detail the algorithm for inference of orthology and provide the rationale for parameter selection through multiple tests.
OMA contains several novel improvement ideas for orthology inference and provides a unique dataset of large-scale orthology assignments.
The classification of genes according to evolutionary relations is essential for many aspects of comparative and functional genomics. Evolutionary relations are often described as pairwise relations. Two genes that share a common ancestor are defined as homologs, while genes that are similar in sequence without a common origin are termed analogs. Homologs can be divided into several classes : orthologs, which originate from a speciation event; paralogs, which originate from gene duplication; and xenologs, which originate from horizontal gene transfer. Orthologs are valuable in numerous analyses, including reconstruction of species phylogenies, protein function inference, database annotation, and genomic context analysis.
Evolutionary relations can also be defined with respect to a third gene. Paralogs are classified as out-paralogs or in-paralogs . In-paralogs are genes that diverged by a duplication that occurred after a speciation event of reference. The term co-orthologs is used occasionally to describe the same scenario from the perspective of a third gene that is orthologous to both genes. In contrast, out-paralogs are paralogs that diverged before a particular speciation event of reference.
To address the need for reliable sources of orthologs, several initiatives have been created for better orthologs prediction Commonly, there are two classes of prediction methods: phylogeny based methods, which compare gene trees with species trees (e.g. NOTUNG , Orthostrapper , RIO , Softparsmap , LOFT , Ensembl ) and pairwise based methods, which perform homology search with (optional) subsequent clustering (e.g. BBH , COG , InParanoid , KOG , OrthoMCL , RSD , MultiParanoid , Roundup , Homologene , eggNOG ).
In 2005, we introduced the OMA orthology prediction project with the goal to classify all orthologs in completely sequenced genomes . OMA is a pairwise based method with a number of distinctive features: alignments are performed using an efficient implementation of full Smith-Waterman dynamic programming  (as opposed to methods with lower sensitivity such as BLAST), confidence intervals explicitly consider estimation uncertainty, and exclusion of paralogs is achieved using sequences in third-party genomes as "witnesses of non-orthology" .
Since then, we have substantially improved the OMA algorithm. Orthology is now inferred on the basis of evolutionary distances rather than alignment scores, the predicted orthologs are no longer limited to one-to-one orthologs. We build groups of orthologs using a maximum-weight clique algorithm. A web interface now enables interactive exploration of the predictions . In addition, the number of complete genomes under analysis has increased to over 657, which requires efficient solutions regarding computation speed and memory consumption.
In this paper, we describe the current OMA algorithm in detail, motivate our parameter selection, and offer a discussion about the method and results.
Step 1: To find homology, we compute pairwise alignments between all pairs of sequences for all genes in all genomes. Pairs with significant alignment scores are retained as candidate pairs.
Step 2: Orthologs are usually the closest genes in two genomes, because they started diverging at speciation, whereas paralogs started diverging at a duplication prior to speciation. Genes across genomes that are mutually the most closely related sequences, taking into account inference uncertainty, are upgraded to stable pairs.
Step 3: In cases where an ortholog is missing, we seek to avoid erroneous classification of paralogs as orthologs (pseudo-orthologs) by verifying stable pairs with sequences in a third genome that can act as witness of evolution. Pairs that pass the verification step are upgraded to verified pairs, and pairs that do not pass are paralogs and referred to as broken pairs.
Step 4: For some applications, such as species tree reconstruction, it is advantageous to cluster orthologs into orthologous groups. Pairs of sequences in such groups are termed group pairs.
In the following, we describe each of the four steps, and motivate all parameter choices.
The goal of the first step of the process is homology detection. All pairs of protein sequences from complete genomes are aligned using full dynamic programming. There are several advantages of using protein sequences rather than using DNA sequences. Very distant homologies are difficult to find at the DNA level, and protein sequences suffer less from convergence due to mutational biases. Also, the length of a protein is one third of that of the corresponding DNA sequence, a considerable advantage given that the time complexity of aligning sequences is quadratic with respect to length. The disadvantage to using protein sequences instead of DNA is that complications arising from multiple gene products must be handled explicitly by selecting the longest splice variant as well as isoforms with at least 10% non-redundant positions. The sequences used by OMA are from public databases (mainly Genbank  for Prokaryotes and Ensembl  for Eukaryotes) and all data are checked for consistency and quality.
Homology is established in two sub-steps. First, alignments between all sequences are performed using full, local dynamic programming with a fixed PAM matrix to find all homologous sequences [20, 24]. Second, significant alignments (score > 85) are refined by searching among all PAM distances the scoring matrix that maximizes the alignment score. Since scores are the log of odd ratios, the PAM number of this matrix corresponds to the maximum likelihood of evolutionary distance. Empirically, we observed that with a mixture of homologous and non-homologous pairs of sequences as input, the PAM-224 matrix yields alignment scores that are on average closest to the ones obtained in the refinement part. Thus, this is the fixed matrix that we use in the first part. Refined alignments with scores above 181 (which roughly corresponds to an E-value of 10-14) are considered significant. With scores below this value, the proportion of candidate pairs that end up being predicted as orthologs decreases rapidly (data not shown).
The all-against-all step is computationally expensive, and the run time increases quadratically with the total number of amino acids in a protein sequence. The use of a heuristic-based algorithm such as BLAST could potentially increase the speed of the homology search, but modern implementations of Smith-Waterman using SIMD instructions are almost as fast as BLAST . Moreover, most of the time is consumed by estimating evolutionary distances.
Since we consider entire proteins as the basic evolutionary unit, why then not use global alignments? Protein ends are often variable, and thus, it is reasonable to ignore them by using local alignments. To guarantee that a significant fraction of a sequence is aligned, we use a length tolerance criterion. The length of the shorter aligned sequence must be at least the fraction ℓ of the longest sequence. That is min (|a 1|, |a 2|) > ℓ·max(|s 1|, |s 2|)
where a 1 and a 2 are the lengths of the aligned subsequences of s 1 and s 2. Alignments that pass both the length and score criteria are upgraded to candidate pairs (CP).
The parameter ℓ is determined by two tests. The first test, the triangle inequality test, is performed over all candidate pairs. Under a time-reversible Markovian model, the evolutionary distances between homologous sequences should obey the triangle inequality condition which requires that in a triplet of sequences, any distance between two sequences be less than or equal to the sum of the other two distances. Because these distances are estimates, this property is expected only to hold within a confidence interval.d xz ≤ d xy + d xz
Many proteins consists of several domains originating from gene fusions, deletions, and internal repetitions. The majority of multi-domain proteins have evolved by the stepwise insertions of single domains . In the second test, candidate pairs are verified by testing the assumption that the number of domains for orthologous sequences are in agreement, including identical domains (e.g. repetitions). Domain information is obtained from the Pfam database and consists of conserved protein regions and domains . The amount of proteins with the same number of domains increase with stricter length tolerance, but a "plateau" is observed for 0.6 < ℓ < 0.9 (Figure 2).
Figure 2 shows the results of the two validation tests and also that the number of orthologous relations (i.e. VP) decreases with increasing length criteria. A trade-off exists between sensitivity and selectivity. A value of ℓ = 0.61 is a good compromise between minimizing triangle inequality violations and numbers of different domains while still including enough ambiguous alignments.
In the second step of the algorithm, potential orthologs are detected by the identification of sequences in two genomes that are more closely related to each other than to any other sequence in the other sequence. We term these sequences stable pairs (SP). This name was chosen due to its close association with the stable marriage problem in computer science.
To measure the relatedness of sequences, either similarity scores or evolutionary distances can be used. Most methods employ the similarity score ("best hit"), because it is directly obtained by the alignment process and the highest scoring sequence is usually the most closely related sequence. However, scores do not constitute a direct measure of relatedness. In particular, they depend on protein lengths. Evolutionary distances such as PAM units, though more expensive to compute, constitute a sounder measure of relatedness, because distances are additive in their expected value (i.e. they are expected to equal the sums of branch lengths between the species) and have well characterized statistical properties.
A tolerance interval is used to allow the inclusion of more than one potential ortholog, as this becomes necessary when a gene duplication event occurred after speciation. The tolerance threshold can be defined by including similarity scores in an interval below the top score, or by using variance of distance estimates to compute a confidence interval.
As mentioned above, the use of confidence intervals is necessary to account for many-to-many orthologous relations, which arise when duplications occurred after speciation. Additionally, distance estimation is subject to inference uncertainty and, thus, true orthologs may not have the shortest estimated distance.
where d is a pairwise maximum likelihood distance estimate and k, the tolerance parameter of the standard deviation between the two distances, where . An estimate of the variance is obtained by the distance estimation, while efficient estimation of covariance for this case was previously shown .
The tolerance parameter k controls the balance between sensitivity (more true orthologs as stable pairs) and selectivity (few out-paralogs as stable pairs). The optimal value of k for our purpose is determined using the out-paralog test.
Although the construction of stable pairs is likely to identify the corresponding ortholog of each sequence, at least one special case exists in which systematic failure will occur: differential gene loss. This problem affects all pairwise approaches, and is shown in Figure 6A. An ancient duplication event is followed by two speciation events resulting in three species X, Y, and Z. In two of these species, each of the duplicates is lost (e.g. x2 and y1), and as a result, when comparing species X and Y, x1 and y2 are the highest scoring match. In such a case, (x 1, y 2), although paralogs, form a stable pair.
Each stable pair is verified by comparison to all other genomes. Stable pairs for which no witness of non-orthology could be found are termed verified pairs (VP) and are likely to be orthologs. Furthermore, stable pairs that are not verified were defined as broken pairs (BP) and are likely to correspond to paralogs.
Such cases of differential gene losses are not uncommon in nature. Among yeasts for instance, approximately 5% of stable pairs are detected as non-orthologous using the procedure described above.
Note that although both the verification of the SP step and the out-paralogy test detect non-orthologs, the test requires knowledge of the species tree. To keep the orthology prediction independent from such (often uncertain) knowledge, we only used the out-paralogy test for parameter fitting, and only in cases where the species topology is undisputed.
The final step of the algorithm creates groups of orthologs. Such grouping is non-trivial, because orthology is defined over pairs of sequences and is not necessarily a transitive relation. For instance, a sequence in one genome may form several verified pairs with sequences in another genomes, corresponding to several orthologous relations (co-orthologs). These in turn cannot be orthologous to each other. In OMA, we address this problem by making available both pairwise orthologous relations (the verified pairs) and groups of genes in which all pairs are orthologs. Though the OMA groups leave out orthologous relations, they are useful for some applications, such as species tree inference.
To validate our methods and to compare different algorithms for clique construction, a species tree is built from the orthologous groups produced by each algorithm, and the fit of the data to the tree is measured using the dimensionless index . This technique assumes that if the groups inferred by the clique algorithms correctly predict orthology, the data will have a better fit to the species tree.
For verification, 100 trials using various genomes and different taxa are computed using four different clique versions. Maximum size clique chose the largest clique in the graph starting with the highest scoring edge (but does not use any other edge information). Maximum size score clique is an extension that uses the sum of the edge weights and selects the higher scoring clique from several maximum cliques of same size. The above described algorithm, maximum edge weight clique, is used twice, first with the scores and then with the distances complement as edge weights.
Results of clustering
Maximum size score
Maximum edge weight (distance)
Maximum edge weight (score)
Sequence pairs and their corresponding evolutionary relationships
All pairs (AP)
Candidate pairs (CP)
Stable pairs (SP)
Broken pairs (BP)
Verified pairs (VP)
Group pairs (GP)
Verified pairs represent a useful resource that describe many-to-many orthology while pseudo-orthologs from differential gene loss have been removed. In other words, the most similar sequences may not be orthologous, and for this reason, all stable pairs are verified using a third genome as a witness of non-orthology. A critical assumption in the verification of stable pairs is that in at least some genomes both copies of a duplication event are present. It is possible that no duplicates remain and that paralogy cannot be detected by sequence similarity. However, the increasing number of completed genomes also increases the chance of observing duplicates in a genome. Both paralogs are often present in multiple genomes. For example, when predicting orthologs for the subset of Firmicutes, (the subset is used for computational reasons) 75% of broken pairs had more than one witness of non-orthology.
Lateral gene transfer (LGT) events of homologous sequences (xenologs) are difficult to distinguish from duplication events. Two genes may appear to be duplicates when in fact, they are not. This issue affects most orthology prediction methods. In the case of OMA, the verification step helps reducing the adverse impact of LGT; furthermore, we are investigating reliable ways of excluding the most obvious cases of LGT.
Two genes that in one organism may be truly orthologous to one fused genes in another organism. OMA considers the entire protein, rather than domains, to be the basic evolutionary unit. Users interested in gene fusion-fission events, or in domain evolution, may view this as a limitation. We have chosen to exclude such scenarios, due to the difficulty to separate these events from the cases where the domains in proteins arose form different evolutionary unrelated domains. Although part of the sequence may have diverged through speciation, another part is clearly non-homologous. If orthology is defined at the domain level, a gene could be orthologous to two or more sequences that are completely unrelated. In terms of function, which is often inferred from orthology, genes with different domains are unlikely to be similar. Finally, restricting potential orthology to genes with a majority of homologous positions presents the advantage of not only avoiding these problems, but also reducing computational complexity.
In OMA, orthologous groups consist of close orthologs, which are useful to build species trees. The results of grouping close orthologs is represented by an ortholog matrix. In this matrix, rows correspond to groups of orthologous genes, and columns correspond to genomes. A non-empty element in Mi,jindicates that a genome j has a member in an orthologous group i. The members of a group possess at most one close ortholog in each genome.
In cases where duplication events occurred after a speciation event, several orthologous relationships exist and are often referred to as co-orthologs or in-paralogs . We group orthologs such that the most similar protein sequence belong together using maximum edge weight cliques. It should be noted that the most similar sequences do not necessarily have the most similar function .
Using cliques to construct groups is a strict requirement, because if an edge is missing (due to weak similarity or misclassification), the group gets split. For applications where this is problematic, users can devise their own grouping strategy from the orthologous pairs, which we also make available.
The all-against-all step is computationally expensive, and the time complexity is O((n 1 + n 2 + ... + n k )2) where n k represents the number of sequences in the k th genome. As of November 2008, we have computed nearly 6 trillion sequence alignments. A total of approximately 12 Hexaflop or roughly 500 years of CPU time. Of these alignments, 3.2 billion were considered significant (i.e. score > 130). This dataset constitutes a valuable resource for comparative studies and is available upon request. The subsequent steps of the algorithm are comparatively fast.
The performances of OMA are compared to other projects in a separate article . The study includes COG, KOG, EggNOG, InParanoid, OrthoMCL, Ensembl, Homologene, and RoundUp. The study tests ortholog predictions on the basis of phylogeny (through reconstruction of orthologous gene trees and through comparison with phylogenetic analyses from the literature) and on the basis of function conservation (in terms of GO annotation, EC number classification, expression level, and gene neighborhood conservation). The results of OMA are among the best in the phylogenetic tests. In functional tests, it also performs well where high functional specificity is required, at the expense of a lower recall than projects such as OrthoMCL or EggNOG.
In terms of size and with 657 genomes analyzed, OMA is by a wide margin the largest orthologs inference effort (the second largest, EggNOG, includes 373 genomes). Our website is regularly updated as new species get included.
Orthology is interesting for a wide range of bioinformatics analyses, including functional annotation, phylogenetic inference, or genome evolution. This paper describes and motivates the algorithm of OMA for predicting orthologous relationships among complete genomes. The algorithm takes a pairwise approach, thus neither requiring tree reconstruction nor reconciliation, and offers the following improvements over the standard bidirectional best hit approach: i) the use of evolutionary distance instead of score, ii) a tolerance that allows the inclusion of one-to-many and many-to-many orthologs, iii) consideration of uncertainty in distance estimations, iv) detection of potential differential gene losses. The algorithm is characterized by four parameters that are optimized using independent tests. The current status of the project and the project results, including phylogenetic trees derived from the data, are available online .
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.