- Methodology article
- Open Access
RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs
© Zmasek and Eddy; licensee BioMed Central Ltd. 2002
- Received: 14 March 2002
- Accepted: 16 May 2002
- Published: 16 May 2002
When analyzing protein sequences using sequence similarity searches, orthologous sequences (that diverged by speciation) are more reliable predictors of a new protein's function than paralogous sequences (that diverged by gene duplication). The utility of phylogenetic information in high-throughput genome annotation ("phylogenomics") is widely recognized, but existing approaches are either manual or not explicitly based on phylogenetic trees.
Here we present RIO (Resampled Inference of Orthologs), a procedure for automated phylogenomics using explicit phylogenetic inference. RIO analyses are performed over bootstrap resampled phylogenetic trees to estimate the reliability of orthology assignments. We also introduce supplementary concepts that are helpful for functional inference. RIO has been implemented as Perl pipeline connecting several C and Java programs. It is available at http://www.genetics.wustl.edu/eddy/forester/. A web server is at http://www.rio.wustl.edu/. RIO was tested on the Arabidopsis thaliana and Caenorhabditis elegans proteomes.
The RIO procedure is particularly useful for the automated detection of first representatives of novel protein subfamilies. We also describe how some orthologies can be misleading for functional inference.
- Gene Tree
- Query Sequence
- Pairwise Distance
- Pfam Family
- HMMER Package
Accurate computational protein function analysis is an important way of extracting value from primary sequence data. Due to the large amount of data, automated systems seem unavoidable (at least for initial, prioritizing steps). Such efforts are complicated, for a variety of reasons. Many proteins belong to large families, as suggested by Dayhoff . Such families are often composed of subfamilies related to each other by gene duplication events. For example, Ingram  showed that human α, β, and γ chains of hemoglobins are related to each other by gene duplications. Gene duplication allows one copy to assume a new biological role through mutation, while the other copy preserves the original functionality [3, 4]. Hence, subfamilies often differ in their biological functionality yet still exhibit a high degree of sequence similarity.
Other complications in functional analysis include: ignoring the multi-domain organization of proteins; error propagation caused by transfer of information from previously erroneously annotated sequences; insufficient masking of low complexity regions; and alternative splicing .
Typically, automated sequence function analysis relies on pairwise sequence similarity and programs such as BLAST  or FASTA . Annotating a sequence by transferring annotation from its most similar sequence(s) tends to produce overly specific annotation. In contrast, analyses using profile search algorithms such as HMMER http://hmmer.wustl.edu/ and Pfam  classify sequences too generally. They recognize that a query sequence belongs to a certain family (or, to be more precise, indicate which domain(s) the query is likely to contain), but do not subclassify the sequence.
It is infeasible to completely automate functional analysis, because it is impossible to precisely define what protein "function" means. However, a principle of phylogenomics is that orthologous sequences (that diverged by speciation) are more likely to conserve protein function than paralogous sequences (that diverged by gene duplication). Orthology and paralogy are well defined and can be inferred from gene and species trees. One useful and automatable phylogenomics approach would be as follows: if a novel sequence has orthologs, annotation can be transferred from them (as in best BLAST analysis); if there are no orthologs, the sequence is classified as just a family member (as in Pfam/InterPro analysis) and flagged as possibly the first representative of a novel subfamily. At the core of such approaches stands therefore the distinction between orthologs and paralogs, and hence the ability to discriminate between duplication and speciation events on a gene tree. Various efficient algorithms to infer gene duplications on a gene tree by comparing it to a species tree have been described (for example: by Eulenstein , and by Zhang ). We developed a simple algorithm (named SDI for Speciation Duplication Inference) that appears to solve this problem even more efficiently on realistic data sets, though it has an asymptotic worst-case running time that is less favorable .
In practice, phylogenetic trees are unreliable. Errors in trees will produce spurious inferred duplications. This is obviously problematic if duplications are to be used as indicators of potential functional changes. Therefore, instead of determining the orthologs of a query sequence on just one gene tree, inference could be performed over bootstrap resampled gene trees [14, 15] to estimate of the reliability of the assignments. Here we describe and test a procedure – RIO (for Resampled Inference of Orthologs) – which allows to perform such analyses in an automated fashion. We present results of using RIO to analyze a plant (A. thaliana) and an animal (the nematode C. elegans) proteome.
Orthologs are defined as two genes that diverged by a speciation event. Paralogs are defined as two genes that diverged by a duplication event . Other concepts derived from gene trees can be useful for functional prediction. We introduce and justify three such concepts ("super-orthologs", "ultra-paralogs", and "subtree-neighbors"):
Definition 1. Given a rooted gene tree with duplication or speciation assigned to each of its internal nodes, two sequences are super-orthologous if and only if each internal node on their connecting path represents a speciation event.
Hence, the query sequences in Figure 3 have no super-orthologs. In contrast, the rat, mouse, and wheat sequences in Figure 1A are super-orthologs pf the human query sequence. By definition, the super-orthologs of a given sequence are a subset of its orthologs.
Certain sequences underwent multiple recent duplications, resulting in large species specific sequence families, such as the C. elegans seven-transmembrane proteins acting as odorant and chemosensory receptors [19, 20]. For query sequences belonging to such sequence families, orthologs (if present) are less effective for predicting specific information. In these cases, paralogs of the same (sub) family might be more informative for functional prediction (as long as the duplications indeed happened "late" in evolutionary times). To formalize this, we introduce the concept of "ultra-paralogs":
Definition 2. Given a rooted gene tree with duplication or speciation assigned to each of its internal nodes, two sequences are ultra-paralogous if and only if the smallest subtree containing them both contains only internal nodes representing duplications.
Definition 3. Given a completely binary and rooted gene tree, the k-subtree-neighbors of a sequence q are defined as all sequences derived from the k-level parent node of q, except q itself (the level of q itself is 0, q's parent is 1, and so forth).
The RIO procedure
We use the Pfam protein family database  as a source of high quality curated multiple sequence alignments and profile HMMs (Hidden Markov Models, see  for a review), as well as programs from the HMMER package http://hmmer.wustl.edu/. RIO can easily be adapted to work with different sources of alignments and different alignment programs. For tree reconstruction, the neighbor joining (NJ) algorithm  is used, since it is reasonably fast, can handle alignments of large numbers of sequences, and does not assume a molecular clock. NJ recreates the correct additive tree as long as the input distances are additive , and is effective even if additivity is only approximated .
Input: A query protein sequence Q with unknown function.
A curated multiple alignment A from the Pfam database for the protein family that Q belongs to (as determined by hmmpfam from the HMMER package).
A profile HMM H for the protein family that Q belongs to.
Optional: A gene tree based on the multiple alignment A and the query Q annotated with orthology bootstrap confidence values for the query Q.
1. Query sequence Q is aligned to the existing alignment A (using hmmalign from the HMMER package and the Pfam profile HMM H).
2. The alignment is bootstrap resampled x times (usually, x = 100).
4. An unrooted phylogenetic tree is inferred for each of the x multiple alignments by neighbor joining , resulting in x gene trees. Each tree is rooted by a modified version of our SDI algorithm  that minimized the number of duplications postulated (this is discussed in more detail later).
5. For each of the x rooted gene trees: For each node it is inferred whether it represents a duplication or a speciation event by comparing the gene tree to a trusted species tree.
6. For each sequence s in the gene tree (except Q): Count the number of gene trees where s is orthologous to Q (see Figure 6 for an illustration of steps 5. and 6.). Bootstrap confidence values for super-orthologies, ultra-paralogies and subtree-neighbors are calculated analogously.
Precalculation of pairwise distances for increased time efficiency
The most time consuming step in the procedure described above is the calculation of pairwise distances. [The time complexity is O(xLN2), N being the number of sequences, L being their length, and x being the number of bootstrap resamples. On an average Intel processor the wall clock time for 100 bootstrapped datasets of a typical Pfam multiple alignment is in the range of hours.]
Since the query sequence is aligned to stable Pfam alignments, it is possible to precalculate the pairwise distances for each alignment and store the results. Then, when RIO is being used to analyze a query sequence, only the distances of the query to each sequence in the Pfam alignment have to be calculated. This step becomes thus O(xLN) instead of O(xLN2).
To do this correctly, the aligned query sequence has to be bootstrap resampled in exactly the same way as was used for precalculating the pairwise distances of the Pfam alignment. For this purpose, bootstrap positions (e.g. which aligned columns from the Pfam alignment were chosen in a particular bootstrap sample) are saved to a file. With this file it is possible to bootstrap the new alignment of N+1 sequences (Pfam alignment plus query sequence) in precisely the same manner, so the NxN precalculated distances are valid for the (N+1)x(N+1) distance matrix. The alignment method must also guarantee that the original Pfam multiple alignment remains unchanged when the query sequence is aligned to it. This requires specially prepared Pfam full alignments and profile HMMs that are created with the HMMER software as follows:
Input: Original Pfam full alignment A.
Output: "aln" file containing RIO-ready full alignment
"hmm" file containing a RIO-ready profile HMM
"nbd" file containing pairwise distances
"bsp" file bootstrap positions file
"pwd" file containing pairwise distances for bootstrap resampled alignment
1. Remove sequences from species not in RIO's master species tree from alignment A. If A does not contain enough sequences (<6), abort.
2. Run hmmbuild -o A' on A, using the same options as were used to build the original Pfam HMM for A, resulting in alignment A'. (HMMER's construction procedure slightly modifies the input alignment in ways that are usually unimportant, but which matter to bootstrapping in RIO.) Keep A' as the "aln" file.
3. Run hmmbuild with "--hand" option on A', resulting in HMM H' (using the same options as were used to build the original HMM for A). Calibrate H' with hmmcalibrate and keep as "hmm" file.
4. Remove non-consensus (insert) columns from A' (these are annotated by HMMER), resulting in alignment A".
5. Calculate pairwise distances for A", resulting in the "nbd" file (non-bootstrapped distances).
6. Bootstrap resample the columns of A", resulting in the "bsp" file (bootstrap positions file).
7. Calculate pairwise distances for bootstrapped A", resulting in the "pwd" file.
Rooting of gene trees
The concept of speciation and duplication is only meaningful on rooted gene trees, but the neighbor joining algorithm infers unrooted trees. We use a simple parsimony criterion for rooting. Gene trees are rooted on each branch, resulting in 2N-3 differently rooted trees for a gene tree of N sequences. For each of these, the number of inferred duplications is determined. From the trees with a minimal number of duplications (if there is more than one) the tree with the shortest total height is chosen as the rooted tree. Empirical studies on gene trees based on 1750 Pfam alignments show that about 60% of trees rooted in such a way have their root in the same position that direct midpoint rooting  would place it.
Naively performing a full duplication/speciation analysis on each of 2N-3 differently rooted trees results in a overall time complexity of O(N2) or worse, but this can be avoided. For the purpose of the following discussion it is assumed that our SDI algorithm for speciation/duplication inference is employed, but the idea applies to all algorithms based on a mapping function M defined as follows :
Definition 4. Let G be the set of nodes in a rooted binary gene tree and S the set of nodes in a rooted binary species tree. For any node g ∈ G, let γ (g) be the set of species in which occur the extant genes descendant from g. For any node s ∈ S, let σ (s) be the set of species in the external nodes descendant from s. For any g ∈ G, let M(g) ∈ S be the smallest (lowest) node in S satisfying γ (g) ⊆ σ (M(g)).
Duplications are then defined using M(g) as follows:
Definition 5. Let g1 and g2 be the two child nodes of an internal node g of a rooted binary gene tree G. Node g is a duplication if and only if M(g) = M(g1) or M(g) = M(g2).
The main task of most algorithms for duplication inference is the calculation of M. After M has been calculated for any rooted gene tree G it is possible to explore different root placements without having to recalculate M for every node of G. As long as the root is moved one node at the time, M has to be recalculated only for two nodes: the one node which was child 1 (if the new root is placed on a branch originating from child 1 of the previous root) or child 2 (otherwise) of the previous root, as well as for the new root itself. Hence, two postorder traversal steps (child 1 or 2 of the old root, then the new root) in the SDI algorithm are all that is needed. The new sum of duplications is determined by keeping track of the change in duplication/speciation status in the two recalculated nodes as well as in the previous root. Performing this over the whole gene tree (some nodes will be visited twice) it is possible to explore all possible root placements and calculate the resulting duplications in practically linear time. The pseudocode algorithm is as follows:
Algorithm for speciation duplication inference combined with rooting
Input : binary gene tree G, rooted binary species tree S.
Output: G with "duplication" or "speciation" assigned to each internal node and rooted in such a way that the sum of duplications is minimized.
root gene tree G at the midpoint of any branch;
set B = getBranchesInOrder(G);
SDIse(G, S) (see );
for each branch b in B:
set n1 = child 1 of root of G;
set n2 = child 2 of root of G;
root G at the midpoint of branch b;
updateM(n1, n2, G);
if (sum of duplications in G <dmin):
set dmin = d;
set Gdmin = G;
updateM( n 1 , n 2 , G )
set r = root of G;
if (child 1 of r == n1 || child 2 of r == n1):
calculateMforNode( n )
if (n != external):
set a = M(child 1 of n);
set b = M(child 2 of n);
while (a != b):
if (a >b):
set a = parent of a;
set b = parent of b;
set M(n) = a;
if (M(n) == M(child 1 of n) || M(n) == M(child 2 of n):
n is duplication;
n is speciation;
getBranchesInOrder( G )
set n = root of G;
set i = 0;
while !(n == root && indicator of n == 2):
if (n != external && indicator of n != 2):
if (indicator of n == 0):
set indicator of n = 1;
set n = child 1 of n;
set indicator of n = 2;
set n = child 2 of n;
if (parent of n != root):
set B [i ] = branch connecting n and parent of n;
set B [i ] = branch connecting child 1 of root and child 2 of root;
set i = i + 1;
if (parent of n != root &&n != external):
set B [i ] = branch connecting n and parent of n;
set i = i + 1;
set n = parent of n;
Master species tree
Duplication inference requires a species tree. For this purpose, a single completely binary master species tree was compiled manually, containing 249 of the most commonly encountered species in Pfam (spanning Archaea, Bacteria, and Eukaryotes). This tree is based mainly on information from Maddison's "Tree of Life" project http://tolweb.org/tree/phylogeny.html, NCBI's taxonomy database http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html, the "Deep Green" project http://ucjeps.berkeley.edu/bryolab/greenplantpage.html, and [29–32]. This master tree groups nematodes and arthropods into a clade of ecdysozoans (molting animals) as proposed by Aguinaldo , a classification which is still controversial. The tree is available in NHX format  at http://www.genetics.wustl.edu/eddy/forester/tree_of_life_bin_1-4.nhx.
RIO is implemented in a Perl pipeline of several software programs as follows. Alignment of the query sequence is done programs from the HMMER package http://hmmer.wustl.edu/. Bootstrapping is performed by a bespoke C program. Maximum likelihood pairwise distances are calculated using BLOSUM matrices  by a modified version of TREE-PUZZLE . Neighbor joining trees are calculated by a modified version of NEIGHBOR from the PHYLIP package http://evolution.genetics.washington.edu/phylip.html. Rooting and duplication inference are accomplished by "SDIunrooted" – a Java implementation of our SDI algorithm which incorporates various methods for rooting (see above). The actual counting of orthologs is performed by methods of the Java class "RIO". These programs, with the exception of HMMER, are part of the FORESTER package and are available under the GNU license at http://www.genetics.wustl.edu/eddy/forester/.
RIO is also available as an analysis webserver at http://www.rio.wustl.edu/. The pairwise distance and tree calculations are parallelized in this version (currently, ten 1.26 GHz Pentium III processors are being used).
Precalculation of pairwise distances
Pairwise distances to be used in RIO analyses were calculated using the "full" alignments (as opposed to the smaller curated "seed" alignments) from Pfam 6.6 (August 2001, 3071 families, ). Sequences from species not present in the master species tree were removed from the alignments. For computational efficiency reasons, alignments that still contained more than 600 sequences were further pruned; sequences not originating from SWISS-PROT were discarded, and sequences from certain mammals were excluded (mouse, rabbit, hamsters, goat, all primates except human), since mammals are likely to be oversampled in most Pfam families. For some extremely large families [immunoglobulin domain (PF00047), protein kinase domain (PF00069), collagen triple helix repeat (PF01391), and rhodopsin-type 7 transmembrane receptor (PF00001)], all mammalian sequences except those from human and rat were excluded.
Alignments of average length <30 amino acids (<40 for zinc finger domains) or with <6 sequences were not analyzed, because of lack of phylogenetic signal. For all other families, pairwise distances for 100 bootstrap samples were prepared. Following the above rules, pairwise distances were precalculated for 2384 alignments from a total of 3071 in Pfam 6.6 (75 alignments were too short and 612 alignments contained less than six sequences from species in the master species tree).
Phylogenomic analyses of the A. thaliana and C. elegans proteomes
The input for RIO consists of a query protein sequence together with a Pfam alignment for a protein family that the query belongs to. Before RIO could be applied we therefore had to determine the matching domains for each protein in the A. thaliana and C. elegans proteomes. For proteins composed of different domains, a RIO analysis is performed for each domain individually.
The source for protein sequences were: ATH1.pep.03202001, a flatfile database of 25,579 A. thaliana amino acid sequences (hypothetical, predicted and experimentally verified) that have been identified as part of the Arabidopsis Genome Initiative (AGI) http://www.arabidopsis.org/info/agi.html, and wormpep 43, a flatfile database of 19,730 C. elegans amino acid sequences http://www.sanger.ac.uk/Projects/C_elegans/wormpep/.
The program hmmpfam (version 2.2 g) from the HMMER package was used to search each protein sequence in ATH1.pep.03202001 and wormpep 43 against Pfam 6.6. Only domains with a score above the so-called Pfam gathering cutoff were reported ("cut_ga" option) in order to include only confident domain assignments.
The sum of domains assigned to the 25,579 A. thaliana protein sequences was 17,847 (counting multiple copies of the same domain in one protein as one). 12,431 sequences matched one domain (containing possibly multiple copies of this one domain). 1,982 sequences matched two different domains (containing possibly multiple copies of both). 453 sequences matched three or more different domains (containing possibly multiple copies of each). Therefore, a total of 14,866 (58%) sequences from ATH1.pep.03202001 could be assigned to one or more Pfam families.
Similarly, a sum of 12,314 domains was assigned to the 19,769 C. elegans protein sequences. 7,698 sequences matched one domain, 1,632 matched two different domains, and 388 matched three or more different domains. Thus, 9,718 (49%) sequences from wormpep 43 could be assigned to one or more Pfam families.
RIO was then used to analyze each protein sequence matching one or more Pfam families. The results from these analyses can be found at http://www.genetics.wustl.edu/eddy/forester/rio_analyses/. The approximate time requirement was between two and three weeks, performed on eight Pentium III 800 Mhz processors.
How many sequences can be analyzed with RIO?
Number of domains which can be analyzed with RIO
Protein sequences in proteome
Sum of domains assigned to proteome
Domain sequences analyzed with RIO
Sum of individual RIO analyses
Correspondingly, from the 12,314 C. elegans domain sequences matching a Pfam family, 11,287 (92%) could be analyzed with RIO using the precalculated distances. 901 (7%) domain sequences were not analyzed because the corresponding Pfam alignments were either too short or did not contain enough sequences. 53 (0.4%) domain sequences were not analyzed because the E-value for the match to their profile HMM was below the threshold of 0.01. In addition, we did not analyze the 73 C. elegans sequences matching the immunoglobulin family (PF00047), because we considered the phylogenetic signal in this alignment to be questionable. Furthermore, most of the 73 sequences contain multiple copies of the immunoglobulin domain (for example, CE08028 contains 48 immunoglobulin domains) and we therefore worried that the results from this family might skew our overall results. The sum of RIO analyses was 14,740.
Thus, a little less than half of each proteome can be analyzed by RIO. The most important factor is whether a protein sequence has a match to a Pfam domain family.
RIO analysis of lactate/malate dehydrogenase family members
RIO analysis of A. thaliana lactate/malate dehydrogenase family members
RIO top 1 hit (highest orthology value)
Domain used for analysis:
L-LDH (o = 91%, n = 3%)
L-LDH (o = 94%, n = 12%)
MDH (o = 2%, n = 98%)
cytoplasmic MDH (o = 40%, n = 78%)
mitochondrial NAD-dependent MDH
mitochondrial MDH (o = 89%, n = 100%)
mitochondrial MDH (o = 94%, n = 66%)
putative glyoxysomal MDH precursor
MDH (o = 55%, n = 0%)
glyoxysomal MDH (o = 95%, n = 37%)
mitochondrial NAD-dependent MDH, putative
MDH (o = 89%, n = 100%)
mitochondrial MDH (o = 84%, n = 80%)
Chloroplast NAD-dependent MDH
MDH (o = 87%, n = 21%)
MDH (o = 85%, n = 6%)
microbody NAD-dependent MDH
glyoxysomal MDH (o = 100%, n = 100%)
glyoxysomal MDH (o = 80%, n = 97%)
MDH (o = 2%, n = 100%)
MDH (o = 38%, n = 75%)
cytoplasmic MDH (o = 5%, n = 99%)
MDH (o = 31%, n = 84%)
MDH (o = 60%, n = 30%)
chloroplast NADP-MDH (EC 220.127.116.11) (o = 68%, n = 82%)
RIO analysis of C. elegans lactate/malate dehydrogenase family members
RIO top 1 hit (highest orthology value)
Domain used for analysis:
L-LDH (o = 75%, n = 61%)
L-LDH (B chain) (o = 66%, n = 23%)
Member of the MDH protein family (predicted)
MDH (o = 42%, n = 16%)
MDH (o = 53%, n = 34%)
Putative MDH, possible ortholog of H. sapiens Hs.75375 gene product (cytoplasmic MDH) (predicted)
cytoplasmic MDH (o = 13%, n = 95%)
MDH (o = 76%, n = 52%)
Sequences with no orthologs in the current databases
Top orthology bootstrap values of RIO analyses
Top orthology bootstrap values [%]
A. thaliana (total: 14,905)
C. elegans (total: 11,287)
We do not think it is possible at this stage to determine reliable threshold values for "true orthologs" or "absence of orthologs". Such thresholds are very likely to be different for different Pfam families since families vary in the phylogenetic signal their alignment contains. Some sequences that are very likely to be true orthologs nonetheless exhibit marginal orthology bootstrap values (in the range of 70% or even lower).
One might expect that each query sequence that appears to have no orthologs is connected with scenario similar to the one described above for F28P22_13. Yet, this is clearly not the case, for the following reasons: (i) Gene duplications might not be followed by functional modification (many Pfam families are composed of sequences which have all the same function, at least at the resolution of the current annotation). (ii) Some Pfam families are composed solely of sequences originating from closely related (or the same) species (such as PF02362, the B3 DNA binding domain of higher plants). For such families, query sequences from the same species group are expected to have low orthology values. In such cases the concept of subtree-neighbors and ultra-paralogs is more useful than orthologs. (iii) Erroneous RIO results caused by an insufficient phylogenetic signal (due to short sequences, for example) can lead to low orthology values. For this reason, RIO also outputs the average bootstrap value for the consensus tree to give the user a hint about the amount of phylogenetic signal in the alignment used.
Inconsistency between orthology bootstrap values and sequence similarity
The numbers of sequences for which the orthology bootstrap values do not correspond to sequence similarity
Number of query sequences
Manual inspection of the RIO output leads to the following, somewhat unexpected, conclusion. In many cases a discrepancy between orthology bootstrap values and sequence similarity is caused by orthologs in only phylogenetically distant (relatively to the query sequence) species. This can lead to errors if functional annotation is blindly transferred from these orthologs to the query. As an example of this, the results of analyzing the A. thaliana O-methyltransferase F16P17_38 are shown in Figures 10 and 11. (Complete files are at here.) Even though the F16P17_38 sequence is orthologous to the bacterial hydroxyneurosporene methyltransferases (EC 2.1.1.-)  it would be dangerous to annotate it as such. A more reasonable annotation for this query would be to annotate it based on subtree-neighbors and hence call it a plant O-methyltransferase. An indication of this problem (besides a discrepancy between orthology bootstrap values and sequence similarity) is the meeting of the following three conditions: A query sequence has (i) likely orthologs and (ii) likely subtree-neighbors in other species than the query itself, yet (iii) there is no significant overlap between the orthologs and the subtree-neighbors.
We were unable to find convincing examples in the C. elegans and A. thaliana proteomes where wrong sequence similarity based annotations might be caused by unequal rates of evolution (as illustrated in Figure 2). This is not to say that such cases do not exist in those two proteomes, but they are likely to be quite rare. Similarly to the issues described in the previous section, the detection of such examples is complicated by the fact that for many cases in which a discrepancy between orthology bootstrap values and sequence similarity exists, all sequences in the Pfam alignment appear to have to same function, the Pfam family is lineage specific, or the annotations are too poor/confusing to make any kind of inference.
RIO is a procedure for automated phylogenomics. The RIO procedure appears to be particularly useful for the detection of first representatives of novel protein subfamilies. Sequence similarity based methods can be misleading in these cases since every query is always "most similar to something", whereas RIO can detect the absence of orthologs.
Storm, Sonnhammer, and colleagues have recently developed similar ideas and procedures in a program called ORTHOSTRAPPER [44, 45]. One distinction between the two approaches is that ORTHOSTRAPPER's orthology determination procedure does not employ a species tree for duplication inference; it uses a heuristic based on sequence similarity rather than a formally correct phylogenetic means of inferring orthology. Another distinction is that ORTHOSTRAPPER uses uncorrected observed mismatches as a sequence distance measure, rather than estimating evolutionary distances. In general, RIO brings more of the power of known phylogenetic inference algorithms to bear on the problem of proteomic annotation.
Super-orthology is a very stringent criterion. If a query sequence is likely to have super-orthologs, they represent an excellent source to transfer functional annotation from. In contrast, the absence of super-orthologs does not imply that a function for a query sequence cannot be inferred (in the two proteomes analyzed in this work, most sequences appear to have no super-orthologs in Pfam 6.6).
Ultra-paralogs are sequences in the same species as the query and are likely to be the result of recent duplications and therefore might not have yet undergone much functional divergence. Operationally, splice variants can also be thought of as ultra-paralogs (at least as long as protein sequences are considered).
Subtree-neighbors have two uses: (i) If the subtree-neighbors of the query sequence exhibit (partial) agreement in their functional annotations, the elements in which they agree might be used to infer a (partial) function for the query. This is useful for query sequences that are appear to have no orthologs in the current databases. (ii) For query sequences that do have orthologs, absence of overlap between the sequences considered orthologous and those which appear to be subtree-neighbors raises a red flag, indicating that the orthologs are in phylogenetically distant species relative to the query. Transferring annotation from such orthologs is risky. In this case, subtree-neighbors are a more reliable source to transfer annotation from.
RIO outputs warnings if the distance of the query sequence to other sequences is unusually short or long, relative to other branch lengths on the tree. The usefulness of this was not investigated in this work.
A RIO procedure based on Pfam alignments analyzes each protein domain individually since Pfam is protein family database based on individual domains . In some respects, it would be preferable to analyze whole protein sequences, but well curated databases of complete protein alignments are not available (to our knowledge). However, domain-by-domain analysis is not necessarily disadvantageous. Due to domain shuffling many proteins are mosaic proteins, composed of domains with different evolutionary histories [46, 47]. For such proteins it makes much sense to analyze each domain individually. Furthermore, mosaic proteins from sufficiently distant species might be impossible to be aligned over more than one domain at the time, since they are unlikely to exhibit the same domain organization. The same is true for multiple copies of the same domain in protein: Each of them is analyzed individually (such proteins often differ in their number of domain copies and could therefore not be aligned from end to end for the whole family).
In general, the concept of "annotation consensus" is very important in this work (for example consensus between subtree-neighbors, or between subtree-neighbors and orthologs). We have employed this notion loosely. A useful future extension would be to incorporate automated annotation consensus detection into RIO. This would include annotation of internal nodes of a gene tree with a "biological function". Automated consensus detection is trivial for a highly formalized notation system, such as EC numbers (the consensus of EC 18.104.22.168 and EC 22.214.171.124 is EC 1.1.1, a oxidoreductase acting on the CH-OH group of donors with NAD+ or NADP+ as acceptor ). Obviously, it is much more difficult to analyze natural language annotations in the same manner. Perhaps this could be accomplished by utilizing the set of structured vocabularies of the Gene Ontology (GO) project http://www.geneontology.org/.
This work was supported primarily by a grant from Monsanto Company, and also by the Howard Hughes Medical Institute and grant HG01363 from the NIH National Human Genome Research Institute.
- Dayhoff MO: The origin and evolution of protein superfamilies. Fed Proc 1976, 35: 2132–2138.PubMedGoogle Scholar
- Ingram VM: Gene evolution and the haemoglobins. Nature 1961, 189: 704–708.View ArticlePubMedGoogle Scholar
- Haldane JBS: The causes of evolution. New York and London: Harper & Brothers Publishers; 1932.Google Scholar
- Ohno S: Evolution by gene duplication. New York: Springer-Verlag; 1970.View ArticleGoogle Scholar
- Galperin MY, Koonin EV: Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement and operon disruption. In Silico Biol 1998, 1: 55–67.PubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410. 10.1006/jmbi.1990.9999View ArticlePubMedGoogle Scholar
- Pearson WR: Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol 1990, 183: 63–98.View ArticlePubMedGoogle Scholar
- Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer EL: The Pfam protein families database. Nucleic Acids Res 2000, 28: 263–266. 10.1093/nar/28.1.263PubMed CentralView ArticlePubMedGoogle Scholar
- Eisen JA: Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res 1998, 8: 163–167.View ArticlePubMedGoogle Scholar
- Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV: The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res 2001, 29: 22–28. 10.1093/nar/29.1.22PubMed CentralView ArticlePubMedGoogle Scholar
- Eulenstein O: Vorhersage von Genduplikationen und deren Entwicklung in der Evolution. GMD Research Series 1998., 20:Google Scholar
- Zhang L: On a Mirkin-Muchnik-Smith conjecture for comparing molecular phylogenies. J Comput Biol 1997, 4: 177–187.View ArticlePubMedGoogle Scholar
- Zmasek CM, Eddy SR: A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics 2001, 17: 821–828. 10.1093/bioinformatics/17.9.821View ArticlePubMedGoogle Scholar
- Mueller LD, Ayala FJ: Estimation and interpretation of genetic distance in empirical studies. Genet Res 1982, 40: 127–137.View ArticlePubMedGoogle Scholar
- Felsenstein J: Confidence limits on phylogenies: An approach using the bootstrap. Evolution 1985, 39: 783–791.View ArticleGoogle Scholar
- Arabidopsis-Initiative: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 2000, 408: 796–815. 10.1038/35048692View ArticleGoogle Scholar
- C. elegans-Sequencing-Consortium: Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 1998, 282: 2012–2018. 10.1126/science.282.5396.2012View ArticleGoogle Scholar
- Fitch WM: Distinguishing homologous from analogous proteins. Syst Zool 1970, 19: 99–113.View ArticlePubMedGoogle Scholar
- Mombaerts P: Seven-transmembrane proteins as odorant and chemosensory receptors. Science 1999, 286: 707–711. 10.1002/(SICI)1097-010X(20000601)286:7<707::AID-JEZ5>3.0.CO;2-PView ArticlePubMedGoogle Scholar
- Troemel ER: Chemosensory signaling in C. elegans. Bioessays 1999, 21: 1011–1020. 10.1002/(SICI)1521-1878(199912)22:1<1011::AID-BIES5>3.0.CO;2-VView ArticlePubMedGoogle Scholar
- Eddy SR: Profile hidden Markov models. Bioinformatics 1998, 14: 755–763. 10.1093/bioinformatics/14.9.755View ArticlePubMedGoogle Scholar
- Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 1987, 4: 406–425.PubMedGoogle Scholar
- Studier JA, Keppler KJ: A note on the neighbor-joining algorithm of Saitou and Nei. Mol Biol Evol 1988, 5: 729–731.PubMedGoogle Scholar
- Atteson K: The performance of the neighbor-joining method of phylogeny reconstruction. In: Mathematical Hierarchies and Biology (Edited by: Mirkin B, McMorris F, Roberts F, Rzhetsky A). American Mathematical Society 1997, 133–148.Google Scholar
- Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 1992, 89: 10915–10919.PubMed CentralView ArticlePubMedGoogle Scholar
- Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins. In: Atlas of Protein Sequence and Structure (Edited by: Silver Springs, MD). Natl Biomed Res Found 1978, 5: 345–352.Google Scholar
- Swofford DL, Olsen GJ, Waddell PJ, Hillis DM: Phylogenetic Inference. In: Molecular systematics (Edited by: Hillis DM, Moritz C, Mable BK. Sunderland, MA). Sinauer Associates 1996.Google Scholar
- Goodman M, Czelusniak J, Moore GW, Romero-Herrera AE, Matsuda G: Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst Zool 1979, 28: 132–168.View ArticleGoogle Scholar
- Aguinaldo AM, Turbeville JM, Linford LS, Rivera MC, Garey JR, Raff RA, Lake JA: Evidence for a clade of nematodes, arthropods and other moulting animals. Nature 1997, 387: 489–493. 10.1038/387489a0View ArticlePubMedGoogle Scholar
- Barns SM, Delwiche CF, Palmer JD, Pace NR: Perspectives on archaeal diversity, thermophily and monophyly from environmental rRNA sequences. Proc Natl Acad Sci U S A 1996, 93: 9188–9193. 10.1073/pnas.93.17.9188PubMed CentralView ArticlePubMedGoogle Scholar
- Morris SC: Metazoan phylogenies: falling into place or falling to pieces? A palaeontological perspective. Curr Opin Genet Dev 1998, 8: 662–667. 10.1016/S0959-437X(98)80034-8View ArticlePubMedGoogle Scholar
- de Rosa R, Grenier JK, Andreeva T, Cook CE, Adoutte A, Akam M, Carroll SB, Balavoine G: Hox genes in brachiopods and priapulids and protostome evolution. Nature 1999, 399: 772–776. 10.1038/21631View ArticlePubMedGoogle Scholar
- Zmasek CM, Eddy SR: ATV: display and manipulation of annotated phylogenetic trees. Bioinformatics 2001, 17: 383–384. 10.1093/bioinformatics/17.4.383View ArticlePubMedGoogle Scholar
- Strimmer K, von Haeseler A: Quartet puzzling: A quartet maximum likelihood method for reconstructing tree topologies. Mol Biol Evol 1996, 13: 964–969.View ArticleGoogle Scholar
- Felsenstein J: PHYLIP – Phylogeny Inference Package. Cladistics 1989, 5: 164–166.Google Scholar
- Bairoch A, Apweiler R: The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res 2000, 28: 45–48. 10.1093/nar/28.1.45PubMed CentralView ArticlePubMedGoogle Scholar
- Dennis D, Kaplan NO: D and L-lactic acid dehydrogenase in Lactobacillus plantarum. J. Biol. Chem. 1960, 235: 810–818.PubMedGoogle Scholar
- Banaszak LJ, Bradshaw RA: Malate dehydrogenase. In: The Enzymes (Edited by: Boyer PD). New York: Academic Press 1975, 11: 369–396.Google Scholar
- Johnson HS: NADP-malate dehydrogenase: photoactivation in leaves of plants with Calvin cycle photosynthesis. Biochem. Biophys. Res. Commun. 1971, 43: 703–709.View ArticlePubMedGoogle Scholar
- Webb EC: Enzyme nomenclature. San Diego: Academic Press 1992.Google Scholar
- Jendrossek D, Kratzin HD, Steinbuchel A: The Alcaligenes eutrophus ldh structural gene encodes a novel type of lactate dehydrogenase. FEMS Microbiol Lett 1993, 112: 229–235. 10.1016/0378-1097(93)90169-3View ArticlePubMedGoogle Scholar
- Wyrambik D, Grisebach H: Enzymic synthesis of lignin precursors. Further studies on cinnamyl-alcohol dehydrogenase from soybean-cell-suspension cultures. Eur. J. Biochem. 1979, 97: 503–509.View ArticlePubMedGoogle Scholar
- Armstrong GA, Alberti M, Leach F, Hearst JE: Nucleotide sequence, organization, and nature of the protein products of the carotenoid biosynthesis gene cluster of Rhodobacter capsulatus. Mol Gen Genet 1989, 216: 254–268.View ArticlePubMedGoogle Scholar
- Storm CE, Sonnhammer ELL: Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics 2002, 18: 92–99. 10.1093/bioinformatics/18.1.92View ArticlePubMedGoogle Scholar
- Remm M, Storm CE, Sonnhammer ELL: Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol 2001, 314: 1041–1052. 10.1006/jmbi.2000.5197View ArticlePubMedGoogle Scholar
- Patthy L: Evolution of the proteases of blood coagulation and fibrinolysis by assembly from modules. Cell 1985, 41: 657–663.View ArticlePubMedGoogle Scholar
- Doolittle RF: The genealogy of some recently evolved vertebrate proteins. Trends Biochem Sci 1985, 10: 233–237. 10.1016/0968-0004(85)90140-9View ArticleGoogle Scholar
- Gene Ontology Consortium: Creating the gene ontology resource: design and implementation. Genome Res 2001, 11: 1425–1433. 10.1101/gr.180801View ArticleGoogle Scholar
- Costanzo MC, Crawford ME, Hirschman JE, Kranz JE, Olsen P, Robertson LS, Skrzypek MS, Braun BR, Hopkins KL, Kondu P, Lengieza C, Lew-Smith JE, Tillberg M, Garrels JI: YPD, PombePD and WormPD: model organism volumes of the BioKnowledge library, an integrated resource for protein information. Nucleic Acids Res 2001, 29: 75–79. 10.1093/nar/29.1.75PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.