5.1 Synthetic inputs
For evaluating the performances of our algorithms we designed the following simulation: 1) We generated random trees with 12 - 52 nodes by random hierarchical clustering of the trees' leaves, and generated random sets of 1000 - 3000 OL s that are related to these trees (see the Methods section). The labelings were sampled from the uniform distribution U [0, 3].
-
2)
In these random inputs, we "planted" solutions, which are OL s (with 100 - 300 orthologs) that have high co-evolutionary score (identical co-evolution) in large subtree (e.g. 5 - 20 nodes) of the input tree. We added additive noise with uniform distribution U[-0.15, 0.15] to each component of the "planted" solutions.
-
3)
We implemented the two algorithms, Tree Grower and Tree Splitter, on these inputs. 4) Currently there are no other algorithms that were designed specifically for discovering local patterns of co-evolution. Thus, we compared the performances of the algorithms to two popular bi-clustering algorithms (SAMBA [29] and the algorithm of Cheng and Church (C&C) [30]). To this end, we used two measures of performances: False Positive (FP) rate, which is the fraction of orthologs (OFP) or tree branches (BFP) in the output that are not part of a 'planted' solution, and False Negative (FN) rate, which is the fraction of orthologs (OFN) or tree branches (BFN) in the 'planted' solution that do not appear in the output. Figure 9 includes a summary of the simulation study. As can be seen, the performances of our algorithms are very good and far exceed the performances of the competing bi-clustering algorithms. For example, when considering all the synthetic inputs, the average OFN, OFP, BFN, and BFP of the Tree Splitter are 0.002, 0.25, 0.07, and 0.14 respectively. For comparison, the average OFN, OFP, BFN, and BFP of the algorithm of C&C are 0.52, 0.76, 0.16, and 0.61 respectively. This result justifies designing algorithms that are specific for solving the co-evolutionary problem, instead of using general bi-clustering algorithms.
Finally, our simulation showed that there are many inputs where the Tree Splitter algorithm outperforms the Tree Grower algorithm. However, there are cases where the Tree Grower gave better results (the intuition for this phenomenon was given in section 3.1). Thus, we employed both algorithms in the biological analysis.
5.2 Biological Inputs: Results and Discussion
In this section, we describe our main biological findings. The full lists of all the co-evolving sets that were found along with their local co-evolutionary patterns, and their functional enrichments appear in additional files 5, 6 and 7. The two main goals of this section are: 1) to describe a variety of biological examples that can be analyzed by our approach; 2) to depict some new biological insights related to this analysis.
The biological datasets describe the evolution of diverse sets of organisms and OL s, along different time ranges. The Eukaryote dataset includes both multicellular and unicellular organisms and describes evolution along 1642 million years. The fungi are unicellular organisms that appeared 837 million years ago. The mammals are multi-cellular organisms that appeared 197 million years ago (see [16, 31] for the divergence times of the different phylogenetic groups).
In the case of the fungi, we analyzed both dN/dS and CN. The dN/dS dataset includes conserved OL s that have exactly one ortholog in each organism while the fungi CN dataset includes OL with varying number of orthologs in each organism (see section 4.5). In the case of the eukaryote dataset, we analyzed only CN. In Addition to the analysis of OL s we also analyzed the local co-evolution of mammalian signaling pathways (based on dN/dS) and fungi complexes (based on CN; see section 4.5).
The rest of this section includes comparisons between the different measures of co-evolution and a summary of our findings in each of the biological datasets. As mentioned in the methods section, the reported signals of co-evolution can be attributed both the different datasets that we analyzed and/or our computational approach. The fact that some of the signals appear in more than one of the analyzed dataset demonstrates that these signals are very robust. On the other hand, the fact that some enrichment appear in only part of the datasets may be attributed to the fact that the sets of OL s in each database are different, to the different measures of co-evolution that was used, and/or to the different type of OL s that was used.
5.2.1 Comparison between the Different Measures of Co-evolution
The purpose of this subsection is to compare the different measures of co-evolution that are described in this work and to show that they are not redundant. To this end, we first compared the local co-evolutionary patterns found by the different measures of co-evolution. We defined two sets of co-evolving OL s to be identical if they have at least 70% similarity (measured by the corresponding Jaccard coefficient [32]) both when considering their OL s and when comparing the corresponding set of branches on which they co-evolve. Figures 10A, B, and 10C include such a comparison. Each table corresponds to one dataset, and each cell in these three tables corresponds to a comparison of two measures. These cells contain the fraction of the results that are not identical when comparing the corresponding plots of our approach. By our definition, 0 denotes identical sets of results while 1 denotes completely non-identical sets of results. As can be seen, the values in most of the cells are much closer to 1 than 0, demonstrating that the different measures of co-evolution are not redundant.
Our approach can detect regions in the evolutionary tree where sets of orthologs exhibits co-evolution. By definition, this can not be done by clustering; we demonstrate this point in the next sections (see, for example, section 5.2.3). In this section, we demonstrate that also the OL s found by local and global approaches are different. To this end, we compared the results found by our approach to those obtained by a global clustering (k-means with various values of k). In this case, we only compared the OL s in each solution and used the same definition as described above. We compared the global and local results for each measure in each dataset. Figure 11D includes such a comparison. Again, as can be seen, the values in most of the cells are much closer to 1 than 0, demonstrating that many of the results found by our local approach can not be detected by a global clustering.
5.2.2 Local Co-Evolution of Cellular Processes: A Global View
The small fungi dN/dS dataset, the small fungi CN dataset, and the eukaryote CN dataset relate to orthologs (single genes) and not to complexes/pathways as the other two datasets. Thus, it is possible to compute functional enrichment for the resulting sets that co-evolve locally (Methods, subsection 4.3.2).
A summary of these results appears in Figures 11A, B and 13. As can be seen, 10% - 56% of the co-evolving sets that we found are functionally enriched. This fact demonstrates that groups of genes with similar functionality tend to undergo local co-evolution.
Figure 11B depicts the main GO functions that were enriched in the co-evolving sets of OL s in each of the three datasets. As can be seen, our analysis shows that there are cellular processes, such as metabolism and regulation, that exhibit co-evolution in all the three datasets.
Figure 12 includes a concentrated view on the co-evolution of the cellular processes that are related to metabolism and regulation in the three biological datasets. The figure depicts the regions in the evolutionary trees where we detected co-evolving sets of OL s that are enriched with metabolic and regulatory GO functions. This figure also includes information about the corresponding measures of co-evolution that were used for detecting each of the co-evolving sets of OL s. As can be seen, the fact that these two groups of cellular functions exhibit local pattern of co-evolution is robust to the type of the OL, the measure of co-evolution, and the input dataset used.
Additional file 6 includes a comparison between the co-evolving sets of OL s found by our approach and by SAMBA for the Fungi dN/dS dataset; it shows that many cellular functions were found to be enriched by our approach but not by SAMBA.
5.2.3 Fungi Copy Number and dN/dS
The two fungi datasets are interesting since they enable us to compare the two types of co-evolution: co-evolution via similar/correlative dN/dS (Figure 13A), and evolution via similar/correlative gene copy number (Figure 13B). Many metabolic cellular functions (e.g. metabolism of amino acids), and cellular functions that are related to regulation (e.g. translation) exhibit local co-evolutionary patterns both via changes in copy number and via changes in dN/dS. Though the GO enrichments that appear in Figure 13A and in Figure 13B are similar, it is important to note that the OL s (and thus the co-evolving sets of OL s) in the two cases are completely different. This fact emphasizes the centrality of these processes in the fungi evolution.
One explanation of this phenomenon is the fact that fungi datasets includes both anaerobic organisms (S. cerevisiae, S. bayanus and S. glabrata) and aerobic organisms (A. nidulans, C. albicans, D. hansenii, K. lactis, and Y. lipolytica) [33]; and the switch between these two types of metabolism required the co-evolution of various metabolic processes.
We discovered two regions where many of the fungal genes underwent positive selection. By definition, such regions in the evolutionary tree can not be discovered by global clustering methods. The larger set of OL s (554 orthologs) exhibits positive selection along the branch (11, 12) (see Figure 13A) probably following the whole genome duplication event that has occurred at this bifurcation [34]. This whole genome duplication event probably served as a driving force underlying this burst of positive selection, by relaxing the functional constraints acting on each of the gene copies (see for example [35]). Interestingly, this branch also partites the fungi into two groups, anaerobic and aerobic, that were mentioned above. This fact further supports the centrality of metabolism in fungi evolution.
Another set of OL s (11 orthologs) exhibits positive selection along the subtree with the nodes 13, 14, and 15 (see Figure 13A). The branch between nodes 13 and 14, leads to a subgroup (D. hansenii and C. albicans) that evolved a modified version of the genetic code [36], and the branch between nodes 13 and 15 leads to Y. lipolytica (which is a sole member in one of the three taxonomical clusters of the Saccharomycotina[37]). All the results for these datasets appear in additional file 5 and additional file 7.
5.2.4 Eukaryote Copy Number
As mentioned, this biological dataset gives a wider evolutionary view than the fungi datasets. Cellular processes that are related to metabolism, signaling, and mRNA processing exhibit co-evolutionary patterns along this dataset (see Figures 11B and 13). One striking phenomenon is that many of these co-evolving sets (87%) exhibited co-evolution (according to all the measures of co-evolution) along the subtrees of the Animalia and Fungi, and excluding the subtree of the Plantae (see illustration in Figure 11C).
It is possible that this result is partially related to the fact that the analyzed subtrees of the Plantae included only one organism with relatively high evolutionary distance from other organisms. However, we also found two possible biological explanations for this phenomenon: First, many gene modules changed their functionality after the split between the Plantae and the two other groups (Animalia and Fungi). Cases where homologous protein complexes in Plantae and Animalia have rather different functions were reported in the past. For example, the COP9 signalosome, a repressor of photomorphogenesis in Plantae, regulates completely different developmental processes in Animalia[38, 39]. Our analysis, however, may suggest that this is a wide scale phenomenon.
Second, it is possible that there is a relatively higher rate of changes in the protein-protein interactions along the split between the Plantae and the two other groups (i.e. more pairs of protein gain/lose new interactions). Thus, these results suggest that the protein-protein interaction network of Plantae may be relatively different from that of the other groups (see [40] for a comparison of protein-protein interaction networks). To the best of our knowledge, an alignment of the protein interaction network of a plant and organisms from the other two groups has not been performed yet. When such an alignment will be performed, it will be possible to check this hypothesis.
All the results for these datasets appear in additional files 8 and additional files 7.
5.2.5 Co-Evolution of Cellular Functions
The functional enrichments of the co-evolving OL s can teach us about functional interdependencies between cellular functions and about the co-evolution of cellular functions. We found many subtrees where sets of OL s that are enriched with various GO functions exhibited co-evolution. For example, Translation and Gene expression exhibited a copy number based co-evolution in the fungi subtree that is under internal node 12 (Figure 12B), as expected from two coordinated biological processes in charge of producing RNA or proteins from the corresponding genes (DNA sequences).
Additional cellular processes showed coordinated evolution. For example, Translation and Amino acid metabolic process exhibited co-evolution in the Eukaryotes (Figure 12C) in the subtree that included nodes 1, 2, 3, 4, 5, and 8 (as detected by copy number variations). The link between these two processes is probably not direct. A possible explanation is that the evolution of the metabolism of various Amino Acids (AA) altered the composition of the AA pool in the fungi cell. These changes were then followed by a corresponding evolution of the translation machinery to cope with the new AA pool.
5.2.6 Co-Evolution of Fungal complexes
We implemented our approach to find groups of complexes that exhibit correlative (Spearman Correlation) patterns of co-evolution along parts of the Fungi evolutionary tree (Figure 13A; see the Method section).
To discover complexes that co-evolve with other complexes in specific parts of the phylogenetic tree, we divided the evolutionary tree into the three parts that are marked in Figure 6C (Hemiascomycota, Euascomycota, and Archeascomycota). Then, we computed for each complex the number of solutions (co-evolving groups) that include it in each of these three parts of the tree (all the results appear in additional files 9). We focused on complexes whose co-evolution with other complexes is time dependent (i.e. it is relatively higher in a narrow part of the evolutionary tree).
We found that several complexes exhibit different levels of co-evolution with other complexes along different parts of the evolutionary tree. For example, the complexes: Transcription elongation factor and Transcription export which are important for mRNA production, as well as the Proteasome core complex sensu Eukaryota and Proteasome regulatory particle lid subcomplex sensu Eukaryota, in charge of protein degradation, exhibit relatively higher level of co-evolution in the subtree of the Hemiascomycota. These complexes affect general protein amounts in the cell at two different levels, transcription (mRNA formation) and protein stability (protein degradation). In the sub-tree of the Euascomycota we see co-evolution of the Mediator complex and the U2 snRNP. These two complexes affect mRNA level by influencing the rate of transcription and the rate of splicing, respectively. Finally, the SLIK SAGA-like complex, encoding a chromatin remodelling complex, as well as the mRNA cleavage factor and the Spliceosome, involved in mRNA processing, exhibit relatively higher level of co-evolution in the subtree of the Archeascomycota.
Notably, all the complexes whose co-evolution was enriched in specific branches of the tree are involved in basic gene expression processes at all possible levels (mRNA creation, stability and processing, protein creation and stability). A recent work of Man and Pilpel [33] showed that differential translation efficiency of orthologous genes can produce phenotypic divergence of Fungi. Our results may suggest a similar and wider picture where the co-evolution of various gene expression processes is involved in phenotypic divergence.
5.2.7 Co-Evolution of Mammalian Signaling Pathways
Similarly to the previous subsection, we implemented our approach to find groups of signaling pathways that exhibit correlative (Spearman Correlation) and absolutely similar (L1 norm) pattern of co-evolution along parts of the mammalian evolutionary tree (Figure 13B; see the Methods section).
To discover co-evolution of specific pathways in specific parts of the phylogenetic tree, we divided the evolutionary tree into the three parts that are marked in Figure 6D (Rodentia, Laurasitheria, and Primates). Then, we computed for each pathway the number of solutions (co-evolving groups) that include that pathway in each of these three parts of the tree (all the results appear in additional file 10). We focused on those signaling pathways whose co-evolution is time dependent (i.e. it is relatively higher in a narrow part of the evolutionary tree).
In this case, we found that in general pathways exhibit relatively homogenous levels of co-evolution along different parts of the evolutionary tree. However, also in this case, for the L1 norm, some of the pathways exhibit accelerated levels of co-evolution in particular branches. For example, the pathways Toll-like Receptor Signaling, a pathogen-associated pattern recognition receptor, Cell Cycle: G1/S Checkpoint Regulation, and Cell Cycle: G2/M Checkpoint Regulation exhibit relatively higher levels of co-evolution in the subtrees Rodentia and Laurasitheria. Interestingly, in the latter subtree co-evolution can also be seen between these pathways and IL-6 Signaling, which plays a central role in inflammation. The association between basic cellular checkpoints and the response to external insults such as pathogens is intriguing and deserves further investigation.
Finally, in the subtree of the Primates we observe co-evolution of pathways related to neurotransmission and neuronal evolution (e.g. GABA Receptor Signaling, the main inhibitory neurotransmitter in mammalian CNS, TR/RXR Activation, related to activation of the thyroid hormone, and Amyotrophic Lateral Sclerosis Signaling, a disorder of the motor neurons).