Algorithms
Our modularitydriven approach to phylogenetic network analysis, MOPHY, which is illustrated in Figure 1, can be summarized as follows:
1. Considering each extant network individually, identify network modules that represent functional and topological properties of each network.
2. Project modules identified on each network to other extant networks, based on the conservation of functional and topological properties, to obtain a module map for each species. A module map can be thought of as a mathematical representation of the conservation of extant network modules in the corresponding species.
3. Using module maps, compare networks of diverse species to construct network phylogenies.
4. Using resulting network phylogenies, investigate the evolutionary histories of extant network modules to gain insights on the evolution of functional modularity.
In the rest of this section, we formulate this approach based on biologically sound abstractions and present algorithms to effectively solve the resulting computational problems.
Problem Formulation
An interaction network G = (V, E) is specified by a set V of proteins and a set E of interactions between these proteins. For v
_{
i
}, v
_{
j
}∈ V, v
_{
i
}
v
_{
j
}∈ E indicates that proteins v
_{
i
}and v
_{
j
}interact with each other. Interactions are also associated with reliability scores, specified by weight function w: V × V → [0, 1], where larger w(v
_{
i
}, v
_{
j
}) indicates higher confidence in the existence of an interaction between v
_{
i
}and v
_{
j
}[22].
In this paper, we particularly focus on proteinprotein interaction (PPI) networks. However, the proposed method can be easily extended to different sources of data that relates to the organization of cellular systems (e.g., gene expression, regulatory networks, metabolic pathways)  to provide an integrated view and analysis of cellular organization. Consider a set
= {G
_{1}, G
_{2}, ..., G
_{
K
}} of K PPI networks, each belonging a different extant species.
The proteins in these networks can be associated with each other through a sequence based similarity measure, ψ:
×
→ [0, 1], where
= V
_{1} ∪ V
_{2} ∪ ...∪ V
_{
K
}is the set of all proteins in all networks. In practice, we estimate the sequence similarity scores using BLAST Evalues [1]. Let E
_{
ij
}denote the Evalue of the most significant bidirectional BLAST hit between proteins v
_{
i
}and v
_{
j
}. We define ψ (v
_{
i
}, v
_{
j
}) = 1 + 1/log(E
_{
ij
}), if E
_{
ij
}<ω, and 0 otherwise, where ω denotes the threshold for the Evalue being considered significant. If ψ (v
_{
i
}, v
_{
j
}) > 0, i.e., E
_{
ij
}<ω, we call v
_{
i
}and v
_{
j
}homologs. Note that ψ is quite sparse in practice, i.e., most proteins have only a few homologs in other species.
Network Modularity
An important class of network components that are used in characterizing the networks consists of functional modules [23]. A functional module is in general defined as a set of macromolecules (e.g., proteins) that perform a distinct function together (e.g., protein complexes, signaling pathways) [5]. It is shown that these functional modules generally manifest themselves in interaction networks as subgraphs with distinctive local properties. For instance, protein complexes would have high inner connectivity in the network, while being somewhat loosely connected to the rest of the network [24]. Furthermore, it is observed that functionally modular groups of proteins and their interactions are likely to be conserved together [9, 10, 25]. Consequently, modular subgraphs provide an excellent substrate for understanding the evolutionary relationship between networks that belong to diverse species. For this reason, we use conservation of modular subgraphs as an indicator of evolutionary proximity between networks. Note that, comparative methods that target identification of conserved functional modules on a group of networks are also available. However, there are several advantages to our approach of discovering modular subgraphs on each network individually, followed by projection of these subgraphs on other networks: (i) multiple graph alignment is a computationally expensive problem [8, 15], (ii) noisy and incomplete nature of available interaction data makes it particularly difficult to account for mismatches (e.g., does a missing interaction indicate divergence of function or is it just an artifact of missing data?), and (iii) by identifying modular subgraphs in each extant network individually, we can quantify not only the conservation, but also divergence of modularity in different evolutionary lineages and understand the evolutionary histories of functional modules in diverse species.
Proximity as an indicator of modularity
In order to assess the modularity of a group of proteins in a network, we use a proximity based measure, motivated by three key observations: (i) Network proximity is shown to be correlated with functional similarity [26, 27]. (ii) PPI networks are generally incomplete and prone to noise, therefore existence of alternate paths may account for missing interactions [19]. (iii) Existing evidence suggests that existence of alternate paths relaxes the evolutionary pressure on conserving the interaction between two proteins [28].
Consequently, while assessing the modularity of a subgraph in a different network, it might be evolutionarily more relevant to consider the conservation of network proximity rather than conservation of individual interactions.
Let
denote a simple path connecting proteins v
_{
i
}and v
_{
j
}in network G. We define the reliability of this path as the aggregate reliability of all interactions on the path, i.e.,
. Then, we estimate the proximity between two proteins with respect to G as the reliability of the most reliable path between these two proteins, i.e.,
, where ∏ (v
_{
i
}, v
_{
j
}) denotes the set of all simple paths between v
_{
i
}and v
_{
j
}. Note that, according to this measure, proximity decays exponentially by path length. This is in agreement with the relation between functional similarity and network proximity [26]. If the interactions are not associated with reliability scores (i.e., the edges are unweighted, as in many of the existing PPI databases), we assign a score of 1/2 for all interactions, to ensure exponential decay of the proximity measure. In addition, mutual clustering coefficient, as a measure of (sub)graph connectivity, is used to complement proximity in identifying network modules, which is very effective in breaking ties in the unweighted case (please see Section Module Identification for details).
Projection of proximity on other networks
Now consider evaluating the proximity of two proteins with respect to another network,
G' ≠
G, with a view to assess the conservation of their functional association. In order to measure the proximity of
v
_{
i
}and
v
_{
j
}with respect to
G', we aggregate the proximities of their homologs in
V (
G'). For this purpose, we normalize the similarity measure with respect to network
G ' to obtain
, where
v
_{
i
}∉
V (
G') and
v
_{
k
}∈
V(
G') Then, the proximity of
v
_{
i
}and
v
_{
j
}∉
G' with respect to
G' is defined as
Modularity and projection of modularity
We can now define the modularity of a set of proteins with respect to a network. For a set
S of proteins, all in the same species,
i.e.,
S ⊆
V (
G) for some
G ∈
, the modularity of
S with respect to network
G' ∈
is given by
which is the average proximity between all pairs of proteins in S with respect to G' In other words, the modularity of a group of proteins in one network with respect to another network measures the extent to which the modularity of the homologs of these proteins is conserved in the corresponding network, through aggregation of proximities of all homologs. Observe that our framework provides a universal view of modularity by facilitating assessment of the modularity of a set of proteins in one species with respect to a network that belongs to another species.
Concerns could be raised that the proposed measure might be heavily influenced by the sequence information, resulting in an inadvertently sequence based analysis (as opposed to network based). However, as we demonstrate in Section Testing, the biological signals captured by this measure depend on network topology and the use of sequence information is minimal. Namely, we randomly rewire all interactions (i.e., add 100% noise to all networks) and keep sequence similarities as they are. In this case, the performance of the proposed measure in phylogeny reconstruction becomes equivalent to that of a random algorithm (which is otherwise significantly better, even at the presence of 50% noise). This result indicates that biological signal is lost if the network is completely noisy, even if the sequence information is completely preserved.
Module Identification
In our framework, network modules are identified on each network individually, and then projected on other networks to construct module maps. In order to identify modular subgraphs in a single network, we use a hierarchical clustering algorithm with a hybrid similarity measure that integrates the concepts of proximity and connectivity for network clustering. Integration of these two alternate measures enables discovery of tightly coupled subgraphs with low diameter.
Proximity vs. connectivity
An important problem associated with the application of proximity based clustering to molecular interaction networks is that, these networks are "ultrasmall", i.e., distances between most pairs of proteins are very low. Particularly when the interactions are not associated with reliability scores, there are many pairs of proteins with identical proximity at early steps of hierarchical clustering, resulting in many candidate pairs of clusters to be merged. This problem is often alleviated by running the algorithm multiple times with random decisions, and then reconciling the clusters based on the number of times each pair of proteins are assigned to the same cluster.
Mutual clustering coefficient complements proximity
In this study, we propose an alternate approach, which uses mutual clustering coefficient [
29] between two sets of proteins as a secondary measure of similarity between clusters. Mutual clustering coefficient captures the overlap between the interaction profiles of two proteins, providing an estimate of the likelihood that the two proteins together belong to a functional module. We generalize the notion of mutual clustering coefficient to pairs of clusters (sets of proteins) as follows. For a set
S
_{
i
}⊆
V of proteins in network
G = (
V,
E), let
N
_{
i
}= {
v
_{
k
}∈
V: ∃
v
_{
l
}∈
S
_{
i
}such that
v
_{
k
}
v
_{
l
}∈
E} be the set of interacting partners of the proteins in
S
_{
i
}. Then the mutual clustering coefficient of clusters
S
_{
i
}and
S
_{
j
}is given by
Note that this is a measure of the statistical significance of the number of shared interacting partners, which is modeled as a hypergeometric random variable.
Our algorithm uses mutual clustering coefficient as a tiebreaker. In other words, at each step, it merges two clusters with maximum proximity if this pair of clusters is unique. Otherwise, among all pairs of clusters with maximum proximity, it merges the pair with largest mutual clustering coefficient. Consequently, at early stages of the algorithm, mutual clustering coefficient effectively acts as the primary similarity criterion. On the other hand, the proximity criterion ensures formation of clusters with the minimal diameter throughout the course of the algorithm. Once a hierarchical clustering is obtained using this method, we identify the modular subgraphs by properly choosing thresholds on proximity. In Section Testing, we evaluate the effect of the proximity threshold on the performance of our algorithms in detail.
Constructing Module Maps
Once modular subgraphs are identified on each network, we project these subgraphs to all networks and construct a module map for each network. Module maps can be thought of as feature vectors, where features represent the modularity of each subgraph in the corresponding network. The modularity score of each module with respect to each network is calculated as in Equation 2. To be more precise, assume that we identify
m
_{
j
}modules in network
G
_{
j
}. Let the set of these modules be
. Then, for each network
G
_{
i
}, 1 ≤
i ≤
K, the module map
f
_{
ij
}of
G
_{
i
}with respect to
G
_{
j
}is an
m
_{
j
}dimensional vector, defined as
Consequently, the module map of network G
_{
i
}, 1 ≤ i ≤ K is represented as f
_{
i
}= [f
_{i1}, f
_{i2}, ..., f
_{
iK
}]. Network modules represent a particular functional component of each network. These modules, when considered altogether, provide a high level representation of each network. In other words, the module maps are signatures of each network and can be utilized to identify the overall phylogeny of these networks.
Phylogenetic Tree Reconstruction
Once modules are projected to each network and module maps are created, we use these module maps as feature vectors that characterize the cellular organization in each species. Namely, we compute the pairwise distance between each pair of species by comparing module maps, and apply a traditional phylogenetic tree reconstruction algorithm (i.e., neighbor joining [30]) based on these pairwise distances.
Clearly, a straightforward way of estimating evolutionary distances between pairs of species is to consider the correlation between their module maps, i.e., to define
. However, this method is likely to be significantly affected by the incompleteness of data, since some modules may not exist in some species just because of the unavailability of interaction data. For this reason, while computing the evolutionary distance between two species, we only consider the modules that are identified on the networks of these two species. This avoids the bias introduced by the large number of modules in species for which more comprehensive data is available. Furthermore, we consider onedirectional conservation of a single module as a bidirectional hit, to account for missing data. For instance, in the PPI data obtained from DIP, M. musculus (mouse) PPI data is relatively incomplete compared to H. sapiens (human) PPI data. However, available mouse network is quite similar to parts of the human network. Therefore, almost all modules identified in mouse PPI network are also conserved in human PPI network.
Consequently, by considering only the conservation of modules in the smaller network,
i.e., defining
we avoid the effect of missing interactions in the network when incomplete data is used.
Simulation of Network Evolution
In order to evaluate and calibrate our algorithms for network based phylogeny reconstruction, we apply our method to synthetic networks that are generated by simulating network evolution. The goal of this simulation is to generate a set of networks with known underlying phylogeny, construct a phylogenetic tree for these networks using our methods, and compare the underlying and reconstructed trees to evaluate the accuracy of our algorithms.
Network Evolution Models
We generate synthetic networks by utilizing a model that can accurately reflect the basics of network evolution. There are a variety of duplication based evolution models proposed to model the growth of PPI networks [21, 31, 32]. These models are inspired by theoretical models of molecular and functional evolution and aim at reproducing the topological properties observed on extant PPI networks. Among these models, we employ a slightly modified version of an iterative model called duplication mutationcomplementation (DMC), which is shown to regenerate the subgraph distribution of D. melanogaster PPI network significantly better than other network generation models [20]. The DMC model works in iterations and each iteration consists of three steps. (1) Gene duplication: A protein v
_{
o
}is chosen uniformly at random and duplicated to create a new protein v, with all interactions of v
_{
o
}. (2) Mutation/Complementation: For each pair of interactions v
_{
o
}
v' and vv', one is chosen and deleted with a probability p, based on the notion that the duplication results in redundant protein functionality, relaxing the evolutionary pressure on preserving one of the redundant interactions. (3) SelfInteraction: An interaction is added between v
_{
o
}and v with probability p'. In our experiments, we choose p = 0.7 and p' = 0.1 as suggested in [20].
Incorporating Speciation
Unlike other network generation models that focus on evolving a single network, in this work, we evolve multiple networks in parallel to construct a phylogenetic tree of networks. Starting with a single network, the networks evolve concurrently with respect to an evolutionary timeline. We assume that, at each branch, speciation occurs with a constant rate r
_{
s
}, resulting in two networks and splitting the evolution process into two new branches.
Throughout network evolution, a species S = (G, n
_{
c
}, n
_{
f
}) is specified by its PPI network G, the current size of its network n
_{
c
}, and the target size n
_{
f
}. The steps taken while generating the phylogenetic tree of these species can be summarized as follows: Initially, the phylogenetic tree has only one species at the root position with (G, n
_{
c
}, n
_{
f
}), where G contains only two interacting proteins, i.e., n
_{
c
}is set to 2. The value of n
_{
f
}is assigned randomly from a Gaussian distribution (in our experiments, we use μ = 3000, σ = 1000 to generate diverse networks with size that reflect that of extant networks accurately).
At each iteration t of the evolutionary process, for each species S that has not reached to the target interaction network size, i.e., n
_{
c
}
< n
_{
f
}, the following actions are performed:

S is evolved by a single iteration of the DMC model. This results in the addition of a new node to G, incrementing n_{
c
}by one.

If the network size reaches the target value after this addition, i.e., n
_{
c
}= n
_{
f
}, the network is recorded as a leaf network.

Otherwise, a speciation event takes place for S with probability r
_{
s
}. This process is analogous to generating two new species and replacing the original S with an internal node. These new species initially have the same set of interactions with their parent. They are also assigned new n
_{
f
}values right after the speciation, chosen from the same distribution. Note that, common proteins in these new species are recorded as homologs.
When all species reach their targeted network sizes, the simulation terminates with leaves representing the resulting species. Note that each speciation event increases the total number of species by one. The desired number of networks for a given distribution of target network size (number of proteins in each network) can be achieved by properly adjusting r
_{
s
}and the number of iterations.
The evolutionary distances between species are recorded according to the occurrence of speciation events with respect to evolutionary timeline. This enables evaluation of our algorithms in terms of its accuracy in estimating evolutionary distances, as well as the topology of the resulting phylogeny. Furthermore, to obtain similarity scores between proteins in different species, we record homologs after each speciation event and assign scores to each pair of proteins based on the number of duplications that occur after their common ancestor is split into two different proteins. More precisely, we set ψ (P
_{
i
}, P
_{
j
}) = 1/D
_{
ij
}, where D
_{
ij
}is the number of duplication events that occur after the common ancestor of these two proteins is split.
Testing
In this subsection, we evaluate the performance of MOPHY on simulated, as well as real data  in terms of (i) success in accurately reconstructing the underlying phylogeny, (ii) robustness to noise and missing data, and (iii) performance as compared to existing algorithms.
Results on Simulated Data
We test our method on synthetic networks generated by simulation of the evolutionary process. In our experiments, in order to keep the size of the networks at a realistic scale with sufficient variability, we set the average number of proteins in a network to 3000, with a standard deviation of σ = 1000. Here, average network size is kept relatively smaller as compared to that of extant networks for feasibility constraints, since these experiments are performed multiple times to assess statistical significance and the effect of varying parameters. Our results on extant networks show that the method also scales to larger networks and is applicable in practice. Using this configuration, we generate ten networks for each experiment. For all experiments, we generate five different instances, and for each performance figure, we report the average over these five instances. Note that, in these experiments, the interactions are not associated with reliability scores.
Evaluating performance:Comparison of phylogenetic trees
In order to quantify the performance of a tree reconstruction method, it is necessary to compare the reconstructed tree with the underlying tree based on a sound measure of similarity between two phylogenetic trees. For this purpose, we investigate the similarity between the two phylogenetic trees by using two methods: (i) Symmetric Difference of Robinson and Foulds [33] is defined as the total number of partitions that are on one tree and not on the other. We use this measure as a metric of success in reconstructing the topology of the phylogenetic tree. (ii) Nodal distance [34] takes into account the branch lengths and computes a similarity metric by comparing the sum of distances of every node pair in each tree. We use this method to evaluate the performance of algorithms in capturing the evolutionary distances between different networks.
Comparison of performance with other methods
We compare the performance of MOPHY in reconstructing the correct phylogeny to that of four alternate methods

RDL: An existing method for networkbased phylogeny reconstruction, which uses relative description length to assess the similarity between networks [16].

Random Modules: This method implements an algorithm similar to that of MOPHY, but it uses random groups of proteins as modules. These random modules are selected in a way that they reflect the modules incorporated by MOPHY in terms of their quantity and size distribution. We use this method as a reference method to assess the contribution of the information on modularity in reconstructing the correct phylogeny.

Only Protein Similarity: This method incorporates only the similarities between proteins to reconstruct a phylogenetic tree. Namely, we still compute feature vectors for each network, but each entry of the feature vector represents the conservation (score of best sequence similarity match) of a single protein. The purpose of using this method as a reference is to assess the contribution of the use of network information (proximity and modularity) in reconstructing the correct phylogeny.

Random Homolog Selection: In this method, we investigate the impact of the assessment of homology on the performance of MOPHY. Namely, homologous proteins across different species are chosen randomly instead of using sequence similarity of proteins. By the comparison of MOPHY against this method, we aim to verify that the assessment of conservation in MOPHY is not arbitrary; MOPHY rather makes effective use of sequence homology to assess conservation of network modules.
The comparison of the performances of these methods over five different instances, obtained through simulation of network evolution, is shown in Table 1. As seen on the table, MOPHY performs drastically better than any of the four alternate methods in terms of minimizing the nodal distance between the correct evolutionary history and the reconstructed evolutionary history. Furthermore, the Random Modules method performs clearly better than RDL, suggesting that incorporation of network proximity, i.e., aggregation of interactions, is more useful than incorporation of network topology, i.e., incorporation of single interactions, in capturing the similarity of networks. However, comparison of the performances of Random Modules and Only Protein Similarity suggests that, when modularity is not considered, incorporation of network information provides marginal improvement. Results obtained by using Random Homolog Selection method are also significantly less accurate as compared to those obtained by MOPHY, indicating that MOPHY makes use of homology information provided by sequence similarity effectively.
Design parameters and module selection
In MOPHY, the module identification process can be tuned by adjusting several parameters: (i) The threshold on proximity adjusts the tradeoff between the tightness and comprehensiveness of modules (higher threshold on proximity results in smaller and more tightly coupled modules). Since the interactions in the simulated networks are unweighted, we use diameter, i.e., the maximum distance between two proteins in a module, to represent the proximity threshold. (ii) As multiple modules are identified in each network, using all modules in phylogeny reconstruction may lead to problems associated with highdimensionality. Therefore, we investigate the effect of network coverage provided by the modules considered, where coverage is defined as the percentage of proteins included in the selected modules. (iii) In order to understand which modules are more informative, we consider two different module selection strategies: most specific, i.e., the set of smallest (with size ≥ 3) modules for a given coverage or most comprehensive, i.e., the set of largest modules for a given coverage.
Performance of MOPHY for different parameters
Detailed statistics on the comparison of underlying and reconstructed phylogenetic trees for systematic experiments on simulated instances are shown in Tables 2, 3, and in Figure 2. To evaluate the performance of MOPHY statistically, we evaluate the statistical significance of its performance with respect to the Random Modules method. The performance difference between MOPHY and Random Modules can be thought of as an indicator of the usefulness of relying on conservation of modular network structures as opposed to arbitrary (groups of) proteins and their interactions. We quantify the statistical significance of the performance difference between MOPHY and Random Modules based on Student's ttest to compare the means of two populations. The pvalue for an experiment gives the probability that an algorithm that incorporates sequence conservation and network proximity, but not modularity can achieve as good as MOPHY solely based on chance. As seen in Table 2, for any configuration of parameters, the accuracy of the topology of the phylogenetic tree reconstructed by MOPHY is highly significant. In general, more specific (smaller) modules appear to be more informative. Indeed, as seen in Table 3, when evolutionary distances are considered, the performance with more comprehensive (larger) modules is not statistically significant. Furthermore, performance degrades with increasing diameter (less proximity), suggesting that conservation of tightly coupled modules is more informative in reconstructing evolutionary histories. The effect of coverage on performance is shown in Figure 2(a) and 2(b). When more specific modules are used, the effect of coverage on performance is marginal. This indicates that careful selection of a concise set of small, tightly coupled modules may be adequate to reconstruct network phylogenies accurately. Finally, it is interesting to note that the randomized method performs better with large clusters, which is probably due to the increased likelihood that a random group of proteins will contain an informative subset of proteins.
Robustness against noise and missing data
Currently available PPI data is likely to be highly noisy and incomplete. Hence, we evaluate the robustness of MOPHY against random noise and incompleteness of data. For the purpose of observing the effect of noise, after generating the networks via simulation of network evolution, we randomly perturb the resulting networks by repeatedly swapping randomly selected interactions. Furthermore, in order to evaluate the effect of missing interactions in our experiments, we apply two interaction removal strategies, namely uncorrelated and correlated removals. For the uncorrelated removal method, a certain percentage of protein interactions in each separate network is removed randomly. Whereas for the latter removal method, if an interaction is selected for removal from one network, then one of the interactions among the homologs of the interacting proteins is also removed from all the other networks where it exists.
The behavior of the performance of MOPHY with respect to noise rate (percentage of interactions that are swapped) and missing interactions is shown in Figure 2(c). These experiments are performed for diameter = 3, coverage = 60%. As seen in the figure, although the accuracy of MOPHY decreases with noise and missing interactions as expected, the performance difference between MOPHY and the randomized method is significant even at the presence of 50% noise or 40% correlated missing data.
This observation suggests that MOPHY can be used to extract meaningful information on evolutionary histories of networks even when the networks are highly noisy and incomplete. On the other hand, the performance of MOPHY degrades more rapidly with increasing number of uncorrelated missing interactions. This is expected since in the case of uncorrelated missing interactions, after a sufficient number of interactions are removed, the network distance between the homologs of two interacting proteins in one species becomes infinite in the network of another species.
However, as evident in Figure 2(c), MOPHY's performance is significant (p <0.01) with respect to the random algorithm even when 20% of the interactions are removed from the networks at random. Moreover, note that the performance of the randomized method is not affected by noise, and the performance of MOPHY becomes equivalent to that of the randomized method at the presence 100% noise (i.e., random edge swapping is repeated for a sufficiently large number of iterations). These results indicate that the biological signals captured by MOPHY depend on network topology and the use of network proximity and modularity provide significant information on conservation of function that is beyond sequence similarity.
Results on Extant PPI Networks
We test our method on the available PPI networks from seven diverse species. The PPI data is obtained from the Database of Interacting Proteins (DIP) [35]. These networks include those of D. melanogaster (7471 proteins, 22656 interactions), S. cerevisiae (4968 proteins, 17286 interactions), E. coli (1848 proteins, 5930 interactions), C. elegans (2646 proteins, 3977 interactions, H. sapiens (1334 proteins, 1539 interactions), H. pylori (710 proteins, 1359 interactions), and M. musculus (414 proteins, 337 interactions). Although the network sizes in this database vary dramatically for different species, MOPHY can effectively deal with such incompleteness by considering each pair of species separately and considering the conservation smaller network's modules in the larger network (as discussed in Section Phylogenetic Tree Reconstruction).
To reconstruct the phylogeny of these seven networks via MOPHY, we use the most specific modules that contain at least three proteins and set the coverage to 50%. While identifying homolog proteins, we set the BLAST Evalue threshold score ω to 0.05. As in our experiments on simulated data, we compare MOPHY with four alternate methods; (i) RDL, (ii) using random modules, (iii) using only protein similarities, and (iv) random homolog selection method. For reference, we also consider the phylogenetic tree that is reconstructed based on sequenced genomes [36], which is shown in Figure 3(a). The phylogenetic trees reconstructed based on the seven PPI networks by MOPHY, RDL, using only protein similarities, using random modules and random homolog selection method are shown in Figures 3(b), (c), (d), (e)) and 3(f) respectively. Unlike other methods, the tree reconstructed by MOPHY complies well with common knowledge on the underlying phylogeny of these seven diverse species and is also consistent with the whole genome based phylogeny. It should be noted that, as evident in Figure 3, networkbased distance measures tend to overestimate evolutionary distances between extant species. Therefore, methods for normalizing the estimated distances between networks are necessary.
Incidentally, these results also provide evidence supporting the Coelomata topology in the Coelomata vs. Ecdysozoa debate regarding the evolutionary relationship between nematodes, anthropodes, and vertebrates, which has also been supported recently through rigorous analysis of the conservation patterns in intron positions [37]. It is worth to note that, due to limited availability of data, PPI networks differ significantly in size from one species to another. This actually introduces a lot of artificial variation between networks, which might, on a common graph measure, overwhelm desired biological signals. Indeed, as seen in Figure 3(c), RDL is significantly affected by the variability in data availability; it assigns mouse PPI network to the same clade with prokaryotic networks, presumably because the interaction data for this species is quite limited. On the other hand, by focusing on the signals harbored by some more informative modules, we avoid the interference of this global difference among networks. The discriminative power of MOPHY suggests that functional modules identified on PPI networks are important carriers of evolutionary messages. These functional hotspots convey some information beyond that of the apparent graphical variation among the networks, which help overcome the artificial bias commonly introduced to PPI networks by noise or unavailability of proteinprotein interactions.