Several methods have been proposed for assessing functional similarity of biological entities (genes, proteins, domains) [17–19]. Since the functional categories in which these entities are categorized are themselves interrelated through a taxonomy (e.g., Gene Ontology), measures for similarity must consider the underlying taxonomy while comparing molecules in terms of their functional annotation [20]. Various approaches take into account different factors, including taxonomic distance, specificity/generality (rank in hierarchy) of common ancestors, and associated number of molecules for the functional terms being compared (statistical significance or information content). Since most molecules are associated with multiple functional terms, assessment of functional similarity between two molecules poses the additional challenge of evaluating the similarity between two sets of terms, as opposed to a pair of terms. In [9], we developed an information theoretic measure for computing similarity of two sets of terms associated with a pair of molecules. We show that our measure is superior to other composite measures computed by applying associative operators (average, max, etc.) to pairwise term similarity measures.
In this paper, we generalize and extend our results to quantify the functional coherence (or similarity) of a set of biomolecules (as opposed to a pair). Since each molecule corresponds to a set of annotations, the problem is one of quantifying the coherence of a set of sets of terms. A straightforward approach to this would compute pairwise similarities of each pair of molecules in the set and to aggregate them using associative operators (min, max, average). Pairwise similarities (similarity of two sets of annotations) may themselves be computed using our information theoretic measure. An alternate approach to the problem, proposed in this paper, computes the coherence of the set of molecules without computing intermediate pairwise similarity scores. We show that the latter approach is strictly superior to the former in quantifying the coherence of a set of biomolecules. We validate this claim by applying our proposed measure, along with several other currently used measures to a test group of known functionally related proteins. We also apply the measures to randomly generated groups and identify measures that induce the greatest separation between the test and random groups.
Finally, in order to study the correlation between functional coherence and topological proximity in networks, we also need a measure for topological proximity. Traditional measures of topological proximity rely on the shortest path between two nodes. While this measure is more suited to well-curated and complete datasets, it is susceptible to missing interactions and noise. A single false positive or negative may lead to significant (erroneous) perturbation in shortest-path based measures. Measures based on random walks with restart [16], on the other hand, are more resilient to incomplete and noisy data. We consider both classes of measures of topological proximity, and evaluate their correlation with various functional similarity measures for both protein interaction (PPI) and domain interaction (DDI) networks. We show that a combination of random-walk based topological proximity and our similarity measure([9]) yield the strongest correlation between network proximity and functional coherence.
Concepts and ontologies
Let C = {c
i
|1 ≤ i ≤ N } be a finite partially ordered set of concepts. In terms of Gene Ontology (GO), these concepts represent the GO terms in the sub-ontologies (i.e., molecular function, biological process, and cellular component). Without loss of generality, we refer to concepts as terms throughout this paper. Terms are related to each other through is a and part of relationships, such that c
i
→ c
j
denotes c
i
is a/part of c
j
. Note that, if c
i
→ c
j
, then the molecules associated with c
i
are also associated with c
j
, known as the true path rule. Based on these relationships, we define a binary relation over C, denoted by ≼. We say c
j
is an ancestor of c
i
, denoted by c
i
≼ c
j
if and only if either c
i
→ c
j
, or for some ℓ ≥ 1, there exist
for 1 ≤ ℓ ≤ 1 such that
for 1 ≤ ℓ < l, and
(c
j
is an ancestor of c
i
in GO hierarchy). Two terms c
i
, c
j
are comparable, denoted by c
i
~ c
j
, if either c
j
≼ c
i
or c
i
≼ c
j
. If c
i
and c
j
are comparable, then the shortest path between c
i
and c
j
is given by L(c
i
, c
j
) = L(c
j
, c
i
) = ℓ + 1 for minimum such ℓ.
We denote the set of ancestors of a term c
i
by A
i
= {c
k
∈ C|c
i
≼ c
k
}. Note that, not all ancestors of a term are comparable, since the GO hierarchy is a directed acyclic graph, as opposed to a tree. We represent the root term of GO with a terminal concept r, such that c
i
≼ r ∀c
i
∈ C.
Semantic similarity of terms
Semantic similarity measures quantify the similarity between two terms based on the underlying taxonomical relationships. The information content based measure of semantic similarity quantifies similarity between a pair of terms by taking into account the distribution of terms among molecules. Specifically, it rewards infrequent similar terms, over those that are frequent. Let G
c
be the set of molecules associated with term c in the available database, with G
r
being the set of all molecules. The information content of a term is defined as I(c) = - log2(|G
c
|/|G
r
|) [20]. Clearly, I(r) = 0, and as a consequence of the true path rule, I(c
j
) ≥ I(c
i
) for c
j
≼ c
i
. Then, the semantic similarity between two terms is defined as
Here,
is said to be the minimum common ancestor of c
i
and c
j
.
Observe that this measure does not take into account the specificity of terms with identical common ancestors. This problem can be alleviated by normalizing the similarity between two terms by the self-similarities of the terms being compared, e.g., by
[21]. Note, this measure has a well defined maximum of 1 and offer bounded interpretation (ranging from 0 to 1) of Resnik's metric. We now generalize these term-similarity measures to set-similarity.
Functional similarity of molecules
Biomolecules are generally associated with multiple molecular functions and often involved in multiple processes. Consequently annotations of molecules correspond to sets of terms, as opposed to individual terms. While assessing the similarity of sets of terms, we assume that the sets are non-redundant, i.e., each set consists of terms that are not comparable. This can be easily enforced by ensuring that each branch in the hierarchy is represented by at most one term in each set. In GO, this involves considering only the most specific annotations associated with a gene, which provides a non-redundant representation of functional annotation. In this representation, the association between the gene and the ancestors of the most specific term is implied by the true path rule.
An important challenge in the assessment of the functional coherence of sets is that these sets are often incomplete (that is, for many molecules, some of their functions are unknown). Therefore, a reliable measure is one that rewards the abundance of similar terms in the terms, but does not penalize existence of unrelated terms in one of the sets, since the relation between these terms and the other set may be currently unknown. Simple associative measures that aggregate the similarity of pairs of terms in the two terms, such as average (ρ
A
) [17], maximum (ρ
M
) [22], or average of maximums (ρ
H
) [18] do not satisfy these properties [9].
Motivated by these considerations, in prior work, we extend the notion of minimum common ancestors to sets of terms, and generalize the concept of information content from a single term to a set of terms [9]. Let
be the minimum common ancestor set of term sets S
i
and S
j
, and ⊔ denote a generalized union operator that preserves non-redundancy by keeping the most specific terms. The similarity between two term sets is defined as the information content of the set of minimum common ancestors, i.e.,
where
is the set of molecules that are associated with all terms in the set Λ(S
i
, S
j
). Note that ρ
I
also needs to be normalized with respect to self similarities, i.e., ρ
JC
= 1/(ρ
I
(S
i
, S
i
) + ρ
I
(S
j
, S
j
) - 2ρ
I
(S
i
, S
j
) + 1)
Functional coherence of modules
Let ℛ be a set of n molecular entities (genes, proteins, domains), with each entity being associated with a set of terms, i.e., ℛ = {S1, S2, ..., S
n
}. We aim to develop a measure σ(ℛ) to assess the functional coherence of this set, such that a larger σ indicates more semantic similarity between the terms in sets S1, S2, ..., and S
n
. Without loss of generality, we call ℛ a module, since the objective here can also be considered as assessing the modularity of ℛ. We consider various measures to assess the functional coherence of a module, which are discussed below. In order to illustrate each measure, we use a running example based on the ontology shown in Figure 1. In the figure, let ℛ1 = {S1, S2, S3, S4} be a module that can be interpreted as a complex composed of two sub-complexes ℛ2 = {S1, S2, S3} (with the shared term c4) and ℛ3 = {S3, S4} (with the shared term c6), in which S3 "bridges" the two sub-complexes ℛ2 and ℛ3.
Average of pairwise information content
A straightforward way of computing set coherence is to compute the average of the pairwise n(n - 1)/2 set similarity scores [19, 23]:
In our running example, the average pairwise information content of the molecules in complex ℛ1 is given by σ
A
(ℛ1) = (I(c4) + I(c4)/2 + 0 + I(c4)/2 + 0 + I(c6)/4)/6 = 3/8, while that of sub-complex ℛ2 is given by σ
A
(ℛ2) = (I(c4) + I(c4)/2 + I(c4)/2)/3 = 2/3, given that I(c4) = I(c6) = -log2 (3/6) = 1. Bridged complexes get lower score than specialized complexes due to differences in sub-complex annotations.
Generalized information content
It is possible to extend the notion of the minimum common ancestor of pairs of terms to tuples of terms as
. In the other words, the minimum common ancestor of a set of n terms is defined as the most specific among the terms that are common ancestors of all of n terms in the set. Then, for each n-tuple c1 ∈ S1, c2 ∈ S2, ..., and c
n
∈ S
n
, the functional coherence of these terms can be quantified as
. Consequently, the minimum common ancestor set of S1, S2, ..., S
n
can be computed as
leading to a generalization of the information content based measure:
In our running example, since λ(c4, c1) = λ(c4, c6) = λ(c4, c1, c6) = r, we have Λ(ℛ1) = {r}, thus the generalized information content of complex ℛ1 is σ
I
(ℛ1) = I(r) = 0. On the other hand, since Λ(ℛ2) = {c4}, we have σ
I
(ℛ2) = I(c4). As illustrated by this example, σ
I
is a rather conservative measure of functional coherence and it only rewards specialized modules in which all molecules share very similar functions.
Graph information content
We extend the graph information content measure proposed by Pesquita et al. [24]. The idea behind this approach is that, if a group of molecules are coherent, then the information content of the DAG induced by the intersection of ancestors is close to the information content of the DAG induced by the union of ancestors. In other words, defining
as the ancestor set S
i
, graph information content of set ℛ is defined as
Observe that, if all molecules are annotated with the same set of terms, σ
G
(ℛ) would be equal to one, and zero if they have no common terms. Similar to σ
I
, a drawback of this measure is its sensitivity to outliers; that is, if a single molecule in the set is sufficiently functionally different it has a significant impact on the score. Indeed, in our running example, we have σ
G
(ℛ1) = I(r) = 0, while σ
G
(ℛ2) = (I(c4) + I(c2))/(I(c4) + I(c2) + I(c6) + I(c3)) = 1/2, given that I(c2) = I(c3) = -log2 (3/6) = 1.
Weighted information content
Complexes are functionally cohesive modules, but they are often composed of sub-complexes, each performing a specific part of the general function of the complex [25]. However, as illustrated by our running example, generalized information content (σ
I
) and graph information content (σ
G
) require all molecules to be functionally coherent with each other for the module to be considered coherent. In order to provide a more relaxed, and biologically motivated measure of functional coherence, we consider shared functionality between all combinations of molecules and weigh the information content of shared functionality by the number of molecules that contribute to the shared functionality.
Specifically, let
be the set of terms in the ancestor set of S
i
that are not shared with any other molecule in ℛ. Then, weighted information content of set ℛ is defined as the ratio of the information content of all terms that are shared in at least two molecules to the information content of all terms associated with at least one molecule in the set; that is:
In other words, we consider all the partial DAGs (
) generated by each S
i
in ℛ. All the terms that are part of overlapping DAG correspond to shared information among those proteins. The numerator in the above equation corresponds to the information content of the overlapping DAG, while the denominator normalizes that score with total information of the combined DAG. In our running example, we have
σ
W
(ℛ1) = (3I(c4) + 3I(c2) + 2I(c6) + 2I(c3))/(3I(c4) + 3I(c2) + 2I(c6) + 2I(c3) + I(c1)) = 0.86 and σ
W
(ℛ2) = (3I(c4) + 3I(c2))/(3I(c4) + 3I(c2) + I(c6) + I(c3)) = 3/4, given that I(c1) = - log2 (2/6) ≈ 1.6 Since members of the module ℛ1 share all functions other than c1, this measure captures the coherence of the bridged module better than other methods. This method only penalizes for functions which are not shared by a member with rest of the module.
Post-processing coherence scores
We now discuss how coherence scores are processed to make them comparable against each other for different module sizes and across various sub-ontologies.
Combination of sub-ontology scores
The scores discussed above can be based on any of the three sub-ontologies of GO. Since cellular component annotations are sparser than annotations of biological process and molecular function, we use the method proposed by Schlicker et al. [26]. For pairs of molecules, we combine the two coherence scores obtained from biological process and molecular function ontologies as:
where max ρ(BP)and max ρ(MF)are the maximum possible scores for biological process and molecular function, respectively. Module coherence scores (σ) are based only on biological process ontology.
Accounting for module size
In order to compare modules of different sizes, we normalize the functional coherence scores based on a background distribution that characterizes the coherence of modules of identical size. Specifically, for a given module ℛ, we generate a sufficiently large number of random modules of size |ℛ| and compute the functional coherence of each of these modules. Then, letting
denote the average functional coherence of these modules, we compute the size-adjusted coherence score of ℛ as
.
Index of detectability
In order to compare various measures of functional coherence, we assemble a positive (test) group and a randomly selected (control) group of proteins. The positive set comprises of proteins that are known to be functionally related based on prior biological knowledge (i.e., they are known to exist in complexes and perform related functions). Clearly, if we plot coherence values for samples from the test set and from the control set, we expect to see two distinct distributions - samples from the test group are expected to have higher coherence scores than those from the control group. The separation of the two distributions induced by each method indicates the fitness of the measure in quantifying coherence in sample sets, in terms of distinguishing coherent and arbitrary sets. This separation is quantified as:
which is proportional to the area under the binormal ROC curve [27]. Here, T and C denote the sets of test and control modules, respectively.
Measure for topological proximity
The most commonly used measure of topological proximity is graph distance, where the distance between a pair of nodes in a connected graph is defined as the length of the shortest path between them. In the context of biological networks, there are several drawbacks to this measure. It is particularly susceptible to missing or incorrect data - i.e., a single missing edge may reduce proximity significantly, alternately, a single false edge may increase proximity incorrectly [28]. Furthermore, this measure does not take into account the global structure and connectedness of the graph, with alternate paths between a pair of nodes.
Nodes connected to each other via disjoint paths are likely to be functionally closer than nodes that are connected via a single path. Indeed, evidence suggests that multiple alternate paths between functionally associated proteins are often conserved through evolution, owing to their contribution to robustness against perturbations, as well as amplification of signals [11].
To alleviate these drawbacks, we consider an alternate measure that captures the multi-faceted relationship between a pair of nodes [16]. This measure uses a random walk with periodic restarts to estimate the affinity between pairs of nodes. In this model, the random walk is initiated at node i, with neighbor transition probability proportional to edge weight, and at each step, the walk returns to source node i with probability c. The proximity of node j to node i is defined as the relative amount of time spent at node j by such an infinite random walk. It can be shown that the proximity of all nodes to node j can be computed iteratively as
Here, W is the stochastic matrix derived from the adjacency matrix of the network,
is the restart vector with
if j = i and 0 otherwise, and
. Then, the proximity of node j to node i is given by
. Repeating this procedure for all proteins, we obtain a matrix of network proximity scores for all pairs of proteins. Note, however, that this measure of proximity is not symmetric (proximity of j to i is not necessarily equal to the proximity of i to j). Therefore, we take the average of the two proximity values to compute the proximity between a pair of proteins. Using the proposed measures of functional coherence and the random-walk based measure for topological proximity, we quantify the relationship between topological proximity and functional coherence by computing the correlation of the resulting matrices.
Materials
We obtain protein interaction data for S. cerevisiae and S. pombe, from the BioGRID database [29] version 2.0.51. We filter the dataset to obtain a set of physical interactions between proteins, i.e., genetic interactions are removed based on experiment systems (e.g., knockout experiments) mentioned on the BioGRID website. Integr8 [30] is used to map the proteins in the interaction dataset to their Uniprot names, keeping only those proteins that we can map to a Gene Ontology term using Integr8.
We obtain domain interaction data from the DOMINE database [31] version 1.1. This dataset is composed of known, as well as predicted domain interactions. Based on the source and quality of the data, we partition this dataset. Struct interactions are inferred from PDB entries of protein complexes and are collected from iPfam and 3did. Comp-2 interactions are predicted by at least two computational methods that infer domain interactions from protein interaction networks using techniques such as maximum likelihood estimation or from co-evolution of conserved sites in protein sequences. HC+MC interactions consists of high and medium confidence interactions (for details, please refer to [31]).
To test the functional coherence of sets, we obtain positive and random cases from GRIP [32]. GRIP generates positive cases from MIPS CYGD complex catalogue [33] by picking sets from known complexes. For wildtype cases, GRIP selects proteins at random. We generate a total of 16 datasets of which eight are made up of positive cases and eight are random. Each set consists 2000 sets of proteins (complexes), ranging from four to eleven proteins each.
Gene Ontology Annotation (GOA) [34] release 47.0 dated 2009/03/09 is used to obtain annotation information for Uniprot proteins. GOA combines manual and automated inferences of gene product annotations. The mapping of Pfam-A domains to their Gene Ontology functions is obtained from pfam2go http://www.geneontology.org/external2go/pfam2go released on 2009/03/04. We use only the Biological Process and Molecular Function sub-ontologies of Gene Ontology [6] version 1.550 for evaluation, since the coverage for the Cellular Component sub-ontology is relatively sparse.