Gene products functional similarity
Quantitative measure of functional similarity between gene products has been used in many applications, eg., to validate high-throughput protein interaction, help the development of new pathway modeling tools and clustering methods and enable the identification of functionally related gene products independent of homology [32, 34]. GO [30] provides a good vocabulary system to estimate the functional relationship between gene products.
In GO structure, terms and their relationships are represented in the form of directed acyclic graphs. GO-based semantic similarity measures can be classified into two categories. The first category defines semantic similarity based on GO structure. The similarity between two gene products is estimated by the number of nodes two gene products share divided by the total number of nodes in two graphs. The other category is based on information content that is defined as the frequency of each GO term occurring in an annotated data set [32]. This kind of methods assume that the more information two terms share indicated by the information content of terms, the more similar they are. In this paper, we adopted the second category to calculate the semantic similarity between gene products because it was proved to be more efficient than the former one [32].
Similarity measure
The relevance similarity Sim
Rel
[32] is calculated based on the probability of each term. The probability of a term is assumed to be its frequency freq(c) = ∑ {occur(c
i
)|c ∈ Ancestor s(c
i
)}. in the annotations of a databases [35] Note that, for each ancestor a of a concept term c, we have freq(a) ≥ freq(c), because the set of descendants of a contains all the descendants of c. Then the probability of a term c is defined as p(c) = freq(c)|freq(root) where freq(root) is the frequency of the root term. The probability is calculated independently for each ontology. It is monotonically increasing as one moves up on a path from a leaf to the root.
Based on the probability p(c) of each term, the information content (i.e., the amount of information shared by terms) can be measured. Schlicker et al. [32] called this kind of information content as the Relevance similarity Sim
Rel
between a pair of terms. Sim
Rel
defines the similarity between two terms in two parts:
S(c 1, c 2) is the set of common ancestors of terms c1 and c2. The first part evaluates the ratio of the commonality of the terms and the information needed to fully describe the two terms. The second part records the position information of the two terms in the whole ontology. Schlicker et al. [32] have studied many methods to compute the semantic similarity between GO terms. It has been shown in [32] that the measure in (1) can consider as much information about the terms in GO as possible. Given a gene products list G = {g1, g2, …, g
n
}, the corresponding annotation terms for each gene can be identified in GO as AT
i
= {c
i
1, c
i
2,…, c
i
|gi|,} where |g
i
| is the total number of annotation terms in GO for gene product g
i
. Finally, the semantic similarity between two gene products g
i
and g
j
can be calculated with
We used Bioconductor package SemSim in R project (http://bioconductor.org/packages/2.2/bioc/html/SemSim.html) to calculate the Yeast gene products functional similarity.
Comparison of protein pairs and PPIs in terms of functional similarity
In order to investigate the functional similarity between each pair of gene products, we calculated Sim
Rel
between Yeast gene products which were downloaded from SGD [36]. Meanwhile, we check the distribution of functional similarity value in recorded Yeast PPIs downloaded from MIPs database [33]). There are total 6201 Yeast proteins are included in SGD, among them, 4554 Yeast proteins are covered in MIPs database with 12316 protein protein interactions, here we ignored the self loop and direction. Based on Sim
Rel
(Eq.(1)) method, the functional similarity value of each gene product pair ranges from 0 to 1. A functional similarity value close to one indicates high functional similarity whereas a value close to zero indicates low similarity. We analyzed the distribution of the functional similarity value in terms of all Yeast protein pairs and MIPs Yeast PPI networks. Because there are some genes (1616) are not annotated in GO, total 4585 (6201-1616) genes as the input of SemSim package, 4585 × 4584/2 = 10508820 similarity values for the corresponding gene pairs. 65 percent of similarity values has zero value while 19 percent cannot be identified because some genes are annotated by different GO (eg., GOBP, GOCC or GOMF), the remaining 16 percent of gene product pairs has similarity value great than zero, as shown in Figure 1. Again, most of gene product pairs have smaller similarity value, which means that they are not very similar in terms of function. For the gene product pairs with similarity values close to one, we will analyze them in detail and used them as our prior information for semi-supervised mining functional modules from PPI networks. Because some genes are not annotated by GO terms, their corresponding similarity values are zero or can not be identified, and only about 10 percent of PPIs have functional similarity value greater than zero, higher similarity value means that the corresponding proteins are more similar to each other with regarding to function.
The goal of functional modules identification from PPI networks is to determine a group of cellular components and their interactions attributed as specific biological functions [29]. In other words, the proteins in one module will be related to each other with regarding function. Usually, PPI networks used here are recorded in the existing PPI networks database, e.g, MIPs, where the major part of PPI information are extracted by manual annotation from the yeast literature. However, limited number of literatures make such PPI information insufficient. Therefore, identifying modules based on such insufficient PPI information (i.e., the existing PPI networks database) will not get a good performance. As shown in Figure 2a, ten proteins were listed with eight interactions which are recorded in the existing PPI networks database. In this case, it is difficult for any method to identify functional modules. Actually, the protein pairs without recorded interaction information may share common functions shown in Figure 2b marked as dash line. Once such functional information is added, it will be easy to derive the functional components for these ten proteins, finally two modules (left module with clique shape and right module with star shape) in this example are found and circled in Figure 2b.
From the example in Figure 2, we can see that the additional protein functional information is helpful for modules identification. In this paper, GO was used to obtain protein functional information indicated by the similarity value. In the next section, we will describe how to use such functional information to supervise mining functional modules from PPI networks.
Functional modules identification methods
Before introducing our proposed functional modules identification methods, we briefly review the existing popular function modules mining algorithms, including hierarchical clustering (HC) [37], Newwan-Girvan (NG) [8], MCL [38] and MCODE [6].
Hierarchical clustering
In the view of computation, functional modules are special kind of subgraphs in PPI networks, and each subgraph is consistent of a subset of nodes with a specific set of edges connecting them. Meanwhile, hierarchical clustering method [37] is popularly used in networks clustering, thus, we use it to obtain the base clusters, i.e, functional modules. The implementations of this hierarchical clustering algorithm (agglomerative average-linkage hierarchical algorithm (Agnes)) is available in R project, a cluster package http://cran.r-project.org/web/packages/cluster/index.html). Agnes finds the clusters by initially assigning each object to its own cluster and then repeatedly merging pairs of clusters until either the desired number of clusters has been obtained or all of the objects have been merged into a single cluster leading to a complete agglomerative tree. The algorithm takes input as a similarity matrix. Next, we will employ two different similarity metrics, Clustering Coefficient (S
cc
) [3] and Neighborhood (S
nb
) [39] designed to capture various topological properties of scale-free networks because PPI networks are typical scale-free networks [40], and the corresponding clustering methods are called as HC
cc
and HC
nb
respectively. The first similarity metric is based on the Clustering coefficient, a popular metric from graph theory. The clustering coefficient [41] is a measure that represents the inter-connectivity of a vertex’s neighbors. The clustering coefficient of a vertex v with degree k
v
can be defined as follows:
where n
v
denotes the number of triangles that go through node v. Essentially, if the edge between two nodes contributes a lot to the clustering coefficients of the nodes, then they are considered similar and should be clustered together. Here the edge-clustering coefficient [3] is defined, in analogy with the usual node-clustering coefficient, as the number of triangles to which a given edge belongs, divided by the number of triangles that might potentially include it, given the degrees of the adjacent nodes. More formally, for the edge-connecting node i and node j, the edge-clustering coefficient is
where z
i
,
j
is the number of triangles built on that edge, i.e., the number of common neighbors of node i and node j. min[(k
i
– 1), (k
j
– 1)] is the maximal possible number of triangles.
The idea behind the use of this metric is that edges connecting nodes in different communities are included in few or no triangles and tend to have small values of S
cc
(i, j). On the other hand, many triangles exit within clusters. Hence the coefficient S
cc
(i, j) is a measure of how intercommunication a link is. Note that S
cc
(i,j) will be zero when k
i
≤ 1 or k
j
≤ 1, also, when z
i.j
= 0.
The second metric we use is a Neighborhood-based similarity metric. We use the well-known Czekanowski-Dice distance metric [39] for this purpose. This metric uses the adjacency list of each node and favors nodes that have several common neighbors. Two nodes having no common neighbor will have the minimum similarity value (i.e. zero), while those interacting with exactly the same set of nodes will have the maximum value. The Neighborhood Based similarity metric is defined as:
Here, Int(i) and Int(j) denote the adjacency list (including themselves) of proteins i and j, respectively, and Δ represents the symmetric difference between the sets. The value of this metric ranges from 0 to 1. Note that using this metric, nodes that do not interact with each other may have a non-zero similarity if they have common neighbors.
Newman-Girvan method
Newman and Girvan [8] first introduced edge-betweenness measure for clustering networks in sociology and ecology to obtain communities. This measure favors edges between communities and disfavors ones within communities. As pointed out by Holme et al [42] edge-betweenness uses properties calculated from the whole graph, allowing information from non-local features to be used in the clustering. Newman et al. introduced three different edge-betweenness measures, Shortest-path, Random-walk and Current-flow. In this paper, we consider the Shortest-path betweenness measure, which computes for each edge in the graph the fraction of shortest paths that pass through it. It is given by:
where SP
i
,
j
is the number of shortest paths passing through edge e
i
,
j
and SP
mαx
is the maximum number of shortest paths passing through an edge in the graph.
EB(e
i
,
j
) denotes the shortest-path edge betweenness value of the edge between nodes i and j. The edge-betweenness of an edge is the proportion of the shortest paths that edge belongs to. NG method can be taken as a divisive clustering method. It starts with one cluster of all vertices and recursively splits the most appropriate cluster at the edges with a large edge-betweenness value. The process continues until a stopping criterion (the criterion is usually the splitting steps s) is achieved.
MCL
The Markov Cluster algorithm (MCL) [38, 43] simulates a flow on the graph by calculating successive powers of the associated adjacency matrix. At each iteration, an inflation step is applied to enhance the contrast between regions of strong or weak flow in the graph. The process converges towards a partition of the graph, with a set of high-flow regions (the clusters) separated by boundaries with no flow. The value of the inflation parameter (r) strongly influences the number of clusters, i.e., a larger number of smaller clusters will be obtained with increasing of the inflation value (r). The core concept behind this method is that clusters of related nodes are densely interconnected and hence there should be more long paths between pairs of nodes belonging to the same cluster than between pairs of nodes belonging to distinct clusters. Subsequently, in [5, 9, 44] MCL was used to identify functionally related clusters in the protein interaction network of S. cerevisiae and Human. The experimental results indicated that the identified modules did represent functional clusters within the network. In this paper, we used the MCL package (http://www.micans.org/mcl/#source) to mine the functional modules from PPI networks.
MCODE
Molecular complex detection (MCODE) [6] is a method to detect densely connected regions. First it assigns a weight to each vertex corresponding to its local neighborhood density, i.e., with the core-clustering coefficient instead of the clustering coefficient for each vertex. Next, starting from the top-weighted vertex (seed vertex), it recursively moves outward, including in the cluster vertices whose weight is above a given threshold (Node Score Cutoff (t)). During the clustering process, new members are added only if their node score deviates from the cluster’s seed node’s score by less than the set cutoff threshold. Therefore, small cutoff values create much more smaller-size clusters and vice versa. The third stage is post-processing the above clustering results by increasing the size of the complex according to a given parameter (f), so that there can be overlap among the modules which have already been defined. In this paper, we used the MCODE plugin in Cytospace (http://baderlab.org/Software/MCODE) to mine the functional modules from PPI networks.
Prior knowledge based functional modules identification methods
Given the prior information (usually as pairwise constraints), semi-supervised learning approaches [26] can be implemented in two ways. One method is to restrict the solution space based on the pairwise constraints and then find the solution consistent with the constraints for other unlabeled data, such as probabilistic models [45], hierarchical clustering [46], spectral clustering [47], and etc.. The other method is employing the prior information to learn a distance metric which can be used to computer the pairwise similarity, so that the learning methods based on similarity matrix could be adopted, such as [48, 49]. The key difficulty of semi-supervised learning is how to influence an learning algorithm with the prior information. An efficient and simple method to address this challenge is encoding the prior information into the data representation and then inputting the data into an existing learning algorithm [50]. The other way is using the prior information to supervised the learning process. In this section, we will give these two methods for prior knowledge based mining function modules from protein protein interaction networks.
-
(a)
Prior information is combined into the original data set to form a new data set, and then all existing module identification algorithms can be applied on the new data set.
-
(b)
Prior information is used by the proposed learning algorithms:
-
Semi-supervised hierarchical clustering (ssHC): prior knowledge is used to construct the transitive closure [46], and then set them as the initial clusters with the other points.
-
Semi-supervised NG, Semi-supervised MCL and semi-supervised MCODE (ssNG, ssMCL and ssMCODE respectively): using NG, MCL or MCODE to group PPIs into a relatively large number of sub-modules, and then establish the connections between sub-modules according to the pairwise constraints.
The first approach (as indicated in Figure 3) encoding the prior information into the data representation is easily implemented. As shown in Figure 2, the protein functional pairs identified from GO can be added into the original PPI networks. Then, the existing functional modules identification methods (say, HC
cc
[3], HC
nb
[39], NG [8], MCL [5] and MCODE [6]) can be applied on the new combined PPI networks. Furthermore, we propose a novel prior knowledge based learning framework (as indicated in Figure 4 based on the pairwise constraints and can use any existing modules identification method, such as, HC
cc
, HC
nb
, NG, MCL, MCODE and etc.. For HC
cc
and HC
nb
, the prior information using protein functional pairs was used to construct the transitive closures [46]. The transitive closure is constructed based on the pairs of proteins with large functional similarity which gives the constraint degree between each pair of proteins. If the similarity is greater than a threshold (in this study, the best threshold is experimentally proved to be 0.999), we can say that there is a must-link constraint between P
i
,P
j
. A set of constraints C makes up of all the must-link constraints. In this case, an undirected graph G, with one node for each point appearing in the constraints C, and an edge between two nodes if the corresponding points appear together in a must-link constraint. Then, the connected components of G give the sets in the transitive closure. For instance, in our example in last Section, there are 162 transitive closures on 654 Yeast proteins with 1488 protein functional pairs, where different closures may cover different numbers of proteins. The biggest closure has 21 proteins and 207 pairs, while the smallest closure has 2 proteins and 1 pair, as shown in Figure 5.
Such transitive closures and the other proteins which are not included in these closures will be set as the initial clusters of hierarchical clustering methods. Next, hierarchical clustering methods will merge a pair of clusters if they have a smallest distance or a largest similarity (here, clustering coefficient and neighborhood are used to measure the cluster similarity, and average-linkage method is adopted to merge the clusters). The merging procedure will end when the given number of clusters are obtained. These two semi-supervised hierarchical clustering methods (based on clustering coefficient and neighborhood) are denoted by ssHC
cc
and ssHC
cc
respectively.
For NG, MCL and MCODE, we adopted a two-stage semi-supervised learning approach with the aid of the prior information (i.e., protein functional pairs). In the first stage, the PPI networks are grouped into a relatively large number of sub-modules by relaxing the parameters of the existing algorithms. For instance, a large value for s, the number of splitting steps, will be given for NG algorithm, a large inflation r will be set in MCL method and a small cutoff (t) will be set in MCODE method. Then, connections between sub-modules are established according to the protein pairs with higher functional relationship in the second stage. Finally, three semi-supervised methods, ssNG, ssMCL and ssMCODE, are designed to mine the functional modules.