Genetic algorithm
Genetic algorithm (GA) is a bio-inspired meta–heuristic algorithm that generally founded on the theory of evolution [15]. GA searches for optimal solutions by sampling the search space at random and creating a population of candidate solutions. GA uses genetic operators (e.g., mutation and crossover) to evolve into a population of new generations that is hopefully fitter according to a given objective (fitness) function. Survival of an individual to the next population is normally based on its fitness; that is survival of the fittest. However, the survival strategy normally does not preclude the survival of the less fit. Using GA to solve a given problem requires the following problem-dependent design: genetic representation of the problem solutions, the fitness function, candidate selection methods, and genetic operators (,e.g., crossover and mutation). The basic steps of GA are the following [16] :
Spectral clustering
The graph clustering problem is that of finding the highly connected subgraphs (HCS) within the graph. The spectral clustering algorithm works by finding the minimum cut between two HCS subgraphs (clusters). The cut is the number of edges between the two distinct clusters. Finding the minimum cut is solved by the eigenvector x
∗ corresponding to the smallest positive eigenvalue of the generalized eigen problem
where Q and D are the Laplacian matrix and the diagonal matrix of the graph, respectively. We consider the graph initially as one cluster, and proceed to obtain two clusters from it. We choose the size of the two clusters by applying the k-means clustering algorithm on x
∗ with k=2 to choose the value of the eigenvector component that makes the objective function value is as small as possible. By a recursive application of this procedure, we obtain a clustering of the entire network. The number, size, and density of the clusters is determined by the network topology and the threshold value of the objective function used to determine if a cluster will be split again, and are not pre-specified [17].
We apply a spectral clustering method to identify initial subnetworks and clusters in the Collins protein interaction network.
Objective functions
In this paper, we use the following three objective functions [18] to evaluate the quality of possible cluster structures. We compare the clustering achieved using these objective functions to the one achieved by our proposed objective function discussed later. We also compare clustering of all four objective functions to two gold standards.
-
Min-Max-cut:
$$ \text{JM}_{cut}(V_{1},V_{2})=\frac{W_{12}}{W_{11}} + \frac{W_{12}}{W_{22}}. $$
-
Ratio cut:
$$ \text{JR}_{cut}(V_{1},V_{2})=\frac{W_{12}}{|V_{1}|} + \frac{W_{12}}{|V_{2}|}. $$
-
Normalized cut:
$$ \text{JN}_{cut}(V_{1},V_{2})=\frac{W_{12}}{d_{1}} + \frac{W_{12}}{d_{2}}, $$
where \( d_{k}=\sum _{i \in V_{k}} d_{i} \) the degree of each vertex belongs to V
k
and k={1,2} and
$$W_{il} \equiv W(V_{i}, V_{l}) = \sum\limits_{j \in V_{i}, k \in V_{l}, (j,k) \in E} w_{jk},$$
where i,l=1,2 and w
jk
is the weight on edge jk.
Clustering algorithm
In this section, we present a new overlapping clustering algorithm to help facilitate the different demands and purposes of cluster analysis. The structure of the new overlapping clustering algorithm, Algorithm 1, is shown in Fig. 1. Algorithm 1 employs GA for clustering the PPI network. Starting with an initial population of individuals (set of clusterings), the algorithm generates a new set of individuals using genetic operators (selection and mutation). The goal is to get individuals to converge to solutions (clusterings) of maximum fitness according to the objective function.
Representation and initialization
We represent each individual (possible solution for the problem) as k lists {c
1,c
2,c
3,...,c
k
}, where k is the number of clusters. Each list can store integer numbers in the range {1,2,...,N}, where N is the size of the data set, as illustrated in Fig. 2. The element j of a list is a node’s index of the graph G modeling the PPI network. It is possible that some elements of different lists can hold the same value j, which means that a protein with index j can exist in more than one cluster; this is in case of overlapping clustering.
The population is composed of a number (population size) of individuals, or possible clusterings. We use two different methods to initialize the population. The first approach generates m random individuals, where m is the size of the population, as follows: for each individual consisting of k lists, assigning an integer value j in the range {1,2,...,N}, where N is the size of the data set for each element randomly. For example, as illustrated in Fig. 2, the node with index 70 is assigned to the cluster c
1, while the node with index 8 is assigned to two clusters c
1 and c
2. Such a method should take into account the variety among the individuals of the population, which is supposed to be rather high.
In the second approach, we use the resulting clusterings of the spectral clustering algorithm [18] to create the initial population.
Density–based objective function
The objective function aims to calculate the fitness values for each individual of the population to indicate how well each individual is suited to be the solution of a given problem. In our case, the fitness value of an individual reflects the intra–cohesion of each cluster proposed by the individual, as well as the inter–cluster coupling of those clusters. The goal is to maximize intra-cohesion and minimize inter-coupling. We represent intra-cohesion and inter-coupling by the number of edges within and across clusters, respectively. We compute the fitness of an individual as follows:
$$\text{JD}_{cut}(C_{1},..., C_{k})=\sum_{k} \frac{W_{kk}}{A_{k} + W_{ki}}, $$
where W
kk
is the number of edges in a cluster C
k
, W
ki
is the number of edges that has one endpoint in C
k
, and A
k
is the maximum possible number of edges in the cluster C
k
.
Genetic operators
The most common operations used in genetic algorithms are selection, crossover, and mutation. We exclude the crossover operation because it creates too many explorations that disturb the potentially good solutions. Regarding the parent–selection process, it is defined as the process of selecting individuals from the current population to create offspring for the next generation. This process aims to emphasize that the individuals with high fitness values are chosen in hopes that their offspring will have higher fitness as well. There are many ways to select parents, or individuals, from the current population for reproduction. Algorithm 2 illustrates in detail the parent–selection process.
The mutation operation is defined as performing some changes in the values of a specific chromosome, or individual. Consequently, the GA may reach to a better solution with the obtained individuals. We adapt the mutation operator used in [13] and modify it in such a case to be suited to, and more efficient for, our problem. This operation can be described as follows: after selecting an individual to be mutated, its nodes are either moved from one cluster to another, or some nodes of the network are added to the selected individual, as shown in Fig. 3. Figure 3
a shows the selected node of the cluster and Fig. 3
b illustrates the cluster after adding the selected node’s neighbors from the network. Algorithm 3 illustrates in detail the mutation process.
Quality assessment
We consider an approach for quality assessment that finds statistically significant matches between discovered clusters and the reference data such as precision (P), recall (R), and F−measure (the harmonic mean of precision and recall) [19]. This approach measures the level of correspondence between discovered clusters and the reference data set by computing statistically significant matches between the two collections using hyper-geometric p-value, and used these matches to evaluate the precision and recall of the suggested clustering solution as follows. Let \(\mathcal {C}\) be the initial set of discovered clusters, and let \(\hat {\mathcal {C}} \subseteq \mathcal {C}\) be the subset of clusters that had a significant match based on hyper-geometric p-value.
Here, p-value is used to determine whether a discovered cluster is annotated by certain terms from the reference data set at a frequency greater than that would be expected by chance. It is calculated according to the following hypergeometric distribution:
$$p-value=1- \sum_{i=0}^{k-1}\frac{\left({\begin{array}{cc} M \\ i \end{array} } \right) \left({\begin{array}{cc} N-M \\ n-i \end{array} } \right)}{\left({\begin{array}{cc} N \\ i \end{array} } \right)}, $$
where N is the total number of proteins, M is size of a list of proteins G marked to the reference term of interest (protein complex), k is the number of proteins in a discovered cluster C, and i is the number of proteins shared between C and G.
For each predicted cluster C, let true positive (TP) be the set of proteins shared between the cluster C and a reference protein complex G, while false positive (FP) is defined as the set of proteins that exist only in the cluster C, and true negative (TN) is defined as the proteins that are members of the reference complex G but not found in the cluster C. Hence, P, R, and F-measure are calculated according to the following equations:
$$\begin{array}{ccc} \mathrm{P} & = & \frac{\text{TP}}{\text{TP} \cup \text{FP}}, \\ &&\\ \mathrm{R} & = & \frac{\text{TP}}{\text{TP} \cup \text{TN}}, \\ &&\\ \text{F-measure}&= &2 \times \frac{\mathrm{P} \times \mathrm{R}}{\mathrm{P} + \mathrm{R}}. \end{array} $$