Protein complexes predictions within protein interaction networks using genetic algorithms
 Emad Ramadan^{1}Email author,
 Ahmed Naef^{1} and
 Moataz Ahmed^{1}
https://doi.org/10.1186/s1285901610964
© The Author(s) 2016
Published: 25 July 2016
Abstract
Background
Protein–protein interaction networks are receiving increased attention due to their importance in understanding life at the cellular level. A major challenge in systems biology is to understand the modular structure of such biological networks. Although clustering techniques have been proposed for clustering protein–protein interaction networks, those techniques suffer from some drawbacks. The application of earlier clustering techniques to protein–protein interaction networks in order to predict protein complexes within the networks does not yield good results due to the smallworld and powerlaw properties of these networks.
Results
In this paper, we construct a new clustering algorithm for predicting protein complexes through the use of genetic algorithms. We design an objective function for exclusive clustering and overlapping clustering. We assess the quality of our proposed clustering algorithm using two goldstandard data sets.
Conclusions
Our algorithm can identify protein complexes that are significantly enriched in the goldstandard data sets. Furthermore, our method surpasses three competing methods: MCL, ClusterOne, and MCODE in terms of the quality of the predicted complexes. The source code and accompanying examples are freely available at http://faculty.kfupm.edu.sa/ics/eramadan/GACluster.zip.
Keywords
Background
 1.
This approach recognizes that protein complexes are not cliques or nearcliques; the method is capable of identifying clustering with varying densities depending on the local density of edges in subnetworks (i.e., in dense regions of the network, it clusters dense subgraphs; and in sparse regions of the network, it clusters sparse subgraphs).
 2.
The approach is more robust and scalable. An example of this is that the clustering algorithm is capable of clustering large–size networks (such as the human protein interaction network), or it can cluster a large number of networks (hundreds of bacterial networks) without problems by ensuring that the many steps of the algorithm have costs that increase modestly with the number of nodes and edges in the network.
 3.
The algorithm can be tuned using parameters to obtain clusterings with a desired density and an average size of clusters.
Related works
Three major graph clustering approaches have been employed to identify protein complexes.
The first approach searches for subgraphs with specified connectivities, called network motifs, and characterizes these as complexes or parts of them. A complete subgraph (clique) is one such candidate, but other network motifs on small numbers of vertices have been identified through exhaustive searching. Due to the timecomplexities involved, this approach is restricted to searching for small subgraphs in large networks.
In the second, graph–growing approach, a cluster is grown around a seed vertex using graph search algorithms (greedy algorithms). These are local algorithms that begin with single, or several known nodes and then expand from there. The MCODE algorithm (Bader and Hogue [2]) starts with a single seed vertex, and adds more vertices based on a precomputed set of weights. A vertex in the neighborhood of a cluster is added to it as long as its weight is close (within a threshold) to the weight of the seed vertex. Similarly, Bader [3] proposed the SEEDY algorithm, which progressively adds proteins to a seed protein to form complexes, based on a particular distance metric. Another software package called Complexpander by Asthana et al. [4] functions in this way to help identify protein complexes, including the seed proteins from a PPI network. However, our experience comparing this approach with the graph (global) clustering approach that we describe next shows that this approach is less stable than the latter (i.e., the clusters discovered depend strongly on the seed vertices chosen).
The third approach, the graph clustering approach, includes many variants. Algorithms in this category attempt to maximize or minimize certain cluster measures such as connection density, edge cut, or a novel distance metric between nodes in a cluster. In general, these are global algorithms that seek to optimize an objective function for the whole graph. One algorithm by Spirin and Mirny [5] employs the superparamagnetic clustering (SPC), which is a technique based on a principle observed in physics to maximize the cluster density. Another algorithm by Przulji et al. [6] uses the concept of a minimal cut, which is a partition of the nodes of the network into two complementary sets such that the least number of edges cross from one set to the other. In their method, they perform recursive minimal cuts until they end up with densely connected subgraphs. Another method by King et al. [7] called restricted neighborhood search clustering (RNSC) begins by randomly assigning nodes to clusters, then reassigns nodes so as to minimize a cost function. Yet another such method by Enright et al. [8] uses a method called Markov clustering (MCL) to simulate the “flow” of the matrix. It does this by calculating increasing powers of the network’s adjacency matrix. With the increased powers, the areas of high flow become increasingly separated from those with little flow.
The methods described so far compute exclusive clusterings, i.e., they permit nodes to be members of at most one cluster. However, in biological systems many proteins and gene products participate in multiple functions [9]. PereiraLeal et al. [10] used the MCL clustering algorithm in order to detect overlapping clusters. Their algorithm first turns a network with individual proteins as nodes, into a network with protein interactions as nodes (the line graph of the input graph). Then, the MCL algorithm is used to cluster the network of interactions. Finally, the algorithm reconverts the identified clusters from the interaction line graph back to the original graph with proteins as nodes. When the interaction network clusters are converted back to the original network, the same protein can appear in multiple clusters. Nepusz et al. [11] proposed the ClusterOne algorithm in order to detect overlapping clusters that is very similar to MCODE by starting from a single seed vertex. But the algorithm merges each pair of groups where the overlap score is above a specified threshold. Finally, it removes all clusters of a size less than three vertices or whose density is below a given threshold. Ramadan et al. [12] used the spectral clustering algorithm in order to detect overlapping clusters. Their algorithm first find all possible exclusive clusters using the spectral clustering method. Upon identifying all of exclusive clusters, it defines bridges (nodes that are significantly connected to two or more clusters) by examining the boundary nodes in the exclusive clusters (nodes that are joined to other nodes outside the cluster). This gives highly connected clusters, but still permits overlapping clusters, as nodes in one cluster may be involved in another cluster.
Another overlapping clustering algorithm is the PROCOMOSS algorithm proposed by Anirban et al. [13]. The PROCOMOSS algorithm detect overlapping clusters using the genetic algorithm technique. They rely on the properties captured in the graph modeling the PPI network, and they also utilize the GO terms to consider the biological properties of the proteins. Their approach can be described as follows: First, encode the chromosome as a vector of integer numbers representing the indices of the proteins in the proteins set. Then, initialize the population based on applying kmeans clustering on both dimensions of the adjacency matrix A of a graph modeling PPI network. Next, calculate the fitness values of each individual of the population using two objective functions. Finally, select parents by adopting the same method used in NSGAII [14] and mutate the selected chromosome as follows: select a random node and then either remove that node or add its neighbors to the selected chromosome with the same probability. The main drawback of this algorithm is that the predicted clusters cover a small percentage of the network.
Methods
Genetic algorithm

Create an initial population of candidate solutions.

LOOP until any/all the candidate solutions become solution(s).
 1.
Compute the fitness values of each of these candidates.
 2.
Select candidates based on their fitness values.
 3.
Create offspring from selected candidates using genetic operators
 4.
Mutate each of these offspring using genetic operators.
 1.
Spectral clustering
We apply a spectral clustering method to identify initial subnetworks and clusters in the Collins protein interaction network.
Objective functions

MinMaxcut:
$$ \text{JM}_{cut}(V_{1},V_{2})=\frac{W_{12}}{W_{11}} + \frac{W_{12}}{W_{22}}. $$ 
Ratio cut:
$$ \text{JR}_{cut}(V_{1},V_{2})=\frac{W_{12}}{V_{1}} + \frac{W_{12}}{V_{2}}. $$ 
Normalized cut:
where \( d_{k}=\sum _{i \in V_{k}} d_{i} \) the degree of each vertex belongs to V _{ k } and k={1,2} and$$ \text{JN}_{cut}(V_{1},V_{2})=\frac{W_{12}}{d_{1}} + \frac{W_{12}}{d_{2}}, $$where i,l=1,2 and w _{ jk } is the weight on edge jk.$$W_{il} \equiv W(V_{i}, V_{l}) = \sum\limits_{j \in V_{i}, k \in V_{l}, (j,k) \in E} w_{jk},$$
Clustering algorithm
Representation and initialization
The population is composed of a number (population size) of individuals, or possible clusterings. We use two different methods to initialize the population. The first approach generates m random individuals, where m is the size of the population, as follows: for each individual consisting of k lists, assigning an integer value j in the range {1,2,...,N}, where N is the size of the data set for each element randomly. For example, as illustrated in Fig. 2, the node with index 70 is assigned to the cluster c _{1}, while the node with index 8 is assigned to two clusters c _{1} and c _{2}. Such a method should take into account the variety among the individuals of the population, which is supposed to be rather high.
In the second approach, we use the resulting clusterings of the spectral clustering algorithm [18] to create the initial population.
Density–based objective function
Genetic operators
The most common operations used in genetic algorithms are selection, crossover, and mutation. We exclude the crossover operation because it creates too many explorations that disturb the potentially good solutions. Regarding the parent–selection process, it is defined as the process of selecting individuals from the current population to create offspring for the next generation. This process aims to emphasize that the individuals with high fitness values are chosen in hopes that their offspring will have higher fitness as well. There are many ways to select parents, or individuals, from the current population for reproduction. Algorithm 2 illustrates in detail the parent–selection process.
Quality assessment
We consider an approach for quality assessment that finds statistically significant matches between discovered clusters and the reference data such as precision (P), recall (R), and F−measure (the harmonic mean of precision and recall) [19]. This approach measures the level of correspondence between discovered clusters and the reference data set by computing statistically significant matches between the two collections using hypergeometric pvalue, and used these matches to evaluate the precision and recall of the suggested clustering solution as follows. Let \(\mathcal {C}\) be the initial set of discovered clusters, and let \(\hat {\mathcal {C}} \subseteq \mathcal {C}\) be the subset of clusters that had a significant match based on hypergeometric pvalue.
Results and discussion
Data source
We study the protein interaction network from the yeast organism since there are abundant highconfidence data sets for its protein interaction network. In our experiment, we applied our clustering algorithm on the Collins protein interaction network extracted from the BioGrid data set [20]. This network has 8,319 interactions among 1,004 proteins. It has an average degree (16.57), where the degree of a node in a network is the number of links connected to the node; the density of this network is 0.016 (density is the ratio between the total number of connections and the potential connections that can exist in the network).
High–quality data collections are needed as gold standards to validate clustering approaches. We assess the coherence of the discovered clusters based on the Gene Ontology (GO) [21]. We have used the cellular component ontology from GO as the primary gold standard to compare the clusters obtained from the interactions data. We used the cellular components ontology in the GO since it includes more proteins in the protein interactions network than the other ontologies. We have also used collections of protein complexes in the yeast that have been culled from the literature and cataloged in the MIPS yeast genome database [22], as well as a handcurated reference complex set called CYC2008 [23].
Clustering comparisons
Comparison of clustering algorithms on the Collins network. The populations in our method are initialized using spectral and random clusterings
Method  #Cls  CYC2008  MIPS  Discard  

R  P  Fmeasure  R  P  Fmeasure  
MCODE  54  0.66  0.59  0.63  0.27  0.48  0.35  40 % 
MCL  75  0.65  0.45  0.54  0.27  0.34  0.30  19 % 
ClusterOne  114  0.55  0.43  0.49  0.20  0.34  0.25  18 % 
Our method using  
spectral initialization  
1) Density cut  162  0.74  0.60  0.66  0.32  0.45  0.37  14 % 
2) Maxmin cut  180  0.71  0.47  0.60  0.38  0.40  0.39  15 % 
3) Normalized cut  193  0.67  0.50  0.57  0.39  0.37  0.37  20 % 
4) Ratio cut  161  0.73  0.38  0.50  0.39  0.33  0.36  17 % 
Our method using  
random initialization  
5) Density cut  164  0.72  0.54  0.62  0.30  0.41  0.35  18 % 
6) Maxmin cut  162  0.71  0.45  0.56  0.40  0.35  0.38  17 % 
7) Normalized cut  138  0.66  0.57  0.61  0.36  0.44  0.41  19 % 
8) Ratio cut  154  0.61  0.55  0.58  0.34  0.43  0.38  18 % 
A study by Brohee and van Helden [24] that compared these algorithms (among others) showed that the MCODE and MCL algorithms, in particular, were very effective in identifying protein complexes from protein interaction networks. We investigated the performance of our method when compared to these two algorithms. In addition, we also investigated the performance of one of the recent algorithms for clustering (the ClusterOne algorithm). In short, we used the MCODE, MCL, and ClusterOne algorithms to extract clusters from the yeast Collins network.
Clearly, our clustering algorithm (Algorithm 1), which was based on initial spectral clustering and used density cut as an objective function (version 1), has the lowest discard ratio (14 %) over all the other approaches; a low value of discard ratio indicates that a high proportion of the proteins in the considered protein network are clustered. On the other hand, the MCODE algorithm has the highest discard ratio (40 %) because it searches for high–density clusters only. Also clustering algorithm (version 1) yields a high precision value with CYC2008, and also a high recall value (most complexes formed by the proteins under study overlap well with the computed cluster from the protein network). MCODE has a similar results, but with one major drawback, which is that not all the proteins in the network are clustered, as illustrated by the high discard value. It can be seen that our clustering algorithm outperforms the MCODE algorithm by a significant margin in terms of discard and recall values. In addition, our algorithm with different objective functions and initializations (versions 1–8) usually discover more clusters, while MCODE predicts fewer clusters; and the other approaches, MCL and ClusterOne, predict fewer clusters than our method and more clusters than MCODE, as illustrated in Table 1.
In comparison with the MCL and ClusterOne algorithms, our algorithms exhibit better correspondence with the complexes catalog within CYC2008 data set, and has higher recall and precision levels than those attained by the MCL and ClusterOne.
Clustering quality
A few of the clusters in the Collins network with the lowest pvalues with GO components
#  Size  GOID  GOTerm  pvalue  N% 

1  17  GO:0030880  RNA polymerase complex  3.30986E39  100.0 % 
2  8  GO:0044428  Nuclear part  3.70274E05  100.0 % 
3  7  GO:0030126  COPI vesicle coat  1.37069E21  100.0 % 
4  14  GO:0044428  Nuclear part  7.23152E10  100.0 % 
5  27  GO:0005739  Mitochondrion  9.82318E22  100.0 % 
7  18  GO:0000502  Proteasome complex  1.76807E40  100.0 % 
8  12  GO:0005634  Nucleus  3.90352E06  100.0 % 
9  7  GO:0030008  TRAPP complex  1.02802E20  100.0 % 
11  21  GO:0005634  Nucleus  2.04087E10  100.0 % 
12  10  GO:0044425  Membrane part  4.18992E10  100.0 % 
13  5  GO:0035097  Histone methyl–transferase complex  1.31389E11  100.0 % 
14  5  GO:0030126  COPI vesicle coat  1.18247E14  100.0 % 
15  9  GO:0016585  Chromatin remodeling complex  2.37606E17  100.0 % 
16  15  GO:0000502  Proteasome complex  2.20275E33  100.0 % 
17  13  GO:0043189  Histone acetyl–transferase complex  1.21627E39  100.0 % 
20  12  GO:0016514  SWI/SNF complex  4.98150E37  100.0 % 
21  60  GO:0005634  Nucleus  2.15384E32  100.0 % 
22  81  GO:0043227  Membranebound organelle  4.87516E23  100.0 % 
24  63  GO:0044464  Cell part  3.42642E05  98.4 % 
23  4  GO:0031011  INO80 complex  4.13601E07  75.0 % 
Conclusion
In this paper, we proposed a robust approach for identifying protein complexes in PPI networks. The approach takes advantage of GA to help address the complex and heterogeneous nature of protein networks clusterings. We designed a new objective function to allow, overall, for the maximizing of intracluster cohesion, and the minimizing of intercluster coupling. Experimental results have shown that our objective function performs better than other objective functions proposed in the literature to identify overlapping clusters in PPI networks. In general, our clustering approach is found to be more accurate and consistent than existing methods (i.e., MCL, ClusterOne, and MCODE) when compared with two reference sets: MIPS and CYC2008, using the Collins network.
In conclusion, our approach outperformed competing approaches and is capable of effectively detecting both dense and sparsely connected biologically relevant protein complexes with fewer discards.
Declarations
Acknowledgements
The authors wish to acknowledge King Fahd University of Petroleum and Minerals (KFUPM) for utilizing the various facilities in carrying out this research. Many thanks are due to the anonymous referees for their detailed and helpful comments.
Declarations
This article has been published as part of BMC Bioinformatics Volume 17 Supplement 7, 2016: Selected articles from the 12th Annual Biotechnology and Bioinformatics Symposium: bioinformatics. The full contents of the supplement are available online at <https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume17supplement7>.
Funding
Publication cost of this article was personally funded by the authors.
Availability of data and materials
The source code and data are freely available at http://faculty.kfupm.edu.sa/ics/eramadan/GACluster.zip.
Authors’ contributions
ER designed and directed the research. He also drafted the manuscript. AN carried out the study, developed and implemented the methodology. MA participated in design and discussion of the research, and helped to draft the manuscript. All authors read and approved the final manuscript.
Competing interests
All authors declare they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Hartwell L, Hopfeld J, Murray A. From molecular to modular cell biology. Nature. 1999; 402:47–52.View ArticleGoogle Scholar
 Bader G, Hogue C. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 2003; 4(2):27.Google Scholar
 Bader J. Greedily building protein networks with confidence. Bioinformatics. 2003; 19(15):1869–74.View ArticlePubMedGoogle Scholar
 Asthana S, et al. Predicting protein complex membership using probabilistic network reliability. Genome Res. 2004; 14(6):1170–5.View ArticlePubMedPubMed CentralGoogle Scholar
 Spirin V, Mirny L. Protein complexes and functional modules in molecular networks. Proc Nat Acad Sci. 2003; 100:12123–8.View ArticlePubMedPubMed CentralGoogle Scholar
 Przulj N, Wigle D, Jurisica I. Functional topology in a network of protein interactions. Bioinformatics. 2004; 20(3):340–8.View ArticlePubMedGoogle Scholar
 King A, Przulj N, Jurisica I. Protein complex prediction via costbased clustering. Bioinformatics. 2004; 20(17):3013–20.View ArticlePubMedGoogle Scholar
 Enright A, Dongen SV, Ouzounis C. An efficient algorithm for largescale detection of protein families. Nucleic Acids Res. 2002; 30(7):1575–84.View ArticlePubMedPubMed CentralGoogle Scholar
 Palla G, Derényi I, Farkas I, Vicsek T. Uncovering the overlapping community structure of complex networks in nature and society. Nature. 2005; 435:814–8.View ArticlePubMedGoogle Scholar
 PereiraLeal J, Enright A, Ouzounis C. Detection of functional modules from protein interaction networks. Proteins. 2004; 54(1):49–57.View ArticlePubMedGoogle Scholar
 Nepusz T, Yu H, Paccanaro A. Detecting overlapping protein complexes in proteinprotein interaction networks. Nat Methods. 2012; 9(5):471–2.View ArticlePubMedPubMed CentralGoogle Scholar
 Ramadan E, Osgood C, Pothen A. Discovering overlapping modules and bridge proteins in proteomic networks. Proc. ACM Int’l Conf. Bioinformatics and Computational Biology (BCB ’10). 2010;366–9.Google Scholar
 Mukhopadhyay A, Ray S, De M. Detecting protein complexes in a ppi network: a gene ontology based multiobjective evolutionary approach. Mol BioSyst. 2012; 8(11):3036–48.View ArticlePubMedGoogle Scholar
 Deb K, et al. A fast and elitist multiobjective genetic algorithm: Nsgaii. IEEE Trans Evol Comput. 2002; 6:182–97.View ArticleGoogle Scholar
 Holland JH. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence: U Michigan Press; 1975.Google Scholar
 Goldberg DE, et al. Genetic Algorithms in Search, Optimization, and Machine Learning vol. 412: Addisonwesley Reading Menlo Park; 1989.Google Scholar
 Ramadan E, Osgood C, Pothen A. The architecture of a proteomic network in the yeast. Lect Notes Biomath. 2005; 3695:265–76.Google Scholar
 Ding C, et al. A MinMaxCut spectral method for data clustering and graph partitioning. Proc. IEEE Int’l Conf. Data Mining. 2001;107–14.Google Scholar
 Tan P, Steinbach M, Kumar V. Introduction to Data Mining: Pearson Addison Wesley; 2006.Google Scholar
 Reguly T, et al. Comprehensive curation and analysis of global interaction networks in saccharomyces cerevisiae. J Biol. 2006; 5(4):11.View ArticlePubMedPubMed CentralGoogle Scholar
 Consortium TGO. GO: The Gene Ontology database and information resource. 2004. http://www.geneontology.org.
 Mewes H, et al. MIPS: a database for genomes and protein sequences. 2002. http://mips.gsf.de.
 Pu S, et al. Uptodate catalogues of yeast protein complexes. Nucleic Acids Res. 2009; 37(3):825–31.View ArticlePubMedGoogle Scholar
 Brohee S, van Helden J. Evaluation of clustering algorithms for proteinprotein interaction networks. BMC Bioinformatics. 2006; 7(1):488. doi:10.1186/147121057488.View ArticlePubMedPubMed CentralGoogle Scholar
 Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G. Go: Termfinderóopen source software for accessing gene ontology information and finding significantly enriched gene ontology terms associated with a list of genes. Bioinformatics. 2004; 20(18):3710–5.View ArticlePubMedPubMed CentralGoogle Scholar