MAE-FMD: Multi-agent evolutionary method for functional module detection in protein-protein interaction networks

Ji, Jun Zhong; Jiao, Lang; Yang, Cui Cui; Lv, Jia Wei; Zhang, Ai Dong

doi:10.1186/1471-2105-15-325

Methodology article
Open access
Published: 30 September 2014

MAE-FMD: Multi-agent evolutionary method for functional module detection in protein-protein interaction networks

Jun Zhong Ji¹,
Lang Jiao¹,
Cui Cui Yang¹,
Jia Wei Lv¹ &
…
Ai Dong Zhang²

BMC Bioinformatics volume 15, Article number: 325 (2014) Cite this article

2393 Accesses
7 Citations
1 Altmetric
Metrics details

Abstract

Background

Studies of functional modules in a Protein-Protein Interaction (PPI) network contribute greatly to the understanding of biological mechanisms. With the development of computing science, computational approaches have played an important role in detecting functional modules.

Results

We present a new approach using multi-agent evolution for detection of functional modules in PPI networks. The proposed approach consists of two stages: the solution construction for agents in a population and the evolutionary process of computational agents in a lattice environment, where each agent corresponds to a candidate solution to the detection problem of functional modules in a PPI network. First, the approach utilizes a connection-based encoding scheme to model an agent, and employs a random-walk behavior merged topological characteristics with functional information to construct a solution. Next, it applies several evolutionary operators, i.e., competition, crossover, and mutation, to realize information exchange among agents as well as solution evolution. Systematic experiments have been conducted on three benchmark testing sets of yeast networks. Experimental results show that the approach is more effective compared to several other existing algorithms.

Conclusions

The algorithm has the characteristics of outstanding recall, F-measure, sensitivity and accuracy while keeping other competitive performances, so it can be applied to the biological study which requires high accuracy.

Background

With the completion of the sequencing of the human genome, proteomic research becomes one of the most important areas in the life science [1]. Proteomics is the systematic study of the diverse properties of proteins to provide detailed descriptions of the structure, function and control of biological systems in health and disease [2], where the analysis of underlying relationships in protein data can potentially yield and considerably expand useful insights into roles of proteins in biological processes. That is, protein-protein interactions (PPI) can provide us with a good opportunity to systematically analyze the structure of a large living system and also allow us to use them to understand essential principles. Therefore, the analysis of PPI networks naturally serves as the basis to a better understanding of cellular organization, processes, and functions [3]. Since biologists have found that cellular functions and biochemical events are coordinately carried out by groups of proteins interacting each other in functional modules (or complexes), and the modular structure of a complex network is critical to functions, identifying such functional modules (or complexes) in PPI networks is very important for understanding the structures and functions of these fundamental cellular networks^a. In the last decade, some biological experimental methods, e.g., tandem affinity purification with mass spectrometry [4, 5] and protein-fragment complementation assay (PCA) [6], have already been used to detect functional modules in PPI networks. However, there are several limitations to these experimental methods, such as too many processing steps and too time-consuming, especially when dealing with a large-scale and densely connected PPI network. Therefore, computational approaches based on machine learning and data mining have been designed and become useful complements to the experimental methods. Over the last decade, a variety of classic clustering approaches, such as density-based clustering [7–9], hierarchical clustering [10–12], partition-based clustering [13–15], and flow simulation-based clustering [16–18], have been used for identifying functional modules in PPI networks. In recent years, there has also been a number of new emerging approaches [19–21], which employs novel computational models to identify functional modules in a PPI network. Especially, some nature-inspired swarm intelligence algorithms have been recently applied to the detection of functional modules in PPI networks [22–25]. Though using computational approaches to detect protein functional modules in PPI networks has received considerable attention and researchers have proposed many detection ideas and schemes over the past few years [1], how to efficiently identify functional modules by means of novel computational approaches is still a vital and challenging scientific problem in computational biology.

Agent-based methods have been previously applied to solving certain search and optimization problems [26, 27]. In such methods, an agent, a, is a computational entity that resides in and reacts to its local environment. During the process of interacting with its environment and companion agents, each agent increases its energy level as much as possible, so that the multi-agent evolution can achieve the ultimate goal of solving a global optimization problem. As another example of nature-inspired methods, multi-agent evolution has shown some promises in producing low-cost, fast, and reasonably accurate solutions to certain computational problems, such as classification [28], clustering [29, 30], and social network community mining [31]. These encouraging applications are significant motivation for our research, thus we propose a novel multi-agent evolutionary method to detect functional modules in PPI networks (called MAE-FMD) in this paper. Based on a probability model, MAE-FMD first employs a group of agents as a population to carry out random walks from a start protein to other proteins in a PPI network and finish their individual solution encodings. Then, it randomly places these agents into an evolutionary environment modeled as a lattice, and performs innovative agent-based operations, i.e., competition, cooperation, and mutation, in an attempt to increase the energy levels of agents at each iteration. Experimental results and related comparisons have shown that the MAE-FMD algorithm is effective in achieving better functional module mining results.

Method

Basic ideas

In this section, we describe a global search algorithm based on a multi-agent evolutionary method for functional module detection, which consists of two phases: (1) the solution construction phase, and (2) the solution evolution phase. In the first phase, each agent traverses all the nodes of a PPI network through a random-walk process and forms its own solution. In the second phase, the population of agents (i.e., all solutions) are randomly placed into an evolutionary environment for their iterative evolutions until a predefined termination criterion is satisfied. During the evolutions, an energy level is employed to evaluate the ability of an agent to solve a problem in the multi-agent system. The higher the energy level of an agent, the better the quality of the corresponding solution.

Agent representation and its construction

In the MAE-FMD algorithm, each agent corresponds to a candidate solution. An agent is encoded as a graph with N directed edges: A={(1→a₁),(2→a₂),⋯,(i→a_i),⋯,(N→a_N)}, where i is a node label, a_i denotes the connected node from i^th node in the represented solution, and N is the number of nodes in a PPI network. Take the PPI network shown in Figure 1(a) as an example. It consists of eight nodes numbered from 1 to 8. Figure 1(b) gives an encoding form of its corresponding agent, which can be translated into the graph structure as given in Figure 1(c), where each connected component provides a group of nodes, corresponding to the same partition of the network as shown in Figure 1(a).

To obtain a feasible solution, an agent proceeds from a start node and continuously employs a random-walk behavior to traverse other nodes in a PPI network. At each time step, the agent is on a node, tries to move to a functionally related or similar node that is chosen probabilistically from its topologically adjacent nodes, and builds a corresponding connection. When there is no any satisfied node, the agent will end its current traversal by pointing to itself and then randomly select an untraversed node in a PPI network and begin to a new traversal. This random-walk behavior will be performed until all nodes have been processed. Thereafter, the agent forms its solution. A main advantage of this solution is that the number K of clusters is automatically determined by the number of components obtained by an agent, namely, those nodes with a connected relationships are automatically classified into the same community during a later decoding process. Obviously, such an encoding method does not rely on knowing number of clusters beforehand.

During the random-walk process, an agent constructs a solution by proceeding from a start node and moving to feasible neighborhood nodes in a step-by-step fashion. In each step, an agent k moves from node i to node j based on the following probability:

p_{ij}^{k} = \{\begin{matrix} \frac{s_{i, j} + f_{i, j}}{\sum_{l \in U_{i}^{k}} (s_{i, l} + f_{i, l})}, & if j \in U_{i}^{k}, \\ 0, & otherwise, \end{matrix}

(1)

where s_i,j denotes a measure of connection strength between two nodes i and j from the view of topology structures, f_i,j is a functional similarity score of the two nodes i and j, and $U_{i}^{k}$ is a set of available nodes in which each one l (or j) is a neighborhood node of node i not yet visited by the k^th agent in the current traversal and (s_i,j+f_i,j)≥ε (ε represents a specified strength threshold for the combination of topology and function similarities).

Given two nodes i,j∈V, we compute their connection strength by using the structural similarity formula as follows [32]:

s_{i, j} = \frac{|Γ (i) \cap Γ (j)|}{\sqrt{|Γ (i)| |Γ (j)|}},

(2)

where Γ(i) is a set of the neighborhood nodes of node i, and |Γ(i)| is the size of the set.

Based on the annotation information of Gene Ontology (GO), the functional similarity measure for proteins can be implemented. For two proteins i and j that are annotated with two GO term sets gⁱ and g^j, respectively, the functional similarity score can be calculated by [33]:

f_{i, j} = \frac{|g^{i} \cap g^{j}|}{|g^{i} \cup g^{j}|} .

(3)

Agent energy level and evolutionary environment

According to the meaning of energy level mentioned above, we are interested in searching a graph partition with the largest energy level. To guarantee highly intra-connected and sparsely inter-connected modules, we adopt the modularity density function [34] to compute the energy level of an agent:

Energy (A) = \sum_{c = 1}^{K} [\frac{e_{c}}{| E |} - {(\frac{d_{c}}{2 | E |})}^{2}],

(4)

where K is the number of detected modules for an agent A, e_c is the number of links between nodes in c_th module, |E| is the number of all links in the PPI network, and d_c is the sum of the degrees of nodes in c_th module. During each evolutionary process, an agent will try to increase its energy level as much as possible by sensing and performing some reactive behaviors to survive.

To realize the local perceptivity of agents, we select the common lattice structure used in [27, 29, 30] as the evolutionary environment, which is more close to the real evolutionary mechanism in nature than the model of the population in traditional Genetic Algorithms (GAs). All M agents in a population live in such a lattice environment. The size of lattices is m×m, where m is an integer and $m = \sqrt{M}$ . Each agent is randomly placed on a lattice-point and it can only interact with its neighbors. The agent lattice can be shown as the one in Figure 2. Each agent, who corresponds to a partition solution, can occupy a circle in the evolutionary environment, where the data in a circle represents its position in the lattice structure, and two agents can interact with each other if and only if there is a line connecting them.

Suppose that the agent located at (u,v) is A_u,v, u,v=1,2,…,m, then the neighborhood agents of A_u,v, N e i g h b o r(A_u,v), are defined as follows:

Neighbor (A_{u, v}) = {A_{u^{'}, v}, A_{u, v^{'}}, A_{u^{′′}, v}, A_{u, v^{′′}}},

(5)

where u^′=m o d(u−1+m−1,m)+1, v^′=m o d(v−1+m−1,m)+1, u^″=m o d(u,m)+1, v^″=m o d(v,m)+1.

Evolutionary operators

In the above evolutionary environment, computational agents will compete or cooperate with others so that they can gain higher energy level. To simulate the evolution phenomenon in a more natural way, each agent can only sense its local environment, and its behaviors of competition and cooperation can only take place between the agent and its neighborhood agents. That is, an agent interacts with its neighborhood agents, and useful information is transferred among them. In such a way, the information can be gradually diffused to the whole lattice environment so that the global evolution of the agent population is realized. To achieve this purpose, three basic operators are designed for detecting communities in a PPI network.

1) Competition operator. Suppose that the operator is performed on the agent located at (u,v), A_u,v=((1→a₁),(2→a₂),…,(N→a_N)), and H_u,v=((1→h₁),(2→h₂),…,(N→h_N)) is another agent with the highest energy level among the neighborhood agents of A_u,v, namely, H_u,v∈N e i g h b o r(A_u,v) and ∀A^′∈N e i g h b o r(A_u,v), then E n e r g y(A^′)≤E n e r g y(H_u,v). If E n e r g y(A_u,v)≥E n e r g y(H_u,v), A_u,v is a winner, so it can still live in the original lattice; otherwise it will die as a loser, and its lattice-point will be occupied by H_u,v. H_u,v has two candidate strategies to occupy a lattice-point, and it randomly selects one of them with a probability p_o. Let r(0,1) be a uniform random number generator, the value range of which belongs to (0,1). If r(0,1)<p_o, occupying strategy 1 is selected; otherwise occupying strategy 2 is carried out. In the two occupying strategies, H_u,v first generates its clone agent C_u,v = ((1 →c₁), (2 →c₂), …, (N →c_N)), and then C_u,v is placed on the lattice-point to be occupied.

Let $s_{i, a_{i}} + f_{i, a_{i}} = {Al}_{i}$ and $s_{i, h_{i}} + f_{i, h_{i}} = {Hl}_{i}, i = 1, 2, \dots, N$ , namely, the connection strengths of A_u,v are A l₁,A l₂,…,A l_N, and the connection strengths of H_u,v are H l₁,H l₂,…,H l_N, respectively. If a node has no other nodes to be pointed in addition to point to its own, then we call it a breakpoint. In fact, a breakpoint represents the segmentation of two different modules in a PPI network with N directed edges. To distinguish breakpoints, we set $s_{i, a_{i}} = - \infty$ only when i=a_i in an agent encoding.

Strategy 1. For the connection with the lowest strength in H_u,v, H l_j=M i n(H l₁,H l₂,…,H l_N), if A l_j>H l_j then c_j is replaced with a_j (j=1,2,…,N) in the new agent.

Strategy 2. Each A l_i of A_u,v is respectively compared with the corresponding H l_i of H_u,v. If A l_i>H l_i, then c_i=a_i in the new agent.

In the following, we take a PPI network with 8 nodes as an example to illustrate these operators. A schematic diagram of a competition operator is given in Figure 3, where A = ((1→ 6),(2→ 2),(3→ 7),(4→ 8),(5→ 5),(6→ 5),(7→2),(8→8)) is an agent to participate in a competition, H=((1→1),(2→4),(3→7),(4→8),(5→5),(6→5),(7→1),(8→8)) is its neighborhood agent with the highest energy level and E n e r g y(H)≥E n e r g y(A), and A₁ and A₂ are two new agents produced by the competition operator where a shape represents a change in the encoding of the clone agent of H. Assumpting that A l₁>H l₁,A l₇>H l₇ and H l₁=M i n(H l₁,H l₂,···,H l₈), A₁ is the result of Strategy 1 where the link (1→1) is replaced with (1→6) while A₂ is that of Strategy 2 where the two links (1→1) and (7→1) are respectively replaced with (1→6) and (7→2).

In fact, the two strategies in this operator are designed to play similar roles. More specifically, Strategy 1 only replaces the worst connection of a winner with the better information of a loser while Strategy 2 is in favor of reserving all advantaged information of a loser.

2) Crossover operator. Suppose that two parent agents are F₁=((1→f₁),(2→f₂),…,(N→f_N)) and F₂=((1→f1′),(2→f2′),…,(N→f N′)) which will randomly produce a child agent C₁=((1→c₁),(2→c₂),…,(N→c_N)) by making use of their connection information and the corresponding crossover strategies. To obtain offsprings of the two parent agents, the rules of crossover operator are as follows.

Alternating link crossover rule. The rule works as follows: first it chooses a link from the first parent at random; secondly, the link is extended with the appropriate link of the second parent; thirdly, the partial tour created in this way is extended with the appropriate link of the first parent, etc. This process is repeated until traversing all the nodes in a PPI network. During the generation of a candidate agent, once a link is chosen which would produce a cycle into the partial tour, the next link will be selected randomly from the links of those untraversed nodes in the corresponding parent.

The schematic diagram of the alternating link crossover rule is shown in Figure 4, where F₁=((1→6),(2→8),(3→7),(4→4),(5→5),(6→5),(7→4),(8→8)) and F₂=((1→6),(2→2),(3→7),(4→8),(5→5),(6→5),(7→7),(8→8)) are two parent agents, and C₁=((1→6),(2→2),(3→7),(4→8),(5→5),(6→5),(7→4),(8→8)) is an offspring agent. In generating a candidate agent, the links (1→6),(5→5),(7→4) and (8→8) are selected from the first parent while the other links from the second parent, and each shape represents a starting point of a new subtour in the candidate agent.

Alternating chunk crossover rule. Based on this rule, an offspring is constructed from two parent agents as follows: first it takes a random length subtour of the first parent; then this partial tour is extended by choosing a subtour of random length from the second parent; next the partial tour is constantly extended by taking subtours from alternating parents till up to the length of the solution. For each subtour, the random length range is 1 to the remaining digits of the constructing solution. In generating a candidate agent, if a link is chosen which would produce a cycle into the partial tour, the next link will be selected randomly from the links of those untraversed nodes in the corresponding parent. Different length subtours from two parent agents are alternatingly chosen to construct a child agent.

Figure 5 gives an illustrative diagram of an alternating chunk crossover rule, where F₁=((1→5),(2→5),(3→8),(4→4),(5→5),(6→6),(7→4),(8→6)) and F₂=((1→2),(2→3),(3→3),(4→7),(5→5),(6→6),(7→5),(8→6)) are two parent agents, and C₁=((1→5),(2→5),(3→8),(4→4),(5→5),(6→6),(7→5),(8→6)) is an offspring agent. In generating a candidate agent, we assume that the sizes of four chunks are respectively determined as 3, 2, 2 and 1 by four random functions, and chunk 1 and chunk 3 are selected from the first parent while the other two chunks from the second parent. The new agent is alternatively constructed by means of the subtours of different parents, where shapes represent the same meaning as in Figure 4.

Obviously, the crossover operator has the function of a random search, which is performed on an agent and its neighborhood agents to achieve the purpose of cooperation with a crossover probability p_c. More specifically, if r(0,1)<p_c, then the algorithm performs the two crossover operators. Otherwise, it skips these operators. Once a child agent has higher energy level than its parent agent after performing crossover operators, the initial agent with lower energy level will be replaced with the child agent.

3) Self-adaptive mutation operator. In addition to the behaviors of competition and cooperation, an agent can also increase its energy level by using a self-adaptive mutation operator, which depends on the degree of its evolution and controls the number of digits to be mutated. The mechanism of the self-adaptive mutation operator is denoted as:

n = ceil (N^{\frac{l_{i}}{2 \cdot r}}),

(6)

where n is the number of mutation digits, l_i is the number of continued stagnation steps for i_th agent, and r is the maximum step length at which an agent might have the same energy level. It is not difficult to find that n is not only associated with the encoding length of an agent (network size), but also related to the evolutionary process of the agent. More specifically, the larger the network scale, the more the number of potential mutations. On the other hand, the longer the stagnating time of an agent evolution, the more the number of potential mutations.

Based on a mutation probability p_m, n connection elements of an agent A = ((1 → a₁),(2 → a₂),…,(N → a_N)) are randomly selected when r(0,1)<p_m, and then they are mutated by replacing with other nodes possibly connected in the corresponding module.

Figure 6 gives an illustration diagram of a mutation operator, where D = ((1→6),(2→2),(3→7),(4→8),(5→5),(6→5),(7→2),(8→8)) is an original agent, its mutation number n= 2, M= ((1→6),(2→2),(3→7),(4→2),(5→5),(6→5),(7→4),(8→8)) is the mutated agent in which two elements are replaced randomly, and shapes represent the changes in the encoding of the new agent. Essentially, the mutation operator realizes a local search, which only performs a small perturbation on some elements (node connections) of an agent encoding. If a mutation operator can increase the energy level of the current agent, the initial agent with lower energy level will be replaced with the new agent.

In the light of an energy level function, MAE-FMD algorithm employs competition, crossover and mutation operators to continually realize good information exchange among agents and improve the energy levels of a group of initial agents. During the competition process, if the current agent is winner, then it will be kept alive. Otherwise, the neighborhood agent with the highest energy level will be selected, and improved by combination with advantaged information of the current agent, then it replaces the current agent. Meantime, whether crossover operators or mutation operators, once they can produce new agents with higher energy level, the initial agents with lower energy level will be replaced with the new agents. By means of the three operators, the evolutionary process will gradually converge to a solution with the largest energy level which corresponds to the initial module structure of the PPI network.

Post-processing

After a number of iterations, we can obtain a solution with the largest energy level. That is, the preliminary modules are generated by the multi-agent evolutionary method. To improve the detection quality, we adopt two post-processing strategies based on topological and functional information to produce final modules. The first step is merging the similar preliminary modules in light of functional annotation information. A merging module results from two or more preliminary modules which are close in view of function. The similarity S(M_S,M_T) between two modules M_S and M_T is measured by the functional similarity score defined as:

S (M_{S}, M_{T}) = \frac{\sum_{i \in M_{S}, j \in M_{T}} S (i, j)}{min (| M_{S} |, | M_{T} |)},

(7)

where

S (i, j) = \{\begin{matrix} 1 & if i = j \\ f_{ij} & if i \neq j, and (i, j) \in E. \\ 0 & otherwise \end{matrix}

(8)

Two modules with the highest similarity are iteratively merged until there are no such two modules whose similarity is larger than the merging threshold λ.

To exclude some too sparsely connected nodes and very small clusters generated above, we perform the filtering step based on the topological density of PPI network subgraphs. The density of subgraphs of functional modules is measured by:

D_{s} = \frac{e_{s}}{n_{s} \cdot (n_{s} - 1) / 2},

(9)

where n_s is the number of nodes and e_s is the number of interactions in a subgraph s of a PPI network. Let δ be a threshold value, those clusters with D_s<δ and |s|<2 will be filtered from clusters generated above. By such two post-processing strategies, the preliminary modules are refined from the topological property and functional similarity, and the potential functional modules hidden in the PPI networks are generated.

Algorithm description and complexity analysis

The procedure of the proposed MAE-FMD algorithm is to carry out initialization, agent random-walk and solution construction, multi-agent evolution, post-processing, and output of detected modules. The detailed pseudocode is shown in Algorithm 1.

Based on the description of Algorithm 1, the complexity of MAE-FMD can be simply analyzed as follows: Let the maximum number of a node degree be n₁ in a PPI network, and the maximum number of nodes be n₂ in a module. In the initialization process, computing connection strengths (similarities) and the number of common neighbors for all pairs of nodes is time-consuming. For each node, since the number of its maximum neighborhood nodes is n₁, the computing complexity of its all available connection strengths is n₁, thus the time complexity is O(n₁·N). In the agent random-walk and solution construction process, the time complexity is O(M·N·n₁). In the multi-agent evolution process, the time complexity is $O (K \cdot n_{2}^{2} + T \cdot (M + 4 M \cdot N + 4 M \cdot N + M \cdot N + K \cdot n_{2}^{2} + M) \approx O (T \cdot (K \cdot n_{2}^{2} + M \cdot N))$ . Generally speaking, K·n₂≥N,h o w e v e r,O(K·n₂)≈O(N). Thus, the time complexity of the multi-agent evolution process can be simplified as O(T·(n₂+M)·N). In the post-processing and output process, the time complexity is O(K²+K))≈O(K²). Thus, the overall complexity of MAE-FMD is about O(n₁·N)+O(M·N·n₁)+O(T·(n₂+M)·N)+O(K²). Because most PPI networks are small-world and scale-free networks, n₁≪N,n₂<N,K≪N. Moreover, we usually select a constant (e.g, 100) as the population size of agents, which is far less than the number of nodes in a large-scale PPI network. Therefore, the time complexity of MAE-FMD can be decreased to O(T·(n₂+M)·N) (n₂<N for all PPI networks, M≪N for a large-scale PPI network), which is better than that of most existing typical algorithms with O(N²). Especially for a large-scale complex network with near uniform community size, the efficiency of MAE-FMD is very promising for detecting modules in PPI networks.

Results and discussion

In this section, we use three different protein-protein interaction datasets to perform our empirical study. In light of many evaluation metrics, we assess the performance of our algorithm, and compare our test results to other existing algorithms on these PPI datasets. The experimental platform is a PC with Core 2, 2.13 GHz CPU, 2.99 GB RAM, and Windows XP, and all algorithms are implemented by Java language.

PPI datasets

We have performed our experiments over five publicly available benchmark PPI datasets including four yeast data and one human data, namely DIP data [35], Gavin data [36], MIPS data [37], DIP Scere20140703 and DIP Hsapi20140703. Table 1 shows a summary of the data sets used in our experiments, where the 2th column gives the web links, the 3th and 4th columns respectively present the size of proteins and interactions in source data while the 5th and 6th columns respectively present the size of proteins and interactions in the preprocessed data. A cleaning step, which deletes all self-connected and repeated interactions, is performed in data preprocessing. To evaluate the protein modules mined by our algorithm, the set of real functional modules from [38] is selected as the benchmark. This benchmark set, which consists of 428 protein functional modules, is constructed from three main sources: the MIPS [27], Aloy et al. [39] and the SGD database [40] based on the Gene Ontology (GO) notations.

Table 1 Data sets used in our experiments

Full size table

Evaluation metrics

At present, there exist three popular measurements for the evaluation of the detection modules’ quality and the calculation of the detection methods’ general performance [41].

Precision, Recall, F-measure, and Coverage

Many research works use a neighborhood affinity score to assess the degree of matching between the identified functional modules and real ones. The score N A(p,b) between an identified module p=(V_p,E_p) and a real module b=(V_b,E_b) in the benchmark module set is defined as:

NA (p, b) = \frac{{|V_{p} ⋂ V_{b}|}^{2}}{|V_{p}| \times |V_{p}|} .

(10)

If N A(p,b)≥ω, then p and b are considered to be matched (generally, ω=0.2). Let P be the set of functional modules identified by some computational methods and B be the real functional module set in benchmark networks. And then the number of the modules in P which at least matches one real module is denoted by N_cp=|{p|p∈P,∃b∈B,N A(p,b)≥ω}|, while the counterpart number in B can be denoted by N_cb=|{b|b∈B,∃p∈P,N A(p,b)≥ω}|. Thus, Precision and Recall can be defined as follows [42]:

Precision = \frac{N_{cp}}{| P |},

(11)

and

Recall = \frac{N_{cb}}{| B |} .

(12)

F-measure is a harmonic mean of Precision and Recall, so can be used to evaluate the overall performance. It is defined as:

F = \frac{2 \times Precision \times Recall}{Precision + Recall} .

(13)

Moreover, Coverage assesses how many proteins in a PPI network can be clustered into the detected modules by a computational method. That is, it indicates the percentage of proteins assigned to any functional module, i.e., 1-Discard-rate, which can be defined as follows [43]:

Coverage = \frac{|⋃_{i = 1}^{| P |} V_{pi}|}{| V |},

(14)

where |V|=N denotes the size of the PPI network and V_pi is the set of the proteins in the i^th detected module.

Sensitivity, positive predictive value, and accuracy

Sensitivity (S_n), Positive predictive value (PPV) and Accuracy (Acc) are also common measures to assess the performance of module detection methods. Let T_ij be the number of the common proteins in both of the i^th benchmark and the j^th identified module. Then S_n and PPV can be defined as [38]:

S_{n} = \frac{\sum_{i = 1}^{| B |} max_{j} {T_{ij}}}{\sum_{i = 1}^{| B |} N_{i}},

(15)

and

PPV = \frac{\sum_{j = 1}^{| P |} max_{i} {T_{ij}}}{\sum_{j = 1}^{| P |} T_{.j}},

(16)

where N_i is the number of the proteins in the i^th benchmark module, and $T_{.j} = \sum_{i = 1}^{| B |} T_{ij}$ . Generally speaking, S_n assesses how many proteins in the real functional modules can be covered by the predicted modules, while PPV indicates that identified modules are more likely to be true positives.

As a general metric, the accuracy of an identification (Acc) can be calculated as the geometric mean of S_n and PPV:

Acc = {(S_{n} \times PPV)}^{1 / 2} .

(17)

p-value measure

Modules can be statistically evaluated using the p-value from the hypergeometric distribution, which is defined as [44]:

p = 1 - \sum_{i = 0}^{k - 1} \frac{(\begin{array}{l} | F | \\ i \end{array}) (\begin{array}{l} | V | - | F | \\ | C | - i \end{array})}{(\begin{array}{l} | V | \\ | C | \end{array})},

(18)

where |V| denotes the same means as mentioned in Equation 16, C is an identified module, |F| is the number of proteins in a reference function, and k is the number of proteins in common between the function and the module. P-value is also known as a metric of functional homogeneity. It is understood as the probability that at least k proteins in a module of size |C| are included in a reference function of size |F|. A low value of p indicates that the module closely corresponds to the function, because it is less probable that the network will produce the module by chance. Consequently, the minimum p-value in all modules will show the general performance of each detection method.

Effects of parameters

In this subsection, we take the Gavin data as an example to study respectively the effects of the algorithm parameters involved in the multi-agent evolution and post-processing. These parameters include the number of agent population (M), the strength threshold of connections (ε), the maximum step length with same energy level (R), the selection probability (p_o), the crossover probability (p_c), the mutation probability (p_m), merging threshold (λ), and filtering threshold value (δ). During all experimentations, the value of a single parameter is changed, while keeping the values of other parameters fixed.

For the multi-agent random-walk and evolutionary processes, we take maximum energy of an agent and the number of iterations as two evaluation metrics to test the performance of the algorithm. Ten executions are independently carried out in each parametric combination. Figure 7 reveals that the effects of three main parameters (M, ε, R) on the multi-agent method performance by mean value curve with error bars. Figure 7(a) shows the evolutionary performance with 7 different agent sizes (M). Multi-agent evolutionary method is a population-based optimization algorithm, where the number of agent population determines the number of solutions at each iteration. The left graph in Figure 7(a) shows the results about the maximum energy value, and the right graph in Figure 7(a) illustrates the results about the number of iterations. As reflected in Figure 7(a), smaller maximum energy values and larger number of iterationss are obtained when using a small number of agents. Along with the number of agents increasing, the maximum energy slowly increases and the number of iterations decreases on the whole. The reason is that more agents means more initial search points in the search space to be employed so that the search range is larger at each iteration, which induces the algorithm to rapid converge. However, after a sufficient value for the number of agents, any increment does not obviously improve the maximum energy, and also does not dramatically reduce the number of iterations. On the contrary, the search time in each iteration will increase as the size of the number of agents increases. Therefore, to acquire a balance between getting a better solution and using less time, we recommend an agent size of 225 (M=225).

The strength threshold of connections ε is an important parameter in the constructing solution process of an agent, which controls the feasible neighborhood for each node in an agent random-walk process. To investigate the effect of ε on our algorithm, we perform experiments using different values of ε. The results are presented in Figure 7(b) where the curve of maximum energy has 2 distinct ranges for various values of ε, i.e. [0.05, 0.10] and [0.15, 0.60], while the curve of the number of iterations also has 2 rough ranges for various values of ε, i.e. [0.05, 0.20] and [0.25, 0.6]. As ε increases, the maximum energy increases in the first range, and then decreases dramatically in the second range. However, our algorithm can keep the maximum energy value being larger than 0.5 when ε locates in [0.05, 0.35]. For the curve of the number of iterations, the first range has far larger values than the second range though there are some small fluctuations in both ranges. The reason is as follows: Smaller ε is, larger the feasible neighborhood of each node and the search space of a solution are, thus the algorithm will cost more iterations to search a better solution, and vice versa. It is worth noting that the algorithm is easy to fall into local optimal if ε is too large though it has s fast convergence performance. Combining above experimental results and such analysis, we select ε=0.25 in our algorithm.

The maximum step length with same energy level R is also a key parameter which plays an important role in determining the end of evolution. Figure 7(c) shows two plots of the maximum energy value and number of iterations for different R. From the left graph in Figure 7(c), the maximum energy value is insensitive to the parameter R and increases slightly along with increasing of R. From the right graph in Figure 7(c), the number of iterations will have a more significant increase as the parameter R increases. These results illustrate that if the algorithm uses a large value of R, which is bound to increase the number of iterations and not necessarily able to get a better result. Considering the two factors together, we set R=60.

Figure 8 reveals that the effects of three operator parameters (p_o, p_c, p_m) on the performance of the multi-agent method by mean value curve with error bars. As shown in the left graph of Figure 8(a), the maximum energy is insensitive to the occupying probability p_o, and its value maintains around at 0.56 within all values of p_o. From the right graph in Figure 8(a), we can see that the number of iterations decreases as p_o increases on the whole. This is because no matter what value p_o is, the competition operator (Strategy 1 or Strategy 2) is performed, thus the difference on the maximum energy is very small (e.g. the maximum gap of the average value is 0.006). However, since Strategy 2 reserves more advantaged information of a loser than Strategy 1, excessively using Strategy 2 will slow the convergence of the algorithm. Hence, we set p_o=0.5 in our algorithm to obtain a balance between two strategies. The relationship curve between the method performance and p_c is shown in the Figure 8(b). Similar to the curve of p_o, the maximum energy values vary from the different values of p_c, but the difference is very small (e.g. the maximum gap of the average value is 0.009), which suggests that the multi-agent evolutionary process is also not sensitive to the crossover probability. There are three varying ranges for the number of iterations along with p_c increasing, i.e., [0.1, 0.4), [0.4, 0.6] and (0.6, 0.9]. The number of iterations curve decreases gradually in [0.1, 0.4), then maintains the smaller value of almost equal in [0.4, 0.6], and finally increases gradually in (0.6, 0.9], which means that moderate crossover operations can contribute to the convergence of the algorithm, however, too few or too many crossover operations will reduce the convergence of the algorithm. To shorten the evolutionary process, we set p_c=0.5 in our algorithm. Figure 8(c) gives the curve of the evolutionary performance on mutation probability p_m. As p_m increases, the maximum energy values slowly increase, and the number of iterations decreases with a small amount of fluctuation. That is, the rich mutation operators will not only benefit the convergence of the multi-agent evolutionary process, but also get a better maximum energy value. To obtain a good result and save evolution time, we select p_m=0.8 in our algorithm.

For the postprocessing process, we employ recall, F-measure, precision, sensitivity, accuracy and PPV metrics to evaluate algorithm performance. Figure 9 gives the effects of merging threshold λ on 6 performance metrics. Figure 9(a) demonstrates that the F-measure and recall increase as λ increases on the whole range while the precision also increases as λ increases at the beginning and decreases after λ passes over 1.0. Figure 9(b) shows that the relationship between λ and the sensitivity, the accuracy and the PPV. The accuracy and the PPV have the same trend: both values subtly increase as λ increases. Conversely, the sensitivity decreases at the beginning, then keep the value low (0.74) after λ gets to 1.4. As shown in Figure 9(a) and Figure 9(b), the larger λ is, the better F-measure and accuracy seems to be. However, once λ is set a larger value, the number of clusters will become too large due to many small clusters. To balance between the scale and size of clusters, the value of λ is set to 1.8 in our following experiments.

Figure 10 gives the effects of filter threshold δ on 6 performance metrics. As shown in Figure 10(a), the recall and F-measure have a similar trend, namely, their values slowly increase as δ increases at the beginning, then gently decrease after δ gets to 0.12. However, the rate of change is slightly different for the two metrics where the values of recall have larger changes than those of F-measure. Meanwhile, the precision maintains a relatively stable value around 0.45 though there are two small peaks at δ=0.04 and 0.12. Figure 10(b) investigates the relationship between δ and PPV, the accuracy and the sensitivity. As δ increases, three metrics have different tendencies. In detail, the sensitivity obviously decreases from 0.75 to 0.52, the PPV increases from 0.30 to 0.32 when δ locates in [0.02, 0.16], then keeps a larger value (0.32) when δ>0.16 while the accuracy holds steady at 0.46 when δ varies from 0.02 to 0.14, then slightly decreases from 0.46 to 0.41 when δ locates in [0.14, 0.2]. The main reason for these different trends is that only those modules whose similarity is strong enough are merged along with the value of δ increasing, thus making the number of clusters to increase and the average size of a cluster to be small. To make a balance, we employ δ=0.12 in our algorithm.

Based on similar tests, we have determined the parameter sets for other different data sets, and Table 2 summaries these parameters used in the following experiments.

Table 2 Summary of parameters used in our experiments

Full size table

From these results, we can give some simple suggestions to preset these parameters. For M, a certain size population is necessary for MAE-FMD to obtain good quality solution while keeping a smaller value not to increase the running time. For ε, a medium value between [0, 0.6] is recommended. For R, a smaller value is favorable to rapidly converge. For p_o, p_c and p_m, we can set a medium value between [0, 1] to p_o and p_c and a higher value to p_m to save the running time. The two parameters in post-processing depend on different datasets. For the curated databases, such as DIP and MIPS, λ and δ can be set two smaller values in respective domains, however, two larger values in respective domains have to be employed for the database with more noise (such as Gavin).

Comparative evaluations

To demonstrate the strengths of the MAE-FMD method, we compared it to the six competing methods: HAMFMD, NACO-FMD, Coach, CFinder, MCL and MCODE in our experiments, where CFinder and MCL run without parameter settings, the only parameter of Coach is the filter threshold ω which was set to 0.225, NACO-FMD runs with α=1.5, β=4 and δ=0.3, MCODE adopts the default values for their parameters as provided by its binary executable system, and HAM-FMD uses five different combinations of parameter values (100, 0.5, 2, 4, 286, 0.2, 0.8, 0.1, 0.6), (300, 0.4, 2, 4, 510.8, 0.2, 0.8, 0.3, 0.4), (400, 0.5, 2, 4, 910.8, 0.2, 0.8, 0.1, 0.7), (500, 0.5, 1.5, 5, 1025.2, 0.2, 0.5, 0.1, 0.3) and (400, 0.5, 1.5, 5, 817.2, 0.2, 0.5, 0.1, 0.3) for the parameter set of (m, ρ, α, β, Q, P_o, P_c, P_m, δ) on Gavin, DIP and MIPS, respectively.

The detailed comparative results of the various algorithms on the five different data sets are shown respectively in Table 3, where "−" denotes an invalid result. For each detection method, we have listed the number of clusters detected (Number of clusters), the average number of proteins in each cluster (size of average module), the number of detected modules which match at least one real module (N_cp) and the number of real modules that match at least one detected module (N_cb). Taking MAE-FMD on Gavin data as an example, it has detected 193 modules, of which 110 match 224 real modules. Each of 193 detected modules has about 6 proteins in Gavin. These results show that MAE-FMD generates smaller scale clusters on most of data, and MCL doesn’t effectively detect modules when a dataset is largely sparse (i.e. human interaction networks).Figures 11, 12, 13, 14 and 15 show the overall comparison results of these methods in terms of various evaluation metrics, including Coverage, Precision, Recall, F-measure, Sensitivity, PPV and Accuracy for five different data, respectively. From the first panel of these figures, we can conclude that our algorithm archives good performance on the Coverage for all five data sets. For instance, one can easily see that the Coverage of our algorithm is the third highest one among seven algorithms on DIP, MIPS and DIPScere20140703, which is higher than that of other four algorithms and only lower than that of NACO-FMD and MCL. The main reason is that these algorithms adopted different clustering mechanisms which can seriously exert influence on the percentage of proteins clustered into functional modules in a PPI network. Essentially, MAE-FMD, NACO-FMD, HAM-FMD and MCL have two similar characteristics: 1) Three representations of solutions are established on the basis of all nodes of the PPI network. For example, MCL uses a matrix representation of nodes, NACO-FMD employs an ordered sequence of nodes while HAM-FMD and MAE-FMD adopt a connection encoding of nodes; 2) All four algorithms use random clustering mechanisms though specific methods are different. Both characteristics insure that clustering results can include most of nodes in the PPI network. However, MAE-FMD, NACO-FMD and HAM-FMD adopt similar filter operators in post-processing process, thus their coverage values are smaller than that of MCL. Moreover, HAM-FMD combines the random search mechanism used by NACO-FMD with the similar random mechanism used by MAE-FMD, so its coverage value is smaller than those of NACO-FMD and MAE-FMD. Moreover, for the human data (i.e. DIP Hsapi20140703) with large sparsity, our algorithm obtains the best result which shows that MAE-FMD can still keep good coverage performance even when there are seriously sparse connections in the data set.From the second to fourth panels of these figures, we can see that the Precision values of our algorithm are 57%, 50%, 29.9%, 27%, and 18%, respectively. In detail, MAE-FMD obtains the second best result which is only inferior to that of MCODE (72.5%) on Gavin, the third best result which is only inferior to that of CFinder (51.4%, and 30.9%) and MCODE (64.8% and 37.3%) on DIP and MIPS data and that of CFinder (21%) and Coarch (19%) on DIPHsapi20140703 data, and fourth best result which is superior to that of NACO-FMD (22%), MCL (15%) and HAM-FMD (22%) on DIPScere20140703 data. Further, it is easy to observe that our algorithm obtains the best performance on the Recall for Gavin (67.9%), MIPS (46.9%) and DIPHsapi20140703 (17%) data, and is only inferior to that of NACO-FMD on DIP data (less 1.3%) and Coach on DIPScere20140703 data (less 1%). In combination, our algorithm archives the most excellent F-measure on Gavin, DIP, MIPS and DIPHsapi20140703 data, and the second best result on DIPScere20140703 data. That is, our algorithm obtains the highest F-measure value 62.0% with the Gavin data as shown in Figure 11, which is 31.77%, 14.84%, 16.2%, 17.31%, 17.3% and 18.92% higher than that of CFinder, Coach, NACO-FMD, MCL, HAM-FMD and MCODE, 52.2% with the DIP data as shown in Figure 12, which is 12.4%, 5.3%, 7.9%, 15.6%, 2.0% and 16.4% higher than that of CFinder, Coach, NACO-FMD, MCL, HAM-FMD and MCODE, 36.6% with the MIPS data as shown in Figure 13, which is 12.2%, 4.3%, 8.3%, 15.7%, 7.2% and 16.4% higher than that of CFinder, Coach, NACO-FMD, MCL, HAM-FMD and MCODE, 17% with the DIPHsapi20140703 data as shown in Figure 15, which is 9%, 16%, 8.9%, 2% and 15% higher than that of CFinder, Coach, NACO-FMD, HAM-FMD and MCODE, and 37% with the DIPScere20140703 data as shown in Figure 14, which is 13%, 7%, 14%, 7% and 20% higher than that of CFinder, NACO-FMD, MCL, HAM-FMD and MCODE, and only 0.3% lower than that of Coach, respectively.From these figures, we also can observe that MAE-FMD gets the best sensitivity in four data sets (Gavin, DIP, MIPS and DIPHsapi20140703) and the second best result in another data (DIPScere20140703), which indicates the modules detected by our algorithm can cover the real functional modules to a great extent. More specifically, we can see that the sensitivity of our algorithm is 72.4% in Figure 11, which is 24.4%, 40.0%, 32.7%, 33.2%, 36.9% and 34.6% higher than that of CFinder, Coach, NACO-FMD, MCL, HAM-FMD and MCODE algorithms with the Gavin data. Figure 12 shows the sensitivity of our algorithm is 57.0%, which is 25.5%, 33.5%, 25.4%, 27.6%, 29.3% and 32.5% higher than that of CFinder, Coach, NACO-FMD, MCL, HAM-FMD and MCODE algorithms with the DIP data. Figure 13 shows the sensitivity of our algorithm is 36.2%, which is better than that of CFinder (30.9%), Coach (20.7%), NACO-FMD (24.6%), MCL (22%), HAM-FMD (19.3%) and MCODE (15.9%) algorithms with the MIPS data. Figure 15 shows the sensitivity of our algorithm is 30%, which is much better than that of CFinder (18%), Coach (16%), NACO-FMD (17%), HAM-FMD (24.7%) and MCODE (8%) algorithms with the DIPHsapi20140703 data. Though MAE-FMD gets the second best result (59%) on DIPScere20140703 data, it is only inferior to that of CFinder (61%) and also much better than that of Coach (36%), NACO-FMD (50%), MCL (33%), HAM-FMD (47%) and MCODE (20%) algorithms.On Gavin, MIPS, DIPScere20140703 and DIPHsapi20140703 data, MAE-FMD attains the best or the second best PPV value while its PPV performance is not outstanding on DIP data. In detail, the PPV value of MAE-FMD is 30.7% shown in Figure 11, which is 9.9%, 4.7%, 1.4%, 0.4% and 5.0% higher than that of CFinder, Coach, NACO-FMD, MCL and MCODE and is 0.9% lower than that of HAM-FMD. Figure 12 shows the PPV value of MAE-FMD is 29.9%, which is 4.4% and 1.4% higher than the CFinder and MCODE algorithms, and is 1.1%, 3.5%, 5.2% and 5.4% lower than that of Coach, NACO-FMD, MCL and HAM-FMD algorithms with the DIP data. In Figure 13, the PPV value of MAE-FMD is 34.2%, which is 15.3%, 10.6%, 1.2%, 5.1% and 8.2% higher than that of CFinder, Coach, NACO-FMD, MCL and MCODE, and only is 2.1% lower than that of HAM-FMD. The PPV value of MAE-FMD is 32% shown in Figure 14, which is 17%, 9%, 1%, 1% and 15% higher than that of CFinder, Coach, NACO-FMD and MCODE and is equal to that of MCL. In Figure 15, the PPV value of MAE-FMD is 48%, which is equal to that of NACO-FMD, and 16%, 11% and 17% higher than that of CFinder, Coach and MCODE while it is only 4% lower than that of HAM-FMD.Overall, our algorithm achieves the highest Acc on all five tested data due to its balanced effort between Sensitivity and PPV. The Acc value of our algorithm is 47.2% shown in Figure 11, which is 15.6%, 18.2%, 13.1%, 12.8%, 13.7% and 16.1% higher than that of CFinder, Coach, NACO-FMD, MCL, HAM-FMD and MCODE with the Gavin data, respectively. Figure 12 shows the Acc value of our algorithm is 41.3%, which is 12.9%, 14.3%, 8.8%, 9.2%, 10.1% and 14.9% higher than that of CFinder, Coach, NACO-FMD, MCL, HAM-FMD and MCODE with the DIP data, respectively. Figure 13 shows the Acc value of our algorithm is 35.2%, which is 11.0%, 13.1%, 6.8%, 9.9%, 8.7% and 14.9% higher than that of CFinder, Coach, NACO-FMD, MCL, HAM-FMD and MCODE with the MIPS data. Figure 14 shows the Acc value of our algorithm is 43%, which is 12%, 14%, 3.4%, 11%, 5% and 25% higher than that of CFinder, Coach, NACO-FMD, MCL, HAM-FMD and MCODE with the DIPScere20140703 data. Similarly, our algorithm attains 38% on Acc metric, which is 14%, 14%, 9%, 2% and 22% higher than that of CFinder, Coach, NACO-FMD, HAM-FMD and MCODE with the DIPHsapi20140703 data. These experimental results on the Acc performance show that our algorithm is superior to other six algorithms.

Table 3 The results of various algorithms on different data sets

Full size table

Table 4 compares the distribution of the p-values of protein modules obtained by 7 different algorithms on DIP data, where the first column gives different types of p-values, the second column lists 7 algorithms, and the third to eighth columns respectively present the number of modules located in the corresponding range while the ninth column shows the ratio of the modules with a p-value and all modules detected for each algorithm. From these results, we can find that MCODE, Coach, and CFinder have the three highest ratios in all the statistics, however, MCODE only obtains the minimum amount of modules while Coach can obtains the maximum amount of modules. The ratio difference of three swarm intelligence algorithms (MAE-FMD, HAM-FMD and NACO-FMD) is not obvious, particularly to MAE-FMD and HAM-FMD. MCL has the worst ratio in three types of p-values. Moreover, it is worth noting that most modules with a p-value are concentrated in the area (1.0e-10, 1.0e-3], and only a few modules fall into the range (0, 1.0e-20] where MAE-FMD has obvious advantages comparing with other algorithms.

Table 4 Distribution comparisons of the p-values of protein modules obtained from different algorithms on DIP

Full size table

To further investigate the computational results, 10 protein modules with low p-values and high matching rate predicted by different algorithms using DIP data are respectively presented in Tables 5, 6, 7, 8, 9, 10 and 11. In these tables, the first column is a cluster identifier. The second column indicates the number of proteins in each cluster. The third column gives proteins in the predicted module. The four column lists the corresponding real protein module. The fifth column refers to the matching rate (%) between our predicted module and a real module, which can be computed as N_pm/N_pc, where N_pm is the number of proteins belonging to the same MIPS module (real module) within the matched module, and N_pc is the number of proteins contained in the matched module. The last three columns show corresponding p-values of the predicted module from the view of Biological Process, Cellular Component and Molecular Function. From the column of matching rate, we can see that many of the protein modules detected by the seven algorithms match well with the benchmark modules. The p-values of modules in these tables are very low, which further demonstrates that the modules identified have high statistical significance from three different Gene Ontology categories.

Table 5 Some functional modules predicted by MAE-FMD using DIP data

Full size table

Table 6 Some functional modules predicted by CFinder using DIP data

Full size table

Table 7 Some functional modules predicted by Coach using DIP data

Full size table

Table 8 Some functional modules predicted by NACO-FMD using DIP data

Full size table

Table 9 Some functional modules predicted by MCL using DIP data

Full size table

Table 10 Some functional modules predicted by HAM-FMD using DIP data

Full size table

Table 11 Some functional modules predicted by MCODE using DIP data

Full size table

To explicitly reveal the results obtained by our algorithm, we take two modules as the examples to explain. For the retromer module, corresponding to the first module in these seven tables, the seven algorithms have obtained the same good performance in terms of p-values and matching rates. That is, the real retromer module is correctly detected by all seven algorithms. Compared to the anaphase-promoting module (corresponding to the second module in Tables 7, 8, 9, 10 and 11) that is respectively detected by the Coach, NACO-FMD, MCL, HAM-FMD and MCODE algorithms, the minimum p-value of our algorithm in Table 5 is 1.82e-31, which is much less than those of the other five algorithms since the minimum p-values of the module predicted by the Coach, NACO-FMD, MCL, HAM-FMD and MCODE algorithms are 1.38e-25, 1.38e-25, 2.08e-28, 2.08e-28 and 5.62e-24, respectively. The real anaphase-promoting module in the benchmark consists of 16 proteins, of which 1 protein (ygl116w) is isolated by other proteins within the same module and 2 proteins (yir025w and ydr260c) don’t exist in DIP data. Thus, the real structure of the anaphase-promoting module including 13 proteins is shown in Figure 16(a). The protein module obtained by our algorithm consists of 13 proteins and succeeds in matching all 13 proteins in the benchmark module (shown in Figure 16(b)). Though the matching rates of Coach, NACO-FMD, MCL, HAM-FMD and MCODE algorithms are also 100%, Coach, NACO-FMD, MCL, HAM-FMD and MCODE only cover 11, 11, 12, 12 and 10 proteins of the real anaphase-promoting module, respectively (shown in Figure 16(c), Figure 16(d) and Figure 16(e)). In addition, CFinder has not obtained the real anaphase-promoting module. Actually, CFinder finds a huge cluster which contains 13 proteins in the real anaphase-promoting module and 49 other proteins. In other words, the example demonstrates that our algorithm can accurately predict protein modules. To show more biological details of Figure 16, Table 12 gives some corresponding messages of the sixteen proteins in anaphase-promoting module.

Table 12 Sixteen proteins in anaphase-promoting module

Full size table

Moreover, our algorithm also obtains some new modules on all five data sets. Table 13 lists 5 new modules with lower p-values on the DIP data, which are not previously described or not detected by other six algorithms. This means that MAE-FMD has certain exploratory ability in detection functional modules from a PPI network.

Table 13 Some new functional modules predicted by MAE-FMD algorithm using DIP data

Full size table

In this section, we have performed complete comparisons among MAE-FMD, CFinder, Coach, NACO-FMD, MCL, HAM-FMD and MCODE algorithms in terms of various evaluation metrics (e.g. F-measure, accuracy, p-value etc). These evaluation comparisons from different perspectives show that MAE-FMD is a promising method to effectively identify functional module structures in PPI networks. It should be noted that F-measure and accuracy are two comprehensive evaluation metrics whose values can more objectively reflect the detection quality from different computational views. In light of accuracy, MAE-FMD significantly outperforms CFinder, Coach, NACO-FMD, MCL, HAM-FMD and MCODE on five protein data sets. Based on F-measure, MAE-FMD also outperforms other six algorithms on Gavin, DIP, MIPS and DIPHsapi20140703 data, and is slightly worse than Coach on DIPScere20140703 data. On the other hand, since the p-value of modules is a metric to incarnate the biological significance, the more number of modules we get with lower p-values, the greater significant the application is. Though the number of modules discovered by MAE-FMD is smaller than most of algorithms compared on yeast data, the number of modules with lower p-value discovered by MAE-FMD is no less than those algorithms did. For instance, MAE-FMD detects 234 modules on DIP data which is less than those of HAM-FMD, NACO-FMD, Coarch and MCL, however, the number of modules located in (0, 1.0e-20] is 16, 21 and 10 from three types of p-values, which is much larger than those of other four algorithms. Moreover, MAE-FMD can identify some new modules that were not previously identified by other algorithms, especially for the human data. All these results show MAE-FMD can identify more biological functional modules.

In summary, the outstanding experimental results of MAE-FMD on five different data sets demonstrate that MAE-FMD is robust algorithm whose performances are not dependent on the underlying data.

Conclusions

To reveal unknown functional ties between proteins and predict functions for unknown proteins, people have remained a great interest in mining functional modules from PPI networks over the past decade. However, how to accurately predict these protein modules through computational methods is still a highly challenging issue. This paper presented a multi-agent evolution approach called MAE-FMD, which can achieve a high accuracy for identifying functional modules in PPI networks. The most significant feature of MAE-FMD is that the algorithm utilizes random search and optimization mechanisms in the solution constructing and evolutionary processes. First, MAE-FMD employs a random-walk model merged topological characteristics with functional information to construct a candidate solution for each agent, which can effectively and reasonably find a feasible solution. And then, it applies some simple evolutionary operators, i.e., competition, crossover, and mutation, to realize information exchange among agents during the evolution process. The competition operator can replace the worst connection information with the better information to improve the winner anent, the crossover operator performs a random search in a solution space by the cooperation between neighborhood agents while the mutation operator carries out local searches with randomness. The experimental results indicate that our algorithm has the characteristics of outstanding recall, F-measure, sensitivity, accuracy and p-value and can obtain some new modules on five benchmark data sets while keeping other competitive performances, so it can be applied to the biological study which requires a higher accuracy. It should be pointed out that the algorithm doesn’t take into account overlapping functional modules based on the current representation and evolution of solutions, and may require longer running time for larger scale PPI networks due to the iterative evolution of the population. Thus, our future work includes investigating some new strategies to further improve the time efficiency and detect overlapping modules in PPI networks.

Endnote

^a Because the underlying protein interaction data used in the paper do not provide temporal and spatial information, we use the concept of functional modules.

References

Ji JZ, Zhang AD, Liu CN, Quan XM, Liu ZJ: Survey: functional module detection from protein-protein interaction networks. IEEE Trans Knowl Data Eng. 2014, 26 (2): 261-277.
Article Google Scholar
Patternson SD, Aebersold RH: Proteomics: the first decade and beyond. Nat Genet. 2003, 33: 311-323. 10.1038/ng1106.
Article Google Scholar
Zhang AD: Protein interaction networks: computational analysis. 2009, New York, USA: Cambridge University Press
Book Google Scholar
Rigaut G, Shevchenko A, Rutz B, Wilm M, Mann M: A generic protein purification method for protein complex characterization and proteome exploration. Nat Biotechnol. 1999, 17 (10): 1030-1032. 10.1038/13732.
Article PubMed CAS Google Scholar
Gavin AC, Boesche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, Remor M, Hofert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K, Hudak M, Dichson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B, Leutwein C, Heurtier MA, Copley RR, Edelmann A, Querfurth E, Rybin V, et al: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002, 415 (6868): 141-147. 10.1038/415141a.
Article PubMed CAS Google Scholar
Tarassov K, Messier V, Landry CR, Radinovic S, Molina MM, Shames I: An in vivo map of the yeast protein interactome. Science. 2008, 320 (5882): 1465-1470. 10.1126/science.1153878.
Article PubMed CAS Google Scholar
Bader GD, Hogue CW: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 2003, 4 (1): 2-10.1186/1471-2105-4-2.
Article PubMed Central PubMed Google Scholar
Adamcsek B, Palla G, Farkas IJ, Derenyi I, Vicsek T: CFinder: locating cliques and overlapping modules in biological networks. Bioinformatics. 2006, 22 (8): 1021-1023. 10.1093/bioinformatics/btl039.
Article PubMed CAS Google Scholar
Altaf-UI-amin M, Shinbo Y, Mihara K, Kurokawa K, Kanaya S: Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinformatics. 2006, 7 (1): 207-10.1186/1471-2105-7-207.
Article Google Scholar
Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL: Hierarchical organization of modularity in metabolic networks. Science. 2002, 297: 1551-1555. 10.1126/science.1073374.
Article PubMed CAS Google Scholar
Arnau V, Mars S, Marin I: Iterative cluster analysis of protein interaction data. Bioinformatics. 2005, 21 (3): 364-378. 10.1093/bioinformatics/bti021.
Article PubMed CAS Google Scholar
Holme P, Huss M, Jeong H: Subnetwork hierarchies of biochemical pathways. Bioinformatics. 2003, 19: 532-538. 10.1093/bioinformatics/btg033.
Article PubMed CAS Google Scholar
King AD, Przulj N, Jurisica I: Protein complex prediction via cost-based clustering. Bioinformatics. 2004, 20 (17): 3013-3020. 10.1093/bioinformatics/bth351.
Article PubMed CAS Google Scholar
Frey BJ, Dueck D: Clustering by passing messages between data points. Science. 2007, 15 (5814): 972-976.
Article Google Scholar
Abdullah A, Deris S, Hashim SZM, Jamil HM: Graph partitioning method for functional module detections of protein interaction network. Int Conf Comput Technol Dev. 2009, 1 (1): 230-234.
Google Scholar
Enright AJ, Van Dongen S, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002, 30 (7): 1575-1584. 10.1093/nar/30.7.1575.
Article PubMed Central PubMed CAS Google Scholar
Pereira-Leal JB, Enright AJ, Ouzounis CA: Detection of functional modules from protein interaction networks. Proteins. 2004, 54: 49-57.
Article PubMed CAS Google Scholar
Cho YR, Hwang W, Ramanathan M, Zhang AD: Semantic integration to identify overlapping functional modules in protein interaction networks. BMC Bioinformatics. 2007, 8 (1): 265-10.1186/1471-2105-8-265.
Article PubMed Central PubMed Google Scholar
Hwang W, Cho YR, Zhang AD, Ramanathan M: CASCADE: a novel quasi all paths-based network analysis algorithm for clustering biological interactions. BMC Bioinformatics. 2008, 9: 64-10.1186/1471-2105-9-64.
Article PubMed Central PubMed Google Scholar
Inoue K, Li W, Kurata H: Diffusion model based spectral clustering for protein-protein interaction networks. PLoS ONE. 2010, 5 (9): e12623-10.1371/journal.pone.0012623.
Article PubMed Central PubMed Google Scholar
Wu M, Li XL, Kwoh CK, Ng SK: A core-attachment based method to detect protein complexes in PPI networks. BMC Bioinformatics. 2009, 10: 169-10.1186/1471-2105-10-169.
Article PubMed Central PubMed Google Scholar
Sallim J, Abdullah R, Khader AT: ACOPIN: An ACO algorithm with TSP approach for clustering proteins from protein interaction network. Second UKSIM European Symposium on Computer Modeling and Simulation. 2008, IEEE, 203-208.
Chapter Google Scholar
Wu S, Lei XJ, Tian JF: Clustering PPI network based on functional flow model through artificial bee colony algorithm. Seventh International Conference on Natural Computation. 2011, IEEE, 92-96.
Google Scholar
Ji JZ, Liu ZJ, Zhang AD, Jiao L, Liu CN: Improved ant colony optimization for detecting functional modules in protein-protein interaction networks. Information Computing and Applications, Volume 1. 2012, Heidelberg: Springer Berlin, 404-413.
Chapter Google Scholar
Ji JZ, Liu ZJ, Zhang AD, Yang CC, Liu CN: HAM-FMD: mining functional modules in protein-protein interaction networks using ant colony optimization and multi-agent evolution. Neurocomputing. 2013, 121: 453-469.
Article Google Scholar
Liu JM, Jing H, Tang YY: Multi-agent oriented constraint satisfaction. Artif Intell. 2002, 136: 101-144. 10.1016/S0004-3702(01)00174-6.
Article Google Scholar
Zhong WC, Liu J, Xue MZ, Jiao LC: A multiagent genetic algorithm for global numerical optimization. IEEE Trans Syst Man Cybernet (Part B). 2004, 34 (2): 1128-1141. 10.1109/TSMCB.2003.821456.
Article Google Scholar
Pan XY, Jiao LC, Liu F: Granular agent evolutionary algorithm for classification. ACTA ELECTRONICA SINICA. 2009, 37 (3): 628-633.
Google Scholar
Pan XY, Liu F, Jiao LC: Density sensitive based multi-agent evolutionary clustering algorithm. J Software. 2010, 21 (10): 2420-2431.
Google Scholar
Pan XY, Chen H: Multi-agent evolutionary clustering algorithm based on manifold distance. Proceedings of the 8th International Conference on Computational Intelligence and Security. 2012, Guangzhou, China, 123-127.
Google Scholar
Yang B, Huang J, liu DY, Liu JM: A multi-agent based decentralized algorithm for social network community mining. Proceedings of the International Conference on Advances in Social Network Analysis and Mining. 2009, Athens, Greece, 78-82.
Google Scholar
Mete M, Tang FS, Xu XW, Yuruk N: A structural approach for finding functional modules from large biological networks. BMC Bioinformatics. 2008, 9 (Suppl 9): S19-10.1186/1471-2105-9-S9-S19.
Article PubMed Central PubMed Google Scholar
Schlicker A, Albrecht M: FunSimMat: a comprehensive functional similarity database. Nucleic Acids Res. 2008, 36: D434-D439.
Article PubMed Central PubMed CAS Google Scholar
Guimerà R, Amaral LAN: Functional cartography of complex metabolic networks. Nature. 2005, 433 (7028): 895-900. 10.1038/nature03288.
Article PubMed Central PubMed Google Scholar
Xenarios I, Salwinski L, Duan X, Higney P, Kim S, Eisenberg D: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Research. 2002, 30: 303-305. 10.1093/nar/30.1.303.
Article PubMed Central PubMed CAS Google Scholar
Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dumpelfeld B, Edelmann A, Heurtier MA, Hoffman V, Hoefert C, Klein K, Hudak M, Michon AM, Schelder M, Schirle M, Remor M, Rudi T, Hooper S, Bauer A, Bouwmeester T, Casari G, Drewes G, Neubauer G, Rick JM, Kuster B, Bork P, et al: Proteome survey reveals modularity of the yeast cell machinery. Nature. 2006, 440 (7084): 631-636. 10.1038/nature04532.
Article PubMed CAS Google Scholar
Mewes HW, et al: MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res. 2004, 32 (Database issue): D41-D44.
Article PubMed Central PubMed CAS Google Scholar
Friedel CC, Krumsiek J, Zimmer R: Boostrapping the Interactome: unsupervised identification of protein complexes in yeast. RECOMB. 2008, 4955 (1): 3-16.
Google Scholar
Aloy P, Bottcher B, Ceulemans H, Leutwein C, Mellwig C, Fischer S, Gavin AC, Bork P, Superti-Furga G, Serrano L, Russell RB: Structure-based assembly of protein complexes in yeast. Science. 2004, 303 (5666): 2026-2029. 10.1126/science.1092645.
Article PubMed CAS Google Scholar
Dwight SS, Harris MA, Dolinski K, Ball CA, Binkley G, Christie KR, Fisk DG, Issel-Tarver Laurie, Schroeder M, Sherlock G, Sethuraman A, Weng S, Botstein D, Cherry JM: Saccharomyces genome database provides secondary gene annotation using the gene ontology. Nucleic Acids Res. 2002, 30 (1): 69-72. 10.1093/nar/30.1.69.
Article PubMed Central PubMed CAS Google Scholar
Li XL, Wu M, Kwoh CK, Ng SK: Computational approaches for detecting protein complexes from protein interaction networks: a survey. BMC Genomics. 2010, 11 (suppl 1): S3-10.1186/1471-2164-11-S1-S3.
Article PubMed Central PubMed Google Scholar
Chua HN, Ning K, Sung WK, Leong HW, Wong L: Using indirect protein-protein interactions for protein complex prediction. J Bioinformatics Comput Biol. 2008, 6 (03): 435-466. 10.1142/S0219720008003497.
Article CAS Google Scholar
Hwang W, Cho YR, Zhang AD, Ramanathan M: CASCADE: a novel quasi all paths-based network analysis algorithm for clustering biological interactions. BMC Bioinformatics. 2008, 9: 64-10.1186/1471-2105-9-64.
Article PubMed Central PubMed Google Scholar
Bu D, Zhao Y, Cai L, Xue H, Zhu X, Lu H, Zhang J, Sun S, Ling L, Zhang N, Li G, Chen R: Topological structure analysis of the protein-protein interaction network in budding yeast. Nucleic Acids Res. 2003, 31 (9): 2443-2450. 10.1093/nar/gkg340.
Article PubMed Central PubMed CAS Google Scholar

Download references

Acknowledgements

This work is partly supported by National “973” Key Basic Research Program of China (2014CB744601), NSFC Research Program (61375059, 61332016), Specialized Research Fund for the Doctoral Program of Higher Education (20121103110031), and the Beijing Municipal Education Research Plan key project (Beijing Municipal Fund Class B) (KZ201410005004).

Author information

Authors and Affiliations

College of Computer Science, Beijing University of Technology, Chaoyang District, Beijing, China
Jun Zhong Ji, Lang Jiao, Cui Cui Yang & Jia Wei Lv
Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, New York, USA
Ai Dong Zhang

Authors

Jun Zhong Ji
View author publications
You can also search for this author in PubMed Google Scholar
Lang Jiao
View author publications
You can also search for this author in PubMed Google Scholar
Cui Cui Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jia Wei Lv
View author publications
You can also search for this author in PubMed Google Scholar
Ai Dong Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun Zhong Ji.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

JZJ and LJ conceived the work. LJ and JWL performed the experimental analysis. JZJ and CCY prepared the manuscript with revisions by ADZ. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Authors’ original file for figure 14

Authors’ original file for figure 15

Authors’ original file for figure 16

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Ji, J.Z., Jiao, L., Yang, C.C. et al. MAE-FMD: Multi-agent evolutionary method for functional module detection in protein-protein interaction networks. BMC Bioinformatics 15, 325 (2014). https://doi.org/10.1186/1471-2105-15-325

Download citation

Received: 10 April 2014
Accepted: 22 September 2014
Published: 30 September 2014
DOI: https://doi.org/10.1186/1471-2105-15-325

MAE-FMD: Multi-agent evolutionary method for functional module detection in protein-protein interaction networks

Abstract

Background

Results

Conclusions

Background

Method

Basic ideas

Agent representation and its construction

Agent energy level and evolutionary environment

Evolutionary operators

Post-processing

Algorithm description and complexity analysis

Results and discussion

PPI datasets

Evaluation metrics

Precision, Recall, F-measure, and Coverage

Sensitivity, positive predictive value, and accuracy

p-value measure

Effects of parameters

Comparative evaluations

Conclusions

Endnote

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us