Network construction
Studies have confirmed that integrating gene expression data and PPIN helps people understand the complex multi-layered molecular structure of human diseases. Therefore, two gene expression datasets of NSCLC in the NCBI Gene Expression Omnibus (GEO) database are selected to construct GCNs with PPIN information being referred to, respectively. Detailed steps of data preprocessing and network construction are as follows. First of all, limma package in the R/Bioconductor software is utilized to identify differentially expressed genes (DEGs) whose t-statistics p value are adjusted by the Benjamini–Hochberg method [16]. Genes with the adjusted p value less than 0.05 are considered as DEGs. Only interactions between DEGs are used to construct GCNs.
To estimate the interaction intensity between DEGs, a new criterion called Gaussian Copula Mutual Information (GCMI) is introduced [17]. GCMI uses the concept of a statistical copula to provide the advantages of Gaussian parametric estimation for variables with any type of marginal distributions, and it is suitable for estimating MI between two continuous variables. At the same time, we use the \(sim_{Rel}\) score to calculate and compare functionally related products of a pair of DEGs which provides a similarity criterion for gene ontology (GO) terms of two gene products. The computation of \(sim_{Rel}\) has been implemented by the R package GOSemSim [18]. The definition of \(sim_{Rel}\) is as follows,
$$sim_{Rel} \left( {c_{1} ,c_{2} } \right) = \mathop {\max }\limits_{{c \in S\left( {c_{1} ,c_{2} } \right)}} \left( {\frac{2 \cdot \log p\left( c \right)}{{\log p\left( {c_{1} } \right) + \log p\left( {c_{2} } \right)}} \cdot \left( {1 - p\left( c \right)} \right)} \right)$$
(1)
where \(S\left( {c_{1} ,c_{2} } \right)\) is the set of common ancestors of GO terms \(c_{1}\) and \(c_{2}\), \(p\left( c \right)\) is the probability of \(c\). It is utilized to calculate a fitness value in "Fitness functions" section.
After obtaining the correlation matrix between DEGs, it is compared with the PPIN. We downloaded the PPIN from the human protein reference database (HPRD) which contains 39,240 interactions [19]. The original protein–protein interaction information is presented in Additional file 1. If a correlation does not exist in the PPIN, it will be modified to 0.
DM-MOGA framework
Due to the high complexity of identifying disease modules from a large-scale GCN, heuristic strategies are required to guide the search process. One of the most popular strategies is EA which are suitable for solving global optimization problems in the discrete search space [20]. EA is an optimization algorithm inspired by Darwin's principles of natural selection. Each solution is described as an individual in the population, and each individual is associated with one or more fitness functions optimized by natural selection process. In this paper, we propose a new method DM-MOGA based on MOEA and decomposition to identify modules which is regarded as biomarkers of NSCLC. The workflow of DM-MOGA is displayed in Fig. 1. In this section, we describe the framework of DM-MOGA, including pre-simplification with boundary correction, chromosome encoding and initialization scheme, operators of MOEA and the optimal solution selection strategy. After the evolution is completed, the result with the largest \(W^{\prime}\) in the Pareto front is considered as the final solution that contains hundreds of modules. We only select the biggest module involving more biological information from the GCN.
Local module pre-simplification and boundary correction
In order to improve the adaptability of DM-MOGA for large-scale biological networks, a pre-simplification strategy for local module (LM) is introduced from [21] and executed before the evolution. This strategy randomly selects one node \(a\) from the network, then a LM is defined as containing node \(a\), its neighbor \(a_{k}\) with the largest degree, \(a_{k}^{\prime } s\) neighbor \(a_{kk}\) that has the largest number of common neighbors with \(a_{k}\), and all joint neighbors of \(a_{k}\) and \(a_{kk}\). For neighbors of all nodes in the LM, those neighbors whose number of connections with LM is beyond half of its degree are also added to LM. Finally, a complete graph of order 3 in the LM is selected and simplified to a single node. Specifically, we will not consider the possibility that nodes in the LM do not belong to the same module during the evolution. Above operations are repeated to find another LM from the remaining nodes of the network until all nodes are assigned to a LM.
However, this strategy only considers the topology of the network without the weight of edges, and vertices of some edges with smaller weights do not necessarily belong to a LM. Therefore, we develop a module boundary correction strategy. In this strategy, the matrix \(NodeTable\) for all nodes is maintained, of which each row represents a node. The first column is the index of the node, the second column is the module to which the node belongs after the LM pre-simplification strategy, and the third column is the new module to which the node belongs after the boundary correction.
The constructed GCN is denoted as \(G = \left( {V,E} \right)\), where \(V = \left\{ {v_{i} \left| {i = 1,2, \ldots ,N} \right.} \right\}\) is the set of nodes, \(E = \left\{ {e_{i} \left| {i = 1,2, \ldots ,M} \right.} \right\}\) is the set of connections between a pair of nodes. Firstly, calculate the weight \(\{ V_{1}^{w} ,V_{2}^{w} , \ldots ,V_{i}^{w} , \ldots V_{N}^{w} \}\) for each node in the network. Specifically, the most densely connected area in the module made up of node \(i\) and its immediate neighbors is defined as the highest k-core, and \(V_{i}^{w}\) is the product of the density and the minimum degree of the highest k-core. The density of the highest k-core is \(D^{G} = \frac{2E}{{\left| V \right|\left( {\left| V \right| - 1} \right)}}\) [8]. Then, we get the attribute \(\{ V_{1}^{a} ,V_{2}^{a} , \ldots ,V_{i}^{a} , \ldots V_{N}^{a} \}\) for each node, where \(V_{i}^{a} = \sum\nolimits_{j} {E_{ij} }\), \(j \in LC_{{NodeTable\left( {i,2} \right)}}\). For each module, if node \(i\) satisfies \((V_{i}^{w} > 2)\& (V_{i}^{w} \ge \overline{{V^{w} }} )\), node \(i\) is reserved in this module; otherwise, the third element of the corresponding row of node \(i\) in the \(NodeTable\) is set to 0, indicating that node \(i\) will be reassigned afterwards. Secondly, for each LM, the node with the largest sum of the weight of connecting edges is selected as the seed node, and neighbors of this seed are also recursively assigned to this module. For those nodes that still cannot be allocated to a LM, they are retained in the network independently.
Fitness functions
The proposed DM-MOGA identifies disease modules by minimizing the following two fitness functions. The first function is Davies–Bouldin Index (DBI) which should have been a measure to evaluate the quality of clustering results [22]. The basic idea of DBI is to evaluate the distance between two clusters, considering that the distance between nodes belonging to different clusters should be as large as possible and the distance between nodes within a cluster should be as small as possible. With the similarity matrix obtained by calculating \(sim_{Rel}\) between genes, DBI is applied to assess the similarity of functions of genes belonging to the same disease module. The formula is as follows:
$$DBI = \frac{1}{{C_{num} }}\sum\limits_{i = 1}^{{C_{num} }} {\mathop {\max }\limits_{i;j \ne i} \frac{{S_{i} + S_{j} }}{{dist\left( {v_{i} ,v_{j} } \right)}}}$$
(2)
where \(C_{num}\) is the number of modules, \(dist\left( \cdot \right)\) is the \(sim_{Rel}\) similarity between two nodes, \(S_{i} = \frac{1}{{\left| {C_{i} } \right|}}\sum\nolimits_{{x_{j} \in C_{i} }} {dist\left( {x_{j} ,v_{i} } \right)}\) measures the extent of dispersion of module \(C_{i}\), \(C_{i}\) represents the i-th module, \(x_{j}\) is the j-th node of \(C_{i}\), and \(v_{i}\) represents the center of \(C_{i}\).
The second function is a modified clustering coefficient (\(CC^{\prime}\)) suitable for weighted networks. In general, \(CC^{\prime}\) quantifies the aggregation of one node and its neighbors, but it is modified to the sum of \(CC^{\prime}\) of nodes to evaluate the overall module, and the specific formula is as follows. For a random module \(C_{i}\), clustering coefficient is defined as:
$$CC_{i}^{\prime } = \sum\nolimits_{{x_{j} \in C_{i} }} {\frac{{2\sum\nolimits_{{x_{l} ,x_{k} \in R_{{x_{j} }} \wedge l \ne k}} {E\left( {x_{l} ,x_{k} } \right)} }}{{\left| {R_{{x_{j} }} } \right|\left( {\left| {R_{{x_{j} }} } \right| - 1} \right)}}}$$
(3)
where \(R_{{x_{j} }}\) is the set of neighbors of node \(x_{j}\). Since the result of MOEA is a group of modules, the second fitness function is set to the maximum of \(CC^{\prime}\).
The criterion \(W\) proposed by Zhao et al. is used to select a solution from the Pareto front obtained by MOEA as the final result [23]. The result with the largest \(W\) is considered as the final result. The original \(W\) is applied to extract a module from unweighted social networks. In order to put all modules into consideration and adapt to weighted GCNs, \(W\) is changed to the following form:
$$W^{\prime} = \sum\limits_{i = 1}^{{C_{num} }} {\left( {\frac{{O\left( {C_{i} } \right)}}{{\left| {C_{i} } \right|^{2} }} - \frac{{B\left( {C_{i} } \right)}}{{\left| {C_{i} } \right|\left| {C_{i}^{^{\prime}} } \right|}}} \right)}$$
(4)
where \(C_{i}^{\prime }\) is the complement of \(C_{i}\), \(O\left( {C_{i} } \right) = \sum\limits_{{j,k \in C_{i} }} {E_{j,k} }\), \(B\left( {C_{i} } \right) = \sum\limits_{{j \in C_{i} ,k \in C_{i}^{^{\prime}} }} {E_{j,k} }\).
Multi-objective optimization based on decomposition
The basic theory is multi-optimization problem (MOP) based on Pareto optimum which is to optimize a group of functions at the same time:
$$\min F\left( x \right) = \left( {f_{1} \left( x \right),f_{2} \left( x \right), \ldots ,f_{k} \left( x \right)} \right)^{T}$$
(5)
where \(x = \left[ {x_{1} ,x_{2} , \ldots ,x_{N} } \right] \in \Omega\) and \(\Omega\) is the feasible region. Then, the definition of dominance relationship is explained, that is, \(x_{A}\) dominates \(x_{B}\) (written as \(x_{A} \succ x_{B}\), \(x_{A} ,x_{B} \in \Omega\)) if and only if:
$$\forall i \in \left\{ {1,2, \ldots ,k} \right\}\;f_{i} \left( {x_{A} } \right) \le f_{i} \left( {x_{B} } \right) \wedge \exists j \in \left\{ {1,2, \ldots ,k} \right\}\;f_{j} \left( {x_{A} } \right) < f_{j} \left( {x_{B} } \right)$$
(6)
If there is no vector \(x \in \Omega\) such that \(x \succ x^{*}\), \(x^{*}\) is called a non-dominated solution or Pareto-optimal solution.
MOEA/D-Net is a community detection method based on MOEA with decomposition. It decomposes a MOP into a number of scalar optimization subproblems and optimizes them simultaneously by population evolution. At each iteration, the population is made up of the best solution found for each subproblem since the beginning of evolution. In MOEA/D-Net, the popular Tchebycheff method is used to construct the aggregation function and therefore the scalar optimization subproblems are in the form:
$$\min g^{te} \left( {\left. x \right|\lambda_{i} ,z^{*} } \right) = \mathop {\max }\limits_{j = 1}^{2} \left\{ {\lambda_{i}^{j} \left| {F_{j} \left( x \right) - z^{*} } \right|} \right\}$$
(7)
where \(\lambda = \left\{ {\lambda_{1} ,\lambda_{2} , \ldots ,\lambda_{pop} } \right\}\) is a series of weight vectors uniformly distributed on \(\lambda_{i}^{1} + \lambda_{i}^{2} = 1\), \(\lambda = \left\langle {\lambda_{i}^{1} ,\lambda_{i}^{2} } \right\rangle \in \left[ {0,1} \right]\), \(i = \left\{ {1,2, \ldots ,pop} \right\}\), \(pop\) is the population size, and \(z^{*} = \left\langle {z_{1}^{*} ,z_{2}^{*} } \right\rangle\) is the reference point in which each point \(z_{j}^{*}\) corresponds to the minimum value of a fitness function obtained from the population. For each target vector \(\lambda_{i}\), calculate the Euclidean distance between all weight vectors and \(\lambda_{i}\), and the neighborhood of \(\lambda_{i}\), denoted as \(Neib_{i}\), is made up of \(nm\) individuals with the smallest Euclidean distance to \(\lambda_{i}\), where \(nm\) is a predefined parameter. For each non-dominated individual, there is a weight vector that makes it the optimal solution of Eq. 7, and each optimal solution of Eq. 7 is a Pareto-optimal solution of Eq. 5.
Initialization
Individuals are encoded and initialized based on the locus-based adjacency encoding schema which is popular in EA-based community detection algorithms [24]. In an individual, each element is initialized as a random neighbor index of its corresponding node or the index of the corresponding node itself, and then this element is recursively replaced by the index that most neighbors of this node share until the element is not changed. Another variable that needs to be initialized is the reference point \(z^{*}\), and it is set to the minimum of two fitness functions in the initial population.
The main loop of DM-MOGA
DM-MOGA adopts the similar framework with MOEA/D-Net that is proposed by Gong et al. [25]. In this method, the following procedure is applied to evolve the population. Every individual \(p_{j} \left( {1 \le j \le pop} \right)\) is used to perform the crossover and mutation operation with another randomly selected individual to generate a \(child\). If the Tchebycheff value of the \(child\) is better than a neighbor in \(Neib_{j}\), replace that neighbor with the \(child\) and update the reference point \(z^{*}\). Specifically, we choose the two-point crossover to take advantage of protecting the effective connection between nodes. We randomly select two elements \(i\) and \(j\) (i.e., \(1 \le i \le j \le N\)), and elements in \(\left[ {i,j} \right]\) are exchanged between two parents in the population. After the crossover operation is finished, an individual \(p_{j}\) is randomly selected for mutation, on which the neighbor-based mutation is performed. According to the encoding strategy, the mutation operator is to randomly select an element \(e_{l}\) in the individual \(p_{i}\) and replace the neighbor index in it with the index of other neighbors of the node corresponding to \(e_{l}\). DM-MOGA will continue to evolve until the maximum number of generations is reached.