Skip to main content

DM-MOGA: a multi-objective optimization genetic algorithm for identifying disease modules of non-small cell lung cancer



Constructing molecular interaction networks from microarray data and then identifying disease module biomarkers can provide insight into the underlying pathogenic mechanisms of non-small cell lung cancer. A promising approach for identifying disease modules in the network is community detection.


In order to identify disease modules from gene co-expression networks, a community detection method is proposed based on multi-objective optimization genetic algorithm with decomposition. The method is named DM-MOGA and possesses two highlights. First, the boundary correction strategy is designed for the modules obtained in the process of local module detection and pre-simplification. Second, during the evolution, we introduce Davies–Bouldin index and clustering coefficient as fitness functions which are improved and migrated to weighted networks. In order to identify modules that are more relevant to diseases, the above strategies are designed to consider the network topology of genes and the strength of connections with other genes at the same time. Experimental results of different gene expression datasets of non-small cell lung cancer demonstrate that the core modules obtained by DM-MOGA are more effective than those obtained by several other advanced module identification methods.


The proposed method identifies disease-relevant modules by optimizing two novel fitness functions to simultaneously consider the local topology of each gene and its connection strength with other genes. The association of the identified core modules with lung cancer has been confirmed by pathway and gene ontology enrichment analysis.

Peer Review reports


Lung cancer is the cancer with the highest mortality rate worldwide, and about 80% of cases are non-small cell lung cancer (NSCLC) that has a poor 5-year survival rate (average, 9–11 months) [1]. In recent years, research on molecular mechanisms of lung cancer has promoted the development of their corresponding targeted drugs which have greatly improved the survival and prognosis of patients [2]. Meanwhile, more and more evidence has indicated that a group of genes related to a specific disease do not work in isolation, on the contrary, they usually interact with each other, thus gene co-expression networks (GCNs) have become a competitive model. Analysis on GCNs can help researchers identify disease modules. A disease module is considered as a subnetwork that contains most of the disease-related genes with compact topological connections and closely related functions, providing a system-level understanding of disease pathogenesis [3].

In recent years, many methods have been proposed for identifying disease modules. In general, these methods can be divided into four categories: local expansion, machine learning, mathematical programming and evolutionary algorithm (EA) [4]. DISeAse MOdule Detection algorithm (DIAMOnD) is a typical method based on local expansion which uses known disease-associated genes (seeds) to iteratively expand modules by evaluating the significance of the number of connections between seeds and other genes [5]. Although this method can identify disease modules, the coverage of disease-related genes may be low. The node and edge Prioritization-based Community Analysis is a knowledge-guided and network-based integration method to reveal functional modules in non-small cell lung cancer [6]. The protein–protein interaction network is prioritized by a random walk algorithm based on NSCLC seed genes and integrating edge weights, and then a "community network" is constructed in combination with Girvan-Newman and Label Propagation algorithms. MTGO is another method for functional module detection based on prior biological knowledge and topology information [7]. It directly utilizes gene ontology (GO) terms during module detection and labels each module with the most appropriate GO term, thereby simplifying the functional interpretation of modules. Molecular Complex Detection (MCODE) is a popular method without using prior knowledge for seeds [8]. It calculates the weight for each vertex according to the local neighborhood density. Nodes with the largest weight are selected as seeds, then the method traverses outwards and incorporates nodes with the weight higher than a given threshold into the module. SWItch Miner (SWIM) is a method to identify small modules containing key regulatory genes (switch genes) by introducing three topological attribute statistics for nodes [9]. SWIM was applied to a dataset from The Cancer Genome Atlas (TCGA) to characterize the etiology of interesting diseases. Identifying disease modules through machine learning methods is another efficient way. Wu and Stein used Markov Clustering (MCL) to cluster the weighted gene functional interaction network into a series of disease modules to respectively identify prognostic biomarkers of breast cancer and ovarian cancer [10]. PS-MCL (Parallel Shotgun Coarsened MCL) was proposed by Lim et al., a parallel community detection method that outperforms MCL in both runtime and the division quality [11]. PS-MCL adopts an effective coarsening scheme called shotgun coarsening (SC) to improve the module fragmentation problem of MCL, while providing a multi-core parallel algorithm for community detection to increase scalability. In addition, machine learning based methods are more efficient to identify disease modules from multi-omics data. A greedy decision forest is proposed to identify community structure from molecular interaction networks [12]. It obtains a high degree of interpretability by using shapley additive explanations. A strongly interconnected disease module identification method called SigMod is proposed by Liu et al. based on mathematical programming [13]. It identifies disease modules by integrating the results of genome-wide association study (GWAS) and gene networks, as well as optimizing the binary quadratic objective function by a graph min-cut approach. EAs are popular and widely used in the field of disease module identification. Multi-objective evolutionary algorithm (MOEA) DiffCoMO identifies differential co-expression modules by maximizing the difference between module membership value of genes corresponding to two different infection stages [14]. A new method ModuleDiscoverer is proposed to identify regulatory modules from the protein–protein interaction network (PPIN) and gene expression data [15]. It uses a randomization heuristic-based approximation of community structure to discover modules according to the maximum clique enumeration problem.

In this research, we construct a GCN relying on the PPIN, then we develop a disease-related module identification method based on the multi-objective genetic algorithm, named DM-MOGA. This method is utilized to analyze the obtained network by optimizing two fitness functions which can evaluate the functional similarity and the density of the topological connection of modules, respectively. In addition, a boundary correction strategy is designed for local modules obtained by pre-simplification, to reconfirm the genes in the margin belonging to which module.


Network construction

Studies have confirmed that integrating gene expression data and PPIN helps people understand the complex multi-layered molecular structure of human diseases. Therefore, two gene expression datasets of NSCLC in the NCBI Gene Expression Omnibus (GEO) database are selected to construct GCNs with PPIN information being referred to, respectively. Detailed steps of data preprocessing and network construction are as follows. First of all, limma package in the R/Bioconductor software is utilized to identify differentially expressed genes (DEGs) whose t-statistics p value are adjusted by the Benjamini–Hochberg method [16]. Genes with the adjusted p value less than 0.05 are considered as DEGs. Only interactions between DEGs are used to construct GCNs.

To estimate the interaction intensity between DEGs, a new criterion called Gaussian Copula Mutual Information (GCMI) is introduced [17]. GCMI uses the concept of a statistical copula to provide the advantages of Gaussian parametric estimation for variables with any type of marginal distributions, and it is suitable for estimating MI between two continuous variables. At the same time, we use the \(sim_{Rel}\) score to calculate and compare functionally related products of a pair of DEGs which provides a similarity criterion for gene ontology (GO) terms of two gene products. The computation of \(sim_{Rel}\) has been implemented by the R package GOSemSim [18]. The definition of \(sim_{Rel}\) is as follows,

$$sim_{Rel} \left( {c_{1} ,c_{2} } \right) = \mathop {\max }\limits_{{c \in S\left( {c_{1} ,c_{2} } \right)}} \left( {\frac{2 \cdot \log p\left( c \right)}{{\log p\left( {c_{1} } \right) + \log p\left( {c_{2} } \right)}} \cdot \left( {1 - p\left( c \right)} \right)} \right)$$

where \(S\left( {c_{1} ,c_{2} } \right)\) is the set of common ancestors of GO terms \(c_{1}\) and \(c_{2}\), \(p\left( c \right)\) is the probability of \(c\). It is utilized to calculate a fitness value in "Fitness functions" section.

After obtaining the correlation matrix between DEGs, it is compared with the PPIN. We downloaded the PPIN from the human protein reference database (HPRD) which contains 39,240 interactions [19]. The original protein–protein interaction information is presented in Additional file 1. If a correlation does not exist in the PPIN, it will be modified to 0.

DM-MOGA framework

Due to the high complexity of identifying disease modules from a large-scale GCN, heuristic strategies are required to guide the search process. One of the most popular strategies is EA which are suitable for solving global optimization problems in the discrete search space [20]. EA is an optimization algorithm inspired by Darwin's principles of natural selection. Each solution is described as an individual in the population, and each individual is associated with one or more fitness functions optimized by natural selection process. In this paper, we propose a new method DM-MOGA based on MOEA and decomposition to identify modules which is regarded as biomarkers of NSCLC. The workflow of DM-MOGA is displayed in Fig. 1. In this section, we describe the framework of DM-MOGA, including pre-simplification with boundary correction, chromosome encoding and initialization scheme, operators of MOEA and the optimal solution selection strategy. After the evolution is completed, the result with the largest \(W^{\prime}\) in the Pareto front is considered as the final solution that contains hundreds of modules. We only select the biggest module involving more biological information from the GCN.

Fig. 1
figure 1

The DM-MOGA workflow

Local module pre-simplification and boundary correction

In order to improve the adaptability of DM-MOGA for large-scale biological networks, a pre-simplification strategy for local module (LM) is introduced from [21] and executed before the evolution. This strategy randomly selects one node \(a\) from the network, then a LM is defined as containing node \(a\), its neighbor \(a_{k}\) with the largest degree, \(a_{k}^{\prime } s\) neighbor \(a_{kk}\) that has the largest number of common neighbors with \(a_{k}\), and all joint neighbors of \(a_{k}\) and \(a_{kk}\). For neighbors of all nodes in the LM, those neighbors whose number of connections with LM is beyond half of its degree are also added to LM. Finally, a complete graph of order 3 in the LM is selected and simplified to a single node. Specifically, we will not consider the possibility that nodes in the LM do not belong to the same module during the evolution. Above operations are repeated to find another LM from the remaining nodes of the network until all nodes are assigned to a LM.

However, this strategy only considers the topology of the network without the weight of edges, and vertices of some edges with smaller weights do not necessarily belong to a LM. Therefore, we develop a module boundary correction strategy. In this strategy, the matrix \(NodeTable\) for all nodes is maintained, of which each row represents a node. The first column is the index of the node, the second column is the module to which the node belongs after the LM pre-simplification strategy, and the third column is the new module to which the node belongs after the boundary correction.

The constructed GCN is denoted as \(G = \left( {V,E} \right)\), where \(V = \left\{ {v_{i} \left| {i = 1,2, \ldots ,N} \right.} \right\}\) is the set of nodes, \(E = \left\{ {e_{i} \left| {i = 1,2, \ldots ,M} \right.} \right\}\) is the set of connections between a pair of nodes. Firstly, calculate the weight \(\{ V_{1}^{w} ,V_{2}^{w} , \ldots ,V_{i}^{w} , \ldots V_{N}^{w} \}\) for each node in the network. Specifically, the most densely connected area in the module made up of node \(i\) and its immediate neighbors is defined as the highest k-core, and \(V_{i}^{w}\) is the product of the density and the minimum degree of the highest k-core. The density of the highest k-core is \(D^{G} = \frac{2E}{{\left| V \right|\left( {\left| V \right| - 1} \right)}}\) [8]. Then, we get the attribute \(\{ V_{1}^{a} ,V_{2}^{a} , \ldots ,V_{i}^{a} , \ldots V_{N}^{a} \}\) for each node, where \(V_{i}^{a} = \sum\nolimits_{j} {E_{ij} }\), \(j \in LC_{{NodeTable\left( {i,2} \right)}}\). For each module, if node \(i\) satisfies \((V_{i}^{w} > 2)\& (V_{i}^{w} \ge \overline{{V^{w} }} )\), node \(i\) is reserved in this module; otherwise, the third element of the corresponding row of node \(i\) in the \(NodeTable\) is set to 0, indicating that node \(i\) will be reassigned afterwards. Secondly, for each LM, the node with the largest sum of the weight of connecting edges is selected as the seed node, and neighbors of this seed are also recursively assigned to this module. For those nodes that still cannot be allocated to a LM, they are retained in the network independently.

Fitness functions

The proposed DM-MOGA identifies disease modules by minimizing the following two fitness functions. The first function is Davies–Bouldin Index (DBI) which should have been a measure to evaluate the quality of clustering results [22]. The basic idea of DBI is to evaluate the distance between two clusters, considering that the distance between nodes belonging to different clusters should be as large as possible and the distance between nodes within a cluster should be as small as possible. With the similarity matrix obtained by calculating \(sim_{Rel}\) between genes, DBI is applied to assess the similarity of functions of genes belonging to the same disease module. The formula is as follows:

$$DBI = \frac{1}{{C_{num} }}\sum\limits_{i = 1}^{{C_{num} }} {\mathop {\max }\limits_{i;j \ne i} \frac{{S_{i} + S_{j} }}{{dist\left( {v_{i} ,v_{j} } \right)}}}$$

where \(C_{num}\) is the number of modules, \(dist\left( \cdot \right)\) is the \(sim_{Rel}\) similarity between two nodes, \(S_{i} = \frac{1}{{\left| {C_{i} } \right|}}\sum\nolimits_{{x_{j} \in C_{i} }} {dist\left( {x_{j} ,v_{i} } \right)}\) measures the extent of dispersion of module \(C_{i}\), \(C_{i}\) represents the i-th module, \(x_{j}\) is the j-th node of \(C_{i}\), and \(v_{i}\) represents the center of \(C_{i}\).

The second function is a modified clustering coefficient (\(CC^{\prime}\)) suitable for weighted networks. In general, \(CC^{\prime}\) quantifies the aggregation of one node and its neighbors, but it is modified to the sum of \(CC^{\prime}\) of nodes to evaluate the overall module, and the specific formula is as follows. For a random module \(C_{i}\), clustering coefficient is defined as:

$$CC_{i}^{\prime } = \sum\nolimits_{{x_{j} \in C_{i} }} {\frac{{2\sum\nolimits_{{x_{l} ,x_{k} \in R_{{x_{j} }} \wedge l \ne k}} {E\left( {x_{l} ,x_{k} } \right)} }}{{\left| {R_{{x_{j} }} } \right|\left( {\left| {R_{{x_{j} }} } \right| - 1} \right)}}}$$

where \(R_{{x_{j} }}\) is the set of neighbors of node \(x_{j}\). Since the result of MOEA is a group of modules, the second fitness function is set to the maximum of \(CC^{\prime}\).

The criterion \(W\) proposed by Zhao et al. is used to select a solution from the Pareto front obtained by MOEA as the final result [23]. The result with the largest \(W\) is considered as the final result. The original \(W\) is applied to extract a module from unweighted social networks. In order to put all modules into consideration and adapt to weighted GCNs, \(W\) is changed to the following form:

$$W^{\prime} = \sum\limits_{i = 1}^{{C_{num} }} {\left( {\frac{{O\left( {C_{i} } \right)}}{{\left| {C_{i} } \right|^{2} }} - \frac{{B\left( {C_{i} } \right)}}{{\left| {C_{i} } \right|\left| {C_{i}^{^{\prime}} } \right|}}} \right)}$$

where \(C_{i}^{\prime }\) is the complement of \(C_{i}\), \(O\left( {C_{i} } \right) = \sum\limits_{{j,k \in C_{i} }} {E_{j,k} }\), \(B\left( {C_{i} } \right) = \sum\limits_{{j \in C_{i} ,k \in C_{i}^{^{\prime}} }} {E_{j,k} }\).

Multi-objective optimization based on decomposition

The basic theory is multi-optimization problem (MOP) based on Pareto optimum which is to optimize a group of functions at the same time:

$$\min F\left( x \right) = \left( {f_{1} \left( x \right),f_{2} \left( x \right), \ldots ,f_{k} \left( x \right)} \right)^{T}$$

where \(x = \left[ {x_{1} ,x_{2} , \ldots ,x_{N} } \right] \in \Omega\) and \(\Omega\) is the feasible region. Then, the definition of dominance relationship is explained, that is, \(x_{A}\) dominates \(x_{B}\) (written as \(x_{A} \succ x_{B}\), \(x_{A} ,x_{B} \in \Omega\)) if and only if:

$$\forall i \in \left\{ {1,2, \ldots ,k} \right\}\;f_{i} \left( {x_{A} } \right) \le f_{i} \left( {x_{B} } \right) \wedge \exists j \in \left\{ {1,2, \ldots ,k} \right\}\;f_{j} \left( {x_{A} } \right) < f_{j} \left( {x_{B} } \right)$$

If there is no vector \(x \in \Omega\) such that \(x \succ x^{*}\), \(x^{*}\) is called a non-dominated solution or Pareto-optimal solution.

MOEA/D-Net is a community detection method based on MOEA with decomposition. It decomposes a MOP into a number of scalar optimization subproblems and optimizes them simultaneously by population evolution. At each iteration, the population is made up of the best solution found for each subproblem since the beginning of evolution. In MOEA/D-Net, the popular Tchebycheff method is used to construct the aggregation function and therefore the scalar optimization subproblems are in the form:

$$\min g^{te} \left( {\left. x \right|\lambda_{i} ,z^{*} } \right) = \mathop {\max }\limits_{j = 1}^{2} \left\{ {\lambda_{i}^{j} \left| {F_{j} \left( x \right) - z^{*} } \right|} \right\}$$

where \(\lambda = \left\{ {\lambda_{1} ,\lambda_{2} , \ldots ,\lambda_{pop} } \right\}\) is a series of weight vectors uniformly distributed on \(\lambda_{i}^{1} + \lambda_{i}^{2} = 1\), \(\lambda = \left\langle {\lambda_{i}^{1} ,\lambda_{i}^{2} } \right\rangle \in \left[ {0,1} \right]\), \(i = \left\{ {1,2, \ldots ,pop} \right\}\), \(pop\) is the population size, and \(z^{*} = \left\langle {z_{1}^{*} ,z_{2}^{*} } \right\rangle\) is the reference point in which each point \(z_{j}^{*}\) corresponds to the minimum value of a fitness function obtained from the population. For each target vector \(\lambda_{i}\), calculate the Euclidean distance between all weight vectors and \(\lambda_{i}\), and the neighborhood of \(\lambda_{i}\), denoted as \(Neib_{i}\), is made up of \(nm\) individuals with the smallest Euclidean distance to \(\lambda_{i}\), where \(nm\) is a predefined parameter. For each non-dominated individual, there is a weight vector that makes it the optimal solution of Eq. 7, and each optimal solution of Eq. 7 is a Pareto-optimal solution of Eq. 5.


Individuals are encoded and initialized based on the locus-based adjacency encoding schema which is popular in EA-based community detection algorithms [24]. In an individual, each element is initialized as a random neighbor index of its corresponding node or the index of the corresponding node itself, and then this element is recursively replaced by the index that most neighbors of this node share until the element is not changed. Another variable that needs to be initialized is the reference point \(z^{*}\), and it is set to the minimum of two fitness functions in the initial population.

The main loop of DM-MOGA

DM-MOGA adopts the similar framework with MOEA/D-Net that is proposed by Gong et al. [25]. In this method, the following procedure is applied to evolve the population. Every individual \(p_{j} \left( {1 \le j \le pop} \right)\) is used to perform the crossover and mutation operation with another randomly selected individual to generate a \(child\). If the Tchebycheff value of the \(child\) is better than a neighbor in \(Neib_{j}\), replace that neighbor with the \(child\) and update the reference point \(z^{*}\). Specifically, we choose the two-point crossover to take advantage of protecting the effective connection between nodes. We randomly select two elements \(i\) and \(j\) (i.e., \(1 \le i \le j \le N\)), and elements in \(\left[ {i,j} \right]\) are exchanged between two parents in the population. After the crossover operation is finished, an individual \(p_{j}\) is randomly selected for mutation, on which the neighbor-based mutation is performed. According to the encoding strategy, the mutation operator is to randomly select an element \(e_{l}\) in the individual \(p_{i}\) and replace the neighbor index in it with the index of other neighbors of the node corresponding to \(e_{l}\). DM-MOGA will continue to evolve until the maximum number of generations is reached.

Results and discussion


In the experiments, two NSCLC expression microarray datasets obtained using the same platform (GPL570) were downloaded from the GEO database. The first dataset (ID: GSE19804) [26] is balanced and contains 60 disease and control samples, respectively. The second dataset (ID: GSE19188) [27] contains 91 disease samples and 65 control samples. In the above two datasets, each sample contains the expression data of 21,879 genes. After differential expression analysis, 7669 DEGs and 10,496 DEGs were respectively selected from GSE19804 and GSE19188. Detailed information of these two datasets is shown in Table 1.

Table 1 Details of two datasets

Comparison with other methods

Ground-truth dataset

We integrated four kinds of lung cancer-related genes obtained from the MalaCards database as ground truth, including differentially expressed genes, genes related to lung cancer, genes contained in lung cancer related pathways, top affiliated genes of GO terms related to lung cancer [28].

Comparison methods

In the experiment, five methods were used to compare the performance with DM-MOGA, that is, a network reduction-based MOEA for community (module) detection (RMOEA), a disease module identification method SigMod, MCODE, and two classic module identification methods from the R package igraph, that is, Hierarchical Clustering [29] and Louvain [30].

To make a fair comparison, parameters in DM-MOGA and RMOEA were set to the same value, namely, the number of iterations \(max\_gen = 100\), the population size \(pop = 50\), the neighborhood size \(nm = 40\), and the mutation rate was 0.1. In addition, to ensure that SigMod can search for modules smoothly, the parameter \(maxjump\) was set to 27, and we used the default value for parameters of other methods.

Classification performance of disease and healthy samples

Fivefold cross-validation was applied on the largest module to verify its effectiveness as a biomarker. The set of samples is randomly divided into five parts with the same size, one of which is selected as the test set each time, and the other four parts are used for training (the train set). Support vector machine (SVM) is used as the classifier, and the value of five criteria (Accuracy, Precision, Recall, F1 and AUC) of each experiment is taken as the cross-validation result. Since there is a random value during the five-fold cross-validation, for each identified module, we performed five-fold cross-validation for ten times independently, and the final result was the average value of each criterion in experiments. Figures 2 and 3 respectively display the classification performance of the disease module identified by six methods on two NSCLC gene expression datasets. The serial number of the disease module is marked after the abbreviation of comparison methods. According to the figures, module 1 (M1) in community detection results of DM-MOGA obtained the best classification performance on the GSE19804 dataset whose value of five criteria is better than other methods. As for the GSE19188 dataset, the classification performance of the module detected by DM-MOGA was also significant which obtained the maximum value on four of the five metrics. Therefore, the effectiveness of the module obtained from DM-MOGA is verified in guiding the classification of disease and control samples. Moreover, the basic framework of RMOEA is similar to that of DM-MOGA, and both have the ability to effectively guide sample classification. The difference between them is that RMOEA lacks the boundary correction strategy which may lead to unstable results in ten independent experiments. The reason for SigMod, hierarchical clustering and the Louvain algorithm that fails to provide reasonable results might be the same, that is, the default values of their key parameters are not applicable for GCNs. MCODE tends to obtain smaller modules because it has strict conditions for expanding nodes in modules.

Fig. 2
figure 2

The value of the five-fold cross-validated classification index of the optimal module in GSE19804

Fig. 3
figure 3

The value of the five-fold cross-validated classification index of the optimal module in GSE19188

Enrichment analysis for the disease module

The quality of the identified module was further quantitatively studied by gene set enrichment analysis [31]. A hypergeometric test is used to estimate the enrichment of the identified module with reference to the ground truth set, where the statistical significance of the enrichment is defined as follows [32]:

$$p = \sum\limits_{{k = N_{gm} }}^{{N_{g} }} {\left[ {{{\left( {\begin{array}{*{20}c} {N_{g} } \\ k \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {N - N_{g} } \\ {N_{m} - k} \\ \end{array} } \right)} \mathord{\left/ {\vphantom {{\left( {\begin{array}{*{20}c} {N_{g} } \\ k \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {N - N_{g} } \\ {N_{m} - k} \\ \end{array} } \right)} {\left( {\begin{array}{*{20}c} N \\ {N_{m} } \\ \end{array} } \right)}}} \right. \kern-0pt} {\left( {\begin{array}{*{20}c} N \\ {N_{m} } \\ \end{array} } \right)}}} \right]}$$

where \(N_{g}\) is the number of genes in the ground-truth dataset, \(N_{m}\) is the number of genes in the identified module, \(N_{gm}\) is the number of genes that belong to both the ground truth set and the identified module. A smaller p value indicates a more significant enrichment of genes in the identified module. Figures 4 and 5 display the \(- \log \left( p \right)\) value of the module obtained by comparison methods running on GSE19804 and GSE19188, respectively. It can be observed that the proposed method obtains significantly better p value than the other methods on both two GCNs.

Fig. 4
figure 4

Enrichment of the module discovered by different methods from GSE19804

Fig. 5
figure 5

Enrichment of the module discovered by different methods from GSE19188

Effectiveness of GCMI

In order to study whether the gene–gene interaction metric can affect the efficiency of modules identified by DM-MOGA, we employed the most commonly used Pearson Correlation Coefficient (PCC) to reconstruct GCNs for comparison. Except for the interaction criterion, other steps of network construction remain unchanged. Then the proposed method was applied to detect modules on the new networks. Figures 6 and 7 respectively show the classification performance of the module identified by DM-MOGA based on different criteria on the two datasets. It can be observed that GCMI we choose to calculate edge weights can improve the biological significance of the disease module.

Fig. 6
figure 6

The classification performance of the module identified based on different interaction metrics from GSE19804

Fig. 7
figure 7

The classification performance of the module identified based on different interaction metrics from GSE19188

Identification of modules associated with lung cancer


Pathway enrichment analysis was implemented by the KOBAS v3.0 web server, in which four datasets are considered in the analysis, including the KEGG pathway [33], BioCyc, PANTHER, and Reactome [34, 35]. Pathways that are considered to be significant only if the Benjamini–Hochberg adjusted p value ≤ 0.05. Pathway enrichment analysis was performed on the disease modules of GSE19804 and GSE19188 independently, and ten pathways with the smallest p values (\(- \log \left( p \right)\)) were respectively displayed in Figs. 8 and 9. Moreover, in Additional file 2, we provided the enrichment significance of these pathways in the two disease modules. Except for the Pathways in cancer pathway that is from the KEGG pathway database, the other pathways in Figs. 8 and 9 are all from the Reactome pathway database [33, 34]. Each pathway was proved to be connected with lung cancer more or less. To be specific, recent studies have found that the mutation frequency of the PTEN locus in lung cancer is high, and there is a strong correlation between loss of PTEN function and positive expression (p value < 0.05) of EGFR, TGF-α and P-AKT signal transduction pathway (adjusted p value = 4.39E−61) in the development of NSCLC [36]. Besides, there are evidence demonstrating that other signaling pathways enriched in the two disease modules are associated with lung cancer, for instance, the PI3K/Akt signaling pathway (adjusted p value = 5.87E−28) inhibiting the metastasis of A549 cell line from lung adenocarcinoma, the receptor tyrosine kinases (RTKs) (adjusted p value = 2.1E−35) participating in the signal transmission across the plasma membrane, signaling by interleukin (adjusted p value = 4.29E−40) and cell cycle (adjusted p value = 1.97E−31) [37,38,39,40]. In Figs. 8 and 9, all pathways related to the immune system were confirmed to participate in the progression of NSCLC, including the immune system (adjust p-value = 3.23E-55), cytokine signaling in immune system (adjusted p value = 4.23E−40) and innate immune systems (adjusted p value = 6.65E−47) [41,42,43] RNA Polymerase II Transcription pathway participates gene expression (transcription) pathway (adjusted p value = 2.72E−50) and generic transcription pathway (adjusted p value = 7.61E−50). By inhibiting RNA polymerase II-dependent transcription (adjust p value = 1.29E−48), cell growth in the malignant cell line A549 can be effectively inhibited [44, 45]. Studies have found that metabolism of protein (adjusted p value = 1.53E−29) is related to cancer cachexia. In cancer cachexia, overall protein synthesis is decreasing that is directly proportional to tumor growth [46]. Some researchers believe that cancer is a developmental biology (adjusted p value = 2.48E−36) problem. They found that embryos and cancer have a number of common cellular and molecular features [47]. It is known that hemostatic biomarkers (adjusted p value = 3.31E−43) can affect the survival and venous thromboembolism (VTE) occurrence in lung cancer patients [48].

Fig. 8
figure 8

Top 10 significantly enriched pathway terms associated with genes in the identified module of GSE19804

Fig. 9
figure 9

Top 10 significantly enriched pathway terms associated with genes in the identified module of GSE19188

GO terms

GO enrichment analysis was implemented by the R package clusterProfiler [49]. In Figs. 10 and 11, the top ten GO terms that were most significantly enriched in the modules identified from the two datasets are respectively shown. The modules obtained in GSE19804 and GSE19188 were both significantly enriched in regulation of protein serine/threonine kinase activity (GO:0071900, adjusted p value = 3.87E−44). Under EGF-stimulated conditions, it is revealed that proteins interacting with B-Raf are enriched in regulation of protein serine/threonine kinase activity, and several interacting partners of B-Raf are enriched in NSCLC [50].

Fig. 10
figure 10

Top 10 significantly enriched GO terms associated with genes in the identified module of GSE19804

Fig. 11
figure 11

Top 10 significantly enriched GO terms associated with genes in the identified module of GSE19188

Besides, the core module from GSE19804 was mainly enriched in GO:1901653, GO:0043434, GO:0002683, GO:0051348, GO:0033673, GO:0030522, GO:0006352, GO:0006469, GO:0006367, corresponding to cellular response to peptide, response to peptide hormone, negative regulation of immune system process, negative regulation of transferase activity, negative regulation of kinase activity, intracellular receptor signaling pathway, DNA-templated transcription, initiation, negative regulation of protein kinase activity, transcription initiation from RNA polymerase II promoter. Several studies have confirmed that these GO terms are related to the pathogenesis and development of NSCLC. For instance, studies have found that licorice is a potential NSCLC treatment drug, and the targets of licorice are enriched in the biological process of cell response to peptide (GO:1901653, adjusted p value = 3.16E−16) which is an important way to stimulate the acquired immune system [51]. DHX36 plays a role in lung cancer cells by regulating signaling pathways such as response to peptide hormone (GO:0043434, adjusted p value = 1.57E−14) [52]. Interleukin-34 (IL-34) is significantly enriched in "negative regulation of immune system processes" (GO:0002683, adjusted p value = 7.18E−15) which is highly expressed in primary lung cancer tissues and associated with poor prognosis [53]. Exposure to cigarette smoke (CS) can cause injury to the epithelial cells of the respiratory tract and is considered to be one of the pathogenic factors of lung cancer. DEGs were extracted by comparing BEAS-2B cells (a he human bronchial epithelial cell line) before exposure to CS with after, and they are significantly enriched in the negative regulation of transferase activity (GO:0051348, adjusted p value = 2.21E−15) [54]. In addition, negative regulation of kinase activity (GO:0033673, adjusted p value = 1.57E−14), intracellular receptor signaling pathway (GO:0030522, adjusted p value = 3.78E−14), DNA-templated transcription, initiation (GO:0006352, adjusted p value = 1.57E−14), negative regulation of protein kinase activity (GO:0006469, adjusted p value = 2.25E−14) and transcription initiation from RNA polymerase II promoter (GO:0006367, adjusted p value = 1.10E−15) are the most enriched terms by GO enrichment analysis on key lung cancer-related gene sets that have been reported in other studies [55,56,57,58,59].

In Fig. 11, the identified module was mainly enriched in GO:0018108, GO:0018212, GO:0051090, GO:0043405, GO:0032147, GO:0071902, GO:0031098, GO:0051403, GO:0046777, corresponding to peptidyl-tyrosine phosphorylation, peptidyl-tyrosine modification, regulation of DNA-binding transcription factor activity, regulation of MAP kinase activity, activation of protein kinase activity, positive regulation of protein serine/threonine kinase activity, stress-activated protein kinase signaling cascade, stress-activated MAPK cascade, protein autophosphorylation. These GO terms have a certain correlation with lung cancer. Somatic variants that can be detected in the matched lymph node metastases but not in the primary lung cancer, are termed as LME-SMs genes. They are enriched in GO terms, for instance, peptidyl-tyrosine phosphorylation (GO:0018108, adjusted p value = 6.70E−33) and peptidyl-tyrosine modification (GO:0018212, adjusted p value = 1.16E−32) [60]. In [61], there are 35 genes that have been reported to be related to lung cancer. They are mainly related to GO terms in biological pathways, such as regulation of DNA-binding transcription factor activity (GO:0051090, adjusted p value = 1.51E−23), positive regulation of DNA-binding transcription factor activity, etc. In [62], it is found that cPLA2 is over-expressed in NSCLC cells transformed by oncogenic Ras, and cPLA2 is a well-known substrate of MAP kinase and closely related to the regulation of MAP kinase activity (GO:0043405, adjusted p value = 8.96E−32). Stem cell factor (SCF) and its receptor c-kit proto-oncogene are co-expressed in at least 70% of small cell lung cancer tumors and tumor-derived cell lines. The binding of SCF to c-Kit leads to receptor dimerization and activation of protein kinase activity (GO:0032147, adjusted p value = 1.62E−24) [63]. In other studies, the rest GO terms are also the most enriched terms in the results of GO enrichment analysis on key lung cancer-related gene sets [64,65,66].


In this work, to identify disease modules in GCNs, a multi-objective optimization method DM-MOGA is proposed based on the MOEA framework with decomposition. In DM-MOGA, the first step is to respectively construct the GCN on two NSCLC gene expression datasets, in which GCMI between all genes is calculated and considered as edge weights, and then the edges are filtered by referring to the prior knowledge of the PPI network. Secondly, DM-MOGA is separately executed on two GCNs that searches for disease modules by simultaneously optimizing two novel fitness functions, \(DBI\) and \(CC^{\prime}\). After the evolution is finished, the Pareto-optimal solution with the largest \(W^{\prime}\) is selected as the final result.

To examine the validity of disease modules obtained through the above process, a series of experiments performed. First of all, DM-MOGA was compared with several other module identification methods from the following aspects, specifically, the classification effect of disease and control samples guided by modules, the enrichment of modules in the disease-related gene set, and the validity of the edge weight criterion. Then, the correlation between modules and lung cancer was verified by pathway and GO term enrichment analysis. Experiments proved that the biological meaning of key modules obtained by DM-MOGA was more significant.

The proposed method possesses two main advantages. First, two fitness functions that have never been used for module identification problems are introduced which effectively improve the accuracy of the module in guiding patient classification. Second, the boundary correction strategy is designed for local modules, so that nodes with high correlation strength and low degree can be incorporated into the module. However, there are still some works in this field that can be further studied. On the one hand, it is necessary to develop fitness functions that are more suitable for disease module identification; on the other hand, studying the improvement strategies of EAs can further improve search efficiency.

Availability of data and materials

The code for DM-MOGA is available at The PPIN is available from HPRD (; see also Additional file 1). Information of genes associated with NSCLC is compiled from the MalaCards database ( Gene expression dataset GSE19804 and GSE19188 are available from the GEO database (GSE19804:; GSE19188: All tools used in this work are available from the respective links (the KOBAS v3.0 web server:, R package limma:; R package GOSemSim:; R package clusterProfiler:; and R package igraph:



Non-small cell lung cancer


Evolutionary algorithm


DISeAse MOdule Detection algorithm


Molecular complex detection


Markov clustering


Genome-Wide Association Study


Multi-objective evolutionary algorithm


Protein–protein interaction network


Gene expression omnibus


Differentially expressed genes


Gaussian copula mutual information


Gene ontology


Human Protein Reference Database


Local module


Davies–Bouldin Index


Clustering coefficient


Multi-optimization problem


Support vector machine


Pearson correlation coefficient


Receptor tyrosine kinases


Venous thromboembolism




Cigarette smoke


Lymph node


Stem cell factor


  1. Andonegui-Elguera SD, Zamora-Fuentes JM, Espinal-Enríquez J, Hernández-Lemus E. Loss of long distance co-expression in lung cancer. Front Genet. 2021;12:625741.

    Article  CAS  Google Scholar 

  2. Yousefi M, Bahrami T, Salmaninejad A, Nosrati R, Ghaffari P, Ghaffari SH. Lung cancer-associated brain metastasis: molecular mechanisms and therapeutic options. Cell Oncol. 2017;40(5):419–41.

    Article  CAS  Google Scholar 

  3. Mahapatra S, Mandal B, Swarnkar T. Biological networks integration based on dense module identification for gene prioritization from microarray data. Gene Reports. 2018;12:276–88.

    Article  Google Scholar 

  4. Tian Y, Su X, Su Y, Zhang X. EMODMI: a multi-objective optimization based method to identify disease modules. IEEE Trans Emerg Top Comput Intell. 2020;5(4):570–82.

    Article  Google Scholar 

  5. Sharma A, Menche J, Huang CC, Ort T, Zhou X, Kitsak M, et al. A disease module in the interactome explains disease heterogeneity, drug response and captures novel pathways and genes in asthma. Hum Mol Genet. 2015;24(11):3005–20.

    Article  CAS  Google Scholar 

  6. Wang F, Han S, Yang J, Yan W, Hu G. Knowledge-guided, “community network” analysis reveals the functional modules and candidate targets in non-small-cell lung cancer. Cells. 2021;10(2):402.

    Article  CAS  Google Scholar 

  7. Vella D, Marini S, Vitali F, Di Silvestre D, Mauri G, Bellazzi R. MTGO: PPI network analysis via topological and functional module identification. Sci Rep. 2018;8(1):1–13.

    Article  Google Scholar 

  8. Bader GD, Hogue CW. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinform. 2003;4(1):1–27.

    Article  Google Scholar 

  9. Paci P, Colombo T, Fiscon G, Gurtner A, Pavesi G, Farina L. SWIM: a computational tool to unveiling crucial nodes in complex biological networks. Sci Rep. 2017;7(1):1–16.

    Google Scholar 

  10. Wu G, Stein L. A network module-based method for identifying cancer prognostic signatures. Genome Biol. 2012;13(12):1–14.

    Article  Google Scholar 

  11. Lim Y, Yu I, Seo D, Kang U, Sael L. PS-MCL: parallel shotgun coarsened Markov clustering of protein interaction networks. BMC Bioinform. 2019;20(13):1–12.

    Google Scholar 

  12. Pfeifer B, Baniecki H, Saranti A, Biecek P, Holzinger A. Multi-omics disease module detection with an explainable greedy decision forest. Sci Rep. 2022;12(1):1–15.

    Article  Google Scholar 

  13. Liu Y, Brossard M, Roqueiro D, Margaritte-Jeannin P, Sarnowski C, Bouzigon E, et al. SigMod: an exact and efficient method to identify a strongly interconnected disease-associated module in a gene network. Bioinformatics. 2017;33(10):1536–44.

    CAS  Google Scholar 

  14. Ray S, Maulik U. Identifying differentially coexpressed module during HIV disease progression: a multiobjective approach. Sci Rep. 2017;7(1):1–13.

    Article  Google Scholar 

  15. Vlaic S, Conrad T, Tokarski-Schnelle C, Gustafsson M, Dahmen U, Guthke R, et al. Modulediscoverer: identification of regulatory modules in protein-protein interaction networks. Sci Rep. 2018;8(1):1–11.

    Article  CAS  Google Scholar 

  16. Kumar SU, Saleem A, Kumar DT, Preethi VA, Younes S, Zayed H, et al. A systemic approach to explore the mechanisms of drug resistance and altered signaling cascades in extensively drug-resistant tuberculosis. In: Donev R, Karabencheva-Christova T, editors., et al., Advances in protein chemistry and structural biology. Amsterdam: Elsevier; 2021. p. 343–64.

    Google Scholar 

  17. Ince RA, Giordano BL, Kayser C, Rousselet GA, Gross J, Schyns PG. A statistical framework for neuroimaging data analysis based on mutual information estimated via a Gaussian copula. Hum Brain Mapp. 2017;38(3):1541–73.

    Article  Google Scholar 

  18. Yu G, Li F, Qin Y, Bo X, Wu Y, Wang S. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics. 2010;26(7):976–8.

    Article  CAS  Google Scholar 

  19. Keshava Prasad T, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, et al. Human protein reference database—2009 update. Nucleic Acids Res. 2009.

    Article  Google Scholar 

  20. Muraro D, Simmons A. An integrative analysis of gene expression and molecular interaction data to identify dys-regulated sub-networks in inflammatory bowel disease. BMC Bioinform. 2016;17(1):1–11.

    Article  Google Scholar 

  21. Zhang X, Zhou K, Pan H, Zhang L, Zeng X, Jin Y. A network reduction-based multiobjective evolutionary algorithm for community detection in large-scale complex networks. IEEE Trans Cybern. 2018;50(2):703–16.

    Article  Google Scholar 

  22. Karo IMK, MaulanaAdhinugraha K, Huda AF. A cluster validity for spatial clustering based on Davies Bouldin index and polygon dissimilarity function. In: 2017 second international conference on informatics and computing (ICIC); 2017. p. 1–6.

  23. Zhao Y, Levina E, Zhu J. Community extraction for social networks. Proc Natl Acad Sci. 2011;108(18):7321–6.

    Article  CAS  Google Scholar 

  24. Shahabi Sani N, Manthouri M, Farivar F. A multi-objective ant colony optimization algorithm for community detection in complex networks. J Ambient Intell Humaniz Comput. 2020;11(1):5–21.

    Article  Google Scholar 

  25. Gong M, Ma L, Zhang Q, Jiao L. Community detection in networks by using multiobjective evolutionary algorithm with decomposition. Phys A Stat Mech Appl. 2012;391(15):4050–60.

    Article  Google Scholar 

  26. Lu T-P, Tsai M-H, Lee J-M, Hsu C-P, Chen P-C, Lin C-W, et al. Identification of a novel biomarker, sema5a, for non–small cell lung carcinoma in nonsmoking women. Cancer Epidemiol Prev Biomark. 2010.

    Article  Google Scholar 

  27. Hou J, Aerts J, Den Hamer B, Van Ijcken W, Den Bakker M, Riegman P, et al. Gene expression-based classification of non-small cell lung carcinomas and survival prediction. PLoS ONE. 2010.

    Article  Google Scholar 

  28. Rappaport N, Twik M, Plaschkes I, Nudel R, Iny Stein T, Levitt J, et al. MalaCards: an amalgamated human disease compendium with diverse clinical and genetic annotation and structured search. Nucleic Acids Res. 2016.

    Article  Google Scholar 

  29. Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol. 2005;4(1):17.

    Article  Google Scholar 

  30. Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech Theory Exp. 2008;2008(10):P10008.

    Article  Google Scholar 

  31. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci. 2005;102(43):15545–50.

    Article  CAS  Google Scholar 

  32. Liu X, Liu Z-P, Zhao X-M, Chen L. Identifying disease genes and module biomarkers by differential interactions. J Am Med Inform Assoc. 2012;19(2):241–8.

    Article  Google Scholar 

  33. Kanehisa M. Toward understanding the origin and evolution of cellular organisms. Protein Sci. 2019;28(11):1947–51.

    Article  CAS  Google Scholar 

  34. Gillespie M, Jassal B, Stephan R, Milacic M, Rothfels K, Senff-Ribeiro A, et al. The reactome pathway knowledgebase. Nucleic Acids Res. 2022;50(D1):D687–92.

    Article  CAS  Google Scholar 

  35. KOBAS-i. Accessed 23 Aug 2021.

  36. Yun F, Jia Y, Li X, Yuan L, Sun Q, Yu H, et al. Clinicopathological significance of PTEN and PI3K/AKT signal transduction pathway in non-small cell lung cancer. Int J Clin Exp Pathol. 2013;6(10):2112.

    Google Scholar 

  37. Pan B, Shen J, Cao J, Zhou Y, Shang L, Jin S, et al. Interleukin-17 promotes angiogenesis by stimulating VEGF production of cancer cells via the STAT3/GIV signaling pathway in non-small-cell lung cancer. Sci Rep. 2015;5(1):1–13.

    Article  Google Scholar 

  38. Singhal S, Vachani A, Antin-Ozerkis D, Kaiser LR, Albelda SM. Prognostic implications of cell cycle, apoptosis, and angiogenesis biomarkers in non–small cell lung cancer: a review. Clin Cancer Res. 2005;11(11):3974–86.

    Article  CAS  Google Scholar 

  39. Lee Y-C, Lin H-H, Hsu C-H, Wang C-J, Chiang T-A, Chen J-H. Inhibitory effects of andrographolide on migration and invasion in human non-small cell lung cancer A549 cells via down-regulation of PI3K/Akt signaling pathway. Eur J Pharmacol. 2010;632(1–3):23–32.

    Article  CAS  Google Scholar 

  40. Pisick E, Jagadeesh S, Salgia R. Receptor tyrosine kinases and inhibitors in lung cancer. Sci World J. 2004;4:589–604.

    Article  CAS  Google Scholar 

  41. Carbone DP, Gandara DR, Antonia SJ, Zielinski C, Paz-Ares L. Non-small-cell lung cancer: role of the immune system and potential for immunotherapy. J Thorac Oncol. 2015;10(7):974–84.

    Article  CAS  Google Scholar 

  42. Rusek AM, Abba M, Eljaszewicz A, Moniuszko M, Niklinski J, Allgayer H. MicroRNA modulators of epigenetic regulation, the tumor microenvironment and the immune system in lung cancer. Mol Cancer. 2015;14(1):1–10.

    Article  CAS  Google Scholar 

  43. Chen J-B, Kong X-F, Qian W, Mu F, Lu T-Y, Lu Y-Y, et al. Two weeks of hydrogen inhalation can significantly reverse adaptive and innate immune system senescence patients with advanced non-small cell lung cancer: a self-controlled study. Med Gas Res. 2020;10(4):149.

    Article  CAS  Google Scholar 

  44. MacCallum DE, Melville J, Frame S, Watt K, Anderson S, Gianella-Borradori A, et al. Seliciclib (CYC202, R-Roscovitine) induces cell death in multiple myeloma cells by inhibition of RNA polymerase II–dependent transcription and down-regulation of Mcl-1. Can Res. 2005;65(12):5399–407.

    Article  CAS  Google Scholar 

  45. Koparal AT, Zeytinoglu M. Effects of carvacrol on a human non-small cell lung cancer (NSCLC) cell line, A549. Cytotechnology. 2003;43(1):149–54.

    Article  Google Scholar 

  46. Tijerina AJ. The biochemical basis of metabolism in cancer cachexia. Dimens Crit Care Nurs. 2004;23(6):237–43.

    Article  Google Scholar 

  47. Aiello NM, Stanger BZ. Echoes of the embryo: using the developmental biology toolkit to study cancer. Disease Models Mech. 2016;9(2):105–14.

    Article  CAS  Google Scholar 

  48. Reitter EM, Kaider A, Ay C, Quehenberger P, Marosi C, Zielinski C, et al. Longitudinal analysis of hemostasis biomarkers in cancer patients during antitumor treatment. J Thromb Haemost. 2016;14(2):294–305.

    Article  Google Scholar 

  49. Yu G, Wang L-G, Han Y, He Q-Y. clusterProfiler: an R package for comparing biological themes among gene clusters. Omics J Integr Biol. 2012;16(5):284–7.

    Article  CAS  Google Scholar 

  50. Zhang X-Y, Guo H, Huang Y, Hao P-Q, Yang Y, Liu Y, et al. Comparative interactome analysis reveals distinct and overlapping properties of Raf family kinases. Biochem Biophys Res Commun. 2019;514(4):1217–23.

    Article  CAS  Google Scholar 

  51. Zhu J, Huang R, Yang R, Xiao Y, Yan J, Zheng C, et al. Licorice extract inhibits growth of non-small cell lung cancer by down-regulating CDK4-Cyclin D1 complex and increasing CD8+ T cell infiltration. Cancer Cell Int. 2021;21(1):1–18.

    Article  Google Scholar 

  52. Cui Y, Li Z, Cao J, Lane J, Birkin E, Dong X, et al. The G4 resolvase DHX36 possesses a prognosis significance and exerts tumour suppressing function through multiple causal regulations in non-small cell lung cancer. Front Oncol. 2021;11:655757.

    Article  Google Scholar 

  53. Baghdadi M, Wada H, Nakanishi S, Abe H, Han N, Putra WE, et al. Chemotherapy-induced IL34 enhances immunosuppression by tumor-associated macrophages and mediates survival of chemoresistant lung cancer cells. Can Res. 2016;76(20):6030–42.

    Article  CAS  Google Scholar 

  54. Chen H, Chen X, Shen Y, Yin X, Liu F, Liu L, et al. Signaling pathway perturbation analysis for assessment of biological impact of cigarette smoke on lung cells. Sci Rep. 2021;11(1):1–15.

    Google Scholar 

  55. Xie ZC, Tang RX, Gao X, Xie QN, Lin JY, Chen G, et al. A meta-analysis and bioinformatics exploration of the diagnostic value and molecular mechanism of miR-193a-5p in lung cancer. Oncol Lett. 2018;16(4):4114–28.

    Google Scholar 

  56. Gao G, Yao Z, Shen J, Liu Y. Identification of Key miRNAs in the treatment of Dabrafenib-resistant melanoma. Biomed Res Int. 2021;2021:5524486.

    Article  Google Scholar 

  57. Shahid M, Azfaralariff A, Law D, Najm AA, Sanusi SA, Lim SJ, et al. Comprehensive computational target fishing approach to identify Xanthorrhizol putative targets. Sci Rep. 2021;11(1):1–11.

    Article  Google Scholar 

  58. Sarmadi VH, Ahmadloo S, Boroojerdi MH, John CM, Al-Graitte SJR, Lawal H, et al. Human mesenchymal stem cells-mediated transcriptomic regulation of leukemic cells in delivering anti-tumorigenic effects. Cell Transplant. 2020;29:0963689719885077.

    Article  Google Scholar 

  59. Jin Q, Lu J, Gao R, Xu J, Pan X, Wang L. Systematically deciphering the pharmacological mechanism of fructus aurantii via network pharmacology. Evid Based Complement Altern Med. 2021;2021:6236135.

    Article  Google Scholar 

  60. Chen Y, Mao B, Peng X, Zhou Y, Xia K, Guo H, et al. A comparative study of genetic profiles of key oncogenesis-related genes between primary lesions and matched lymph nodes metastasis in lung cancer. J Cancer. 2019;10(7):1642.

    Article  CAS  Google Scholar 

  61. Wang Y, Wang P, Liu M, Zhang X, Si Q, Yang T, et al. Identification of tumor-associated antigens of lung cancer: SEREX combined with bioinformatics analysis. J Immunol Methods. 2021;492: 112991.

    Article  CAS  Google Scholar 

  62. Boonstra J, Verkleij AJ. Regulation of enzyme activity in vivo is determined by its localization. Adv Enzyme Regul. 2004;44:61–73.

    Article  CAS  Google Scholar 

  63. Krystal GW, Hines SJ, Organ CP. Autocrine growth of small cell lung cancer mediated by coexpression of c-kit and stem cell factor. Can Res. 1996;56(2):370–6.

    CAS  Google Scholar 

  64. Casarrubios M, Cruz-Bermúdez A, Nadal E, Insa A, Campelo MdRG, Lázaro M, et al. Pretreatment tissue TCR repertoire evenness is associated with complete pathologic response in patients with NSCLC receiving neoadjuvant chemoimmunotherapy. Clin Cancer Res. 2021;27(21):5878–90.

    Article  CAS  Google Scholar 

  65. Zhuang Z, Chen Q, Huang C, Wen J, Huang H, Liu Z. A comprehensive network pharmacology-based strategy to investigate multiple mechanisms of HeChan tablet on lung cancer. Evid Based Complement Altern Med. 2020;2020:7658342.

    Article  Google Scholar 

  66. Eathiraj S, Palma R, Volckova E, Hirschi M, France DS, Ashwell MA, et al. Discovery of a novel mode of protein kinase inhibition characterized by the mechanism of inhibition of human mesenchymal-epithelial transition factor (c-Met) protein autophosphorylation by ARQ 197. J Biol Chem. 2011;286(23):20666–76.

    Article  CAS  Google Scholar 

Download references


Not applicable.


This work was funded by the National Science Foundation of China to J.S. (61972226), F.L. (61902216) and J-X.L. (61872220).

Author information

Authors and Affiliations



XZ wrote the main manuscript text. XZ and YS designed the project and acquired the data. FL and XK interpreted the results. JS, XZ and YS drafted the manuscript. After that, JXL revised and refined the manuscript substantively. All authors reviewed and approved the final manuscript.

Corresponding author

Correspondence to Yan Sun.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information


file 1: The protein–protein interaction information downloaded from HPRD.


file 2: The union of the 10 most significantly enriched pathways in the two disease modules and the genes overlapping between these modules.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Shang, J., Zhu, X., Sun, Y. et al. DM-MOGA: a multi-objective optimization genetic algorithm for identifying disease modules of non-small cell lung cancer. BMC Bioinformatics 24, 13 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Disease module identification
  • Biological network construction
  • Gene expression data
  • Genetic algorithm
  • Multi-objective optimization