 Research article
 Open Access
 Published:
ModularBoost: an efficient network inference algorithm based on module decomposition
BMC Bioinformatics volume 22, Article number: 153 (2021)
Abstract
Background
Given expression data, gene regulatory network(GRN) inference approaches try to determine regulatory relations. However, current inference methods ignore the inherent topological characters of GRN to some extent, leading to structures that lack clear biological explanation. To increase the biophysical meanings of inferred networks, this study performed datadriven module detection before network inference. Gene modules were identified by decompositionbased methods.
Results
ICAdecomposition based module detection methods have been used to detect functional modules directly from transcriptomic data. Experiments about timeseries expression, curated and scRNAseq datasets suggested that the advantages of the proposed ModularBoost method over established methods, especially in the efficiency and accuracy. For scRNAseq datasets, the ModularBoost method outperformed other candidate inference algorithms.
Conclusions
As a complicated task, GRN inference can be decomposed into several tasks of reduced complexity. Using identified gene modules as topological constraints, the initial inference problem can be accomplished by inferring intramodular and intermodular interactions respectively. Experimental outcomes suggest that the proposed ModularBoost method can improve the accuracy and efficiency of inference algorithms by introducing topological constraints.
Background
In recent years, systems biology has developed rapidly. With the continuous development of highthroughput analysis technologies such as proteomics and transcriptomics [1], it has become possible to infer gene regulatory networks (GRNs). The main purpose of GRN inference is to determine causal relations between genes. Such networks offer important information about regulation and boost people’s understanding about mechanisms.
After several decades of development, computational efficiency and accuracy of network inference algorithms have increased dramatically. Large gene networks of microorganisms and mammals can be reconstructed using transcriptomics datasets. Methods with high accuracy in inferring GRNs have been proposed. For example, the TIGRESS [2] and fused LASSO [3] based on linear regression have exhibited superior performance in computational efficiency, while machine learningbased methods such as GENIE3 [4] and GRNBoost2 [5, 6] in boosting framework are widely used due to their advantages in accuracy. Mutual informationbased CLR [7] and PIDC [8] can reveal the statistical dependencies among genes.
However, there are some limitations also exist in those GRNs inference methods. Inferred topologies usually lack clear biophysical explanations, limiting their applications such as diseasegene prediction and gene therapy. For GRN and protein–protein interaction (PPI) networks, a key character shared by biological networks is the socalled functional module or communities structure [9]. Each module corresponds to a subnetwork in which nodes are densely connected and exchange information frequently [10]. Besides, traditional approaches determine the casual relations between genes at a single stage, leading to considerable computational burden. Parameter estimation of GRN with topological constraints had exhibited advantages in computational efficiency [11]. According to the regulatory module theory, intermodular connections have a more tight association than the genes pairs in intramodule [12]. When the conventional inference task can be accomplished at multiple stages, the efficiency of network inference may be expected to be improved.
With accurately detected modules, it is feasible to develop an efficient inference framework that combines inherent modular structures with established inference algorithm. Plenty of module identification methods have been developed to detect functional modules from GRN, PPI, and other biological networks. For gene module detection approaches including CoReg and SigMod etc[13, 14], network topologies were required to be known. This is a strong prerequisite that is hard to be satisfied in real applications. In this context, datadriven module identification methods become crucial to identify modules directly from transcriptomic data. Decompositionbased and clusteringbased methods have attracted increasing attention due to the ability to detect gene modules from transcriptomic data [15]. Although gene modules can be detected, current researches focus on finding the biological explanations and relevant pathways to some extent [16]. From the viewpoint of network inference (NI), gene modules provide a constraint to guide the inference, leading to GRN with community structures.
Motivated by these topological characters, this study proposes a ModularBoost method to integrate decompositionbased module identification and boostingbased inference algorithm. Using ICAFDR, ModularBoost assigned genes to regulatory modules according to their expression data. Based on the detected gene modules, casual relations within gene modules were inferred by GRNBoost2 that is a topranking inference algorithm, while the regulatory relationships between modules were detected by linear sparse regression. Then ModularBoost normalized the scores from subnetworks to obtain the final network.
Among decompositionbased methods, ICAFDR, ICAzscore, and PCA have been implemented and compared with commonlyused clustering expression datasets [9]. And we selected the ICAFDR algorithm that demonstrates the highest accuracy in module identification. Besides, the performance of ModularBoost method was evaluated by singlecell expression and timeseries data. The simulated scRNAseq datasets are generated by BEELINE [17] and PIDC [8]. The three experimental scRNAseq data sets are from the SCODE project [18]. And the timeseries data sets are the S. aureus, E. coli and Yeast expression data from the Dream5 challenge [19]. As for the gold standard of gene modules, based on partly known gene regulatory edges, functional modules were extracted by graph theory or community detection methods. This work not only discusses the applicability and accuracy of ModuleBoost in network inference (NI), but also further analyzes the relation between datadriven module detection and NI.
Result
Modular inference of simulated scRNAseq datasets
Curated networks were extracted from Beeline project, which focused on GRN inference using single cell expression data. Different from traditional microarray datasets, singlecell data contains information about cell–cell variability and can be used to investigate behavior patterns of cell populations [20]. However, strong stochastic single cell expression data often lead to low accuracy in network inference. Even several algorithms including SCODE and PIDC have been developed [8, 18], the accuracy levels of GRN inference using single cell data were quietly low. Another bottleneck brought by single cell data was increasing computational burden, due to plenty of cell samples [21]. Cell–cell variability information in single cell expression data play a negative role in inferring TFgene relations, leading to low accuracy of inference in many cases.
In this case, the proposed ModularBoost approach aims to improve the accuracy in inference by introducing topological constraints. In the proposed ModularBoost method, ICAFDR based decomposition was used as inner part to detect functional modules directly from curated datasets. Competing methods include ICAFDR2, ICAzscore, PCAbased decomposition and Kmeans clustering methods. Performance index \(F_{rr}\) using four decomposition methods and Kmeans clustering were described in this research. Curated datasets from the GSD network had three experimental conditions, depending on the dropout rates. PIDC E. coliS denotes singlecell data with 700 cell samples, while both E. coliLL and E. coliLH represent datasets with 2000 samples. In addition, the E. coliLL and E. coliLH groups correspond to singlecell data with low and high dropout rates respectively. Evaluation metrics of the curated GSD network and PIDC E. coli network were compared in Table 1.
\(F_{rr}\) values were positively related with the accuracy level in module detection. Highest values in each column were displayed in bold. ICAFDR and ICAFDR2 required the number of gene modules \(n\_comp\) and the threshold of Qvalue q_cutoff according to the number of genes in GRN. For the module identification evaluation of curated networks with 19 genes, we set consistently \(n\_comp\) = 2 for five module identification methods. Meanwhile, PIDC E. coli network with 100 genes set \(n\_comp\) = 4. And the q_cutoff of ICAFDR, ICAFDR2, and PCA was \(10^{3}\). However, the q_cutoff of ICAzscore decomposition was different from ICAFDR due to the difference in statistical principles: \(\textit{q\_cutoff}\ ^{zscore}\) = 1.5. To eliminate the randomness of the heuristic algorithm, we repeated each gene network 10 times and took the average of \(F_{rr}\).
A pattern that could be found in Table 1 was that ICAFDR outperformed three decomposition methods and kmeans clustering, showing high accuracy in module detection. Furthermore, \(F_{rr}\) indexes obtained by ICAFDR were slightly higher than that from ICAFDR2. A possible explanation was that ICAFDR2 algorithm had taken the direction of regulatory edges into consideration, thus influencing the accuracy of gene module identification. Meanwhile, dropout rate, which was regarded as noise, had shown negative impacts on module detection. \(F_{rr}\) indexes from GSD70 were lower than that from GSD1 and GSD50, using five module detection methods.
Instead of reconstructing a network as whole, the ModularBoost approach accomplished the GRN inference task based on the identified modules, which were densely connected genes and TFs. Furthermore, intramodular and intermodular interactions between genes were inferred in two stages of inference.
It can be observed from Table 2 that the proposed ModularBoost approach obtained highest AUROC indexes among selected inference algorithms including ridge and GRNBoost2 methods. Highest AUROC and AUPR in each column were displayed in bold. This phenomenon validates the effectiveness of ModularBoost as well as the integration of two inference strategies. The computational time of ModularBoost in the three curated networks and PIDC networks are also less than that of GRNBoost2. ModularBoost \(time_{GSD1} = 85\) s, \(time_{GSD50} = 73\) s, \(time_{GSD70} = 78\) s, GRNBoost2 \(time_{GSD1} = 119\) s, \(time_{GSD50} = 116\) s, \(time_{GSD70} = 100\) s. ModularBoost \(time_{E. coliS} = 16\) s, \(time_{E. coliLL} = 155\) s, \(time_{E. coliLH} = 137\) s, GRNBoost2 \(time_{E. coliS} = 67\) s, \(time_{E. coliLL} = 1362\) s, \(time_{E. coliLH} = 965\) s. The AUROC and AUPR indexes of GRNBoost2 and the proposed ModularBoost indicated that the ModularBoost approach was able to accomplish the inference task using reduced computational resource, without obvious loss of accuracy. From this perspective, the ModularBoost approach can be used as an efficient solution to infer TFgene relations using single cell expression data. AUROC and AUPR indexes of GSD70 group were lower than two other groups, suggesting negative impact of dropout rate. Especially when the dropout rate increase from 50 to 70, such impacts were obvious, both in module detection and network inference. Meanwhile, this situation was also confirmed at PIDC datasets. AUROC and AUPR indexes tend to decrease when the singlecell data sample sizes and dropout rates increase. In this study, dropout rate played the role of noise in single cell expression data and has a negative influence in GRN inference accuracy.
To show the superior network prediction of ModuleBoost2, we compared ModularBoost with the ridge regression based inference method from the perspective of degree distribution [22]. The degree distribution \(p_k\) of GRNs follows approximately the powerlaw [23], and it is given by:
where k denotes degree, \(\alpha\) represents scaling coefficient and c is a constant.
The powerlaw distributions of PIDC E. coliS and E. coliLL were shown in Fig. 1. For simulated single cell expression datasets, EcoliS and EcoliLL share the same gold standard networks. These two cases may lead to different inferred structures, due to various complexity levels. When the sample size of expression data increases, the inference indexes tend to decline while the mismatch degree between inferred networks and gold standards has become more significant.
Under this circumstance, the proposed ModularBoost method can obtain GRNs that are higher topologically similar with gold standards, compared with ridge regression inference algorithm. In other words, ModularBoost is able to improve the interpretability of inferred networks to some degree.
Modular inference of experimental scRNAseq datasets
In this section, experimental single cell RNA sequencing(scRNAseq) datasets were used as the major information source in GRN inference. In biomedical and genomic research, scRNAseq datasets has played a crucial role in exploring dynamics of cell population and differentiation. Three scRNAseq datasets and relevant gold standard networks were provided by the SCODE project. In Table 3, PrE (primitive endoderm cells), MEF (mouse embryonic fibroblast cells), DE (definitive endoderm cells) denote various cell types from mouse and human respectively. Regulatory edges in gold standard covered a subset of 100 genes, ranging from 40\(\%\) to 60\(\%\). Based on known regulatory edges, gene modules were detected using graph theorybased methods, i.e. community detections methods. Those modules were intersected, corresponding to overlapping phenomenon.
The basic idea of module standards is to explore strongly interconnected components where nodes are mutually connected. In module 2 of PrE network, ETS1, EGR1 and SMAD7 formed a connected component. ETS1 and EGR1 are hub nodes due to high outdegrees. In known regulatory relations, the SMAD7 gene owns 12 outdegrees and 16 indegrees, indicating that this gene play a bridge node in the PrE network.
For the MEF network, those TFs such as KLF4 had high outdegrees and were believed to have high topological importance in information flows. According to known modules, overlapping existed, allowing part of genes belong to two or more functional modules.
In Fig. 2, gene modules were colored and visualized by the Gephi software to illustrate the topological positions. In the directed graphs, the physical sizes of nodes were positively related with their topological importance. Meanwhile, functional and topological neighborhoods were related but different. It can be observed from Fig. 2 that nodes in the same gene modules tend to densely connected and had topological relations. In this study, gene modules were determined by datadriven ICAFDR method, leaving a part of genes uncolored.
For PrE, MEF, DE networks, the numbers of gene modules were assigned as 3, 4, 4, according to prior information.
From Table 4, high \(F_{rr}\) indexes with bold obtained by ICAbased decomposition suggested their advantages in detection accuracy, compared with PCAdecomposition and kmeans clustering. And \(F_{rr}\) indexes of ICAFDR2 were lower than that of ICAFDR. This pattern was consistent with that in microarray datasets from curated subsection.
Based on the detected modules, the proposed ModularBoost approach accomplished the network inference using two kinds of regression: ridge regression and ensemble inference algorithm. This study chose cuttingedge ensemble inference named GRNBoost2 to determine intramodular casual interactions. Table 5 shows AUROC and AUPR indexes obtained by the ModularBoost method and three candidate inference algorithms. And highest AUROC and AUPR in each column were displayed in bold.
SCODE PrE and MEF datasets are singlecell expression data measured from mouse cells, while SCODE DE dataset denotes expression data of human cell populations. For PrE and MEF datasets, the proposed ModularBoost method obtained improved inference performance with reduced computational cost (ModularBoost \(time_{PrE} = 17s, time_{MEF} = 24s, time_{DE} = 11\) s, GRNBoost2 \(time_{PrE} = 335s, time_{MEF} = 230s, time_{DE} = 539\) s).
Figure 3 depicts both the PR and the ROC curves of four inference algorithms, during reconstruction of three SCODE networks. For PrE and MEF networks, the proposed ModularBoost approach outperforms three candidate inference methods in terms of AUPR and AUROC values. Moreover, the shape of the PR curve shows that the predicted lists inferred by ModularBoost method contains more true edges than other algorithms. For the DE dataset, the accuracy indexes of ModularBoost were lower than that of existing GRNBoost2. One possible explanation is the complex physiological process of gene regulation in human cell populations.
Modular inference of DREAM5 networks
As a typical kind of timeseries expression data, microarray datasets have long been used as information source in GRN inference. DREAM5 provided four networks as benchmarks for network inference research [19]. Considering the huge number of nodes and regulatory edges, the accuracy indexes obtained by existing methods were relatively low and the computational time was considerable. Among four DREAM5 networks, S. aureus, E. coli and Yeast networks had been used as benchmarks in this sections.
In DREAM5 challenge, ‘true’ regulatory relations were used as socalled gold standards to judge the accuracy degree of inference algorithms. Specially, the modular gold standard of DREAM5 was from Saelens [9]. As for the labels of gene modules, Sisima and Macisaac et. al provided two groups of incomplete module sets to quantitatively evaluate performance of candidate module identification methods [24]. Based on ICAFDR, the ModularBoost approach firstly detected gene modules directly from DREAM5 datasets. The comparison of ICAFDR with the other methods were shown in Table 6.
Minimal, Strict, and Interconnected denote three definitions of gene modules, according to connectivity patterns. The Minimal modules can be regarded as the overlapping sets of genes that shared at least one TF. And Strict modules correspond to the gene sets that are regulated by the same regulators. For Interconnected modules, genes in the same module are strongly interconnect. For the same gold standard network, the Interconnected standard usually obtains the least number of gene modules.
For E. coli network with 4511 genes and Yeast network with 5950 genes , we set \(n\_comp = [10, 20, \ldots , 100]\) and \({q_cutoff} = [10^{1}, 10^{2}, \ldots , 10^{13}]\), leading to 130 parameter combinations. The parameter q_cutoff of ICAzscore and PCA decomposition algorithms were settled as \({q\_cutoff}\ ^{zscore} = [0.5, 1, \ldots , 6.5]\), \({q\_cutoff}\ ^{pca} = [1, 0.75, 0.5, 0.25, 0.1, 0.075, 0.05, 0.025, 0.01, 0.0075, 0.005, 0.0025, 0.001]\). To reduce the stochastic impacts, decompositionbased module detection algorithms detected modules ten times for each parameter combination, and computed the average \(F_{rr}\) indexes. In Table 6, the values \(F_{rr}\) were obtained by taking the maximum value among 130 combinations.
It can be observed from Table 6 that decompositionbased approaches identify gene modules with enhanced accuracy than Kmeans method, under three module definitions. Among three module definitions, \(F_{rr}\) indexes of the Minimal and Strict groups were higher than that of Interconnected criteria. This phenomenon indicated that Minimal coregulation can be a suitable definition to evaluate module detection. To demonstrate the decomposition of ICAFDR, regulatory modules were colored in DREAM5 E. coli gene networks shown in Fig. 4.
In Fig. 4, nodes in the graph represent TFs or genes and edges denote regulations. Those nodes colored with the same color were assigned to the same functional module, according to ICAFDR decomposition in the ModularBoost method. With annotations, gene modules are crucial to deepen the understanding about regulation mechanisms with a given network. From this perspective, the ModularBoost method offered a datadriven solution to unveil functional modules directly from expression data, even without accurate annotations. Different from raw module identification, the ModularBoost method detects gene at the first stage, then infer intermodular and intramodular regulations at the second stage. Directed regulatory edges between TFgene pairs will be necessary to further analyze information flow and potential biophysical explanations.
In systems biology, densely connected nodes usually are related with specific cellular functions or diseases [25]. Under this circumstance, the ModularBoost method aims to provide reconstructed GRN topologies with clear community structures. This is an important character owned by the ModularBoost method. Other candidate inference algorithm including ridge regression based TIGRESS and ensemblebased GRNBoost2 majorly focus the whole network structure [2].
From the gene module detection outcomes of two DREAM5 networks, the ICAFDR part in the ModularBoost method accomplished the first stage of task efficiently. In the subsequent inference of subnetworks, the AUROC and AUPR indexes were compared in Table 7. And the highest values in each column were marked in bold.
It can be observed from the Table 7 that the ModularBoost approach obtained higher AUROC and AUPR indexes than standard ridge and GRNBoost2 in two DREAM5 networks. The proposed ModularBoost approach integrates ensemblebased GRNBoost2 and ridge inference methods by introducing gene modules as topological constraints. The core of TIGRESS algorithm was regularized regression, leading to higher AUPR indexes than conventional linear regression methods. Network inference tasks were performed on a computer with 8 GB RAM, Intel i79750H 2.60 GHz. GRNBoost2 algorithm took 1 h 57 min to complete the inference of the DREAM5 S. aureus network, while ModularBoost only needed 7 min for the same task. GRNBoost2 algorithm took 41 h 46 min on DREAM5 E. coli network, while ModularBoost took 1 h 8 min, reducing approximately 96% computing efforts. The DREAM5 Yeast network took 33 h 36 min with GRNBoost2 inference, while only used 1 h 9 min with ModularBoost. Obviously, compared with GRNBoost2, ModularBoost significantly improved the speed of network inference, and did not cause a significant decrease in the accuracy of network inference. One advantage of introducing topological constraints is to improve inference accuracy. Another benefit is to reduce computational burden, especially for GRN with thousands of regulatory edges.
For three types of GRNs, degree distributions were fitted by the powerlaw distribution, as shown in Fig. 5. Compared with ridgebased inference algorithm, GRNs obtained by the ModularBoost approach showed closer similarity with the gold standard networks. These results show that ModularBoostinferred networks are more similar with the ‘true’ network topology. The linear fitting parameters are shown in Table 8.
Discussion and conclusions
This work aims to develop a seamless framework to perform GRN inference based on module identification. In order to detect modules from expression data, ICAbased decomposition algorithms have been applied in the proposed ModularBoost algorithm. Among several candidate decomposition methods, ICAFDR had shown advantages in detection accuracy. In this case, ModularBoost employs the ICAFDR algorithm to detect gene module from transcriptomic data. In the subsequent network inference part, intramodular and intermodular interactions were determined by ensemblebased and sparse regressionbased algorithms respectively. The idea behind the ModularBoost is to introduce topological constraints to conventional network inference. Such topological constraints consider inherence community structures in GRN and other biological networks and can be introduced by datadriven approaches.
The proposed ModularBoost method can be also regarded as a lowweight solution to deal with timeseries and single cell expression data. Based on experimental outcomes about curated and scRNAseq datasets, the ModularBoost method is able to improve inference accuracy as well as to reduce computational time. It can be understood that decomposition of network inference can reduce the computational burden, since an original task was transferred to multiple subtasks. The purpose is to obtain topologies with better biophysical or biomedical explanations. To evaluate the effectiveness of identified regulatory modules, relevant annotations called Module gold will be necessary. This study applies module labels and compute \(F_{rr}\) index to quantitatively evaluate the performance of datadriven gene module identification.
ModularBoost methods
ICAFDR based gene module identification
Gene module correspond to the group of genes with similar expressive patterns and biological functions. Researches about gene module help researchers better understand disease modules and genedisease relations.
In general, the intent of the independent component analysis (ICA) is to find the hidden ‘independent component’ that refers to the gene module in this research [26]. When applied in the field of gene module detection, ICA usually splits express data matrix X into two matrices: a source matrix S and a mixing matrix A, which means \(X = AS\) is shown in Fig. 6.
For single cell expression matrix X, this study assumes that the columns and rows of X correspond to genes and cell samples respectively. The expression value \(x_{ij}\) of gene j at sample i is shown as
where \(a_{ik}\) reflects the contribution of cell sample i in component k and \(s_{kj}\) denotes the contribution of component k on gene j [16]. ICAdecomposition algorithms are different in adopting optimization standards for component independence. In this study, we chose an efficient ICA algorithm—FastICA. The goal of FastICA is to find an orthogonal rotation of prewhitened data through a fixedpoint iteration frame [27]. FastICA iteratively maximizes nonGaussian of the rotated components until convergence, and k independent signals that corresponded k gene modules are found in this process.
FastICA algorithm tries to find gene components in the source matrix that own nonGaussian characteristics. Each modular signal in the source matrix generally obeys a heavytailed normal distribution. Under this circumstance, those genes at the tails have a significant contribution to those components, while the majority of genes in peak have weak impact. In the next step, we applied the false discovery rate (FDR) estimation to assign genes to various functional modules.
The basic procedure of the ICAFDR algorithm is shown as Algorithm 1.
Input parameters include the number of gene modules \(n\_comps\), the maximum iterations \(max\_iter = 20{,}000\) in this research, and the threshold of Qvalue q_cutoff. The process of whitening was defined as Eq.3.
where \(\varvec{E}\) denotes the orthogonal matrix of eigenvectors of \(E\{\varvec{xx}^T\}\) and \(\varvec{D}\) is the diagonal matrix of its eigenvalues. The first goal of ICAFDR is maximizing nonGaussianity, and nonGaussianity is measured by the approximation of negentropy \(J_G(\varvec{w})\) given in the Eq.4.
The entropy \(H(\cdot )\) can be defined as Eq.5 in the ICAFDR:
where \(\upsilon\) is a Gaussian variable of unit variance and zero mean, and \(G(\cdot )\) is the nonquadratic function that is used to improve the robustness of estimation, such as:
The \(g(\cdot )\) that in Algorithm. 1 is the derivatives of the function in Eq.6:
FDR represents the number of false discoveries in an experiment divided by the total number of discoveries, and the discovery is statistical test that provides an acceptance threshold. Using hypotheses tests, this study first evaluated the statistical significance of genes in each modular signal. This yielded a p value for genes in each module, and a ranked list in descending order. Correction for multiple testing was performed by calculating a “Qvalue” from the p values and estimating FDR values [28]. The formula for calculating a Qvalue is defined by Eq.8
where \(p_{ik}\) is the \(i^{th}\) smallest p alue out of n gene p values for the k modular signal.
The fundamental principle of postprocess is assigning the genes with lower Qvalue than q_cutoff to a module, and the process was shown in Fig. 6. According to the number of genes and modules, the value of q_cutoff was selected from the set \(\{10^{1}, 10^{2}, \ldots , 10^{13}\}\).
The ICAbased decomposition also has several derivatives, including ICAFDR, ICAFDR2 and ICAzscore. ICAFDR2 is similar to ICAFDR but divides each component into two modules according to the signs of gene regulations, while ICAzscore replaces FDR indexes with zscores to detect module from source signals. As a tool to reduce dimension, the principal component analysis (PCA) can be used to visualize the similarities among the biological samples [29]. We tested the performance of these methods in the experiments to compare with the ICAFDR.
Decompositionbased GRN hybrid inference
Different NI algorithms have their unique characteristics, and the combination of multiple algorithms can provide a possible strategy to obtain networks with sparse and dense interactions [11]. The ICAFDR algorithm has divided genes into different modules. For GRNs, functionally correlated genes or TFs form gene modules in which the intramodular connections are tenser than of intermodular relations. In this section, based on the detected gene modules, we proposed an algorithm that uses GRNBoost2 to infer intramodular interactions and ridge regression to determine intermodular regulations, which conforms to community structures. Before calculating the intermodular connections, the proposed ModularBoost approach removes those gene pairs that are in the same module to reduce computational burden. The workflow of ModuleBoost is shown in Fig. 7.
Infer intramodular interactions using GRNBoost2
Based on a similar concept as GENIE3, GRNBoost2 infers regulators for every target gene purely from the gene expression matrix [4]. The conventional GRNBoost2 inference algorithm was based on Gradient Boosting Machine (GBM) regression that focuses on efficiency and had been a topranking algorithm in GRN inference. One character of GRNBoost2 is global estimation of decision tree number with a selftuning mechanism.
The set of \(n\_comps\) modules that decomposed by the ICAFDR is defined as \(M = \{ m_1, m_2, \ldots , m_k, \ldots , m_{n\_comps}\}\), where the \(m_k = \{g_1^k, g_2^k, \ldots , g_{k_n}^k\}\), \(k = 1,2,\ldots , n\_comps\), and there are \(k_n\) genes assigned in \(m_k\). In what follows, we applied GRNBoost2 at each module \(m_k\) and separately inferred the intramodular connections. Learning samples can be constructed as \(LS^k =\{\mathbf{x }_1, \mathbf{x }_2, \ldots , \mathbf{x }_s, \ldots , \mathbf{x }_N\}\), where N is the number of samples or experiments in the gene expression matrix, and \(\mathbf{x} _s = (x_s^1, x_s^2, \ldots , x_s^{k_n})^T\) is a vector of genes expression data in sth sample.
GRNBoost2 assumes that the expression levels of the genes in \(m_k\) can be represented by the other genes in the module with random noise. This indicates that \(\mathbf{x} _s^{j}\) can be defined as the vector of genes except gene j in sth observation samples, i.e. \(\mathbf{x} _s^{j} = (x_s^{1}, x_s^{2}, \ldots , x_s^{j1}, x_s^{j+1}, \ldots , x_s^{k_n})^T\). Therefore, expression behaviors of the target gene j are controlled by the other genes, shown by Eq.9:
where \(\varepsilon _s\) is a random noise with mean of zero. The function \(f_j(\cdot )\) exploits the expression of direct regulators of gene j, and it is trained from the learning sample \(LS_j^k = \{(\mathbf{x} _s^{j}, x_s^{j}), s = 1,\ldots ,N\}\). Meanwhile, the feature selection computes the confidence level \(w_{ij} (i \ne j)\) for the regulatory edge from gene i to gene j. For the tree regressionbased GRN inference method, the function \(f_j(\cdot )\) is determined by an ensemble of decision trees. At each tree node \(\phi\), the total reduction of the variance for the output variable is computed as:
where S is the set of samples at the tree node \(\phi\), \(S_l\) and \(S_r\) denote the left and right subtrees, \(Var(\cdot )\) represents the variance during splitting.
For GRNBoost2, each decision tree is trained by a random subset of samples that cover approximately 90\(\%\) of raw data, and 10\(\%\) others are called outofbag samples that improve the loss function [5]. Trees stop growing up when the loss function meets the earlystopping rule that the average loss improvement drops to 0.
Therefore, gene modules correspond to subnetworks with multiple topological importance levels. Gathering regulatory associations and sorting in descending order, the intramodular inference finalizes the output as shown in Fig. 7c.
Infer intermodular connections using sparse regression
The regulatory associations between regulators and target genes in the same module were inferred by ensemblebased GRNBoost2 algorithm, while the relationships controlled by regulators that outside the module are still unknown. The combination of sparse regression algorithms with biologically meaningful constraints may provide a promising solution to enhance accuracy of GRN inference. Based on this hypothesis, we recovered regulatory relations among genes in different modules by adopting regularized linear regression (ridge regression) and used a pretreatment to reduce computational complexity. The visualization of intermodular inference is available in Fig. 6d.
In the following, given n genes, m samples, and the expression matrix \(E_{m \times n}\), the linear regression problem can be defined as:
where \(\varvec{E}_t\) is a expression value vector of target gene \(g_t\in G = \{g_1, g_2, \ldots , g_n\}, t = 1,2, \ldots ,n\). The potential regulators of \(g_t\) that not in the same module with \(g_t\) are denoted by \(G^{t} = \{g_{r_1}, g_{r_2},\ldots , g_{r_t}\}\), and \(\varvec{E}_{r_1}, \varvec{E}_{r_2}, \ldots , \varvec{E}_{r_t}\) are the expression of \(G^{t}\). Detailed representation is:
where \(\alpha _{r_t,t}\) is a regression coefficient describing the influence of regulator \(g_{r_t}\) on the target gene \(g_t\), and \(\varvec{\beta }\) is a vector of noise in regression. For obtaining the solution vector \(\varvec{\alpha }\) of each target gene, the objective function of ridge regression is defined as:
where the quadratic penalty term \(\Vert \varvec{\alpha }\Vert _2^2\) makes the loss function convex and leads to a unique minimum.
After the inference of subnetworks and intermodular connections, a total of \(n \times (n1)\) causal relations and importance scores were computed. Then the regulatory edge scores of intramodular and intermodular relations were standardized using maximumminimum normalization:
According to the normalized important scores, the regulatory associations are combined and ranked in descending order. Therefore, we can calibrate those associations with the gold standard, as shown in Fig. 7e, and use AUROC index to examine whether the real regulatory relationships enrich at the top of the ranking.
Evaluation metrics
Evaluation metrics are used to quantitatively evaluate the performance of datadriven module identification methods. As indexes F measure, Rand index, and the normalized mutual information face the problem of detecting overlapping modules [9]. This study selects Recovery, Relevance, and \(F_{rr}\) to evaluate the accuracy of ICAFDRbased module identification methods for their high accuracy and efficiency in handling overlap. The three indexes are in the range of [0, 1], and reach the value 1 only when the observed module and the known module are exactly equal.
\(\hat{M}\) and M are the set of known and observed modules. In this first stage, a distance (similarity) matrix is computed by the Jaccard index between two modules sets.
The Recovery is used to match known modules with observed modules, and the Relevance index reflect the extent to which observed modules match with known modules.
Afterwards, the similarity score \(F_{rr}\) is summarized by mapping the known modules to detected ones and vice versa. A score quantifying the false negatives (Relevance) is calculated by averaging the similarities of known modules and picking out the best representatives in the detected modules. Another score that is related to false positives (Recovery) is computed in a similar style. An asymmetric method for module similarity is given by averging Relevance and Recovery as shown in Eq. 17
This study uses the values of area under receive operator curve (AUROC) and area under precisionrecall curves (AUPR) to reveal accuracy levels of the ModularBoost network inference algorithm. ModularBoost outputs a descending list of putative regulatory interactions. Picking only the top K edges in this list, we compared them with gold standards to assess the number of false positives (FP), true positives (TP), false negatives (FN), and true negatives (TN). ROC curve shows the tradeoff between false positive rate (FPR) and true positive rate (TPR) across different K thresholds, while PR describes the relationship between recall and precision. FPR, TPR, recall, and precision are expressed as:
Finally, the AUROC and the AUPR are respectively assessed by computing the area under ROC and PR curves.
Availability of data and materials
The datasets analyzed during the current study are available in the Github repository, https://github.com/cosinalee/ModularBoost.git
Abbreviations
 GRN:

Gene regulatory network
 NI:

Network inference
 PPI:

Protein–protein interaction
 FDR:

False discovery rate
 GSD:

Gonadal sex determination
 TF:

Transcriptional factor
 ICA:

Independent component analysis
 PCA:

Principal component analysis
 AUROC:

Area under receive operator curve
 AUPR:

Area under precisionrecall curve
 scRNAseq:

single cell RNA sequencing
 PrE:

Primitive endoderm cells
 MEF:

Mouse embryonic fibroblast cells
 DE:

Definitive endoderm cells
 GBM:

Gradient boosting machine
 FP:

False positives
 TP:

True positives
 FN:

False negatives
 TN:

true negatives
 FPR:

false positive rate
 TPR:

True positive rate
References
 1.
Deng Y, Jiang YH, Yang Y, He Z, Luo F, Zhou J. Molecular ecological network analyses. BMC Bioinf. 2012;13(1):113.
 2.
Haury AC, Mordelet F, VeraLicona P, Vert JP. Tigress: trustful inference of gene regulation using stability selection. BMC Syst Biol. 2012;6:145.
 3.
Omranian N, EloundouMbebi JMO, MuellerRoeber B, Nikoloski Z. Gene regulatory network inference using fused lasso on multiple data sets. Entific Rep. 2016;6(1):20533.
 4.
Irrthum A, Wehenkel L, Geurts P, et al. Inferring regulatory networks from expression data using treebased methods. PLoS ONE. 2010;5(9):12776.
 5.
Moerman T, Aibar Santos S, Bravo GonzálezBlas C, Simm J, Moreau Y, Aerts J, Aerts S. Grnboost2 and arboreto: efficient and scalable inference of gene regulatory networks. Bioinformatics. 2019;35(12):2159–61.
 6.
Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;29(5):1189–232.
 7.
Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner TS. Largescale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 2007;5(1):8.
 8.
Chan TE, Stumpf MP, Babtie AC. Gene regulatory network inference from singlecell data using multivariate information measures. Cell Syst. 2017;5(3):251–67.
 9.
Saelens W, Cannoodt R, Saeys Y. A comprehensive evaluation of module detection methods for gene expression data. Nature Commun. 2018;9(1):1–12.
 10.
Chaussabel D, Baldwin N. Democratizing systems immunology with modular transcriptional repertoire analyses. Nat Rev Immunol. 2014;14(4):271–80.
 11.
Li W, Zhang W, Zhang J. A novel model integration network inference algorithm with clustering and hub genes finding. Molecular Inf. 2020;39(5):1900075.
 12.
Alsina L, Israelsson E, Altman MC, Dang KK, Ghandil P, Israel L, Von Bernuth H, Baldwin N, Qin H, Jin Z, et al. A narrow repertoire of transcriptional modules responsive to pyogenic bacteria is impaired in patients carrying lossoffunction mutations in MYD88 or IRAK4. Nat Immunol. 2014;15(12):1134–42.
 13.
Song Q, Grene R, Heath LS, Li S. Identification of regulatory modules in genome scale transcription regulatory networks. BMC Syst Biol. 2017;11(1):140.
 14.
Liu Y, Brossard M, Roqueiro D, MargaritteJeannin P, Sarnowski C, Bouzigon E, Demenais F. Sigmod: an exact and efficient method to identify a strongly interconnected diseaseassociated module in a gene network. Bioinformatics. 2017;33(10):1536–44.
 15.
Zhang W, Zhang F, Zhang J, Wang N. Hierarchical parameter estimation of GRN based on topological analysis. IET Syst Biol. 2018;12(6):294–303.
 16.
Rotival M, Zeller T, Wild PS, Maouche S, Szymczak S, Schillert A, Castagné R, Deiseroth A, Proust C, Brocheton J, et al. Integrating genomewide genetic variations and monocyte expression data reveals transregulated gene modules in humans. PLoS Genet. 2011;7(12):1002367.
 17.
Pratapa A, Jalihal AP, Law JN, Bharadwaj A, Murali T. Benchmarking algorithms for gene regulatory network inference from singlecell transcriptomic data. Nat Methods. 2020;17(2):147–54.
 18.
Matsumoto H, Kiryu H, Furusawa C, Ko MS, Ko SB, Gouda N, Hayashi T, Nikaido I. Scode: an efficient regulatory network inference algorithm from singlecell RNASEQ during differentiation. Bioinformatics. 2017;33(15):2314–21.
 19.
Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, Camacho DM, Allison KR, Kellis M, Collins JJ, Stolovitzky G. Wisdom of crowds for robust gene network inference. Nat Methods. 2012;9(8):796–804.
 20.
Yuan Y, BarJoseph Z. Deep learning for inferring gene relationships from singlecell expression data. Proc Nat Acad Sci. 2019;116(52):27151–8.
 21.
Zhang W, Li W, Zhang J, Wang N. Data integration of hybrid microarray and single cell expression data to enhance gene network inference. Curr Bioinf. 2019;14(3):255–68.
 22.
de Matos Simoes R, Dehmer M, EmmertStreib F. Interfacing cellular networks of S. cerevisiae and E. coli: connecting dynamic and genetic information. BMC Genomics. 2013;14(1):324.
 23.
Ouma WZ, Pogacar K, Grotewold E. Topological and statistical analyses of gene regulatory networks reveal unifying yet quantitatively different emergent properties. PLoS Comput Biol. 2018;14(4):1006098.
 24.
MacIsaac KD, Wang T, Gordon DB, Gifford DK, Stormo GD, Fraenkel E. An improved map of conserved regulatory sites for saccharomyces cerevisiae. BMC Bioinf. 2006;7(1):1–14.
 25.
Luo P, Li Y, Tian LP, Wu FX. Enhancing the prediction of diseasegene associations with multimodal deep learning. Bioinformatics. 2019;35(19):3735–42.
 26.
Nascimento M, Silva FFE, Sáfadi T, Nascimento ACC, Ferreira TEM, Barroso LMA, Ferreira Azevedo C, Guimarães SEF, Serão NVL. Independent component analysis (ICA) basedclustering of temporal RNASEQ data. PLoS ONE. 2017;12(7):0181195.
 27.
Hyvärinen A, Oja E. Independent component analysis: algorithms and applications. Neural Netw. 2000;13(4–5):411–30.
 28.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc: Ser B (Methodol). 1995;57(1):289–300.
 29.
Yao F, Coquery J, Lê Cao KA. Independent principal component analysis for biologically meaningful dimension reduction of large biological data sets. BMC Bioinf. 2012;13(1):24.
Acknowledgements
The authors are grateful preliminary research on GRN inference from Wenchao Li, which sparked the idea for this project. Li’s guidance on the principles of scRNAseq data also gives this research a broader perspective.
Funding
Not applicable.
Author information
Affiliations
Contributions
XL tested the performance of ModularBoost and wrote the manuscript. WZ drafted the initial idea, loaded the initial datasets, and wrote manuscript. JZ and GL were the supervision and guided the entire research process. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Li, X., Zhang, W., Zhang, J. et al. ModularBoost: an efficient network inference algorithm based on module decomposition. BMC Bioinformatics 22, 153 (2021). https://doi.org/10.1186/s1285902104074y
Received:
Accepted:
Published:
Keywords
 Regulatory network inference
 Gene module Decomposition
 GRNBoost2
 Linear regression