ModularBoost: an efficient network inference algorithm based on module decomposition

Background Given expression data, gene regulatory network(GRN) inference approaches try to determine regulatory relations. However, current inference methods ignore the inherent topological characters of GRN to some extent, leading to structures that lack clear biological explanation. To increase the biophysical meanings of inferred networks, this study performed data-driven module detection before network inference. Gene modules were identified by decomposition-based methods. Results ICA-decomposition based module detection methods have been used to detect functional modules directly from transcriptomic data. Experiments about time-series expression, curated and scRNA-seq datasets suggested that the advantages of the proposed ModularBoost method over established methods, especially in the efficiency and accuracy. For scRNA-seq datasets, the ModularBoost method outperformed other candidate inference algorithms. Conclusions As a complicated task, GRN inference can be decomposed into several tasks of reduced complexity. Using identified gene modules as topological constraints, the initial inference problem can be accomplished by inferring intra-modular and inter-modular interactions respectively. Experimental outcomes suggest that the proposed ModularBoost method can improve the accuracy and efficiency of inference algorithms by introducing topological constraints.

After several decades of development, computational efficiency and accuracy of network inference algorithms have increased dramatically. Large gene networks of microorganisms and mammals can be reconstructed using transcriptomics datasets. Methods with high accuracy in inferring GRNs have been proposed. For example, the TIGRESS [2] and fused LASSO [3] based on linear regression have exhibited superior performance in computational efficiency, while machine learning-based methods such as GENIE3 [4] and GRNBoost2 [5,6] in boosting framework are widely used due to their advantages in accuracy. Mutual information-based CLR [7] and PIDC [8] can reveal the statistical dependencies among genes.
However, there are some limitations also exist in those GRNs inference methods. Inferred topologies usually lack clear biophysical explanations, limiting their applications such as disease-gene prediction and gene therapy. For GRN and protein-protein interaction (PPI) networks, a key character shared by biological networks is the so-called functional module or communities structure [9]. Each module corresponds to a sub-network in which nodes are densely connected and exchange information frequently [10]. Besides, traditional approaches determine the casual relations between genes at a single stage, leading to considerable computational burden. Parameter estimation of GRN with topological constraints had exhibited advantages in computational efficiency [11]. According to the regulatory module theory, inter-modular connections have a more tight association than the genes pairs in intra-module [12]. When the conventional inference task can be accomplished at multiple stages, the efficiency of network inference may be expected to be improved.
With accurately detected modules, it is feasible to develop an efficient inference framework that combines inherent modular structures with established inference algorithm. Plenty of module identification methods have been developed to detect functional modules from GRN, PPI, and other biological networks. For gene module detection approaches including CoReg and SigMod etc [13,14], network topologies were required to be known. This is a strong prerequisite that is hard to be satisfied in real applications. In this context, data-driven module identification methods become crucial to identify modules directly from transcriptomic data. Decomposition-based and clustering-based methods have attracted increasing attention due to the ability to detect gene modules from transcriptomic data [15]. Although gene modules can be detected, current researches focus on finding the biological explanations and relevant pathways to some extent [16]. From the viewpoint of network inference (NI), gene modules provide a constraint to guide the inference, leading to GRN with community structures.
Motivated by these topological characters, this study proposes a ModularBoost method to integrate decomposition-based module identification and boosting-based inference algorithm. Using ICA-FDR, ModularBoost assigned genes to regulatory modules according to their expression data. Based on the detected gene modules, casual relations within gene modules were inferred by GRNBoost2 that is a top-ranking inference algorithm, while the regulatory relationships between modules were detected by linear sparse regression. Then ModularBoost normalized the scores from subnetworks to obtain the final network.
Among decomposition-based methods, ICA-FDR, ICA-zscore, and PCA have been implemented and compared with commonly-used clustering expression datasets [9].
And we selected the ICA-FDR algorithm that demonstrates the highest accuracy in module identification. Besides, the performance of ModularBoost method was evaluated by single-cell expression and time-series data. The simulated scRNA-seq datasets are generated by BEELINE [17] and PIDC [8]. The three experimental scRNA-seq data sets are from the SCODE project [18]. And the time-series data sets are the S. aureus, E. coli and Yeast expression data from the Dream5 challenge [19]. As for the gold standard of gene modules, based on partly known gene regulatory edges, functional modules were extracted by graph theory or community detection methods. This work not only discusses the applicability and accuracy of ModuleBoost in network inference (NI), but also further analyzes the relation between data-driven module detection and NI.

Modular inference of simulated scRNA-seq datasets
Curated networks were extracted from Beeline project, which focused on GRN inference using single cell expression data. Different from traditional microarray datasets, single-cell data contains information about cell-cell variability and can be used to investigate behavior patterns of cell populations [20]. However, strong stochastic single cell expression data often lead to low accuracy in network inference. Even several algorithms including SCODE and PIDC have been developed [8,18], the accuracy levels of GRN inference using single cell data were quietly low. Another bottleneck brought by single cell data was increasing computational burden, due to plenty of cell samples [21]. Cellcell variability information in single cell expression data play a negative role in inferring TF-gene relations, leading to low accuracy of inference in many cases.
In this case, the proposed ModularBoost approach aims to improve the accuracy in inference by introducing topological constraints. In the proposed ModularBoost method, ICA-FDR based decomposition was used as inner part to detect functional modules directly from curated datasets. Competing methods include ICA-FDR2, ICAzscore, PCA-based decomposition and K-means clustering methods. Performance index F rr using four decomposition methods and K-means clustering were described in this research. Curated datasets from the GSD network had three experimental conditions, depending on the dropout rates. PIDC E. coli-S denotes single-cell data with 700 cell samples, while both E. coli-LL and E. coli-LH represent datasets with 2000 samples. In addition, the E. coli-LL and E. coli-LH groups correspond to single-cell data with low and high dropout rates respectively. Evaluation metrics of the curated GSD network and PIDC E. coli network were compared in Table 1. F rr values were positively related with the accuracy level in module detection. Highest values in each column were displayed in bold. ICA-FDR and ICA-FDR2 required the number of gene modules n_comp and the threshold of Q-value q_cutoff according to the number of genes in GRN. For the module identification evaluation of curated networks with 19 genes, we set consistently n_comp = 2 for five module identification methods. Meanwhile, PIDC E. coli network with 100 genes set n_comp = 4. And the q_cutoff of ICA-FDR, ICA-FDR2, and PCA was 10 −3 . However, the q_cutoff of ICA-zscore decomposition was different from ICA-FDR due to the difference in statistical principles: q_cutoff zscore = 1.5. To eliminate the randomness of the heuristic algorithm, we repeated each gene network 10 times and took the average of F rr .
A pattern that could be found in Table 1 was that ICA-FDR outperformed three decomposition methods and k-means clustering, showing high accuracy in module detection. Furthermore, F rr indexes obtained by ICA-FDR were slightly higher than that from ICA-FDR2. A possible explanation was that ICA-FDR2 algorithm had taken the direction of regulatory edges into consideration, thus influencing the accuracy of gene module identification. Meanwhile, dropout rate, which was regarded as noise, had shown negative impacts on module detection. F rr indexes from GSD-70 were lower than that from GSD-1 and GSD-50, using five module detection methods.
Instead of reconstructing a network as whole, the ModularBoost approach accomplished the GRN inference task based on the identified modules, which were densely connected genes and TFs. Furthermore, intra-modular and inter-modular interactions between genes were inferred in two stages of inference.
It can be observed from Table 2 that the proposed ModularBoost approach obtained highest AUROC indexes among selected inference algorithms including ridge and GRNBoost2 methods. Highest AUROC and AUPR in each column were displayed in bold. This phenomenon validates the effectiveness of ModularBoost as well as the integration of two inference strategies. The computational time of ModularBoost in  indexes of GRNBoost2 and the proposed ModularBoost indicated that the Modular-Boost approach was able to accomplish the inference task using reduced computational resource, without obvious loss of accuracy. From this perspective, the ModularBoost approach can be used as an efficient solution to infer TF-gene relations using single cell expression data. AUROC and AUPR indexes of GSD-70 group were lower than two other groups, suggesting negative impact of dropout rate. Especially when the dropout rate increase from 50 to 70, such impacts were obvious, both in module detection and network inference. Meanwhile, this situation was also confirmed at PIDC datasets. AUROC and AUPR indexes tend to decrease when the single-cell data sample sizes and dropout rates increase. In this study, dropout rate played the role of noise in single cell expression data and has a negative influence in GRN inference accuracy.
To show the superior network prediction of ModuleBoost2, we compared Modular-Boost with the ridge regression based inference method from the perspective of degree distribution [22]. The degree distribution p k of GRNs follows approximately the powerlaw [23], and it is given by: where k denotes degree, α represents scaling coefficient and c is a constant.
The power-law distributions of PIDC E. coli-S and E. coli-LL were shown in Fig. 1. For simulated single cell expression datasets, Ecoli-S and Ecoli-LL share the same gold standard networks. These two cases may lead to different inferred structures, due to various complexity levels. When the sample size of expression data increases, the inference indexes tend to decline while the mismatch degree between inferred networks and gold standards has become more significant.
Under this circumstance, the proposed ModularBoost method can obtain GRNs that are higher topologically similar with gold standards, compared with ridge regression (1) log p k = −α log k + c Fig. 1 The power-law distributions of PIDC E. coli. Comparison of the power-law distribution and log 10 -log 10 degree for gold standard network, ModularBoost-inferred network, and Ridge-inferred network of E. coli inference algorithm. In other words, ModularBoost is able to improve the interpretability of inferred networks to some degree.

Modular inference of experimental scRNA-seq datasets
In this section, experimental single cell RNA sequencing(scRNA-seq) datasets were used as the major information source in GRN inference. In biomedical and genomic research, scRNA-seq datasets has played a crucial role in exploring dynamics of cell population and differentiation. Three scRNA-seq datasets and relevant gold standard networks were provided by the SCODE project. In Table 3, PrE (primitive endoderm cells), MEF (mouse embryonic fibroblast cells), DE (definitive endoderm cells) denote various cell types from mouse and human respectively. Regulatory edges in gold standard covered a subset of 100 genes, ranging from 40% to 60% . Based on known regulatory edges, gene modules were detected using graph theory-based methods, i.e. community detections methods. Those modules were intersected, corresponding to overlapping phenomenon.
The basic idea of module standards is to explore strongly interconnected components where nodes are mutually connected. In module 2 of PrE network, ETS1, EGR1 and SMAD7 formed a connected component. ETS1 and EGR1 are hub nodes due to high out-degrees. In known regulatory relations, the SMAD7 gene owns 12 out-degrees and 16 in-degrees, indicating that this gene play a bridge node in the PrE network.
For the MEF network, those TFs such as KLF4 had high out-degrees and were believed to have high topological importance in information flows. According to known modules, overlapping existed, allowing part of genes belong to two or more functional modules.
In Fig. 2, gene modules were colored and visualized by the Gephi software to illustrate the topological positions. In the directed graphs, the physical sizes of nodes were positively related with their topological importance. Meanwhile, functional and topological neighborhoods were related but different. It can be observed from Fig. 2 that nodes in the same gene modules tend to densely connected and had topological relations. In this study, gene modules were determined by data-driven ICA-FDR method, leaving a part of genes uncolored. From Table 4, high F rr indexes with bold obtained by ICA-based decomposition suggested their advantages in detection accuracy, compared with PCA-decomposition and k-means clustering. And F rr indexes of ICA-FDR2 were lower than that of ICA-FDR. This pattern was consistent with that in microarray datasets from curated subsection.
Based on the detected modules, the proposed ModularBoost approach accomplished the network inference using two kinds of regression: ridge regression and ensemble inference algorithm. This study chose cutting-edge ensemble inference named GRN-Boost2 to determine intra-modular casual interactions. Table 5 shows AUROC and AUPR indexes obtained by the ModularBoost method and three candidate inference algorithms. And highest AUROC and AUPR in each column were displayed in bold.
SCODE PrE and MEF datasets are single-cell expression data measured from mouse cells, while SCODE DE dataset denotes expression data of human cell   Figure 3 depicts both the PR and the ROC curves of four inference algorithms, during reconstruction of three SCODE networks. For PrE and MEF networks, the proposed ModularBoost approach outperforms three candidate inference methods in terms of AUPR and AUROC values. Moreover, the shape of the PR curve shows that the predicted lists inferred by ModularBoost method contains more true edges than other algorithms. For the DE dataset, the accuracy indexes of ModularBoost were lower than that of existing GRNBoost2. One possible explanation is the complex physiological process of gene regulation in human cell populations.

Modular inference of DREAM5 networks
As a typical kind of time-series expression data, microarray datasets have long been used as information source in GRN inference. DREAM5 provided four networks as benchmarks for network inference research [19]. Considering the huge number of nodes and regulatory edges, the accuracy indexes obtained by existing methods were relatively low and the computational time was considerable. Among four DREAM5 networks, S. aureus, E. coli and Yeast networks had been used as benchmarks in this sections.
In DREAM5 challenge, 'true' regulatory relations were used as so-called gold standards to judge the accuracy degree of inference algorithms. Specially, the modular gold standard of DREAM5 was from Saelens [9]. As for the labels of gene modules, Sisima and Macisaac et. al provided two groups of incomplete module sets to quantitatively evaluate performance of candidate module identification methods [24]. Based on ICA-FDR, the ModularBoost approach firstly detected gene modules directly from DREAM5 datasets. The comparison of ICA-FDR with the other methods were shown in Table 6.
Minimal, Strict, and Interconnected denote three definitions of gene modules, according to connectivity patterns. The Minimal modules can be regarded as the overlapping sets of genes that shared at least one TF. And Strict modules correspond to the gene sets that are regulated by the same regulators. For Interconnected modules, genes in the same module are strongly interconnect. For the same gold  To reduce the stochastic impacts, decomposition-based module detection algorithms detected modules ten times for each parameter combination, and computed the average F rr indexes. In Table 6, the values F rr were obtained by taking the maximum value among 130 combinations. It can be observed from Table 6 that decomposition-based approaches identify gene modules with enhanced accuracy than K-means method, under three module definitions. Among three module definitions, F rr indexes of the Minimal and Strict groups were higher than that of Interconnected criteria. This phenomenon indicated that Minimal co-regulation can be a suitable definition to evaluate module detection. To demonstrate the decomposition of ICA-FDR, regulatory modules were colored in DREAM5 E. coli gene networks shown in Fig. 4.
In Fig. 4, nodes in the graph represent TFs or genes and edges denote regulations. Those nodes colored with the same color were assigned to the same functional module, according to ICA-FDR decomposition in the ModularBoost method. With annotations, gene modules are crucial to deepen the understanding about regulation mechanisms with a given network. From this perspective, the ModularBoost method offered a datadriven solution to unveil functional modules directly from expression data, even without accurate annotations. Different from raw module identification, the ModularBoost method detects gene at the first stage, then infer inter-modular and intra-modular regulations at the second stage. Directed regulatory edges between TF-gene pairs will be necessary to further analyze information flow and potential biophysical explanations.
In systems biology, densely connected nodes usually are related with specific cellular functions or diseases [25]. Under this circumstance, the ModularBoost method aims to provide reconstructed GRN topologies with clear community structures. This is an important character owned by the ModularBoost method. Other candidate inference algorithm including ridge regression based TIGRESS and ensemble-based GRNBoost2 majorly focus the whole network structure [2].
From the gene module detection outcomes of two DREAM5 networks, the ICA-FDR part in the ModularBoost method accomplished the first stage of task efficiently. In the subsequent inference of subnetworks, the AUROC and AUPR indexes were compared in Table 7. And the highest values in each column were marked in bold.
It can be observed from the Table 7 that the ModularBoost approach obtained higher AUROC and AUPR indexes than standard ridge and GRNBoost2 in two DREAM5 networks. The proposed ModularBoost approach integrates ensemblebased GRNBoost2 and ridge inference methods by introducing gene modules as topological constraints. The core of TIGRESS algorithm was regularized regression, leading to higher AUPR indexes than conventional linear regression methods. Network inference tasks were performed on a computer with 8 GB RAM, Intel i7-9750H 2.60 GHz. GRNBoost2 algorithm took 1 h 57 min to complete the inference of the DREAM5 S. aureus network, while ModularBoost only needed 7 min for the same Obviously, compared with GRNBoost2, Modular-Boost significantly improved the speed of network inference, and did not cause a significant decrease in the accuracy of network inference. One advantage of introducing topological constraints is to improve inference accuracy. Another benefit is to reduce computational burden, especially for GRN with thousands of regulatory edges. For three types of GRNs, degree distributions were fitted by the power-law distribution, as shown in Fig. 5. Compared with ridge-based inference algorithm, GRNs obtained by the ModularBoost approach showed closer similarity with the gold standard networks. These results show that ModularBoost-inferred networks are more similar with the 'true' network topology. The linear fitting parameters are shown in Table 8.

Discussion and conclusions
This work aims to develop a seamless framework to perform GRN inference based on module identification. In order to detect modules from expression data, ICA-based decomposition algorithms have been applied in the proposed ModularBoost algorithm. Among several candidate decomposition methods, ICA-FDR had shown advantages in detection accuracy. In this case, ModularBoost employs the ICA-FDR algorithm to detect gene module from transcriptomic data. In the subsequent network inference part,  intra-modular and inter-modular interactions were determined by ensemble-based and sparse regression-based algorithms respectively. The idea behind the ModularBoost is to introduce topological constraints to conventional network inference. Such topological constraints consider inherence community structures in GRN and other biological networks and can be introduced by data-driven approaches. The proposed ModularBoost method can be also regarded as a low-weight solution to deal with time-series and single cell expression data. Based on experimental outcomes about curated and scRNA-seq datasets, the ModularBoost method is able to improve inference accuracy as well as to reduce computational time. It can be understood that decomposition of network inference can reduce the computational burden, since an original task was transferred to multiple sub-tasks. The purpose is to obtain topologies with better biophysical or biomedical explanations. To evaluate the effectiveness of identified regulatory modules, relevant annotations called Module gold will be necessary. This study applies module labels and compute F rr index to quantitatively evaluate the performance of data-driven gene module identification.

ICA-FDR based gene module identification
Gene module correspond to the group of genes with similar expressive patterns and biological functions. Researches about gene module help researchers better understand disease modules and gene-disease relations.
In general, the intent of the independent component analysis (ICA) is to find the hidden 'independent component' that refers to the gene module in this research [26]. When applied in the field of gene module detection, ICA usually splits express data matrix X into two matrices: a source matrix S and a mixing matrix A, which means X = AS is shown in Fig. 6.   Fig. 6 An overview of the ICA-FDR decomposition method. FastICA splits the expression X into a mixing matrix A and a source matrix S. Contained in the rows of S, the components reflect hidden biological processes influencing gene expression. The level of genetic influence on components are reflected by the heat color map, from dark (minimum) to red (maximum). FDR estimation determines which genes are assigned to each module For single cell expression matrix X, this study assumes that the columns and rows of X correspond to genes and cell samples respectively. The expression value x ij of gene j at sample i is shown as where a ik reflects the contribution of cell sample i in component k and s kj denotes the contribution of component k on gene j [16]. ICA-decomposition algorithms are different in adopting optimization standards for component independence. In this study, we chose an efficient ICA algorithm-FastICA. The goal of FastICA is to find an orthogonal rotation of prewhitened data through a fixed-point iteration frame [27]. FastICA iteratively maximizes non-Gaussian of the rotated components until convergence, and k independent signals that corresponded k gene modules are found in this process.
FastICA algorithm tries to find gene components in the source matrix that own non-Gaussian characteristics. Each modular signal in the source matrix generally obeys a heavy-tailed normal distribution. Under this circumstance, those genes at the tails have a significant contribution to those components, while the majority of genes in peak have weak impact. In the next step, we applied the false discovery rate (FDR) estimation to assign genes to various functional modules.
The basic procedure of the ICA-FDR algorithm is shown as Algorithm 1.
Input parameters include the number of gene modules n_comps , the maximum iterations max_iter = 20,000 in this research, and the threshold of Q-value q_cutoff. The process of whitening was defined as Eq.3.
(2) x ij = k a ik s kj where E denotes the orthogonal matrix of eigenvectors of E{xx T } and D is the diagonal matrix of its eigenvalues. The first goal of ICA-FDR is maximizing non-Gaussianity, and non-Gaussianity is measured by the approximation of negentropy J G (w) given in the Eq.4.
The entropy H (·) can be defined as Eq.5 in the ICA-FDR: where υ is a Gaussian variable of unit variance and zero mean, and G(·) is the non-quadratic function that is used to improve the robustness of estimation, such as: The g(·) that in Algorithm. 1 is the derivatives of the function in Eq.6: FDR represents the number of false discoveries in an experiment divided by the total number of discoveries, and the discovery is statistical test that provides an acceptance threshold. Using hypotheses tests, this study first evaluated the statistical significance of genes in each modular signal. This yielded a p value for genes in each module, and a ranked list in descending order. Correction for multiple testing was performed by calculating a "Q-value" from the p values and estimating FDR values [28]. The formula for calculating a Q-value is defined by Eq.8 where p ik is the i th smallest p-alue out of n gene p values for the k modular signal.
The fundamental principle of post-process is assigning the genes with lower Q-value than q_cutoff to a module, and the process was shown in Fig. 6. According to the number of genes and modules, the value of q_cutoff was selected from the set {10 −1 , 10 −2 , . . . , 10 −13 }.
The ICA-based decomposition also has several derivatives, including ICA-FDR, ICA-FDR2 and ICA-zscore. ICA-FDR2 is similar to ICA-FDR but divides each component into two modules according to the signs of gene regulations, while ICA-zscore replaces FDR indexes with z-scores to detect module from source signals. As a tool to reduce dimension, the principal component analysis (PCA) can be used to visualize the similarities among the biological samples [29]. We tested the performance of these methods in the experiments to compare with the ICA-FDR. (3)

Decomposition-based GRN hybrid inference
Different NI algorithms have their unique characteristics, and the combination of multiple algorithms can provide a possible strategy to obtain networks with sparse and dense interactions [11]. The ICA-FDR algorithm has divided genes into different modules. For GRNs, functionally correlated genes or TFs form gene modules in which the intramodular connections are tenser than of inter-modular relations. In this section, based on the detected gene modules, we proposed an algorithm that uses GRNBoost2 to infer intra-modular interactions and ridge regression to determine inter-modular regulations, which conforms to community structures. Before calculating the inter-modular connections, the proposed ModularBoost approach removes those gene pairs that are in the same module to reduce computational burden. The workflow of ModuleBoost is shown in Fig. 7.

Infer intra-modular interactions using GRNBoost2
Based on a similar concept as GENIE3, GRNBoost2 infers regulators for every target gene purely from the gene expression matrix [4]. The conventional GRNBoost2 inference algorithm was based on Gradient Boosting Machine (GBM) regression that focuses on efficiency and had been a top-ranking algorithm in GRN inference. One character of GRNBoost2 is global estimation of decision tree number with a self-tuning mechanism. The set of n_comps modules that decomposed by the ICA-FDR is defined as where ε s is a random noise with mean of zero. The function f j (·) exploits the expression of direct regulators of gene j, and it is trained from the learning sample LS k j = {(x −j s , x j s ), s = 1, . . . , N } . Meanwhile, the feature selection computes the confidence level w ij (i = j) for the regulatory edge from gene i to gene j. For the tree regression-based GRN inference method, the function f j (·) is determined by an ensemble of decision trees. At each tree node φ , the total reduction of the variance for the output variable is computed as: where S is the set of samples at the tree node φ , S l and S r denote the left and right subtrees, Var(·) represents the variance during splitting.
For GRNBoost2, each decision tree is trained by a random subset of samples that cover approximately 90% of raw data, and 10% others are called out-of-bag samples that improve the loss function [5]. Trees stop growing up when the loss function meets the early-stopping rule that the average loss improvement drops to 0. Therefore, gene modules correspond to subnetworks with multiple topological importance levels. Gathering regulatory associations and sorting in descending order, the intramodular inference finalizes the output as shown in Fig. 7c.

Infer inter-modular connections using sparse regression
The regulatory associations between regulators and target genes in the same module were inferred by ensemble-based GRNBoost2 algorithm, while the relationships controlled by regulators that outside the module are still unknown. The combination of sparse regression algorithms with biologically meaningful constraints may provide a promising solution to enhance accuracy of GRN inference. Based on this hypothesis, we recovered regulatory relations among genes in different modules by adopting regularized linear regression (ridge regression) and used a pre-treatment to reduce computational complexity. The visualization of inter-modular inference is available in Fig. 6d.
In the following, given n genes, m samples, and the expression matrix E m×n , the linear regression problem can be defined as: where E t is a expression value vector of target gene g t ∈ G = {g 1 , g 2 , . . . , g n }, t = 1, 2, . . . , n . The potential regulators of g t that not in the same module with g t are denoted by G −t = {g r 1 , g r 2 , . . . , g r t } , and E r 1 , E r 2 , . . . , E r t are the expression of G −t . Detailed representation is: where α r t ,t is a regression coefficient describing the influence of regulator g r t on the target gene g t , and β is a vector of noise in regression. For obtaining the solution vector α of each target gene, the objective function of ridge regression is defined as: where the quadratic penalty term α 2 2 makes the loss function convex and leads to a unique minimum.
After the inference of subnetworks and inter-modular connections, a total of n × (n − 1) causal relations and importance scores were computed. Then the regulatory edge scores of intra-modular and inter-modular relations were standardized using maximum-minimum normalization: According to the normalized important scores, the regulatory associations are combined and ranked in descending order. Therefore, we can calibrate those associations with the gold standard, as shown in Fig. 7e, and use AUROC index to examine whether the real regulatory relationships enrich at the top of the ranking.

Evaluation metrics
Evaluation metrics are used to quantitatively evaluate the performance of data-driven module identification methods. As indexes F measure, Rand index, and the normalized mutual information face the problem of detecting overlapping modules [9]. This study selects Recovery, Relevance, and F rr to evaluate the accuracy of ICA-FDR-based module identification methods for their high accuracy and efficiency in handling (11) E t = α r 1 ,t E r 1 + α r 2 ,t E r 2 + . . . + α r t ,t E r t + β t overlap. The three indexes are in the range of [0, 1], and reach the value 1 only when the observed module and the known module are exactly equal. M and M are the set of known and observed modules. In this first stage, a distance (similarity) matrix is computed by the Jaccard index between two modules sets.
The Recovery is used to match known modules with observed modules, and the Relevance index reflect the extent to which observed modules match with known modules.
Afterwards, the similarity score F rr is summarized by mapping the known modules to detected ones and vice versa. A score quantifying the false negatives (Relevance) is calculated by averaging the similarities of known modules and picking out the best representatives in the detected modules. Another score that is related to false positives (Recovery) is computed in a similar style. An asymmetric method for module similarity is given by averging Relevance and Recovery as shown in Eq. 17 This study uses the values of area under receive operator curve (AUROC) and area under precision-recall curves (AUPR) to reveal accuracy levels of the ModularBoost network inference algorithm. ModularBoost outputs a descending list of putative regulatory interactions. Picking only the top K edges in this list, we compared them with gold standards to assess the number of false positives (FP), true positives (TP), false negatives (FN), and true negatives (TN). ROC curve shows the trade-off between false positive rate (FPR) and true positive rate (TPR) across different K thresholds, while PR describes the relationship between recall and precision. FPR, TPR, recall, and precision are expressed as: Finally, the AUROC and the AUPR are respectively assessed by computing the area under ROC and PR curves.