Skip to main content
  • Methodology Article
  • Open access
  • Published:

Enhanced construction of gene regulatory networks using hub gene information



Gene regulatory networks reveal how genes work together to carry out their biological functions. Reconstructions of gene networks from gene expression data greatly facilitate our understanding of underlying biological mechanisms and provide new opportunities for biomarker and drug discoveries. In gene networks, a gene that has many interactions with other genes is called a hub gene, which usually plays an essential role in gene regulation and biological processes. In this study, we developed a method for reconstructing gene networks using a partial correlation-based approach that incorporates prior information about hub genes. Through simulation studies and two real-data examples, we compare the performance in estimating the network structures between the existing methods and the proposed method.


In simulation studies, we show that the proposed strategy reduces errors in estimating network structures compared to the existing methods. When applied to Escherichia coli, the regulation network constructed by our proposed ESPACE method is more consistent with current biological knowledge than the SPACE method. Furthermore, application of the proposed method in lung cancer has identified hub genes whose mRNA expression predicts cancer progress and patient response to treatment.


We have demonstrated that incorporating hub gene information in estimating network structures can improve the performance of the existing methods.


A gene regulatory network (GRN) describes interactions and regulatory relationships among genes. It provides a systematic understanding of the molecular mechanisms underlying biological processes by revealing how genes work together to form modules that carry out cell functions [14]. In addition, the visualization of genetic dependencies through the GRN facilitates the systematic interpretation and comprehension of analysis results from genome-wide studies using high-throughput data. GRNs have proven valuable in a variety of contexts, including identifying druggable targets [5], detecting driver genes in diseases [6], and even optimizing prognostic and predictive signatures [7].

Gene expression microarrays monitor the transcription activities of thousands of genes simultaneously, which provides a great opportunity to study the “relationships” among genes on a large scale. However, challenges lie in constructing large-scale GRNs from gene expression microarray data due to the small sample sizes of microarray studies and the extremely large solution space. Computational techniques and algorithms have been proposed to reconstruct GRNs from gene expression data, including probability-based approaches such as Bayesian networks [812], correlation-based approaches [13], likelihood-based approaches [1416], partial-correlation-based approaches [17, 18], and information-theory-based approaches [1922]. The existing methods are briefly reviewed in the Methods Section. Readers can also refer to Bansal et al. [23] and Allan et al. [24] for a more detailed review of network construction methods.

The sparse partial correlation estimation (SPACE) method, proposed by Peng et al. [18], considers a penalized regression approach to estimate edges in the GRN, which utilizes the sparse feature of the GRN. Comparative studies have shown that the SPACE method performs well in estimating sparse networks with high accuracy [24]. Peng et al. [18] also showed that the method was able to identify functional relevant molecular networks. In addition, recent studies of network analysis have revealed its advantage in detecting genes or modules associated with phenotypes [2527].

In gene networks, genes that have many interactions with other genes are defined as hub gens. Because of these interactions, hub genes usually play an important role in a biological system. For example, transcription factor (TF), a protein that binds to specific DNA sequences, can regulate a given set of genes. In humans, approximately 10% of genes in the genome code for around 2600 TFs [28]. The combinatorial human TFs account for most of the regulation activities in the human genome, especially during the development stage. As a result, the genes that code TFs, called TF-encoding genes, are usually regarded as hub genes. Furthermore, in cancer research, cancer genes (oncogenes or tumor suppressor genes) take part in tumor genesis and are likely to be hub genes in the genetic networks of tumors [29, 30]. Through decades of biological studies, knowledge on important genes (such as TFs or cancer genes) has been accumulated. Our hypothesis is that incorporating prior knowledge about hub genes can improve accuracy in estimating the gene network structure. It is worth noting that there is a reweighted 1 regularization method [31] that repeatedly estimates the structures and modifies the weights of the penalties by using the information on degrees from the previous estimation to encourage the appearance of the hubs. This method does not use the prior information obtainable from the other resources while our method uses additional information not contained in the observed dataset.

To explicitly account for the information on hub genes, we propose an extension of the SPACE method, which introduces an additional tuning parameter to open up the possibility of reducing penalization and increasing the likelihood of selecting the edges connected to such genes. We numerically show that the proposed method reduces errors in estimating network structures. Although we focus on extending the SPACE method in this paper, the idea can also be applied to penalized likelihood methods as well as to other penalized regression methods. Note that there is no rigorous definition of a hub in the context of a network; the definition of a hub varies depending on the sparsity of the network. For sparse protein networks, a hub is defined in [32] as a protein whose degree lies over the 0.95 quantile of the degree distribution or in [33] and [7] as a protein whose degree is greater than 7. In this paper, we conservatively define a hub as a node whose degree is both greater than 7 and above the 0.95 quantile of the degree distribution, because most nodes in sparse networks have relatively small degrees between 0 and 3.

In this study, we briefly introduce seven existing methods, including the SPACE and the graphical lasso, and propose the extended SPACE (ESPACE) method to incorporate the biological knowledge about important genes, i.e. network hubs. Moreover, it is worth noting that the ESPACE only incorporates the previously known biological information not contained in the observed dataset compared to the other existing methods. Through simulation studies, we show that the proposed approach reduces error in estimating the network structures compared to the seven other existing methods that we reviewed in the “Methods” section. Finally, we demonstrate the improvement of the ESPACE method compared to the SPACE method with two real-data examples.


Review of existing methods

Here, we briefly review the existing methods; the GeneNet [34], the NS [17], the GLASSO [15], the GLASSO-SF [31], the PCACMI [21], the CMI2NI [22], and the SPACE [18]. Let \(X_{i}^{k}\) be the expression level of the ith gene of the kth array for i=1,2,…,p and k=1,2,…,n. Let \(\mathbf {X}_{i} = \left (X_{i}^{1}, X_{i}^{2}, \ldots, X_{i}^{n}\right)^{T}\) so that observed gene expression data can be denoted by an n×p matrix X=(X 1;X 2;…;X p ) whose rows and columns denote arrays and genes, respectively. Suppose row vectors \(\mathbf {X}^{k}=\left (X_{1}^{k},X_{2}^{k},\ldots,X_{p}^{k}\right)\) for k=1,2,…,n are independently and identically distributed random vectors from the multivariate normal distribution with mean 0 and covariance matrix Σ. We assume that Σ is positive definite, and let ΩΣ −1=(ω ij )1≤i,jp be the inverse of the covariance matrix Σ, which is referred to as a concentration matrix or a precision matrix.


Schäfer and Strimmer [34] propose the linear shrinkage estimator for a covariance matrix and the Gaussian graphical model (GGM) selection based on the partial correlation obtained from their shrinkage estimator. With multiple testing procedure using the local false discovery rate [35], the GGM selection controls the false discovery rate under a pre-determined level α. Since Schäfer and Strimmer [34] provide their GGM selection procedure in the R package GeneNet, we denote their GGM selection procedure as GeneNet in this paper. To be specific, one of the most commonly used linear shrinkage estimators S for the covariance matrix Σ is

$$ S^{*} = \lambda^{*} T + (1-\lambda^{*}) S, $$

where S=(s ij )1≤i,jp is the sample covariance matrix, T=diag(s 11,s 22,…,s pp ) is the shrinkage target matrix, and \(\lambda ^{*} = {\sum \nolimits }_{i\neq j} \widehat {\text {Var}} (s_{ij}) / \left ({\sum \nolimits }_{i\neq j} s_{ij}^{2}\right)\) is the optimal shrinkage intensity. With this estimator S , the matrix of the partial correlations \(P = (\hat {\rho }^{ij})_{1 \le i,j \le p}\) is defined as \(\hat {\rho }^{ij} = -\hat {\omega }_{ij} / \sqrt {\hat {\omega }_{ii} \hat {\omega }_{jj}}\), where \(\hat {\Omega } = (\hat {\omega }_{ij})_{1 \le i,j \le p} = \left (S^{*}\right)^{-1}\).

To identify the significant edges, Schäfer and Strimmer [34] suppose the distribution of the partial correlations as the mixture

$$f(\rho) = \eta_{0} f_{0} (\rho,\nu) + (1-\eta_{0}) f_{1}(\rho), $$

where f 0 is the null distribution, f 1 is the alternative distribution corresponding to the true edges, and η 0 is the unknown mixing parameter. Using the algorithm in [35], GeneNet identifies significant edges that have the local false positive rate

$$\text{fdr}(\rho) = \frac{\hat{\eta}_{0} f_{0} (\rho, \hat{\nu})}{\hat{f}(\rho)} $$

smaller than the pre-determined level α, where f 0(ρ,ν)=|ρ|Be(ρ 2;0.5,(ν−1)/2), Be(x;a,b) is the density of the Beta distribution and ν is the reciprocal variance of the null ρ.

Neighborhood selection (NS)

Meinshausen and Bühlmann [17] propose the neighborhood selection (NS) method, which separately solves the lasso [36] problem and identifies edges with nonzero estimated regression coefficients for each node. Meinshausen and Bühlmann [17] prove that the NS method is asymptotically consistent in identifying the neighborhood of each node when the neighborhood stability condition is satisfied. Note that the neighborhood stability condition is related to the irrepresentable condition in linear model literature [37].

To be specific, for each node iV={1,2,…,p}, NS solves the following lasso problem

$$\hat{\beta}^{i,\lambda} = \operatornamewithlimits{arg\min}_{\beta\in \mathbb{R}^{p}: \beta_{i} = 0} ~\frac{1}{2} \| \mathbf{X}_{i} - \mathbf{X}\beta\|_{2}^{2} + \lambda \|\beta\|_{1}, $$

where \(\|\mathbf {x}\|_{2}^{2} = {\sum \nolimits }_{i=1}^{p} x_{i}^{2}\) and \(\|\mathbf {x}\|_{1} = {\sum \nolimits }_{i=1}^{p} |x_{i}|\) for \(\mathbf {x} \in \mathbb {R}^{p}\). With the estimate \(\hat {\beta }^{i,\lambda }\), NS identifies the neighborhood of the node i as \(N_{i}(\lambda) = \{ k~|~ \hat {\beta }_{k}^{i,\lambda } \neq 0\}\), which defines an edge set \(E_{i}^{\lambda } = \{(i,j)~|~ j \in N_{i}(\lambda)\}\). Since NS separately solves p lasso problems, contradictory edges may occur when we define the total edge set \(E^{\lambda } = \cup _{i=1}^{p} E_{i}^{\lambda }\), i.e., \(\hat {\beta }_{k}^{i,\lambda } \neq 0\) and \(\hat {\beta }_{i}^{k,\lambda } = 0\). To avoid these contradictory edges, NS suggests two types of edge sets E λ, and E λ, defined as follows:

$$\begin{array}{*{20}l} E^{\lambda,\wedge} = \left\{(i,j)~|~ i \in N_{j}(\lambda) ~\text{and}~ j \in N_{i}(\lambda)\right\},\\ E^{\lambda,\vee} = \left\{(i,j)~|~ i \in N_{j}(\lambda) ~\text{or}~ j \in N_{i}(\lambda)\right\}. \end{array} $$

Meinshausen and Bühlmann [17] mentioned these two edge sets have only small differences in their experience and the differences vanish asymptotically. Meinshausen and Bühlmann [17] also propose the choice of the tuning parameter λ i (α) for the ith node

$$\lambda_{i}(\alpha) = \|\mathbf{X}_{i}\|_{2} \tilde{\Phi}^{-1}\left(\frac{\alpha}{2p^{2}}\right), $$

where \(\tilde {\Phi } = 1 - \Phi \) and Φ is the distribution function of the standard normal distribution. With this choice of λ i (α) for i=1,2,…,p, the probability of falsely identifying edges in the network is bounded by the level α. Note that we estimate the edge set with E λ, and solve the lasso problems using the R package CDLasso proposed by [38] in this paper.

Graphical lasso (GLASSO)

Friedman et al. [15] propose the graphical lasso method that estimates a sparse inverse covariance matrix Ω by maximizing the 1 penalized log-likelihood

$$ l(\Omega) = \log |\Omega| - \text{tr}(S\Omega) - \lambda \|\Omega\|_{1}, $$

where S is the sample covariance matrix, tr(A) is the trace of A and A1 is the 1 norm of A for \(A \in \mathbb {R}^{p \times p}\).

To be specific, let W be the estimate of the covariance matrix Σ and consider partitioning W and S

$$ W = \left(\begin{array}{cc} W_{11} & w_{12}\\ w_{12}^{T} & w_{22} \end{array}\right),~ S = \left(\begin{array}{cc} S_{11} & s_{12}\\ s_{12}^{T} & s_{22} \end{array}\right), ~ \Omega = \left(\begin{array}{cc} \Omega_{11} & \omega_{12}\\ \omega_{12}^{T} & \omega_{22} \end{array}\right) $$

Motivated by [39], Friedman et al. [15] show that the solution \(\hat {\Omega }\) of (1) is equivalent to the inverse of W whose partitioned entity w 12 satisfies w 12=W 11 β , where β is the solution of the lasso problem

$$ \min_{\beta} ~ \frac{1}{2} \left\| W^{1/2}_{11} \beta - W_{11}^{-1/2} s_{12} \right\|_{2}^{2} + \lambda \|\beta\|_{1}. $$

Based on the above property, the graphical lasso sets the diagonal elements w ii =s ii +ρ and obtains the off-diagonal elements of W by repeatedly applying the following two steps:

  1. 1.

    Permuting the columns and rows to locate the target elements at the position of w 12.

  2. 2.

    Finding the solution w 12=W 11 β by solving the lasso problem (2).

until convergence occurs. After finding W, the estimate \(\hat {\Omega }\) is obtained from the relationships \(\omega _{12} = - \hat {\beta } \hat {\omega }_{22}\) and \(\hat {\omega }_{22} = 1/(w_{22} - w_{12}^{T}\hat {\beta })\), where \(\hat {\beta } = W_{11}^{-1} w_{12}\). This graphical lasso algorithm was proposed in [15] and had its computational efficiency improved in [16] and [40]. Witten et al. [16] provide the R package glasso version 1.7.

GLASSO with reweighted strategy for scale-free network (GLASSO-SF)

Liu and Ihler [31] propose the reweighted 1 regularization method to improve the performance of the estimation for the scale-free network whose degrees follows the power law distribution. Motivated by the fact that the existing methods work poorly for the scale-free networks, Liu and Ihler [31] consider changing the 1 norm penalty in the existing methods to the power law regularization

$$ p_{\lambda,\gamma}(\Omega) = \lambda \sum\limits_{i=1}^{p} \log \left(\|\omega_{-i}\|_{1} + \epsilon_{i} \right) + \gamma \sum\limits_{i=1}^{p} |\omega_{ii}|, $$

where λ and γ are nonnegative tuning parameters, ω i ={ω ij | ji}, \(\|\omega _{-i}\|_{1} = {\sum \nolimits }_{j\neq i} |\omega _{ij}|\), and ε i is a small positive number for i=1,2,…,p. Thus, Liu and Ihler [31] consider optimizing the following objective function

$$ f(\Omega; \mathbf{X}, \lambda, \gamma) = L(\mathbf{X},\Omega) + u_{L} \cdot p_{\lambda,\gamma}(\Omega), $$

where L(X,Ω) denotes the objective function of the existing method without its penalty terms, u L =1 if L is convex and u L =−1 if L is concave for Ω. Note that the choice of L is flexible. For instance, L(X,Ω) can be the log-likelihood function of Ω as in the graphical lasso or the squared loss function as in the NS and the SPACE. In this section, we suppose that L is concave for the purpose of notational simplicity.

To obtain the maximizer of f(Ω;X,λ,γ), Liu and Ihler [31] propose the iteratively reweighted 1 regularization procedure based on the minorization-maximization (MM) algorithm [41]. The reweighted procedure iteratively solves the following problem:

$$ \Omega^{(k+1)} = \operatornamewithlimits{arg\max}_{\Omega}~ L(\mathbf{X}, \Omega) - \sum\limits_{i=1}^{p} \sum\limits_{j\neq i} \eta_{ij}^{(k)} |\omega_{ij}| - \gamma \sum\limits_{i=1}^{p} |\omega_{ii}|, $$

where \(\Omega ^{(k)}= \left (\omega _{ij}^{(k)}\right)\) is the estimate at the kth iteration, \(\|\omega _{-i}^{(k)}\|_{1} = {\sum \nolimits }_{l \neq i} |\omega _{il}^{(k)}|\), and \(\eta _{ij}^{(k)} = \lambda \left (1/(\|\omega _{-i}^{(k)}\|_{1}\right. + \epsilon _{i}) +\left.1/(\|\omega _{-j}^{(k)}\|_{1} + \epsilon _{j})\right)\). In practice, [31] suggest ε i =1, γ=2λ/ε i , and the initial estimate Ω (0)=I p , where I p is the p-dimensional identity matrix. Note that this reweighted strategy facilitates to estimate the hub nodes by adjusting weights in the penalty term but weights are updated by solely using the observed dataset without previously known information from other literatures.

In this paper, we consider L(X,Ω)= log|Ω|−tr(S Ω), which is the same to the component in the objective function of the GLASSO. Thus, we call this procedure as the GLASSO with a reweighted strategy for the scale-free network (GLASSO-SF). As applied in [31], we stop the reweighting iteration after 5 iterations. The R package glasso version 1.7 is used to obtain the solution of (5) at each iteration with the penalty matrix \(E^{(k)} = (e_{ij}^{(k)})\), where \(e_{ij}^{(k)} = \eta _{ij}^{(k)}\) for ij and \(e_{ii}^{(k)} = 2\lambda \) for i=1,2,…,p.

Path consistency algorithm based on conditional mutual information (PCACMI)

Mutual information (MI) is a widely used measure of dependency between variables in information theory. MI even measures non-linear dependency between variables and can be considered as a generalization of the correlation. Several mutual information (MI) based methods have been developed such as ARACNE [20], CLR [42], and minet [43]. However, similar to the correlation, MI only measures pairwise dependency between two variables. Thus, it usually identifies many undirected interactions between variables. To resolve this difficulty, Zhang et al. [21] propose the information theoretic method for reconstruction of the gene regulatory networks based on the conditional mutual information (CMI).

To be specific, let H(X) and H(X,Y) be the entropy of a random variable X and the joint entropy of random variables X and Y, respectively. For two random variables X and Y, H(X) and H(X,Y) can be expressed as

$$ H(X) = E \left(-\log f_{X}(X)\right),~ H(X,Y) = E\left(-\log f_{XY}(X,Y)\right), $$

where f X (x) is the marginal probability density function (PDF) of X and f XY (x,y) is the joint PDF of X and Y. With these notations, MI is defined as

$$ \begin{array}{lll} I(X,Y) &=& E\left(- \text{log} \frac{f_{XY}(X,Y)}{f_{X}(X)f_{Y}(Y)}\right)\\ &=& H(X) + H(Y) - H(X,Y). \end{array} $$

It is known that MI measures dependency between two variables that contain both directed dependency and indirected dependency through other variables. While MI can not distinguish directed and indirected dependency, CMI can measure directed dependency between two variables by conditioning on other variables. CMI for X and Y given Z is defined as

$$ I(X,Y|Z) = H(X,Z)+ H(Y,Z) - H(Z) - H(X,Y,Z). $$

To estimate the entropies in (7), Zhang et al. [21] consider the Gaussian kernel density estimator used in [19]. Using the Gaussian kernel density estimator, MI and CMI are defined as

$$\begin{array}{*{20}l} \widehat{I}(X,Y) = \frac{1}{2} \log \frac{|C(X)|~|C(Y)|}{|C(X,Y)|},\\ \widehat{I}(X,Y|Z) = \frac{1}{2} \log \frac{|C(X,Z)|~|C(Y,Z)|}{|C(Z)|~|C(X,Y,Z)|}, \end{array} $$

where |A| is the determinant of a matrix A, C(X), C(Y) and C(Z) are the variances of X, Y and Z, respectively, and C(X,Z), C(Y,Z) and C(X,Y,Z) are the covariance matrices of (X,Z), (Y,Z) and (X,Y,Z), respectively.

To efficiently identify dependent pairs of variables, Zhang et al. [21] adopt the path consistency algorithm (PCA) in [44]. Thus, the authors called their procedure as PCA based on CMI (PCACMI). The PCACMI method sets L = 0 and calculates with L-order CMI, which is equivalent to MI if L=0. Then, PCACMI removes the pairs of variables such that the maximal CMI of two variables given L+1 adjacent variables is less than a given threshold α, where α determines whether two variables are independent or not and adjacent variables denote variables connected to the two target variables in PCACMI at the previous step. PCACMI repeats the above steps until there is no higher order connection. The MATLAB code for PCACMI is provided by [21] at the author’s website

Conditional mutual inclusive information-based network inference (CMI2NI)

Recently, Zhang et al. [22] proposed the conditional mutual inclusive information-based network inference (CMI2NI) method that improves the PCACMI method [21]. CMI2NI considers the Kullback-Leibler divergences from the joint probability density function (PDF) of target variables to the interventional PDFs removing the dependency between two variables of interest. Instead of using CMI, CMI2NI uses the conditional mutual inclusive information (CMI2) as the measure of dependency between two variables of interest given other variables. To be specific, we consider three random variables X, Y and Z. For these three random variables, the CMI2 between X and Y given Z is defined as

$$ \text{CMI2}(X,Y|Z) = \left(D_{\text{KL}}(P || P_{X \rightarrow Y}) + D_{\text{KL}}(P || P_{Y \rightarrow X}) \right)/2, $$

where D KL(f||g) is the Kullback-Leibler divergence from f to g, P is the joint PDF of X, Y and Z, and P XY is the interventional probability of X, Y and Z for removing the connection from X to Y.

With Gaussian assumption on the observed data, the CMI2 for two random variables X and Y given m-dimensional vector Z can be expressed as

$$ \begin{aligned} CMI2(X,Y|Z) &= \frac{1}{4}\left(\text{tr}(C^{-1} \Sigma) + \text{tr}(\tilde{C}^{-1} \tilde{\Sigma}) + \log C_{0}\right.\\ &\left.\qquad+ \log \tilde{C}_{0} - 2n \right), \end{aligned} $$

where Σ is the covariance matrix of (X,Y,Z T)T, \(\tilde {\Sigma }\) is the covariance matrix of (Y,X,Z T)T, Σ XZ is the covariance matrix of (X,Z T)T, Σ YZ is the covariance matrix of (Y,Z T)T, n=m+2, and C, \(\tilde {C},C_{0}\) and \(\tilde {C}_{0}\) are defined with the elements of \(\Sigma,\Sigma _{XZ},\Sigma _{YZ},\Sigma ^{-1},\Sigma _{XZ}^{-1}\) and \(\Sigma _{YZ}^{-1}\) (see Theorem 1 in [22] for details). As applied in PCACMI, CMI2NI adopts the path consistency algorithm (PCA) to efficiently calculate the CMI2 estimates. All steps of the PCA in CMI2NI are the same as one of PCACMI if we change the CMI to the CMI2. In the PCA steps of CMI2NI, two variables are regarded as independent if the corresponding CMI2 estimate is less than a given threshold α. The MATLAB code for CMI2NI is available at the author’s website

Sparse partial correlation estimation (SPACE)

In the Gaussian graphical models [45], the conditional dependencies among p variables can be represented by a graph \(\mathcal {G}=\left (V,E\right)\), where V={1,2,…,p} is a set of nodes representing p variables and E={(i,j) | ω ij ≠0, 1≤ijp} is a set of edges corresponding to the nonzero off-diagonal elements of Ω.

To describe the SPACE method, we consider linear models such that for i=1,2,…,p,

$$ \mathbf{X}_{i}=\sum\limits_{j\neq i}\beta_{ij}\mathbf{X}_{j}+\boldsymbol{\epsilon}_{i} $$

where ε i is an n-dimensional random vector from the multivariate normal distribution with mean 0 and covariance matrix (1/ω ii )I n , and I n is an identity matrix with size of n×n. Under normality, the regression coefficients β ij s can be replaced with the partial correlations ρ ijs by the relationship

$$ \beta_{ij}=-\frac{\omega_{ij}}{\omega_{ii}}=\rho^{ij}\sqrt{\frac{\omega_{jj}}{\omega_{ii}}}, $$

where \(\rho ^{ij}=\text {corr}\left (X_{i},X_{j}~|~X_{k},k\neq i,j\right)=-{\omega _{ij}}\left /{\sqrt {\omega _{ii}\omega _{jj}}}\right.\) is a partial correlation between X i and X j . Motivated by the relationship (12), Peng et al. [18] propose the SPACE method for solving the following 1-regularized problem:

$$ {\begin{aligned} \min_{\rho}\frac{1}{2}\sum\limits_{i=1}^{p}\left\{ w_{i}\sum\limits_{k=1}^{n}\left(X_{i}^{k}-\sum\limits_{j\neq i}\rho^{ij}\sqrt{\frac{\omega_{jj}}{\omega_{ii}}}X_{j}^{k}\right)^{2}\right\} +\lambda\sum\limits_{1\le i<j \le p}|\rho^{ij}|, \end{aligned}} $$

where w i is a nonnegative weight for the i-th squared error loss.

Proposed approach incorporating previously known hub information

Extended sparse partial correlation estimation (ESPACE)

In this paper, we assume that some genes (or nodes), which are referred to as hub genes (or hub nodes), regulate many other genes, and we also assume that many of these hub genes were identified from previous experiments. To incorporate information about hub nodes, we propose the extended SPACE (ESPACE) method, which extends the model space by using an additional tuning parameter α on edges connected to the given hub nodes. This additional tuning parameter can reflect the hub gene information by reducing the penalty on edges connected to hub nodes. To be specific, let \(\mathcal {H}\) be the set of hub nodes that were previously identified. The ESPACE method we propose solves

$$ {\begin{aligned} &\min_{\rho}\frac{1}{2}\sum\limits_{i=1}^{p}\left\{w_{i} \sum\limits_{k=1}^{n}\left(X_{i}^{k}-\sum\limits_{j\neq i}\rho^{ij}\sqrt{\frac{\omega_{jj}}{\omega_{ii}}}X_{j}^{k}\right)^{2}\right\}\\ &\quad+\alpha \lambda \sum\limits_{{i<j,\atop \{i \in \mathcal{H}\} \cup \{j\in \mathcal{H}\}}}|\rho^{ij}|+ \lambda \sum\limits_{ {i<j,\atop i,j\in \mathcal{H}^{c}}}|\rho^{ij}|, \end{aligned}} $$

where 0<α≤1. Note that we consider the weights w i s for the squared error loss as one in this paper. To summarize the process of the proposed method, we depict the flowchart of the ESPACE method in Fig. 1. As described in Fig. 1, the ESPACE has the prior knowledge about hub genes as an additional input, which is the novelty of the proposed method compared to the other existing methods.

Fig. 1
figure 1

Flowchart of ESPACE

Extended graphical lasso (EGLASSO)

In the Background, we mentioned the proposed procedure is applicable to other methods such as the graphical lasso. For the purpose of fair comparison and the investigation of the performance, we also applied the proposed strategy to the GLASSO, which is the GLASSO incorporating the hub gene information. We call this procedure the extended graphical lasso (EGLASSO). Similar to the ESPACE, the EGLASSO maximizes

$$ \log |\Omega| - \text{tr}(S\Omega) -\alpha \lambda \sum\limits_{{i<j,\atop \{i \in \mathcal{H}\} \cup \{j\in \mathcal{H}\}}}|\omega_{ij}| - \lambda \sum\limits_{{i<j,\atop i,j\in \mathcal{H}^{c}}}|\omega_{ij}|, $$

where λ≥0 and 0<α≤1 are two tuning parameters, S is the sample covariance matrix, tr(A) is the trace of A and \(\mathcal {H}\) is the set of hub nodes that were previously identified. Note that we can use the R package glasso version 1.7 for the EGLASSO by defining the penalty matrix corresponding to the penalty term in (15).

Active shooting algorithm for ESPACE

To solve (14), we adopt the active shooting algorithm introduced in [18]. We rewrite the problem (14) as

$$ \min_{\rho}\frac{1}{2}\left\|\mathbf{Y}-\tilde{\mathbf{X}}\boldsymbol{\rho}\right\|_{2}^{2}+\alpha \lambda \sum\limits_{{i<j,\atop \{i \in \mathcal{H}\} \cup \{j\in \mathcal{H}\}}}|\rho^{ij}|+ \lambda \sum\limits_{ {i<j,\atop i,j\in \mathcal{H}^{c}}}|\rho^{ij}|, $$

where \(\mathbf {Y}=(\mathbf {X}_{1}^{T},\mathbf {X}_{2}^{T},\ldots,\mathbf {X}_{p}^{T})^{T}\) is an n×p column vector; \(\mathbf {X}^{k,l}=\left (\mathbf {0}_{n(k-1)\times 1}^{T},\mathbf {X}_{l(k)}^{T},\mathbf {0}_{n(l-k-1)\times 1}^{T},\mathbf {X}_{k(l)}^{T},\mathbf {0}_{n(p-l)\times 1}^{T}\right)^{T}\) is an n×p column vector as well, with \(\mathbf {X}_{k(l)}=\sqrt {\frac {\omega _{kk}}{\omega _{ll}}}\mathbf {X}_{k};\)

$$ \tilde{\mathbf{X}}= \left(\begin{array}{cccccccc} \mathbf{X}^{1,2}; & \mathbf{X}^{1,3}; & \cdots; & \mathbf{X}^{1,p}; & \mathbf{X}^{2,3}; & \mathbf{X}^{2,4}; & \cdots; & \mathbf{X}^{(p-1),p} \end{array}\right), $$

and ρ=(ρ 12;ρ 13;…;ρ 1p;ρ 23;ρ 24;…;ρ (p−1)p)T. Let \(\widehat {\boldsymbol {\rho }}^{(m)}\) and \(\widehat {\omega }_{ii}^{(m)}\) be estimates of ρ and ω ii at the m-th iteration, respectively. Then, the steps of the modified algorithm are outlined below:

  • Step 1: (Initialization of \(\widehat {\omega }_{ii}\)) For i=1,2,…,p, \(\widehat {\omega }_{ii}^{(0)} = 1\) and s=0.

  • Step 2: (Initialization of \(\widehat {\boldsymbol {\rho }}\)) For 1≤i<jp and m=0,

    $$ \begin{array}{llll} \widehat{\rho}^{ij,(0)} &=& \text{sign}\left(\mathbf{Y}^{T}\mathbf{X}^{i,j}\right)\frac{\left(\left|\mathbf{Y}^{T}\mathbf{X}^{i,j}\right|-\alpha\lambda\right)_{+}}{\left(\mathbf{X}^{i,j}\right)^{T}\mathbf{X}^{i,j}} &\text{for}~\{i \in \mathcal{H}\}\cup\{ j \in \mathcal{H}\},\\ \widehat{\rho}^{ij,(0)} &=& \text{sign}\left(\mathbf{Y}^{T}\mathbf{X}^{i,j}\right)\frac{\left(\left|\mathbf{Y}^{T}\mathbf{X}^{i,j}\right|-\lambda\right)_{+}}{\left(\mathbf{X}^{i,j}\right)^{T}\mathbf{X}^{i,j}} &\text{for}~i,j\in\mathcal{H}^{c}, \end{array} $$

    where (x)+= max(x,0) and X i,js are defined in (16) with \(\widehat {\omega }_{ii}^{(s)}\).

  • Step 3: Define an active set \(\Lambda =\{(i,j)~|~\widehat {\rho }^{ij,(m)}\neq 0\}\).

  • Step 4: Iteratively update \(\widehat {\boldsymbol {\rho }}^{(m)}\) for (k,l)Λ,

    $$ \begin{array}{lll} \widehat{\rho}^{kl,(m)}&= \text{sign}\left((\mathbf{X}^{k,l})^{T}\boldsymbol{\epsilon}'\right)\frac{\left(\left|(\mathbf{X}^{k,l})^{T}\boldsymbol{\epsilon}'\right|-\alpha\lambda\right)_{+}}{(\mathbf{X}^{k,l})^{T}\mathbf{X}^{k,l}}\\ &\quad\text{for}~\{k \in \mathcal{H}\}\cup\{ l \in \mathcal{H}\}, \\ \medskip \widehat{\rho}^{kl,(m)}&= \text{sign}\left((\mathbf{X}^{k,l})^{T}\boldsymbol{\epsilon}'\right)\frac{\left(\left|(\mathbf{X}^{k,l})^{T}\boldsymbol{\epsilon}'\right|-\lambda\right)_{+}}{(\mathbf{X}^{k,l})^{T}\mathbf{X}^{k,l}}&\text{for}~k,l\in\mathcal{H}^{c}, \end{array} $$

    where \(\boldsymbol {\epsilon }'= \mathbf {Y}-{\sum \nolimits }_{(i,j)\neq (k,l)} \tilde {\rho }^{ij}\mathbf {X}^{i,j}\) and \(\tilde {\rho }^{ij}\)s are current estimates at the step for updating the (k,l)-th partial correlation.

  • Step 5: Repeat Step 4 until convergence occurs on the active set Λ.

  • Step 6: Update \(\widehat {\boldsymbol {\rho }}^{(m+1)}\) for 1≤i<jp by using the equations in Step 4. If the maximum difference between \(\widehat {\boldsymbol {\rho }}^{(m+1)}\) and \(\widehat {\boldsymbol {\rho }}^{(m)}\) is less than a pre-determined tolerance τ, then go to Step 7 with the estimates \(\widehat {\boldsymbol {\rho }}^{(m+1)}\). Otherwise, consider m=m+1 and go back to Step 3.

  • Step 7: Update \(\widehat {\omega }_{ii}^{(s+1)}\) for i=1,2,…,p,

    $$\begin{aligned} \frac{1}{\widehat{\omega}_{ii}^{(s+1)}}&= \frac{1}{n}\left\|\mathbf{X}_{i}-\sum\limits_{j\neq i}\widehat{\rho}^{ij,(m+1)}\sqrt{\frac{\widehat{\omega}_{jj}^{(s)}}{\widehat{\omega}_{ii}^{(s)}}}\mathbf{X}_{j}\right\|_{2}^{2}\\ &\quad\text{for}~i=1,2,\ldots,p. \end{aligned} $$
  • Step 8: Repeat Step 2 through Step 7 with s=s+1 until convergence occurs on \(\widehat {\omega }_{ii}\)s.

Note that the number of iterations of \(\widehat {\omega }_{ii}\)s is usually small for stabilization of the estimates of ρ. In our numerical study, the estimates of ω ii s converge within 10 iterations. Moreover, the inner products such as Y T X i,j, whose complexity is O(n p), can efficiently be computed by rewriting \(\mathbf {Y}^{T}\mathbf {X}^{i,j} = {\sum \nolimits }_{k=1}^{n} \left (\sqrt {\omega _{jj}/\omega _{ii}} + \sqrt {\omega _{jj}/\omega _{ii}}\right) X_{i}^{k} X_{j}^{k}, \) whose complexity is O(n). We implemented the R package espace, which is available from

Choice of tuning parameters

We have introduced the ESPACE method, which relaxes the penalty on edges connected to the hub genes (i.e., α<1) but uses the same penalty on edges connected to non-hub gene (i.e., α=1). When no hub genes are involved in a network, ESPACE is reduced to SPACE. For a given λ, this modification allows us to find more edges connected to the hub genes by reducing α. In practice, however, we do not know the values of λ and α. In this paper, we consider the GIC-type criterion used in [46] for the Gaussian graphical model to choose the optimal tuning parameters (λ ,α ). Let \(\widehat {\rho }_{(\lambda,\alpha)}^{ij}\) be the (i,j)-th estimate of partial correlation for given λ and α. The GIC-type criterion is defined as

$${\begin{aligned} \text{GIC}(\lambda,\alpha)&=\sum\limits_{i=1}^{p}\left\{ n\cdot\log {RSS}_{i}+\log{\log{n}}\log (p-1){\vphantom{\widehat{\rho}_{\lambda,\alpha}^{ij}}}\right.\\&\qquad\quad\times\left.\left|\left\{j:j\neq i,\widehat{\rho}_{\lambda,\alpha}^{ij}\neq0\right\}\right|\right\}, \end{aligned}} $$

where \({RSS}_{i} = \left \|\mathbf {X}_{i}-\sum \limits _{j\neq i}\widehat {\rho }_{(\lambda,\alpha)}^{ij}\mathbf {X}_{j(i)}\right \|_{2}^{2}\) and |A| denotes a cardinality of a set A. We choose the tuning parameters which minimize the GIC-type criterion,

$$(\lambda^{*},\alpha^{*})={\operatornamewithlimits{argmin}_{\lambda,\alpha}\text{GIC}(\lambda,\alpha).} $$

Simulation studies

Simulation settings

In this simulation, we consider four real protein-protein interaction (PPI) networks used in a comparative study [24], which were partially selected from the human protein reference database [47]. As mentioned earlier, genes whose degrees are greater than 7 and above the 0.95 quantile of the degree distribution are thought of as hub genes. Figure 2 shows the four PPI networks and their hub genes. Let p be the number of nodes in a network. We consider the number of samples as p/2 and p and generate samples from the multivariate normal distribution with mean 0 and covariance matrix Σ defined with \((\Sigma)_{ij} = (\Omega ^{-1})_{ij}/\sqrt {(\Omega ^{-1})_{ii}(\Omega ^{-1})_{jj}}\), where Ω is a concentration matrix corresponding to a given network structure. To generate a positive definite concentration matrix, we use the following procedure as described in [18]:

  • Step G1: For a given edge set E, we generate an initial concentration matrix \(\tilde {\Omega }=(\tilde {\omega }_{ij})_{1\le i,j\le p}\) with

    $$\tilde{\omega}_{ij}=\left\{ \begin{array}{ll} 1 & \quad i=j\\ 0 & \quad i\neq j,~(i,j)\notin E\\ \sim~Unif(D) & \quad i\neq j,~(i,j)\in E \end{array}\right., $$

    where D=[−1,−0.5][ 0.5,1].

    Fig. 2
    figure 2

    The network structures of the four simulated networks. The structure of the real protein-protein interaction networks [47] were used to construct networks of different sizes by varying the number of references required to support each connection. In the degree distribution, the 0.95 quantile is 7 (connections), so the nodes with more than 7 connections were defined as hub nodes, which are represented as black nodes in the network structure. a 52 edges among 44 nodes (3 hubs), b 103 edges among 83 nodes (3 hubs), c 290 edges among 231 nodes (8 hubs) and d 837 edges among 612 nodes (33 hubs)

  • Step G2: For positive definiteness and symmetry of the concentration matrix, we define a concentration matrix Ω=(ω ij )1≤i,jp as

    $$\Omega=\frac{1}{2}\left(A+A^{T}\right), $$

    where A=(a ij )1≤i,jp , \(a_{ij}=\tilde {\omega }_{ij}/(1.5\cdot d_{i})\) and \(d_{i}=\sum \limits _{k\neq i}|\tilde {\omega }_{ik}|\) for i=1,2,…,p.

  • Step G3: Set ω ii =1 for i=1,2,…,p and ω ij = 0.1·sign(ω ij ) if 0<|ω ij |<0.1.

With these four networks, we have conducted the numerical comparisons of the ESPACE and the SPACE methods, as well as seven other methods including the other reviewed existing methods and EGLASSO. For the purpose of fair comparison, we select the optimal model by the GIC for SPACE, ESPACE, GLASSO, GLASSO-SF, and EGLASSO. Since there is no specific rule for the model selection in the other methods, we set the level α=0.2 for GeneNet and NS, and the threshold α=0.03 for PCACMI and CMI2NI. Note that the pre-determined level α=0.2 is a default of the GeneNet package and used in [35]. The pre-determined threshold α=0.03 was used in [21, 22].

Note that all the existing methods need O(p 2) memory space to store and calculate values corresponding to the interactions between variables. We can reduce this memory consumption when the whole variables can be divided into several conditionally independent blocks by using the condition described in [16].

Sensitivity analysis on random noise in the observed data

To investigate the effect of the random noise contained in the observed data, we consider sensitivity analysis for the variance of the random noise. To be specific, suppose that a random vector X=(X 1,X 2,…,X p )T follows the multivariate normal distribution with mean 0 and covariance matrix Σ, a vector of random noise ε=(ε 1,ε 2,…,ε p )T follows the multivariate normal distribution with mean 0 and covariance matrix \(\sigma _{\epsilon }^{2} I\), and X and ε are independent, where I is the identity matrix. Furthermore, we assume that an observed random vector Z=(Z 1,Z 2,…,Z p )T such that

$$ \mathbf{Z} = \mathbf{X} + \varepsilon. $$

Thus, the covariance matrix of Z becomes \(\Sigma + \sigma _{\epsilon }^{2} I\), which may have a different conditional dependent structure to one of X.

For example, if we consider \(\sigma _{\epsilon }^{2} = 0.5\) and the following Σ and Σ Z

$$ \Sigma= \left(\begin{array}{ccc} 15/11 & -8/11 & 2/11 \\ -8/11& 16/11& -4/11 \\ 2/11& -4/11& 12/11 \\ \end{array}\right),~ \Sigma_{\mathbf{Z}}= \Sigma + \sigma_{\epsilon}^{2} I, $$

then the inverse matrices of Σ and Σ Z are calculated as

$$ {\begin{aligned} &\Sigma^{-1}= \left(\begin{array}{ccc} 1 & 0.5 & 0 \\ -0.5& 1& 0.25 \\ 0& 0.25& 1 \\ \end{array}\right)\text{and}\\ &\Sigma_{\mathbf{Z}}^{-1}= \left(\begin{array}{ccc} 0.63 &0.23 &-0.02 \\ 0.23 &0.62 & 0.12 \\ -0.02& 0.12 & 0.66 \end{array}\right), \text{respectively.} \end{aligned}} $$

Thus, we can see that Z 1 and Z 3 are conditionally dependent given Z 2 while X 1 and X 3 are conditionally independent given X 2. Moreover, the nonzero partial correlations decrease when the variance of the random noises increases. From these observations, the performance of the estimation becomes worse if the variance of the random noise increases.

In this sensitivity analysis, we consider \(\sigma _{\epsilon }^{2} = 0, 0.01, 0.1, 0.25, 0.5\) and p=231 and n=115,231 with the same network structure as the one of p=231 in Fig. 2. To focus on the proposed method, we apply the SPACE and the ESPACE methods to the 50 generated datasets containing random noise having variance \(\sigma _{\epsilon }^{2}\).

Performance measures

To investigate the gains from the extension, we use five performance measures: sensitivity (SEN), specificity (SPE), false discovery rate (FDR), mis-specification rate (MISR) and Matthews correlation coefficients (MCC). Note that the MCC, which lies between −1 and +1, has been used to measure the performance of binary classification, where +1, 0, and −1 denote a perfect classification, a random classification, and a total discordance of classification, respectively. Let ρ and \(\widehat {\rho }_{\lambda,\alpha }\) be (p(p−1)/2)-dimensional vectors of the true and estimated partial correlation, respectively. The above five measures are defined as

$${\begin{aligned} \begin{array}{l} \text{SEN} \equiv \text{TP}/(\text{TP}+\text{FN}),~~ \text{SPE} \equiv \text{TN}/(\text{TN}+\text{FP}), \\ \text{FDR} \equiv \text{FP}/(\text{TP} + \text{FP}),~~ \text{MISR} \equiv (\text{FN} + \text{FP})/\left(p(p-1)/2\right)~~\text{and}\\ \text{MCC} \equiv \frac{\text{TP} \times \text{TN} - \text{FP} \times \text{FN}}{\sqrt{(\text{TP}+\text{FP})(\text{TP}+\text{FN})(\text{TN}+\text{FP})(\text{TN}+\text{FN})}}, \end{array} \end{aligned}} $$

where \(\text {TP} = {\sum \nolimits }_{i<j} I(\rho ^{ij} \neq 0) I(\widehat {\rho }^{ij}_{\lambda,\alpha } \neq 0)\), \(\text {FP} = {\sum \nolimits }_{i<j} I(\rho ^{ij} = 0) I(\widehat {\rho }^{ij}_{\lambda,\alpha } \neq 0)\), \(\text {FN} = {\sum \nolimits }_{i<j} I(\rho ^{ij} \neq 0) I(\widehat {\rho }^{ij}_{\lambda,\alpha }=0)\) and \(\text {TN} = {\sum \nolimits }_{i<j} I(\rho ^{ij} = 0) I(\widehat {\rho }^{ij}_{\lambda,\alpha } = 0)\).

Application to Escherichia coli dataset

We applied the ESPACE method to the largest public Escherichia coli (E.coli) microarray dataset available from the Many Microbe Microarrays database (M3D) [48]. The M3D contains 907 microarrays measured under 466 experimental conditions using Affymetrix GeneChip E.coli genome arrays. Microarrays from the same experimental conditions were averaged to derive the mean expression. The data set (“E_coli_v4_Build_6” from the M3D) contains the expression levels of 4297 genes from 446 samples. In the E.coli genome, a number of studies have been conducted to identify transcriptional regulations. The RegulonDB [49] curates the largest and best-known information on the transcriptional regulation of E.coli. To combine the information from the above two databases, we focus on the 1623 genes reported in both the M3D and the RegulonDB. As mentioned before, the TFs are known to regulate many other genes in the genome and can be considered potential hubs. To incorporate the information about the potential hubs, we used a list of 180 known TF-encoding genes from the RegulonDB. The RegulonDB also provides 3811 transcriptional interactions among the 1623 genes, which were used as the gold standard to evaluate the accuracy of the constructed networks.

Application to lung cancer adenocarcinoma dataset

Lung cancer is the leading cause of death from cancer, both in the United States and worldwide; it has a 5-year survival rate of approximately 15% [50]. The progression and metastasis of lung cancer varies greatly among early stage lung cancer patients. To customize treatment plans for individual patients, it is important to identify prognostic or predictive biomarkers, which allows for more precise classification of lung cancer patients. In this study, we applied the extended SPACE method to reconstruct the gene regulatory network in lung cancer. Exploring network structures can facilitate comprehension of biological mechanisms underlying lung cancer and identification of important genes that could be potential lung cancer biomarkers. We constructed the gene network using microarray data from 442 lung cancer adenocarcinoma patients in the Lung Cancer Consortium study [51]. For detail about preprocessing this dataset, please refer to [7]. First, univariate Cox regression was used to identify the genes whose expression levels are correlated with patient survival outcomes, after adjusting for clinical factors such as study site, age, gender, and stage. The false discovery rate (FDR) was then calculated using a Beta-Uniform model [52]. By controlling the FDR to less than 10%, we identified 794 genes that were associated with the survival outcome of lung cancer patients. Among these 794 genes, 22 were found to appear among the 236 carefully curated cancer genes of the FoundationOneTM gene panel (Foundation Medicine, Inc.). Current biological knowledge indicates genes from this panel play a key role in various types of cancer. These 22 genes were then input as known hub genes to the ESPACE method.

Results and discussion

Simulation results

Comparison results for existing methods

For each network, we generated 50 datasets and reconstructed the network from each dataset using nine different network construction methods, including both the SPACE and the ESPACE methods. In addition to the five performance measures, we also measure the computation time (Time) of each method to compare the efficiency. Note that all methods are executed on R software [53] for the purpose of fair comparison. We implemented the R codes for PCACMI and CMI2NI using the authors’ MATLAB codes. The computation times are measured in CPU time (seconds) by using a desktop PC (Intel Core(TM) i7-4790K CPU (4.00 GHz) and 32 GB RAM).

Tables 1, 2, 3 and 4 report the averages and standard errors of the number of the estimated edges, the five performance measures of the estimation of the network structures and computation times with the optimal tuning parameter λ for SPACE, GLASSO, GLASSO-SF; the optimal tuning parameters α and λ for ESPACE and EGLASSO; and the pre-determined α for GeneNet, NS, PCACMI, and CMI2NI.

Table 1 The averages of the number of estimated edges, the five performance measures and the computation time (sec.) over 50 datasets
Table 2 The averages of the number of estimated edges, the five performance measures and the computation time (sec.) over 50 datasets
Table 3 The averages of the number of estimated edges, the five performance measures and the computation time (sec.) over 50 datasets
Table 4 The averages of the number of estimated edges, the five performance measures and the computation time (sec.) over 50 datasets

Overall, ESPACE has the best performance in estimating network structures in terms of the MCC and the MISR except for the case (p,n)=(83,41), where ESPACE has the second smallest FDR while the MCC and the MISR of ESPACE show the moderate performance among all methods. In the case (p,n)=(83,41), the CMI-based methods have better performance than the others in terms of the MCC and the MISR, but the CMI-based methods also have the large FDRs (≈41%) more than double of those of the other methods. As we described in the Methods Section, the MCC has been used to measure the performance of binary classification and the MISR denotes the total error rate. Thus, this comparison results show that ESPACE is favorable for the identification of edges for the networks with high-dimensional data.

In addition, we made several interesting observations from the results of the our simulation study. First, ESPACE and EGLASSO improve SPACE and GLASSO in terms of the FDR, the MISR, and the MCC for almost scenarios, respectively. The only exception is the case (p,n)=(231,115) for the ESPACE and the SPACE methods. In this case, although the FDR of ESPACE increases 2.17% compared to one of SPACE, ESPACE still improves SPACE in terms of the SEN, the MISR, and the MCC. This suggests that our proposed strategy, which incorporates the previously known hub information, can reduce the errors in estimating network structures compared to the existing method without considering known hub information. Second, GeneNet controls the FDR relatively close to the given level α while the FDRs of NS are controlled conservatively. For instance, the FDRs of GeneNet are measured between 3.94 and 22.24% and NS has the FDRs less than 3.48%. Note that GeneNet and NS control the FDR under 20% (α=0.2) in this simulation study. Third, all methods except the CMI-based methods (the PCACMI and the CMI2NI) have similar efficiency for the relatively low dimensions (p=44,83). The CMI-based methods are relatively slower than the other methods for all the scenarios except for the case (p,n)=(612,612), where GLASSO-SF is the slowest and 1.4 times slower than CMI2NI. CMI2NI is slightly slower than PCACMI for the relatively high dimensions (p=231,612). Finally, even though ESPACE is not the fastest method among the nine methods we consider, there is no overall winner and ESPACE is the third best in terms of the computation time for p=231,612 except for the case (p,n)=(612,612) where ESPACE is faster than SPACE, GLASSO-SF, PCACMI and CMI2NI.

To investigate the other property of the proposed approach, we depict barplots of the averages of degrees of known hub genes over 50 datasets for ESPACE and SPACE in Fig. 3. Figure 3 shows that ESPACE tends to find more edges connected to known hub genes than SPACE. The only exception is the case (p,n)=(612,306), where the average by the ESPACE is 0.57 less than that by SPACE. We conjecture this is simply due the difference in the number of estimated edges, which by ESPACE is 15.54 less than that of SPACE on average. This property is due to the result that the averages of α selected by the GIC in the ESPACE method lie between 0.76 and 0.97 for all the scenarios, which indicates that ESPACE has incorporated prior information about the hub genes and reduced the penalty on edges connected to known hub genes.

Fig. 3
figure 3

Plots of the averages of degrees of hub nodes over the simulated 50 datasets. Vertical lines denote 95% confidence intervals of the averages

Results of sensitivity analysis on random noise

Table 5 reports the averages of the number of the estimated edges and the five performance measures. From the results in Table 5, we can see that the performance of estimation decreases when the variance of the random noises increases in both the SPACE and the ESPACE. For a relatively small sample size (n=115), both the SPACE and the ESPACE are more sensitive to the variance \(\sigma _{\epsilon }^{2}\) compared to the case of n=231. Even though the performance of two methods decreases by similar amounts as the variance \(\sigma _{\epsilon }^{2}\) increases, the ESPACE has better performance than the SPACE in terms of the MCC and the MISR.

Table 5 The averages of the number of estimated edges and the five performance measures over 50 datasets

Comparison of the identified GRNs in Escherichia coli dataset

In this study, we compared the performance of network construction using the SPACE and the ESPACE methods for the model selected by the GIC. We report the number of estimated edges and the true positives, which are matched to the transcriptional interactions in the RegulonDB, in Table 6. The SPACE method estimated 368 edges among 524 genes, which contain 16 TF-encoding genes, and identified 16 transcriptional interactions as true positives. In comparison, the ESPACE method estimated 349 edges among 478 genes containing 29 TF-encoding genes and found 45 transcriptional interactions in the RegulonDB. The ESPACE method found more interactions than the SPACE method and increased the ratio of the number of TPs versus the number of estimated edges as 8.55%. Figure 4 shows the number of TPs vs. the number of estimated edges for various λ values with α in Table 6. The number of TPs of the ESPACE method is consistently greater than those of the SPACE method at similar sparsity. These results clearly indicate that incorporating potential hub gene information improves the accuracy of network construction.

Fig. 4
figure 4

Plot of the number of the TPs vs. the number of the estimated edges for various λs with α in Table 6

Table 6 Summary of the estimated networks using the SPACE and ESPACE methods from the E.coli dataset. We denote a set of estimated edges and a set of the interactions from the RegulonDB by \(\widehat {E}\) and T, respectively

Comparison of the identified GRNs in lung cancer adenocarcinoma dataset

We again compared the performances of network construction using the SPACE and the ESPACE methods. An overview of the networks constructed using both methods is shown in Fig. 5. The SPACE method estimated 234 edges between 114 genes and the ESPACE method found 272 edges between 132 genes. Although the numbers of estimated edges from both the SPACE and ESPACE methods are quite similar, 16.7 and 28.3% of the estimated edges in networks by the SPACE and the ESPACE are different, respectively. We identified hub genes using the criterion mentioned at the beginning of this paper. The lists of hub genes identified in both networks are reported in Table 7. Interestingly, all hub genes identified by the SPACE method were also found using ESPACE. Note that this is not the usual case. For instance, if we define a hub as a node whose degree is greater than 5, the set of hub genes identified by the SPACE is not a subset of the hub genes identified by the ESPACE. To investigate the gains of the ESPACE method, therefore, we focused on the hub genes identified only by ESPACE (AURKA, APC, CDKN3), among which, AURKA and APC are among the 22 pre-specidifed hub genes while CDKN3 is not.

Fig. 5
figure 5

Estimated networks structure using the SPACE and ESPACE methods. The nodes with more than 7 connections (the 0.95 quantile in the degree distribution) were defined as hub nodes, which are represented as black nodes in the network structure. The details of hub genes are reported in Table 7. a SPACE (114 nodes, 234 edges) and b ESPACE (132 nodes, 272 edges)

Table 7 Hub genes from the estimated graphs using the SPACE and ESPACE methods

The CDKN3 (Cyclin-Dependent Kinase Inhibitor 3) protein coded by the CDKN3 gene is a cyclin-dependent kinase inhibitor. Recent studies [54, 55] show that CDKN3 overexpression was associated with poorer survival outcomes in lung adenocarcinoma, but not in lung squamous cell carcinoma. We validated that CDKN3 is associated with the prognosis of lung adenocarcinoma patients in two independent datasets (see Fig. 6). The CDKN3 expression allowed us to separate the lung adenocarcinoma patients into high CDKN3 and low CDKN3 groups with significantly different survival outcomes: in the GSE13213 dataset [56] (n=117), hazard ratio = 2.02 (high CDKN3 vs. low CDKN3), p=0.0146; in the GSE1037 dataset [57] (n=61), hazard ratio= 3.39 (high CDKN3 vs. low CDKN3), p=0.0126. Note that we divided patients into “high” and “low” groups by their gene expression levels with the K-means clustering method.

Fig. 6
figure 6

Kaplan-Meier curves for the CDKN3 gene from GSE13213 and GSE1037 datasets. For each gene, we divide patients into two groups, “High” and “Low”, by their gene expression levels with the K-means clustering method. Red solid lines denote the “High” group and black dashed lines denote the “Low” group. a CDKN3 (GSE13213 dataset) and b CDKN3 (GSE1037 dataset)

APC (Adenomatous Polyposis Coli) is a tumor suppressor gene, and is involved in the Wnt signaling pathway as a negative regulator. It has been identified as one of the key mutation genes in lung adenocarcinoma by a comprehensive study on the somatic mutations in lung adenocarcinoma [58]. AURKA (aurora kinase A) is a protein-coding gene found to be associated with many different types of cancer. Aurora kinase inhibitors have been studied as a potential cancer treatment [59]. Using the GSE42127 dataset [7] (n=209), we found that AURKA expression can predict lung cancer patients’ response to chemotherapy. The dataset contains expression profiles and treatment information for 209 lung cancer patients from MD Anderson Cancer Center, among whom 62 received adjuvant chemotherapy (ACT group) and the remaining 147 did not (no ACT group). The AURKA gene expression allowed us to separate the 209 patients into a low AURKA group (n=104) and high AURKA group (n=105) using the median AURKA expression as a cut-off. The patients in the low AURKA group (Fig. 7 a) showed significant improvement in survival after ACT: hazard ratio = 0.289 (ACT vs. no ACT) and p value = 0.0312. The patients in the high AURKA group (Fig. 7 b), on the other hand, showed no significant survival benefit after ACT: hazard ratio = 0.679 (ACT vs. no ACT) and p value = 0.241. These results indicate that AURKA expression could potentially be a predictive biomarker for lung cancer adjuvant chemotherapy, since only patients with low AURKA expression benefit from the treatment, while those with high AURKA expression are less likely to benefit. In addition, it is possible that Aurora kinase inhibitors, which suppress the expression of AURKA genes, may synergize the effect of adjuvant chemotherary, i.e. improve the chance that a patient responds to adjuvant chemotherapy. In fact, a recent study has demonstrated that Aurora kinase inhibitors may synergize the effect of adjuvant chemotherapy in ovarian cancer, which is consistent with our results in lung cancer.

Fig. 7
figure 7

Kaplan-Meier curves for low and high groups of the AURKA gene expression in GSE42127 dataset [7]. The AURKA expression separates the 209 lung cancer patients into two groups. In the AURKA low expression group (left panel), lung cancer patients with ACT (green line) have significantly longer survival time than patients without ACT (observational group, purple line). In the AURKA high expression group (right panel), patients with ACT do not have significant survival benifit compared to patients without ACT. a lower expression group. b high expression group


We have demonstrated incorporating hub gene information in estimating network structures by extending SPACE with an additional tuning parameter. Our simulation study shows that the ESPACE method reduces errors in the construction of networks when the networks have previously-known hub nodes. Through two applications, we illustrate that the ESPACE method can improve the SPACE method by using the information about the potential hub genes. Although we adopted the GIC to select the optimal tuning parameters in this paper, the ESPACE method can directly be applied with other model selection criteria. The performance of the ESPACE method varies with the chosen criterion. However, the performance of the ESPACE method is at least comparable to the SPACE method since the ESPACE includes the SPACE as a reduced case.



Adjuvant chemotherapy


Adenomatous polyposis coli


Aurora kinase A


Cyclin-dependent kinase inhibitor 3


Conditional mutual inclusive information-based network inference

E.coli :

Escherichia coli


Extended GLASSO


Extended SPACE


False discovery rate


Generalized information criterion


Gaussian graphical model


Graphical lasso


GLASSO with reweighted strategy for scale-free network


Gene regulatory network


Many Microbe Microarrays database


Matthews correlation coefficients


Mutual information


Mis-specification rate




Neighborhood selection


path consistency algorithm based on conditional mutual information


Protein-protein interaction




Sparse partial correlation estimation




Transcription factor


  1. Friedman N. Inferring cellular networks using probabilistic graphical models. Science. 2004; 303(5659):799–805. doi:10.1126/science.109406810.1126/science.1094068.

    Article  CAS  PubMed  Google Scholar 

  2. Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N. Revealing modular organization in the yeast transcriptional network. Nat Genet. 2002; 31(4):370–7. doi:10.1038/Ng94110.1038/Ng941.

    CAS  PubMed  Google Scholar 

  3. Segal E, Shapira M, Regev A, Pe’er D, Botstein D, Koller D, Friedman N. Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat Genet. 2003; 34(2):166–76. doi:10.1038/ng1165ng116510.1038/ng1165 ng1165.

    Article  CAS  PubMed  Google Scholar 

  4. Sachs K, Perez O, Pe’er D, Lauffenburger DA, Nolan GP. Causal protein-signaling networks derived from multiparameter single-cell data. Science. 2005; 308(5721):523–9. doi:10.1126/science.110580910.1126/science.1105809.

    Article  CAS  PubMed  Google Scholar 

  5. Zhong R, Allen JD, Xiao G, Xie Y. Ensemble-based network aggregation improves the accuracy of gene network reconstruction. PLoS ONE. 2014; 9(11):106319. doi:10.1371/journal.pone.010631910.1371/journal.pone.0106319.

    Article  Google Scholar 

  6. Akavia UD, Litvin O, Kim J, Sanchez-Garcia F, Kotliar D, Causton HC, Pochanard P, Mozes E, Garraway LA, Pe’er D. An integrated approach to uncover drivers of cancer. Cell. 2010; 143(6):1005–17. doi:10.1016/j.cell.2010.11.01310.1016/j.cell.2010.11.013.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Tang H, Xiao G, Behrens C, Schiller J, Allen J, Chow CW, Suraokar M, Corvalan A, Mao J, White MA, Wistuba I, Minna JD, Xie Y. A 12-gene set predicts survival benefits from adjuvant chemotherapy in non-small cell lung cancer patients. Clin Cancer Res. 2013; 19(6):1577–86. doi:10.1158/1078-0432.CCR-12-232110.1158/1078-0432.CCR-12-2321.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Cooper GF, Herskovits E. A bayesian method for the induction of probabilistic networks from data. Mach Learn. 1992; 9(4):309–47. doi:10.1023/A:102264940155210.1023/A:1022649401552.

    Google Scholar 

  9. Ellis B, Wong WH. Learning causal bayesian network structures from experimental data. J Am Stat Assoc. 2008; 103(482):778–89. doi:10.1198/01621450800000019310.1198/016214508000000193.

    Article  CAS  Google Scholar 

  10. Liang FM, Zhang J. Learning bayesian networks for discrete data. Comput Stat Data Anal. 2009; 53(4):865–76. doi:10.1016/j.csda.2008.10.00710.1016/j.csda.2008.10.007.

    Article  Google Scholar 

  11. Needham CJ, Bradford JR, Bulpitt AJ, Westhead DR. Inference in bayesian networks. Nat Biotechnol. 2006; 24(1):51–3. doi:10.1038/nbt0106-5110.1038/nbt0106-51.

    Article  CAS  PubMed  Google Scholar 

  12. Sachs K, Gifford D, Jaakkola T, Sorger P, Lauffenburger DA. Bayesian network approach to cell signaling pathway modeling. Sci STKE. 2002; 2002(148):38. doi:10.1126/stke.2002.148.pe3810.1126/stke.2002.148.pe38.

    Google Scholar 

  13. Langfelder P, Horvath S. Wgcna: an r package for weighted correlation network analysis. Bmc Bioinforma. 2008; 9. doi:10.1186/1471-2105-9-55910.1186/1471-2105-9-559.

  14. Yuan M, Lin Y. Model selection and estimation in the gaussian graphical model. Biometrika. 2007; 94(1):19–35. doi:10.1093/biomet/asm01810.1093/biomet/asm018.

    Article  Google Scholar 

  15. Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008; 9(3):432–41. doi:10.1093/biostatistics/kxm04510.1093/biostatistics/kxm045.

    Article  PubMed  Google Scholar 

  16. Witten DM, Friedman JH, Simon N. New insights and faster computations for the graphical lasso. J Comput Graph Stat. 2011; 20(4):892–900. doi:10.1198/jcgs.2011.11051a10.1198/jcgs.2011.11051a.

    Article  Google Scholar 

  17. Meinshausen N, Buhlmann P. High-dimensional graphs and variable selection with the lasso. Ann Stat. 2006; 34(3):1436–62. doi:10.1214/00905360600000028110.1214/009053606000000281.

    Article  Google Scholar 

  18. Peng J, Wang P, Zhou N, Zhu J. Partial correlation estimation by joint sparse regression models. J Am Stat Assoc. 2009; 104(486):735–46. doi:10.1198/jasa.2009.012610.1198/jasa.2009.0126.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Basso K, Margolin AA, Stolovitzky G, Klein U, Dalla-Favera R, Califano A. Reverse engineering of regulatory networks in human b cells. Nat Genet. 2005; 37(4):382–90. doi:10.1038/ng153210.1038/ng1532.

    Article  CAS  PubMed  Google Scholar 

  20. Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, Califano A. Aracne: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. Bmc Bioinforma. 2006; 7. doi:10.1186/1471-2105-7-S1-S710.1186/1471-2105-7-S1-S7.

  21. Zhang X, Zhao XM, He K, Lu L, Cao Y, Liu J, Hao JK, Liu ZP, Chen L. Inferring gene regulatory networks from gene expression data by path consistency algorithm based on conditional mutual information. Bioinformatics. 2012; 28(1):98–104. doi:10.1093/bioinformatics/btr62610.1093/bioinformatics/btr626.

    Article  CAS  PubMed  Google Scholar 

  22. Zhang X, Zhao J, Hao JK, Zhao XM, Chen L. Conditional mutual inclusive information enables accurate quantification of associations in gene regulatory networks. Nucleic Acids Res. 2015; 43(5):31. doi:10.1093/nar/gku131510.1093/nar/gku1315.

    Article  CAS  Google Scholar 

  23. Bansal M, Belcastro V, Ambesi-Impiombato A, di Bernardo D. How to infer gene networks from expression profiles. Mol Syst Biol. 2007; 3. doi:10.1038/Msb410012010.1038/Msb4100120.

  24. Allen JD, Xie Y, Chen M, Girard L, Xiao G. Comparing statistical methods for constructing large scale gene networks. PLoS ONE. 2012; 7(1):29348. doi:10.1371/journal.pone.002934810.1371/journal.pone.0029348.

    Article  Google Scholar 

  25. Pan W. Network-based multiple locus linkage analysis of expression traits. Bioinformatics. 2009; 25(11):1390–6. doi:10.1093/bioinformatics/btp17710.1093/bioinformatics/btp177.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Pan W, Xie BH, Shen XT. Incorporating predictor network in penalized regression with application to microarray data. Biometrics. 2010; 66(2):474–84. doi:10.1111/j.1541-0420.2009.01296.x10.1111/j.1541-0420.2009.01296.x.

    Article  PubMed  Google Scholar 

  27. Wei P, Pan W. Incorporating gene networks into statistical tests for genomic data via a spatially correlated mixture model. Bioinformatics. 2008; 24(3):404–11. doi:10.1093/bioinformatics/btm61210.1093/bioinformatics/btm612.

    Article  PubMed  Google Scholar 

  28. Babu MM, Luscombe NM, Aravind L, Gerstein M, Teichmann SA. Structure and evolution of transcriptional regulatory networks. Curr Opin Struct Biol. 2004; 14(3):283–91. doi:10.1016/

    Article  CAS  PubMed  Google Scholar 

  29. Li JJ, Xie D. Rack1, a versatile hub in cancer. Oncogene. 2015; 34(15):1890–8. doi:10.1038/onc.2014.12710.1038/onc.2014.127.

    Article  CAS  PubMed  Google Scholar 

  30. Selvanathan SP, Graham GT, Erkizan HV, Dirksen U, Natarajan TG, Dakic A, Yu S, Liu X, Paulsen MT, Ljungman ME, Wu CH, Lawlor ER, Uren A, Toretsky JA. Oncogenic fusion protein ews-fli1 is a network hub that regulates alternative splicing. Proc Natl Acad Sci USA. 2015; 112(11):1307–16. doi:10.1073/pnas.1500536112.

    Article  Google Scholar 

  31. Liu Q, Ihler A. Learning scale free networks by reweighted L1 regularization. In: AISTATS: 2011. p. 40–48.

  32. Batada NN, Reguly T, Breitkreutz A, Boucher L, Breitkreutz BJ, Hurst LD, Tyers M. Stratus not altocumulus: a new view of the yeast protein interaction network. PLoS Biol. 2006; 4(10):317. doi:10.1371/journal.pbio.0040317.

    Article  Google Scholar 

  33. Ekman D, Light S, Bjorklund AK, Elofsson A. What properties characterize the hub proteins of the protein-protein interaction network of saccharomyces cerevisiae?Genome Biol. 2006; 7(6):45. doi:10.1186/gb-2006-7-6-r45.

    Article  Google Scholar 

  34. Schafer J, Strimmer K. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat Appl Genet Mol Biol. 2005; 4:32. doi:10.2202/1544-6115.1175.

    Google Scholar 

  35. Efron B. Local false discovery rates. available at. 2005. Accessed 9 Mar.

  36. Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Series B-Methodological. 1996; 58(1):267–88.

    Google Scholar 

  37. Zhao P, Yu B. On model selection consistency of lasso. J Mach Learn Res. 2006; 7:2541–63.

    Google Scholar 

  38. Wu TT, Lange K. Coordinate descent algorithms for lasso penalized regression. Ann Appl Stat. 2008; 2(1):224–44. doi:10.1214/07-Aoas147.

    Article  Google Scholar 

  39. Banerjee O, El Ghaoui L, d’Aspremont A. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. J Mach Learn Res. 2008; 9:485–516.

    Google Scholar 

  40. Mazumder R, Hastie T. The graphical lasso: New insights and alternatives. Electron J Stat. 2012; 6(0):2125–149. doi:10.1214/12-ejs740.

    Article  PubMed  PubMed Central  Google Scholar 

  41. Lange K, Hunter DR, Yang I. Optimization transfer using surrogate objective functions. J Comput Graph Stat. 2000; 9(1):1–20.

    Google Scholar 

  42. Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner TS. Large-scale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 2007; 5(1):8. doi:10.1371/journal.pbio.0050008.

    Article  Google Scholar 

  43. Meyer PE, Lafitte F, Bontempi G. minet: A r/bioconductor package for inferring large transcriptional networks using mutual information. BMC Bioinforma. 2008; 9:461.

    Article  Google Scholar 

  44. Spirtes P, Glymour C, Scheines R. Causation, Prediction, and Search, 2nd ed. Boston: The MIT Press; 2000.

    Google Scholar 

  45. Lauritzen SL. Graphical Models. New York: Oxford University Press Inc.; 1996.

    Google Scholar 

  46. Yu D, Son W, Lim J, Xiao G. Statistical completion of a partially identified graph with applications for the estimation of gene regulatory networks. Biostatistics. 2015. doi:10.1093/biostatistics/kxv013.

  47. Prasad TSK, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, Balakrishnan L, Marimuthu A, Banerjee S, Somanathan DS, Sebastian A, Rani S, Ray S, Harrys Kishore CJ, Kanth S, Ahmed M, Kashyap MK, Mohmood R, Ramachandra YL, Krishna V, Rahiman BA, Mohan S, Ranganathan P, Ramabadran S, Chaerkady R, Pandey A. Human protein reference database 2009 update. Nucleic Acids Res. 2009; 37(suppl 1):767–72. doi:10.1093/nar/gkn892.

    Article  Google Scholar 

  48. Faith JJ, Driscoll ME, Fusaro VA, Cosgrove EJ, Hayete B, Juhn FS, Schneider SJ, Gardner TS. Many microbe microarrays database: uniformly normalized affymetrix compendia with structured experimental metadata. Nucleic Acids Res. 2008; 36(Database issue):866–70. doi:10.1093/nar/gkm815.

    Google Scholar 

  49. Salgado H, Peralta-Gil M, Gama-Castro S, Santos-Zavaleta A, Muniz-Rascado L, Garcia-Sotelo JS, Weiss V, Solano-Lira H, Martinez-Flores I, Medina-Rivera A, Salgado-Osorio G, Alquicira-Hernandez S, Alquicira-Hernandez K, Lopez-Fuentes A, Porron-Sotelo L, Huerta AM, Bonavides-Martinez C, Balderas-Martinez YI, Pannier L, Olvera M, Labastida A, Jimenez-Jacinto V, Vega-Alvarado L, Del Moral-Chavez V, Hernandez-Alvarez A, Morett E, Collado-Vides J. Regulondb v8.0: omics data sets, evolutionary conservation, regulatory phrases, cross-validated gold standards and more. Nucleic Acids Res. 2013; 41(Database issue):203–13. doi:10.1093/nar/gks1201.

    Article  Google Scholar 

  50. Jemal A, Siegel R, Xu J, Ward E. Cancer statistics, 2010. CA Cancer J Clin. 2010; 60(5):277–300. doi:10.3322/caac.20073.

    Article  PubMed  Google Scholar 

  51. Shedden K, Taylor JM, Enkemann SA, Tsao MS, Yeatman TJ, Gerald WL, Eschrich S, Jurisica I, Giordano TJ, Misek DE, Chang AC, Zhu CQ, Strumpf D, Hanash S, Shepherd FA, Ding K, Seymour L, Naoki K, Pennell N, Weir B, Verhaak R, Ladd-Acosta C, Golub T, Gruidl M, Sharma A, Szoke J, Zakowski M, Rusch V, Kris M, Viale A, Motoi N, Travis W, Conley B, Seshan VE, Meyerson M, Kuick R, Dobbin KK, Lively T, Jacobson JW, Beer DG. Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nat Med. 2008; 14(8):822–7. doi:10.1038/nm.1790.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Pounds S, Morris SW. Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics. 2003; 19(10):1236–42. doi:10.1093/bioinformatics/btg148.

    Article  CAS  PubMed  Google Scholar 

  53. R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2015. Available at

  54. Zang X, Chen M, Zhou Y, Xiao G, Xie Y, Wang X. Identifying cdkn3 gene expression as a prognostic biomarker in lung adenocarcinoma via meta-analysis. Cancer Inform. 2015; 14(Suppl 2):183–91. doi:10.4137/CIN.S17287.

    PubMed  PubMed Central  Google Scholar 

  55. Fan C, Chen L, Huang Q, Shen T, Welsh EA, Teer JK, Cai J, Cress WD, Wu J. Overexpression of major cdkn3 transcripts is associated with poor survival in lung adenocarcinoma. Br J Cancer. 2015; 113(12):1735–43. doi:10.1038/bjc.2015.378.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Tomida S, Takeuchi T, Shimada Y, Arima C, Matsuo K, Mitsudomi T, Yatabe Y, Takahashi T. Relapse-related molecular signature in lung adenocarcinomas identifies patients with dismal prognosis. J Clin Oncol. 2009; 27(17):2793–9. doi:10.1200/JCO.2008.19.7053.

    Article  CAS  PubMed  Google Scholar 

  57. Jones MH, Virtanen C, Honjoh D, Miyoshi T, Satoh Y, Okumura S, Nakagawa K, Nomura H, Ishikawa Y. Two prognostically significant subtypes of high-grade lung neuroendocrine tumours independent of small-cell and large-cell neuroendocrine carcinomas identified by gene expression profiles. The Lancet. 2004; 363(9411):775–81. doi:10.1016/S0140-6736(04)15693-6.

    Article  CAS  Google Scholar 

  58. Ding L, Getz G, Wheeler DA, Mardis ER, McLellan MD, Cibulskis K, Sougnez C, Greulich H, Muzny DM, Morgan MB, Fulton L, Fulton RS, Zhang Q, Wendl MC, Lawrence MS, Larson DE, Chen K, Dooling DJ, Sabo A, Hawes AC, Shen H, Jhangiani SN, Lewis LR, Hall O, Zhu Y, Mathew T, Ren Y, Yao J, Scherer SE, Clerc K, Metcalf GA, Ng B, Milosavljevic A, Gonzalez-Garay ML, Osborne JR, Meyer R, Shi X, Tang Y, Koboldt DC, Lin L, Abbott R, Miner TL, Pohl C, Fewell G, Haipek C, Schmidt H, Dunford-Shore BH, Kraja A, Crosby SD, Sawyer CS, Vickery T, Sander S, Robinson J, Winckler W, Baldwin J, Chirieac LR, Dutt A, Fennell T, Hanna M, Johnson BE, Onofrio RC, Thomas RK, Tonon G, Weir BA, Zhao X, Ziaugra L, Zody MC, Giordano T, Orringer MB, Roth JA, Spitz MR, Wistuba II, Ozenberger B, Good PJ, Chang AC, Beer DG, Watson MA, Ladanyi M, Broderick S, Yoshizawa A, Travis WD, Pao W, Province MA, Weinstock GM, Varmus HE, Gabriel SB, Lander ES, Gibbs RA, Meyerson M, Wilson RK. Somatic mutations affect key pathways in lung adenocarcinoma. Nature. 2008; 455(7216):1069–75.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Kollareddy M, Zheleva D, Dzubak P, Brahmkshatriya PS, Lepsik M, Hajduch M. Aurora kinase inhibitors: progress towards the clinic. Invest New Drugs. 2012; 30(6):2411–32. doi:10.1007/s10637-012-9798-6.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references


We gratefully thank Jessie Norris for language editing of the manuscript.


This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Ministry of Science, ICT and Future Planning (NRF-2015R1C1A1A02036312 and NRF-2011-0030810), and the National Institutes of Health grants (1R01CA17221, 1R01GM117597, R15GM113157).

Availability of data and materials

The Escherichia coli dataset analyzed during the current study is available in the Many Microbe Microarrays database (M3D) [48], The RegulonDB [49] dataset used for validation of the result is available at The Lung Cancer Consortium study dataset analyzed during this study is included in the published article [51]. The information of cancer genes used during this study is available from the FoundationOneTM, The proposed method ESPACE is implemented the R package “espace”, which is available from

Authors’ contributions

DY, JL and GX drafted the manuscript. DY and JL formulate the proposed model and performed simulation studies. GX performed the interpretation of the results in the application to the lung cancer adenocarcinoma dataset. GX designed preprocessing procedure in the real-data applications. FL and XW helped in the verification of the proposed model and revised the manuscript. All authors read and approved the final manuscript.

Competing interests

We have no financial or personal relationships with other people and organizations that cause conflict of interests. The authors declare that they have no competing interests.

Consent for publication

Not applicable.

Ethics approval and consent to participate

Not applicable.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Guanghua Xiao.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, D., Lim, J., Wang, X. et al. Enhanced construction of gene regulatory networks using hub gene information. BMC Bioinformatics 18, 186 (2017).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: