 Methodology Article
 Open Access
 Published:
Enhanced construction of gene regulatory networks using hub gene information
BMC Bioinformatics volume 18, Article number: 186 (2017)
Abstract
Background
Gene regulatory networks reveal how genes work together to carry out their biological functions. Reconstructions of gene networks from gene expression data greatly facilitate our understanding of underlying biological mechanisms and provide new opportunities for biomarker and drug discoveries. In gene networks, a gene that has many interactions with other genes is called a hub gene, which usually plays an essential role in gene regulation and biological processes. In this study, we developed a method for reconstructing gene networks using a partial correlationbased approach that incorporates prior information about hub genes. Through simulation studies and two realdata examples, we compare the performance in estimating the network structures between the existing methods and the proposed method.
Results
In simulation studies, we show that the proposed strategy reduces errors in estimating network structures compared to the existing methods. When applied to Escherichia coli, the regulation network constructed by our proposed ESPACE method is more consistent with current biological knowledge than the SPACE method. Furthermore, application of the proposed method in lung cancer has identified hub genes whose mRNA expression predicts cancer progress and patient response to treatment.
Conclusions
We have demonstrated that incorporating hub gene information in estimating network structures can improve the performance of the existing methods.
Background
A gene regulatory network (GRN) describes interactions and regulatory relationships among genes. It provides a systematic understanding of the molecular mechanisms underlying biological processes by revealing how genes work together to form modules that carry out cell functions [1–4]. In addition, the visualization of genetic dependencies through the GRN facilitates the systematic interpretation and comprehension of analysis results from genomewide studies using highthroughput data. GRNs have proven valuable in a variety of contexts, including identifying druggable targets [5], detecting driver genes in diseases [6], and even optimizing prognostic and predictive signatures [7].
Gene expression microarrays monitor the transcription activities of thousands of genes simultaneously, which provides a great opportunity to study the “relationships” among genes on a large scale. However, challenges lie in constructing largescale GRNs from gene expression microarray data due to the small sample sizes of microarray studies and the extremely large solution space. Computational techniques and algorithms have been proposed to reconstruct GRNs from gene expression data, including probabilitybased approaches such as Bayesian networks [8–12], correlationbased approaches [13], likelihoodbased approaches [14–16], partialcorrelationbased approaches [17, 18], and informationtheorybased approaches [19–22]. The existing methods are briefly reviewed in the Methods Section. Readers can also refer to Bansal et al. [23] and Allan et al. [24] for a more detailed review of network construction methods.
The sparse partial correlation estimation (SPACE) method, proposed by Peng et al. [18], considers a penalized regression approach to estimate edges in the GRN, which utilizes the sparse feature of the GRN. Comparative studies have shown that the SPACE method performs well in estimating sparse networks with high accuracy [24]. Peng et al. [18] also showed that the method was able to identify functional relevant molecular networks. In addition, recent studies of network analysis have revealed its advantage in detecting genes or modules associated with phenotypes [25–27].
In gene networks, genes that have many interactions with other genes are defined as hub gens. Because of these interactions, hub genes usually play an important role in a biological system. For example, transcription factor (TF), a protein that binds to specific DNA sequences, can regulate a given set of genes. In humans, approximately 10% of genes in the genome code for around 2600 TFs [28]. The combinatorial human TFs account for most of the regulation activities in the human genome, especially during the development stage. As a result, the genes that code TFs, called TFencoding genes, are usually regarded as hub genes. Furthermore, in cancer research, cancer genes (oncogenes or tumor suppressor genes) take part in tumor genesis and are likely to be hub genes in the genetic networks of tumors [29, 30]. Through decades of biological studies, knowledge on important genes (such as TFs or cancer genes) has been accumulated. Our hypothesis is that incorporating prior knowledge about hub genes can improve accuracy in estimating the gene network structure. It is worth noting that there is a reweighted ℓ _{1} regularization method [31] that repeatedly estimates the structures and modifies the weights of the penalties by using the information on degrees from the previous estimation to encourage the appearance of the hubs. This method does not use the prior information obtainable from the other resources while our method uses additional information not contained in the observed dataset.
To explicitly account for the information on hub genes, we propose an extension of the SPACE method, which introduces an additional tuning parameter to open up the possibility of reducing penalization and increasing the likelihood of selecting the edges connected to such genes. We numerically show that the proposed method reduces errors in estimating network structures. Although we focus on extending the SPACE method in this paper, the idea can also be applied to penalized likelihood methods as well as to other penalized regression methods. Note that there is no rigorous definition of a hub in the context of a network; the definition of a hub varies depending on the sparsity of the network. For sparse protein networks, a hub is defined in [32] as a protein whose degree lies over the 0.95 quantile of the degree distribution or in [33] and [7] as a protein whose degree is greater than 7. In this paper, we conservatively define a hub as a node whose degree is both greater than 7 and above the 0.95 quantile of the degree distribution, because most nodes in sparse networks have relatively small degrees between 0 and 3.
In this study, we briefly introduce seven existing methods, including the SPACE and the graphical lasso, and propose the extended SPACE (ESPACE) method to incorporate the biological knowledge about important genes, i.e. network hubs. Moreover, it is worth noting that the ESPACE only incorporates the previously known biological information not contained in the observed dataset compared to the other existing methods. Through simulation studies, we show that the proposed approach reduces error in estimating the network structures compared to the seven other existing methods that we reviewed in the “Methods” section. Finally, we demonstrate the improvement of the ESPACE method compared to the SPACE method with two realdata examples.
Methods
Review of existing methods
Here, we briefly review the existing methods; the GeneNet [34], the NS [17], the GLASSO [15], the GLASSOSF [31], the PCACMI [21], the CMI2NI [22], and the SPACE [18]. Let \(X_{i}^{k}\) be the expression level of the ith gene of the kth array for i=1,2,…,p and k=1,2,…,n. Let \(\mathbf {X}_{i} = \left (X_{i}^{1}, X_{i}^{2}, \ldots, X_{i}^{n}\right)^{T}\) so that observed gene expression data can be denoted by an n×p matrix X=(X _{1};X _{2};…;X _{ p }) whose rows and columns denote arrays and genes, respectively. Suppose row vectors \(\mathbf {X}^{k}=\left (X_{1}^{k},X_{2}^{k},\ldots,X_{p}^{k}\right)\) for k=1,2,…,n are independently and identically distributed random vectors from the multivariate normal distribution with mean 0 and covariance matrix Σ. We assume that Σ is positive definite, and let Ω≡Σ ^{−1}=(ω _{ ij })_{1≤i,j≤p } be the inverse of the covariance matrix Σ, which is referred to as a concentration matrix or a precision matrix.
GeneNet
Schäfer and Strimmer [34] propose the linear shrinkage estimator for a covariance matrix and the Gaussian graphical model (GGM) selection based on the partial correlation obtained from their shrinkage estimator. With multiple testing procedure using the local false discovery rate [35], the GGM selection controls the false discovery rate under a predetermined level α. Since Schäfer and Strimmer [34] provide their GGM selection procedure in the R package GeneNet, we denote their GGM selection procedure as GeneNet in this paper. To be specific, one of the most commonly used linear shrinkage estimators S ^{∗} for the covariance matrix Σ is
where S=(s _{ ij })_{1≤i,j≤p } is the sample covariance matrix, T=diag(s _{11},s _{22},…,s _{ pp }) is the shrinkage target matrix, and \(\lambda ^{*} = {\sum \nolimits }_{i\neq j} \widehat {\text {Var}} (s_{ij}) / \left ({\sum \nolimits }_{i\neq j} s_{ij}^{2}\right)\) is the optimal shrinkage intensity. With this estimator S ^{∗}, the matrix of the partial correlations \(P = (\hat {\rho }^{ij})_{1 \le i,j \le p}\) is defined as \(\hat {\rho }^{ij} = \hat {\omega }_{ij} / \sqrt {\hat {\omega }_{ii} \hat {\omega }_{jj}}\), where \(\hat {\Omega } = (\hat {\omega }_{ij})_{1 \le i,j \le p} = \left (S^{*}\right)^{1}\).
To identify the significant edges, Schäfer and Strimmer [34] suppose the distribution of the partial correlations as the mixture
where f _{0} is the null distribution, f _{1} is the alternative distribution corresponding to the true edges, and η _{0} is the unknown mixing parameter. Using the algorithm in [35], GeneNet identifies significant edges that have the local false positive rate
smaller than the predetermined level α, where f _{0}(ρ,ν)=ρBe(ρ ^{2};0.5,(ν−1)/2), Be(x;a,b) is the density of the Beta distribution and ν is the reciprocal variance of the null ρ.
Neighborhood selection (NS)
Meinshausen and Bühlmann [17] propose the neighborhood selection (NS) method, which separately solves the lasso [36] problem and identifies edges with nonzero estimated regression coefficients for each node. Meinshausen and Bühlmann [17] prove that the NS method is asymptotically consistent in identifying the neighborhood of each node when the neighborhood stability condition is satisfied. Note that the neighborhood stability condition is related to the irrepresentable condition in linear model literature [37].
To be specific, for each node i∈V={1,2,…,p}, NS solves the following lasso problem
where \(\\mathbf {x}\_{2}^{2} = {\sum \nolimits }_{i=1}^{p} x_{i}^{2}\) and \(\\mathbf {x}\_{1} = {\sum \nolimits }_{i=1}^{p} x_{i}\) for \(\mathbf {x} \in \mathbb {R}^{p}\). With the estimate \(\hat {\beta }^{i,\lambda }\), NS identifies the neighborhood of the node i as \(N_{i}(\lambda) = \{ k~~ \hat {\beta }_{k}^{i,\lambda } \neq 0\}\), which defines an edge set \(E_{i}^{\lambda } = \{(i,j)~~ j \in N_{i}(\lambda)\}\). Since NS separately solves p lasso problems, contradictory edges may occur when we define the total edge set \(E^{\lambda } = \cup _{i=1}^{p} E_{i}^{\lambda }\), i.e., \(\hat {\beta }_{k}^{i,\lambda } \neq 0\) and \(\hat {\beta }_{i}^{k,\lambda } = 0\). To avoid these contradictory edges, NS suggests two types of edge sets E ^{λ,∧} and E ^{λ,∨} defined as follows:
Meinshausen and Bühlmann [17] mentioned these two edge sets have only small differences in their experience and the differences vanish asymptotically. Meinshausen and Bühlmann [17] also propose the choice of the tuning parameter λ _{ i }(α) for the ith node
where \(\tilde {\Phi } = 1  \Phi \) and Φ is the distribution function of the standard normal distribution. With this choice of λ _{ i }(α) for i=1,2,…,p, the probability of falsely identifying edges in the network is bounded by the level α. Note that we estimate the edge set with E ^{λ,∧} and solve the lasso problems using the R package CDLasso proposed by [38] in this paper.
Graphical lasso (GLASSO)
Friedman et al. [15] propose the graphical lasso method that estimates a sparse inverse covariance matrix Ω by maximizing the ℓ _{1} penalized loglikelihood
where S is the sample covariance matrix, tr(A) is the trace of A and ∥A∥_{1} is the ℓ _{1} norm of A for \(A \in \mathbb {R}^{p \times p}\).
To be specific, let W be the estimate of the covariance matrix Σ and consider partitioning W and S
Motivated by [39], Friedman et al. [15] show that the solution \(\hat {\Omega }\) of (1) is equivalent to the inverse of W whose partitioned entity w _{12} satisfies w _{12}=W _{11} β ^{∗}, where β ^{∗} is the solution of the lasso problem
Based on the above property, the graphical lasso sets the diagonal elements w _{ ii }=s _{ ii }+ρ and obtains the offdiagonal elements of W by repeatedly applying the following two steps:

1.
Permuting the columns and rows to locate the target elements at the position of w _{12}.

2.
Finding the solution w _{12}=W _{11} β ^{∗} by solving the lasso problem (2).
until convergence occurs. After finding W, the estimate \(\hat {\Omega }\) is obtained from the relationships \(\omega _{12} =  \hat {\beta } \hat {\omega }_{22}\) and \(\hat {\omega }_{22} = 1/(w_{22}  w_{12}^{T}\hat {\beta })\), where \(\hat {\beta } = W_{11}^{1} w_{12}\). This graphical lasso algorithm was proposed in [15] and had its computational efficiency improved in [16] and [40]. Witten et al. [16] provide the R package glasso version 1.7.
GLASSO with reweighted strategy for scalefree network (GLASSOSF)
Liu and Ihler [31] propose the reweighted ℓ _{1} regularization method to improve the performance of the estimation for the scalefree network whose degrees follows the power law distribution. Motivated by the fact that the existing methods work poorly for the scalefree networks, Liu and Ihler [31] consider changing the ℓ _{1} norm penalty in the existing methods to the power law regularization
where λ and γ are nonnegative tuning parameters, ω _{−i }={ω _{ ij }  j≠i}, \(\\omega _{i}\_{1} = {\sum \nolimits }_{j\neq i} \omega _{ij}\), and ε _{ i } is a small positive number for i=1,2,…,p. Thus, Liu and Ihler [31] consider optimizing the following objective function
where L(X,Ω) denotes the objective function of the existing method without its penalty terms, u _{ L }=1 if L is convex and u _{ L }=−1 if L is concave for Ω. Note that the choice of L is flexible. For instance, L(X,Ω) can be the loglikelihood function of Ω as in the graphical lasso or the squared loss function as in the NS and the SPACE. In this section, we suppose that L is concave for the purpose of notational simplicity.
To obtain the maximizer of f(Ω;X,λ,γ), Liu and Ihler [31] propose the iteratively reweighted ℓ _{1} regularization procedure based on the minorizationmaximization (MM) algorithm [41]. The reweighted procedure iteratively solves the following problem:
where \(\Omega ^{(k)}= \left (\omega _{ij}^{(k)}\right)\) is the estimate at the kth iteration, \(\\omega _{i}^{(k)}\_{1} = {\sum \nolimits }_{l \neq i} \omega _{il}^{(k)}\), and \(\eta _{ij}^{(k)} = \lambda \left (1/(\\omega _{i}^{(k)}\_{1}\right. + \epsilon _{i}) +\left.1/(\\omega _{j}^{(k)}\_{1} + \epsilon _{j})\right)\). In practice, [31] suggest ε _{ i }=1, γ=2λ/ε _{ i }, and the initial estimate Ω ^{(0)}=I _{ p }, where I _{ p } is the pdimensional identity matrix. Note that this reweighted strategy facilitates to estimate the hub nodes by adjusting weights in the penalty term but weights are updated by solely using the observed dataset without previously known information from other literatures.
In this paper, we consider L(X,Ω)= logΩ−tr(S Ω), which is the same to the component in the objective function of the GLASSO. Thus, we call this procedure as the GLASSO with a reweighted strategy for the scalefree network (GLASSOSF). As applied in [31], we stop the reweighting iteration after 5 iterations. The R package glasso version 1.7 is used to obtain the solution of (5) at each iteration with the penalty matrix \(E^{(k)} = (e_{ij}^{(k)})\), where \(e_{ij}^{(k)} = \eta _{ij}^{(k)}\) for i≠j and \(e_{ii}^{(k)} = 2\lambda \) for i=1,2,…,p.
Path consistency algorithm based on conditional mutual information (PCACMI)
Mutual information (MI) is a widely used measure of dependency between variables in information theory. MI even measures nonlinear dependency between variables and can be considered as a generalization of the correlation. Several mutual information (MI) based methods have been developed such as ARACNE [20], CLR [42], and minet [43]. However, similar to the correlation, MI only measures pairwise dependency between two variables. Thus, it usually identifies many undirected interactions between variables. To resolve this difficulty, Zhang et al. [21] propose the information theoretic method for reconstruction of the gene regulatory networks based on the conditional mutual information (CMI).
To be specific, let H(X) and H(X,Y) be the entropy of a random variable X and the joint entropy of random variables X and Y, respectively. For two random variables X and Y, H(X) and H(X,Y) can be expressed as
where f _{ X }(x) is the marginal probability density function (PDF) of X and f _{ XY }(x,y) is the joint PDF of X and Y. With these notations, MI is defined as
It is known that MI measures dependency between two variables that contain both directed dependency and indirected dependency through other variables. While MI can not distinguish directed and indirected dependency, CMI can measure directed dependency between two variables by conditioning on other variables. CMI for X and Y given Z is defined as
To estimate the entropies in (7), Zhang et al. [21] consider the Gaussian kernel density estimator used in [19]. Using the Gaussian kernel density estimator, MI and CMI are defined as
where A is the determinant of a matrix A, C(X), C(Y) and C(Z) are the variances of X, Y and Z, respectively, and C(X,Z), C(Y,Z) and C(X,Y,Z) are the covariance matrices of (X,Z), (Y,Z) and (X,Y,Z), respectively.
To efficiently identify dependent pairs of variables, Zhang et al. [21] adopt the path consistency algorithm (PCA) in [44]. Thus, the authors called their procedure as PCA based on CMI (PCACMI). The PCACMI method sets L = 0 and calculates with Lorder CMI, which is equivalent to MI if L=0. Then, PCACMI removes the pairs of variables such that the maximal CMI of two variables given L+1 adjacent variables is less than a given threshold α, where α determines whether two variables are independent or not and adjacent variables denote variables connected to the two target variables in PCACMI at the previous step. PCACMI repeats the above steps until there is no higher order connection. The MATLAB code for PCACMI is provided by [21] at the author’s website https://sites.google.com/site/xiujunzhangcsb/software/pcacmi.
Conditional mutual inclusive informationbased network inference (CMI2NI)
Recently, Zhang et al. [22] proposed the conditional mutual inclusive informationbased network inference (CMI2NI) method that improves the PCACMI method [21]. CMI2NI considers the KullbackLeibler divergences from the joint probability density function (PDF) of target variables to the interventional PDFs removing the dependency between two variables of interest. Instead of using CMI, CMI2NI uses the conditional mutual inclusive information (CMI2) as the measure of dependency between two variables of interest given other variables. To be specific, we consider three random variables X, Y and Z. For these three random variables, the CMI2 between X and Y given Z is defined as
where D _{KL}(fg) is the KullbackLeibler divergence from f to g, P is the joint PDF of X, Y and Z, and P _{ X→Y } is the interventional probability of X, Y and Z for removing the connection from X to Y.
With Gaussian assumption on the observed data, the CMI2 for two random variables X and Y given mdimensional vector Z can be expressed as
where Σ is the covariance matrix of (X,Y,Z ^{T})^{T}, \(\tilde {\Sigma }\) is the covariance matrix of (Y,X,Z ^{T})^{T}, Σ _{ XZ } is the covariance matrix of (X,Z ^{T})^{T}, Σ _{ YZ } is the covariance matrix of (Y,Z ^{T})^{T}, n=m+2, and C, \(\tilde {C},C_{0}\) and \(\tilde {C}_{0}\) are defined with the elements of \(\Sigma,\Sigma _{XZ},\Sigma _{YZ},\Sigma ^{1},\Sigma _{XZ}^{1}\) and \(\Sigma _{YZ}^{1}\) (see Theorem 1 in [22] for details). As applied in PCACMI, CMI2NI adopts the path consistency algorithm (PCA) to efficiently calculate the CMI2 estimates. All steps of the PCA in CMI2NI are the same as one of PCACMI if we change the CMI to the CMI2. In the PCA steps of CMI2NI, two variables are regarded as independent if the corresponding CMI2 estimate is less than a given threshold α. The MATLAB code for CMI2NI is available at the author’s website https://sites.google.com/site/xiujunzhangcsb/software/cmi2ni.
Sparse partial correlation estimation (SPACE)
In the Gaussian graphical models [45], the conditional dependencies among p variables can be represented by a graph \(\mathcal {G}=\left (V,E\right)\), where V={1,2,…,p} is a set of nodes representing p variables and E={(i,j)  ω _{ ij }≠0, 1≤i≠j≤p} is a set of edges corresponding to the nonzero offdiagonal elements of Ω.
To describe the SPACE method, we consider linear models such that for i=1,2,…,p,
where ε _{ i } is an ndimensional random vector from the multivariate normal distribution with mean 0 and covariance matrix (1/ω _{ ii })I _{ n }, and I _{ n } is an identity matrix with size of n×n. Under normality, the regression coefficients β _{ ij }s can be replaced with the partial correlations ρ ^{ij}s by the relationship
where \(\rho ^{ij}=\text {corr}\left (X_{i},X_{j}~~X_{k},k\neq i,j\right)={\omega _{ij}}\left /{\sqrt {\omega _{ii}\omega _{jj}}}\right.\) is a partial correlation between X _{ i } and X _{ j }. Motivated by the relationship (12), Peng et al. [18] propose the SPACE method for solving the following ℓ _{1}regularized problem:
where w _{ i } is a nonnegative weight for the ith squared error loss.
Proposed approach incorporating previously known hub information
Extended sparse partial correlation estimation (ESPACE)
In this paper, we assume that some genes (or nodes), which are referred to as hub genes (or hub nodes), regulate many other genes, and we also assume that many of these hub genes were identified from previous experiments. To incorporate information about hub nodes, we propose the extended SPACE (ESPACE) method, which extends the model space by using an additional tuning parameter α on edges connected to the given hub nodes. This additional tuning parameter can reflect the hub gene information by reducing the penalty on edges connected to hub nodes. To be specific, let \(\mathcal {H}\) be the set of hub nodes that were previously identified. The ESPACE method we propose solves
where 0<α≤1. Note that we consider the weights w _{ i }s for the squared error loss as one in this paper. To summarize the process of the proposed method, we depict the flowchart of the ESPACE method in Fig. 1. As described in Fig. 1, the ESPACE has the prior knowledge about hub genes as an additional input, which is the novelty of the proposed method compared to the other existing methods.
Extended graphical lasso (EGLASSO)
In the Background, we mentioned the proposed procedure is applicable to other methods such as the graphical lasso. For the purpose of fair comparison and the investigation of the performance, we also applied the proposed strategy to the GLASSO, which is the GLASSO incorporating the hub gene information. We call this procedure the extended graphical lasso (EGLASSO). Similar to the ESPACE, the EGLASSO maximizes
where λ≥0 and 0<α≤1 are two tuning parameters, S is the sample covariance matrix, tr(A) is the trace of A and \(\mathcal {H}\) is the set of hub nodes that were previously identified. Note that we can use the R package glasso version 1.7 for the EGLASSO by defining the penalty matrix corresponding to the penalty term in (15).
Active shooting algorithm for ESPACE
To solve (14), we adopt the active shooting algorithm introduced in [18]. We rewrite the problem (14) as
where \(\mathbf {Y}=(\mathbf {X}_{1}^{T},\mathbf {X}_{2}^{T},\ldots,\mathbf {X}_{p}^{T})^{T}\) is an n×p column vector; \(\mathbf {X}^{k,l}=\left (\mathbf {0}_{n(k1)\times 1}^{T},\mathbf {X}_{l(k)}^{T},\mathbf {0}_{n(lk1)\times 1}^{T},\mathbf {X}_{k(l)}^{T},\mathbf {0}_{n(pl)\times 1}^{T}\right)^{T}\) is an n×p column vector as well, with \(\mathbf {X}_{k(l)}=\sqrt {\frac {\omega _{kk}}{\omega _{ll}}}\mathbf {X}_{k};\)
and ρ=(ρ ^{12};ρ ^{13};…;ρ ^{1p};ρ ^{23};ρ ^{24};…;ρ ^{(p−1)p})^{T}. Let \(\widehat {\boldsymbol {\rho }}^{(m)}\) and \(\widehat {\omega }_{ii}^{(m)}\) be estimates of ρ and ω _{ ii } at the mth iteration, respectively. Then, the steps of the modified algorithm are outlined below:

Step 1: (Initialization of \(\widehat {\omega }_{ii}\)) For i=1,2,…,p, \(\widehat {\omega }_{ii}^{(0)} = 1\) and s=0.

Step 2: (Initialization of \(\widehat {\boldsymbol {\rho }}\)) For 1≤i<j≤p and m=0,
$$ \begin{array}{llll} \widehat{\rho}^{ij,(0)} &=& \text{sign}\left(\mathbf{Y}^{T}\mathbf{X}^{i,j}\right)\frac{\left(\left\mathbf{Y}^{T}\mathbf{X}^{i,j}\right\alpha\lambda\right)_{+}}{\left(\mathbf{X}^{i,j}\right)^{T}\mathbf{X}^{i,j}} &\text{for}~\{i \in \mathcal{H}\}\cup\{ j \in \mathcal{H}\},\\ \widehat{\rho}^{ij,(0)} &=& \text{sign}\left(\mathbf{Y}^{T}\mathbf{X}^{i,j}\right)\frac{\left(\left\mathbf{Y}^{T}\mathbf{X}^{i,j}\right\lambda\right)_{+}}{\left(\mathbf{X}^{i,j}\right)^{T}\mathbf{X}^{i,j}} &\text{for}~i,j\in\mathcal{H}^{c}, \end{array} $$where (x)_{+}= max(x,0) and X ^{i,j}s are defined in (16) with \(\widehat {\omega }_{ii}^{(s)}\).

Step 3: Define an active set \(\Lambda =\{(i,j)~~\widehat {\rho }^{ij,(m)}\neq 0\}\).

Step 4: Iteratively update \(\widehat {\boldsymbol {\rho }}^{(m)}\) for (k,l)∈Λ,
$$ \begin{array}{lll} \widehat{\rho}^{kl,(m)}&= \text{sign}\left((\mathbf{X}^{k,l})^{T}\boldsymbol{\epsilon}'\right)\frac{\left(\left(\mathbf{X}^{k,l})^{T}\boldsymbol{\epsilon}'\right\alpha\lambda\right)_{+}}{(\mathbf{X}^{k,l})^{T}\mathbf{X}^{k,l}}\\ &\quad\text{for}~\{k \in \mathcal{H}\}\cup\{ l \in \mathcal{H}\}, \\ \medskip \widehat{\rho}^{kl,(m)}&= \text{sign}\left((\mathbf{X}^{k,l})^{T}\boldsymbol{\epsilon}'\right)\frac{\left(\left(\mathbf{X}^{k,l})^{T}\boldsymbol{\epsilon}'\right\lambda\right)_{+}}{(\mathbf{X}^{k,l})^{T}\mathbf{X}^{k,l}}&\text{for}~k,l\in\mathcal{H}^{c}, \end{array} $$where \(\boldsymbol {\epsilon }'= \mathbf {Y}{\sum \nolimits }_{(i,j)\neq (k,l)} \tilde {\rho }^{ij}\mathbf {X}^{i,j}\) and \(\tilde {\rho }^{ij}\)s are current estimates at the step for updating the (k,l)th partial correlation.

Step 5: Repeat Step 4 until convergence occurs on the active set Λ.

Step 6: Update \(\widehat {\boldsymbol {\rho }}^{(m+1)}\) for 1≤i<j≤p by using the equations in Step 4. If the maximum difference between \(\widehat {\boldsymbol {\rho }}^{(m+1)}\) and \(\widehat {\boldsymbol {\rho }}^{(m)}\) is less than a predetermined tolerance τ, then go to Step 7 with the estimates \(\widehat {\boldsymbol {\rho }}^{(m+1)}\). Otherwise, consider m=m+1 and go back to Step 3.

Step 7: Update \(\widehat {\omega }_{ii}^{(s+1)}\) for i=1,2,…,p,
$$\begin{aligned} \frac{1}{\widehat{\omega}_{ii}^{(s+1)}}&= \frac{1}{n}\left\\mathbf{X}_{i}\sum\limits_{j\neq i}\widehat{\rho}^{ij,(m+1)}\sqrt{\frac{\widehat{\omega}_{jj}^{(s)}}{\widehat{\omega}_{ii}^{(s)}}}\mathbf{X}_{j}\right\_{2}^{2}\\ &\quad\text{for}~i=1,2,\ldots,p. \end{aligned} $$ 
Step 8: Repeat Step 2 through Step 7 with s=s+1 until convergence occurs on \(\widehat {\omega }_{ii}\)s.
Note that the number of iterations of \(\widehat {\omega }_{ii}\)s is usually small for stabilization of the estimates of ρ. In our numerical study, the estimates of ω _{ ii }s converge within 10 iterations. Moreover, the inner products such as Y ^{T} X ^{i,j}, whose complexity is O(n p), can efficiently be computed by rewriting \(\mathbf {Y}^{T}\mathbf {X}^{i,j} = {\sum \nolimits }_{k=1}^{n} \left (\sqrt {\omega _{jj}/\omega _{ii}} + \sqrt {\omega _{jj}/\omega _{ii}}\right) X_{i}^{k} X_{j}^{k}, \) whose complexity is O(n). We implemented the R package espace, which is available from https://sites.google.com/site/dhyeonyu/software.
Choice of tuning parameters
We have introduced the ESPACE method, which relaxes the penalty on edges connected to the hub genes (i.e., α<1) but uses the same penalty on edges connected to nonhub gene (i.e., α=1). When no hub genes are involved in a network, ESPACE is reduced to SPACE. For a given λ, this modification allows us to find more edges connected to the hub genes by reducing α. In practice, however, we do not know the values of λ and α. In this paper, we consider the GICtype criterion used in [46] for the Gaussian graphical model to choose the optimal tuning parameters (λ ^{∗},α ^{∗}). Let \(\widehat {\rho }_{(\lambda,\alpha)}^{ij}\) be the (i,j)th estimate of partial correlation for given λ and α. The GICtype criterion is defined as
where \({RSS}_{i} = \left \\mathbf {X}_{i}\sum \limits _{j\neq i}\widehat {\rho }_{(\lambda,\alpha)}^{ij}\mathbf {X}_{j(i)}\right \_{2}^{2}\) and A denotes a cardinality of a set A. We choose the tuning parameters which minimize the GICtype criterion,
Simulation studies
Simulation settings
In this simulation, we consider four real proteinprotein interaction (PPI) networks used in a comparative study [24], which were partially selected from the human protein reference database [47]. As mentioned earlier, genes whose degrees are greater than 7 and above the 0.95 quantile of the degree distribution are thought of as hub genes. Figure 2 shows the four PPI networks and their hub genes. Let p be the number of nodes in a network. We consider the number of samples as p/2 and p and generate samples from the multivariate normal distribution with mean 0 and covariance matrix Σ defined with \((\Sigma)_{ij} = (\Omega ^{1})_{ij}/\sqrt {(\Omega ^{1})_{ii}(\Omega ^{1})_{jj}}\), where Ω is a concentration matrix corresponding to a given network structure. To generate a positive definite concentration matrix, we use the following procedure as described in [18]:

Step G1: For a given edge set E, we generate an initial concentration matrix \(\tilde {\Omega }=(\tilde {\omega }_{ij})_{1\le i,j\le p}\) with
$$\tilde{\omega}_{ij}=\left\{ \begin{array}{ll} 1 & \quad i=j\\ 0 & \quad i\neq j,~(i,j)\notin E\\ \sim~Unif(D) & \quad i\neq j,~(i,j)\in E \end{array}\right., $$where D=[−1,−0.5]∪[ 0.5,1].

Step G2: For positive definiteness and symmetry of the concentration matrix, we define a concentration matrix Ω=(ω _{ ij })_{1≤i,j≤p } as
$$\Omega=\frac{1}{2}\left(A+A^{T}\right), $$where A=(a _{ ij })_{1≤i,j≤p }, \(a_{ij}=\tilde {\omega }_{ij}/(1.5\cdot d_{i})\) and \(d_{i}=\sum \limits _{k\neq i}\tilde {\omega }_{ik}\) for i=1,2,…,p.

Step G3: Set ω _{ ii }=1 for i=1,2,…,p and ω _{ ij } = 0.1·sign(ω _{ ij }) if 0<ω _{ ij }<0.1.
With these four networks, we have conducted the numerical comparisons of the ESPACE and the SPACE methods, as well as seven other methods including the other reviewed existing methods and EGLASSO. For the purpose of fair comparison, we select the optimal model by the GIC for SPACE, ESPACE, GLASSO, GLASSOSF, and EGLASSO. Since there is no specific rule for the model selection in the other methods, we set the level α=0.2 for GeneNet and NS, and the threshold α=0.03 for PCACMI and CMI2NI. Note that the predetermined level α=0.2 is a default of the GeneNet package and used in [35]. The predetermined threshold α=0.03 was used in [21, 22].
Note that all the existing methods need O(p ^{2}) memory space to store and calculate values corresponding to the interactions between variables. We can reduce this memory consumption when the whole variables can be divided into several conditionally independent blocks by using the condition described in [16].
Sensitivity analysis on random noise in the observed data
To investigate the effect of the random noise contained in the observed data, we consider sensitivity analysis for the variance of the random noise. To be specific, suppose that a random vector X=(X _{1},X _{2},…,X _{ p })^{T} follows the multivariate normal distribution with mean 0 and covariance matrix Σ, a vector of random noise ε=(ε _{1},ε _{2},…,ε _{ p })^{T} follows the multivariate normal distribution with mean 0 and covariance matrix \(\sigma _{\epsilon }^{2} I\), and X and ε are independent, where I is the identity matrix. Furthermore, we assume that an observed random vector Z=(Z _{1},Z _{2},…,Z _{ p })^{T} such that
Thus, the covariance matrix of Z becomes \(\Sigma + \sigma _{\epsilon }^{2} I\), which may have a different conditional dependent structure to one of X.
For example, if we consider \(\sigma _{\epsilon }^{2} = 0.5\) and the following Σ and Σ _{ Z }
then the inverse matrices of Σ and Σ _{ Z } are calculated as
Thus, we can see that Z _{1} and Z _{3} are conditionally dependent given Z _{2} while X _{1} and X _{3} are conditionally independent given X _{2}. Moreover, the nonzero partial correlations decrease when the variance of the random noises increases. From these observations, the performance of the estimation becomes worse if the variance of the random noise increases.
In this sensitivity analysis, we consider \(\sigma _{\epsilon }^{2} = 0, 0.01, 0.1, 0.25, 0.5\) and p=231 and n=115,231 with the same network structure as the one of p=231 in Fig. 2. To focus on the proposed method, we apply the SPACE and the ESPACE methods to the 50 generated datasets containing random noise having variance \(\sigma _{\epsilon }^{2}\).
Performance measures
To investigate the gains from the extension, we use five performance measures: sensitivity (SEN), specificity (SPE), false discovery rate (FDR), misspecification rate (MISR) and Matthews correlation coefficients (MCC). Note that the MCC, which lies between −1 and +1, has been used to measure the performance of binary classification, where +1, 0, and −1 denote a perfect classification, a random classification, and a total discordance of classification, respectively. Let ρ and \(\widehat {\rho }_{\lambda,\alpha }\) be (p(p−1)/2)dimensional vectors of the true and estimated partial correlation, respectively. The above five measures are defined as
where \(\text {TP} = {\sum \nolimits }_{i<j} I(\rho ^{ij} \neq 0) I(\widehat {\rho }^{ij}_{\lambda,\alpha } \neq 0)\), \(\text {FP} = {\sum \nolimits }_{i<j} I(\rho ^{ij} = 0) I(\widehat {\rho }^{ij}_{\lambda,\alpha } \neq 0)\), \(\text {FN} = {\sum \nolimits }_{i<j} I(\rho ^{ij} \neq 0) I(\widehat {\rho }^{ij}_{\lambda,\alpha }=0)\) and \(\text {TN} = {\sum \nolimits }_{i<j} I(\rho ^{ij} = 0) I(\widehat {\rho }^{ij}_{\lambda,\alpha } = 0)\).
Application to Escherichia coli dataset
We applied the ESPACE method to the largest public Escherichia coli (E.coli) microarray dataset available from the Many Microbe Microarrays database (M3D) [48]. The M3D contains 907 microarrays measured under 466 experimental conditions using Affymetrix GeneChip E.coli genome arrays. Microarrays from the same experimental conditions were averaged to derive the mean expression. The data set (“E_coli_v4_Build_6” from the M3D) contains the expression levels of 4297 genes from 446 samples. In the E.coli genome, a number of studies have been conducted to identify transcriptional regulations. The RegulonDB [49] curates the largest and bestknown information on the transcriptional regulation of E.coli. To combine the information from the above two databases, we focus on the 1623 genes reported in both the M3D and the RegulonDB. As mentioned before, the TFs are known to regulate many other genes in the genome and can be considered potential hubs. To incorporate the information about the potential hubs, we used a list of 180 known TFencoding genes from the RegulonDB. The RegulonDB also provides 3811 transcriptional interactions among the 1623 genes, which were used as the gold standard to evaluate the accuracy of the constructed networks.
Application to lung cancer adenocarcinoma dataset
Lung cancer is the leading cause of death from cancer, both in the United States and worldwide; it has a 5year survival rate of approximately 15% [50]. The progression and metastasis of lung cancer varies greatly among early stage lung cancer patients. To customize treatment plans for individual patients, it is important to identify prognostic or predictive biomarkers, which allows for more precise classification of lung cancer patients. In this study, we applied the extended SPACE method to reconstruct the gene regulatory network in lung cancer. Exploring network structures can facilitate comprehension of biological mechanisms underlying lung cancer and identification of important genes that could be potential lung cancer biomarkers. We constructed the gene network using microarray data from 442 lung cancer adenocarcinoma patients in the Lung Cancer Consortium study [51]. For detail about preprocessing this dataset, please refer to [7]. First, univariate Cox regression was used to identify the genes whose expression levels are correlated with patient survival outcomes, after adjusting for clinical factors such as study site, age, gender, and stage. The false discovery rate (FDR) was then calculated using a BetaUniform model [52]. By controlling the FDR to less than 10%, we identified 794 genes that were associated with the survival outcome of lung cancer patients. Among these 794 genes, 22 were found to appear among the 236 carefully curated cancer genes of the FoundationOne^{TM} gene panel (Foundation Medicine, Inc.). Current biological knowledge indicates genes from this panel play a key role in various types of cancer. These 22 genes were then input as known hub genes to the ESPACE method.
Results and discussion
Simulation results
Comparison results for existing methods
For each network, we generated 50 datasets and reconstructed the network from each dataset using nine different network construction methods, including both the SPACE and the ESPACE methods. In addition to the five performance measures, we also measure the computation time (Time) of each method to compare the efficiency. Note that all methods are executed on R software [53] for the purpose of fair comparison. We implemented the R codes for PCACMI and CMI2NI using the authors’ MATLAB codes. The computation times are measured in CPU time (seconds) by using a desktop PC (Intel Core(TM) i74790K CPU (4.00 GHz) and 32 GB RAM).
Tables 1, 2, 3 and 4 report the averages and standard errors of the number of the estimated edges, the five performance measures of the estimation of the network structures and computation times with the optimal tuning parameter λ ^{∗} for SPACE, GLASSO, GLASSOSF; the optimal tuning parameters α ^{∗} and λ ^{∗} for ESPACE and EGLASSO; and the predetermined α for GeneNet, NS, PCACMI, and CMI2NI.
Overall, ESPACE has the best performance in estimating network structures in terms of the MCC and the MISR except for the case (p,n)=(83,41), where ESPACE has the second smallest FDR while the MCC and the MISR of ESPACE show the moderate performance among all methods. In the case (p,n)=(83,41), the CMIbased methods have better performance than the others in terms of the MCC and the MISR, but the CMIbased methods also have the large FDRs (≈41%) more than double of those of the other methods. As we described in the Methods Section, the MCC has been used to measure the performance of binary classification and the MISR denotes the total error rate. Thus, this comparison results show that ESPACE is favorable for the identification of edges for the networks with highdimensional data.
In addition, we made several interesting observations from the results of the our simulation study. First, ESPACE and EGLASSO improve SPACE and GLASSO in terms of the FDR, the MISR, and the MCC for almost scenarios, respectively. The only exception is the case (p,n)=(231,115) for the ESPACE and the SPACE methods. In this case, although the FDR of ESPACE increases 2.17% compared to one of SPACE, ESPACE still improves SPACE in terms of the SEN, the MISR, and the MCC. This suggests that our proposed strategy, which incorporates the previously known hub information, can reduce the errors in estimating network structures compared to the existing method without considering known hub information. Second, GeneNet controls the FDR relatively close to the given level α while the FDRs of NS are controlled conservatively. For instance, the FDRs of GeneNet are measured between 3.94 and 22.24% and NS has the FDRs less than 3.48%. Note that GeneNet and NS control the FDR under 20% (α=0.2) in this simulation study. Third, all methods except the CMIbased methods (the PCACMI and the CMI2NI) have similar efficiency for the relatively low dimensions (p=44,83). The CMIbased methods are relatively slower than the other methods for all the scenarios except for the case (p,n)=(612,612), where GLASSOSF is the slowest and 1.4 times slower than CMI2NI. CMI2NI is slightly slower than PCACMI for the relatively high dimensions (p=231,612). Finally, even though ESPACE is not the fastest method among the nine methods we consider, there is no overall winner and ESPACE is the third best in terms of the computation time for p=231,612 except for the case (p,n)=(612,612) where ESPACE is faster than SPACE, GLASSOSF, PCACMI and CMI2NI.
To investigate the other property of the proposed approach, we depict barplots of the averages of degrees of known hub genes over 50 datasets for ESPACE and SPACE in Fig. 3. Figure 3 shows that ESPACE tends to find more edges connected to known hub genes than SPACE. The only exception is the case (p,n)=(612,306), where the average by the ESPACE is 0.57 less than that by SPACE. We conjecture this is simply due the difference in the number of estimated edges, which by ESPACE is 15.54 less than that of SPACE on average. This property is due to the result that the averages of α ^{∗} selected by the GIC in the ESPACE method lie between 0.76 and 0.97 for all the scenarios, which indicates that ESPACE has incorporated prior information about the hub genes and reduced the penalty on edges connected to known hub genes.
Results of sensitivity analysis on random noise
Table 5 reports the averages of the number of the estimated edges and the five performance measures. From the results in Table 5, we can see that the performance of estimation decreases when the variance of the random noises increases in both the SPACE and the ESPACE. For a relatively small sample size (n=115), both the SPACE and the ESPACE are more sensitive to the variance \(\sigma _{\epsilon }^{2}\) compared to the case of n=231. Even though the performance of two methods decreases by similar amounts as the variance \(\sigma _{\epsilon }^{2}\) increases, the ESPACE has better performance than the SPACE in terms of the MCC and the MISR.
Comparison of the identified GRNs in Escherichia coli dataset
In this study, we compared the performance of network construction using the SPACE and the ESPACE methods for the model selected by the GIC. We report the number of estimated edges and the true positives, which are matched to the transcriptional interactions in the RegulonDB, in Table 6. The SPACE method estimated 368 edges among 524 genes, which contain 16 TFencoding genes, and identified 16 transcriptional interactions as true positives. In comparison, the ESPACE method estimated 349 edges among 478 genes containing 29 TFencoding genes and found 45 transcriptional interactions in the RegulonDB. The ESPACE method found more interactions than the SPACE method and increased the ratio of the number of TPs versus the number of estimated edges as 8.55%. Figure 4 shows the number of TPs vs. the number of estimated edges for various λ values with α ^{∗} in Table 6. The number of TPs of the ESPACE method is consistently greater than those of the SPACE method at similar sparsity. These results clearly indicate that incorporating potential hub gene information improves the accuracy of network construction.
Comparison of the identified GRNs in lung cancer adenocarcinoma dataset
We again compared the performances of network construction using the SPACE and the ESPACE methods. An overview of the networks constructed using both methods is shown in Fig. 5. The SPACE method estimated 234 edges between 114 genes and the ESPACE method found 272 edges between 132 genes. Although the numbers of estimated edges from both the SPACE and ESPACE methods are quite similar, 16.7 and 28.3% of the estimated edges in networks by the SPACE and the ESPACE are different, respectively. We identified hub genes using the criterion mentioned at the beginning of this paper. The lists of hub genes identified in both networks are reported in Table 7. Interestingly, all hub genes identified by the SPACE method were also found using ESPACE. Note that this is not the usual case. For instance, if we define a hub as a node whose degree is greater than 5, the set of hub genes identified by the SPACE is not a subset of the hub genes identified by the ESPACE. To investigate the gains of the ESPACE method, therefore, we focused on the hub genes identified only by ESPACE (AURKA, APC, CDKN3), among which, AURKA and APC are among the 22 prespecidifed hub genes while CDKN3 is not.
The CDKN3 (CyclinDependent Kinase Inhibitor 3) protein coded by the CDKN3 gene is a cyclindependent kinase inhibitor. Recent studies [54, 55] show that CDKN3 overexpression was associated with poorer survival outcomes in lung adenocarcinoma, but not in lung squamous cell carcinoma. We validated that CDKN3 is associated with the prognosis of lung adenocarcinoma patients in two independent datasets (see Fig. 6). The CDKN3 expression allowed us to separate the lung adenocarcinoma patients into high CDKN3 and low CDKN3 groups with significantly different survival outcomes: in the GSE13213 dataset [56] (n=117), hazard ratio = 2.02 (high CDKN3 vs. low CDKN3), p=0.0146; in the GSE1037 dataset [57] (n=61), hazard ratio= 3.39 (high CDKN3 vs. low CDKN3), p=0.0126. Note that we divided patients into “high” and “low” groups by their gene expression levels with the Kmeans clustering method.
APC (Adenomatous Polyposis Coli) is a tumor suppressor gene, and is involved in the Wnt signaling pathway as a negative regulator. It has been identified as one of the key mutation genes in lung adenocarcinoma by a comprehensive study on the somatic mutations in lung adenocarcinoma [58]. AURKA (aurora kinase A) is a proteincoding gene found to be associated with many different types of cancer. Aurora kinase inhibitors have been studied as a potential cancer treatment [59]. Using the GSE42127 dataset [7] (n=209), we found that AURKA expression can predict lung cancer patients’ response to chemotherapy. The dataset contains expression profiles and treatment information for 209 lung cancer patients from MD Anderson Cancer Center, among whom 62 received adjuvant chemotherapy (ACT group) and the remaining 147 did not (no ACT group). The AURKA gene expression allowed us to separate the 209 patients into a low AURKA group (n=104) and high AURKA group (n=105) using the median AURKA expression as a cutoff. The patients in the low AURKA group (Fig. 7 a) showed significant improvement in survival after ACT: hazard ratio = 0.289 (ACT vs. no ACT) and p value = 0.0312. The patients in the high AURKA group (Fig. 7 b), on the other hand, showed no significant survival benefit after ACT: hazard ratio = 0.679 (ACT vs. no ACT) and p value = 0.241. These results indicate that AURKA expression could potentially be a predictive biomarker for lung cancer adjuvant chemotherapy, since only patients with low AURKA expression benefit from the treatment, while those with high AURKA expression are less likely to benefit. In addition, it is possible that Aurora kinase inhibitors, which suppress the expression of AURKA genes, may synergize the effect of adjuvant chemotherary, i.e. improve the chance that a patient responds to adjuvant chemotherapy. In fact, a recent study has demonstrated that Aurora kinase inhibitors may synergize the effect of adjuvant chemotherapy in ovarian cancer, which is consistent with our results in lung cancer.
Conclusions
We have demonstrated incorporating hub gene information in estimating network structures by extending SPACE with an additional tuning parameter. Our simulation study shows that the ESPACE method reduces errors in the construction of networks when the networks have previouslyknown hub nodes. Through two applications, we illustrate that the ESPACE method can improve the SPACE method by using the information about the potential hub genes. Although we adopted the GIC to select the optimal tuning parameters in this paper, the ESPACE method can directly be applied with other model selection criteria. The performance of the ESPACE method varies with the chosen criterion. However, the performance of the ESPACE method is at least comparable to the SPACE method since the ESPACE includes the SPACE as a reduced case.
Abbreviations
 ACT:

Adjuvant chemotherapy
 APC:

Adenomatous polyposis coli
 AURKA:

Aurora kinase A
 CDKN3:

Cyclindependent kinase inhibitor 3
 CMI2NI:

Conditional mutual inclusive informationbased network inference
 E.coli :

Escherichia coli
 EGLASSO:

Extended GLASSO
 ESPACE:

Extended SPACE
 FDR:

False discovery rate
 GIC:

Generalized information criterion
 GGM:

Gaussian graphical model
 GLASSO:

Graphical lasso
 GLASSOSF:

GLASSO with reweighted strategy for scalefree network
 GRN:

Gene regulatory network
 M3D:

Many Microbe Microarrays database
 MCC:

Matthews correlation coefficients
 MI:

Mutual information
 MISR:

Misspecification rate
 MM:

Minorizationmaximization
 NS:

Neighborhood selection
 PCACMI:

path consistency algorithm based on conditional mutual information
 PPI:

Proteinprotein interaction
 SEN:

Sensitivity
 SPACE:

Sparse partial correlation estimation
 SPE:

Specificity
 TF:

Transcription factor
References
 1
Friedman N. Inferring cellular networks using probabilistic graphical models. Science. 2004; 303(5659):799–805. doi:10.1126/science.109406810.1126/science.1094068.
 2
Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N. Revealing modular organization in the yeast transcriptional network. Nat Genet. 2002; 31(4):370–7. doi:10.1038/Ng94110.1038/Ng941.
 3
Segal E, Shapira M, Regev A, Pe’er D, Botstein D, Koller D, Friedman N. Module networks: identifying regulatory modules and their conditionspecific regulators from gene expression data. Nat Genet. 2003; 34(2):166–76. doi:10.1038/ng1165ng116510.1038/ng1165 ng1165.
 4
Sachs K, Perez O, Pe’er D, Lauffenburger DA, Nolan GP. Causal proteinsignaling networks derived from multiparameter singlecell data. Science. 2005; 308(5721):523–9. doi:10.1126/science.110580910.1126/science.1105809.
 5
Zhong R, Allen JD, Xiao G, Xie Y. Ensemblebased network aggregation improves the accuracy of gene network reconstruction. PLoS ONE. 2014; 9(11):106319. doi:10.1371/journal.pone.010631910.1371/journal.pone.0106319.
 6
Akavia UD, Litvin O, Kim J, SanchezGarcia F, Kotliar D, Causton HC, Pochanard P, Mozes E, Garraway LA, Pe’er D. An integrated approach to uncover drivers of cancer. Cell. 2010; 143(6):1005–17. doi:10.1016/j.cell.2010.11.01310.1016/j.cell.2010.11.013.
 7
Tang H, Xiao G, Behrens C, Schiller J, Allen J, Chow CW, Suraokar M, Corvalan A, Mao J, White MA, Wistuba I, Minna JD, Xie Y. A 12gene set predicts survival benefits from adjuvant chemotherapy in nonsmall cell lung cancer patients. Clin Cancer Res. 2013; 19(6):1577–86. doi:10.1158/10780432.CCR12232110.1158/10780432.CCR122321.
 8
Cooper GF, Herskovits E. A bayesian method for the induction of probabilistic networks from data. Mach Learn. 1992; 9(4):309–47. doi:10.1023/A:102264940155210.1023/A:1022649401552.
 9
Ellis B, Wong WH. Learning causal bayesian network structures from experimental data. J Am Stat Assoc. 2008; 103(482):778–89. doi:10.1198/01621450800000019310.1198/016214508000000193.
 10
Liang FM, Zhang J. Learning bayesian networks for discrete data. Comput Stat Data Anal. 2009; 53(4):865–76. doi:10.1016/j.csda.2008.10.00710.1016/j.csda.2008.10.007.
 11
Needham CJ, Bradford JR, Bulpitt AJ, Westhead DR. Inference in bayesian networks. Nat Biotechnol. 2006; 24(1):51–3. doi:10.1038/nbt01065110.1038/nbt010651.
 12
Sachs K, Gifford D, Jaakkola T, Sorger P, Lauffenburger DA. Bayesian network approach to cell signaling pathway modeling. Sci STKE. 2002; 2002(148):38. doi:10.1126/stke.2002.148.pe3810.1126/stke.2002.148.pe38.
 13
Langfelder P, Horvath S. Wgcna: an r package for weighted correlation network analysis. Bmc Bioinforma. 2008; 9. doi:10.1186/14712105955910.1186/147121059559.
 14
Yuan M, Lin Y. Model selection and estimation in the gaussian graphical model. Biometrika. 2007; 94(1):19–35. doi:10.1093/biomet/asm01810.1093/biomet/asm018.
 15
Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008; 9(3):432–41. doi:10.1093/biostatistics/kxm04510.1093/biostatistics/kxm045.
 16
Witten DM, Friedman JH, Simon N. New insights and faster computations for the graphical lasso. J Comput Graph Stat. 2011; 20(4):892–900. doi:10.1198/jcgs.2011.11051a10.1198/jcgs.2011.11051a.
 17
Meinshausen N, Buhlmann P. Highdimensional graphs and variable selection with the lasso. Ann Stat. 2006; 34(3):1436–62. doi:10.1214/00905360600000028110.1214/009053606000000281.
 18
Peng J, Wang P, Zhou N, Zhu J. Partial correlation estimation by joint sparse regression models. J Am Stat Assoc. 2009; 104(486):735–46. doi:10.1198/jasa.2009.012610.1198/jasa.2009.0126.
 19
Basso K, Margolin AA, Stolovitzky G, Klein U, DallaFavera R, Califano A. Reverse engineering of regulatory networks in human b cells. Nat Genet. 2005; 37(4):382–90. doi:10.1038/ng153210.1038/ng1532.
 20
Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, Califano A. Aracne: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. Bmc Bioinforma. 2006; 7. doi:10.1186/147121057S1S710.1186/147121057S1S7.
 21
Zhang X, Zhao XM, He K, Lu L, Cao Y, Liu J, Hao JK, Liu ZP, Chen L. Inferring gene regulatory networks from gene expression data by path consistency algorithm based on conditional mutual information. Bioinformatics. 2012; 28(1):98–104. doi:10.1093/bioinformatics/btr62610.1093/bioinformatics/btr626.
 22
Zhang X, Zhao J, Hao JK, Zhao XM, Chen L. Conditional mutual inclusive information enables accurate quantification of associations in gene regulatory networks. Nucleic Acids Res. 2015; 43(5):31. doi:10.1093/nar/gku131510.1093/nar/gku1315.
 23
Bansal M, Belcastro V, AmbesiImpiombato A, di Bernardo D. How to infer gene networks from expression profiles. Mol Syst Biol. 2007; 3. doi:10.1038/Msb410012010.1038/Msb4100120.
 24
Allen JD, Xie Y, Chen M, Girard L, Xiao G. Comparing statistical methods for constructing large scale gene networks. PLoS ONE. 2012; 7(1):29348. doi:10.1371/journal.pone.002934810.1371/journal.pone.0029348.
 25
Pan W. Networkbased multiple locus linkage analysis of expression traits. Bioinformatics. 2009; 25(11):1390–6. doi:10.1093/bioinformatics/btp17710.1093/bioinformatics/btp177.
 26
Pan W, Xie BH, Shen XT. Incorporating predictor network in penalized regression with application to microarray data. Biometrics. 2010; 66(2):474–84. doi:10.1111/j.15410420.2009.01296.x10.1111/j.15410420.2009.01296.x.
 27
Wei P, Pan W. Incorporating gene networks into statistical tests for genomic data via a spatially correlated mixture model. Bioinformatics. 2008; 24(3):404–11. doi:10.1093/bioinformatics/btm61210.1093/bioinformatics/btm612.
 28
Babu MM, Luscombe NM, Aravind L, Gerstein M, Teichmann SA. Structure and evolution of transcriptional regulatory networks. Curr Opin Struct Biol. 2004; 14(3):283–91. doi:10.1016/j.sbi.2004.05.00410.1016/j.sbi.2004.05.004.
 29
Li JJ, Xie D. Rack1, a versatile hub in cancer. Oncogene. 2015; 34(15):1890–8. doi:10.1038/onc.2014.12710.1038/onc.2014.127.
 30
Selvanathan SP, Graham GT, Erkizan HV, Dirksen U, Natarajan TG, Dakic A, Yu S, Liu X, Paulsen MT, Ljungman ME, Wu CH, Lawlor ER, Uren A, Toretsky JA. Oncogenic fusion protein ewsfli1 is a network hub that regulates alternative splicing. Proc Natl Acad Sci USA. 2015; 112(11):1307–16. doi:10.1073/pnas.1500536112.
 31
Liu Q, Ihler A. Learning scale free networks by reweighted L1 regularization. In: AISTATS: 2011. p. 40–48.
 32
Batada NN, Reguly T, Breitkreutz A, Boucher L, Breitkreutz BJ, Hurst LD, Tyers M. Stratus not altocumulus: a new view of the yeast protein interaction network. PLoS Biol. 2006; 4(10):317. doi:10.1371/journal.pbio.0040317.
 33
Ekman D, Light S, Bjorklund AK, Elofsson A. What properties characterize the hub proteins of the proteinprotein interaction network of saccharomyces cerevisiae?Genome Biol. 2006; 7(6):45. doi:10.1186/gb200676r45.
 34
Schafer J, Strimmer K. A shrinkage approach to largescale covariance matrix estimation and implications for functional genomics. Stat Appl Genet Mol Biol. 2005; 4:32. doi:10.2202/15446115.1175.
 35
Efron B. Local false discovery rates. available at. 2005. http://statweb.stanford.edu/~ckirby/brad/papers/2005LocalFDR.pdf. Accessed 9 Mar.
 36
Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Series BMethodological. 1996; 58(1):267–88.
 37
Zhao P, Yu B. On model selection consistency of lasso. J Mach Learn Res. 2006; 7:2541–63.
 38
Wu TT, Lange K. Coordinate descent algorithms for lasso penalized regression. Ann Appl Stat. 2008; 2(1):224–44. doi:10.1214/07Aoas147.
 39
Banerjee O, El Ghaoui L, d’Aspremont A. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. J Mach Learn Res. 2008; 9:485–516.
 40
Mazumder R, Hastie T. The graphical lasso: New insights and alternatives. Electron J Stat. 2012; 6(0):2125–149. doi:10.1214/12ejs740.
 41
Lange K, Hunter DR, Yang I. Optimization transfer using surrogate objective functions. J Comput Graph Stat. 2000; 9(1):1–20.
 42
Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner TS. Largescale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 2007; 5(1):8. doi:10.1371/journal.pbio.0050008.
 43
Meyer PE, Lafitte F, Bontempi G. minet: A r/bioconductor package for inferring large transcriptional networks using mutual information. BMC Bioinforma. 2008; 9:461.
 44
Spirtes P, Glymour C, Scheines R. Causation, Prediction, and Search, 2nd ed. Boston: The MIT Press; 2000.
 45
Lauritzen SL. Graphical Models. New York: Oxford University Press Inc.; 1996. http://books.google.com/books?id=mGQWkx4guhAC.
 46
Yu D, Son W, Lim J, Xiao G. Statistical completion of a partially identified graph with applications for the estimation of gene regulatory networks. Biostatistics. 2015. doi:10.1093/biostatistics/kxv013.
 47
Prasad TSK, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, Balakrishnan L, Marimuthu A, Banerjee S, Somanathan DS, Sebastian A, Rani S, Ray S, Harrys Kishore CJ, Kanth S, Ahmed M, Kashyap MK, Mohmood R, Ramachandra YL, Krishna V, Rahiman BA, Mohan S, Ranganathan P, Ramabadran S, Chaerkady R, Pandey A. Human protein reference database 2009 update. Nucleic Acids Res. 2009; 37(suppl 1):767–72. doi:10.1093/nar/gkn892.
 48
Faith JJ, Driscoll ME, Fusaro VA, Cosgrove EJ, Hayete B, Juhn FS, Schneider SJ, Gardner TS. Many microbe microarrays database: uniformly normalized affymetrix compendia with structured experimental metadata. Nucleic Acids Res. 2008; 36(Database issue):866–70. doi:10.1093/nar/gkm815.
 49
Salgado H, PeraltaGil M, GamaCastro S, SantosZavaleta A, MunizRascado L, GarciaSotelo JS, Weiss V, SolanoLira H, MartinezFlores I, MedinaRivera A, SalgadoOsorio G, AlquiciraHernandez S, AlquiciraHernandez K, LopezFuentes A, PorronSotelo L, Huerta AM, BonavidesMartinez C, BalderasMartinez YI, Pannier L, Olvera M, Labastida A, JimenezJacinto V, VegaAlvarado L, Del MoralChavez V, HernandezAlvarez A, Morett E, ColladoVides J. Regulondb v8.0: omics data sets, evolutionary conservation, regulatory phrases, crossvalidated gold standards and more. Nucleic Acids Res. 2013; 41(Database issue):203–13. doi:10.1093/nar/gks1201.
 50
Jemal A, Siegel R, Xu J, Ward E. Cancer statistics, 2010. CA Cancer J Clin. 2010; 60(5):277–300. doi:10.3322/caac.20073.
 51
Shedden K, Taylor JM, Enkemann SA, Tsao MS, Yeatman TJ, Gerald WL, Eschrich S, Jurisica I, Giordano TJ, Misek DE, Chang AC, Zhu CQ, Strumpf D, Hanash S, Shepherd FA, Ding K, Seymour L, Naoki K, Pennell N, Weir B, Verhaak R, LaddAcosta C, Golub T, Gruidl M, Sharma A, Szoke J, Zakowski M, Rusch V, Kris M, Viale A, Motoi N, Travis W, Conley B, Seshan VE, Meyerson M, Kuick R, Dobbin KK, Lively T, Jacobson JW, Beer DG. Gene expressionbased survival prediction in lung adenocarcinoma: a multisite, blinded validation study. Nat Med. 2008; 14(8):822–7. doi:10.1038/nm.1790.
 52
Pounds S, Morris SW. Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of pvalues. Bioinformatics. 2003; 19(10):1236–42. doi:10.1093/bioinformatics/btg148.
 53
R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2015. Available at https://www.Rproject.org/.
 54
Zang X, Chen M, Zhou Y, Xiao G, Xie Y, Wang X. Identifying cdkn3 gene expression as a prognostic biomarker in lung adenocarcinoma via metaanalysis. Cancer Inform. 2015; 14(Suppl 2):183–91. doi:10.4137/CIN.S17287.
 55
Fan C, Chen L, Huang Q, Shen T, Welsh EA, Teer JK, Cai J, Cress WD, Wu J. Overexpression of major cdkn3 transcripts is associated with poor survival in lung adenocarcinoma. Br J Cancer. 2015; 113(12):1735–43. doi:10.1038/bjc.2015.378.
 56
Tomida S, Takeuchi T, Shimada Y, Arima C, Matsuo K, Mitsudomi T, Yatabe Y, Takahashi T. Relapserelated molecular signature in lung adenocarcinomas identifies patients with dismal prognosis. J Clin Oncol. 2009; 27(17):2793–9. doi:10.1200/JCO.2008.19.7053.
 57
Jones MH, Virtanen C, Honjoh D, Miyoshi T, Satoh Y, Okumura S, Nakagawa K, Nomura H, Ishikawa Y. Two prognostically significant subtypes of highgrade lung neuroendocrine tumours independent of smallcell and largecell neuroendocrine carcinomas identified by gene expression profiles. The Lancet. 2004; 363(9411):775–81. doi:10.1016/S01406736(04)156936.
 58
Ding L, Getz G, Wheeler DA, Mardis ER, McLellan MD, Cibulskis K, Sougnez C, Greulich H, Muzny DM, Morgan MB, Fulton L, Fulton RS, Zhang Q, Wendl MC, Lawrence MS, Larson DE, Chen K, Dooling DJ, Sabo A, Hawes AC, Shen H, Jhangiani SN, Lewis LR, Hall O, Zhu Y, Mathew T, Ren Y, Yao J, Scherer SE, Clerc K, Metcalf GA, Ng B, Milosavljevic A, GonzalezGaray ML, Osborne JR, Meyer R, Shi X, Tang Y, Koboldt DC, Lin L, Abbott R, Miner TL, Pohl C, Fewell G, Haipek C, Schmidt H, DunfordShore BH, Kraja A, Crosby SD, Sawyer CS, Vickery T, Sander S, Robinson J, Winckler W, Baldwin J, Chirieac LR, Dutt A, Fennell T, Hanna M, Johnson BE, Onofrio RC, Thomas RK, Tonon G, Weir BA, Zhao X, Ziaugra L, Zody MC, Giordano T, Orringer MB, Roth JA, Spitz MR, Wistuba II, Ozenberger B, Good PJ, Chang AC, Beer DG, Watson MA, Ladanyi M, Broderick S, Yoshizawa A, Travis WD, Pao W, Province MA, Weinstock GM, Varmus HE, Gabriel SB, Lander ES, Gibbs RA, Meyerson M, Wilson RK. Somatic mutations affect key pathways in lung adenocarcinoma. Nature. 2008; 455(7216):1069–75.
 59
Kollareddy M, Zheleva D, Dzubak P, Brahmkshatriya PS, Lepsik M, Hajduch M. Aurora kinase inhibitors: progress towards the clinic. Invest New Drugs. 2012; 30(6):2411–32. doi:10.1007/s1063701297986.
Acknowledgements
We gratefully thank Jessie Norris for language editing of the manuscript.
Funding
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Ministry of Science, ICT and Future Planning (NRF2015R1C1A1A02036312 and NRF20110030810), and the National Institutes of Health grants (1R01CA17221, 1R01GM117597, R15GM113157).
Availability of data and materials
The Escherichia coli dataset analyzed during the current study is available in the Many Microbe Microarrays database (M3D) [48], http://m3d.mssm.edu. The RegulonDB [49] dataset used for validation of the result is available at http://regulondb.ccg.unam.mx. The Lung Cancer Consortium study dataset analyzed during this study is included in the published article [51]. The information of cancer genes used during this study is available from the FoundationOne^{TM}, http://www.foundationone.com. The proposed method ESPACE is implemented the R package “espace”, which is available from https://sites.google.com/site/dhyeonyu/software.
Authors’ contributions
DY, JL and GX drafted the manuscript. DY and JL formulate the proposed model and performed simulation studies. GX performed the interpretation of the results in the application to the lung cancer adenocarcinoma dataset. GX designed preprocessing procedure in the realdata applications. FL and XW helped in the verification of the proposed model and revised the manuscript. All authors read and approved the final manuscript.
Competing interests
We have no financial or personal relationships with other people and organizations that cause conflict of interests. The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Yu, D., Lim, J., Wang, X. et al. Enhanced construction of gene regulatory networks using hub gene information. BMC Bioinformatics 18, 186 (2017). https://doi.org/10.1186/s1285901715761
Received:
Accepted:
Published:
Keywords
 Gene regulatory network
 Hub gene
 Partial correlation
 Sparse partial correlation estimation
 Escherichia coli
 Lung cancer