Enhanced construction of gene regulatory networks using hub gene information

Background Gene regulatory networks reveal how genes work together to carry out their biological functions. Reconstructions of gene networks from gene expression data greatly facilitate our understanding of underlying biological mechanisms and provide new opportunities for biomarker and drug discoveries. In gene networks, a gene that has many interactions with other genes is called a hub gene, which usually plays an essential role in gene regulation and biological processes. In this study, we developed a method for reconstructing gene networks using a partial correlation-based approach that incorporates prior information about hub genes. Through simulation studies and two real-data examples, we compare the performance in estimating the network structures between the existing methods and the proposed method. Results In simulation studies, we show that the proposed strategy reduces errors in estimating network structures compared to the existing methods. When applied to Escherichia coli, the regulation network constructed by our proposed ESPACE method is more consistent with current biological knowledge than the SPACE method. Furthermore, application of the proposed method in lung cancer has identified hub genes whose mRNA expression predicts cancer progress and patient response to treatment. Conclusions We have demonstrated that incorporating hub gene information in estimating network structures can improve the performance of the existing methods.


Background
A gene regulatory network (GRN) describes interactions and regulatory relationships among genes. It provides a systematic understanding of the molecular mechanisms underlying biological processes by revealing how genes work together to form modules that carry out cell functions [1][2][3][4]. In addition, the visualization of genetic dependencies through the GRN facilitates the systematic interpretation and comprehension of analysis results from genome-wide studies using high-throughput data. GRNs have proven valuable in a variety of contexts, including identifying druggable targets [5], detecting driver genes in diseases [6], and even optimizing prognostic and predictive signatures [7].
The sparse partial correlation estimation (SPACE) method, proposed by Peng et al. [18], considers a penalized regression approach to estimate edges in the GRN, which utilizes the sparse feature of the GRN. Comparative studies have shown that the SPACE method performs well in estimating sparse networks with high accuracy [24]. Peng et al. [18] also showed that the method was able to identify functional relevant molecular networks. In addition, recent studies of network analysis have revealed its advantage in detecting genes or modules associated with phenotypes [25][26][27].
In gene networks, genes that have many interactions with other genes are defined as hub gens. Because of these interactions, hub genes usually play an important role in a biological system. For example, transcription factor (TF), a protein that binds to specific DNA sequences, can regulate a given set of genes. In humans, approximately 10% of genes in the genome code for around 2600 TFs [28]. The combinatorial human TFs account for most of the regulation activities in the human genome, especially during the development stage. As a result, the genes that code TFs, called TF-encoding genes, are usually regarded as hub genes. Furthermore, in cancer research, cancer genes (oncogenes or tumor suppressor genes) take part in tumor genesis and are likely to be hub genes in the genetic networks of tumors [29,30]. Through decades of biological studies, knowledge on important genes (such as TFs or cancer genes) has been accumulated. Our hypothesis is that incorporating prior knowledge about hub genes can improve accuracy in estimating the gene network structure. It is worth noting that there is a reweighted 1 regularization method [31] that repeatedly estimates the structures and modifies the weights of the penalties by using the information on degrees from the previous estimation to encourage the appearance of the hubs. This method does not use the prior information obtainable from the other resources while our method uses additional information not contained in the observed dataset.
To explicitly account for the information on hub genes, we propose an extension of the SPACE method, which introduces an additional tuning parameter to open up the possibility of reducing penalization and increasing the likelihood of selecting the edges connected to such genes. We numerically show that the proposed method reduces errors in estimating network structures. Although we focus on extending the SPACE method in this paper, the idea can also be applied to penalized likelihood methods as well as to other penalized regression methods. Note that there is no rigorous definition of a hub in the context of a network; the definition of a hub varies depending on the sparsity of the network. For sparse protein networks, a hub is defined in [32] as a protein whose degree lies over the 0.95 quantile of the degree distribution or in [33] and [7] as a protein whose degree is greater than 7. In this paper, we conservatively define a hub as a node whose degree is both greater than 7 and above the 0.95 quantile of the degree distribution, because most nodes in sparse networks have relatively small degrees between 0 and 3.
In this study, we briefly introduce seven existing methods, including the SPACE and the graphical lasso, and propose the extended SPACE (ESPACE) method to incorporate the biological knowledge about important genes, i.e. network hubs. Moreover, it is worth noting that the ESPACE only incorporates the previously known biological information not contained in the observed dataset compared to the other existing methods. Through simulation studies, we show that the proposed approach reduces error in estimating the network structures compared to the seven other existing methods that we reviewed in the "Methods" section. Finally, we demonstrate the improvement of the ESPACE method compared to the SPACE method with two real-data examples.

Review of existing methods
Here, we briefly review the existing methods; the GeneNet [34], the NS [17], the GLASSO [15], the GLASSO-SF [31], the PCACMI [21], the CMI2NI [22], and the SPACE [18]. Let X k i be the expression level of the ith gene of the kth array for i = 1, 2, . . . , p and k = 1, 2, . . . , n. Let X i = X 1 i , X 2 i , . . . , X n i T so that observed gene expression data can be denoted by an n × p matrix X = (X 1 ; X 2 ; . . . ; X p ) whose rows and columns denote arrays and genes, respectively. Suppose row vectors X k = X k 1 , X k 2 , . . . , X k p for k = 1, 2, . . . , n are independently and identically distributed random vectors from the multivariate normal distribution with mean 0 and covariance matrix . We assume that is positive definite, and let ≡ −1 = ω ij 1≤i,j≤p be the inverse of the covariance matrix , which is referred to as a concentration matrix or a precision matrix.

GeneNet
Schäfer and Strimmer [34] propose the linear shrinkage estimator for a covariance matrix and the Gaussian graphical model (GGM) selection based on the partial correlation obtained from their shrinkage estimator. With multiple testing procedure using the local false discovery rate [35], the GGM selection controls the false discovery rate under a pre-determined level α. Since Schäfer and Strimmer [34] provide their GGM selection procedure in the R package GeneNet, we denote their GGM selection procedure as GeneNet in this paper. To be specific, one of the most commonly used linear shrinkage estimators S * for the covariance matrix is where S = (s ij ) 1≤i,j≤p is the sample covariance matrix, T = diag(s 11 , s 22 , . . . , s pp ) is the shrinkage target matrix, and λ * = i =j Var(s ij )/ i =j s 2 ij is the optimal shrinkage intensity. With this estimator S * , the matrix of the partial correlations P = (ρ ij ) 1≤i,j≤p is defined aŝ To identify the significant edges, Schäfer and Strimmer [34] suppose the distribution of the partial correlations as the mixture where f 0 is the null distribution, f 1 is the alternative distribution corresponding to the true edges, and η 0 is the unknown mixing parameter. Using the algorithm in [35], GeneNet identifies significant edges that have the local false positive rate is the density of the Beta distribution and ν is the reciprocal variance of the null ρ.

Neighborhood selection (NS)
Meinshausen and Bühlmann [17] propose the neighborhood selection (NS) method, which separately solves the lasso [36] problem and identifies edges with nonzero estimated regression coefficients for each node. Meinshausen and Bühlmann [17] prove that the NS method is asymptotically consistent in identifying the neighborhood of each node when the neighborhood stability condition is satisfied. Note that the neighborhood stability condition is related to the irrepresentable condition in linear model literature [37]. To be specific, for each node i ∈ V = {1, 2, . . . , p}, NS solves the following lasso problem Since NS separately solves p lasso problems, contradictory edges may occur when we define the total edge set E λ = ∪ p i=1 E λ i , i.e.,β i,λ k = 0 and β k,λ i = 0. To avoid these contradictory edges, NS suggests two types of edge sets E λ,∧ and E λ,∨ defined as follows: Meinshausen and Bühlmann [17] mentioned these two edge sets have only small differences in their experience and the differences vanish asymptotically. Meinshausen and Bühlmann [17] also propose the choice of the tuning parameter λ i (α) for the ith node where˜ = 1 − and is the distribution function of the standard normal distribution. With this choice of λ i (α) for i = 1, 2, . . . , p, the probability of falsely identifying edges in the network is bounded by the level α. Note that we estimate the edge set with E λ,∧ and solve the lasso problems using the R package CDLasso proposed by [38] in this paper.

Graphical lasso (GLASSO)
Friedman et al. [15] propose the graphical lasso method that estimates a sparse inverse covariance matrix by maximizing the 1 penalized log-likelihood  [39], Friedman et al. [15] show that the solutionˆ of (1) is equivalent to the inverse of W whose partitioned entity w 12 satisfies w 12 = W 11 β * , where β * is the solution of the lasso problem min Based on the above property, the graphical lasso sets the diagonal elements w ii = s ii + ρ and obtains the off-diagonal elements of W by repeatedly applying the following two steps: 1. Permuting the columns and rows to locate the target elements at the position of w 12 . 2. Finding the solution w 12 = W 11 β * by solving the lasso problem (2).
until convergence occurs. After finding W, the estimatê is obtained from the relationships ω 12 = −βω 22 and This graphical lasso algorithm was proposed in [15] and had its computational efficiency improved in [16] and [40]. Witten et al. [16] provide the R package glasso version 1.7.

GLASSO with reweighted strategy for scale-free network (GLASSO-SF)
Liu and Ihler [31] propose the reweighted 1 regularization method to improve the performance of the estimation for the scale-free network whose degrees follows the power law distribution. Motivated by the fact that the existing methods work poorly for the scale-free networks, Liu and Ihler [31] consider changing the 1 norm penalty in the existing methods to the power law regularization where λ and γ are nonnegative tuning parameters, ω −i = {ω ij | j = i}, ω −i 1 = j =i |ω ij |, and i is a small positive number for i = 1, 2, . . . , p. Thus, Liu and Ihler [31] consider optimizing the following objective function where L(X, ) denotes the objective function of the existing method without its penalty terms, u L = 1 if L is convex and u L = −1 if L is concave for . Note that the choice of L is flexible. For instance, L(X, ) can be the loglikelihood function of as in the graphical lasso or the squared loss function as in the NS and the SPACE. In this section, we suppose that L is concave for the purpose of notational simplicity.
To obtain the maximizer of f ( ; X, λ, γ ), Liu and Ihler [31] propose the iteratively reweighted 1 regularization procedure based on the minorization-maximization (MM) algorithm [41]. The reweighted procedure iteratively solves the following problem: is the estimate at the kth iteration, . In practice, [31] suggest i = 1, γ = 2λ/ i , and the initial estimate (0) = I p , where I p is the p-dimensional identity matrix. Note that this reweighted strategy facilitates to estimate the hub nodes by adjusting weights in the penalty term but weights are updated by solely using the observed dataset without previously known information from other literatures.
In this paper, we consider L(X, ) = log | | − tr(S ), which is the same to the component in the objective function of the GLASSO. Thus, we call this procedure as the GLASSO with a reweighted strategy for the scale-free network (GLASSO-SF). As applied in [31], we stop the reweighting iteration after 5 iterations. The R package glasso version 1.7 is used to obtain the solution of (5) at each iteration with the penalty matrix E (k) = (e

Path consistency algorithm based on conditional mutual information (PCACMI)
Mutual information (MI) is a widely used measure of dependency between variables in information theory. MI even measures non-linear dependency between variables and can be considered as a generalization of the correlation. Several mutual information (MI) based methods have been developed such as ARACNE [20], CLR [42], and minet [43]. However, similar to the correlation, MI only measures pairwise dependency between two variables. Thus, it usually identifies many undirected interactions between variables. To resolve this difficulty, Zhang et al. [21] propose the information theoretic method for reconstruction of the gene regulatory networks based on the conditional mutual information (CMI).
To be specific, let H(X) and H(X, Y ) be the entropy of a random variable X and the joint entropy of random variables X and Y, respectively. For two random variables X and Y, H(X) and H(X, Y ) can be expressed as where f X (x) is the marginal probability density function (PDF) of X and f XY (x, y) is the joint PDF of X and Y. With these notations, MI is defined as It is known that MI measures dependency between two variables that contain both directed dependency and indirected dependency through other variables. While MI can not distinguish directed and indirected dependency, CMI can measure directed dependency between two variables by conditioning on other variables. CMI for X and Y given Z is defined as To estimate the entropies in (7), Zhang et al. [21] consider the Gaussian kernel density estimator used in [19]. Using the Gaussian kernel density estimator, MI and CMI are defined as where |A| is the determinant of a matrix A, C(X), C(Y ) and C(Z) are the variances of X, Y and Z, respectively, and C(X, Z), C(Y , Z) and C(X, Y , Z) are the covariance matrices of (X, Z), (Y , Z) and (X, Y , Z), respectively. To efficiently identify dependent pairs of variables, Zhang et al. [21] adopt the path consistency algorithm (PCA) in [44]. Thus, the authors called their procedure as PCA based on CMI (PCACMI). The PCACMI method sets L = 0 and calculates with L-order CMI, which is equivalent to MI if L = 0. Then, PCACMI removes the pairs of variables such that the maximal CMI of two variables given L + 1 adjacent variables is less than a given threshold α, where α determines whether two variables are independent or not and adjacent variables denote variables connected to the two target variables in PCACMI at the previous step. PCACMI repeats the above steps until there is no higher order connection. The MATLAB code for PCACMI is provided by [21] at the author's website https://sites.google.com/site/ xiujunzhangcsb/software/pca-cmi.

Conditional mutual inclusive information-based network inference (CMI2NI)
Recently, Zhang et al. [22] proposed the conditional mutual inclusive information-based network inference (CMI2NI) method that improves the PCACMI method [21]. CMI2NI considers the Kullback-Leibler divergences from the joint probability density function (PDF) of target variables to the interventional PDFs removing the dependency between two variables of interest. Instead of using CMI, CMI2NI uses the conditional mutual inclusive information (CMI2) as the measure of dependency between two variables of interest given other variables. To be specific, we consider three random variables X, Y and Z. For these three random variables, the CMI2 between X and Y given Z is defined as where D KL (f ||g) is the Kullback-Leibler divergence from f to g, P is the joint PDF of X, Y and Z, and P X→Y is the interventional probability of X, Y and Z for removing the connection from X to Y.
With Gaussian assumption on the observed data, the CMI2 for two random variables X and Y given mdimensional vector Z can be expressed as where is the covariance matrix of (X, Y , Z T ) T ,˜ is the covariance matrix of (Y , X, Z T ) T , XZ is the covariance matrix of (X, Z T ) T , YZ is the covariance matrix of (Y , Z T ) T , n = m + 2, and C,C, C 0 andC 0 are defined with the elements of , XZ , YZ , −1 , −1 XZ and −1 YZ (see Theorem 1 in [22] for details). As applied in PCACMI, CMI2NI adopts the path consistency algorithm (PCA) to efficiently calculate the CMI2 estimates. All steps of the PCA in CMI2NI are the same as one of PCACMI if we change the CMI to the CMI2. In the PCA steps of CMI2NI, two variables are regarded as independent if the corresponding CMI2 estimate is less than a given threshold α. The MATLAB code for CMI2NI is available at the author's website https://sites.google.com/site/ xiujunzhangcsb/software/cmi2ni.

Sparse partial correlation estimation (SPACE)
In the Gaussian graphical models [45], the conditional dependencies among p variables can be represented by a graph G = (V , E), where V = 1, 2, . . . , p is a set of nodes representing p variables and E = i, j ω ij = 0, 1 ≤ i = j ≤ p is a set of edges corresponding to the nonzero off-diagonal elements of .
To describe the SPACE method, we consider linear models such that for i = 1, 2, . . . , p, where i is an n-dimensional random vector from the multivariate normal distribution with mean 0 and covariance matrix (1/ω ii )I n , and I n is an identity matrix with size of n × n. Under normality, the regression coefficients β ij s can be replaced with the partial correlations ρ ij s by the relationship where ii ω jj is a partial correlation between X i and X j . Motivated by the relationship (12), Peng et al. [18] propose the SPACE method for solving the following 1 -regularized problem: where w i is a nonnegative weight for the i-th squared error loss.

Proposed approach incorporating previously known hub information Extended sparse partial correlation estimation (ESPACE)
In this paper, we assume that some genes (or nodes), which are referred to as hub genes (or hub nodes), regulate many other genes, and we also assume that many of these hub genes were identified from previous experiments. To incorporate information about hub nodes, we propose the extended SPACE (ESPACE) method, which extends the model space by using an additional tuning parameter α on edges connected to the given hub nodes. This additional tuning parameter can reflect the hub gene information by reducing the penalty on edges connected to hub nodes. To be specific, let H be the set of hub nodes that were previously identified. The ESPACE method we propose solves where 0 < α ≤ 1. Note that we consider the weights w i s for the squared error loss as one in this paper. To summarize the process of the proposed method, we depict the flowchart of the ESPACE method in Fig. 1. As described in Fig. 1, the ESPACE has the prior knowledge about hub genes as an additional input, which is the novelty of the proposed method compared to the other existing methods.

Extended graphical lasso (EGLASSO)
In the Background, we mentioned the proposed procedure is applicable to other methods such as the graphical lasso. For the purpose of fair comparison and the investigation of the performance, we also applied the proposed strategy to the GLASSO, which is the GLASSO incorporating the hub gene information. We call this procedure the extended graphical lasso (EGLASSO). Similar to the ESPACE, the EGLASSO maximizes (15) where λ ≥ 0 and 0 < α ≤ 1 are two tuning parameters, S is the sample covariance matrix, tr(A) is the trace of A and H is the set of hub nodes that were previously identified. Note that we can use the R package glasso version 1.7 for the EGLASSO by defining the penalty matrix corresponding to the penalty term in (15).
Step 5: Repeat Step 4 until convergence occurs on the active set .
Step 6: Update ρ (m+1) for 1 ≤ i < j ≤ p by using the equations in Step 4. If the maximum difference between ρ (m+1) and ρ (m) is less than a pre-determined tolerance τ , then go to Step 7 with the estimates ρ (m+1) . Otherwise, consider m = m + 1 and go back to Step 3.
Note that the number of iterations of ω ii s is usually small for stabilization of the estimates of ρ. In our numerical study, the estimates of ω ii s converge within 10 iterations. Moreover, the inner products such as Y T X i,j , whose complexity is O(np), can efficiently be computed by rewriting We implemented the R package espace, which is available from https://sites.google. com/site/dhyeonyu/software.

Choice of tuning parameters
We have introduced the ESPACE method, which relaxes the penalty on edges connected to the hub genes (i.e., α < 1) but uses the same penalty on edges connected to nonhub gene (i.e., α = 1). When no hub genes are involved in a network, ESPACE is reduced to SPACE. For a given λ, this modification allows us to find more edges connected to the hub genes by reducing α. In practice, however, we do not know the values of λ and α. In this paper, we consider the GIC-type criterion used in [46] for the Gaussian graphical model to choose the optimal tuning parameters (λ * , α * ). Let ρ ij (λ,α) be the (i, j)-th estimate of partial correlation for given λ and α. The GIC-type criterion is defined as and |A| denotes a cardinality of a set A. We choose the tuning parameters which minimize the GIC-type criterion,

Simulation studies Simulation settings
In this simulation, we consider four real protein-protein interaction (PPI) networks used in a comparative study [24], which were partially selected from the human protein reference database [47]. As mentioned earlier, genes whose degrees are greater than 7 and above the 0.95 quantile of the degree distribution are thought of as hub genes. Figure 2 shows the four PPI networks and their hub genes. Let p be the number of nodes in a network. We consider the number of samples as p/2 and p and a b c d Fig. 2 The network structures of the four simulated networks. The structure of the real protein-protein interaction networks [47] were used to construct networks of different sizes by varying the number of references required to support each connection. In the degree distribution, the 0.95 quantile is 7 (connections), so the nodes with more than 7 connections were defined as hub nodes, which are represented as black nodes in the network structure. a 52 edges among 44 nodes (3 hubs), b 103 edges among 83 nodes (3 hubs), c 290 edges among 231 nodes (8 hubs) and d 837 edges among 612 nodes (33 hubs) generate samples from the multivariate normal distribution with mean 0 and covariance matrix defined with where is a concentration matrix corresponding to a given network structure.
To generate a positive definite concentration matrix, we use the following procedure as described in [18]: Step G1: For a given edge set E, we generate an initial concentration matrix˜ = (ω ij ) 1≤i,j≤p with Step G2: For positive definiteness and symmetry of the concentration matrix, we define a concentration matrix = (ω ij ) 1≤i,j≤p as Step G3: With these four networks, we have conducted the numerical comparisons of the ESPACE and the SPACE methods, as well as seven other methods including the other reviewed existing methods and EGLASSO. For the purpose of fair comparison, we select the optimal model by the GIC for SPACE, ESPACE, GLASSO, GLASSO-SF, and EGLASSO. Since there is no specific rule for the model selection in the other methods, we set the level α = 0.2 for GeneNet and NS, and the threshold α = 0.03 for PCACMI and CMI2NI. Note that the pre-determined level α = 0.2 is a default of the GeneNet package and used in [35]. The pre-determined threshold α = 0.03 was used in [21,22].
Note that all the existing methods need O(p 2 ) memory space to store and calculate values corresponding to the interactions between variables. We can reduce this memory consumption when the whole variables can be divided into several conditionally independent blocks by using the condition described in [16].

Sensitivity analysis on random noise in the observed data
To investigate the effect of the random noise contained in the observed data, we consider sensitivity analysis for the variance of the random noise. To be specific, suppose that a random vector X = (X 1 , X 2 , . . . , X p ) T follows the multivariate normal distribution with mean 0 and covariance matrix , a vector of random noise ε = ( 1 , 2 , . . . , p ) T follows the multivariate normal distribution with mean 0 and covariance matrix σ 2 I, and X and ε are independent, where I is the identity matrix. Furthermore, we assume that an observed random vector Z = (Z 1 , Z 2 , . . . , Z p ) T such that Thus, the covariance matrix of Z becomes + σ 2 I, which may have a different conditional dependent structure to one of X.
Thus, we can see that Z 1 and Z 3 are conditionally dependent given Z 2 while X 1 and X 3 are conditionally independent given X 2 . Moreover, the nonzero partial correlations decrease when the variance of the random noises increases. From these observations, the performance of the estimation becomes worse if the variance of the random noise increases.
In this sensitivity analysis, we consider σ 2 = 0, 0.01, 0.1, 0.25, 0.5 and p = 231 and n = 115, 231 with the same network structure as the one of p = 231 in Fig. 2. To focus on the proposed method, we apply the SPACE and the ESPACE methods to the 50 generated datasets containing random noise having variance σ 2 .

Performance measures
To investigate the gains from the extension, we use five performance measures: sensitivity (SEN), specificity (SPE), false discovery rate (FDR), mis-specification rate (MISR) and Matthews correlation coefficients (MCC). Note that the MCC, which lies between −1 and +1, has been used to measure the performance of binary classification, where +1, 0, and −1 denote a perfect classification, a random classification, and a total discordance of classification, respectively. Let ρ and ρ λ,α be (p(p − 1)/2)dimensional vectors of the true and estimated partial

Application to Escherichia coli dataset
We applied the ESPACE method to the largest public Escherichia coli (E.coli) microarray dataset available from the Many Microbe Microarrays database (M3D) [48]. The M3D contains 907 microarrays measured under 466 experimental conditions using Affymetrix GeneChip E.coli genome arrays. Microarrays from the same experimental conditions were averaged to derive the mean expression. The data set ("E_coli_v4_Build_6" from the M3D) contains the expression levels of 4297 genes from 446 samples. In the E.coli genome, a number of studies have been conducted to identify transcriptional regulations. The RegulonDB [49] curates the largest and bestknown information on the transcriptional regulation of E.coli. To combine the information from the above two databases, we focus on the 1623 genes reported in both the M3D and the RegulonDB. As mentioned before, the TFs are known to regulate many other genes in the genome and can be considered potential hubs. To incorporate the information about the potential hubs, we used a list of 180 known TF-encoding genes from the RegulonDB. The RegulonDB also provides 3811 transcriptional interactions among the 1623 genes, which were used as the gold standard to evaluate the accuracy of the constructed networks.

Application to lung cancer adenocarcinoma dataset
Lung cancer is the leading cause of death from cancer, both in the United States and worldwide; it has a 5-year survival rate of approximately 15% [50]. The progression and metastasis of lung cancer varies greatly among early stage lung cancer patients. To customize treatment plans for individual patients, it is important to identify prognostic or predictive biomarkers, which allows for more precise classification of lung cancer patients. In this study, we applied the extended SPACE method to reconstruct the gene regulatory network in lung cancer. Exploring network structures can facilitate comprehension of biological mechanisms underlying lung cancer and identification of important genes that could be potential lung cancer biomarkers. We constructed the gene network using microarray data from 442 lung cancer adenocarcinoma patients in the Lung Cancer Consortium study [51]. For detail about preprocessing this dataset, please refer to [7]. First, univariate Cox regression was used to identify the genes whose expression levels are correlated with patient survival outcomes, after adjusting for clinical factors such as study site, age, gender, and stage. The false discovery rate (FDR) was then calculated using a Beta-Uniform model [52]. By controlling the FDR to less than 10%, we identified 794 genes that were associated with the survival outcome of lung cancer patients. Among these 794 genes, 22 were found to appear among the 236 carefully curated cancer genes of the FoundationOne TM gene panel (Foundation Medicine, Inc.). Current biological knowledge indicates genes from this panel play a key role in various types of cancer. These 22 genes were then input as known hub genes to the ESPACE method.

Comparison results for existing methods
For each network, we generated 50 datasets and reconstructed the network from each dataset using nine different network construction methods, including both the SPACE and the ESPACE methods. In addition to the five performance measures, we also measure the computation time (Time) of each method to compare the efficiency. Note that all methods are executed on R software [53] for the purpose of fair comparison. We implemented the R codes for PCACMI and CMI2NI using the authors' MAT-LAB codes. The computation times are measured in CPU time (seconds) by using a desktop PC (Intel Core(TM) i7-4790K CPU (4.00 GHz) and 32 GB RAM).
Tables 1, 2, 3 and 4 report the averages and standard errors of the number of the estimated edges, the five performance measures of the estimation of the network structures and computation times with the optimal tuning parameter λ * for SPACE, GLASSO, GLASSO-SF; the optimal tuning parameters α * and λ * for ESPACE and EGLASSO; and the pre-determined α for GeneNet, NS, PCACMI, and CMI2NI.
Overall, ESPACE has the best performance in estimating network structures in terms of the MCC and the MISR except for the case (p, n) = (83, 41), where ESPACE has the second smallest FDR while the MCC and the MISR of ESPACE show the moderate performance among all methods. In the case (p, n) = (83, 41), the CMI-based methods have better performance than the others in terms of the MCC and the MISR, but the CMI-based methods also have the large FDRs (≈ 41%) more than double of those of the other methods. As we described in the Methods Section, the MCC has been used to measure the performance of binary classification and the MISR denotes the total error rate. Thus, this comparison results show that ESPACE is favorable for the identification of edges for the networks with high-dimensional data.    In addition, we made several interesting observations from the results of the our simulation study. First, ESPACE and EGLASSO improve SPACE and GLASSO in terms of the FDR, the MISR, and the MCC for almost scenarios, respectively. The only exception is the case (p, n) = (231, 115) for the ESPACE and the SPACE methods. In this case, although the FDR of ESPACE increases 2.17% compared to one of SPACE, ESPACE still improves SPACE in terms of the SEN, the MISR, and the MCC. This suggests that our proposed strategy, which incorporates the previously known hub information, can reduce the errors in estimating network structures compared to the existing method without considering known hub information. Second, GeneNet controls the FDR relatively close to the given level α while the FDRs of NS are controlled conservatively. For instance, the FDRs of GeneNet are measured between 3.94 and 22.24% and NS has the FDRs less than 3.48%. Note that GeneNet and NS control the FDR under 20% (α = 0.2) in this simulation study. Third, all methods except the CMI-based methods (the PCACMI and the CMI2NI) have similar efficiency for the relatively low dimensions (p = 44, 83). The CMI-based methods are relatively slower than the other methods for all the scenarios except for the case (p, n) = (612, 612), where GLASSO-SF is the slowest and 1.4 times slower than CMI2NI. CMI2NI is slightly slower than PCACMI for the relatively high dimensions (p = 231, 612). Finally, even though ESPACE is not the fastest method among the nine methods we consider, there is no overall winner and ESPACE is the third best in terms of the computation time for p = 231, 612 except for the case (p, n) = (612, 612) where ESPACE is faster than SPACE, GLASSO-SF, PCACMI and CMI2NI.
To investigate the other property of the proposed approach, we depict barplots of the averages of degrees of known hub genes over 50 datasets for ESPACE and SPACE in Fig. 3. Figure 3 shows that ESPACE tends to find more edges connected to known hub genes than SPACE. The only exception is the case (p, n) = (612, 306), where the average by the ESPACE is 0.57 less than that by SPACE. We conjecture this is simply due the difference in the number of estimated edges, which by ESPACE is 15.54 less than that of SPACE on average. This property is due to the result that the averages of α * selected by the GIC in the ESPACE method lie between 0.76 and 0.97 for all the scenarios, which indicates that ESPACE has incorporated prior information about the hub genes and reduced the penalty on edges connected to known hub genes. Table 5 reports the averages of the number of the estimated edges and the five performance measures. From the results in Table 5, we can see that the performance of estimation decreases when the variance of the random noises increases in both the SPACE and the ESPACE. For a relatively small sample size (n = 115), both the SPACE and the ESPACE are more sensitive to the variance σ 2 compared to the case of n = 231. Even though the performance of two methods decreases by similar amounts as the variance σ 2 increases, the ESPACE has better performance than the SPACE in terms of the MCC and the MISR.

Comparison of the identified GRNs in Escherichia coli dataset
In this study, we compared the performance of network construction using the SPACE and the ESPACE methods   Figure 4 shows the number of TPs vs. the number of estimated edges for various λ values with α * in Table 6. The number of TPs of the ESPACE method is consistently greater than those of the SPACE method at similar sparsity. These results clearly indicate that incorporating potential hub gene information improves the accuracy of network construction.

Comparison of the identified GRNs in lung cancer adenocarcinoma dataset
We again compared the performances of network construction using the SPACE and the ESPACE methods. An overview of the networks constructed using both methods is shown in Fig. 5. The SPACE method estimated 234 edges between 114 genes and the ESPACE method found 272 edges between 132 genes. Although the numbers of estimated edges from both the SPACE and ESPACE methods are quite similar, 16.7 and 28.3% of the estimated edges in networks by the SPACE and the ESPACE are different, respectively. We identified hub genes using the criterion mentioned at the beginning of this paper. The lists of hub genes identified in both networks are reported in Table 7. Interestingly, all hub genes identified by the SPACE method were also found using ESPACE. Note that this is not the usual case. For instance, if we define a hub as a node whose degree is greater than 5, the set of hub genes identified by the SPACE is not a subset of the hub genes identified by the ESPACE. To investigate the gains of the ESPACE method, therefore, we focused on the hub genes identified only by ESPACE (AURKA, APC, CDKN3), among which, AURKA and APC are among the 22 pre-specidifed hub genes while CDKN3 is not.  Fig. 4 Plot of the number of the TPs vs. the number of the estimated edges for various λs with α * in Table 6 The CDKN3 (Cyclin-Dependent Kinase Inhibitor 3) protein coded by the CDKN3 gene is a cyclin-dependent kinase inhibitor. Recent studies [54,55] show that CDKN3 overexpression was associated with poorer survival outcomes in lung adenocarcinoma, but not in lung squamous cell carcinoma. We validated that CDKN3 is associated with the prognosis of lung adenocarcinoma patients in two independent datasets (see Fig. 6). The CDKN3 expression allowed us to separate the lung adenocarcinoma patients into high CDKN3 and low CDKN3 groups with significantly different survival outcomes: in the GSE13213 dataset [56] (n = 117), hazard ratio = 2.02 (high CDKN3 vs. low CDKN3), p=0.0146; in the GSE1037 dataset [57] (n = 61), hazard ratio= 3.39 (high CDKN3 vs. low CDKN3), p=0.0126. Note that we divided patients into "high" and "low" groups by their gene expression levels with the K-means clustering method.
APC (Adenomatous Polyposis Coli) is a tumor suppressor gene, and is involved in the Wnt signaling pathway as a negative regulator. It has been identified as one of the key mutation genes in lung adenocarcinoma by a comprehensive study on the somatic mutations in lung adenocarcinoma [58]. AURKA (aurora kinase A) is a protein-coding gene found to be associated with many different types of cancer. Aurora kinase inhibitors have been studied as a potential cancer treatment [59]. Using the GSE42127 dataset [7] (n = 209), we found that AURKA expression can predict lung cancer patients' response to chemotherapy. The dataset contains expression profiles and treatment information for 209 lung cancer patients from MD Anderson Cancer Center, among whom 62 received adjuvant chemotherapy (ACT group) and the remaining 147 did not (no ACT group). The AURKA gene expression allowed us to separate the 209 patients into a low AURKA group (n = 104) and high AURKA group Estimated networks structure using the SPACE and ESPACE methods. The nodes with more than 7 connections (the 0.95 quantile in the degree distribution) were defined as hub nodes, which are represented as black nodes in the network structure. The details of hub genes are reported in Table 7. a SPACE (114 nodes, 234 edges) and b ESPACE (132 nodes, 272 edges) (n = 105) using the median AURKA expression as a cut-off. The patients in the low AURKA group (Fig. 7a) showed significant improvement in survival after ACT: hazard ratio = 0.289 (ACT vs. no ACT) and p value = 0.0312. The patients in the high AURKA group (Fig. 7b), on the other hand, showed no significant survival benefit after ACT: hazard ratio = 0.679 (ACT vs. no ACT) and p Bold font denotes the genes only identified by the modified method value = 0.241. These results indicate that AURKA expression could potentially be a predictive biomarker for lung cancer adjuvant chemotherapy, since only patients with low AURKA expression benefit from the treatment, while those with high AURKA expression are less likely to benefit. In addition, it is possible that Aurora kinase inhibitors, which suppress the expression of AURKA genes, may synergize the effect of adjuvant chemotherary, i.e. improve the chance that a patient responds to adjuvant chemotherapy. In fact, a recent study has demonstrated that Aurora kinase inhibitors may synergize the effect of adjuvant chemotherapy in ovarian cancer, which is consistent with our results in lung cancer.

Conclusions
We have demonstrated incorporating hub gene information in estimating network structures by extending SPACE with an additional tuning parameter. Our simulation study Fig. 6 Kaplan-Meier curves for the CDKN3 gene from GSE13213 and GSE1037 datasets. For each gene, we divide patients into two groups, "High" and "Low", by their gene expression levels with the K-means clustering method. Red solid lines denote the "High" group and black dashed lines denote the "Low" group. a CDKN3 (GSE13213 dataset) and b CDKN3 (GSE1037 dataset) shows that the ESPACE method reduces errors in the construction of networks when the networks have previouslyknown hub nodes. Through two applications, we illustrate that the ESPACE method can improve the SPACE method by using the information about the potential hub genes. Although we adopted the GIC to select the optimal tuning parameters in this paper, the ESPACE method can directly be applied with other model selection criteria. The performance of the ESPACE method varies with the chosen criterion. However, the performance of the ESPACE method is at least comparable to the SPACE method since the ESPACE includes the SPACE as a reduced case.