Review of existing methods
Here, we briefly review the existing methods; the GeneNet [34], the NS [17], the GLASSO [15], the GLASSO-SF [31], the PCACMI [21], the CMI2NI [22], and the SPACE [18]. Let \(X_{i}^{k}\) be the expression level of the ith gene of the kth array for i=1,2,…,p and k=1,2,…,n. Let \(\mathbf {X}_{i} = \left (X_{i}^{1}, X_{i}^{2}, \ldots, X_{i}^{n}\right)^{T}\) so that observed gene expression data can be denoted by an n×p matrix X=(X
1;X
2;…;X
p
) whose rows and columns denote arrays and genes, respectively. Suppose row vectors \(\mathbf {X}^{k}=\left (X_{1}^{k},X_{2}^{k},\ldots,X_{p}^{k}\right)\) for k=1,2,…,n are independently and identically distributed random vectors from the multivariate normal distribution with mean 0 and covariance matrix Σ. We assume that Σ is positive definite, and let Ω≡Σ
−1=(ω
ij
)1≤i,j≤p
be the inverse of the covariance matrix Σ, which is referred to as a concentration matrix or a precision matrix.
GeneNet
Schäfer and Strimmer [34] propose the linear shrinkage estimator for a covariance matrix and the Gaussian graphical model (GGM) selection based on the partial correlation obtained from their shrinkage estimator. With multiple testing procedure using the local false discovery rate [35], the GGM selection controls the false discovery rate under a pre-determined level α. Since Schäfer and Strimmer [34] provide their GGM selection procedure in the R package GeneNet, we denote their GGM selection procedure as GeneNet in this paper. To be specific, one of the most commonly used linear shrinkage estimators S
∗ for the covariance matrix Σ is
$$ S^{*} = \lambda^{*} T + (1-\lambda^{*}) S, $$
where S=(s
ij
)1≤i,j≤p
is the sample covariance matrix, T=diag(s
11,s
22,…,s
pp
) is the shrinkage target matrix, and \(\lambda ^{*} = {\sum \nolimits }_{i\neq j} \widehat {\text {Var}} (s_{ij}) / \left ({\sum \nolimits }_{i\neq j} s_{ij}^{2}\right)\) is the optimal shrinkage intensity. With this estimator S
∗, the matrix of the partial correlations \(P = (\hat {\rho }^{ij})_{1 \le i,j \le p}\) is defined as \(\hat {\rho }^{ij} = -\hat {\omega }_{ij} / \sqrt {\hat {\omega }_{ii} \hat {\omega }_{jj}}\), where \(\hat {\Omega } = (\hat {\omega }_{ij})_{1 \le i,j \le p} = \left (S^{*}\right)^{-1}\).
To identify the significant edges, Schäfer and Strimmer [34] suppose the distribution of the partial correlations as the mixture
$$f(\rho) = \eta_{0} f_{0} (\rho,\nu) + (1-\eta_{0}) f_{1}(\rho), $$
where f
0 is the null distribution, f
1 is the alternative distribution corresponding to the true edges, and η
0 is the unknown mixing parameter. Using the algorithm in [35], GeneNet identifies significant edges that have the local false positive rate
$$\text{fdr}(\rho) = \frac{\hat{\eta}_{0} f_{0} (\rho, \hat{\nu})}{\hat{f}(\rho)} $$
smaller than the pre-determined level α, where f
0(ρ,ν)=|ρ|Be(ρ
2;0.5,(ν−1)/2), Be(x;a,b) is the density of the Beta distribution and ν is the reciprocal variance of the null ρ.
Neighborhood selection (NS)
Meinshausen and Bühlmann [17] propose the neighborhood selection (NS) method, which separately solves the lasso [36] problem and identifies edges with nonzero estimated regression coefficients for each node. Meinshausen and Bühlmann [17] prove that the NS method is asymptotically consistent in identifying the neighborhood of each node when the neighborhood stability condition is satisfied. Note that the neighborhood stability condition is related to the irrepresentable condition in linear model literature [37].
To be specific, for each node i∈V={1,2,…,p}, NS solves the following lasso problem
$$\hat{\beta}^{i,\lambda} = \operatornamewithlimits{arg\min}_{\beta\in \mathbb{R}^{p}: \beta_{i} = 0} ~\frac{1}{2} \| \mathbf{X}_{i} - \mathbf{X}\beta\|_{2}^{2} + \lambda \|\beta\|_{1}, $$
where \(\|\mathbf {x}\|_{2}^{2} = {\sum \nolimits }_{i=1}^{p} x_{i}^{2}\) and \(\|\mathbf {x}\|_{1} = {\sum \nolimits }_{i=1}^{p} |x_{i}|\) for \(\mathbf {x} \in \mathbb {R}^{p}\). With the estimate \(\hat {\beta }^{i,\lambda }\), NS identifies the neighborhood of the node i as \(N_{i}(\lambda) = \{ k~|~ \hat {\beta }_{k}^{i,\lambda } \neq 0\}\), which defines an edge set \(E_{i}^{\lambda } = \{(i,j)~|~ j \in N_{i}(\lambda)\}\). Since NS separately solves p lasso problems, contradictory edges may occur when we define the total edge set \(E^{\lambda } = \cup _{i=1}^{p} E_{i}^{\lambda }\), i.e., \(\hat {\beta }_{k}^{i,\lambda } \neq 0\) and \(\hat {\beta }_{i}^{k,\lambda } = 0\). To avoid these contradictory edges, NS suggests two types of edge sets E
λ,∧ and E
λ,∨ defined as follows:
$$\begin{array}{*{20}l} E^{\lambda,\wedge} = \left\{(i,j)~|~ i \in N_{j}(\lambda) ~\text{and}~ j \in N_{i}(\lambda)\right\},\\ E^{\lambda,\vee} = \left\{(i,j)~|~ i \in N_{j}(\lambda) ~\text{or}~ j \in N_{i}(\lambda)\right\}. \end{array} $$
Meinshausen and Bühlmann [17] mentioned these two edge sets have only small differences in their experience and the differences vanish asymptotically. Meinshausen and Bühlmann [17] also propose the choice of the tuning parameter λ
i
(α) for the ith node
$$\lambda_{i}(\alpha) = \|\mathbf{X}_{i}\|_{2} \tilde{\Phi}^{-1}\left(\frac{\alpha}{2p^{2}}\right), $$
where \(\tilde {\Phi } = 1 - \Phi \) and Φ is the distribution function of the standard normal distribution. With this choice of λ
i
(α) for i=1,2,…,p, the probability of falsely identifying edges in the network is bounded by the level α. Note that we estimate the edge set with E
λ,∧ and solve the lasso problems using the R package CDLasso proposed by [38] in this paper.
Graphical lasso (GLASSO)
Friedman et al. [15] propose the graphical lasso method that estimates a sparse inverse covariance matrix Ω by maximizing the ℓ
1 penalized log-likelihood
$$ l(\Omega) = \log |\Omega| - \text{tr}(S\Omega) - \lambda \|\Omega\|_{1}, $$
(1)
where S is the sample covariance matrix, tr(A) is the trace of A and ∥A∥1 is the ℓ
1 norm of A for \(A \in \mathbb {R}^{p \times p}\).
To be specific, let W be the estimate of the covariance matrix Σ and consider partitioning W and S
$$ W = \left(\begin{array}{cc} W_{11} & w_{12}\\ w_{12}^{T} & w_{22} \end{array}\right),~ S = \left(\begin{array}{cc} S_{11} & s_{12}\\ s_{12}^{T} & s_{22} \end{array}\right), ~ \Omega = \left(\begin{array}{cc} \Omega_{11} & \omega_{12}\\ \omega_{12}^{T} & \omega_{22} \end{array}\right) $$
Motivated by [39], Friedman et al. [15] show that the solution \(\hat {\Omega }\) of (1) is equivalent to the inverse of W whose partitioned entity w
12 satisfies w
12=W
11
β
∗, where β
∗ is the solution of the lasso problem
$$ \min_{\beta} ~ \frac{1}{2} \left\| W^{1/2}_{11} \beta - W_{11}^{-1/2} s_{12} \right\|_{2}^{2} + \lambda \|\beta\|_{1}. $$
(2)
Based on the above property, the graphical lasso sets the diagonal elements w
ii
=s
ii
+ρ and obtains the off-diagonal elements of W by repeatedly applying the following two steps:
-
1.
Permuting the columns and rows to locate the target elements at the position of w
12.
-
2.
Finding the solution w
12=W
11
β
∗ by solving the lasso problem (2).
until convergence occurs. After finding W, the estimate \(\hat {\Omega }\) is obtained from the relationships \(\omega _{12} = - \hat {\beta } \hat {\omega }_{22}\) and \(\hat {\omega }_{22} = 1/(w_{22} - w_{12}^{T}\hat {\beta })\), where \(\hat {\beta } = W_{11}^{-1} w_{12}\). This graphical lasso algorithm was proposed in [15] and had its computational efficiency improved in [16] and [40]. Witten et al. [16] provide the R package glasso version 1.7.
GLASSO with reweighted strategy for scale-free network (GLASSO-SF)
Liu and Ihler [31] propose the reweighted ℓ
1 regularization method to improve the performance of the estimation for the scale-free network whose degrees follows the power law distribution. Motivated by the fact that the existing methods work poorly for the scale-free networks, Liu and Ihler [31] consider changing the ℓ
1 norm penalty in the existing methods to the power law regularization
$$ p_{\lambda,\gamma}(\Omega) = \lambda \sum\limits_{i=1}^{p} \log \left(\|\omega_{-i}\|_{1} + \epsilon_{i} \right) + \gamma \sum\limits_{i=1}^{p} |\omega_{ii}|, $$
(3)
where λ and γ are nonnegative tuning parameters, ω
−i
={ω
ij
| j≠i}, \(\|\omega _{-i}\|_{1} = {\sum \nolimits }_{j\neq i} |\omega _{ij}|\), and ε
i
is a small positive number for i=1,2,…,p. Thus, Liu and Ihler [31] consider optimizing the following objective function
$$ f(\Omega; \mathbf{X}, \lambda, \gamma) = L(\mathbf{X},\Omega) + u_{L} \cdot p_{\lambda,\gamma}(\Omega), $$
(4)
where L(X,Ω) denotes the objective function of the existing method without its penalty terms, u
L
=1 if L is convex and u
L
=−1 if L is concave for Ω. Note that the choice of L is flexible. For instance, L(X,Ω) can be the log-likelihood function of Ω as in the graphical lasso or the squared loss function as in the NS and the SPACE. In this section, we suppose that L is concave for the purpose of notational simplicity.
To obtain the maximizer of f(Ω;X,λ,γ), Liu and Ihler [31] propose the iteratively reweighted ℓ
1 regularization procedure based on the minorization-maximization (MM) algorithm [41]. The reweighted procedure iteratively solves the following problem:
$$ \Omega^{(k+1)} = \operatornamewithlimits{arg\max}_{\Omega}~ L(\mathbf{X}, \Omega) - \sum\limits_{i=1}^{p} \sum\limits_{j\neq i} \eta_{ij}^{(k)} |\omega_{ij}| - \gamma \sum\limits_{i=1}^{p} |\omega_{ii}|, $$
(5)
where \(\Omega ^{(k)}= \left (\omega _{ij}^{(k)}\right)\) is the estimate at the kth iteration, \(\|\omega _{-i}^{(k)}\|_{1} = {\sum \nolimits }_{l \neq i} |\omega _{il}^{(k)}|\), and \(\eta _{ij}^{(k)} = \lambda \left (1/(\|\omega _{-i}^{(k)}\|_{1}\right. + \epsilon _{i}) +\left.1/(\|\omega _{-j}^{(k)}\|_{1} + \epsilon _{j})\right)\). In practice, [31] suggest ε
i
=1, γ=2λ/ε
i
, and the initial estimate Ω
(0)=I
p
, where I
p
is the p-dimensional identity matrix. Note that this reweighted strategy facilitates to estimate the hub nodes by adjusting weights in the penalty term but weights are updated by solely using the observed dataset without previously known information from other literatures.
In this paper, we consider L(X,Ω)= log|Ω|−tr(S
Ω), which is the same to the component in the objective function of the GLASSO. Thus, we call this procedure as the GLASSO with a reweighted strategy for the scale-free network (GLASSO-SF). As applied in [31], we stop the reweighting iteration after 5 iterations. The R package glasso version 1.7 is used to obtain the solution of (5) at each iteration with the penalty matrix \(E^{(k)} = (e_{ij}^{(k)})\), where \(e_{ij}^{(k)} = \eta _{ij}^{(k)}\) for i≠j and \(e_{ii}^{(k)} = 2\lambda \) for i=1,2,…,p.
Path consistency algorithm based on conditional mutual information (PCACMI)
Mutual information (MI) is a widely used measure of dependency between variables in information theory. MI even measures non-linear dependency between variables and can be considered as a generalization of the correlation. Several mutual information (MI) based methods have been developed such as ARACNE [20], CLR [42], and minet [43]. However, similar to the correlation, MI only measures pairwise dependency between two variables. Thus, it usually identifies many undirected interactions between variables. To resolve this difficulty, Zhang et al. [21] propose the information theoretic method for reconstruction of the gene regulatory networks based on the conditional mutual information (CMI).
To be specific, let H(X) and H(X,Y) be the entropy of a random variable X and the joint entropy of random variables X and Y, respectively. For two random variables X and Y, H(X) and H(X,Y) can be expressed as
$$ H(X) = E \left(-\log f_{X}(X)\right),~ H(X,Y) = E\left(-\log f_{XY}(X,Y)\right), $$
where f
X
(x) is the marginal probability density function (PDF) of X and f
XY
(x,y) is the joint PDF of X and Y. With these notations, MI is defined as
$$ \begin{array}{lll} I(X,Y) &=& E\left(- \text{log} \frac{f_{XY}(X,Y)}{f_{X}(X)f_{Y}(Y)}\right)\\ &=& H(X) + H(Y) - H(X,Y). \end{array} $$
(6)
It is known that MI measures dependency between two variables that contain both directed dependency and indirected dependency through other variables. While MI can not distinguish directed and indirected dependency, CMI can measure directed dependency between two variables by conditioning on other variables. CMI for X and Y given Z is defined as
$$ I(X,Y|Z) = H(X,Z)+ H(Y,Z) - H(Z) - H(X,Y,Z). $$
(7)
To estimate the entropies in (7), Zhang et al. [21] consider the Gaussian kernel density estimator used in [19]. Using the Gaussian kernel density estimator, MI and CMI are defined as
$$\begin{array}{*{20}l} \widehat{I}(X,Y) = \frac{1}{2} \log \frac{|C(X)|~|C(Y)|}{|C(X,Y)|},\\ \widehat{I}(X,Y|Z) = \frac{1}{2} \log \frac{|C(X,Z)|~|C(Y,Z)|}{|C(Z)|~|C(X,Y,Z)|}, \end{array} $$
(8)
where |A| is the determinant of a matrix A, C(X), C(Y) and C(Z) are the variances of X, Y and Z, respectively, and C(X,Z), C(Y,Z) and C(X,Y,Z) are the covariance matrices of (X,Z), (Y,Z) and (X,Y,Z), respectively.
To efficiently identify dependent pairs of variables, Zhang et al. [21] adopt the path consistency algorithm (PCA) in [44]. Thus, the authors called their procedure as PCA based on CMI (PCACMI). The PCACMI method sets L = 0 and calculates with L-order CMI, which is equivalent to MI if L=0. Then, PCACMI removes the pairs of variables such that the maximal CMI of two variables given L+1 adjacent variables is less than a given threshold α, where α determines whether two variables are independent or not and adjacent variables denote variables connected to the two target variables in PCACMI at the previous step. PCACMI repeats the above steps until there is no higher order connection. The MATLAB code for PCACMI is provided by [21] at the author’s website https://sites.google.com/site/xiujunzhangcsb/software/pca-cmi.
Conditional mutual inclusive information-based network inference (CMI2NI)
Recently, Zhang et al. [22] proposed the conditional mutual inclusive information-based network inference (CMI2NI) method that improves the PCACMI method [21]. CMI2NI considers the Kullback-Leibler divergences from the joint probability density function (PDF) of target variables to the interventional PDFs removing the dependency between two variables of interest. Instead of using CMI, CMI2NI uses the conditional mutual inclusive information (CMI2) as the measure of dependency between two variables of interest given other variables. To be specific, we consider three random variables X, Y and Z. For these three random variables, the CMI2 between X and Y given Z is defined as
$$ \text{CMI2}(X,Y|Z) = \left(D_{\text{KL}}(P || P_{X \rightarrow Y}) + D_{\text{KL}}(P || P_{Y \rightarrow X}) \right)/2, $$
(9)
where D
KL(f||g) is the Kullback-Leibler divergence from f to g, P is the joint PDF of X, Y and Z, and P
X→Y
is the interventional probability of X, Y and Z for removing the connection from X to Y.
With Gaussian assumption on the observed data, the CMI2 for two random variables X and Y given m-dimensional vector Z can be expressed as
$$ \begin{aligned} CMI2(X,Y|Z) &= \frac{1}{4}\left(\text{tr}(C^{-1} \Sigma) + \text{tr}(\tilde{C}^{-1} \tilde{\Sigma}) + \log C_{0}\right.\\ &\left.\qquad+ \log \tilde{C}_{0} - 2n \right), \end{aligned} $$
(10)
where Σ is the covariance matrix of (X,Y,Z
T)T, \(\tilde {\Sigma }\) is the covariance matrix of (Y,X,Z
T)T, Σ
XZ
is the covariance matrix of (X,Z
T)T, Σ
YZ
is the covariance matrix of (Y,Z
T)T, n=m+2, and C, \(\tilde {C},C_{0}\) and \(\tilde {C}_{0}\) are defined with the elements of \(\Sigma,\Sigma _{XZ},\Sigma _{YZ},\Sigma ^{-1},\Sigma _{XZ}^{-1}\) and \(\Sigma _{YZ}^{-1}\) (see Theorem 1 in [22] for details). As applied in PCACMI, CMI2NI adopts the path consistency algorithm (PCA) to efficiently calculate the CMI2 estimates. All steps of the PCA in CMI2NI are the same as one of PCACMI if we change the CMI to the CMI2. In the PCA steps of CMI2NI, two variables are regarded as independent if the corresponding CMI2 estimate is less than a given threshold α. The MATLAB code for CMI2NI is available at the author’s website https://sites.google.com/site/xiujunzhangcsb/software/cmi2ni.
Sparse partial correlation estimation (SPACE)
In the Gaussian graphical models [45], the conditional dependencies among p variables can be represented by a graph \(\mathcal {G}=\left (V,E\right)\), where V={1,2,…,p} is a set of nodes representing p variables and E={(i,j) | ω
ij
≠0, 1≤i≠j≤p} is a set of edges corresponding to the nonzero off-diagonal elements of Ω.
To describe the SPACE method, we consider linear models such that for i=1,2,…,p,
$$ \mathbf{X}_{i}=\sum\limits_{j\neq i}\beta_{ij}\mathbf{X}_{j}+\boldsymbol{\epsilon}_{i} $$
(11)
where ε
i
is an n-dimensional random vector from the multivariate normal distribution with mean 0 and covariance matrix (1/ω
ii
)I
n
, and I
n
is an identity matrix with size of n×n. Under normality, the regression coefficients β
ij
s can be replaced with the partial correlations ρ
ijs by the relationship
$$ \beta_{ij}=-\frac{\omega_{ij}}{\omega_{ii}}=\rho^{ij}\sqrt{\frac{\omega_{jj}}{\omega_{ii}}}, $$
(12)
where \(\rho ^{ij}=\text {corr}\left (X_{i},X_{j}~|~X_{k},k\neq i,j\right)=-{\omega _{ij}}\left /{\sqrt {\omega _{ii}\omega _{jj}}}\right.\) is a partial correlation between X
i
and X
j
. Motivated by the relationship (12), Peng et al. [18] propose the SPACE method for solving the following ℓ
1-regularized problem:
$$ {\begin{aligned} \min_{\rho}\frac{1}{2}\sum\limits_{i=1}^{p}\left\{ w_{i}\sum\limits_{k=1}^{n}\left(X_{i}^{k}-\sum\limits_{j\neq i}\rho^{ij}\sqrt{\frac{\omega_{jj}}{\omega_{ii}}}X_{j}^{k}\right)^{2}\right\} +\lambda\sum\limits_{1\le i<j \le p}|\rho^{ij}|, \end{aligned}} $$
(13)
where w
i
is a nonnegative weight for the i-th squared error loss.
Proposed approach incorporating previously known hub information
Extended sparse partial correlation estimation (ESPACE)
In this paper, we assume that some genes (or nodes), which are referred to as hub genes (or hub nodes), regulate many other genes, and we also assume that many of these hub genes were identified from previous experiments. To incorporate information about hub nodes, we propose the extended SPACE (ESPACE) method, which extends the model space by using an additional tuning parameter α on edges connected to the given hub nodes. This additional tuning parameter can reflect the hub gene information by reducing the penalty on edges connected to hub nodes. To be specific, let \(\mathcal {H}\) be the set of hub nodes that were previously identified. The ESPACE method we propose solves
$$ {\begin{aligned} &\min_{\rho}\frac{1}{2}\sum\limits_{i=1}^{p}\left\{w_{i} \sum\limits_{k=1}^{n}\left(X_{i}^{k}-\sum\limits_{j\neq i}\rho^{ij}\sqrt{\frac{\omega_{jj}}{\omega_{ii}}}X_{j}^{k}\right)^{2}\right\}\\ &\quad+\alpha \lambda \sum\limits_{{i<j,\atop \{i \in \mathcal{H}\} \cup \{j\in \mathcal{H}\}}}|\rho^{ij}|+ \lambda \sum\limits_{ {i<j,\atop i,j\in \mathcal{H}^{c}}}|\rho^{ij}|, \end{aligned}} $$
(14)
where 0<α≤1. Note that we consider the weights w
i
s for the squared error loss as one in this paper. To summarize the process of the proposed method, we depict the flowchart of the ESPACE method in Fig. 1. As described in Fig. 1, the ESPACE has the prior knowledge about hub genes as an additional input, which is the novelty of the proposed method compared to the other existing methods.
Extended graphical lasso (EGLASSO)
In the Background, we mentioned the proposed procedure is applicable to other methods such as the graphical lasso. For the purpose of fair comparison and the investigation of the performance, we also applied the proposed strategy to the GLASSO, which is the GLASSO incorporating the hub gene information. We call this procedure the extended graphical lasso (EGLASSO). Similar to the ESPACE, the EGLASSO maximizes
$$ \log |\Omega| - \text{tr}(S\Omega) -\alpha \lambda \sum\limits_{{i<j,\atop \{i \in \mathcal{H}\} \cup \{j\in \mathcal{H}\}}}|\omega_{ij}| - \lambda \sum\limits_{{i<j,\atop i,j\in \mathcal{H}^{c}}}|\omega_{ij}|, $$
(15)
where λ≥0 and 0<α≤1 are two tuning parameters, S is the sample covariance matrix, tr(A) is the trace of A and \(\mathcal {H}\) is the set of hub nodes that were previously identified. Note that we can use the R package glasso version 1.7 for the EGLASSO by defining the penalty matrix corresponding to the penalty term in (15).
Active shooting algorithm for ESPACE
To solve (14), we adopt the active shooting algorithm introduced in [18]. We rewrite the problem (14) as
$$ \min_{\rho}\frac{1}{2}\left\|\mathbf{Y}-\tilde{\mathbf{X}}\boldsymbol{\rho}\right\|_{2}^{2}+\alpha \lambda \sum\limits_{{i<j,\atop \{i \in \mathcal{H}\} \cup \{j\in \mathcal{H}\}}}|\rho^{ij}|+ \lambda \sum\limits_{ {i<j,\atop i,j\in \mathcal{H}^{c}}}|\rho^{ij}|, $$
(16)
where \(\mathbf {Y}=(\mathbf {X}_{1}^{T},\mathbf {X}_{2}^{T},\ldots,\mathbf {X}_{p}^{T})^{T}\) is an n×p column vector; \(\mathbf {X}^{k,l}=\left (\mathbf {0}_{n(k-1)\times 1}^{T},\mathbf {X}_{l(k)}^{T},\mathbf {0}_{n(l-k-1)\times 1}^{T},\mathbf {X}_{k(l)}^{T},\mathbf {0}_{n(p-l)\times 1}^{T}\right)^{T}\) is an n×p column vector as well, with \(\mathbf {X}_{k(l)}=\sqrt {\frac {\omega _{kk}}{\omega _{ll}}}\mathbf {X}_{k};\)
$$ \tilde{\mathbf{X}}= \left(\begin{array}{cccccccc} \mathbf{X}^{1,2}; & \mathbf{X}^{1,3}; & \cdots; & \mathbf{X}^{1,p}; & \mathbf{X}^{2,3}; & \mathbf{X}^{2,4}; & \cdots; & \mathbf{X}^{(p-1),p} \end{array}\right), $$
and ρ=(ρ
12;ρ
13;…;ρ
1p;ρ
23;ρ
24;…;ρ
(p−1)p)T. Let \(\widehat {\boldsymbol {\rho }}^{(m)}\) and \(\widehat {\omega }_{ii}^{(m)}\) be estimates of ρ and ω
ii
at the m-th iteration, respectively. Then, the steps of the modified algorithm are outlined below:
-
Step 1: (Initialization of \(\widehat {\omega }_{ii}\)) For i=1,2,…,p, \(\widehat {\omega }_{ii}^{(0)} = 1\) and s=0.
-
Step 2: (Initialization of \(\widehat {\boldsymbol {\rho }}\)) For 1≤i<j≤p and m=0,
$$ \begin{array}{llll} \widehat{\rho}^{ij,(0)} &=& \text{sign}\left(\mathbf{Y}^{T}\mathbf{X}^{i,j}\right)\frac{\left(\left|\mathbf{Y}^{T}\mathbf{X}^{i,j}\right|-\alpha\lambda\right)_{+}}{\left(\mathbf{X}^{i,j}\right)^{T}\mathbf{X}^{i,j}} &\text{for}~\{i \in \mathcal{H}\}\cup\{ j \in \mathcal{H}\},\\ \widehat{\rho}^{ij,(0)} &=& \text{sign}\left(\mathbf{Y}^{T}\mathbf{X}^{i,j}\right)\frac{\left(\left|\mathbf{Y}^{T}\mathbf{X}^{i,j}\right|-\lambda\right)_{+}}{\left(\mathbf{X}^{i,j}\right)^{T}\mathbf{X}^{i,j}} &\text{for}~i,j\in\mathcal{H}^{c}, \end{array} $$
where (x)+= max(x,0) and X
i,js are defined in (16) with \(\widehat {\omega }_{ii}^{(s)}\).
-
Step 3: Define an active set \(\Lambda =\{(i,j)~|~\widehat {\rho }^{ij,(m)}\neq 0\}\).
-
Step 4: Iteratively update \(\widehat {\boldsymbol {\rho }}^{(m)}\) for (k,l)∈Λ,
$$ \begin{array}{lll} \widehat{\rho}^{kl,(m)}&= \text{sign}\left((\mathbf{X}^{k,l})^{T}\boldsymbol{\epsilon}'\right)\frac{\left(\left|(\mathbf{X}^{k,l})^{T}\boldsymbol{\epsilon}'\right|-\alpha\lambda\right)_{+}}{(\mathbf{X}^{k,l})^{T}\mathbf{X}^{k,l}}\\ &\quad\text{for}~\{k \in \mathcal{H}\}\cup\{ l \in \mathcal{H}\}, \\ \medskip \widehat{\rho}^{kl,(m)}&= \text{sign}\left((\mathbf{X}^{k,l})^{T}\boldsymbol{\epsilon}'\right)\frac{\left(\left|(\mathbf{X}^{k,l})^{T}\boldsymbol{\epsilon}'\right|-\lambda\right)_{+}}{(\mathbf{X}^{k,l})^{T}\mathbf{X}^{k,l}}&\text{for}~k,l\in\mathcal{H}^{c}, \end{array} $$
where \(\boldsymbol {\epsilon }'= \mathbf {Y}-{\sum \nolimits }_{(i,j)\neq (k,l)} \tilde {\rho }^{ij}\mathbf {X}^{i,j}\) and \(\tilde {\rho }^{ij}\)s are current estimates at the step for updating the (k,l)-th partial correlation.
-
Step 5: Repeat Step 4 until convergence occurs on the active set Λ.
-
Step 6: Update \(\widehat {\boldsymbol {\rho }}^{(m+1)}\) for 1≤i<j≤p by using the equations in Step 4. If the maximum difference between \(\widehat {\boldsymbol {\rho }}^{(m+1)}\) and \(\widehat {\boldsymbol {\rho }}^{(m)}\) is less than a pre-determined tolerance τ, then go to Step 7 with the estimates \(\widehat {\boldsymbol {\rho }}^{(m+1)}\). Otherwise, consider m=m+1 and go back to Step 3.
-
Step 7: Update \(\widehat {\omega }_{ii}^{(s+1)}\) for i=1,2,…,p,
$$\begin{aligned} \frac{1}{\widehat{\omega}_{ii}^{(s+1)}}&= \frac{1}{n}\left\|\mathbf{X}_{i}-\sum\limits_{j\neq i}\widehat{\rho}^{ij,(m+1)}\sqrt{\frac{\widehat{\omega}_{jj}^{(s)}}{\widehat{\omega}_{ii}^{(s)}}}\mathbf{X}_{j}\right\|_{2}^{2}\\ &\quad\text{for}~i=1,2,\ldots,p. \end{aligned} $$
-
Step 8: Repeat Step 2 through Step 7 with s=s+1 until convergence occurs on \(\widehat {\omega }_{ii}\)s.
Note that the number of iterations of \(\widehat {\omega }_{ii}\)s is usually small for stabilization of the estimates of ρ. In our numerical study, the estimates of ω
ii
s converge within 10 iterations. Moreover, the inner products such as Y
T
X
i,j, whose complexity is O(n
p), can efficiently be computed by rewriting \(\mathbf {Y}^{T}\mathbf {X}^{i,j} = {\sum \nolimits }_{k=1}^{n} \left (\sqrt {\omega _{jj}/\omega _{ii}} + \sqrt {\omega _{jj}/\omega _{ii}}\right) X_{i}^{k} X_{j}^{k}, \) whose complexity is O(n). We implemented the R package espace, which is available from https://sites.google.com/site/dhyeonyu/software.
Choice of tuning parameters
We have introduced the ESPACE method, which relaxes the penalty on edges connected to the hub genes (i.e., α<1) but uses the same penalty on edges connected to non-hub gene (i.e., α=1). When no hub genes are involved in a network, ESPACE is reduced to SPACE. For a given λ, this modification allows us to find more edges connected to the hub genes by reducing α. In practice, however, we do not know the values of λ and α. In this paper, we consider the GIC-type criterion used in [46] for the Gaussian graphical model to choose the optimal tuning parameters (λ
∗,α
∗). Let \(\widehat {\rho }_{(\lambda,\alpha)}^{ij}\) be the (i,j)-th estimate of partial correlation for given λ and α. The GIC-type criterion is defined as
$${\begin{aligned} \text{GIC}(\lambda,\alpha)&=\sum\limits_{i=1}^{p}\left\{ n\cdot\log {RSS}_{i}+\log{\log{n}}\log (p-1){\vphantom{\widehat{\rho}_{\lambda,\alpha}^{ij}}}\right.\\&\qquad\quad\times\left.\left|\left\{j:j\neq i,\widehat{\rho}_{\lambda,\alpha}^{ij}\neq0\right\}\right|\right\}, \end{aligned}} $$
where \({RSS}_{i} = \left \|\mathbf {X}_{i}-\sum \limits _{j\neq i}\widehat {\rho }_{(\lambda,\alpha)}^{ij}\mathbf {X}_{j(i)}\right \|_{2}^{2}\) and |A| denotes a cardinality of a set A. We choose the tuning parameters which minimize the GIC-type criterion,
$$(\lambda^{*},\alpha^{*})={\operatornamewithlimits{argmin}_{\lambda,\alpha}\text{GIC}(\lambda,\alpha).} $$
Simulation studies
Simulation settings
In this simulation, we consider four real protein-protein interaction (PPI) networks used in a comparative study [24], which were partially selected from the human protein reference database [47]. As mentioned earlier, genes whose degrees are greater than 7 and above the 0.95 quantile of the degree distribution are thought of as hub genes. Figure 2 shows the four PPI networks and their hub genes. Let p be the number of nodes in a network. We consider the number of samples as p/2 and p and generate samples from the multivariate normal distribution with mean 0 and covariance matrix Σ defined with \((\Sigma)_{ij} = (\Omega ^{-1})_{ij}/\sqrt {(\Omega ^{-1})_{ii}(\Omega ^{-1})_{jj}}\), where Ω is a concentration matrix corresponding to a given network structure. To generate a positive definite concentration matrix, we use the following procedure as described in [18]:
-
Step G1: For a given edge set E, we generate an initial concentration matrix \(\tilde {\Omega }=(\tilde {\omega }_{ij})_{1\le i,j\le p}\) with
$$\tilde{\omega}_{ij}=\left\{ \begin{array}{ll} 1 & \quad i=j\\ 0 & \quad i\neq j,~(i,j)\notin E\\ \sim~Unif(D) & \quad i\neq j,~(i,j)\in E \end{array}\right., $$
where D=[−1,−0.5]∪[ 0.5,1].
-
Step G2: For positive definiteness and symmetry of the concentration matrix, we define a concentration matrix Ω=(ω
ij
)1≤i,j≤p
as
$$\Omega=\frac{1}{2}\left(A+A^{T}\right), $$
where A=(a
ij
)1≤i,j≤p
, \(a_{ij}=\tilde {\omega }_{ij}/(1.5\cdot d_{i})\) and \(d_{i}=\sum \limits _{k\neq i}|\tilde {\omega }_{ik}|\) for i=1,2,…,p.
-
Step G3: Set ω
ii
=1 for i=1,2,…,p and ω
ij
= 0.1·sign(ω
ij
) if 0<|ω
ij
|<0.1.
With these four networks, we have conducted the numerical comparisons of the ESPACE and the SPACE methods, as well as seven other methods including the other reviewed existing methods and EGLASSO. For the purpose of fair comparison, we select the optimal model by the GIC for SPACE, ESPACE, GLASSO, GLASSO-SF, and EGLASSO. Since there is no specific rule for the model selection in the other methods, we set the level α=0.2 for GeneNet and NS, and the threshold α=0.03 for PCACMI and CMI2NI. Note that the pre-determined level α=0.2 is a default of the GeneNet package and used in [35]. The pre-determined threshold α=0.03 was used in [21, 22].
Note that all the existing methods need O(p
2) memory space to store and calculate values corresponding to the interactions between variables. We can reduce this memory consumption when the whole variables can be divided into several conditionally independent blocks by using the condition described in [16].
Sensitivity analysis on random noise in the observed data
To investigate the effect of the random noise contained in the observed data, we consider sensitivity analysis for the variance of the random noise. To be specific, suppose that a random vector X=(X
1,X
2,…,X
p
)T follows the multivariate normal distribution with mean 0 and covariance matrix Σ, a vector of random noise ε=(ε
1,ε
2,…,ε
p
)T follows the multivariate normal distribution with mean 0 and covariance matrix \(\sigma _{\epsilon }^{2} I\), and X and ε are independent, where I is the identity matrix. Furthermore, we assume that an observed random vector Z=(Z
1,Z
2,…,Z
p
)T such that
$$ \mathbf{Z} = \mathbf{X} + \varepsilon. $$
(17)
Thus, the covariance matrix of Z becomes \(\Sigma + \sigma _{\epsilon }^{2} I\), which may have a different conditional dependent structure to one of X.
For example, if we consider \(\sigma _{\epsilon }^{2} = 0.5\) and the following Σ and Σ
Z
$$ \Sigma= \left(\begin{array}{ccc} 15/11 & -8/11 & 2/11 \\ -8/11& 16/11& -4/11 \\ 2/11& -4/11& 12/11 \\ \end{array}\right),~ \Sigma_{\mathbf{Z}}= \Sigma + \sigma_{\epsilon}^{2} I, $$
(18)
then the inverse matrices of Σ and Σ
Z
are calculated as
$$ {\begin{aligned} &\Sigma^{-1}= \left(\begin{array}{ccc} 1 & 0.5 & 0 \\ -0.5& 1& 0.25 \\ 0& 0.25& 1 \\ \end{array}\right)\text{and}\\ &\Sigma_{\mathbf{Z}}^{-1}= \left(\begin{array}{ccc} 0.63 &0.23 &-0.02 \\ 0.23 &0.62 & 0.12 \\ -0.02& 0.12 & 0.66 \end{array}\right), \text{respectively.} \end{aligned}} $$
(19)
Thus, we can see that Z
1 and Z
3 are conditionally dependent given Z
2 while X
1 and X
3 are conditionally independent given X
2. Moreover, the nonzero partial correlations decrease when the variance of the random noises increases. From these observations, the performance of the estimation becomes worse if the variance of the random noise increases.
In this sensitivity analysis, we consider \(\sigma _{\epsilon }^{2} = 0, 0.01, 0.1, 0.25, 0.5\) and p=231 and n=115,231 with the same network structure as the one of p=231 in Fig. 2. To focus on the proposed method, we apply the SPACE and the ESPACE methods to the 50 generated datasets containing random noise having variance \(\sigma _{\epsilon }^{2}\).
Performance measures
To investigate the gains from the extension, we use five performance measures: sensitivity (SEN), specificity (SPE), false discovery rate (FDR), mis-specification rate (MISR) and Matthews correlation coefficients (MCC). Note that the MCC, which lies between −1 and +1, has been used to measure the performance of binary classification, where +1, 0, and −1 denote a perfect classification, a random classification, and a total discordance of classification, respectively. Let ρ and \(\widehat {\rho }_{\lambda,\alpha }\) be (p(p−1)/2)-dimensional vectors of the true and estimated partial correlation, respectively. The above five measures are defined as
$${\begin{aligned} \begin{array}{l} \text{SEN} \equiv \text{TP}/(\text{TP}+\text{FN}),~~ \text{SPE} \equiv \text{TN}/(\text{TN}+\text{FP}), \\ \text{FDR} \equiv \text{FP}/(\text{TP} + \text{FP}),~~ \text{MISR} \equiv (\text{FN} + \text{FP})/\left(p(p-1)/2\right)~~\text{and}\\ \text{MCC} \equiv \frac{\text{TP} \times \text{TN} - \text{FP} \times \text{FN}}{\sqrt{(\text{TP}+\text{FP})(\text{TP}+\text{FN})(\text{TN}+\text{FP})(\text{TN}+\text{FN})}}, \end{array} \end{aligned}} $$
where \(\text {TP} = {\sum \nolimits }_{i<j} I(\rho ^{ij} \neq 0) I(\widehat {\rho }^{ij}_{\lambda,\alpha } \neq 0)\), \(\text {FP} = {\sum \nolimits }_{i<j} I(\rho ^{ij} = 0) I(\widehat {\rho }^{ij}_{\lambda,\alpha } \neq 0)\), \(\text {FN} = {\sum \nolimits }_{i<j} I(\rho ^{ij} \neq 0) I(\widehat {\rho }^{ij}_{\lambda,\alpha }=0)\) and \(\text {TN} = {\sum \nolimits }_{i<j} I(\rho ^{ij} = 0) I(\widehat {\rho }^{ij}_{\lambda,\alpha } = 0)\).
Application to Escherichia coli dataset
We applied the ESPACE method to the largest public Escherichia coli (E.coli) microarray dataset available from the Many Microbe Microarrays database (M3D) [48]. The M3D contains 907 microarrays measured under 466 experimental conditions using Affymetrix GeneChip E.coli genome arrays. Microarrays from the same experimental conditions were averaged to derive the mean expression. The data set (“E_coli_v4_Build_6” from the M3D) contains the expression levels of 4297 genes from 446 samples. In the E.coli genome, a number of studies have been conducted to identify transcriptional regulations. The RegulonDB [49] curates the largest and best-known information on the transcriptional regulation of E.coli. To combine the information from the above two databases, we focus on the 1623 genes reported in both the M3D and the RegulonDB. As mentioned before, the TFs are known to regulate many other genes in the genome and can be considered potential hubs. To incorporate the information about the potential hubs, we used a list of 180 known TF-encoding genes from the RegulonDB. The RegulonDB also provides 3811 transcriptional interactions among the 1623 genes, which were used as the gold standard to evaluate the accuracy of the constructed networks.
Application to lung cancer adenocarcinoma dataset
Lung cancer is the leading cause of death from cancer, both in the United States and worldwide; it has a 5-year survival rate of approximately 15% [50]. The progression and metastasis of lung cancer varies greatly among early stage lung cancer patients. To customize treatment plans for individual patients, it is important to identify prognostic or predictive biomarkers, which allows for more precise classification of lung cancer patients. In this study, we applied the extended SPACE method to reconstruct the gene regulatory network in lung cancer. Exploring network structures can facilitate comprehension of biological mechanisms underlying lung cancer and identification of important genes that could be potential lung cancer biomarkers. We constructed the gene network using microarray data from 442 lung cancer adenocarcinoma patients in the Lung Cancer Consortium study [51]. For detail about preprocessing this dataset, please refer to [7]. First, univariate Cox regression was used to identify the genes whose expression levels are correlated with patient survival outcomes, after adjusting for clinical factors such as study site, age, gender, and stage. The false discovery rate (FDR) was then calculated using a Beta-Uniform model [52]. By controlling the FDR to less than 10%, we identified 794 genes that were associated with the survival outcome of lung cancer patients. Among these 794 genes, 22 were found to appear among the 236 carefully curated cancer genes of the FoundationOneTM gene panel (Foundation Medicine, Inc.). Current biological knowledge indicates genes from this panel play a key role in various types of cancer. These 22 genes were then input as known hub genes to the ESPACE method.