Gene clustering
The proposed SGLasso assumes that the cluster structure has been well defined. When clusters of genes in the same function groups can be constructed based on biological information such as GO [16], such clusters can be used in the analysis. However it is often the case that gene group information may only be partially available or even not available. In this case we propose defining cluster structure based on statistical measurements [9].
We use the K-means approach in this paper. There exist many alternative clustering methods, such as the hierarchical clustering, self-organizing map, tree-truncated vector quantization method, among others. For data sets with unknown data structures, there exists no dominating approach. We use the K-means approach since it is computationally affordable and relatively robust.
We propose using the Gap statistic [32] to determine the optimal number of clusters. With the K-means approach, we first choose M-the largest number of clusters. Then for m = 1,..., M:
-
1.
Generate m clusters using the K-means approach. Denote rss
m
as the total within block sum of squares.
-
2.
Create a new data set by separately permuting each gene expression measurements. Apply the K-means method to the permuted expression data. Let denote the resulting within cluster sum of squares. Repeat this for a number of times and compute the average ave().
-
3.
Compute the Gap statistic as gap(m) = ave() - rss
m
.
Choose the value m that maximizes gap(m). We refer to [32] for detailed discussions of the Gap statistic.
Data settings
Let Z be a length d vector of gene expressions, and let Y be the clinical outcome of interest. Assume that n i.i.d. copies of (Y, Z) are available. We generate m gene clusters using the K-means approach, where m is chosen using the Gap statistic. We assume that the clusters have sizes p1,..., p
m
with p1 +...+ p
m
= d. We denote Z = (Z1,...,Zm), where Zi contains the p
i
gene expressions in the ithcluster for i = 1,..., m. We assume that Y is associated with Z through a parametric or semiparametric model Y ~ φ(β Z') with a regression function φ and unknown regression coefficient β, where β= (β1,..., βm) and βi= (βi 1,..., ) for i = 1,..., m. In this article, we study the binary classification and censored survival analysis problems because of their wide applications.
Binary classification
For classification problems, Y is the categorical variable indicating the disease status, for example occurrence or stage of cancer. We focus on binary classification only. Suppose that Y = 1 denotes the presence and Y = 0 indicates the absence of disease. We assume the commonly used logistic regression model, where the logit of the conditional probability is logit(P(Y = 1|Z)) = α + β Z' and α is the unknown intercept. Based on a sample of n iid observations (Y1, Z1),..., (Y
n
, Z
n
), the maximum likelihood estimator is defined as (, ) = argmaxα, βR
n
(α, β), where
We always keep the intercept α in the model. For simplicity, we denote R
n
(α, β) as R
n
(β).
Survival analysis
For right censored survival data, Y = (T, Δ), where T = min(U, V) and Δ = I(U ≤ V). Here U and V denote the event time of interest and the random censoring time, respectively. The most widely used model for right censored data is the Cox proportional hazards model [35] which assumes that the conditional hazard function λ(u|Z) = λ0 (u) exp(β Z'), where λ0 is the unknown baseline function and β is the regression coefficient. Based on a sample of n iid observations X
j
= (Y
j
, Z
j
), j = 1,..., n, the maximum partial likelihood estimator is defined as the value that maximizes
where r
j
= {k : T
k
≥ T
j
} is the risk set at time T
j
.
Supervised group Lasso
For the logistic regression and Cox proportional hazards models, the SGLasso consists of the following steps.
-
1.
For cluster i = 1,..., m, compute i-the cluster-wise Lasso estimate of βi. Especially,
(1)
i= argmax R
n
(βi) subject to |βi 1| + ... + || ≤ u
i
,
where u
i
is the data-dependent tuning parameter and
for binary classification and
for Cox survival analysis. That is for cluster i, we only use genes within that cluster to construct predictive models. Gene selection within that cluster is achieved with the Lasso. Sparse models are achieved when u
i
→ 0. We propose selection of u
i
using V-fold cross validation [36]. Especially we note that tuning parameters u
i
are selected for each cluster separately. So we allow different tuning parameters, hence different degrees of regularization for different clusters. This flexibility allows us to detect more subtle structures that cannot be detected by applying the Lasso method to all the genes/clusters at the same time.
-
2.
For each cluster, the Lasso models have only a small number of nonzero coefficients. For cluster i, denote ias the reduced covariate vector composed of covariates with nonzero estimated coefficients in Step 1 cluster-wise models. Denote ias the corresponding reduced unknown coefficient. We note that the dimension of the genes may be greatly reduced via Step 1. For example in the examples, a cluster with size ~20 may only have 2 ~ 5 genes presented in the reduced data.
-
3.
Construct the joint predictive model under the GLasso constraint. Especially,
where
n
() is R
n
(β) with β replaced by and Z replaced by . u is also chosen via V-fold cross validation. With u → 0, estimates of some components of (1,..., m) can be exactly zero. Selection of important clusters can then be achieved.
In our examples, the objective functions R
n
are continuously differentiable and depend only on data and the unknown regression coefficient β. Other smooth objective functions, for example the log-binomial likelihood for binary classification or the least square type estimating equation for the AFT survival model [37], can also be considered. The SGLasso only needs to assume that the expectation of the objective function has well-separated maximum. However for the proposed computational algorithms to work, we need to assume that the objective function is locally differentiable, i.e, it can be locally approximated by a smooth function.
Computational algorithms
Since the Lasso constraint is not differentiable, standard derivative based maximization approaches, such as the Newton-Raphson, cannot be used to obtain the Lasso estimate. In most previous studies, the maximization relies on the quadratic programming (QP) or general non-linear programming which are known to be computationally intensive. Moreover, the quadratic programming cannot be applied directly to the settings where the sample size may be smaller than the number of predictors. The L1boosting based approach proposed by [38] provides a computationally feasible solution for high dimensional cases.
Algorithm I: L1boosting Lasso
For the ithcluster:
-
1.
Initialize βi= 0 and s = 0.
-
2.
With the current estimate of βi= (βi 1,..., ), compute φ (βi) = ∂R
n
(βi)/∂ βi. Denote the kthcomponent of φ as φk.
-
3.
Find k* that minimizes min(φk(β), - φk(β)). If (β) = 0, then stop the iteration.
-
4.
Otherwise denote γ = -sign( (β)). Find that
(2)
= argmaxπ ∈ [0,1]R
n
((l - π) β+ π u
i
γ),
where has the k*thelement equals to 1 and the rest equal to 0.
-
5.
Let βik= (1 - ) βikfor k ≠ k* and = (1 - ) + γu . Let s = s + 1.
-
6.
Repeat steps 2–5 until convergence or a fixed number of iterations S has been reached.
The βiat convergence is the Lasso estimate. We conclude convergence if the absolute value of (β) computed in step 3 is less than a pre-defined criteria, and/or if R
n
(β) is larger than a pre-defined threshold. Alternative algorithm can be LARS based. Since it is not the focus of this study, we omit discussions of other computational algorithms.
For the GLasso, a LARS based approach is proposed in [19]. With high dimensional cases, a computationally more affordable approach is proposed in [39]. This approach shares the same spirit as the L1 boosting Lasso and they are both special cases of the gradient-based constraint maximization discussed in [40]. This boosting based algorithm can be summarized into the following iterations.
Algorithm II: boosting group Lasso
-
1.
Initialize = 0. Set Δ as a sufficiently small positive scalar.
-
2.
With the current estimate of β, calculate the gradient ∂R
n
()/∂.
-
3.
Set b = - ∂
n
()/∂ and τ = {1,..., m}. Denote the pthcomponent of b as bp.
-
4.
Start Loop.
-
(a)
Calculate the projection u
p
= I (p ∈ τ) × (||bp|| + {u - ∑p∈τ||bp||}/|τ|) for p = 1,..., m, where τ is the cardinality of τ.
-
(b)
If (u
p
≥ 0) for all p, then abort the loop.
-
(c)
Else update the active set τ = {p : u
p
> 0}.
-
5.
End Loop.
-
6.
Get a new estimate p= bpu
p
/||bp|| for p = 1,..., m.
-
7.
Repeat steps 2–6 until convergence or a fixed number of iterations has been reached.
A graphic presentation
We use the following numerical example to graphically demonstrate the parameter path of the proposed approach. For a better resolution, we only consider a small study with nine covariates (genes) and three clusters. Since the proposed approach does not depend on the special format of the objective function, we consider a simple linear regression model and use the least squares loss function.
Consider the linear model y = β1 z1 + ... ... + β9 z9 + ε, where β = (β1,..., β9) is the vector of regression coefficients and ε is the random error. We assume that there are three clusters, where (z1, z2, z3) form cluster 1, (z4, z5, z6) form cluster 2 and the rest belong to cluster 3. We assume that all z are marginally N (0, 1) distributed; the pairwise correlation coefficients are 0.4, 0.4 and 0.2 for covariates in clusters 1, 2, and 3, respectively; and different clusters are independent. Moreover, we set β = (-1, -1, 0, -1, -1, 0, 0, 0, 0). In this simulated dataset, we have three clusters, two of which are associated with the outcome. Within the first two clusters, two out of three covariates contribute to the outcome.
We generate 100 random data points from the above model. The regression parameters are estimated using the Lasso, GLasso and SGLasso. Tuning parameters are selected using 3-fold cross validation. In Figure 2, we show the parameter path as a function of the tuning parameter u. In the upper panels, we show the parameter paths for Lasso (left) and GLasso (right). In the lower-left panel, we show parameter paths for the first step estimates using the SGLasso. We see that the within-cluster Lasso has paths close to those in the upper-left panel. The parameter paths for the second step SGLasso (lower-right panel) are similar to those in the upper-right panel, with simpler structures due to the reduced number of covariates. The SGLasso selects (z1, z2, z4, z5, z6) with nonzero estimates, while the Lasso selects the covariates (z1,..., z6, z8), and the GLasso selects all covariates.