Dimension reduction and variable selection play a critical role in the analysis of contemporary high-dimensional data. The semi-parametric multi-index model often serves as a reasonable model for analysis of such high-dimensional data. The sliced inverse regression (SIR) method, which can be formulated as a generalized eigenvalue decomposition problem, offers a model-free estimation approach for the indices in the semi-parametric multi-index model. Obtaining sparse estimates of the eigenvectors that constitute the basis matrix that is used to construct the indices is desirable to facilitate variable selection, which in turn facilitates interpretability and model parsimony.

Results

To this end, we propose a group-Dantzig selector type formulation that induces row-sparsity to the sliced inverse regression dimension reduction vectors. Extensive simulation studies are carried out to assess the performance of the proposed method, and compare it with other state of the art methods in the literature.

Conclusion

The proposed method is shown to yield competitive estimation, prediction, and variable selection performance. Three real data applications, including a metabolomics depression study, are presented to demonstrate the method’s effectiveness in practice.

High-throughput technologies such as microarray and next-generation sequencing have been widely applied in biomedical research to monitor genome-wide DNA, RNA and epigenetic molecular activities and to detect disease-associated events or biomarkers. With decrease in experimental costs over the years, tremendous amounts of data have been generated and accumulated in public data depositories in the past two decades (e.g. Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA) from NCBI, ArrayExpress from EBI, and the NIH Metabolomics Workbench data repository). Perhaps due to limitation of clinical tissue access, individual labs usually generate omics datasets with small to moderate sample sizes (e.g. \(n = 40 - 1000\)). Statistical power and reproducibility of studies using such large-p-small-n data has long been a concern. Dimension reduction (feature extraction) and variable selection (feature selection) play a crucial role for down-stream pattern recognition, classification and clustering, with such high-dimensional data. In this article, we propose dimension reduction and variable selection for semi-parametric models in the high-dimensional setting.

Let Y denote the response variable and \({{\mathbf{X}}}\) denote the p-dimensional predictor vector. Consider the semi-parametric multi-index model

$$\begin{aligned} Y = g (\mathbf{v }^{\top }_1{{\mathbf{X}}},\ldots ,\mathbf{v }^{\top }_d{{\mathbf{X}}},\varepsilon ), \end{aligned}$$

(1)

where \(\mathbf{v }_1,\ldots ,\mathbf{v }_d\) are unknown p-dimensional linearly independent column vectors, \(d \ll p\), g is an unknown link function, \(\varepsilon\) is a random error term with an arbitrary and unknown distribution, and \(\varepsilon \perp \!\!\!\perp {{\mathbf{X}}}\), where \(\perp \!\!\!\perp\) denotes statistical independence. Under model (1), the response Y depends on the p-dimensional predictor vector \({{\mathbf{X}}}\) only through the d linear combinations \(\mathbf{v }^{\top }_1{{\mathbf{X}}},\ldots ,\mathbf{v }^{\top }_d{{\mathbf{X}}}\). Consequently, (1) is often expressed as \(Y \perp \!\!\!\perp {{\mathbf{X}}}| (\mathbf{v }^{\top }_1{{\mathbf{X}}},\ldots ,\mathbf{v }^{\top }_d{{\mathbf{X}}})\). If we knew \(\mathbf{v }_1,\ldots ,\mathbf{v }_d\), since \(d \ll p\), estimation of the unknown link function g can be facilitated with the aid of any flexible nonparametric method. Therefore, the focus of dimension reduction methods via (1) is on estimating the vectors \(\mathbf{v }_1,\ldots ,\mathbf{v }_d\) without any prior knowledge or assumption on g. The subspace span\((\mathbf{v }_1,\ldots ,\mathbf{v }_d)\) is called the central subspace and denoted by \({\mathcal {S}}_{Y|{{\mathbf{X}}}}\), where d is the smallest dimension such that (1) holds.

In the past two decades, a number of sufficient dimension reduction methods have been proposed to estimate \(\mathbf{v }_1,\ldots ,\mathbf{v }_d\) under (1). The most widely used methods are perhaps the sliced inverse regression (SIR) by [1], and the sliced average variance estimation (SAVE) by [2]. Let \({\varvec{\Sigma }}= \text{cov}({{\mathbf{X}}})\), and denote \(\mathbf{V } = (\mathbf{v }_1,\ldots ,\mathbf{v }_d) \in {\mathbb {R}}^{p \times d}\). In a pioneering paper, [1] established that the centered inverse regression, \({\mathbb {E}}({{\mathbf{X}}}|Y)-{\mathbb {E}}({{\mathbf{X}}})\), is contained in the linear subspace of \({\mathbb {R}}^p\) spanned by the vectors \({\varvec{\Sigma }}\mathbf{v }_1,\ldots ,{\varvec{\Sigma }}\mathbf{v }_d\), under model (1) and the linear conditional mean assumption \(\big (\) i.e. \({\mathbb {E}}\big ({{\mathbf{X}}}|{\mathbf{V}}^{\top }{{\mathbf{X}}}\big )\) is a linear function of \(\mathbf{V }^{\top }{{\mathbf{X}}}\)\(\big )\). A direct consequence of this result is that the kernel matrix\({\mathbf{M}} = \text{cov} [ {\mathbb {E}}({{\mathbf{X}}}|Y)-{\mathbb {E}}({{\mathbf{X}}})]\) is degenerate in any direction \({\varvec{\Sigma }}\)-orthogonal to the \(\mathbf{v }_j\)’s. Therefore, the eigenvectors corresponding to the d nonzero eigenvalues of \({\varvec{\Sigma }}^{-1} \mathbf{M }\) can be used as the estimators of \(\mathbf{v }_1,\ldots ,\mathbf{v }_d\). A drawback of the sliced inverse regression is, since it only exploits the inverse first moment, \({\mathbb {E}}({{\mathbf{X}}}|Y)\), it yields degenerate directions if the model is symmetric about zero. To this end, [2] proposed the sliced average variance estimation that incorporates information in the inverse second moment. Since then, there have been a number of other proposals that exploit the moments of \({{\mathbf{X}}}|Y\). The recurring theme of these inverse regression methods is to construct a method-specific kernel matrixM degenerate in any direction \({\varvec{\Sigma }}\)-orthogonal to the \(\mathbf{v }_j\)’s that span \({\mathcal {S}}_{Y|{{\mathbf{X}}}}\).

The aforementioned inverse regression methods for sufficient dimension reduction (SDR) yield the d-dimensional sufficient predictors, \(\mathbf{v }^{\top }_1{{\mathbf{X}}},\ldots ,\mathbf{v }^{\top }_d{{\mathbf{X}}}\), that are linear combinations of all the original predictors. As a consequence, no variable selection is achieved. Hence, the results can be hard to interpret and the important variables may be difficult to identify. In addition, the estimation and prediction efficiency gain may be less than that possible with variable selection. To this end, methods that perform simultaneous dimension reduction and variable selection to construct a few sufficient predictors that are linear combinations of only the important original predictors have been proposed. An incomplete list of such methods include: the shrinkage sliced inverse regression introduced by [3]; the sparse sufficient dimension reduction method due to [4]; and the general shrinkage strategy for sparse inverse regression estimation proposed by [5]. A common limitation of these methods is that the variable selection procedure used is coordinate-dependent, in the sense that they introduce element-wise (coordinate-wise) sparsity as opposed to row-wise (predictor) sparsity on V. The element-wise sparsity approach is not desirable because we would like to deem a predictor unimportant based on its contribution across all dimension reduction vectors simultaneously. To address this problem, [6] proposed a coordinate-independent sparse estimation (CISE) method to obtain row-wise sparse estimates using the inverse regression methods. Due to the row-sparsity, CISE solutions are also orthogonal transformation invariant. That is, the estimated zero rows on V will also be estimated as zero even if the dimension reduction subspace was represented by VO, where O is any \(d \times d\) orthogonal matrix. This is an attractive property since \(\mathbf{v }_1,\ldots ,\mathbf{v }_d\) are often not unique, but the subspace spanned by these vectors is unique. However, although these methods perform variable selection, they are not applicable to the high-dimensional setting.

To this end, a number of recent papers have proposed sparse sufficient dimension reduction methods for the high-dimensional setting. These methods can be categorized into three approaches. The first category employs different types of regularizations to develop shrinkage based methods for simultaneous dimension reduction and variable selection. For instance, [7] proposed sparse ridge sliced inverse regression by introducing \(\ell _1\) and \(\ell _2\)-regularization to the least squares formulation of sliced inverse regression [8] to achieve dimension reduction and variable selection simultaneously. Yu et al. [9] showed estimation consistency by adopting the Dantzig selector to solve the generalized eigenvalue problem formulation of SIR under sparse covariance assumptions. But their estimation follows the sequential estimation approach which yields coordinate-dependent sparse estimates. Wang et al. [10] re-cast SIR into a “pseudo” sparse reduced-rank regression problem and showed consistency in central subspace estimation. By constructing artificial response variables made up from top eigenvectors of the estimated conditional covariance matrix, [11] introduced the Lasso-SIR method to obtain sparse estimates of the SIR dimension reduction directions. More recently, [12] proposed a convex formulation for sparse SIR, and [13] proposed the sparse minimum discrepancy approach for simultaneous dimension reduction and variable selection that incorporates SIR, with extension to SAVE and the principal fitted components (PFC; [14, 15]). The second approach for high-dimensional SDR is the sequential SDR framework proposed by [16] and [17]. This framework yields simulataneous dimension reduction and variable selection via a sequential process that allows for \(p > n\). It incorporates well-established SDR methods and has shown successful applications in high-dimensional data analysis. The third approach includes the thresholding-type procedures. The thresholding techniques have shown important applications for variable screening purposes in, for example, [18]. A promising diagonal thresholding screening SIR algorithm [19] was developed for sparse predictor covariance scenarios and the estimation consistency was established under high-dimensional setting. However, it does not yield sparse central subspace for variable selection. A concise review of the sparse sufficient dimension reduction literature can be found in [20].

Our work in this paper contributes to this body of literature by studying a convex formulation that produces simultaneous dimension reduction and variable selection. Our formulation can be interpreted as a version of a group Dantzig selector and falls under the first category of regularization based methods for high-dimensional sparse SDR. We minimize the sum of the block-\(\ell _1\)-norm of the row vectors that span the central subspace. Due to the row-sparse nature of the resulting estimator, our formulation leads to coordinate-independent sparse estimates - in the sense that the predictors selected by our method are independent of the basis matrix used to represent the central subspace. This is attractive as, often times, the central subspace is unique but the basis vectors that span it are not. Our proposed formulation is convex, and thus can be implemented using well established solvers such as the CVX toolbox in MATLAB. We provide readily available MATLAB codes for practitioners to use the proposed method.

Our work closely relates to the Lasso-SIR of [11] in the way it constructs artificial response matrix, but unlike our method, the estimated directions obtained by the Lasso-SIR [11] are not coordinate independent as the directions are estimated separately. Our work also relates to the convex formulation for sparse SIR of [12], in that both methods are based on convex optimization. However, the objective function in [12] is optimized over the \(p \times p\) projection matrix, \(\mathbf{VV} ^\top\), while our objective function is optimized over the \(p \times d\) direction matrix V, \(d \ll p\). If the number of predictors, p, is large, the method proposed in [12] is likely to be computationally more expensive than our method.

As is the case with most SDR methods, the sparse estimation method we propose in this paper can be used for regression (i.e. quantitative response) setting and classification problems (i.e. categorical response). Under model (1), [21] pointed out that linear discriminant analysis (LDA) is equivalent to SIR in the population. Cook and Yin [22] also further established the relationship between LDA and SIR, as well as quadratic discriminant analysis (QDA) and SAVE, and presented applications of (1) in discriminant analysis. These close connections also motivated us to explore the competitiveness of empirical results using our method with existing state-of-the-art generalizations of LDA for high-dimensional and multi-class classification problems.

Organization. The rest of the article is organized as follows. In “Method” section we describe the proposed method, and discuss its implementation. In “Continuous response” section we conduct extensive simulation studies to assess the performance of the proposed estimator and compare to other estimators in the literature. We apply our method to three omics datasets and demonstrate its use in practice in “Categorical response” section. We offer brief discussion in “Summary and conclusion” section.

Notations. For a vector \(\mathbf{v } \in {\mathbb {R}}^p\), we define \(\Vert \mathbf{v }\Vert _\infty = \max _{i=1,\ldots ,p}|v_i|\), \(\Vert \mathbf{v }\Vert _1 = \sum _{i=1}^p |v_i|\), and \(\Vert \mathbf{v }\Vert _2 = \sqrt{\sum _{i=1}^p v_i^2}\). For a matrix \(\mathbf{M } \in {\mathbb {R}}^{n \times p}\) we define \(\mathbf{m }_i\) to be its ith row, \(\mathbf{m }_j\) to be its jth column, \(\Vert \mathbf{M }\Vert _\infty = \max _{i=1,\ldots ,n}\Vert \mathbf{m }_i\Vert _1\), \(\Vert \mathbf{M }\Vert _{\infty ,2} = \max _{i=1,\ldots ,n}\Vert \mathbf{m }_i\Vert _2\), and \(\Vert \mathbf{M }\Vert _{\text{F}} = \sqrt{ \sum _{i=1}^n \sum _{j=1}^p m_{ij}^2}\).

Method

As stated in the background section, let \({\mathcal {S}}_{Y|{{\mathbf{X}}}}\) denote the central subspace, and let \({\varvec{\Sigma }}= \text{cov}({{\mathbf{X}}})\). Define a population seed as any matrix \({\varvec{\Delta }}\) such that \(\text{span}({\varvec{\Delta }}) \subseteq {\varvec{\Sigma }}{\mathcal {S}}_{Y|{{\mathbf{X}}}}\) and possibly \(\text{span}({\varvec{\Delta }}) = {\varvec{\Sigma }}{\mathcal {S}}_{Y|{{\mathbf{X}}}}\). Here, we assume the coverage condition, \(\text{span}({\varvec{\Delta }}) = {\varvec{\Sigma }}{\mathcal {S}}_{Y|{{\mathbf{X}}}}\), to hold. The coverage condition is commonly made in sufficient dimension reduction literature, and may be reasonable in many applications, see Cook and Ni [23] for discussion. Thus, if \({\varvec{\Sigma }}\) is invertible, a seed matrix can be used to obtain a matrix \(\mathbf{V }\) whose columns span the central subspace by setting \(\mathbf{V } = {\varvec{\Sigma }}^{-1}{\varvec{\Delta }}\). For example, for the ordinary least squares method, the \(p \times 1\) covariance vector \({\varvec{\Delta }}= \text{cov}({{\mathbf{X}}}, Y)\) is the seed vector, and the central subspace can be obtained as the span of the least squares vector \(\mathbf{V } = {\varvec{\Sigma }}^{-1}{\varvec{\Delta }}\), if \(d=1\).

Let \(\mathbf{V } \in {\mathbb {R}}^{p \times d}\) be a matrix such that span(V) = \({\mathcal {S}}_{Y|{{\mathbf{X}}}}\). In his pioneering sliced inverse regression estimation paper, under the linear conditional mean assumption (i.e. \({\mathbb {E}}({{\mathbf{X}}}|\mathbf{V }^{\top }{{\mathbf{X}}})\) is a linear function of the d-dimensional random vector \(\mathbf{V }^{\top }{{\mathbf{X}}}\)), [1] showed that

for all y. The conditioning in (2) cannot be performed in practice unless Y is discrete, and standard practice with a continuous response is first to partition Y into H slices, indexed by \(h = 1,\ldots , H\), and then average (2) over the values of Y in a slice [1]. This yields, \(\psi _h \equiv {\mathbb {E}}\{{{\mathbf{X}}}|J_h(Y)=1\}-{\mathbb {E}}({{\mathbf{X}}}) \in {\varvec{\Sigma }}{\mathcal {S}}_{Y|{{\mathbf{X}}}}, h=1,\ldots ,H,\) where \(J_h(Y)=1\) if Y is in slice h and \(J_h(Y)=0\) otherwise. When the response is categorical, H is set to be the number of categories by construction, and when the response is continuous, H must satisfy \(H \ge d\). Let \({\varvec{\Psi }}\) to be the \(p \times H\) matrix \({\varvec{\Psi }}= (\psi _1,\ldots ,\psi _H)\). Then, it follows that \({\varvec{\Psi }}\) qualifies as a seed matrix. Furthermore, it follows that the SIR kernel matrix \(\mathbf{M } = \text{cov}[{\mathbb {E}}({{\mathbf{X}}}|Y)-{\mathbb {E}}({{\mathbf{X}}})] = {\varvec{\Psi }}{\varvec{\Psi }}^{\top }\) qualifies as a seed matrix. Consequently, the sliced inverse regression estimation can be formulated as a generalized eigenvalue problem,

where \(\lambda _j\)’s are the eigenvalues of \({\varvec{\Sigma }}^{-1}\mathbf{M }\), and \(\mathbf{v }_j\)’s are the corresponding eigenvectors. The d eigenvectors, corresponding to the d nonzero eigenvalues span the central subspace.

Let \(\big \{ \mathbf{x }_i^{\top },y_i \big \}_{i=1}^n\) denote an available n iid samples, \(\widehat{\mathbf{M }}\) and \({\widehat{{\varvec{\Sigma }}}}\) be the sample estimates of M and \({\varvec{\Sigma }}\), respectively. That is, with

where \(\mathbf{v }_i\), \(i=1,\ldots ,p\), are the rows of V, \({\tilde{\Lambda }}\) is a diagonal matrix with the d nonzero eigenvalues of \({\widehat{{\varvec{\Sigma }}}}^{-1/2}\widehat{\mathbf{M }} {\widehat{{\varvec{\Sigma }}}}^{-1/2}\), \(\tilde{\mathbf{V }}\) is a \(p \times d\) matrix of the corresponding non-sparse eigenvectors, and \(\tau > 0\) is a tuning parameter that controls the row-sparsity of \({\widehat{\mathbf{V }}}\). As \(\tau\) increases, it leads to solutions \({\widehat{\mathbf{V }}} ({\tau })\) with more row-sparsity. Note again that the target basis matrix \(\mathbf{V }\) is \(p \times d\), and the \(\Vert \mathbf{v }_i\Vert _2\)’s in the objective function are defined row-wise, aggregated over the p rows corresponding to the p predictors. We could think of (4) as a group dantzig type formulation [24], where the group refers to a predictor’s contribution to the d dimension reduction vectors, and that the objective function is defined as a block-\(\ell _1\)-norm of the row vectors. The solution to (4) will not necessarily yield an orthogonal basis matrix \({\widehat{\mathbf{V }}}\). Nevertheless, we can obtain a sparse basis matrix via a Gram-Schmidt orthogonalization of the final estimate. The objective function is independent of the basis used to represent the span of \(\mathbf{V }\), since for any orthogonal matrix \(\mathbf{O }\), \(\psi (\mathbf{V }) = \psi (\mathbf{VO} )\), where \(\psi (\mathbf{V }) = \sum _{i=1}^p \Vert \mathbf{v }_i\Vert _2\), and the non-zero rows in V and VO are the same.

In the classical \(p < n\) regime, we can obtain \(\tilde{\mathbf{V }}\) by doing a singular value decomposition on \({\widehat{{\varvec{\Sigma }}}}^{-1/2}\widehat{\mathbf{M }} {\widehat{{\varvec{\Sigma }}}}^{-1/2}\). However, in the high dimensional setting, \(p > n\), it is important for the performance of our proposed method via (4) that the estimate \(\tilde{\mathbf{V }}\) be reasonable. If M is a \(p \times p\) nonnegative definite matrix with rank(M) = \(d \le p\), and \({\varvec{\Sigma }}\) is a \(p \times p\) positive definite matrix, the true V satisfies \(\mathbf{MV} = {\varvec{\Sigma }}\mathbf{V }\Lambda\). In the high dimensional setting, we assume V is s-sparse for some fixed s, and denote \(supp(\mathbf{V }) = F = \{i: \Vert \mathbf{v }_i\Vert _2 \ne 0, i = 1,\ldots ,p\}\), where \(|F| = s\) represent the number of relevant predictors. Let \(\widehat{\mathbf{M }}\) and \({\widehat{{\varvec{\Sigma }}}}\) be sample estimates of M and \({\varvec{\Sigma }}\) that preserve the same definiteness as their population counterparts. Since \(\widehat{\mathbf{M }}\tilde{\mathbf{V }} = {\widehat{{\varvec{\Sigma }}}}\tilde{\mathbf{V }}{\tilde{\Lambda }}\), we can write the constraint in (4) as \(\big \Vert {\widehat{{\varvec{\Sigma }}}}\tilde{\mathbf{V }}-{\widehat{{\varvec{\Sigma }}}}\mathbf{V } \big \Vert _\infty \le \tau .\) In the high dimensional setting, we assume that \(\tau = O\big (s\sqrt{\log p/n}\big )\), and thus \(\tilde{\mathbf{V }}\) should satisfy \(\big \Vert {\widehat{{\varvec{\Sigma }}}}\tilde{\mathbf{V }}-{\widehat{{\varvec{\Sigma }}}}\mathbf{V } \big \Vert _\infty \le \big (s\sqrt{\log p/n}\big ).\) In the next paragraph, we discuss an approach for obtaining \({\varvec{\Sigma }}^{-1}\) that yields \(\tilde{\mathbf{V }}\).

Estimation of\({\varvec{\Sigma }}^{-1}\): When p is greater than n, \({\widehat{{\varvec{\Sigma }}}}\) is no longer invertible even when \({\varvec{\Sigma }}\) is nonsingular, and it is not possible to get a reasonable estimate \({\tilde{{{\mathbf{V}}}}}\). Therefore, we need a good estimate of \({\varvec{\Sigma }}^{-1}\) for \(n < p\). Estimation of \({\varvec{\Sigma }}^{-1}\) has been extensively studied in the literature. In this work, we will simply adopt the constrained \(\ell _1\) minimization for inverse covariance matrix estimation (CLIME) method proposed by [25]. Denote \({\varvec{\Omega }}:= {\varvec{\Sigma }}^{-1}\). Given a tuning parameter \(\lambda _{1n}\), the CLIME based estimate \({\widehat{{\varvec{\Omega }}}}\) is a solution to the following optimization problem:

The above solution \({\widehat{{\varvec{\Omega }}}}\) is not symmetric in general. To obtain a symmetric estimate, the CLIME estimator \({\widehat{{\varvec{\Omega }}}}_s\) is defined as \({\widehat{{\varvec{\Omega }}}} := ({\widehat{{\varvec{\Omega }}}}_{s,k,l})\) where

In other words, we take the one with smaller magnitude between \({\widehat{{\varvec{\Omega }}}}_{k,l}\) and \({\widehat{{\varvec{\Omega }}}}_{l,k}\). The resulting estimate \({\widehat{{\varvec{\Omega }}}}_s\) is symmetric and, more importantly, positive definite with high probability. By assuming that the covariates have exponential type tails and \(\lambda _{1n} = C_1\sqrt{\log p/n}\) for some generic constant \(C_1\), [25] show that

for \(0 \le q < 1\). Note that the special case of \(q=0\) is a class of \(s_1\)-sparse matrices.

Given an estimate \({\widehat{{\varvec{\Sigma }}}}^{-1}\), we implement the optimization problem in (4) using the CVX, an efficient MATLAB package for specifying and solving convex optimization problems [26, 27].

Selection of the tuning parameter

The tuning parameter \(\tau\) in (4) controls the level of sparsity and needs to be selected. Note that when \(\tau >\Vert {\widehat{{\varvec{\Sigma }}}}^{-1/2}\widehat{\mathbf{M }}{\widehat{{\varvec{\Sigma }}}}^{-1/2} \tilde{\mathbf{V }}\Vert _\infty\), the optimization problem (4) yields a trivial solution, giving us an upper bound for \(\tau\). We choose the optimal \(\tau\) from the range \(\big (0,\Vert {\widehat{{\varvec{\Sigma }}}}^{-1/2}\widehat{\mathbf{M }}{\widehat{{\varvec{\Sigma }}}}^{-1/2} \tilde{\mathbf{V }}\Vert _\infty \big )\) using K-fold cross validation (CV). More specifically, for the categorical response case, we randomly group the observations of \(\mathbf{X }\) into K roughly equal-sized groups, denoted as \(\mathbf{X }^{1},\ldots ,\mathbf{X }^{K}\). For each \(k=1,\ldots ,K\), let \(\mathbf{X }^{-k}\) be the input data matrix leaving out \(\mathbf{X }^{k}\). Let \(\mathbf{y }^{k}\) and \(\mathbf{y }^{-k}\) be the corresponding response vectors. We apply the proposed methods on \(\mathbf{X }^{-k}\) to derive basis matrices \(\widehat{\mathbf{V }}^{-k}_{d}(\tau ), d={\textit{rank}}(\widehat{\mathbf{M }})\), and the data \(\mathbf{X }^{k}\) are then projected onto \(\widehat{\mathbf{V }}^{-k}_{d}(\tau )\) to obtain discriminant scores \(\mathbf{U }_{d}(\tau _{n})= \mathbf{X }^{k}\widehat{\mathbf{V }}^{-k}_{d}(\tau )\), and classification of \(\mathbf{X }^{k}\) is performed using the nearest centroid method to obtain predicted response \(\mathbf{y }^{k}_{pred}(\tau )\). We calculate the K-fold CV misclassification rate as

where \(n_{k}\) is the number of observations in \(\mathbf{X }^{k}\). We do this for each \(\tau\) and select the optimal tuning parameter as \(\tau _{opt}=\min _{\tau _{n}}\{CV(\tau )\}\).

For the continuous response case, we adopted the following information criteria method suggested in [9]. Define the average squared residuals as

where \(\widehat{\mathbf{V }}(\tau )\) is the estimate of V obtained from (4) with a given \(\tau\). Denote the number of nonzero rows of \(\widehat{\mathbf{V }}(\tau )\) by \(s(\tau )\). We select optimal tuning parameter as

In this section, we conduct extensive simulations to assess the performance of the proposed method and compare it with other competing methods in the literature. We consider both continuous and categorical response cases.

Continuous response

Here we simulate models with continuous response variable. We assess estimation accuracy and variable selection selection accuracy. To assess variable selection performance, we report the true positive rate (TPR) and false positive rate (FPR). TPR is the proportion of truly important variables with estimated nonzero corresponding rows, and FPR is the proportion of unimportant variables with estimated nonzero corresponding rows. A method that does well in variable selection will have a TPR close to one, and an FPR close to zero. For estimation and prediction performance, we report the correlation coefficient between the true sufficient predictor (\(\mathbf{v }^{\top }{{\mathbf{X}}})\) and estimated sufficient predictor (\(\widehat{\mathbf{v }}^{\top }{{\mathbf{X}}})\). For the two-dimensional Model (9), we report the average of these two correlation coefficients.

We simulate data using the three regression models below, adopted from [12]. We adopt the following simulation settings from [12]. We generate the predictor vector \({{\mathbf{X}}}\) from \(N(0,{\varvec{\Sigma }})\), where \({\varvec{\Sigma }}_{ij} = 0.5^{|i-j|}\) and \(\varepsilon \sim N(0,1)\). We compare the performance of our proposed method with the performance of [10, 12, 16] and [7]. A linear regression model with three active predictors:

where the central subspace is spanned by the directions \(\mathbf{v }_1 = (1,1,1,0_{p-3})^{\top }\), and \(\mathbf{v }_2 = (0,0,0,1,1,0_{p-5})^{\top }\), and \(d=2\). This model form has been used extensively in the sufficient dimension reduction literature, see for instance [1].

Summary of Simulation Results

Table 1 presents the results for Models (7)–(9). The results show that our proposed method performs very competitively against recent proposals for sparse sliced inverse regression for high-dimensional data [7, 10, 12, 16]. More specifically, we make the following observations. In the classical setting (\(n = 200\), \(p=150\)), while all the methods yield very good results, our method yields the best results, followed by [12]. In the high-dimensional setting (\(n = 100\), \(p=150\)), for the single-index models (Models (7) and (8)), our method yields the best results in terms of TPR and correlation. In terms of FPR, [10] yields the best results. However, our method also yields reasonable FPR values, comparable with the rest of the methods. Model (9) is multi-index model and the performances of all the methods are inferior to their performance in the single-index case. Nevertheless, our method still yields competitive results with the other methods, with the exception of the [16] methods that performs very well in terms of variable selection, but struggles overall in terms of the correlation. However, the reported results for [16] are the best results after considering multiple tuning parameters, unlike the other methods use data adaptive tuning parameter selection approach. In summary, our proposed method yields the best overall performance across the three Models and two settings (classical and high-dimensional setting).

Categorical response

Here we conduct simulations for categorical response. We assess estimation accuracy of the central (discriminant) subspace, prediction accuracy after dimension reduction, and variable selection accuracy. We assess estimation accuracy using the Frobenius norm of the difference between the projection matrices of the true and estimated discriminant subspaces. More specifically, let \(\widehat{\mathbf{V }}\) denote the estimate of \(\mathbf{V }\). We measure closeness of \(\widehat{\mathbf{V }}\) to \(\mathbf{V }\) by

where \(\Vert .\Vert _{\text{F}}\) denotes the Frobenius norm. Smaller values of this distance metric indicate a better estimate. We report TPR and FPR values for variable selection performance assessment. To assess the prediction performance, we report the generalization (test) misclassification rate (MSR).

We compare the performances of our method with three competing methods. The first competing method is the MGSDA by [28]. Like our proposed methods, MGSDA yields row-sparse linear discriminant analysis vectors for a multi-class classification problem. We implement MGSDA using the MGSDA package in R. The second competing method is the multi-class sparse discriminant analysis (MSDA) method by [29]. This method also imposes row-sparsity. For the binary classification case, MSDA reduces to the linear programming discriminant analysis (LPD) method by [30]. We implement MSDA using the MSDA package in R. The third competing method is the penalized linear discriminant analysis (PLDA) of [31]. Unlike the other methods mentioned above, for the multi-class setting, PLDA estimates the discriminant vectors in a sequential fashion starting with the first discriminant vector \(\mathbf{v }_1\), with subsequent \(\mathbf{v }_j\) found subject to orthogonality constraints. Sparsity is achieved by imposing the \(\ell _1\)-norm penalty to each of the vectors. Therefore, PLDA yields sparse estimates that are not necessarily coordinate-independent, and generally this method selects more predictors. We implement PLDA using the penalizedLDA package in R. The corresponding optimal tuning parameters for all the methods are chosen via fivefold cross-validation to minimize test misclassification rate. In the sequel, CISESIR, and CISELDA represent our proposed method with the SIR and LDA matrices, respectively. We simulate three models as follows.

Model 1

We simulate a three class classification problem. The input matrix \({{\mathbf{X}}}\in {\mathbb {R}}^{p \times n}=[{{\mathbf{X}}}_{1}, {{\mathbf{X}}}_{2},{{\mathbf{X}}}_{3}]\) with the true covariance matrix is

where \({\tilde{{\varvec{\Sigma }}}}\) is the covariance structure for signal variables, which we take to be \({\tilde{{\varvec{\Sigma }}}} = \rho \mathbf{J } + (1-\rho )\mathbf{I }\), I is the identity matrix and \(\mathbf{J }\) is a matrix with all entries equal to one. We set \(n_{k}=30\), for a total of 90 observations, and generate \({{\mathbf{X}}}_k\) from \(\text{N}({\varvec{\mu }}_k,{\varvec{\Sigma }})\), where we take \({\varvec{\mu }}_1 = \mathbf{0 }\), \({\varvec{\mu }}_2 = (1,\ldots ,1,0,\ldots ,0)\) with only the first ten entries nonzero, and \({\varvec{\mu }}_3 = (0,\ldots ,0,-2,\ldots ,-2,0,\ldots ,0)\) with entries 11–20 nonzero. The true discriminant vectors \(\mathbf{v }_1\) and \(\mathbf{v }_2\) are the eigenvectors of \({\varvec{\Sigma }}^{-1}\mathbf{M }\) corresponding to its two nonzero eigenvalues, where \(\mathbf{M }\) is the true between-class covariance matrix. We report results for \(p = 50, 500, 1000\). The number of signal variables in this model is \(s=20\), which is the number of nonzero rows in the discriminant space: \(\mathbf{V } = (\mathbf{v }_1, \mathbf{v }_2)\).

Model 2

In model 1 we simulated a case where the within class covariances are the same across the three classes. In this model, we consider the scenario where the classes differ not only through their means, but also their covariances. The covariance matrices for the three classes are given as follows: for class 1, the covariance matrix has the same form as in model 1 with \(\rho =0.9\); for class 2, the covariance matrix has entries \({\varvec{\Sigma }}_{ij} = 0.5^{|i-j|}\); for class 3, the covariance matrix is the identity matrix, \(\mathbf{I }_p\).

Model 3

In models 1 and 2, we simulated data from the inverse regression setup, \({{\mathbf{X}}}| Y\). In this model, we simulate data from forward regression \(Y|{{\mathbf{X}}}\). More specifically, we simulate \({{\mathbf{X}}}\sim N(\mathbf{0 },{\varvec{\Sigma }})\) and generate y using the logistic regression model:

$$\begin{aligned} Y \sim {\textit{Bernoulli}}(p) \quad {\text{where}} \quad p = \frac{e^{\mathbf{v }^{\top }{{\mathbf{X}}}}}{1 + e^{\mathbf{v }^{\top }{{\mathbf{X}}}}}. \end{aligned}$$

(11)

We keep the covariance matrix structure to be the same as in model 1, and we generate the nonzero coefficients of \(\mathbf{v }\) from U(0.8, 1). The number of nonzero coefficients are set to 10 in this example. In both models 1 and 2, the discriminant subspace of interest was two dimensional. However, in this model, it is one dimensional, the space spanned by \(\mathbf{v }\). Note also that while models 1 and 2 are a three-group problem, model 3 is a binary classification problem.

Summary of Simulation Results:The results for model 1 are reported in Table 2. For this model, we observe that the classification performances of all the methods are comparable. In terms of variable selection, we see that PLDA has the highest TPR across all settings, with MGSDA and MSDA yielding the lowest TPRs. Both our methods, CISESIR and CISELDA, yield comparable variable selection performance, with CISELDA performing slightly better in TPR and CISESIR performing slightly better in FPR. Both our methods also perform better than the competing methods when the correlation structure among the predictors is stronger \((\rho =0.9)\). More specifically, when \(\rho =0.9\), MGSDA and MSDA suffer and yield poor results, and PLDA generally selects more variables and yields higher FPR. Notice also that the performances of our methods improve with increase in p. Overall, we observe that our methods yield competitive estimation, classification, and variable selection performances, and generally yield lower FPRs. The results for model 2 are reported in Table 3. For this model, CISELDA and PLDA are the best performers in terms of classification accuracy, while MSDA, CISESIR and CISELDA are the best performers in terms of variable selection. Again, MGSDA performs the worst among all the methods. Notice that the setting for this model is such that the class-level predictor covariance matrices are different. These results show that our methods are also robust to the LDA assumption of constant within-group covariance matrix. Table 4 reports the results for model 3. The results for this model are similar to the results for model 1, confirming that the performance of our methods in binary classification problem resembles their performances in the three class problem (Table 2).

In summary, our simulation results show that our methods (CISESIR and CISELDA) yield competitive estimation, classification and variable selection performance. Our methods are among the best performing in terms of FPR in all settings. PLSD generally yields the highest TPR and FPR values because it selects more variables. This is not surprising since PLSD induces penalties to each of the dimension reduction (discriminant) vectors separately. Moreover, CISESIR and CISELDA yield the best overall variable selection performance in model 2 where the within group predictor covariance structures differ.

Applications

Depression study

Metabolomics data on major depressive disorder (MDD) were obtained from the Metabolomics Workbench (see Data Availability Statement). In the original study, human cerebrospinal fluid and plasma samples were collected from patients diagnosed with MDD and control subjects matched on age and gender, and an untargeted metabolomics profiling was conducted on these samples. There were 158 metabolites on \(n=48\) control patients and \(n=46\) patients diagnosed with depression. Our goal in this study is to apply the proposed and existing competing methods to identify metabolites that optimally discriminate patients with MDD from patients without MDD.

We pre-process the data by eliminating metabolites with coefficient of variation greater than 50%—which leaves us with 103 metabolites. As is commonly done in metabolomics data analysis, we log2 transform each feature, and normalize each feature to have mean 0 and variance 1. Then, we randomly split the dataset into two-third training set and one-third test set. A stratified sampling scheme is applied to preserve the original proportions of samples in each group. We select the optimal tuning parameter that minimizes the average misclassification rate using fivefold cross-validation on the training set. We then apply the methods with the selected optimal tuning parameters to the test set to obtain an estimate of the generalization misclassification rate. We repeat the foregoing analysis scheme 50 times and obtain average misclassification rates and number of variables selected.

The average test misclassification rates, sensitivities, and specificities, along with their standard errors, obtained from the 50 splits are reported in Table 5. We see that the average test error for the proposed methods are comparable to that of MGSDA and MSDA, but are better than that of PLDA. MGSDA identifies fewer predictors, which agrees with the simulation results where it had high specificities and low sensitivities. For differentiating MDD patients from healthy controls, the proposed methods showed high sensitivity (CISELDA: 80.37, CISESIR: 84.62) and moderate to high specificities (CISELDA: 83.20, CISESIR: 72.27). Of the competing methods, PLDA had low specificity. Of note, these sensitivies and specificities were obtained comparing the observed class from the test data with the predicted class obtained using nearest centroid, and averaging over 50 data splits.

We also investigate the metabolites identified by the proposed methods and how they may relate to depression. Here, we consider only metabolites that are selected all 50 times, a potential indication of the ability of these metabolites to contribute most to the differentiation of subjects with and without depression symptoms. When this was used, 14 metabolites (2-hydroxyglutaric acid, 2-hydroxyvaleric acid, asparagine, creatinine, dodecanol, gluconic acid, glutamine, hydroxylamine, itaconic acid, lactamide, lysine, malic acid, palmitic acid, and p-cresol) were selected by CISELDA for the separation between MDD and healthy controls, 12 metabolites (2-hydroxyglutaric acid, 2-hydroxyvaleric acid, asparagine, creatinine, dodecanol, glutamine, hydroxylamine, itaconic acid, lactamide, malic acid, palmitic acid, and p-cresol) where selected by CISESIR, 8 metabolites (asparagine creatinine, dodecanol, glutamine, lactamide, malic acid, palmitic acid, and p-cresol ) were selected by MGSDA, 10 metabolites (asparagine, creatinine, dodecanol, fructose, glutamine, hydroxylamine, lactamide, oxoproline, palmitic acid, and p-cresol) were selected by MSDA, and 90 meatabolites were selected by PLDA, of which CISELDA is a subset. Note that the 8 metabolites identified by MGSDA are subsets of CISESIR and CISESLDA. We report the log2 transformed intensity data for the metabolites identified by CISELDA for patients with depression symptoms compared to patients with no depression symptoms (right panel of Fig. 1).

Some of the metabolite biomarker candidates identified by our methods have been suggested to be depression-related compounds. For example, glutamine, which was significantly reduced for patients in our data - confirming other studies reporting that depressed patients had reduced levels of glutamine/glutamate [32], is suggested to be implicated in the pathophysiologic mechanisms of MDD [33, 34].

We also conduct pathway enrichment analaysis using MetaboAnalyst 3.3 for possible connections between these metabolites (http://www.metaboanalyst.ca/faces/ModuleView.xhtml). With a false discovery rate of 0.05, CISELDA, CISESIR, and MSDA identified the nitrogen metabolic pathway to be significantly enriched with three metabolites (Hydroxylamine, Glutamine, and Asparagine) in our list belonging to this pathway. Meanwhile, no pathway reached the FDR threshold for MGSDA. On the other hand, 8 pathways including the nitrogen pathway reached FDR threshold for the candidate metabolites identified by PLDA; this is not surprising since PLDA identified more metabolites. The nitrogen pathway plays an important role in the metabolism of nitrogen into other compounds that are essential for human survival.

RNA-seq data

In this example, we demonstrate the ability of the methods to identify features for discriminative purposes when there are more than two groups and when the number of variables is high. Advances and improvements in technology and decreasing cost of next-generation sequencing have made RNA sequencing (RNA-seq) a widely used method for gene expression studies. We used RNA-seq data on Drosophila Melanogaster (Fly) [35] downloaded from ReCount database [36]. Features with more than half their values being zero were filtered out. The remaining features with zero values were truncated at 0.5 and the data were then log-transformed. We filtered out features with low variances, resulting in \(p=12,046\) dimensions. Finally, the data were normalized to have equal medians for each sample, and mean zero and unit variance for each feature. There were four fly classes: Class 1 consisted of all embryos; Class 2 consisted of all larvae; Class 3 consisted of all white prepupae; and Class 4 consisted of all adult flies. The data set consists of a total of \(n=147\) samples. We split the data to 99 training set and 48 test set proportionately. The rest of the analysis was carried out similarly to the depression example.

In Table 6 we report the classification performance in terms of average test misclassification rates. The proposed methods are competitive achieving similar or better classification accuracy when compared to the competing methods. In terms of variable selection, it is noticeable that MGSDA is most sparse, with PLDA being least sparse. This result is consistent with the simulation results where MGSDA had high specificities and low sensitivities. Figure 2 is a visual representation of one random split of the testing data projected onto the estimated sparse discriminant subspaces. It can be observed that the classes are well separated. MSDA took too long to run for this data. Therefore, the results for this method are not reported.

Riboflavin production data

Here, we apply the proposed method on data with continuous response variable. The data concerns riboflavin (vitamin B2) production with B. subtilis. We obtained the data from [37]. Please refer to the Data Availability Study for where to download the riboflavin data. There is a single real-valued response variable, which is the logarithm of the riboflavin production rate, and \(p = 4088\) (co)variables that measure the logarithm of the expression level of 4088 genes; these gene expression profiles were normalized using the default in the R package affy [38]. The data consist of \(n = 71\) samples that were hybridized repeatedly during a fed-batch fermentation process in which different engineered strains and strains grown under different fermentation conditions were analyzed. We refer interested readers to [37] for more details.

Next, we estimate the structural dimension, which we find to be \({\widehat{d}} = 1\). Following, we randomly split the data into 50 training and 21 test samples. Then, we apply the proposed method on the training samples to obtain an estimate of the direction that span the central subspace, \(\widehat{\mathbf{v }}_1\), and project both the training and the test samples to this estimated direction to obtain the corresponding sufficient predictors, \(\mathbf{v }_1^{\top }{{\mathbf{X}}}_{\text{train}}\) and \(\mathbf{v }_1^{\top }{{\mathbf{X}}}_{\text{train}}\). Figure 3 depicts the sufficient summary plots. We also repeated the foregoing procedure 50 times and counted the number of times each gene was selected, i.e. had a corresponding non-zero estimated coefficient in \(\widehat{\mathbf{v }}_1\). We find that the following nine genes were selected in 80% of the replications: XHLA\(\_\)at, YCGO\(\_\)at, YHDX\(\_\)r\(\_\)at, YRZI\(\_\)r\(\_\)at, YTGD\(\_\)at, YCKE\(\_\)at, YXLD\(\_\)at, YCDH\(\_\)at, GAPB\(\_\)at.

Summary and conclusion

We have introduced a novel sparse estimation method for the population reduction vectors in semi-parametric multi-index models using the sliced inverse regression [1]. Unlike most existing methods in this literature that follow the sequential estimation fashion, our proposed method yields simultaneous estimation of the reduction vectors. The estimated dimension reduction matrix is row-sparse and thus leads to coordinate-independent sparse estimates, in the sense that the selected predictors are the same under any orthogonal transformation of the reduction vectors that span the subspace of interest, making it appealing for variable screening. The proposed method extends the scope of the popular sliced inverse regression for dimension reduction [1] to the high-dimensional setting. We carried out extensive simulations and applications to assess the effectiveness of the proposed method. Relative to other state of the art methods in the literature, our numerical experiments show that our proposed method is competitive in prediction performance, and generally yield smaller false positive rates (FPRs) with respect to variable selection.

The proposed method was applied to three real datasets including data from a depression study aimed at identifying metabolites that differentiate patients with major depressive disorder (MDD) symptoms from patients without MDD symptoms. Our results show that a number of metabolites including some known to be associated with major depression are enriched in the set of metabolites selected by our method.

Qian W, Ding S, Cook D. Sparse minimum discrepancy approach to sufficient dimension reduction with simultaneous variable selection in ultrahigh dimension. J Am Stat Assoc. 2019;114:1277–90.

CVX-Research: Cvx: Matlab software for disciplined convex programming, version 2.0. http://cvxr.com/cvx 2012.

Grant M, Boyd S. Graph implementations for nonsmooth convex programs. In: Blondel, V., Boyd, S., Kimura, H. (eds.) Recent advances in learning and control. Lecture Notes in Control and Information Sciences. Springer-Verlag Limited, pp. 95–110;2008.

Gaynanova I, Booth JG, Wells MT. Simultaneous sparse estimation of canonical vectors in the p > > n setting. J Am Stat Assoc. 2016;111:696–706.

Hasler G, van der Veen J, Tumonis T, Meyers N, Shen J, Drevets W. Reduced prefrontal glutamate/glutamine and -aminobutyric acid levels in major depression determined using proton magnetic resonance spectroscopy. Arch Gen Psychiatry. 2007;64(2):193–200. https://doi.org/10.1001/archpsyc.64.2.193.

Cotter DR, Pariante CM, Everall IP. Glial cell abnormalities in major psychiatric disorders: the evidence and implications. Brain Res Bull. 2001;55(5):585–95. https://doi.org/10.1016/S0361-9230(01)00527-5 (Neuropathology of severe mental illness: studies from the Stanley foundation neuropathology consortium).

Rajkowska G, Miguel-Hidalgo JJ, Wei J, Dilley G, Pittman SD, Meltzer HY, Overholser JC, Roth BL, Stockmeier CA. Morphometric evidence for neuronal and glial prefrontal cell pathology in major depression see accompanying editorial, in this issue. Biol Psychiat. 1999;45(9):1085–98. https://doi.org/10.1016/S0006-3223(99)00041-4.

Graveley BR, Brooks AN, Carlson J, Duff MO, Landolin JM, Yang L, Artieri CG, van Baren MJ, Boley N, Booth BW, Brown JB, Cherbas L, Davis CA, Dobin A, Li R, Lin W, Malone JH, Mattiuzzo NR, Miller D, Sturgill D, Tuch BB, Zaleski C, Zhang D, Blanchette M, Dudoit S, Eads B, Green RE, Hammonds A, Jiang L, Kapranov P, Langton L, Perrimon N, Sandler JE, Wan KH, Willingham A, Zhang Y, Zou Y, Andrews J, Bickel PJ, Brenner SE, Brent MR, Cherbas P, Gingeras TR, Hoskins RA, Kaufman TC, Oliver B, Celniker SE. The developmental transcriptome of drosophila melanogaster. Nature. 2011;471:473–9.

HH conceived of the idea, HH deduced the algorithm for CISESIR and SS deduced the algorithm for CISELDA, HH and SS designed the simulation study. SS and HH performed the real data analysis. HH and SS wrote the manuscript. All authors read and approved the final manuscript.

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.