 Research
 Open access
 Published:
A nonnegative spikeandslab lasso generalized linear stacking prediction modeling method for highdimensional omics data
BMC Bioinformatics volumeÂ 25, ArticleÂ number:Â 119 (2024)
Abstract
Background
Highdimensional omics data are increasingly utilized in clinical and public health research for disease risk prediction. Many previous sparse methods have been proposed that using prior knowledge, e.g., biological group structure information, to guide the modelbuilding process. However, these methods are still based on a single model, offen leading to overconfident inferences and inferior generalization.
Results
We proposed a novel stacking strategy based on a nonnegative spikeandslab Lasso (nsslasso) generalized linear model (GLM) for disease risk prediction in the context of highdimensional omics data. Briefly, we used prior biological knowledge to segment omics data into a set of subdata. Each submodel was trained separately using the features from the group via a proper base learner. Then, the predictions of submodels were ensembled by a super learner using nsslasso GLM. The proposed method was compared to several competitors, such as the Lasso, grlasso, and gsslasso, using simulated data and two openaccess breast cancer data. As a result, the proposed method showed robustly superior prediction performance to the optimal singlemodel method in highnoise simulated data and realworld data. Furthermore, compared to the traditional stacking method, the proposed nsslasso stacking method can efficiently handle redundant submodels and identify important submodels.
Conclusions
The proposed nsslassoÂ method demonstrated favorable predictive accuracy, stability, and biological interpretability. Additionally, the proposed method can also be used to detect new biomarkers and key group structures.
Background
Using highdimensional omics data to build disease risk prediction models is a hotspot in research of clinical and public health fields. For example, a persistent challenge in cancer treatment is the heterogeneity of prognostic between patients [1], which is largely determined by the individualâ€™s genetic and molecular makeup [2]. Precision medicine aims to use information at the highdimensional molecular level and mathematical models, to achieve more accurate diagnosis, personalized medical care, and reliable prognosis prediction [3]. However, variable selection problem often exists in high dimensional data [4].
An ideal model should own interpretability, such that the endusers can comprehend and utilize it effectively [5]. Sparse models such as the Lasso and Elastic net, are considered to be more interpretable since they emphasize the limited number of important features that contribute more to prediction [6]. Extensions in a similar spirit include the smoothly clipped absolute deviation (SCAD) and the minimax concave penalty (MCP), which were introduced by [7] and [8], respectively. However, these methods are modeled on singlelevel information, such as gene expression data, which may ignore the interaction and higherlevel linkage between variables. Networkbased regularization method is an alternative approach, in which geneâ€“gene interactions are utilized as extra regularization terms [9]. Besides, considering that modeling using only genelevel information will yield unstable results, building a model from the angle of a higher level of prior information (e.g. biological functions involved in disease mechanisms) is preferable [10, 11]. For example, carcinogenesis is a complex biological process regulated by multiple genes in various pathways, therefore, employing such pathway information in modeling can make better use of prior biological knowledge and is a closer mimic of tumor progression [12,13,14,15,16]. In this light, many methods, such as group Lasso (grlasso), group SCAD (grSCAD) and composite MCP (cMCP), have been proposed that enable the information of group structure to be integrated into model building procedure and can achieve sparsity at the group level or bilevel [17,18,19]. Group structure information can also be incorporated into predictive modeling through a twostep approach [11]. For instance, multilayer groupLasso (MLGL) sequentially apples hierarchical clustering and group Lasso to identify datadriven group structures and construct predictive models [20]. Ordered homogeneity pursuit Lasso (OHPL) first reduces the whole data to a set of variables that are representatives of group structures and then employs lasso to fit these dimensionreduced data [21]. A similar process was reported in an earlier study conducted by Chen and Wang [14, 22]. Compared to the datadriven group information, group information based on biological knowledge would be more robust to outlying samples [23]. Wei et al. introduced a pathwaybased procedure for the integration of genomic data [16]. They used nonparametric models to fit the genes in each pathway and performed gradient descent boosting to combine the â€śpathway activitiesâ€ť additively. Zhang et al. proposed to integrate the risk scores derived from pathways using a Bayesian hierarchical Cox model to make cancer survival prediction [24]. Most of the aforementioned methods are singlemodelbased (SMB), which may result in inferior generalization ability in different data [25], while others employ a naive idea of ensemble learning.
The ensemble learning method is a general statistical practice that considers the predictions of multiple algorithms or models simultaneously [26,27,28,29]. By leveraging the strengths of varied models, ensemble methods often yield more robust and accurate predictions than using a single model [30]. Owing to these favorable properties, ensemble methods have gained increasing attention in the last two decades [31,32,33,34]. A popular ensemble learning method is â€śmodel stackingâ€ť or â€śstacked generalizationâ€ť [26], to which we refer as â€śstackingâ€ť hereinafter. Stacking is usually a twolayer construction: in the first layer, a set of submodels (base learners) are constructed and their predictions are harvested; in the second layer, a metamodel (super learner) is fitted by learning the predictions of submodels. The predictions of submodels are usually generated through a Kfold crossvalidation (CV) manner to reduce overfitting [35]. Breiman stated in his study that â€śstacking never does worse than selecting the single best predictorâ€ť [35] and similar conclusions were drawn in van der Laanâ€™s research [36]. Recently, Gelmanâ€™s team illustrated the theories about stacking for multimodal Bayesian posterior distributions and generalized the stacking to Bayesian hierarchical stacking, in which the weights of submodels vary as a function of data, and are inferred using Bayesian inference [37, 38]. Despite the demonstrated advantages in prediction, Bayesian inference encounters challenges in intensive computation.
In this study, we proposed a novel stacking strategy for predicting disease risk using a nonnegative spikeandslab Lasso (nsslasso) generalized linear model (GLM) in the context of highdimensional omics data. Precisely, we use the group structure information derived from biological knowledge to segment highdimensional omics data into several subdata. After that, each submodel is trained separately using the features from the grouped feature set via a proper base learner (better prediction and shorter time cost) and a CV procedure. Then, the CV predictions of submodels are ensembled by the super learner using nsslasso GLM. We propose several variants based on the above strategy by combining different base learners and super learners and assessing their prediction performance via a simulation study. These methods are also compared with several widely used penalized methods. Without loss of generality, the proposed methods are applied to largescale gene expression data derived from two openaccess breast cancer datasets using pathways as the group information.
The paper is organized as follows: In â€śMethodsâ€ť section, we provide a detailed illustration of the stacking fitting procedure using the nsslasso GLM, along with the algorithm for parameter estimation using Expectationâ€“Maximization (EM) and the cyclic coordinate descent algorithm. "Simulation study" section presents a comparison of the prediction performance of our proposed method and existing methods through a simulation study. In "Applications to real data" section, we apply the proposed methods to realworld data. Finally, "Discussion" section concludes the paper and addresses several critical issues related to our approach.
Methods
nsslasso GLM stacking model
Given a learning dataset \(D=\{({y}_{n}, {{\varvec{x}}}_{n}), n = 1, 2, ..., N\}\), suppose a numerical or binary outcome variable y in terms of an input vector x, that can be predicted by fitting J predictive models {\({f}_{1}(x), {f}_{2}(x), ..., {f}_{J}(x)\)} either based on different modeling methods (e.g., random forest, support vector machine, etc.) or using the subsets of input variables. Instead of selecting the single optimal predictive model, stacking was proposed to combine the predictions of the J models.
The general model stacking is a twolayer structure consisting of base learners and the super learner. The first layer randomly partitions the original training data D_{0} into K mutually exclusive and exhaustive subsets (folds) of (rough) equal size. The k^{th} fold is used as a validation set, \(V(k)\), while the remaining folds are used as a training set, \(T(k)\), \(k = 1, 2, ..., K\), to predict the outcomes in \(V(k)\). The process is repeated for each fold, resulting in the prediction for all data \(V\). For J candidate base learners, we can obtain the prediction \({V}_{j}, j=1, 2, \dots , J\) by repeating the above procedure. The whole process yields a matrix with columns being pooled CV prediction for different base learners. The second layer implements a super learner to fit the CV predictions from J base learners. The resulting coefficients are the estimated weights \({\widehat{w}}_{j}\) for the jth base learner, which is subsequently used to combine the J submodels. Of note, submodels to combine are the refit models (here, denoting \({f}_{j}(x)\)) using the original data D_{0}. The final prediction model in a new data D_{1} is given by,
The estimated weights \({\widehat{w}}_{j}\) are usually constrained to be nonnegative to lower the variance of prediction while the sumtoone constraint of \({\widehat{w}}_{j}\) proved to be generally unnecessary [35]. Optimization algorithms, such as nonnegative least squares and the limitedmemory BFGS method (LBFGS), can be used to estimate the weights [39].
In the present study, we introduced a novel nsslasso GLM stacking strategy based on segmenting highdimensional omics data. The algorithm flow is shown in Fig.Â 1. Highdimensional omics data are segmented into groups based on prior biological knowledge. This reduces the dimensionality from considering all the variables to only considering those in a given group. Then we propose to construct predictive models based on features in each group, serving as submodels of the first layer of the stacking framework. Sparse methods (SMs), such as the Lasso, SCAD, MCP, or network penalized method, as well as various machine learning (ML) methods, can be used to build submodels. In the second layer, the nsslasso GLM is used as the super learner to estimate the weights of the submodels based on the CV predictions. Here, we treat the predictions of the submodels as covariates under a GLM. This idea has already been mentioned in [36, 40]. The final model can be expressed as,
where the response variable y is supposed to follow an exponential family distribution; \(h\) is a monotonic link function, such as an identity function or sigmoid function.
Here, we use an adaptive spikeandslab mixture prior distribution \(\psi \left(.\right)\) for weight \({\widehat{w}}_{j}\) in the super learner to differentiate submodels according to their importance in predicting outcomes [41]. Let \(\psi (.)\) follow the truncated mixture doubleexponential (DE) prior distribution assuming that weights are restricted to be nonnegative, (see Additional file 1: Fig. S1), then \({\widehat{w}}_{j}\) follows the nonnegative spikeandslab prior:
where s_{0} and s_{1} (s_{1}â€‰>â€‰s_{0}â€‰>â€‰0) are the scale parameters for spike and slab distribution, respectively. s_{1} applies weaker compression to the pathways of strong effects and is usually given as a larger value, say s_{1}â€‰=â€‰1; while s_{0} gives stronger compression to the pathways of weak effects (or even compress to zero) and is a smaller value, which should be selected from a set of predefined candidate values via crossvalidation. \({\gamma }_{j}\) is an indicator (\({\gamma }_{j} \in \{0, 1\}\)) following a binomial distribution:
where \({\theta }_{j}\) is a specific parameter for submodels following the Beta distribution \({\theta }_{j}\sim Beta(a, b)\). This parameter can integrate external prior information. However, \({\theta }_{j}\) can reduce to \({\theta }_{j}\sim U(0, 1)\) in the absence of prior information. Formula (3) can further be represented as:
where \({S}_{j}=(1{\gamma }_{j}){s}_{0}+{\gamma }_{j}{s}_{1}\) is called the total scale parameter.
Algorithm and parameters estimation
Parameters in the proposed model stacking framework including \({w}_{j}\), \({\theta }_{j}, {\gamma }_{j}\), and \({S}_{j}\) are estimated with the EM algorithm based on the cyclic coordinate descent algorithm instead of the Bayesian intensive sampling algorithm. This approach enables faster and more feasible fitting of highdimensional models without compromising prediction accuracy.
In Estep, given \({w}_{j}\) and \({\theta }_{j}\), the posterior expectation of \({\gamma }_{j}\), denoting \({p}_{j}\), can be derived from,
where \({p(\gamma }_{j}=1{\theta }_{j})={\theta }_{j}\), \({p(\gamma }_{j}=0{\theta }_{j})=(1{\theta }_{j})\), \(p({w}_{j}{\gamma }_{j}=1,{s}_{1})=\psi \left({w}_{j}0,{s}_{1}\right)\), \(p({w}_{j}{\gamma }_{j}=0,{ s}_{0})=\psi \left({w}_{j}0,{ s}_{0}\right)\). Then, denote the conditional posterior expectation of \({{S}_{j}}^{1}\) as \({\lambda }_{j}\) and it can be derived from,
In Mstep, we update (\(w\), \(\theta\)) by maximizing the log joint posterior density of these parameters,
where \(l(w)=logp(ywf(x))\) is the log joint density distribution function of the submodels; \(p({\theta }_{j})\) is the prior distribution form of \({\theta }_{j}\), say, \({\theta }_{j}\sim Beta(a, b)\).
The estimate of \(w\) can be updated by the following likelihood function,
where \({\lambda }_{j}\) is replaced by its conditional posterior expectation derived above. Noticed that the term \({\sum }_{j=1}^{J}{\lambda }_{j}{w}_{j}\) serves as the adaptive L_{1} Lasso penalty, and thus the weights can be updated by maximizing \({Q}_{1}(w)\) using the cyclic coordinate descent algorithm. Therefore, this method is called spikeandslab Lasso (sslasso) [41]. This step can be done with the help of the R package glmnet, and limits the estimate of \(w\) to nonnegative (that is nsslasso).
In addition, we adopted existing numerical optimization algorithms, such as LBFGS to obtain \(\theta\). The summarized algorithm flow refers to [41].
Evaluation of model performance
The present study utilizes several metrics to measure the predictive performance of a fitted GLM, including (1) deviance: \(2{\sum }_{n=1}^{N}logp{(y}_{n}wf(x))\); (2) the Brier Score (BS) for binary outcomes: \(\frac{1}{N}\sum_{n=1}^{N}{{(y}_{n}{\widehat{y}}_{n})}^{2}\); and (3) area under the ROC curve (AUC); (iv) misclassification for binary outcomes: \(\frac{1}{N}\sum_{n=1}^{N}I(\left{y}_{n}{\widehat{y}}_{n}\right>=0.5)\), where \(I(\left{y}_{n}{\widehat{y}}_{n}\right>0.5)=1\).
Competitive statistical methods
We assess the prediction performance of the proposed approach using simulated and two real data. For the stacking methods, Lasso (glmnet) [42], SCAD, MCP (ncvreg) [43], and network penalized method (glmgraph) [9], as well as some ML methods including Knearest neighbor (KNN), support vector machine (SVM), naive bayes (Nba), random forest (RF) (E1071) [44], are used as base learners. In addition to using nsslasso as the super learner (implemented with the package BhGLM, with the weights limited to nonnegative) [45], we consider two competitive super learners, namely nonnagetive Lasso (nLasso) (implemented in the package glmnet, with the weights limited to nonnegative) and LBFGS (the function optim() in the package stats). We use Kfold CV with Kâ€‰=â€‰5 for stacking to ensure computational feasibility [25, 39]. Several potential competitive statistical methods are included: Lasso, MCP, SCAD, and network regularized method. Several grouplevel penalization methods are also used for comparison, such as the â€śall in all outâ€ť methods including overlap (to deal with overlapping structures) group Lasso (grlasso), overlap group MCP (grMCP) and overlap group SCAD (grSCAD) (grpregOverlap) [46], and the â€śbilevelâ€ť methods including cMCP (grpregOverlap) and group spikeandslab Lasso (gsslasso) GLM (BhGLM). Another â€śbilevelâ€ť method in grpregOverlap, say group exponential Lasso (GEL) is not used because of its poor performance in this study. All methods are executed using default parameters. All analyses are performed using R (4.1.3) software on Dale T7920 INTEL Windows 10 Gold 5117 CPU @ 2.00GHz.
Simulation study
Simulation design
The present study designs six scenarios with three gradientdistributed theoretical generalized R^{2} [47] and two sets of varied nonzero covariate coefficients (\(\beta\)) (see TableÂ 1) to quantify the amount of information in a given data set. For each scenario, we generate two homogeneous datasets with equal sample sizes, one for training data D_{0} and the other for test data D_{1}. To assess the performance of the methods, we conduct 100 duplicated runs and calculate the average results for comparison. The simulations are implemented using the R package BhGLM.
Specifically, for each dataset, we generate Nâ€‰=â€‰500 samples, each with a binary response \({y}_{n}\) and Mâ€‰=â€‰1000 continuous covariates\({{\varvec{x}}}_{n}=\left({x}_{n,1}, {x}_{n,2}, .., {x}_{n,1000}\right)\), for\(n=1, 2, \dots , N\). The vector x_{n} is randomly sampled from the multivariate normal distribution i.e. \({{\varvec{x}}}_{n}\sim N(0,\Sigma )\), where \(\Sigma \in {\mathbb{R}}^{1000\times 1000}\) is the varianceâ€“covariance matrix. We then group these covariates into 20 distinct groups, allowing for overlap between the groups (Additional file 2: Table S1). The correlation coefficient r within groups is 0.6, while the variables between groups are independent. The binary response \({y}_{n}\) is generated by dichotomizing a continuous intermediate response \({z}_{n}\) with 50% largest being â€śpositiveâ€ť (\({y}_{n} = 1\)) and the others being â€śnegativeâ€ť (\({y}_{n} = 0\)). \({z}_{n}\) follows a univariate normal distribution\({z}_{n}\sim N({\mu }_{n}, {\sigma }^{2})\), where\({\mu }_{n}={\beta }_{0}+{\sum }_{m=1}^{M}{x}_{nm}{\beta }_{m}\), with \({\beta }_{0}\) set to zero in this study. \({\sigma }^{2}\) denotes the residual variance, which is determined by fixing three theoretical generalized R^{2}: 0.50, 0.25, and 0.10. We set a total of eight nonzero covariate coefficients with two types: the absolute values range between 0.7 to 1, and the other range from 0.2 to 1.5.
Results of the simulation
Prediction performance
Tables 2 and 3 summarize the Brier Score and AUC of different methods under six simulation scenarios. Deviance, misclassification, and running time are shown in Additional file 2: Table S2. Among SMB methods, MCP and SCAD are representative methods that use coefficientadaptive penalties. According to the simulation, their performance varied little compared to Lasso in the six scenarios. The network method also performed similarly to Lasso. The methods considering group structures, e.g., grlasso, grMCP, and grSCAD, did not exhibit advantages in prediction compared with these neglecting group structures. Only gsslasso is competitive across all scenarios.
For the stacking methods, we considered obtaining the submodels using the SMs including Lasso and network, as well as ML methods including KNN, NBa, and RF, while MCP, SCAD, and SVM were not considered because of their complexity in computation (MCP and SCAD only listed in Scenarios 1 and 6, SVM took more than 90 min for each duplicated run). We employed nsslasso as the super learner and compared it to nLasso and LBFGS. The computational time of nsslasso was similar to nLasso, but much shorter than LBFGS. In our study, the MLbased stacking methods had poorer predictive performance compared to the SMB methods while the SMsbased stacking methods demonstrated a predictive advantage (except for Scenario 1) over the SMB methods. However, there was little difference between SMsbased stacking methods using different super learners.
Distribution of estimated weights
We further compared the weight estimations via SMsbased stacking methods using different super learners. Theoretically, the weights for group1, group5, and group20 should be nonzero due to the presence of relevant nonzero variables. FigureÂ 2 shows that nsslasso consistently identified the nonzero weights across all scenarios, while LBFGS and nLasso generally included some zero weights. Besides, LBFGS had a narrower interval range of nonzero weights, but it may not be suitable for dealing with large amounts of submodels because it lacks sparsity.
Applications to real data
We applied the proposed approach to two real breast cancer datasets with binary outcomes and largescale gene expression profiles. Breast cancer is the second leading cause of mortality in women, which is a typical molecular heterogeneous disease [48]. For these two datasets, gene expression data were standardized using the covariates function of BhGLM package in the R platform. We randomly partitioned the original data into two subsets of equal sample size: one for training models and the other for evaluating model performance. The process was repeated 100 times in case of casual results due to data split [49]. To ensure a balanced response, we performed a Chisquare test on the number of events between training and test data and considered those with \({P}_{chisquare}>0.2\) being balanced splits that would be retained for further analysis. Genes were mapped to pathways using genome annotation tools. More precisely, we first mapped gene symbols to Entrez Ids using annotate package and then mapped all genes to KEGG pathways (default parameter) using clusterProfiler package [50]. The adjacency matrix for each pathway used in the network regularization method was calculated using WGCNA package [51].
TCGA breast cancer dataset
The Cancer Genome Atlas (TCGA) project collects a variety of types of cancer data such as clinical data, transcriptome expression data, and genomic variation data. We acquired the transcriptome profiling data of the Breast Invasive Carcinoma (BRCA) and the phenotype information from the â€śGDC Data Portalâ€ť. The outcome used in the present study is the occurrence of the â€śnew tumor eventâ€ť.
We included the samples that had both phenotype and expression profiles. Genes withâ€‰>â€‰50% of zero expression were filtered out, and those withâ€‰>â€‰20% quantile variance were kept. Eventually, we obtained a dataset with 960 samples and 14,068 genes. These genes were mapped to 162 pathways involving 4169 nonoverlapping genes (see Additional file 2: Table S3). The study first conducted an initial screening of 162 pathways to identify those with potential predictive values. We fitted Lasso Logistic regression for these genes in each pathway and obtained the CV predicted values and AUC. From 129 pathways with AUCâ€‰>â€‰0.500, 19 pathways (1570 nonoverlapping genes, Additional file 2: Table S3) with AUCâ€‰>â€‰0.577 were selected as candidate submodels for subsequent analysis. For methods without considering group structures, models were directly fitted using the 1570 genes.
Table 4 summarizes the prediction performance of different methods for the TCGA BRCA dataset. gsslasso exhibited the best predictive performance among the SMB methods. The MLbased stacking methods showed poor predictive performance while nsslasso (Lasso) and LBFGS (Lasso) outperformed the SMB methods according to AUC. In addition, we repeated the above analysis using 51 pathways with AUCâ€‰>â€‰0.550 (including a total of 2734 nonoverlapping genes) to evaluate the impacts of the number of included pathways (see Additional file 2: Table S4). In general, nsslasso (Lasso) and LBFGS (Lasso) still outperformed the other methods, although all methods experienced a decline in predictive accuracy.
In searching for model interpretation, we applied nsslasso (Lasso) to the whole data, resulting in a pathwaystacking model (AUC: 0.750) that selected five pathways fitted with a limited number of genes: p53 signaling pathway (hsa 04115, relatively weight, Wâ€‰=â€‰0.084), RNA transport (hsa 03013, Wâ€‰=â€‰0.297), Terpenoid backbone biosynthesis (hsa 00900, Wâ€‰=â€‰0.215), RNA degradation (hsa 03018, Wâ€‰=â€‰0.232), Arrhythmogenic right ventricular cardiomyopathy (hsa 05412, Wâ€‰=â€‰0.172). All five pathways were included in the 19 selected pathways with AUCâ€‰>â€‰0.577. In addition, we also applied nLasso (Lasso) (AUC: 0.736) and LBFGS (Lasso) (AUC: 0.746) for reference (see Additional file 2: Table S5). nLasso (Lasso) identified seven pathways, all of which were also selected by nsslasso (Lasso). LBFGS (Lasso) included 57 pathways with relative weightâ€‰>â€‰0.001, which makes it difficult to indicate pathway importance.
METABRIC dataset
The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) dataset encompasses over 2000 breast cancer patients with accessible clinical, gene expression, and mutation data. We acquired gene expression data and the phenotype data of the breast invasive ductal carcinoma from the METABRIC. The interested outcome is the â€śvital statusâ€ť. After data preprocessing (as described in 4.1), we finally obtained a dataset with 1420 samples and 19,494 genes. These genes were mapped to 146 pathways involving 3709 nonoverlapping genes (see Additional file 2: TableÂ S6). We selected 21 pathways (Additional file 2: TableÂ S6) with AUCâ€‰>â€‰0.669 as candidate submodels for subsequent analysis.
In the SMB methods, grlasso and grSCAD were two competitive methods (TableÂ 5). But these two methods did not perform well when applied to the TCGA BRCA dataset. The MLbased stacking methods also showed poor performance while the lassobased super learners showed comparable performance to grlasso and grSCAD. We included 143 from 146 pathways with AUCâ€‰>â€‰0.600 (a total of 3,673 nonoverlapping genes) as sensitivity analysis (Additional file 2: TableÂ S7). Also, the performance of all methods decreased, and lassobased super learners demonstrated favorable performance.
The pathwaystacking model fitted using nsslasso (Lasso) for the METABRIC dataset identified eight pathways (AUC: 0.777): cell cycle (hsa 04110, Wâ€‰=â€‰0.174), HTLVI infection (hsa 05166, Wâ€‰=â€‰0.105), Calcium signaling pathway (hsa 04020, Wâ€‰=â€‰0.166), Protein digestion and absorption (hsa 04974, Wâ€‰=â€‰0.092), Adipocytokine signaling pathway (hsa 04920, Wâ€‰=â€‰0.102), PPAR signaling pathway (hsa 03320, Wâ€‰=â€‰0.086), TGFbeta signaling pathway (hsa 04350, Wâ€‰=â€‰0.156), Protein processing in the endoplasmic reticulum (hsa 04141, Wâ€‰=â€‰0.119). nLasso (Lasso) (AUC: 0.776) covered nsslasso (Lasso) (Additional file 2: TableÂ S8). LBFGS (Lasso) (AUC: 0.760) identified 82 pathways with relatively small weights (>â€‰0.001).
Discussion
The present study proposed a general stacking strategy based on data segmentation and nsslasso GLM for predicting disease risk in the context of highdimensional data. To the best of our knowledge, this is the first paper that demonstrates the use of stacking to integrate group structure information into modeling, in which a new nonnegative spikeandslab prior that limits the weights of submodels to nonnegative is used. The proposed method inherits the advantage of stacking, which may account for the improved and robust generalization compared to those existing methods based on the single model (e.g., Lasso, grlasso, gsslasso). Furthermore, employing nsslasso as the super learner can adaptively combine submodels: selecting a strong submodel but eliminating the rest with similar effects. This feature leads to reduced variance and enhanced prediction accuracy [35]. Using nsslasso is comparable to using LBFGS in prediction, but the former exhibits an advantage in fast estimating weights and identifying important submodels.
In the simulation, the SMsbased methods exhibited superior performance in prediction than the SMB methods, except for Scenario 1. Scenario 1 represented the situation of high theoretical generalized R^{2} (low \({\sigma }^{2}\)), in which the data noise is low. In the case of enough effective information, methods based on a single model can achieve a fairly good prediction. Besides, stacking methods suffer from an increased variance due to the random split in the CV procedure [52], which can potentially lead to the loss of valuable information. With the increase of data noise (much closer to realworld data) and the decrease of effective information, the SMsbased stacking methods presented a better performance in simulation scenarios and realworld data, because it is more tolerant to noise by borrowing information from different models.
For this study, we conducted 100 duplicated runs for every scenario. To evaluate the impact of the number of duplicated runs, we increased the simulation runs to 200 for Scenario 1 and compared the results with those obtained from 100 runs. The stacking methods remained consistent across both 100 and 200 runs, while the SMB showed slight changes. This suggests that the stackingbased methods are more stable in making predictions.
In addition, the MLbased stacking methods showed poor prediction performance either in simulated data or in realworld data. One possible explanation is that these ML methods are prone to overfit the data. These overfited submodels, typically, produce similar predictions. Stackingâ€™s performance is likely to be less favorable when the submodels yield similar predictions. Another possible reason is that these ML methods with complex fitting algorithms are generally less appropriate for the data of a small sample size. These methods require more data to fit their parameters well [25].
A noted point of the proposed strategy is the interpretability of the resulting models. As stated in Buchâ€™s article [11], the utilization of prior biological knowledge for the purpose of grouping omics data can identify relevant functional groups. In our study, for instance, five important pathways were identified by nsslasso (Lasso) in TCGA breast cancer concerning the occurrence of â€śnew tumor eventâ€ť, out of which the p53 signaling pathway is one of the most wellknown pathways that is closely associated with the prognosis of breast cancer [53]. The aberrant of p53 results in an elevated occurrence of new tumor events as many signals about cellular health interact with the p53 protein, ultimately determining whether the cell proceeds with the division cycle [54]. The model also identified other pathways that involve various biological processes. RNA transport (weight: 0.297), RNA degradation (weight: 0.232), and Terpenoid backbone biosynthesis (weight: 0.215) contributed more to the prediction, highlighting their important role in prediction. Meanwhile, the proposed stacking strategy carries out a withingroup variable selection to extract genes at the base layer. With the help of power from genes in the same pathway, one may identify patterns that are too subtle to discern at the single gene level [11, 22]. These identified pathways and genes can serve as starting points for subsequent targeting research.
Variable screening is essential when the dimensionality of the data is extremely large [25]. The proposed methods involve at least three dimension reduction processes, which may account for the observed favorable performance. The first dimension reduction is the data segmentation based on prior biological information, whereby, reducing the whole omics data to a set of subdata. The second dimension reduction involves the variable selection using sparse methods in the construction of submodels, reducing subdata to several important predictors. The third dimension reduction is the selection of important submodels using the proposed nsslasso GLM in the second layer of stacking. This sequence of dimension reductions gradually eliminates numerous irrelevant variables, ensuring that the stacked models contain only a limited number of vital predictors.
A notable challenge of our approach is its computationally intensive nature, primarily due to the CV process and the need to ensemble numerous submodels. Therefore, our practice is to first select the pathways with strong signals (say, twenty pathways of the highest AUC) as candidate submodels. Moreover, as a hierarchical Bayesian stacking method, our method can be extended by incorporating multiplelevel group structures, such as SNPgenepathway. This extension can be achieved by leveraging prior knowledge of \(\theta\). In addition, researchers can explore alternative priors to the nonnegative spikeandslab mixture DE prior used in the proposed model and investigate their theoretical and empirical properties. Last but not least, the proposed method is a common strategy, which can be applied to other biological processes with similarly multiple levels.
Availability of data and materials
The main code for the proposed method is freely available on the GitHub website at: (https://github.com/JasonLnzi/ANonnegativeSpikeandslabLassoGeneralizedLinearStackingModel). The R package and function mentioned in the section â€ś2.4 Competitive statistical methodsâ€ť are listed in Additional file 2: TableÂ S10. We acquired the dataset for Breast Invasive Carcinoma (ldentifier/Accession Number: TCGABRCA) from the TCGA (The Cancer Genome Atlas) database, accessible at https://portal.gdc.cancer.gov/projects/TCGABRCA. We obtained another breast cancer data from METABRIC (Molecular Taxonomy of Breast Cancer International Consortium, https://www.cbioportal.org/study/summary?id=brca_metabric) with the identifier â€śBreast Invasive Ductal Carcinomaâ€ť.
Abbreviations
 Lasso:

Least absolute shrinkage and selection operator
 nsslasso:

Nonnegative spikeandslab lasso
 GLM:

Generalized linear model
 SCAD:

Smoothly clipped absolute deviation
 MCP:

Minimax concave penalty
 grlasso:

Group lasso
 grSCAD:

Group SCAD
 cMCP:

Composite MCP
 MLGL:

Multilayer grouplasso
 OHPL:

Ordered homogeneity pursuit lasso
 SMB:

Singlemodelbased
 CV:

Crossvalidation
 EM:

Expectationmaximization
 LBFGS:

Smoothly clipped absolute deviation
 SMs:

Sparse methods
 ML:

Machine learning
 DE:

Doubleexponential
 sslasso:

Spikeandslab lasso
 BS:

Brier Score
 ROC:

Receiver operating characteristic curve
 AUC:

Area under the ROC curve
 grMCP:

Overlap group MCP
 GEL:

Group exponential lasso
 KEGG:

Kyoto encyclopedia of genes and genomes
 TCGA:

The Cancer Genome Atlas
 BRCA:

Breast cancer
 METABRIC:

Molecular Taxonomy of Breast Cancer International Consortium
References
Gupta GK, Collier AL, Lee D, et al. Perspectives on triplenegative breast cancer: current treatment strategies, unmet needs, and potential targets for future therapies. Cancers. 2020;12(9):2392.
Fisher R, Pusztai L, Swanton C. Cancer heterogeneity: implications for targeted therapeutics. Br J Cancer. 2013;108(3):479â€“85.
Ashley EA. Towards precision medicine. Nat Rev Genet. 2016;17(9):507â€“22.
Heinze G, Wallisch C, Dunkler D. Variable selectionâ€”a review and recommendations for the practicing statistician. Biometrical J. 2018;60(3):431â€“49.
Zhang YP, Zhang XY, Cheng YT, et al. Artificial intelligencedriven radiomics study in cancer: the role of feature engineering and modeling. Mil Med Res. 2023;10(1):22.
Rudin C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat Mach Intell. 2019;1(5):206â€“15.
Fan JQ, Li RZ. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc. 2001;96(456):1348â€“60.
Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann Stat. 2010;38(2):894â€“942.
Li CY, Li HZ. Networkconstrained regularization and variable selection for analysis of genomic data. Bioinformatics. 2008;24(9):1175â€“82.
Bellazzi R, Zupan B. Predictive data mining in clinical medicine: current issues and guidelines. Int J Med Inform. 2008;77(2):81â€“97.
Buch G, Schulz A, Schmidtmann I, et al. A systematic review and evaluation of statistical methods for group variable selection. Stat Med. 2023;42(3):331â€“52.
Hanahan D, Weinberg RA. The hallmarks of cancer. Cell. 2000;100(1):57â€“70.
Vogelstein B, Kinzler KW. Cancer genes and the pathways they control. Nat Med. 2004;10(8):789â€“99.
Chen X, Wang LL. Integrating biological knowledge with gene expression profiles for survival prediction of cancer. J Comput Biol. 2009;16(2):265â€“78.
Frohlich F, Kessler T, Weindl D, et al. Efficient parameter estimation enables the prediction of drug response using a mechanistic pancancer pathway model. Cell Syst. 2018;7(6):567.
Wei Z, Li HZ. Nonparametric pathwaybased regression models for analysis of genomic data. Biostatistics. 2007;8(2):265â€“84.
Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Stat Soc B. 2006;68:49â€“67.
Wang L, Chen G, Li H. Group SCAD regression analysis for microarray time course gene expression data. Bioinformatics. 2007;23(12):1486â€“94.
Breheny P, Huang J. Penalized methods for bilevel variable selection. Stat Interface. 2009;2(3):369â€“80.
Grimonprez Q, Blanck S, Celisse A, et al. MLGL: an R package implementing correlated variable selection by hierarchical clustering and grouplasso. J Stat Softw. 2023;106(3):1â€“33.
Lin YW, Xiao N, Wang LL, et al. Ordered homogeneity pursuit lasso for group variable selection with applications to spectroscopic data. Chemometr Intell Lab. 2017;168:62â€“71.
Chen X, Wang L, Ishwaran H. An integrative pathwaybased clinicalgenomic model for cancer survival prediction. Stat Probabil Lett. 2010;80(17â€“18):1313â€“9.
Manoli T, Gretz N, GrĂ¶ne HJ, et al. Group testing for pathway analysis improves comparability of different microarray datasets. Bioinformatics. 2006;22(20):2500â€“6.
Zhang XY, Li Y, Akinyemiju T, et al. Pathwaystructured predictive model for cancer survival prediction: a twostage approach. Genetics. 2017;205(1):89.
Phillips RV, van der Laan MJ, Lee HA, et al. Practical considerations for specifying a super learner. Int J Epidemiol. 2023;52(4):1276â€“85.
Wolpert D. Stacked generalization. Neural Netw. 1992;5:241â€“59.
Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123â€“40.
Schapire RE. The strength of weak learnability. Mach Learn. 1990;5(2):197â€“227.
Hoeting JA, Madigan D, Raftery AE, et al. Bayesian model averaging: a tutorial. Stat Sci. 1999;14(4):382â€“401.
Sagi O, Rokach L. Ensemble learning: a survey. Wires Data Min Knowl. 2018;8:4.
Breiman L. Random forests. Mach Learn. 2001;45(1):5â€“32.
Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;29(5):1189â€“232.
Clyde M, Iversen ES. Bayesian model averaging in the Mopen frame work. Bayesian Theory Appl. 2013;8:484â€“98.
Clarke B. Comparing Bayes model averaging and stacking when model approximation error cannot be ignored. J Mach Learn Res. 2004;4(4):683â€“712.
Breiman L. Stacked regressions. Mach Learn. 1996;24(1):49â€“64.
van der Laan MJ, Polley EC, Hubbard AE. Super learner. Stat Appl Genet Mol. 2007;6:1.
Yao YL, Vehtari A, Simpson D, et al. Using stacking to average bayesian predictive distributions (with discussion). Bayesian Anal. 2018;13(3):917â€“1003.
Yao YL, Pirs G, Vehtari A, et al. Bayesian hierarchical stacking: some models are (somewhere) useful. Bayesian Anal. 2022;17(4):1043â€“71.
Naimi AI, Balzer LB. Stacked generalization: an introduction to super learning. Eur J Epidemiol. 2018;33(5):459â€“64.
Stiglic G, Wang F, Davey A, et al., editors. Pediatric readmission classification using stacked regularized logistic regression models. AMIA annual symposium proceedings; 2014. American Medical Informatics Association.
Tang ZX, Shen YP, Zhang XY, et al. The spikeandslab lasso generalized linear models for prediction and associated genes detection. Genetics. 2017;205(1):77.
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1â€“22.
Breheny P, Huang J. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann Appl Stat. 2011;5(1):232â€“53.
Dimitriadou E, Hornik K, Leisch F, et al. Misc functions of the Department of Statistics (e1071). TU Wien. 2008;1:5â€“24.
Yi NJ, Tang ZX, Zhang XY, et al. BhGLM: Bayesian hierarchical GLMs and survival models, with applications to genomics and epidemiology. Bioinformatics. 2019;35(8):1419â€“21.
Zeng Y, Breheny P. Overlapping group logistic regression with applications to genetic pathway selection. Cancer Inform. 2016;15:179â€“87.
Nagelkerke NJD. A note on a general definition of the coefficient of determination. Biometrika. 1991;78(3):691â€“2.
Subhan MA, Parveen F, Shah H, et al. Recent advances with precision medicine treatment for breast cancer including triplenegative subtype. Cancers (Basel). 2023;15(8):2204.
Altman DG, Royston P. What do we mean by validating a prognostic model? Stat Med. 2000;19(4):453â€“73.
Yu GC, Wang LG, Han YY, et al. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS. 2012;16(5):284â€“7.
Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. Bmc Bioinform. 2008;9:1â€“13.
Kohavi R. A study of crossvalidation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence. 1995;2(IJCAI):1137â€“43.
Zhong CM, Xie ZJ, Zeng LH, et al. MIR4435â€“2HG Is a potential pancancer biomarker for diagnosis and prognosis. Front Immunol. 2022;13:855078.
Giannikaki E, Kouvidou C, Tzardi M, et al. p53 protein expression in breast carcinomas. Comparative study with the wild type p53 induced proteins mdm2 and p21/waf1. Anticancer Res. 1997;17(3C):2123â€“7.
Acknowledgements
We thank the reviewers and the associate editor for their constructive suggestions and comments that have improved the manuscript. We acknowledge the contributions of the TCGA Research Network and the METABRIC team.
Funding
This study was supported by the National Natural Science Foundation of China [81773541 &Â 82373688], funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions at Soochow University, the State Key Laboratory of Radiation Medicine and Protection [GZK1201919] to ZT. The funding body had no role in the design or conduct of this research.
Author information
Authors and Affiliations
Contributions
Study conception and design: SJJ, WS, and TZX. Data collection and cleaning: SJJ, DYF, SH, WXC, and TZX. Real data analysis and interpretation: SJJ and WS. Drafting and revising of the manuscript: SJJ, WS, and TZX. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
. Supplementary figures.
Additional file 2
. Supplementary tables.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Shen, J., Wang, S., Dong, Y. et al. A nonnegative spikeandslab lasso generalized linear stacking prediction modeling method for highdimensional omics data. BMC Bioinformatics 25, 119 (2024). https://doi.org/10.1186/s12859024057416
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859024057416