Skip to main content
  • Research article
  • Open access
  • Published:

Improved two-stage model averaging for high-dimensional linear regression, with application to Riboflavin data analysis



Model averaging has attracted increasing attention in recent years for the analysis of high-dimensional data. By weighting several competing statistical models suitably, model averaging attempts to achieve stable and improved prediction. In this paper, we develop a two-stage model averaging procedure to enhance accuracy and stability in prediction for high-dimensional linear regression. First we employ a high-dimensional variable selection method such as LASSO to screen redundant predictors and construct a class of candidate models, then we apply the jackknife cross-validation to optimize model weights for averaging.


In simulation studies, the proposed technique outperforms commonly used alternative methods under high-dimensional regression setting, in terms of minimizing the mean of the squared prediction error. We apply the proposed method to a riboflavin data, the result show that such method is quite efficient in forecasting the riboflavin production rate, when there are thousands of genes and only tens of subjects.


Compared with a recent high-dimensional model averaging procedure (Ando and Li in J Am Stat Assoc 109:254–65, 2014), the proposed approach enjoys three appealing features thus has better predictive performance: (1) More suitable methods are applied for model constructing and weighting. (2) Computational flexibility is retained since each candidate model and its corresponding weight are determined in the low-dimensional setting and the quadratic programming is utilized in the cross-validation. (3) Model selection and averaging are combined in the procedure thus it makes full use of the strengths of both techniques. As a consequence, the proposed method can achieve stable and accurate predictions in high-dimensional linear models, and can greatly help practical researchers analyze genetic data in medical research.

Peer Review reports


Model averaging and model selection are two important techniques for estimation and prediction in statistical modeling. Model selection is appealing for its simplicity and interpretability thus has been attracting more attention. A systematic review on model selection can be found in Heinze, Wallisch, and Dunkler [14] or Lee, Cho, and Yu [18]. The main drawback for model selection, however, is that the uncertainty is essentially ignored once an optimal model is found. A possible consequence is that the inference based on the “best” model can be misleading—either overestimating or underestimating, due to poor representative of the real distribution of the data. In addition, different methods, criteria, and even small changes in the data can result in very different final models, thus prediction accuracy of mode selection is reduced [23].

Model averaging has been widely recognized as a solution when model uncertainty is high and interferes with the quality of the prediction. By combining a set of competing models and wisely choosing the weights, model averaging can even out the overestimation and underestimation, therefore often leads to a more reliable result than the individual prediction. With model averaging, no assumption that a true model exists needs to be made, and inferences could be made about all candidate models. Hence, this method is preferred when uncertainty arises in the model selection process [24].

It has been shown that averaging models tend to yield better predictions than single models in the classical setting where the sample size is at least one order of magnitude greater than the number of predictors [16], how well it performs in high-dimensional regression, in which the number of predictors exceeds the sample size, is still under investigation. Recently, [2] proposed a two-step model averaging with cross-validation (MCV) procedure. The method consists of two steps: the first step is to construct candidate models by marginal correlation; the second step is to find optimal model weights by delete-one cross-validation. The main feature of MCV lies on the fact that the standard constraint of the model weights summing up to 1 is relaxed to the model weights can be vary freely between 0 and 1, and it is claimed that this relaxation lowers the prediction error.

The MCV approach is one of the only few references and is the first study that removes the weight restriction, thus marks a significant step toward the development in the high-dimensional frequentist model averaging. Nonetheless, we have several concerns with this method. First, the model construction step of MCV employs the marginal correlation to measure the strength of association between predictors and the response. Hence, it can miss some significant variables that are associated with the response conditionally but not marginally. Furthermore, the marginal correlation can be misleading when there exist non-trivial correlations among the predictors [20]. As a result, week candidate models could be involved and be given fairly large weights while strong candidate models could have small weights, and predictive inference is impaired. Second, we find the arguments in the paper regarding why the restriction of total weights summing to 1 should be removed are not convincing. For example, Ando and Li remarked “consider the extreme case that the predictors are uncorrelated with each other and the noise variance is ignorable. The predictors from each model become uncorrelated with each other as well and the optimal combined predictor is the sum of all model predictors, implying that the optimal weight assignment should be (1, 1, ..., ). Thus, the total weight should be equal to M, not 1.” This statement is problematic as for this special case, each model should be assigned the weight of \(\frac{1}{M}\) if there are M of them, thus the total weight is still 1. Considering such point, more scrutiny is needed on whether relaxing the total weight constraint will lower the prediction error. Third, numerical study was conducted in that paper to demonstrate the favorable performance of MCV, but the same original data is employed for both model building and model validation, and the mean squared error (MSE) is calculated as the performance measure. A result of this model development process is that the MSE will tend to understate the inherent variability in making future predictions from the selected model [17], therefore the predictive capability of MCV needs to be re-calibrated by the collection of new data.

To address the above concerns of MCV and to further enhance the predictive ability, we develop a two-stage model averaging procedure for high-dimensional linear regression. In the first step, a high-dimensional variable selection method such as LASSO is employed to screen redundant predictors and to construct candidate models. In the second step, the jackknife cross-validation with the conventional constraint of total weights summing to 1 is applied to optimize model weights for averaging. The proposed method makes full use of the strengths of both model selection and averaging, meanwhile it is computational feasible since each candidate model and its corresponding weight are determined in the low-dimensional setting and the quadratic programming is utilized in the cross-validation. We conduct simulations and a real data example, the results illustrate that the proposed approach is efficient in forecasting and outperforms the existing predicting methodologies.

The remainder of the article is organized as follows. “Method” section describes the problem setting and an improved two-stage high-dimensional model averaging procedure is developed. We present a simulation study in “Simulations” section and provide a real data example in “Riboflavin Data Analysis” section for empirically examining the effectiveness of the proposed method. The paper ends with discussion and future research directions.


Given the dataset of n observations, a linear regression model takes the form of

$$\begin{aligned} y_i=\beta _1 x_{i1}+\cdots +\beta _p x_{ip}+\epsilon _i, \qquad i=1, 2,\ldots , n, \end{aligned}$$

where \(y_i\) is the response in the ith trial, \(x_{i1},\ldots , x_{ip}\) are the predictors, \(\beta _1,\ldots , \beta _p\) are the regression coefficients, and \(\epsilon _i\) is the error term. Here, we assume: 1) the independent error \(\epsilon _i\) has mean zero and finite variance \(\sigma ^2\); 2) the number of predictors p exceeds the sample size n; 3) only a subset of predictors have contributions in predicting the response.

Alternatively, in matrix form, model (2.1) can be written as

$$\begin{aligned} \mathbf{y }=\mathbf{X }{\varvec{\beta }}+{\varvec{\epsilon }}, \end{aligned}$$


$$\begin{aligned} \mathbf{y }& = {} \left( \begin{array}{c} {y_1}\\ \cdot \\ \cdot \\ \cdot \\ {y_n}\\ \end{array}\right) ,\quad {\varvec{\beta }}= \left( \begin{array}{c} {\beta _1}\\ \cdot \\ \cdot \\ {\beta _p}\\ \end{array} \right) , \qquad {\varvec{\epsilon }}= \left( \begin{array}{c} {\epsilon _1}\\ \cdot \\ \cdot \\ \cdot \\ \cdot \\ {\epsilon _n} \end{array} \right) ,\\ \mathbf{X }& = {} \left( \begin{array}{c} {\mathbf{X }_1}\\ \cdot \\ \cdot \\ {\mathbf{X }_n} \end{array}\right) = \begin{pmatrix} x_{11} &{} x_{12} &{} \ldots &{} x_{1p} \\ x_{21} &{} x_{22} &{} \ldots &{} x_{2p} \\ \vdots &{} \vdots &{} &{} \vdots \\ x_{n1} &{} x_{n2} &{} \ldots &{} x_{np} \end{pmatrix}. \end{aligned}$$

For model (2.1), we will develop a two-stage model averaging method to improve the predictive performance for high-dimensional regression.

Model construction step

For the high-dimensional linear models, [2] proposed a two-stage model averaging procedure named MCV. The procedure first divides p predictors into \(K + 1\) groups by the absolute marginal correlations between all predictors and the response. Let model \(M_k\) consist of the predictors with marginal correlations falling into the kth group. The first group has the highest values, and the \((K + 1)\)th group has values closest to 0 and is then discarded. Thus the number of candidate models is K. Each model can also be written in matrix form \(\mathbf{y }=\mathbf{X }_k{\varvec{\beta }}_k+{\varvec{\epsilon }}\), for \(k=1,\ldots , K\). Given candidate models whose number of predictors is smaller than the sample size, the regression coefficients are estimated by the least-squares method \({\hat{\varvec{\beta }}}_k=(\mathbf{X }'_k\mathbf{X }_k)^{-1}\mathbf{X }'_k\mathbf{y }\) and the predicted value \({\hat{\varvec{y}}}_k=\mathbf{X }_k{\hat{\varvec{\beta }}}_k\).

As discussed in "Introduction" section, the marginal correlation could provide misleading results in sorting variables and preparing candidate models therefore reduces prediction accuracy. To prepare proper candidate models ready for averaging, in the first step of the proposed method we consider some representative measurements which can simultaneously rank and select variables.

The first competitor is Distance Correlation. It is a measure of dependence between random vectors introduced by [22]. For all distributions with finite first moments, such method provides a robust way measuring and testing dependence by correlation of distances. The distance correlation test is implemented by the energy package in R. To find candidate models, we partition all predictors into different groups by the p-values of the test. The first group has the lowest values, and the last group has largest values and then is discarded.

Two well-known variable selection approaches are also considered for comparison. Ridge Regression [15] is a way to create a parsimonious model when the number of predictors exceeds the number of observations, or when multicollinearity exists in predictors. The predictors are sorted and chosen by the method proposed by Cule et al. [7] and can be conducted by the ridge package in R. Similar to Ridge Regression, LASSO [23] also works in a similar fashion the only difference is of the penalty term. LASSO is performed by ncvreg package in R, the importance of predictors is based on the coefficients magnitude, and those whose coefficients shrunk to 0 are abandoned.

Random Forest [4] is also under consideration. It is an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees. Random Forest can be used to rank the importance of variables and to conduct variable selection in regression. The technique was described in the papers by Genuer, Poggi and Tuleau-Malot [10, 11], and is implemented in the R package VSURF.

To evaluate the prediction performance, we adopt a simulation setting from Ando and Li [2]. The detailed structure of the simulation study is given below.

Simulation 1. Set the sample size \(n=100, 200\) and the number of predictors \(p=500, 1000, 2000\). Set the number of significant predictors \(s=20, 50, 100\), and generate the coefficients \(\beta _j\) from Unif(-1, 1). The predictors \(x_{ij}, i=1, \ldots , n, j=1, \ldots , p\) are generated independently from Unif(-1, 1) as well. Finally, the error terms \(\epsilon _i\) are generated from N\((0, \sigma =0.1, 1, 2)\). Therefore, in total there are \(2*3*3*3=54\) combinations of \((n, p, s, \sigma )\).

To compare the methods in a fair manner, after candidate models are constructed in step 1, in step 2 the same delete-one cross-validation from MCV is used to optimize the model weights. In each run, we simulate a training set with n observations from the linear regression model in (2.1) for each combination. All the model building and averaging are done on the training set. We also collect an independent test data set of \(n^*=1000\) observations to compute the mean of the squared prediction error (MSPE), MSPE=\(\frac{\sum _{i=1}^{n^*} (y_i-{\hat{y}}_i)^2}{n^*}\). We repeat this process 100 times, and report the means and standard deviations of MSPE for each method.

Table 1 Simulated results for simulation 1, comparing the means and standard deviations (in the parentheses) of MSPE

Table 1 summarizes the simulation results. Several observations can be made from this table. First, the predictive performance is affected by several factors, including the sample size, number of predictors, number of significant predictors, and the distribution of the error terms. More specifically, the prediction accuracy increases as n increases, p, s, or \(\sigma\) decreases. Second, out of the 54 scenarios, marginal correlation, Distance Correlation, Ridge Regression, LASSO, Random Forest perform the best 9/11/14/19/1 times, respectively. It seems that each method has its unique merits, none of the 5 methods can universally dominate the other competitors. The overall performance of LASSO appears to be the best, Ridge Regression is a close second. To explore the methods in more detail, we find that LASSO often performs favorably when n is large, or when s and \(\sigma\) are small, yet ridge regression tends to be more stable among different cases. In summary, the simulation results show that using a shrinkage variable selection method to construct candidate models is preferable to the marginal correlation in the model construction step for model averaging.

Model weighting step

After the candidate models and their corresponding least-squares predicted values \(\{{\hat{\varvec{y}}}_1,\ldots , {\hat{\varvec{y}}}_K \}\) are obtained, the second stage of model averaging is to determine the model weights. For the MCV method, the weight vector \({\varvec{w}}\) is optimized by minimizing the delete-one cross-validation criterion

$$\begin{aligned} {\hat{\varvec{w}}}= \text {CV}({\varvec{w}})= ({\varvec{y}}-{\tilde{\varvec{y}}})'({\varvec{y}}-{\tilde{\varvec{y}}}), \end{aligned}$$

where \({\tilde{\varvec{y}}}=\sum _{k=1}^{K} w_k{\tilde{\varvec{y}}}_k\), \({\tilde{\varvec{y}}}_k=({\tilde{y}}^{(1)}_k, \ldots , {\tilde{y}}^{(n)}_k)\) be an n-dimensional vector, and \({\tilde{y}}^{(i)}_k\), is the predicted value \({\hat{y}}^{(i)}_k\) computed without the ith observation. Here, the restriction of \(\sum _{k=1}^{K} w_k=1\) is removed, and the minimization of \({\hat{\varvec{w}}}\) can be easily solved by a quadratic programming. Following the similar idea, [3] further extended model averaging to high-dimensional generalized linear models.

To compare with the relaxed cross-validation criterion in MCV, we consider three weight choice alternates in which the conventional constraint of total weights summing to 1 is applied.

AIC [1] is a consistent estimator of the Kullback-Leibler discrepancy between the distribution that generated the data and the model that approximates it. Suppose there are K models, AIC\(_k=-2\)log\((L_k)+2p_k\), where \(L_k\) is the maximized likelihood function and \(p_k\) is the number of parameters under the k-th model. [5] advocated to use the MAIC for determining the weights

$$\begin{aligned} w_k=\frac{\mathrm{exp}(-AIC_k/2)}{\sum _{k=1}^{K}\mathrm{exp}(-AIC_k/2)}, \qquad k=1, \ldots , K. \end{aligned}$$

More recently, [12] proposed a least-squares model average estimator with model weights selected by minimizing Mallows’ \(C_p\) criterion [21]. The Mallows’ criterion for the model averaging estimator (MMA) is

$$\begin{aligned} C_n({\varvec{w}})= \left( {\varvec{y}}-\mathbf{X }\varvec{\hat{\beta }}\right) '\left( {\varvec{y}}-\mathbf{X }\varvec{\hat{\beta }}\right) +2\sigma ^2k({\varvec{w}}), \end{aligned}$$

where \(k({\varvec{w}})\) is the effective number of parameters, and \(\sigma ^2\) can be estimated by a sample variance from a “large” approximating model. Hansen showed that the model average estimator that minimizes the Mallows criterion also minimizes the squared error in large samples.

In addition, [13] suggested to select the weights by minimizing a deleted-one cross-validation criterion, and the method is termed as the jackknife model averaging (JMA). The same method is adopted in MCV as expressed in (2.2). However, JMA requires \(\sum _{k=1}^{K} w_k=1\) while such conventional restriction is relaxed in MCV. This criterion is quadratic in the weights, so computation is a plausible application of quadratic programming. It is also shown that the JMA estimator is asymptotically optimal in achieving the lowest possible expected squared error.

To compare all the weight determining methods, we conduct another simulation study.

Simulation 2. We use the same setting and data generating process in Simulation 1. The same candidate models are constructed from the training data, all the methods described in this session are used to determine the model weights. An independent test data set of \(n^*=1000\) observations is generated to compute the prediction accuracy measured by the MSPE. We repeat the iteration 100 times.

Table 2 Simulated results for simulation 2, comparing the means and standard deviations (in the parentheses) of MSPE

Table 2 compares the means and standard deviations of MSPE for each method. Out of the 54 scenarios, the relaxed cross-validation (refers to CV in the table), MAIC, MMA, JMA perform the best 9/7/14/24 times. The overall performance of JMA appears to be the best, MMA performs similarly but slightly worse. More importantly, we find that in most cases (45 out of 54), relaxing the total weight constraint does not lower the prediction error. The results confirm with our concern in "Introduction" section.

The improved two-stage high-dimensional model averaging

Based on the results from the numerical experiments, we propose an improved model average (IMA) procedure, which contains the following two steps:

Stage 1: Construct candidate models by a high-dimensional penalized variable selection method (LASSO or its variants, such as SCAD [8] and ALASSO [26]).

  • Run the penalized regression for the high-dimensional data and obtain the LASSO estimate for each predictor.

  • Partition all p predictors into \(K+1\) groups by the magnitude of the estimates. The first group has the largest absolute values of LASSO estimates, and the \((K + 1)\)th group contains the predictors whose coefficients are already shrunk to 0. We drop the \((K + 1)\)th group, thus the number of candidate models is K.

  • For each model \(M_k\), \(k=1,\ldots , K\), \(\mathbf{y }=\mathbf{X }_k{\varvec{\beta }}_k+{\varvec{\epsilon }}\), and the least-squares prediction is \({\hat{\varvec{y}}}_k=\mathbf{X }_k{\hat{\varvec{\beta }}}_k\).

Stage 2: Optimize the model weights by the standard jackknife cross-validation criterion by Hansen and Racine [13].

  • Let \({\tilde{\varvec{y}}}_k=({\tilde{y}}^{(1)}_k, \ldots , {\tilde{y}}^{(n)}_k)\) be an n-dimensional vector, where \({\tilde{y}}^{(i)}_k\), is the predicted value \({\hat{y}}^{(i)}_k\) computed without the ith observation.

  • Let \({\varvec{w}}=(w_1, \ldots , w_k)\) be the weight vector for the K models, then the jackknife predictor is

    $$\begin{aligned} {\tilde{\varvec{y}}}=\sum _{k=1}^{K} w_k {\tilde{y}}_k. \end{aligned}$$
  • The optimal estimate of \({\varvec{w}}\) is obtained by minimizing the cross-validation

    $$\begin{aligned} {\hat{\varvec{w}}}= \text {CV}({\varvec{w}})= ({\varvec{y}}-{\tilde{\varvec{y}}})'({\varvec{y}}-{\tilde{\varvec{y}}}), \end{aligned}$$

    where the standard restriction \(\sum _{k=1}^{K} w_k=1\) is applied.

  • Finally, the model averaging predicted value \({\hat{\varvec{y}}}\) is expressed as

    $$\begin{aligned} {\hat{\varvec{y}}}=\sum _{k=1}^{K} {\hat{w}}_k {\hat{\varvec{y}}}_k . \end{aligned}$$


As a referee points out, ranking is an essential component of the out-performance. By ranking all the predictors based on the magnitude of the estimates, the models proposed in Stage 1 are not equally competitive. The first few models are likely to be more informative than the last few models because of the ordering of regressors. Therefore, assigning larger weights to the first few models and smaller weights to the last few models tend to improve the predictive performance. Note that [2] also ranked the predictors in the MCV method by using the marginal correlation. Moreover, we have conducted some simulation experiments, the results show that (although not shown in the article) ranking predictors into K different groups has lower prediction errors than a random assignment of non-zero predictors.


The overall performance of the proposed method depends on the choice of parameter K which is the total number of models. Determining the optimal value of K is similar to finding the best tuning parameter \(\lambda\) in penalized variable selection. For each candidate value of K, a group of models is constructed and the best model weights are determined. The optimal K is the one with the lowest cross-validation score. Note that the specification of K may vary case by case, here we propose a practical strategy for optimizing the choice of K. The number of predictors \(p_k\) in each model is set to be the same, \(p_k = n \times q\), where n is the sample size, q is a value between 0 and 1. First, we set T to be the number of predictors whose LASSO estimates are non-zero. Because the total number of predictors from K models is equal to \(K \times p_k\), it is desirable for \(K \times q \times n\) to be no greater than T. Subject to this restriction, we choose K and q whose cross-validation prediction error is the smallest.


To assess the performance of the improved model averaging (IMA) method, we compare our approach with some existing model selection and averaging methods on simulated data. These competitors include MAIC [5], LASSO [23], MCV [2], and SISSCAD (a combination of sure independence screening [9] and SCAD [8]).

In this simulation study, we follow the settings of Ando and Li [2], 6 model settings are considered.

  1. (a)

    Set the sample size \(n=50\) and the number of predictors \(p=2000\). Let the number of true predictors \(s=50\), the true predictors \(x_i\) are spaced evenly, \(i=40(j-1)+1, j=1, \ldots , 50.\) Further, the true coefficients \(\beta _j\) are generated from N(0, 0.5), and the design matrix \({\varvec{X}}\) is generated from N\((0, \Sigma = \rho ^{|i-j|})\), where \(\rho =0.6\). Finally, the error terms \(\epsilon _i\) are generated from N\((0, \sigma =0.2)\).

  2. (b)

    Under the same setting (a), n is increased to 100.

  3. (c)

    Under the same setting (a), the value of \(\rho\) is decreased to 0.

  4. (d)

    Under the same setting (a), the standard deviation of the error term \(\sigma\) follows Unif(2.1, 2.3).

  5. (e)

    Under the same setting (d), n is increased to 100.

  6. (f)

    Under the same setting (d), the value of \(\rho\) is decreased to 0.

For each setting, we generate \(n^*=1000\) observations to compute MSPE. Figures 1, 2, 3, 4, 5 and 6 show the boxplots of MSPEs after 100 simulation runs under settings (a) to (f). As we can see, increasing the sample size ((a) V.S (b) and (d) V.S. (e)), and decreasing the correlation \(\rho\) in the design matrix \({\varvec{X}}\) ((a) V.S (c) and (d) V.S. (f)), both tend to decrease the prediction error. In addition, the proposed procedure IMA yields the smallest median of MSPE in most settings, and achieves stable performance by having the short length of the boxplots. The results demonstrate that IMA performs favorably in comparison with other methods. In particular, the median of MSPE of IMA is at least 20% lower than that of MCV, we remark that the IMA method outperforms MCV, and thus significantly improves the prediction accuracy and stability of model averaging for high-dimensional linear regression.

Fig. 1
figure 1

Boxplots of MSPEs for Setting (a)

Fig. 2
figure 2

Boxplots of MSPEs for Setting (b)

Fig. 3
figure 3

Boxplots of MSPEs for Setting (c)

Fig. 4
figure 4

Boxplots of MSPEs for Setting (d)

Fig. 5
figure 5

Boxplots of MSPEs for Setting (e)

Fig. 6
figure 6

Boxplots of MSPEs for Setting (f)

Table 3 Averaged computational time (in seconds) and corresponding standard deviations (in the parentheses) for each method

Table 3 compares the computational time required for each method under the settings (a) to (f). The computations are conducted using R version 3.5.0 on the Owens clusters at the Ohio Supercomputer Center (OSC). After 100 simulation runs, the averaged time (in seconds) and corresponding standard deviations are given. Several observations can be seen from this table. First, with the fixed number of predictors p and the correlation \(\rho\), increasing the sample size n will decrease the computing time. Second, with the fixed number of predictors p and the sample size n, decreasing the correlation \(\rho\) will decrease the computing time. Third, in general, model selection methods (LASSO, SISSCAD) have less computing time than the model averaging procedures (MAIC, MCV, IMA). Last, the proposed method IMA is much faster than other model averaging competitors and is comparable to model selection methods regarding the computational cost.

Riboflavin data analysis

To further explore the practical behavior of the proposed method, we consider the riboflavin data with \(n=71\) observations in Bühlmann and Mandozzi (2014 [6]). The response variable y is the logarithm of the riboflavin production rate, the predictors are the gene expression levels for 4088 genes. The methods to be included in the comparison experiment are: IMA, MAIC, SCAD, MCV, and RMSA (another recent model averaging procedure by Lin et al. [19]. We randomly divide the data into a training set with 50 observations and a test set with the remaining 21. In each time, p = 500 or 1000 genes are chosen randomly from the totality of all genes, the process is repeated 100 times.

The results for the Riboflavin data are presented in Table 4. Similar to the results in the simulation study, the prediction errors of IMA are substantially smaller than those of the other methods, and the computing time is pretty competitive. It is also noted that RMSA performs well in this real example, while the amount of computation required for such method is considerably high. Therefore, compared to RMSA, IMA not only has slightly better performance in prediction accuracy, but also enjoys much lower computing cost.

Table 4 Results for the Riboflavin data, comparing the means and standard deviations (in the parentheses below) of MSPE, and the averaged computational time (in seconds) and corresponding standard deviations (in the parentheses below) for each method


To achieve a stable and improved prediction, we propose a novel two-stage model averaging procedure in high-dimensional linear regression models. The method uses variable selection to group predictors for model averaging and applies the standard jackknife cross-validation for optimizing weights.

Compared with the recent model averaging procedure (MCV, [2]), the proposed approach has better predictive performance, meanwhile it retains computational flexibility even for extra high-dimensional data. We conduct numerical studies including simulations and a real riboflavin data analysis, the results demonstrate that the proposed technique is quite efficient in forecasting and outperforms the existing methodologies in general.

Even though advantages of model averaging over model selection have been demonstrated in the low dimensional regression analysis, it is still unclear to a large extent when model averaging should be preferred is high-dimensional linear models. [25] proposed an index, PIE, to measure model selection instability in estimation. They suggested that: If PIE is bigger than 0.5, model averaging should be considered; if PIE is less than 0.4, model selection is likely to work better than the model combining methods. However, PIE does not function properly for high-dimensional data based on our investigation. In the future, we intend to develop a measure which can be used to guide practitioners for deciding which way to go: averaging or selection, when dealing with high-dimensional problems.

Availability of data and materials

The riboflavin data analyzed during the current study is provided in a supplementary file.


  1. Akaike H. A Bayesian extension of the minimum AIC procedure of autoregressive model fitting. Biometrika. 1979;66:237–42.

    Article  Google Scholar 

  2. Ando T, Li KC. A model-averaging approach for high-dimensional regression. J Am Stat Assoc. 2014;109:254–65.

    Article  CAS  Google Scholar 

  3. Ando T, Li KC. A weight-relaxed model averaging approach for high-dimensional generalized linear models. Annals Stat. 2017;45:2654–79.

    Article  Google Scholar 

  4. Breiman L. Random forests. Mach Learn. 2001;45:15–32.

    Google Scholar 

  5. Buckland ST, Burnham KP, Augustin NH. Model selection uncertainty: an integral part of inference. Biometrics. 1997;53:603–18.

    Article  Google Scholar 

  6. Buhlmann P, Mandozzi J. High-dimensional variable screening and bias in subsequent inference, with an empirical comparison. Comput Stat. 2014;29:407–30.

    Article  Google Scholar 

  7. Cule E, De Iorio M. A semi-automatic method to guide the choice of ridge parameter in ridge regression. 2012 arXiv:1205.0686v1 [stat.AP].

  8. Fan J, Li R. Variable selection via nonconcave penalized likelihood and oracle properties. J Am Stat Assoc. 2001;96:1348–60.

    Article  Google Scholar 

  9. Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc B. 2008;70:849–911.

    Article  Google Scholar 

  10. Genuer R, Poggi JM, Tuleau-Malot C. Variable selection using random forests. Pattern Recogn Lett. 2010;31:2225–36.

    Article  Google Scholar 

  11. Genuer R, Poggi JM, Tuleau-Malot C. VSURF: An R Package for Variable Selection Using Random Forests. The R Journal. 2015;7:19–33.

    Article  Google Scholar 

  12. Hansen BE. Least squares model averaging. Econometrica. 2007;75:1175–89.

    Article  Google Scholar 

  13. Hansen BE, Racine J. Jackknife model averaging. J Econ. 2012;167:38–46.

    Article  Google Scholar 

  14. Heinze G, Wallisch C, Dunkler D. Variable selection: a review and recommendations for the practicing statistician. Biomet J. 2018;60:431–49.

    Article  Google Scholar 

  15. Hoerl A, Kennard R. Ridge regression: biased estimation for nonorthogonal problems. Technometrics. 1970;12:55–67.

    Article  Google Scholar 

  16. Hu X, Madden LV, Edwards S, Xu X. Combining models is more likely to give better predictions than single models. Phytopathology. 2015;105:1174–82.

    Article  Google Scholar 

  17. Kutner MH, Neter J, Nachtsheim CJ, Li W. Applied linear regression models. 5th ed. McGraw-Hill Irwin, Boston.

  18. Lee ER, Cho J, Yu K. A systematic review on model selection in high-dimensional regression. J Kor Stat Soc. 2019;48:1–12.

    Article  Google Scholar 

  19. Lin B, Wang Q, Zhang J, Pang Z. Stable prediction in high-dimensional linear models. Stat Comput. 2017;27:1401–12.

    Article  Google Scholar 

  20. Ma S, Li R, Tsai SL. Variable screening via quantile partial correlation. J Am Stat Assoc. 2017;112:650–63.

    Article  CAS  Google Scholar 

  21. Mallows CL. Some comments on \(C_p\). Technometrics. 1973;15:661–75.

  22. Székely G, Rizzo M, Bakirov N. Measuring and testing dependence by correlation of distances. Ann Stat. 2007;35:2769–94.

    Article  Google Scholar 

  23. Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc B. 1996;58:268–88.

    Google Scholar 

  24. Wang H, Zhang X, Zou G. Frequentist model averaging estimation: a review. J Syst Sci Complex. 2009;22:732–48.

    Article  Google Scholar 

  25. Yuan Z, Yang Y. Combining linear regression models: When and How? J Am Stat Assoc. 2005;100:1202–14.

    Article  CAS  Google Scholar 

  26. Zou H. The adaptive lasso and its oracle properties. J Am Stat Assoc. 2006;101:1418–29.

    Article  CAS  Google Scholar 

Download references


The author would like to express their appreciation to Ohio supercomputer center (OSC) for providing computing resources and technical support which help to implement the numerical studies in this manuscript.


Not applicable.

Author information

Authors and Affiliations



JP developed the method, conducted the numerical experiments, and wrote the manuscript. The author read and approved the final manuscript.

Corresponding author

Correspondence to Juming Pan.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The author declares no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

Riboflavin dataset.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pan, J. Improved two-stage model averaging for high-dimensional linear regression, with application to Riboflavin data analysis. BMC Bioinformatics 22, 155 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: