Improved two-stage model averaging for high-dimensional linear regression, with application to Riboflavin data analysis

Model averaging has attracted increasing attention in recent years for the analysis of high-dimensional data. By weighting several competing statistical models suitably, model averaging attempts to achieve stable and improved prediction. In this paper, we develop a two-stage model averaging procedure to enhance accuracy and stability in prediction for high-dimensional linear regression. First we employ a high-dimensional variable selection method such as LASSO to screen redundant predictors and construct a class of candidate models, then we apply the jackknife cross-validation to optimize model weights for averaging. In simulation studies, the proposed technique outperforms commonly used alternative methods under high-dimensional regression setting, in terms of minimizing the mean of the squared prediction error. We apply the proposed method to a riboflavin data, the result show that such method is quite efficient in forecasting the riboflavin production rate, when there are thousands of genes and only tens of subjects. Compared with a recent high-dimensional model averaging procedure (Ando and Li in J Am Stat Assoc 109:254–65, 2014), the proposed approach enjoys three appealing features thus has better predictive performance: (1) More suitable methods are applied for model constructing and weighting. (2) Computational flexibility is retained since each candidate model and its corresponding weight are determined in the low-dimensional setting and the quadratic programming is utilized in the cross-validation. (3) Model selection and averaging are combined in the procedure thus it makes full use of the strengths of both techniques. As a consequence, the proposed method can achieve stable and accurate predictions in high-dimensional linear models, and can greatly help practical researchers analyze genetic data in medical research.


Background
Model averaging and model selection are two important techniques for estimation and prediction in statistical modeling. Model selection is appealing for its simplicity and interpretability thus has been attracting more attention. A systematic review on model selection can be found in Heinze, Wallisch, and Dunkler [14] or Lee, Cho, and Yu [18]. The main drawback for model selection, however, is that the uncertainty is essentially ignored once an optimal model is found. A possible consequence is that the inference based on the "best" model can be misleading-either overestimating or underestimating, due to poor representative of the real distribution of the data. In addition, different methods, criteria, and even small changes in the data can result in very different final models, thus prediction accuracy of mode selection is reduced [23].
Model averaging has been widely recognized as a solution when model uncertainty is high and interferes with the quality of the prediction. By combining a set of competing models and wisely choosing the weights, model averaging can even out the overestimation and underestimation, therefore often leads to a more reliable result than the individual prediction. With model averaging, no assumption that a true model exists needs to be made, and inferences could be made about all candidate models. Hence, this method is preferred when uncertainty arises in the model selection process [24].
It has been shown that averaging models tend to yield better predictions than single models in the classical setting where the sample size is at least one order of magnitude greater than the number of predictors [16], how well it performs in high-dimensional regression, in which the number of predictors exceeds the sample size, is still under investigation. Recently, [2] proposed a two-step model averaging with cross-validation (MCV) procedure. The method consists of two steps: the first step is to construct candidate models by marginal correlation; the second step is to find optimal model weights by delete-one cross-validation. The main feature of MCV lies on the fact that the standard constraint of the model weights summing up to 1 is relaxed to the model weights can be vary freely between 0 and 1, and it is claimed that this relaxation lowers the prediction error.
The MCV approach is one of the only few references and is the first study that removes the weight restriction, thus marks a significant step toward the development in the highdimensional frequentist model averaging. Nonetheless, we have several concerns with this method. First, the model construction step of MCV employs the marginal correlation to measure the strength of association between predictors and the response. Hence, it can miss some significant variables that are associated with the response conditionally but not marginally. Furthermore, the marginal correlation can be misleading when there exist non-trivial correlations among the predictors [20]. As a result, week candidate models could be involved and be given fairly large weights while strong candidate models could have small weights, and predictive inference is impaired. Second, we find the arguments in the paper regarding why the restriction of total weights summing to 1 should be removed are not convincing. For example, Ando and Li remarked "consider the extreme case that the predictors are uncorrelated with each other and the noise variance is ignorable. The predictors from each model become uncorrelated with each other as well and the optimal combined predictor is the sum of all model predictors, implying that the optimal weight assignment should be (1, 1, ..., ). Thus, the total weight should be equal to M, not 1. " This statement is problematic as for this special case, each model should be assigned the weight of 1 M if there are M of them, thus the total weight is still 1. Considering such point, more scrutiny is needed on whether relaxing the total weight constraint will lower the prediction error. Third, numerical study was conducted in that paper to demonstrate the favorable performance of MCV, but the same original data is employed for both model building and model validation, and the mean squared error (MSE) is calculated as the performance measure. A result of this model development process is that the MSE will tend to understate the inherent variability in making future predictions from the selected model [17], therefore the predictive capability of MCV needs to be recalibrated by the collection of new data.
To address the above concerns of MCV and to further enhance the predictive ability, we develop a two-stage model averaging procedure for high-dimensional linear regression. In the first step, a high-dimensional variable selection method such as LASSO is employed to screen redundant predictors and to construct candidate models. In the second step, the jackknife cross-validation with the conventional constraint of total weights summing to 1 is applied to optimize model weights for averaging. The proposed method makes full use of the strengths of both model selection and averaging, meanwhile it is computational feasible since each candidate model and its corresponding weight are determined in the low-dimensional setting and the quadratic programming is utilized in the cross-validation. We conduct simulations and a real data example, the results illustrate that the proposed approach is efficient in forecasting and outperforms the existing predicting methodologies.
The remainder of the article is organized as follows. "Method" section describes the problem setting and an improved two-stage high-dimensional model averaging procedure is developed. We present a simulation study in "Simulations" section and provide a real data example in "Riboflavin Data Analysis" section for empirically examining the effectiveness of the proposed method. The paper ends with discussion and future research directions.

Method
Given the dataset of n observations, a linear regression model takes the form of where y i is the response in the ith trial, x i1 , . . . , x ip are the predictors, β 1 , . . . , β p are the regression coefficients, and ǫ i is the error term. Here, we assume: 1) the independent error ǫ i has mean zero and finite variance σ 2 ; 2) the number of predictors p exceeds the sample size n; 3) only a subset of predictors have contributions in predicting the response.
Alternatively, in matrix form, model (2.1) can be written as where (2.1) For model (2.1), we will develop a two-stage model averaging method to improve the predictive performance for high-dimensional regression.

Model construction step
For the high-dimensional linear models, [2] proposed a two-stage model averaging procedure named MCV. The procedure first divides p predictors into K + 1 groups by the absolute marginal correlations between all predictors and the response. Let model M k consist of the predictors with marginal correlations falling into the kth group. The first group has the highest values, and the (K + 1) th group has values closest to 0 and is then discarded. Thus the number of candidate models is K. Each model can also be written in matrix form y = X k β k + ǫ , for k = 1, . . . , K . Given candidate models whose number of predictors is smaller than the sample size, the regression coefficients are estimated by the least-squares method β k = (X ′ k X k ) −1 X ′ k y and the predicted value ŷ k = X kβk . As discussed in "Introduction" section, the marginal correlation could provide misleading results in sorting variables and preparing candidate models therefore reduces prediction accuracy. To prepare proper candidate models ready for averaging, in the first step of the proposed method we consider some representative measurements which can simultaneously rank and select variables.
The first competitor is Distance Correlation. It is a measure of dependence between random vectors introduced by [22]. For all distributions with finite first moments, such method provides a robust way measuring and testing dependence by correlation of distances. The distance correlation test is implemented by the energy package in R. To find candidate models, we partition all predictors into different groups by the p-values of the test. The first group has the lowest values, and the last group has largest values and then is discarded.
Two well-known variable selection approaches are also considered for comparison. Ridge Regression [15] is a way to create a parsimonious model when the number of predictors exceeds the number of observations, or when multicollinearity exists in predictors. The predictors are sorted and chosen by the method proposed by Cule et al. [7] and can be conducted by the ridge package in R. Similar to Ridge Regression, LASSO [23] also works in a similar fashion the only difference is of the penalty term. LASSO is performed by ncvreg package in R, the importance of predictors is based on the coefficients magnitude, and those whose coefficients shrunk to 0 are abandoned.
Random Forest [4] is also under consideration. It is an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees. Random Forest can be used to rank the importance of variables and to conduct variable selection in regression. The technique was described in the papers by Genuer, Poggi and Tuleau-Malot [10,11], and is implemented in the R package VSURF.
To evaluate the prediction performance, we adopt a simulation setting from Ando and Li [2]. The detailed structure of the simulation study is given below.
To compare the methods in a fair manner, after candidate models are constructed in step 1, in step 2 the same delete-one cross-validation from MCV is used to optimize the model weights. In each run, we simulate a training set with n observations from the linear regression model in (2.1) for each combination. All the model building and averaging are done on the training set. We also collect an independent test data set of n * = 1000 observations to compute the mean of the squared prediction error (MSPE), MSPE= We repeat this process 100 times, and report the means and standard deviations of MSPE for each method. Table 1 summarizes the simulation results. Several observations can be made from this table. First, the predictive performance is affected by several factors, including the sample size, number of predictors, number of significant predictors, and the distribution of the error terms. More specifically, the prediction accuracy increases as n increases, p, s, or σ decreases. Second, out of the 54 scenarios, marginal correlation, Distance Correlation, Ridge Regression, LASSO, Random Forest perform the best 9/11/14/19/1 times, respectively. It seems that each method has its unique merits, none of the 5 methods can universally dominate the other competitors. The overall performance of LASSO appears to be the best, Ridge Regression is a close second. To explore the methods in more detail, we find that LASSO often performs favorably when n is large, or when s and σ are small, yet ridge regression tends to be more stable among different cases. In summary, the simulation results show that using a shrinkage variable selection method to construct candidate models is preferable to the marginal correlation in the model construction step for model averaging.

Model weighting step
After the candidate models and their corresponding least-squares predicted values {ŷ 1 , . . . ,ŷ K } are obtained, the second stage of model averaging is to determine the model weights. For the MCV method, the weight vector w is optimized by minimizing the delete-one cross-validation criterion where ỹ = K k=1 w kỹk , ỹ k = (ỹ (1) k , . . . ,ỹ (n) k ) be an n-dimensional vector, and ỹ (i) k , is the predicted value ŷ (i) k computed without the ith observation. Here, the restriction of K k=1 w k = 1 is removed, and the minimization of ŵ can be easily solved by a quadratic programming. Following the similar idea, [3] further extended model averaging to highdimensional generalized linear models.
To compare with the relaxed cross-validation criterion in MCV, we consider three weight choice alternates in which the conventional constraint of total weights summing to 1 is applied.
AIC [1] is a consistent estimator of the Kullback-Leibler discrepancy between the distribution that generated the data and the model that approximates it. Suppose there are K models, AIC k = −2log(L k ) + 2p k , where L k is the maximized likelihood function and p k is the number of parameters under the k-th model. [5] advocated to use the MAIC for determining the weights More recently, [12] proposed a least-squares model average estimator with model weights selected by minimizing Mallows' C p criterion [21]. The Mallows' criterion for the model averaging estimator (MMA) is where k(w) is the effective number of parameters, and σ 2 can be estimated by a sample variance from a "large" approximating model. Hansen showed that the model average estimator that minimizes the Mallows criterion also minimizes the squared error in large samples.
In addition, [13] suggested to select the weights by minimizing a deleted-one crossvalidation criterion, and the method is termed as the jackknife model averaging (JMA). The same method is adopted in MCV as expressed in (2.2). However, JMA requires K k=1 w k = 1 while such conventional restriction is relaxed in MCV. This criterion is quadratic in the weights, so computation is a plausible application of quadratic programming. It is also shown that the JMA estimator is asymptotically optimal in achieving the lowest possible expected squared error.
To compare all the weight determining methods, we conduct another simulation study. Simulation 2. We use the same setting and data generating process in Simulation 1. The same candidate models are constructed from the training data, all the methods  described in this session are used to determine the model weights. An independent test data set of n * = 1000 observations is generated to compute the prediction accuracy measured by the MSPE. We repeat the iteration 100 times. Table 2 compares the means and standard deviations of MSPE for each method. Out of the 54 scenarios, the relaxed cross-validation (refers to CV in the table), MAIC, MMA, JMA perform the best 9/7/14/24 times. The overall performance of JMA appears to be the best, MMA performs similarly but slightly worse. More importantly, we find that in most cases (45 out of 54), relaxing the total weight constraint does not lower the prediction error. The results confirm with our concern in "Introduction" section.

The improved two-stage high-dimensional model averaging
Based on the results from the numerical experiments, we propose an improved model average (IMA) procedure, which contains the following two steps: Stage 1: Construct candidate models by a high-dimensional penalized variable selection method (LASSO or its variants, such as SCAD [8] and ALASSO [26]).
• Run the penalized regression for the high-dimensional data and obtain the LASSO estimate for each predictor. • Partition all p predictors into K + 1 groups by the magnitude of the estimates. The first group has the largest absolute values of LASSO estimates, and the (K + 1) th group contains the predictors whose coefficients are already shrunk to 0. We drop the (K + 1) th group, thus the number of candidate models is K. • For each model M k , k = 1, . . . , K , y = X k β k + ǫ , and the least-squares prediction is ŷ k = X kβk .
Stage 2: Optimize the model weights by the standard jackknife cross-validation criterion by Hansen and Racine [13].
• Let ỹ k = (ỹ (1) k , . . . ,ỹ (n) k ) be an n-dimensional vector, where ỹ (i) k , is the predicted value ŷ (i) k computed without the ith observation. • Let w = (w 1 , . . . , w k ) be the weight vector for the K models, then the jackknife predictor is • The optimal estimate of w is obtained by minimizing the cross-validation where the standard restriction K k=1 w k = 1 is applied. • Finally, the model averaging predicted value ŷ is expressed as Remark As a referee points out, ranking is an essential component of the out-performance. By ranking all the predictors based on the magnitude of the estimates, the models proposed in Stage 1 are not equally competitive. The first few models are likely to be more informative than the last few models because of the ordering of regressors. Therefore, assigning larger weights to the first few models and smaller weights to the last few models tend to improve the predictive performance. Note that [2] also ranked the predictors in the MCV method by using the marginal correlation. Moreover, we have conducted some simulation experiments, the results show that (although not shown in the article) ranking predictors into K different groups has lower prediction errors than a random assignment of non-zero predictors.
Remark The overall performance of the proposed method depends on the choice of parameter K which is the total number of models. Determining the optimal value of K is similar to finding the best tuning parameter in penalized variable selection. For each candidate value of K, a group of models is constructed and the best model weights are determined. The optimal K is the one with the lowest cross-validation score. Note that the specification of K may vary case by case, here we propose a practical strategy for optimizing the choice of K. The number of predictors p k in each model is set to be the same, p k = n × q , where n is the sample size, q is a value between 0 and 1. First, we set T to be the number of predictors whose LASSO estimates are non-zero. Because the total number of predictors from K models is equal to K × p k , it is desirable for K × q × n to be no greater than T. Subject to this restriction, we choose K and q whose cross-validation prediction error is the smallest.

Simulations
To assess the performance of the improved model averaging (IMA) method, we compare our approach with some existing model selection and averaging methods on simulated data. These competitors include MAIC [5], LASSO [23], MCV [2], and SISSCAD (a combination of sure independence screening [9] and SCAD [8]).
In this simulation study, we follow the settings of Ando and Li [2], 6 model settings are considered.
Bold values indicate the smallest MSPE in each scenario  For each setting, we generate n * = 1000 observations to compute MSPE. Figures 1, 2 S. (f )), both tend to decrease the prediction error. In addition, the proposed procedure IMA yields the smallest median of MSPE in most settings, and achieves stable performance by having the short length of the boxplots. The results demonstrate that IMA performs favorably in comparison with other methods. In particular, the median of MSPE of IMA is at least 20% lower than that of MCV, we remark that the IMA method outperforms MCV, and thus significantly improves the prediction accuracy and stability of model averaging for high-dimensional linear regression. Table 3 compares the computational time required for each method under the settings (a) to (f ). The computations are conducted using R version 3.5.0 on the Owens clusters at the Ohio Supercomputer Center (OSC). After 100 simulation runs, the averaged time (in seconds) and corresponding standard deviations are given. Several observations can be seen from this table. First, with the fixed number of predictors p and the correlation ρ , increasing the sample size n will decrease the computing time. Second, with the fixed number of predictors p and the sample size n, decreasing the correlation ρ will decrease the computing time. Third, in general, model selection methods (LASSO, SISSCAD) have less computing time than the model averaging procedures (MAIC, MCV, IMA). Last, the proposed method IMA is much faster than other model averaging competitors and is comparable to model selection methods regarding the computational cost.

Riboflavin data analysis
To further explore the practical behavior of the proposed method, we consider the riboflavin data with n = 71 observations in Bühlmann and Mandozzi (2014 [6]). The response variable y is the logarithm of the riboflavin production rate, the predictors are the gene expression levels for 4088 genes. The methods to be included in the comparison experiment are: IMA, MAIC, SCAD, MCV, and RMSA (another recent model  averaging procedure by Lin et al. [19]. We randomly divide the data into a training set with 50 observations and a test set with the remaining 21. In each time, p = 500 or 1000 genes are chosen randomly from the totality of all genes, the process is repeated 100 times.
The results for the Riboflavin data are presented in Table 4. Similar to the results in the simulation study, the prediction errors of IMA are substantially smaller than those of the other methods, and the computing time is pretty competitive. It is also noted that RMSA performs well in this real example, while the amount of computation required for such method is considerably high. Therefore, compared to RMSA, IMA not only has slightly better performance in prediction accuracy, but also enjoys much lower computing cost.

Discussion
To achieve a stable and improved prediction, we propose a novel two-stage model averaging procedure in high-dimensional linear regression models. The method uses variable selection to group predictors for model averaging and applies the standard jackknife cross-validation for optimizing weights.
Compared with the recent model averaging procedure (MCV, [2]), the proposed approach has better predictive performance, meanwhile it retains computational flexibility even for extra high-dimensional data. We conduct numerical studies including simulations and a real riboflavin data analysis, the results demonstrate that the proposed technique is quite efficient in forecasting and outperforms the existing methodologies in general.
Even though advantages of model averaging over model selection have been demonstrated in the low dimensional regression analysis, it is still unclear to a large extent when model averaging should be preferred is high-dimensional linear models. [25] proposed an index, PIE, to measure model selection instability in estimation. They suggested that: If PIE is bigger than 0.5, model averaging should be considered; if PIE is less than 0.4, model selection is likely to work better than the model combining methods. However, PIE does not function properly for high-dimensional data based on our investigation. In the future, we intend to develop a measure which can be used to guide practitioners for deciding which way to go: averaging or selection, when dealing with high-dimensional problems.