AUTALASSO: an automatic adaptive LASSO for genome-wide prediction

Background Genome-wide prediction has become the method of choice in animal and plant breeding. Prediction of breeding values and phenotypes are routinely performed using large genomic data sets with number of markers on the order of several thousands to millions. The number of evaluated individuals is usually smaller which results in problems where model sparsity is of major concern. The LASSO technique has proven to be very well-suited for sparse problems often providing excellent prediction accuracy. Several computationally efficient LASSO algorithms have been developed, but optimization of hyper-parameters can be demanding. Results We have developed a novel automatic adaptive LASSO (AUTALASSO) based on the alternating direction method of multipliers (ADMM) optimization algorithm. The two major hyper-parameters of ADMM are the learning rate and the regularization factor. The learning rate is automatically tuned with line search and the regularization factor optimized using Golden section search. Results show that AUTALASSO provides superior prediction accuracy when evaluated on simulated and real bull data compared to the adaptive LASSO, LASSO and ridge regression implemented in the popular glmnet software. Conclusions The AUTALASSO provides a very flexible and computationally efficient approach to GWP, especially when it is important to obtain high prediction accuracy and genetic gain. The AUTALASSO also has the capability to perform GWAS of both additive and dominance effects with smaller prediction error than the ordinary LASSO.


Background
Genome-wide prediction (GWP) refers to the idea that regression coefficients, obtained by regressing genomic markers on phenotypic measurements in some training data, can be used to predict phenotypic values without the need to measure the phenotypes in some test data [1][2][3]. Genome-wide data often consist of several thousands, sometimes millions of markers. Since the number of individuals is usually smaller, in the range of some hundreds to a few thousands, the result is a multivariate highdimensional statistical issue that is often referred to as the p >> n problem [4,5]. The outer product of the design matrix in the ordinary least squares (OLS) estimator is not invertible in these situations and will result in a matrix *Correspondence: Patrik.Waldmann@slu.se 1 Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences, Box 7023, 750 07 Uppsala, Sweden Full list of author information is available at the end of the article of rank one. As a consequence, different regularization techniques have been proposed to facilitate in obtaining well-conditioned and unique solutions [6].
Regularization is a mathematical technique to impose prior information on the structure of the solution to an optimization problem. It closely resembles the task of using priors in Bayesian statistics. By formulating regression as a minimization of loss plus penalty function problem it is easy to enforce certain restrictions on the regression coefficients. Ridge regression [7] and the LASSO [8] are two popular regularization methods that only differ in the norms of their penalty functions, i.e. the 2 and 1 norms, respectively. It is well established that the LASSO usually results in better prediction accuracy than ridge regression if the predictors display low to moderate correlation between each other [4,9,10].
The 1 norm of the LASSO is special because it results in a convex minimization problem with sparse solutions (many regression coefficients are set to zero) which facilitate computations in large scale situations. A huge number of studies have been devoted to extensions of the lasso in various directions [11]. Generalized Linear Model (GLM), survival and support vector machine versions can be obtained by alternation of the loss function. Modifications of the penalty function have resulted in the elastic net, group and fused LASSO that can handle correlated, structured and time dependent predictor variables. Even more elaborate extensions include sparse additive models and the graphical LASSO [12].
Several optimization algorithms have been developed for the LASSO. The least-angle regression (LAR) algorithm borrows from forward-stepwise regression by updating an active set of regressors via a regularization path [13]. LAR provides relatively fast fitting for sparse data but requires one OLS solution per iteration and is not suitable for all loss functions. Another algorithm that has turned out to be effective is cyclic coordinate descent (CCD). CCD is also an iterative algorithm that updates the regression coefficients by choosing a single coordinate to update, and then perform a univariate minimization over this coordinate. The idea is to compute the OLS coefficients on the partial residuals and apply soft-thresholding to take care of the LASSO contribution to the penalty. CCD efficiently exploits sparsity and can handle large data sets [14].
The rise of big data problems has resulted in a resurgence of first order gradient descent based methods in mathematical optimization [15]. One problem with the LASSO is that the penalty function is nondifferentiable at zero. However, it turns out that the rediscovery of proximal operator techniques has solved this issue. The idea behind the proximal gradient is to decompose the objective function f into the sum of one convex, differentiable function g and one convex, nondifferentiable function h, and then use a proximal map of h along with the gradient of g to update in each iteration [16]. Proximal versions of the LASSO include the fast iterative soft-thresholding algorithm (FISTA) [17] and the alternating diretion method of multipliers (ADMM) [18]. These methods are fast and can handle large data, but the tuning of the learning rate and the regularization parameter can be tedious.
The purpose of this study is to introduce proximal algorithms, with a special focus on ADMM, into a GWP framework, and then to develop a general approach that automatically finds the optimal values of the learning rate and the regularization parameters of an adaptive LASSO. The learning rate determines how large steps the iterative algorithm takes towards the minimum. Too large steps leads to fewer iterations, but with a high risk to miss the minimum. Too small steps may lead to very many iterations to convergence and therefore also more computing time. The learning rate is optimized using line search with Armijo rule. The level of the regularization parameter in the LASSO regulates how many regression coefficients that are set to zero and is optimized with golden section search. Finally, the method is evaluated on simulated and real GWP data from cattle.

Simulated data
For the QTLMAS2010 data we set ADMM = 10 −4 and GSS = 10 −2 . Moreover, the initial bracketing points in the GSS procedure were set to λ a = 0.0001λ max and λ b = λ max . In this data set the last generation is used as test data and therefore only one evaluation is needed. The AUTALASSO completed in 190 s and resulted in a MSE test of 64.34 and r test of 0.676. The number of λvalues tested to convergence in the GSS algorithm was 21.
The additive effects were calculated as a = −θ λ opt,gen=0 + θ λ opt,gen=2 and the non-zero SNPs are plotted in Fig. 1. It can be seen that the major coefficients corresponds well with the anticipated QTLs, i.e. the two major additive loci and the additive part of the dominance locus. The heterozygote effects equals the dominance effects d and selected variables are plotted in Fig. 2

Real data
We used the same -values as for the QTLMAS2010 data in the analyses of the bull data. Moreover, we first ran one analysis of one fold with λ a = 0.0001λ max and λ b = λ max to get estimates that could guide to a more narrow λ interval. The consecutive folds were analysed with λ a = 0.0001λ max and λ b = 0.002λ max . The mean timing over the folds was 3020 s and the mean MSE test was 11.801 for AUTALASSO. The r test was estimated to 0.615 for the AUTALASSO. With 100 λ-values the mean timing of the glmnet ALASSO was 170 s and mean MSE test equal to 13.17. The MSE test for the LASSO and RR was 12.65 and 13.20, respectively. The corresponding estimates for r test were for the ALASSO, LASSO and RR 0.554, 0.579 and 0.554. Hence, the AUTALASSO produces the lowest prediction error and highest prediction accuracy, but is somewhat slower than glmnet. Though it should be noted that glmnet is based on the CCD algorithm and implemented in compiled Fortran which should be faster than Julia.
Of the original 716 environmental variables were 28 selected by the AUTALASSO as shown in Fig. 3. None of the 617 repeated measurement indicators were selected. The three highest positive effects were obtained for variables 672 (age), 692 (year 2007) and 691 (year 2006), respectively. Two negative effects also deviate from the others, variable 711 (semen coll. numb.15) and 689 (year 2004).
The plots of the selected regression coefficients θ λ opt for the two SNP homozygotes are provided from the authors. The additive and dominance effects were calculated in the same way as for the simulated data. Figure 4 provides a plot of the additive effects of the Fleckvieh bull data. The number of selected additive effects was 1116. The maximum additive effect was 0.223 for SNP 23194, whereas the minimum of -0.230 was found for SNP 28087. Figure 5 provides a plot of the dominance effects of the bull data.
The number of selected dominance effects was 1468. The maximum dominance effect was 0.279 for SNP 17125, whereas the minimum of -0.254 was found for SNP 26154.

Discussion
One of the major obstacles in using high density genomic marker data for prediction purposes is that the number of individuals that can be scored usually is considerably smaller than the number of markers. This situation introduces several problems. For example, when p > n the ordinary least squares estimates are not unique and will considerably over-fit the data resulting in a low prediction accuracy [4]. Other problems with big genome-wide data sets include spurious random correlations, incidental endogeneity, noise accumulation, and measurement error [19]. Regularization provides an efficient technique to constrain parameters which will lead to unique solutions and often better prediction accuracy. The most successful regularization method is the LASSO because it provides a convex optimization problem with sparse solutions [12]. Here, we have proposed an automatic adaptive LASSO that uses the ADMM algorithm for computing solutions to the objective function, line search for tuning of the learning rate and golden section search for optimization of the regularization parameter. A large number of methods have been proposed for genome-wide prediction and several papers have evaluated their predictive properties [20,21]. Regularization approaches include ridge regression, the LASSO and mixture modeling. Their Bayesian counterparts are often referred to as the "Bayesian alphabet" in the genetics literature [22]. The least angle regression version of the LASSO was exploited for GWP in [23] and found to yield better prediction accuracy than genomic BLUP and BayesA. In another study based on glmnet [14], the LASSO, the elastic net, adaptive LASSO and the adaptive elastic net all had similar accuracies but outperformed ridge regression and ridge regression BLUP [10]. In [9], an overview of both the frequentist and Bayesian LASSO is provided. They found the LASSO and adaptive LASSO to be competitive compared to several other methods in evaluations on both simulated and real data. In conclusion, the LASSO seems to be a computationally efficient method that yields good prediction accuracy on GWP data under many situations, but it would be valuable to compile all evaluations into a meta-analysis.
We have calculated both MSE test and r test as measures of model fit and found that they give somewhat contradictory results, especially for the glmnet implementations where the ordinary LASSO had both higher r test and MSE test than the ALASSO. The standard Pearson correlation coefficient is r test = σˆy test y test / σˆy test σ y test . When we decompose r test for the QTLMAS2010 data into its parts, we see that σ y test is the same at 10.83 for both methods. However, σˆy test is increased from 5.93 in the LASSO to 6.82 in the ALASSO, and σˆy test y test is increased from 43.556 in the LASSO to 49.86 in the ALASSO. The introduction of the weight factor in the ALASSO increases model complexity which results in decreased model bias, on the expense of an increased variance. However, most importantly, the MSE test is reduced. This is an example of the Bias-Variance trade-off that is fundamental in statistical learning [4]. Unfortunately, this phenomenon can complicate the interpretation of r test as a measure of model fit and it should be used with care. ADMM is a form of a proximal operator algorithm that has become very popular in mathematical optimization [15]. As described in the "Methods" section, we can see that the ADMM is closely related to the augmented Lagrangian method, but does not require a joint minimization which facilitate computations in special applications, for example the LASSO. The ADMM LASSO was first outlined in [18] and since then there have been extensions to for example the clustering LASSO [24] and the generalized LASSO [25]. We have used the least squares (Euclidean) loss function since we perform regression in this study, but it is straightforward to change to a classification loss function (for example logit or hinge loss) in the case of a binary trait.

Conclusions
The AUTALASSO provides a very flexible and computationally efficient approach to GWP, especially in situations were it is important to obtain high prediction accuracy and genetic gain. The algorithm automatically tunes the learning rate with line search and the regularization factor using Golden section search. The results show that the prediction error is smaller for the AUTALASSO when compared with the adaptive LASSO, the ordinary LASSO and ridge regression implemented in the popular glmnet package, albeit at the expense of some increased computing time. It is also shown that important SNP markers can be identified, both with underlying additive and dominance effects. The implementation in Julia is easy to modify, and we have hope that future versions of the AUTALASSO will fully capitalize on the distributed computing facilities of the ADMM algorithm.

ADMM adaptive LASSO
The LASSO was introduced by [8] and the objective function of the optimization problem consists of an ordinary squared 2 (Euclidean) norm loss function and an 1 norm penalty function where f (β) = (1/2) Xβ − y subject to β − θ = 0. ( The ADMM algorithm then is the following iterative procedure where ρ > 0 is the learning rate, λ > 0 is the regularization parameter and S λρ (ν) = (ν −λρ) + −(−ν −λρ) + is the soft-thresholding operator that depends on both ρ and λ. It is now possible to rewrite this algorithm using proximal operators (see Appendix) which is iterated to convergence determined by for tolerance parameter ADMM . The choice of ADMM can have a dramatic influence on the number of iterations to convergence and we recommend to set it between 10 −3 and 10 −4 .
In [26], it was shown that the LASSO can be inconsistent for variable selection under some situations and the adaptive LASSO was suggested in order to reduce bias and full-fill oracle properties (i.e that the estimator satisfies support and √ n-estimation consistency). The idea behind the adaptive LASSO is to use variable specific weightŝ w = 1/|β| γ , whereβ is estimated based on some pilot run and γ > 0 a tuning parameter. The OLS estimates ofβ are not defined when p > n, but univariate marginal regression (or covariance) coefficients calculated asβ = X T y/n can be used and yield good recovery properties under most situations [27]. Moreover, the adaptive lasso penalty can be seen as an approximation to the q penalties with q = 1 − γ . Hence, γ = 1 approximates the 0 penalty [12]. This formulation is a great advantage since the penalty function is still convex and the only part of the objective function that needs to be changed in the adaptive LASSO is g * (β) = λ ŵβ 1 .

Automatic tuning of the learning rate and the regularization factor
Many procedures have been proposed for tuning of the learning rate ρ in gradient decent algorithms. Since the proximal gradient is a special case of projected gradients methods we can use the fact that line search with Armijo rule can be used for optimization of the learning rate in this form of constrained optimization [15]. Given that the projected gradient starts from the current iterate of β t , moves in the direction of the negative gradient of the loss function β t − ρ∇f (β t ) and then performs a projection back to the convex constraint set that holds θ t , a line search can be formulated by iteratively decreasing ρ k until (6) no longer holds. The gradient is calculated as ∇f β t = X T Xβ t − y . A suitable choice can be to start with ρ 0 = 1 and decrease ρ k by a factor of 0.5 each kth iteration until ρ K is found. A new ρ K is then used in each tth ADMM update.
The bias-variance tradeoff in statistical learning makes it necessary to use test (or validation) data to avoid underand overfitting [4]. A common approach to find the minimum test MSE of a LASSO evaluation is to calculate the value of the regularization factor where all regression coefficients of the training data are zero, i.e. λ max = X T y ∞ , and then to evaluate the MSE test along a path of equally spaced and decreasing λ-values until an arbitrarily choosen λ min is reached [14]. Often, this amounts to fitting the LASSO for at least 100 λ-values, and the precision in finding the optimum depends on how well the path covers the shape of the test error function around the minimum.
Another option is to optimize the λ-value using a formal optimization algorithm that minimize the convex test error function. Here we propose to use the Golden section search (GSS) which is a technique for finding the minimum of a function by successively narrowing the range of values inside which the minimum is known to exist [28]. The GSS algorithm is described as follows: 1. Set two bracketing points, for example λ a = 0.001λ max and λ b = λ max . 2. Fit the ADMM LASSO with line search to the training data for λ a and λ b to obtain θ λ a and θ λ b which are used to predict the squared test errors 3. Set the golden section ratio to ϕ = 1 9. Calculate λ opt = (λ c + λ d )/2 and perform a final ADMM run to get θ λ opt and SE test,λ opt .
The value to choose for the convergence tolerance GSS can have a large effect on computing time. For a relatively good balance between precision and computing time, GSS can be set between 0.01 and 0.001. The start values of θ and β also influence computing time in each ADMM run. Hence, the θ and β fits in step 5 of the GSS algorithm are re-used in next iteration to obtain warm-starts, but the u-vector is set to 0. Warm-starts have been shown to reduce the number of iterations to convergence in ADMM considerably [29].

Simulated data
The original data was produced for the QTLMAS2010 work-shop [30]. The total number of individuals is 3226 and they are structured in a pedigree with 5 generations. The pedigree is founded by 20 individuals (5 males and 15 females), and created assuming that each female mates once and gives birth to approximately 30 progeny. Five autosomal chromosomes of length 100Mbp were simulated. A neutral coalescent model was used to simulate the SNP data. The procedure created 10,031 markers, including 263 monomorphic and 9768 biallelic SNPs. The continuous quantitative trait was created from 37 QTLs, including 9 controlled genes and 28 random genes.
The controlled QTLs included two pairs of epistatic genes with no individual effects, 3 maternally imprinted genes and two additive major genes with effects of -3 and 3. The random genes were selected from the simulated SNPs and then their effects were sampled from a truncated normal distribution, and accepted if the absolute value of the additive effect was smaller than 2. The resulting additive effects of the random genes varied between -1.98 and 1.93. Each simulated QTL was surrounded by 19-47 polymorphic SNPs (MAF >0.05) positioned within 1 Mb distance from the QTL. 364 SNPs were in moderate to high LD with the QTLs.
In addition to the original data, one dominance locus was positioned at SNP number 9212 on chromosome 5 by giving the heterozygote (1) an effect of 5.00 and the upper homozygote (2) a value of 5.01 . One overdominance locus was produced at SNP 9404 by assigning the heterozygote an effect of 5.00, and lower homozygote (0) the effect of -0.01 and upper homozygote (2)

Real data
The SNP genotype data comes from the bovine 50k Beadchip and the phenotypes measured were total number of spermatozoa in ejaculates during routine artificial insemination programs, both obtained from Austrian Fleckvieh bulls [31]. Sperm quality data were obtained from 3 Austrian ). In addition to year and station were also age of bull (13 -161 months), month of year (January -December), period between 2 successive ejaculates (0 -60 days) and collector (1 -17) used as input variables. All variables, except age of bull and period between successive ejaculates, were transformed into 0,1 indicator variables (one-hot encoding). Age of bull and period between successive ejaculates were standardized to mean 0 and SD 1, whereas the phenotype total number of spermatozoa was mean centered. The final number of environmental variables was 716, where the first 617 corresponds to bull repeated measurements indicators.
1799 Austrian Fleckvieh bulls were genotyped using the bovine SNP50 Beadchip v1 (Illumina) which contains 54001 SNPs (50k). Only autosomal SNPs that were assigned to a chromosome were used. Missing SNP data were imputed using FImpute [32] and converted to 0,1 and 2 code. SNPs with MAF smaller than 0.01 were removed resulting in a total of 38871 SNPs. These SNPs were also converted into indicator variables which produced 116613 markers. Bulls without phenotype measures were also removed. The final number of bulls was 671 with a total of 21567 phenotype measures. The data was randomly divided into 10 repeats (corresponding to 10-fold crossvalidation) of training data with 15097 observations and test data with 6470 observations.

Implementation
The full automatic adaptive LASSO (AUTALASSO) was implemented in Julia version 0.6 [33] and uses the ProximalOperators package for the proximal steps in the ADMM algorithm. The Julia code used for the QTLMAS2010 data is provided at: https://github.com/ patwa67/AUTALASSO.
The prediction accuracy of the AUTALASSO was compared with the R implementation of an adaptive LASSO (ALASSO), the ordinary LASSO (LASSO) and ridge regression (RR) using the glmnet package with default settings [14]. For comparative purposes were both MSE test and Pearson correlation coefficient (r test ) calculated. The evaluations were performed on an Intel Xeon E5 4-Core 3.7GHz with 256GB RAM. two terms. Another key property is for separable sum functions f (β, θ) = g(β) + h(θ) where splitting leads to Finally we note that there is a near relationship between proximal operators and gradient methods where when λ is small and f (β) is differentiable. In this formulation, ∇ denotes the gradient and λ is an equivalent to the learning rate of a gradient optimizer [16].

Alternating direction method of multipliers (ADMM)
The alternating direction method of multipliers (ADMM) is a simple but powerful algorithm that is appropriate for distributed convex optimization, and specifically to problems in statistics and machine learning. It is based on decomposition of a large global problem into small local coordinated subproblems to find a solution [18]. In order to understand the ADMM, it is useful to introduce the methods of dual ascent and augmented Lagrangians. For the convex optimization problem minimize β f (β) subject to Xβ = y (11) the Langrangian will be L(β, θ) = f (β) + θ T (Xβ − y) (12) and the dual function is where X is a matrix of size n × p, y is a vector of length n, θ is the dual variable and f * is the convex conjugate of f. The dual problem is to maximize g(θ). In dual ascent the dual problem is solved using gradient ascent. If the dual function is differentiable the following iterating scheme describes dual ascent where ρ > 0 is the learning rate and t is the iteration number. The first step of (14) is a β-minimization step and the second step is a dual variable update. If ρ > 0 is adequately chosen and several assumptions hold, then β t converges to an optimal point and θ t converges to an optimal dual point.
Augmented Lagrangian methods were partly developed to make the dual ascent method more robust in situations where the objective function lacks strict convexity. The augmented Lagrangian for (11) is L λ (β, θ) = f (β) + θ T (Xβ − y) + (λ/2) Xβ − y 2 2 (15) where λ is the penalty parameter. The augmented Lagrangian can be used in an iterative procedure denoted the method of multipliers β t+1 = argmin β L λ β, θ t θ t+1 = θ t + λ Xβ t+1 − y (16) which is the same as dual ascent, except for the augmented part of the Lagrangian and the use of λ as step size. The method of multipliers has better convergence properties than dual ascent.