Seagull: lasso, group lasso and sparse-group lasso regularization for linear regression models via proximal gradient descent

Background Statistical analyses of biological problems in life sciences often lead to high-dimensional linear models. To solve the corresponding system of equations, penalization approaches are often the methods of choice. They are especially useful in case of multicollinearity, which appears if the number of explanatory variables exceeds the number of observations or for some biological reason. Then, the model goodness of fit is penalized by some suitable function of interest. Prominent examples are the lasso, group lasso and sparse-group lasso. Here, we offer a fast and numerically cheap implementation of these operators via proximal gradient descent. The grid search for the penalty parameter is realized by warm starts. The step size between consecutive iterations is determined with backtracking line search. Finally, seagull -the R package presented here- produces complete regularization paths. Results Publicly available high-dimensional methylation data are used to compare seagull to the established R package SGL. The results of both packages enabled a precise prediction of biological age from DNA methylation status. But even though the results of seagull and SGL were very similar (R2 > 0.99), seagull computed the solution in a fraction of the time needed by SGL. Additionally, seagull enables the incorporation of weights for each penalized feature. Conclusions The following operators for linear regression models are available in seagull: lasso, group lasso, sparse-group lasso and Integrative LASSO with Penalty Factors (IPF-lasso). Thus, seagull is a convenient envelope of lasso variants.


Background
Linear regression is a widely used tool to explore the dependence between a response variable and explanatory variables.For example, in genome-wide association studies, counts of genetic variants along the genome are related to records of a disease or performance trait.The high throughput of modern biotechnological procedures enables studying an extremely large amount of explanatory variables (p).However, this often goes along with relatively few observations (n; p ≫ n), making the estimation of effects a challenge.As an example, high-dimensional methylation data are recently used to build regression models termed epigenetic clocks, which enable biological age to be predicted from DNA methylation status.Especially in the presence of multicollinearity, penalization methods have proved to be useful; Tikhonov, elastic net [1] and lasso [2] regularization are famous examples.
Standard approaches for epigenetic clocks employ elastic net regression, which performs well but typically results in only ~100 methylation sites with non-zero effect, limiting the potential for their genome-wide annotation and interpretation [3].To simultaneously detect non-zero effects and account for the relatedness of explanatory variables, the lasso has been modified and enhanced to the group lasso [4], the sparsegroup lasso [5] and the "Integrative LASSO with Penalty Factors" (IPF-lasso) [6].These particular modifications of the lasso assume an underlying group structure within the explanatory variables.For instance, in genome-wide association studies, a group structure can be identified from linkage and linkage disequilibrium among chromosome regions.Thus, a method that exploits this structure such as the (sparse-)group lasso has the potential to improve the accuracy of results.seagull -the R package presented herecontains implementations of the lasso variants mentioned above focusing on precision of parameter estimation and computational efficiency.
Hereinafter we briefly describe the optimization problem and all relevant input parameters of seagull.We then use public data to evaluate our package and to compare it to the established R package SGL [7].

Implementation
The R package seagull offers regularization paths for optimization problems of the form: This is also known as the sparse-group lasso [5].The first term expresses the "goodness of fit".The second and third term are penalties, both of which are multiplied with the penalty parameter λ > 0. The vector y contains n observations of the response variable.The vectors b and u represent non-penalized and penalized effects, respectively; X and Z are the corresponding design matrices.Moreover, α ∈ [0, 1] is the mixing parameter which convexly links the penalties.
In the two limiting cases of α = 1 and α = 0, the resulting objective function is the lasso [2] and the group lasso [4], respectively.However, if α is chosen to be less than 1, it is assumed that the explanatory variables have an underlying group/cluster structure (with non-overlapping groups).Groups need to be determined prior to the call of seagull, for instance, by applying a suitable cluster algorithm to the explanatory variables or by grouping them according to the source of measurement (RNA expression, SNP genotypes, etc.).Referring to this structure, the entries of u can be separated into the corresponding groups, say u (l) for group l and p l is the size of group l (L is the total number of groups).Hence: The penalty operators lasso, group lasso and sparse-group lasso are available in seagull.Furthermore, it is possible to consider weights for each explanatory variable and group.Thus, the implemented extension of the optimization problem ( 1) is: where ω F j and ω G l are positive weights for feature j and group l, respectively.The weights for groups are defined as: where the average over weights of features is taken over those features that belong to group l, i.e., ω F j ¼ 1 Hence, if all weights ω F j are set to 1, the optimization problem (2) yields problem (1).
The option of including weights can be used for any reason but it also enables the user to apply the strategy of IPF-lasso.In order to show this, we go back to optimization problem (2) with α = 1: For convenience, we assume the absence of any effects b and multiply the entire expression by 2n.Thus: where a simplification can be obtained via As a last step we assume that the entries of u are obtained from M different sources, i.e., "modalities"as called by the authors of [6].Then, we let all λ's which belong to the same modality m have the same value λ (m) .Therefore, the last term in the above expression can be written as a sum over modalities: And this immediately leads to the IPF-lasso.So in the seagull package, this particular lasso variant is implicitly included.The weights for features just need to be set accordingly, i.e., the same weight for features that belong to the same modality.
The penalty parameter λ > 0 reflects the strength of the penalization.Our package provides the opportunity to calculate a maximal value for λ (i.e., λ max ) and to perform a grid search by gradually decreasing this value down to a minimal value (i.e., λ min ).This minimum value is determined as a user-specified proportion ξ of λ max , i.e., λ min = ξλ max .The sequence of penalty parameters is then calculated on a logarithmic scale.To efficiently accelerate the corresponding grid search, we implemented warm starts.Thus, the solution of b and u for the current value of λ is used as starting point for the subsequent value of λ.Eventually, seagull provides a sequence of penalty parameters and calculates the corresponding path of solutions.
The optimization problem is solved via proximal gradient descent (PGD; e.g., [8]).PGD is an extension of gradient descent for optimization problems which contain nonsmooth parts, i.e., problems where the gradient is not available for the entire objective function.More details about this algorithm are presented in Additional file 4. As PGD is an iterative algorithm, a proper step size between consecutive iterations is crucial for convergence.This step size is determined with backtracking line search.
In the best case, an iterative algorithm such as PGD converges to the solution of the optimization problem.But typically, in the neighborhood of the solution the gain from one iteration to the next iteration decreases.Thus, a stopping criterion is implemented.Such a criterion is often based on a measurement of gain itself.In seagull, we implemented a stopping criterion which measures the gain from iteration k − 1 to k and scales it with the estimates at iteration k: We refer to ε rel as the relative accuracy, due to its definition as a ratio.
All implemented algorithms are based on the R package Rcpp 1.0.3 [9].

Data and evaluation criteria
We analyzed blood DNA methylation profiles at about 1.9 million CpG sites and its association with chronological age in mice (n = 141).The data set is publicly available and described in detail in [10].We split the data set into training (n = 75) and validation (n = 66) data, where all age classes appeared almost equally in both sets, and applied the sparse-group lasso variant of seagull 1.0.5.R scripts for processing and analyzing the data are available in the supplementary material (Additional files 1 and 2).Ready-to-use data are also available at Code Ocean (see Availability of data and materials).
We compared the outcome of seagull to that of the established R package SGL 1.3 [7].Its implementation is based on accelerated generalized gradient descent.Both packages offer regularization paths to the same optimization problem (1).Thus, the input parameters for both packages are very similar.For example, we set the mixing parameter α to 0.95, a grid of 50 values for λ, and the ratio ξ between minimal and maximal λ equal to 0.001.However, despite the similarities between seagull and SGL, the implemented convergence criteria differ due to different meanings of accuracy parameters.In the SGL package, this parameter is an upper bound for the ℓ 1 -norm of the estimates of b and u.Unless stated otherwise, we set the accuracy parameter for SGL and seagull to 10 − 4 and 10 − 6 , respectively.
We used the following criteria to evaluate the two packages: the minimum mean squared error (MSE) of predicted age based on methylation data (i.e., methylation age) and measured chronological age in the validation set along the regularization path, the squared correlation coefficient R 2 between predicted and chronological age, the number of features with an estimated effect different from zero (i.e., non-zeros), and the execution time needed to compute the entire regularization path.
Another example for the application of seagull in genome-wide association studies is given in Additional file 3. It is shown how parameters (i.e., weights) can be tuned for IPF-lasso.

Results and discussion
Figure 1a shows the model fit based on regression coefficients which led to the minimum mean squared error of chronological age in the validation set.The correlation between the chronological and the predicted age ("methylation age") was 95.8%, and 5095 non-zero effects were identified with seagull.Hence, using only the identified fraction of CpG sites enabled a precise prediction of age.As an option for regulating the sparsity, increasing the accuracy parameter of seagull by two magnitudes (10 − 6 to 10 − 4 ) increased the number of non-zero effects by one magnitude.Though the implemented convergence criteria differed between both packages, results were similar.The correlation between regression coefficients leading to the minimum mean squared error was 99.5% (Fig. 1b).The number of non-zero effects obtained with SGL was 8822.In contrast to SGL, seagull computed the solution in a fraction of the time (seagull: ~2 h; SGL: ~45 h).
Table 1 displays the impact of the accuracy parameter on evaluation criteria in detail.If the accuracy parameter of seagull was set to 10 − 6 or 10 − 8 , seagull outran SGL with respect to all evaluation criteria.The measures for R 2 and non- zeros are both based on the value of λ for which the minimum MSE of prediction was obtained.The dependence between λ and the corresponding MSE is shown in Fig. 2.
In addition to SGL, seagull enables the opportunity to introduce weights for each penalized feature.This option was recently investigated in [11], where an optimization problem similar to (2) was used to estimate effects of SNP genotypes in a flowering plant breed (Arabidopsis thaliana).In that study, weights were defined according to  the minor allele frequency (MAF) of genetic variants at each locus j, i.e., ω F j ¼ 2 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi MAF j ð1 − MAF j Þ p .Unlike seagull, the optimization problem described in [11] does not involve the incorporation of group weights other than the square root of the size of each group.

Conclusions
Here we introduced our R package seagull, which offers regularization paths for the lasso, group lasso, sparse-group lasso, and IPF-lasso for linear regression models.We compared seagull to the established R package SGL.Both packages delivered similar results in terms of mean squared error, squared correlation coefficient, and sparsity pattern.Despite these similarities, seagull computed the solution in a fraction of time that SGL required.Furthermore, only seagull offered the opportunity to incorporate weights for each penalized variable which enables further variants of the lasso such as the IPFlasso.In summary, seagull is a convenient envelope of lasso variants.

Fig. 1 a
Fig. 1 a Relationship between observed (chronological) and predicted (methylation) age.Each blue dot represents a sample in each class of observed chronological age (3mos, 4mos, etc.).Mean methylation age and error bars are displayed in black for each class of age.b Methylation age obtained with seagull vs. SGL.Blue dots represent samples; the dashed line is a regression line with slope 1

Fig. 2
Fig. 2 Path of mean squared error (MSE) of predicted age for each λ.Results of seagull and SGL are represented in blue and violet, respectively.The vertical lines mark the index in the sequence of λ's with lowest MSE in corresponding color.The respective lowest MSE of seagull and SGL were 11.79 and 12.78

Table 1
Performance evaluationAccuracy parameter refers to a package-dependent convergence parameter; R 2 is the squared correlation coefficient and MSE is the mean squared error of chronological and predicted age; Non-zero denotes the number of CpG sites with nonzero effect estimate; Time is the computational time needed to calculate the full regularization path