The R package seagull offers regularization paths for optimization problems of the form:
$$ \underset{\left(b,u\right)}{\mathit{\min}}\frac{1}{2n}{\left\Vert y- Xb- Zu\right\Vert}_2^2+\alpha \lambda {\left\Vert u\right\Vert}_1+\left(1-\alpha \right)\lambda {\left\Vert u\right\Vert}_{2,1}. $$
(1)
This is also known as the sparse-group lasso [5]. The first term expresses the “goodness of fit”. The second and third term are penalties, both of which are multiplied with the penalty parameter λ > 0. The vector y contains n observations of the response variable. The vectors b and u represent non-penalized and penalized effects, respectively; X and Z are the corresponding design matrices. Moreover, α ∈ [0, 1] is the mixing parameter which convexly links the penalties.
In the two limiting cases of α = 1 and α = 0, the resulting objective function is the lasso [2] and the group lasso [4], respectively. However, if α is chosen to be less than 1, it is assumed that the explanatory variables have an underlying group/cluster structure (with non-overlapping groups). Groups need to be determined prior to the call of seagull, for instance, by applying a suitable cluster algorithm to the explanatory variables or by grouping them according to the source of measurement (RNA expression, SNP genotypes, etc.). Referring to this structure, the entries of u can be separated into the corresponding groups, say u(l) for group l and pl is the size of group l (L is the total number of groups). Hence:
$$ {\left\Vert u\right\Vert}_{2,1}=\sum \limits_{l=1}^L\sqrt{p_l}{\left\Vert {u}^{(l)}\right\Vert}_2. $$
The penalty operators lasso, group lasso and sparse-group lasso are available in seagull. Furthermore, it is possible to consider weights for each explanatory variable and group. Thus, the implemented extension of the optimization problem (1) is:
$$ \underset{\left(b,u\right)}{\mathit{\min}}\frac{1}{2n}{\left\Vert y- Xb- Zu\right\Vert}_2^2+\alpha \lambda \sum \limits_{j=1}^p{\omega}_j^F\left|{u}_j\right|+\left(1-\alpha \right)\lambda \sum \limits_{l=1}^L{\omega}_l^G{\left\Vert {u}^{(l)}\right\Vert}_2, $$
(2)
where \( {\omega}_j^F \) and \( {\omega}_l^G \) are positive weights for feature j and group l, respectively. The weights for groups are defined as:
$$ {\omega}_l^G=\sqrt{p_l\overline{\omega_j^F}}, $$
where the average over weights of features is taken over those features that belong to group l, i.e., \( \overline{\omega_j^F}=\frac{1}{p_l}{\sum}_{j\ \mathrm{in}\ \mathrm{group}\ l}{\omega}_j^F \). Hence, if all weights \( {\omega}_j^F \) are set to 1, the optimization problem (2) yields problem (1).
The option of including weights can be used for any reason but it also enables the user to apply the strategy of IPF-lasso. In order to show this, we go back to optimization problem (2) with α = 1:
$$ \underset{\left(b,u\right)}{\min}\frac{1}{2n}{\left\Vert y- Xb- Zu\right\Vert}_2^2+\lambda \sum \limits_{j=1}^p{\omega}_j^F\left|{u}_j\right|. $$
For convenience, we assume the absence of any effects b and multiply the entire expression by 2n. Thus:
$$ \underset{u}{\min }{\left\Vert y- Zu\right\Vert}_2^2+2 n\lambda \sum \limits_{j=1}^p{\omega}_j^F\left|{u}_j\right|, $$
where a simplification can be obtained via \( {\lambda}_j=2 n\lambda {\omega}_j^F \):
$$ \underset{u}{\min }{\left\Vert y- Zu\right\Vert}_2^2+\sum \limits_{j=1}^p{\lambda}_j\left|{u}_j\right|. $$
As a last step we assume that the entries of u are obtained from M different sources, i.e., “modalities” – as called by the authors of [6]. Then, we let all λ’s which belong to the same modality m have the same value λ(m). Therefore, the last term in the above expression can be written as a sum over modalities:
$$ \sum \limits_{j=1}^p{\lambda}_j\left|{u}_j\right|=\sum \limits_{m=1}^M{\lambda}^{(m)}{\left\Vert {u}^{(m)}\right\Vert}_1. $$
And this immediately leads to the IPF-lasso. So in the seagull package, this particular lasso variant is implicitly included. The weights for features just need to be set accordingly, i.e., the same weight for features that belong to the same modality.
The penalty parameter λ > 0 reflects the strength of the penalization. Our package provides the opportunity to calculate a maximal value for λ (i.e., λmax) and to perform a grid search by gradually decreasing this value down to a minimal value (i.e., λmin). This minimum value is determined as a user-specified proportion ξ of λmax, i.e., λmin = ξλmax. The sequence of penalty parameters is then calculated on a logarithmic scale. To efficiently accelerate the corresponding grid search, we implemented warm starts. Thus, the solution of b and u for the current value of λ is used as starting point for the subsequent value of λ. Eventually, seagull provides a sequence of penalty parameters and calculates the corresponding path of solutions.
The optimization problem is solved via proximal gradient descent (PGD; e.g., [8]). PGD is an extension of gradient descent for optimization problems which contain non-smooth parts, i.e., problems where the gradient is not available for the entire objective function. More details about this algorithm are presented in Additional file 4. As PGD is an iterative algorithm, a proper step size between consecutive iterations is crucial for convergence. This step size is determined with backtracking line search.
In the best case, an iterative algorithm such as PGD converges to the solution of the optimization problem. But typically, in the neighborhood of the solution the gain from one iteration to the next iteration decreases. Thus, a stopping criterion is implemented. Such a criterion is often based on a measurement of gain itself. In seagull, we implemented a stopping criterion which measures the gain from iteration k − 1 to k and scales it with the estimates at iteration k:
$$ \frac{{\left\Vert {\left(\hat{\begin{array}{c}b\\ {}u\end{array}}\right)}^{\left[k\right]}-{\left(\hat{\begin{array}{c}b\\ {}u\end{array}}\right)}^{\left[k-1\right]}\right\Vert}_{\infty }}{{\left\Vert {\left(\hat{\begin{array}{c}b\\ {}u\end{array}}\right)}^{\left[k\right]}\right\Vert}_2}\le {\varepsilon}_{rel}. $$
We refer to εrel as the relative accuracy, due to its definition as a ratio.
All implemented algorithms are based on the R package Rcpp 1.0.3 [9].