ecpc: An R-package for generic co-data models for high-dimensional prediction

High-dimensional prediction considers data with more variables than samples. Generic research goals are to find the best predictor or to select variables. Results may be improved by exploiting prior information in the form of co-data, providing complementary data not on the samples, but on the variables. We consider adaptive ridge penalised generalised linear and Cox models, in which the variable specific ridge penalties are adapted to the co-data to give a priori more weight to more important variables. The R-package ecpc originally accommodated various and possibly multiple co-data sources, including categorical co-data, i.e. groups of variables, and continuous co-data. Continuous co-data, however, was handled by adaptive discretisation, potentially inefficiently modelling and losing information. Here, we present an extension to the method and software for generic co-data models, particularly for continuous co-data. At the basis lies a classical linear regression model, regressing prior variance weights on the co-data. Co-data variables are then estimated with empirical Bayes moment estimation. After placing the estimation procedure in the classical regression framework, extension to generalised additive and shape constrained co-data models is straightforward. Besides, we show how ridge penalties may be transformed to elastic net penalties with the R-package squeezy. In simulation studies we first compare various co-data models for continuous co-data from the extension to the original method. Secondly, we compare variable selection performance to other variable selection methods. Moreover, we demonstrate use of the package in several examples throughout the paper.


Introduction
Generalised linear models (GLMs) (McCullagh and Nelder, 1989) are the cornerstone of many statistical models for prediction and variable selection purposes, modelling the relation between outcome data and observed data. When observed data are high-dimensional, with the number of variables far exceeding the number of samples, these models may be penalised to account for the high-dimensionality. Well known examples include the ridge (Hoerl and Kennard, 1970), lasso (Tibshirani, 1996) and elastic net penalty (Zou and Hastie, 2005). One of the main assumptions underlying generalised linear models is that all variables are exchangeable. In many high-dimensional settings, however, this assumption is questionable (Ignatiadis and Lolas, 2020). For example, in cancer genomics, variables may be grouped according to some biological function. Variables within these groups may have a similar effect, while variables from different groups have a different effect. Hence, variables are exchangeable within groups, but not between groups. represent gene expression of genes that have similar effect within groups representing some biological function but different effect between those groups. To alleviate the exchangeability assumption, shared information may be modelled explicitly in the prior distribution of the variables, e.g. by introducing shared group penalties, penalising more important groups of variables relatively less (as done by van de Wiel et al. (2016)). The shared prior information may be represented in data matrices, called co-data, to distinguish the main, observed data with information on the samples from the complementary data with information on the variables. When the co-data are related to the effect sizes of variables, these data may be exploited to improve prediction and variable selection in high-dimensional data settings.
Various R-packages accommodate approaches to incorporate some form of co-data. Early methods such as grplasso (Meier et al., 2008) and gglasso (Yang and Zou, 2015) allow for categorical, or grouped, co-data, by using group lasso penalties. As these penalties are governed by one overall penalty parameter, these types of penalties may be not flexible enough to model the relation between the effect sizes and grouped co-data. To increase this flexibility, other methods were developed that estimate multiple, group-specific penalty (or prior) parameters, using efficient empirical Bayes approaches. Examples include GRridge (van de Wiel et al., 2016) for group-adaptive ridge penalties (normal priors), graper (Velten and Huber, 2019) for group-adaptive spike-and-slab priors and gren (Münch et al., 2019) for group-adaptive elastic net priors. The method ecpc (van Nee et al., 2021b) presents a flexible empirical Bayes approach to extend the use of grouped co-data to various other (and potentially multiple) codata types, such as hierarchical groups and continuous co-data, for multi-group adaptive ridge penalties. For continuous co-data, however, the normal prior variances corresponding to the ridge penalties are not modelled as a function of the continuous co-data variable, but rather as a function of groups of variables corresponding to the adaptively discretised co-data variable. When the relation between the prior variance and continuous co-data is non-constant and/or "simple", e.g. linear, the adaptive discretisation may lead to a loss of information and/or inefficiently model the relation. The package fwelnet (Tay et al., 2020) develops feature-weighted elastic net for continuous co-data specifically (there called "features of features"). Regression coefficients are estimated jointly with co-data variable weights, modelling the variable-specific elastic net penalties by a normalised, exponential function of the co-data. For categorical co-data, fwelnet boils down to an elastic net penalty on the group level (Tay et al., 2020), governed by one overall penalty parameter. Hence, it may lack flexibility when compared to empirical Bayes methods estimating multiple penalties. The package squeezy (van Nee et al., 2021a) presents fast approximate marginal likelihood estimates for group-adaptive elastic net penalties, but is available for grouped co-data only.
Here, we present an extension of the R-package ecpc to generic co-data models, in particular for continuous co-data. First, we show how a classical linear regression model may be used to regress the (unknown) variable-specific normal prior variances on the co-data. The co-data variable weights are estimated with an empirical Bayes moment estimator, slightly modified from van Nee et al. (2021b). Then, we present how the estimation procedure may be extended straightforwardly to model the relation between the prior variances and co-data by generalised additive models (Hastie and Tibshirani, 1986) for modelling non-linear functions and by shape constrained additive models (Pya and Wood, 2015) for shape constrained functions, e.g. positive and monotonically increasing functions. Besides, we use ideas from van Nee et al. (2021a) to transform the adaptive ridge penalties to elastic net penalties using the package squeezy. Either this approach or the previously implemented posterior selection approaches (van Nee et al., 2021b) may be used for variable selection.

Getting started
The goal of this paper is to provide a stand-alone introduction to the package ecpc and to provide examples of its use. To get started, one may install the R-package ecpc from CRAN and load it by running: R> install.packages("ecpc") R> library("ecpc") The main function in the package is the eponymous function ecpc(), which fits a ridge penalised generalised linear model by estimating the co-data variable weights and regression coefficients subsequently. The function outputs an object of the S3-class 'ecpc', for which the methods summary(), print(), plot(), predict() and coef() have been implemented. See the index in ?"ecpc-package" for a list of all functions, including functions for preparing and visualising co-data, or see Figure 1 for a cheat sheet of the main functions and workflow of the package.
The remainder of this paper is organised as follows: Section 2 first presents the model and the co-data models for a linear, generalised additive and shape-constrained co-data model. All types of co-data models are accompanied with short examples of how to use the package. Next, it is shown how ridge penalties may be transformed to elastic net penalties, again accompanied with a short toy example. Section 3 first compares various ways of modelling a continuous co-data variable with the originally proposed adaptive discretisation in a simulation study. Secondly, the method is compared to other methods in terms of variable selection. Section 4 then demonstrates use of the software on an application to the classification of lymph node metastasis using high-dimensional RNA expression data. Section 5 shortly concludes the software.

Method
Consider response data Y ∈ R n , observed high-dimensional data X ∈ R n×p with p n, which contain information on the n samples of Y , and possibly multiple co-data matrices Z (d) ∈ R p×G d , d = 1, .., D, which contain prior information on the p variables of X. Generally, codata matrices may include continuous or categorical co-data. For categorical co-data, dummy variables should be provided. For categorical co-data with overlapping categories, dummy variables may be weighted accordingly to account for multiplicity (see van Nee et al. (2021b)).
We consider a generalised linear model for the response with canonical link function g(·), parameterised with regression coefficients β ∈ R p . Furthermore, we model the regression coefficients with a normal prior, corresponding to a ridge penalty, in which the prior variance is regressed on the co-data: with X i and Z k the i th and k th row of X and Z respectively, γ (d) ∈ R G the co-data variable weights for co-data matrix d, w the co-data matrix weights and τ 2 global a scaling factor which may improve numerical computations in practice. When the data X consist of multiple data modalities, like gene expression data, copy number data and methylation data in genomics, scaling factors specific to the data modalities may be used (Boulesteix et al., 2017;van de Wiel et al., 2021) and estimated with ecpc.
Remark. The first version of ecpc, as described in van Nee et al. (2021b), only handles (possibly overlapping) groups of variables. Co-data is then supplied in a list of group sets, in the argument groupsets. Continuous co-data variables may be handled by adaptive discretisation. The new version discussed in this paper allows for both (undiscretised) continuous and grouped co-data. In the new version of ecpc, one needs to supply co-data as a list of co-data matrices in the argument Z. The function createZforGroupset() may be used to obtain the co-data matrix with dummy variables corresponding to a group set for grouped co-data.

Details for linear co-data models
Consider a linear relation between the prior variances and the co-data: v = Zγ.
The co-data variable weights γ are estimated in van Nee et al. (2021b) with moment estimation by equating theoretical moments to empirical moments. For co-data that represent groups of variables, the empirical moments are averaged over all variables in that group, leading to a linear system of G equations and G unknowns. For co-data that do not represent groups of variables, e.g. continuous co-data, we simply form one group per variable, leading to the following linear system of p equations and G unknowns: with • representing the Hadamard (element-wise) product. C ∈ R p×p and b ∈ R p are derived in van Nee et al. (2021b) and given by: withβ the maximum penalised likelihood estimate given an initialτ 2 global and corresponding constant diagonal ridge penalty matrixΩ, with W a diagonal weight matrix used in the iterative weighted least squares algorithm to fitβ, and withṽ an estimate for the variance ofβ with respect to the response data Y . Solving the linear system leads to the following least squares estimate for γ. As the prior variance has to be positive, the resulting prior variance estimate Zγ is truncated at 0: In generalised linear models it is common to use a log-link for the response to enforce positivity, resulting in positive, multiplicative effects. Note that here, however, Equation (2) is the result of equating theoretical to empirical moments. Replacing b by log(b) would violate the moment equalities. Also, if we would enforce positivity instead by, for example, substituting Zγ directly by v = exp(Zγ ), the moment equations would not be linear anymore in γ, nor multiplicative, e.g. as (C , with Z g the g th codata variable. Hence, the advantage of simply post-hoc truncating Zγ is that the system of equations in Equation (3) is easily solved. Section 2.3 discusses shape constrained co-data models which may be used to enforce positivity by including it as a constraint.

Interpretation
The interpretation of the co-data weights γ (scaled by τ 2 global ) is similar as in regular linear regression: when co-data variable Z g increases with one unit, while the other co-data variables are kept fixed, then the prior variance increases with γ g (γ g τ 2 global ). Consequently, when the prior variance for the effect β k of some variable X k increases with γ g (γ g τ 2 global ), then the a priori expected squared effect size, E(β 2 k ) = v k , increases with γ g (γ g τ 2 global ). In other words, when we would compare the effect of two variables X k and X l with the same co-data values, except for one co-data variable which is one unit higher for X k than for X l , then we would a priori expect β 2 k to be on average γ g (γ g τ 2 global ) larger than β 2 l .

Short example in ecpc
For this short example and the ones below, simulate some linear response data: R> set.seed(1) R> p <-300 #number of covariates R> n <-100 #sample size training data set R> n2 <-100 #sample size test data set R> beta <-rnorm(p, mean=0, sd=0.1) #simulate effects R> X <-matrix(rnorm(n*p, mean=0, sd=1), n, p) #simulate observed training data R> Y <-rnorm(n, mean = X%*%beta, sd=1) #simulate response training data R> X2 <-matrix(rnorm(n2*p, mean=0, sd=1), n, p) #simulate observed test data R> Y2 <-rnorm(n2, mean = X2%*%beta, sd=1) #simulate response test data As co-data, suppose that we have two co-data variables; one informative co-data variable containing the true absolute effect sizes and one non-informative co-data variable containing random normally distributed values: R> Z1 <-abs(beta) #informative co-data R> Z2 <-rnorm(p, mean=0, sd=1) #random, non-informative co-data R> Z <-cbind(Z1, Z2) #(px2)-dimensional co-data matrix Then we fit the linear co-data model and test the fit on the test data. Besides, we set postselection=FALSE to only estimate the dense model, without selecting variables a posteriori (see Section 2.4): R> fit <-ecpc(Y, X, Z=list(Z), X2=X2, Y2=Y2, postselection=FALSE) [1] "Estimate global tau^2 (equiv. global ridge penalty lambda)" [1] "Estimate co-data weights and (if included) hyperpenalties with mgcv" [1] "Estimate regression coefficients" Note that the co-data matrix is provided in a list, as it is also possible to provide a list of multiple co-data matrices. This will be used to explicitly distinguish linear co-data variables from smooth or constrained ones, as exemplified below. The performance of the fit on the test data may be given for both the co-data learnt model fit with ecpc() and for the co-data agnostic model fit with one global ridge penalty: Alternatively, the plot() method provides a graph of the regression coefficients and prior variances. If the R-packages ggplot2 (Wickham, 2016) and ggpubr (Kassambara, 2020) are installed, the output looks as shown in Figure 2, else a similar plot will be made with the base R plot() function.
R> plot(fit, show="coefficients") R> plot(fit, show="priorweights", Z=list(Z)) Lastly, the regression coefficients may be re-estimated for different prior parameters. First, the function penalties() may be used to change some prior parameters and retrieve the corresponding ridge penalties. Then, the method coef() re-estimates the regression coefficients given these penalties. For example, if one would alter the global level of regularisation by multiplying τ 2 global by 2: R> new_penalties <-penalties(fit, tauglobal = fit$tauglobal * 2, Z=list(Z)) R> new_coefficients <-coef(fit, penalties=new_penalties, X=X, Y=Y) Note that in general, however, altering the prior parameters by hand will not be needed as these parameters are optimised by the function ecpc(). The functions above, however, may be used to conveniently skip prior parameter estimation when prior parameters are known, e.g. when results have been saved and need to be checked quickly. For example, here we just set all prior parameters to one and only estimate the regression coefficients: R> new_penalties2 <-penalties(tauglobal = 1, sigmahat = 1, gamma = c(1,1), + w = 1, Z=list(Z)) R> new_coefficients2 <-coef.ecpc(penalties=new_penalties2, X=X, Y=Y) 2.2 Details for generalised additive co-data models Generalised additive models (GAMs), originally proposed in Hastie and Tibshirani (1986), have been widely applied to model non-linear relations. Applied here, we assume that the relation between the prior variance and co-data may be modeled by a sum of smooth functions, s 1 (·), .., s G (·), of the co-data variables: In practice, the smooth functions are estimated by using a basis expansion to recast the problem into a linear model (as originally proposed by, for example, Wahba (1980)). So, for a basis expansion consisting of J g basis functions φ g,j (·), j = 1, .., J g , for co-data variable Z g : The type and number of basis functions should in general be chosen such that they are flexible enough to approximate the underlying function well. To avoid overfitting for too many basis functions, the coefficients may be estimated by optimising the likelihood penalised by a smoothing penalty. While our software allows the user to supply any basis expansion, we focus here on the popular p-splines (see Eilers and Marx (2021) for an introduction). This approach combines flexible spline basis functions with a quadratic smoothing penalty on the differences of the spline coefficients. So, the smoothing penalty is of the form γ T GAM g λ g S g γ GAM , where the difference penalty matrix S g smooths the non-linear function of the co-data variable Z g and where λ g is the corresponding smoothing penalty parameter. Hence, the least squares estimate for the linear co-data model in Equation (3) is extended to the following estimate for the GAM coefficients in a non-linear co-data model: This least-squares equation is of a form also known as penalised signal regression (Marx and Eilers, 1999) and can be solved by the function gam() (or bam() for big data) of the R-package mgcv, for example. This function also provides fast and stable estimation of the penalties λ g (Wood, 2011). Alternatively, when only one smoothing penalty matrix is provided, the smoothing penalty may be estimated by using random splits as proposed in van Nee et al. .
Remark. Note that grouped co-data may be coded as group sets or as dummies in a co-data matrix Z. The former option, however, does not allow for a generalised ridge penalty, but for other penalties including the ordinary ridge and (hierarchical) lasso penalty.

Short example in ecpc
We continue with the simulated data from above. First, we use the helper functions createZforSplines() and createS() to create spline basis matrices and corresponding smoothing penalty matrices respectively. The degree of the spline functions and order of the penalty matrices are set to 3 and 2 by default, respectively. We set the number of splines to 20 for the first co-data variable and to 30 for the second in this example.
R> Z1.s <-createZforSplines(values=Z1, G=20, bdeg=3) R> S1.Z1 <-createS(orderPen=2, G=20) R> Z2.s <-createZforSplines(values=Z2, G=30, bdeg=3) R> S1.Z2 <-createS(orderPen=2, G=30) Before we fit the model, we first concatenate the two co-data matrices in a list. The variables of this list are always renamed such that the i th element is named Zi. The smoothing penalty matrices should be given in a separate argument paraPen, similar to the eponymous argument in gam(). Each element in this argument's list should match one of the names Zi, for which the corresponding smoothing matrix is given in S1 (and optionally S2, S3, et cetera for multiple smoothing matrices for one co-data matrix).

R> fit.gam$MSEecpc
[1] 2.472784 The non-linear relation between the prior variance and each co-data source may again be plotted with the plot() method, either with the co-data spline variables on the x-axis, or the continuous co-data values on the x-axis. The corresponding output is shown in Figure 3:  Figure 3: Example output for plot() in a generalised additive co-data model with the co-data variables on the x-axis (left two plots) or continuous values on the x-axis (right two plots).
Alternatively, one may plot the non-linear relation directly. The spline variable coefficients are given in fit$gamma for both co-data matrices and have an attribute codataSource to indicate for each coefficient to which co-data matrix it belongs. The non-linear relation of one co-data matrix is then plot as follows (output not shown):

Details for shape-constrained co-data models
Prior assumptions on the shape of the relation between the prior variance and co-data, such as monotonicity or convexity, may be imposed by constrained optimisation of spline coefficients (Pya and Wood, 2015). Pya and Wood (2015) develop shape-constrained p-splines to handle difficulties in optimising multiple smoothing penalties due to discontinuous gradients. Their Rpackage scam, however, cannot be readily used for signal regression, which differs from regular regression in that the spline basis matrix is multiplied by the known matrix (C • C). Moreover, the smoothing parameter estimates are estimated using a generalised cross-validation (GCV) criterion, which we show below to overfit in the unconstrained case. Therefore, we fall back to the simple approach of directly constraining the spline coefficients.
We use the approach proposed in van Nee et al. (2021b) to estimate the smoothing penalties: first we estimate the smoothing penalties λ g separately for each co-data variable Z g using random splits of the data. As this optimisation is in one dimension only, we can use Brent's algorithm from the general purpose optimisation R-package optim, which should be sufficient to handle discontinuous gradients. Then we estimate the spline coefficients γ g for each co-data variable Z g and corresponding spline basis function matrix Φ g . In the general constrained setting, this estimate is given by subjecting the possible solution to Equation (4) to (in)equality constraints given in matrix M (in)eq,g and vector b (in)eq,g : Several shapes may be imposed by choosing M ineq and b ineq accordingly (Pya and Wood, 2015): i) positivity may be imposed by constraining the spline coefficients to be positive; ii) monotonically increasing (decreasing) may be imposed by constraining the first order differences γ i+1 − γ i to be positive (negative); iii) convexity (concavity) may be imposed by constraining second order differences γ i+2 − 2γ i+1 + γ i to be positive (negative); iv) any combination of the shapes i-iii may be imposed by combining the corresponding constraints.
Then, given the spline coefficient estimatesγ g , we combine multiple co-data variables by estimating co-data source weights w = (w 1 , .., w G ) T using the same method of moment equation (van Nee et al., 2021b). For Z w := [Φ 1γ1 , .., Φ 1γG ]: Note that a similar equation is used when multiple co-data matrices Z (1) , .., Z (D) are provided.
Remark. Multiple co-data matrices Z (1) , .., Z (D) may be provided in a list to the function ecpc(), or stacked and provided in a list of one co-data matrix Z = [Z (1) , .., Z (D) ]. When the function bam() from mgcv is used, multiple smoothing parameters may be used for either representation, and are estimated jointly. After, the co-data variable weights γ are jointly estimated for all co-data matrices as well. As a result, the co-data weights w do not need to be estimated as they are implicitly accounted for in the joint estimate of γ. In contrast, when the random splitting is used, only one smoothing parameter per co-data matrix may be estimated. Therefore, the co-data matrix weights are estimated to combine multiple co-data matrices. By default, the function ecpc() uses bam() when co-data is provided in co-data matrices and no constraints are provided. This may be changed by setting hypershrinkage="none" when no penalty for the moment estimates is used or to hypershrinkage="ridge" when a generalised ridge penalty as in Equation (4) is used with random splits for estimating the penalty parameter. When constraints are provided, the function ecpc() automatically switches to the random splits.
Note that an intercept is excluded by default, but that it can easily be included by appending a column of ones to Z.

Transforming ridge penalties to elastic net penalties
The first version of ecpc allows for posterior selection (van Nee et al., 2021b), exemplified in Section 4. Alternatively, the obtained adaptive ridge penalties may be transformed to elastic net penalties for simultaneous estimation and variable selection, as explained here.
In the proposed model in Equation (1), the regression coefficients follow a normal prior corresponding to a ridge penalty. Now, suppose that each β k independently follows some other prior distribution π(β k ), parameterised by covariate-specific prior parameter λ k and with prior mean 0 and finite variance Var(β k ) = Zγ = h(λ k ) for some known monotonic variance function h(·): As example we consider the elastic net prior, corresponding to the elastic net penalty, with variable specific elastic net penalty. Recently, it was shown that when the prior parameters are group-specific, the marginal likelihood -as function of λ k -is approximately the same as the marginal likelihood as function of normal prior parameters γ, as the prior distribution of the linear predictor η = Xβ is asymptotically normally distributed (van Nee et al., 2021a): This result also holds for priors with variable specific, finite variance Eicker (1966). We may use this result to obtain approximate method of moment equations for other priors. Denote byβ R (Y ) the ridge penalised maximum likelihood estimate as function of the observed response data Y . The method of moments equations are given by equating the theoretical marginal moments to the empirical moments (van Nee et al., 2021b): Using the normal approximation for the marginal likelihood we obtain: So we may obtain the ridge estimatesγ as above to estimate the covariate specific prior variancesv k = (Z kγ ) + , and transform them with the variance function to obtain the covariate specific prior parameters:λ This transformation could be used to transform the prior variance estimates for the generalised additive co-data model in Equation (4) and for the shape-constrained co-data model in Equation (5) too. Note that, however, the penalisation and constraints are applied to γ and not to λ. So in general, for priors other than the normal prior, the variance function is not linear, such that the additivity in GAMs is on the prior variance level and not on the transformed level:λ nor does the transformation of a shape-constrained function, h −1 (s(Z k )), necessarily have the same shape as s(Z k ).

Short example for elastic net
We may use the R-package squeezy to transform ridge penalties to elastic net penalties (van Nee et al., 2021a), in which we use the fit from ecpc() from above. As example, we use the elastic net parameter α = 0.3, for which we summarise the obtained elastic net penalties and regression coefficients: R> if(!requireNamespace("squeezy")) install.packages("squeezy") R> library("squeezy") R>

Simulation study
Estimation and prediction performance have been compared for several methods in van Nee et al. (2021b). Here, we focus on continuous co-data to exemplify the newly proposed co-data models. First, we perform a simulation study to compare the estimates of the prior variance and prediction performance for different co-data models proposed here and the adaptive discretisation proposed in the first version of ecpc. Secondly, we perform a simulation study to compare different variable selection methods.

Estimation and prediction performance of various co-data models
We use the same simulation set-up as in van Nee et al. (2021b) and simulate 50 training and test data sets for some true vector of regression coefficients β 0 ∈ R 300 . Again, we consider random and informative co-data, but now continuous versions of it: 1. Random: generate standard normal co-data Z k i.i.d.
∼ N (0, 1) for k = 1, .., p. 2. Informative: use the true regression coefficients to inform the co-data; Z k = |β 0 k |. We compare the following co-data models: i) ridge: a co-data agnostic, global ridge penalty, corresponding to the co-data intercept only model. Any co-data method should outperform this baseline method when co-data is informative, and preferably not lose much when co-data is not informative; ii) linear: a linear co-data model with an intercept and one (non-)informative co-data variable; iii) gam: a generalised additive co-data model using p-splines of degree 3 and with difference penalty matrix of second order differences as suggested in Eilers and Marx (2021). We use 20 splines and the marginal likelihood method available in bam() from the mgcv package unless stated otherwise; iv) scam.p: same as the generalised additive model but with shape constrained to be positive; v) scam.pmi: same as the generalised additive model but with shape constrained to be positive and monotonically increasing; vi) AD: adaptive discretisation of the continuous co-data as proposed in van Nee et al. (2021b).
We use a minimum of 20 variables in the smallest groups, which leads to seven hierarchical groups. Figure 5 shows prior variance estimates for different co-data models with corresponding prediction performance shown in Figure 6. As expected, the estimated prior variance is flat for random co-data and increasing for informative co-data, leading to prediction performance similar to and better than the co-data agnostic ordinary ridge, respectively. The estimates of the (constrained) generalised additive co-data models are slightly more non-linear than the linear estimate, but lead to similar prediction performance. The variance of the estimates of the constrained generalised additive models in the random co-data reflects the effect of adding constraints, e.g. the estimates vary mostly in the positive direction for the positively constrained model. The linear and (constrained) generalised additive model slightly outperform the adaptive discretisation. One advantage of the additive models using p-splines over the  Table 1: Results for simulation study based on 50 training and test sets. Average run time and standard deviation for various co-data models, smoothing parameter estimation methods (as used in bam() from mgcv or with splits in ecpc) and number of co-data variables.
adaptive discretisation is that the p-splines can estimate local changes on a finer level; while the adaptive discretisation method is limited to discretisations in which each discretised group contains at least one variable (in our case, at least 20 variables per discretised group), this is not needed for p-splines, as they are penalised with a difference penalty. To illustrate, Figure  A1a in Appendix A shows the generalised additive model estimates in one training data set when G = 20 or G = 50 splines are used and when the difference penalty is estimated with one of the methods provided in the R-function bam() or with the random splits as used in the first version of ecpc. Except for the generalised cross-validation criterion GCV.Cp, the estimates and corresponding prediction performance seem to be robust for the number of splines (see also Figures A2 and A1b in Appendix A). Finally, Table 1 shows the average run times of the methods. The adaptive discretisation is around 3-6 times as slow as the (constrained) additive co-data models.

Variable selection compared to other methods
We alter the simulation set-up from above for variable selection. We now set 250 regression coefficients to 0, leaving 50 non-zero coefficients. We scale the regression coefficients such that the L2-norm of β 0 and the scaled, sparse β 0,s are the same. We use the following co-data: 1. Random: as in the simulation study above, so Z (1) k i.i.d.
∼ N (0, 1) for k = 1, .., p. 2. Informative+monotone: as in the simulation study above, but with white noise added such that the co-data is not exactly 0 for the zero coefficients, Z (2) k ind.
∼ N (|β 0,s k |, σ 2 0 ), for σ 0 a tenth of the sample standard deviation of β 0,s . The effect size |β 0,s k | 2 is (up till some noise) a monotone, quadratic function of the co-data. 3. Informative+convex: similar to the Informative co-data but distinguishing negative from positive effects by Z k . The effect size |β k | 2 is now (up till some noise) not monotone but a convex, quadratic function of the co-data. We compare the following variable selection methods: i) glmnet: a co-data agnostic elastic net model fitted with glmnet (Friedman et al., 2010). ii) fwelnet: an elastic net model with continuous co-data, fitted with fwelnet (Tay et al., 2020). The elastic net penalties are a fixed, exponential function of the co-data weights. iii) ecpc+squeezy: a GAM for the co-data fitted with ecpc, followed by a transformation of the ridge penalties to elastic net penalties by squeezy (van Nee et al., 2021a) for variable selection. Co−data model MSE Figure 6: Simulation study based on 50 training and test sets and random co-data (left) or informative co-data (right). Boxplots of the MSE of the predictions on the test sets for various co-data models.

Random
iv) ecpc+postselection: a GAM for the co-data, using the default option for posterior selection provided in the ecpc software. The first three methods have one additional tuning parameter, the elastic net parameter α ∈ [0, 1], with 0 corresponding to the full model and 1 to the lasso model. The last method has one tuning parameter, the number of selected covariates (or equivalently, the proportion of estimated zero effects), ranging from 300 to 0 (0 to 1) for the full model to the most sparse model. In practice, one may choose one value from a range of values for the tuning parameter by comparing predictive performances and selecting the sparsest model that performs (nearly) optimal. Figure 7 shows the performance of the methods in variable selection and prediction error on the test data. Note that ecpc+postselection may be tuned to select sparse models up to a model that is almost empty, reaching a sensitivity and 1-precision of 0. In contrast, the models selected by the other methods still contain more variables, even in the most sparse models for tuning parameter α = 1. Besides, ecpc+squeezy and ecpc+postselection do not always select a full model, explaining why the sensitivities do not reach 1. This is a result from ecpc() truncating estimated negative prior variances to 0, deselecting some variables a priori. In the sparse setting, the co-data agnostic glmnet outperforms the other methods both in terms of variable selection and prediction performance when the co-data is random. This in contrast to the dense simulation setting, in which the prediction performance of glmnet and gam were on par ( Figure 6). For the informative+monotone co-data, fwelnet slightly outperforms ecpc+squeezy and ecpc+postselection, all outperforming glmnet. For the informative+convex co-data, however, fwelnet is not able to flexibly adapt to the convex shape of the co-data, while the flexible GAM for the co-data in ecpc+squeezy and ecpc+postselection still adequately exploits the co-data.

Analysis example
We demonstrate the different co-data models by applying the method to an application in classifying lymph node metastasis (coded 1) from other types of cancer (coded 0). We use the data readily available from the R-package CoRF, providing high-dimensional RNA expression training data for p = 12838 probes and n = 133 patients, and validation data for n 2 = 97 patients. First we install and load the package. Then we load the data and transform the  Figure 7: Simulation study for variable selection based on 50 training and test sets for various types of co-data. a) Average sensitivity and precision for several methods and various tuning parameters; b) Mean squared error prediction performance on the test data. The lines indicate the pointwise average and the inner and outer shaded bands indicate the 25-75% and 5-95% quantiles respectively.

R> Ypred <-predict(Res, X2=ValidationData)
The posterior selected variables are given in Res$betaPost as maxsel was provided to ecpc(). Alternatively, the same posterior selection method may be performed with the function postSelect() on the fitted 'ecpc'-object Res: R> sparseModels <-postSelect(Res, X=TrainData, Y=RespTrain, maxsel=maxSel) A second approach for variable selection is to use squeezy() to transform the ridge penalties to elastic net penalties, with elastic net tuning parameter α. As mentioned above, in practice one may try a range of tuning parameters to choose the sparsest model with close to optimal performance. For this example, we include the lasso model (α = 1):

R>
sparseModel2 <-squeezy(Y=RespTrain, X=TrainData, alpha=1, + lambdas=Res$penalties, + X2=ValidationData, Y2=RespValidationNum) Instead of fitting monotone and positively constrained functions for the correlation and p-values co-data, one could consider other co-data models. Figure 9 shows the results for three different settings: 1) a GAM, i.e. without constraints; 2) a SCAM with positivity constraints; 3) a SCAM with positivity and monotonicity constraints, as in the exemplified code above. The results include the dense models obtained with ecpc() and a co-data agnostic ordinary ridge model, and sparse models for a range of posterior selected variables obtained with postSelect(), the lasso model obtained with transformed penalties from squeezy() and a co-data agnostic lasso model. The estimated prior variance contribution for the correlation codata shows large deviations near the boundaries, which increase when 50 instead of 20 splines are used (Setting 1). While p-splines have no boundary effects in regular regression models (Eilers and Marx, 1996), these effects may be the result from the signal regression nature of the model used in Equation (4). To dampen the boundary effects, it may be beneficial to transform the co-data values such that the values spread out more evenly, e.g. using the empirical cumulative distribution function. Fitting a co-data model with positive (and monotone) constraints (Setting 2 and 3) results in smoother functions than when it is fit without constraints. While adding the constraints stabilises posterior selection for highly sparse models, it generally does not benefit the prediction performance when compared to the GAM co-data model. The GAM co-data model results in the best prediction performance among sparse models, though, in practice, the simpler lasso model may be preferred as it shows competitive performance. The overall best prediction performance on the test data is retrieved by the full, dense model when the (unconstrained) generalised additive co-data model is used with 50 splines.

Conclusion
We presented an extension to the R-package ecpc that accommodates linear co-data models, generalised additive co-data models and shape constrained additive co-data models for the purpose of high-dimensional prediction and variable selection. These co-data models are particularly useful for continuous co-data, for which an adaptive discretisation was available in the first version. The newly proposed co-data models are shown to run faster and lead to slightly better prediction performance when compared to the first version in a simulation study. Moreover, the estimated variable-specific ridge penalties may be transformed to elastic net penalties with the R-package squeezy to allow for variable selection. We showed in a simulation study that this approach and the previously proposed posterior selection approach lead to similar performance, outperforming other methods when the effect sizes are (non-exponentially) related to the co-data. We have provided several short examples and one analysis example to a cancer genomics application to demonstrate the code. Stand-alone R-scripts and other code files used for the simulations and examples may be found on https://github.com/Mirrelijn/ecpc.  Figure 9: Data analysis example: a) Estimated prior variance contributions of each co-data source, before multiplying with the co-data specific weight. Note that the p-values are shown on the log-scale in Settings 2 and 3, to clearly show the non-zero peaks at the smallest p-values; b) corresponding prediction performance on the validation set for 20 or 50 spline basis functions. The settings correspond to different co-data models: 1) no constrains; 2) positive constrained shape; 3) positive and monotonically constrained shape.
A Additional figures to simulation study Figures A1 and A2 show the results for generalised additive co-data models when different smoothing parameter methods are used.  Figure A1: Simulation study based on 50 training and test sets and random co-data (left) or informative co-data (right). a) Example of estimated prior variance for various smoothing parameter estimation methods in one training data set; b) boxplots of the MSE of the predictions on the test sets for the ordinary ridge model (G = 1 co-data intercept variable) and for a generalised additive co-data model using various smoothing parameter estimation methods and G = 20 or 50 splines. Continuous co−data variable Prior variance Figure A2: Simulation study based on 50 training and test sets and random co-data (left) or informative co-data (right). Estimated prior variance for the generalised additive co-data model using various smoothing parameter estimation methods. The lines indicate the pointwise median and the inner and outer shaded bands indicate the 25-75% and 5-95% quantiles respectively. Points indicate the true (β 0 k ) 2 .