 Software
 Open Access
 Published:
ecpc: an Rpackage for generic codata models for highdimensional prediction
BMC Bioinformatics volume 24, Article number: 172 (2023)
Abstract
Background
Highdimensional prediction considers data with more variables than samples. Generic research goals are to find the best predictor or to select variables. Results may be improved by exploiting prior information in the form of codata, providing complementary data not on the samples, but on the variables. We consider adaptive ridge penalised generalised linear and Cox models, in which the variablespecific ridge penalties are adapted to the codata to give a priori more weight to more important variables. The Rpackage ecpc originally accommodated various and possibly multiple codata sources, including categorical codata, i.e. groups of variables, and continuous codata. Continuous codata, however, were handled by adaptive discretisation, potentially inefficiently modelling and losing information. As continuous codata such as external p values or correlations often arise in practice, more generic codata models are needed.
Results
Here, we present an extension to the method and software for generic codata models, particularly for continuous codata. At the basis lies a classical linear regression model, regressing prior variance weights on the codata. Codata variables are then estimated with empirical Bayes moment estimation. After placing the estimation procedure in the classical regression framework, extension to generalised additive and shape constrained codata models is straightforward. Besides, we show how ridge penalties may be transformed to elastic net penalties. In simulation studies we first compare various codata models for continuous codata from the extension to the original method. Secondly, we compare variable selection performance to other variable selection methods. The extension is faster than the original method and shows improved prediction and variable selection performance for nonlinear codata relations. Moreover, we demonstrate use of the package in several genomics examples throughout the paper.
Conclusions
The Rpackage ecpc accommodates linear, generalised additive and shape constrained additive codata models for the purpose of improved highdimensional prediction and variable selection. The extended version of the package as presented here (version number 3.1.1 and higher) is available on (https://cran.rproject.org/web/packages/ecpc/).
Background
Generalised linear models (GLMs) [1] are the cornerstone of many statistical models for prediction and variable selection purposes, modelling the relation between outcome data and observed data. When observed data are highdimensional, with the number of variables far exceeding the number of samples, these models may be penalised to account for the highdimensionality. Well known examples include the ridge [2], lasso [3] and elastic net penalty [4]. One of the main assumptions underlying generalised linear models is that all variables are exchangeable. In many highdimensional settings, however, this assumption is questionable [5]. For example, in cancer genomics, variables may be grouped according to some biological function. Variables within these groups may have a similar effect, while variables from different groups have a different effect. Hence, variables are exchangeable within groups, but not between groups. To alleviate the exchangeability assumption, shared information may be modelled explicitly in the prior distribution of the variables, e.g. by introducing shared group penalties, penalising variables in a group similarly and penalising more important groups of variables relatively less (as done by [6]). The shared prior information may be represented in data matrices, called codata, to distinguish the main, observed data with information on the samples from the complementary data with information on the variables. In genomics, for example, the codata matrix columns may contain p values representing the strength of association between each variable and outcome from external studies, correlations between mRNA and DNA, dummy variables for chromosomes and pathway information. When the codata are related to the effect sizes of variables, these data may be exploited to improve prediction and variable selection in highdimensional data settings.
Various Rpackages accommodate approaches to incorporate some form of codata. Early methods such as grplasso [7] and gglasso [8] allow for categorical, or grouped, codata, by using group lasso penalties. As these penalties are governed by one overall penalty parameter, these types of penalties may be not flexible enough to model the relation between the effect sizes and grouped codata. To increase this flexibility, other methods were developed that estimate multiple, groupspecific penalty (or prior) parameters, using efficient empirical Bayes approaches. Examples include GRridge [6] for groupadaptive ridge penalties (normal priors), graper [9] for groupadaptive spikeandslab priors and gren [10] for groupadaptive elastic net priors. Our method ecpc [11] presents a flexible empirical Bayes approach to extend the use of grouped codata to various other (and potentially multiple) codata types, such as hierarchical groups (e.g. gene ontologies) and continuous codata, for multigroup adaptive ridge penalties. For continuous codata, however, the normal prior variances corresponding to the ridge penalties are not modelled as a function of the continuous codata variable, but rather as a function of groups of variables corresponding to the adaptively discretised codata variable. When the relation between the prior variance and continuous codata is nonconstant and/or “simple”, e.g. linear, the adaptive discretisation may lead to a loss of information and/or inefficiently model the relation. The package fwelnet [12] develops featureweighted elastic net for continuous codata specifically (there called “features of features”). Regression coefficients are estimated jointly with codata variable weights, modelling the variablespecific elastic net penalties by a normalised, exponential function of the codata. For categorical codata, fwelnet boils down to an elastic net penalty on the group level [12], governed by one overall penalty parameter. Hence, it may lack flexibility when compared to empirical Bayes methods estimating multiple penalties. The package squeezy [13] presents fast approximate marginal likelihood estimates for groupadaptive elastic net penalties, but is available for grouped codata only.
Here, we present an extension of the Rpackage ecpc to generic codata models, in particular for continuous codata such as external p values. First, we show how a classical linear regression model may be used to regress the (unknown) variablespecific normal prior variances on the codata. This provides a flexible parsimonious framework to obtain featurespecific penalties. The codata variable weights are estimated with an empirical Bayes moment estimator, slightly modified from [11]. Then, we present how the estimation procedure may be extended straightforwardly to model the relation between the prior variances and codata by generalised additive models [14] for modelling nonlinear functions and by shape constrained additive models [15], e.g. for positive and monotonically increasing functions. This extension benefits the stability and interpretation of the estimated relation between codata and the prior variances, especially when a basic linear model does not represent this relation well. Besides, we use ideas from [13] to transform the adaptive ridge penalties to elastic net penalties using the package squeezy. Either this approach or the previously implemented posterior selection approaches [11] may be used for variable selection.
Contributions
The empirical Bayes estimation method [11] is extended to the continuous case. The main contributions of this software paper, newly extending the existing Rpackage ecpc, are as follows:

Codata are provided to the main function ecpc() in the more generic format of a codata matrix (input argument Z), instead of a list of group sets (input argument groupsets). Besides dummy variables for group membership information, a codata matrix may contain continuous codata.

The empirical Bayes estimates may be additionally penalised with a generalised ridge penalty (input argument paraPen, similar to the Rpackage mgcv) and/or subjected to constraints (input argument paraCon). This may be used to model the prior variances as nonlinear and/or shapeconstrained function of the codata.

The adaptive ridge penalty estimates given by ecpc() may be transformed with squeezy() to elastic net penalties to obtain sparse regression coefficient estimates.
Implementation
The main function in the Rpackage is the eponymous function ecpc(), which fits a ridge penalised generalised linear model by estimating the codata variable weights and regression coefficients subsequently. The function outputs an object of the S3class ‘ecpc’, for which the methods summary(), print(), plot(), predict() and coef() have been implemented. See the index in ?“ecpcpackage” for a list of all functions, including functions for preparing and visualising codata, or see Fig. 1 for a cheat sheet of the main functions and workflow of the package.
Data input
The function ecpc() considers the following data. The response data \(\varvec{Y}\in {\mathbb {R}}^n\) are given in input argument Y. The observed highdimensional data \(X\in {\mathbb {R}}^{n\times p}\) with \(p\gg n\), which contain information on the n samples of \(\varvec{Y}\), are given in X. The codata of possibly multiple codata matrices \(Z^{(d)}\in {\mathbb {R}}^{p\times G_d}\), \(d=1,\ldots ,D\), which contain prior information on the p variables of X, are given in Z. Generally, codata matrices may include continuous or categorical codata. For categorical codata, dummy variables should be provided. For categorical codata with overlapping categories, dummy variables may be weighted accordingly to account for multiplicity (see [11]). The helper function createZforGroupset() may be used to create a codata matrix from a list of (overlapping) groups. Codata should not contain missing values. When the missingness is deemed uninformative, existing methods may be used to impute the missing values, e.g. by the codata variable mean. When the missingness is suspected to be informative, missing values should be set to 0 and an extra categorical codata variable (1 for missing 0 else) should be included.
Response model and codata model
Currently, ecpc() allows for a linear, logistic and Cox survival model (input argument model). Generally, the response is modelled with a generalised linear (or Cox) model with canonical link function \(g(\cdot )\), parameterised with regression coefficients \(\varvec{\beta }\in {\mathbb {R}}^p\). Furthermore, the regression coefficients follow a normal prior, with variance \(v_k, k=1,\ldots ,p,\) inversely proportional to the variablespecific ridge penalty, in which the prior variance is regressed on the codata. First, consider the linear codata model in which the prior variance is modelled as a linear function of the codata:
with \(\varvec{X}_i\) and \(\varvec{Z}_k\) the \(i^{th}\) and \(k^{th}\) row of X and Z respectively, \(\varvec{\gamma }^{(d)}\in {\mathbb {R}}^G\) the codata variable weights for codata matrix d, \(\varvec{w}\) the codata matrix weights and \(\tau _{global}^2\) a scaling factor which may improve numerical computations in practice. The linear codata model and interpretation of the prior parameters are illustrated in Fig. 2. When the data X consist of multiple data modalities, like gene expression data, copy number data and methylation data in genomics, scaling factors specific to the data modalities may be used [16, 17] and estimated with ecpc.
Codata models
The extension of ecpc implements three types of codata models, based on a linear model, generalised additive model [14] and shapeconstrained additive model [15]. The flexibility of the additive models is important in case the relation between the codata variables and effect sizes is nonlinear. A linear codata model then inadequately exploits the codata, while additive models are able to adapt to the underlying relation, as illustrated in Fig. 3. For illustration, consider one codata source with G codata variables and set the scaling parameter \(\tau _{global}^2\) to 1:
with \(Z_g\) extended to continuous codata, \(s_g\) a smooth function and \(c_g\) some shapeconstrained function, e.g. monotone or convex, both applied elementwise. Generally, the larger the variablespecific prior variance, the smaller the corresponding ridge penalty and the larger the a priori expected variable effect size.
In practice, the smooth and shapeconstrained functions are estimated by using a basis expansion to recast the problem into a (constrained) linear model (as originally proposed by, for example, [18]). So, for a basis expansion consisting of \(J_g\) basis functions \(\phi _{g,j}(\cdot )\), \(j=1,\ldots ,J_g\), for codata variable \(\varvec{Z}_g\):
with \(\Phi _g\in {\mathbb {R}}^{p\times J_g}\) the matrix of codata variable vector \(\varvec{Z}_g\in {\mathbb {R}}^p\) evaluated in all \(J_g\) basis functions. Any basis expansion may be used by supplying the corresponding basis expansion matrix \(\Phi _g\) as codata in input argument Z in ecpc().
Choice of basis expansion
The type and number of basis functions should in general be chosen such that they are flexible enough to approximate the underlying function well. To avoid overfitting for too many basis functions, the coefficients may be estimated by optimising the likelihood penalised by a smoothing penalty. While our software allows the user to supply any basis expansion, we focus here on the popular psplines (see [19] for an introduction). This approach combines flexible spline basis functions with a quadratic smoothing penalty on the differences of the spline coefficients (difference penalty matrix \(S_g\) in Equation (5)). The level of smoothness is then automatically tuned by estimation of the smoothing penalty (\(\lambda _g\) in Eq. (5)). For shapeconstrained functions, we consider a pspline basis expansion and constrain the spline coefficients [15].
The helper function createZforsplines() may be used to create the psplines expansion matrix \(\Phi _g\) corresponding to codata variable \(\varvec{Z}_g\) for input argument Z. The function createS() may be used to create the corresponding difference penalty matrix \(S_g\) for input argument paraPen. The function createCon() may be used to create constraints for functions that are positive, monotonically increasing or decreasing, convex or concave, or any combination thereof.
Model parameters estimation
Prior parameters and regression coefficients are estimated with an empirical Bayes approach, following [11]. In short, first the global scaling parameter \(\tau ^2_{global}\) is estimated, then the codata variable weights \(\varvec{\gamma }^{(d)}\) for each codata matrix d separately and then the codata weights \(\varvec{w}\). To ensure stability and identifiability of the estimates when codata variables or sources are (increasingly) correlated, the codata variable weights \(\varvec{\gamma }^{(d)}\) may be penalised (e.g. see Eq. (5)) and the codata source weights are estimated subject to the constraint \(\varvec{w}\ge 0\). After, given the prior parameter estimates, the regression coefficients \(\varvec{\beta }\) are estimated by maximising the penalised likelihood (equivalent to maximising the posterior).
Here, the empirical Bayes estimation for the codata variable weights \(\varvec{\gamma }^{(d)}\) [11] is extended for continuous codata. The codata variable weights are estimated with moment estimation by equating theoretical moments to empirical moments. For codata that represent groups of variables [11], the empirical moments are averaged over all variables in that group, leading to a linear system of G equations and G unknowns. For continuous codata, we simply form one group per variable, leading to the following linear system of p equations and G unknowns:
with \(\circ\) representing the Hadamard (elementwise) product. \(C\in {\mathbb {R}}^{p\times p}\) and \(\varvec{b}\in {\mathbb {R}}^{p}\) are derived in [11] and given by:
with \(\tilde{\varvec{\beta }}\) the maximum penalised likelihood estimate given an initial \({\tilde{\tau }}^2_{global}\) and corresponding constant diagonal ridge penalty matrix \({\tilde{\Omega }}\), with W a diagonal weight matrix used in the iterative weighted least squares algorithm to fit \(\tilde{\varvec{\beta }}\), and with \(\tilde{\varvec{v}}\) an estimate for the variance of \(\tilde{\varvec{\beta }}\) with respect to the response data Y.
Note that storing the matrix C is memorycostly as it is a \(p\times p\)dimensional matrix with p potentially tens of thousands of variables. C can, however, be written as matrix product of two smaller matrices \(C=LR\) with \(L\in {\mathbb {R}}^{p\times n}\) and \(R\in {\mathbb {R}}^{n\times p}\). Instead of computing and storing C in one go, we compute it per block of b rows to alleviate memory costs: for each block of rows \(C_{block}\) we only need to store the elements of \((C_{block}\circ C_{block})Z\in {\mathbb {R}}^{b\times G}\).
The main estimating equation boils down to solving a linear system, which is solved as is for linear codata models, penalised with a generalised ridge penalty for generalised additive models or solved under constraints plus possibly penalised with a generalised ridge penalty for shapeconstrained additive codata models. The penalisation on the level of prior parameters ensures stable estimation of the codata weights.
Linear codata model
As the prior variance has to be positive, the resulting prior variance estimate is truncated at 0 after solving the linear system from (2):
In generalised linear models it is common to use a loglink for the response to enforce positivity, resulting in positive, multiplicative effects. Note that here, however, Equation (2) is the result of equating theoretical to empirical moments. Replacing \(\varvec{b}\) by \(\log (\varvec{b})\) would violate the moment equalities. Also, if we would enforce positivity instead by, for example, substituting \(Z\varvec{\gamma }\) directly by \(\varvec{v}=\exp (Z\varvec{\gamma }')\), the moment equations would not be linear anymore in \(\varvec{\gamma }\), nor multiplicative, e.g. as \((C\circ C)\exp (Z\varvec{\gamma })\ne \exp ((C\circ C)Z\varvec{\gamma })=\prod _{g=1}^G\exp ((C\circ C)\varvec{Z}_g\gamma _g)\), with \(\varvec{Z}_g\) the \(g^{th}\) codata variable. Hence, the advantage of simply posthoc truncating \(Z\hat{\varvec{\gamma }}\) is that the system of equations in see Eq. (3) is easily solved. Alternatively, shape constrained codata models may be used to enforce positivity, as explained further on.
Generalised additive codata model
For estimating the generalised additive codata model coefficients in a nonlinear codata model, the least squares estimate in Eq. (3) is extended by penalising the coefficients with a difference penalty matrix \(S_g\) with smoothing penalty parameter \(\lambda _g\).
with \(Z_{GAM}=[\Phi _1,\ldots , \Phi _G]\) the matrix of spline basis expansions for all G codata variables and \(\varvec{\gamma }_{GAM}=(\varvec{\gamma }_{1}^T,\ldots ,\varvec{\gamma }_{G}^T)^T\) the vector of all spline coefficients. This leastsquares equation is of a form also known as penalised signal regression [20] and can be solved by the function gam() (or bam() for big data) of the Rpackage mgcv, for example. This function also provides fast and stable estimation of the penalties \(\lambda _g\) [21] for multiple codata sources and possibly multiple penalty matrices per codata source jointly. Alternatively, when only one smoothing penalty matrix is provided per codata source, the smoothing penalty and spline coefficients may be estimated per codata source separately by using random splits as proposed in [11]. Our software uses bam() to solve \(\hat{\varvec{\gamma }}\) by default and allows for random splits when only one smoothing penalty matrix is provided.
Shapeconstrained additive codata model
Prior assumptions on the shape of the relation between the prior variance and codata, such as monotonicity or convexity, may be imposed by constrained optimisation of spline coefficients [15]. The codata weight estimate is given by subjecting the possible solution of Eq. (4) to (in)equality constraints given in matrix \(M_{(in)eq,g}\) and vector \(\varvec{b}_{(in)eq,g}\):
Several shapes may be imposed by choosing \(M_{ineq}\) and \(b_{ineq}\) accordingly [15]: (i) positivity may be imposed by constraining the spline coefficients to be positive; (ii) monotonically increasing (decreasing) may be imposed by constraining the first order differences \(\gamma _{i+1}\gamma _i\) to be positive (negative); (iii) convexity (concavity) may be imposed by constraining second order differences \(\gamma _{i+2}  2\gamma _{i+1} + \gamma _i\) to be positive (negative); (iv) any combination of the shapes iiii may be imposed by combining the corresponding constraints.
In [15] shapeconstrained psplines are developed to handle difficulties in optimising multiple smoothing penalties due to discontinuous gradients. Their Rpackage scam, however, cannot be readily used for signal regression, which differs from regular regression in that the spline basis matrix is multiplied by the known matrix \((C\circ C)\). Moreover, the smoothing parameter estimates are estimated using a generalised crossvalidation (GCV) criterion, which we show below to overfit in the unconstrained case. Therefore, we rely on the simple approach of directly constraining the spline coefficients as in Equation (5).
We use the approach proposed in [11] to estimate the smoothing penalties: first we estimate the smoothing penalties \(\lambda _g\) separately for each codata variable \(\varvec{Z}_g\) using random splits of the data. As this optimisation is in one dimension only, we use Brent’s algorithm from the general purpose optimisation Rpackage optim, which should be sufficient to handle discontinuous gradients. Then we estimate the spline coefficients \(\varvec{\gamma }_g\) for each codata variable \(\varvec{Z}_g\) and corresponding spline basis function matrix \(\Phi _g\).
When at least one of the codata models is shapeconstrained, the software uses the random splits in combination with lsqlincon() from the Rpackage pracma for constrained optimisation.
Variable selection
The normal prior in Eq. (1) corresponding to adaptive ridge penalties leads to dense, i.e. nonzero, estimates for the regression coefficients \(\varvec{\beta }\). To obtain sparser solutions, the adaptive ridge penalties may be transformed to elastic net penalties by modifying results from [13], as detailed below. The ridge penalties resulting from the fit with ecpc() are transformed with squeezy() from the Rpackage squeezy, which also estimates the elastic net penalised regression coefficients using the Rpackage glmnet. Alternatively, ecpc may use posterior selection to select variables, as formerly proposed [11]. The two approaches differ in how the level of sparsity is tuned: the user may tune the number of variables for posterior selection or tune the elastic net sparsity parameter \(\alpha \in [0,1]\) when squeezy is used.
Transforming ridge penalties to elastic net penalties
In the proposed model in Eq. (1), the regression coefficients follow a normal prior corresponding to a ridge penalty. Now, suppose that each \(\beta _k\) independently follows some other prior distribution \(\pi (\beta _k)\), parameterised by variablespecific prior parameter \(\lambda _k\) and with prior mean 0 and finite variance \(\textrm{Var}(\beta _k)=Z\varvec{\gamma }=h(\lambda _k)\) for some known monotonic variance function \(h(\cdot )\):
As example we consider the elastic net prior, corresponding to the elastic net penalty, with variablespecific elastic net penalty. Recently, it was shown that when the prior parameters are groupspecific, the marginal likelihood as function of \(\lambda _k\) is approximately the same as the marginal likelihood as function of normal prior parameters \(\varvec{\gamma }\), as the prior distribution of the linear predictor \(\varvec{\eta }=X\varvec{\beta }\) is asymptotically normally distributed [13]:
This result also holds for priors with variablespecific, finite variance [22]. We may use this result to obtain approximate method of moment equations for other priors.
Denote by \(\hat{\varvec{\beta }}_R(Y)\) the ridge penalised maximum likelihood estimate as function of the observed response data \(\varvec{Y}\). The method of moments equations are given by equating the theoretical marginal moments to the empirical moments [11]:
Using the normal approximation for the marginal likelihood we obtain:
So we may obtain the ridge estimates \(\hat{\varvec{\gamma }}\) as above to estimate the variablespecific prior variances \({\hat{v}}_k=(\varvec{Z}_k\hat{\varvec{\gamma }})_+\), and transform these with the variance function to obtain the variablespecific prior parameters:
This transformation can also be used to transform the prior variance estimates for the generalised additive codata model in Equation (4) and for the shapeconstrained codata model in Eq. (5). Note, however, that the penalisation and constraints are applied to \(\varvec{\gamma }\) and not to \(\varvec{\lambda }\).
Results
We include the full analyses with results in three vignettes corresponding to the three sections below; short examples (Additional file 1), simulation study (Additional file 2) and analysis example (Additional file 3). Here we summarise the main findings.
Short examples
Use of ecpc for linear, generalised additive and shapeconstrained additive codata models is demonstrated in short examples. Besides, classspecific methods from ‘ecpc’ and transformation from ridge to elastic net penalties are illustrated.
Simulation study
Estimation and prediction performance of various codata models
The extension to ecpc proposes new codata models for modelling continuous codata in addition to the adaptive discretisation model proposed in the first version. We compare the newly proposed and former codata models and a codata agnostic ridge model in a simulation study. Figure 4 illustrates the prior variance estimates for the various codata models. Results show that all codata models lead to improved prediction performance compared to the codata agnostic ridge model when codata are informative and similar performance when codata are random. The improvement for the newly proposed codata models is slightly better than for the former, adaptive discretisation codata model, as it better estimates the relation between the prior variance and codata. Moreover, the newly proposed codata models are around 3–6 times as fast as the former adaptive discretisation.
Besides, we compare robustness of the estimates for generalised additive codata models for an increased number of splines and various methods for estimating the smoothing penalty, i.e. using random splits or any of the available methods in bam() in mgcv (“ML”, “fREML” and “GCV.Cp”). Using random splits leads to similar estimates as the methods “ML” and “fREML”, both for 20 and an increased number of 50 splines, while “GCV.Cp” leads to unstable estimates.
Variable selection compared to other methods
We compare variable selection of ecpc using posterior selection (ecpc+postselection) and elastic net penalties transformed with squeezy (ecpc+squeezy) with a codata agnostic elastic net model (glmnet [23]) and featureweighted elastic net (fwelnet [12]) in a simulation study. Results are shown in Fig. 5. Both variable selection methods implemented for ecpc show similar performance, besides differences resulting from the different type of tuning the level of sparsity. Results show that in the sparse setting, the codata agnostic model glmnet outperforms the other codata learnt methods when codata are random, in contrast to the dense setting. When codata are informative and the relation between the prior variances and codata is monotone, the codata learnt methods outperform glmnet, with fwelnet slightly outperforming ecpc. When codata are informative and the relation between the prior variances and codata is convex, ecpc outperforms fwelnet as the generalised additive codata model is able to flexibly adapt to the nonexponential relation, whereas fwelnet is not.
Computation time and memory costs
Figure 6 shows the computation time and peak memory used for various numbers of samples and variables and for the following models: ecpc with a linear codata model, generalised additive codata model (20 splines) or shape constrained additive codata model (20 splines plus positivity constraint) and glmnet for a codata agnostic ridge penalty or lasso penalty. As storing the memorycostly matrix \(C\in {\mathbb {R}}^{p\times p}\) in Eq. (5) is avoided and only blocks of rows are stored, peak memory grows subquadratically with p.
Analysis example
In [11, 13] we demonstrated the use of codata to improve standard methods like ridge and lasso for several data sets. Here, we focus on a single data set with \(n=133\) samples and \(p=12838\) variables, that includes several types of codata. We demonstrate the software on an application to the classification of lymph node metastasis from other types of cancer using highdimensional RNA expression data. Three sources of codata are available: categorical codata for known signature genes, continuous codata for ciscorrelation between RNA and copy number and continuous codata for p values from an external, similar study. More information on the data and details of the results are given in the vignette. We show results for several settings of codata models and compare performances of dense and sparse models. Figure 7 shows the results for three settings: 1) a GAM, i.e. without constraints; 2) a SCAM with positivity constraints; 3) a SCAM with positivity and monotonicity constraints. Among the sparse models, using a generalised additive codata model with 50 splines for the continuous codata variables and posterior selection leads to the best performance on independent test data, though the simpler lasso model may be preferred as it shows competitive performance. Note, however, that lasso may render a rather unstable set of selected variables [24], and that the use of codata improves this stability [11]. Overall, the dense model using a generalised additive codata model with 50 splines shows the best prediction performance.
Conclusions
We presented an extension to the Rpackage ecpc that accommodates linear codata models, generalised additive codata models and shape constrained additive codata models for the purpose of highdimensional prediction and variable selection. These codata models are particularly useful for continuous codata. The newly proposed codata models are shown to run faster and lead to slightly better prediction performance when compared to adaptive discretisation. Moreover, the estimated variablespecific ridge penalties may be transformed to elastic net penalties with the Rpackage squeezy to allow for variable selection. We showed in a simulation study that this approach and the previously proposed posterior selection approach lead to similar performance, outperforming other methods when the effect sizes are (nonexponentially) related to the codata. We have provided a vignette with several short examples to demonstrate general usage of the code (Additional file 1), a vignette to reproduce the simulation study (Additional file 2) and a vignette with an analysis example to a cancer genomics application (Additional file 3).
Availability of data and materials
All three vignettes reproducing the simulations and examples, including data may be found on https://github.com/Mirrelijn/ecpc/vignettes.
Project name: ecpc
Project home page: https://github.com/Mirrelijn/ecpc
Operating system(s): platform independent
Programming language: R
Other requirements: R \(\ge 3.5.0\)
License: GPL (\(\ge 3\))
Any restrictions to use by nonacademics: none
Abbreviations
 GLM:

Generalised linear models
 GAM:

Generalised additive model
 SCAM:

Shapeconstrained additive model
References
McCullagh P, Nelder J. Generalized linear models II. London: Chapman and Hall; 1989.
Hoerl AE, Kennard RW. Ridge regression: biased estimation for nonorthogonal problems. Technometrics. 1970;12(1):55–67.
Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodol). 1996;58:267–88.
Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B (Methodol). 2005;67(2):301–20.
Ignatiadis N, Lolas P. \(\sigma\)ridge: group regularized ridge regression via empirical bayes noise level crossvalidation. 2020. arXiv preprint arXiv:2010.15817.
van de Wiel MA, Lien TG, Verlaat W, van Wieringen WN, Wilting SM. Better prediction by use of codata: adaptive groupregularized ridge regression. Stat Med. 2016;35:368–81.
Meier L, van de Geer S, Bühlmann P. The group Lasso for logistic regression. J R Stat Soc Ser B (Methodol). 2008;70(1):53–71.
Yang Y, Zou H. A fast unified algorithm for solving grouplasso penalize learning problems. Stat Comput. 2015;25(6):1129–41.
Velten B, Huber W. Adaptive penalization in highdimensional regression and classification with external covariates using variational bayes. Biostatistics. 2019. https://doi.org/10.1093/biostatistics/kxz034.kxz034.
Münch MM, Peeters CF, van der Vaart AW, van de Wiel MA. Adaptive groupregularized logistic elastic net regression. Biostatistics. 2019. https://doi.org/10.1093/biostatistics/kxz062.kxz062.
van Nee MM, Wessels LFA, van de Wiel MA. Flexible codata learning for highdimensional prediction. Stat Med. 2021;40(26):5910–25.
Tay JK, Aghaeepour N, Hastie T, Tibshirani R. Featureweighted elastic net: using “features of features” for better prediction. 2020. arXiv preprint arXiv:2006.01395.
van Nee MM, van de Brug T, van de Wiel MA. Fast marginal likelihood estimation of penalties for groupadaptive elastic net. J Computat Graph Stat 2022;1–27 (justaccepted)
Hastie T, Tibshirani R. Generalized additive models. Stat Sci. 1986;1(3):297–318.
Pya N, Wood SN. Shape constrained additive models. Stat Comput. 2015;25(3):543–59.
Boulesteix AL, De Bin R, Jiang X, Fuchs M. Ipflasso: Integrativepenalized regression with penalty factors for prediction based on multiomics data. Comput Math Methods Med. 2017;2017.
van de Wiel MA, van Nee MM, Rauschenberger A. Fast crossvalidation for multipenalty highdimensional ridge regression. J Comput Graph Stat. 2021;30(4):835–47.
Wahba G. Spline bases, regularization, and generalized cross validation for solving approximation problems with large quantities of noisy data. Madison: University of Wisconsin; 1980.
Eilers PH, Marx BD. Practical smoothing: the joys of Psplines. Cambridge: Cambridge University Press; 2021.
Marx BD, Eilers PH. Generalized linear regression on sampled signals and curves: a pspline approach. Technometrics. 1999;41(1):1–13.
Wood SN. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. J R Stat Soc Ser B (Methodol). 2011;73(1):3–36.
Eicker F. A multivariate central limit theorem for random linear vector forms. Ann Math Stat. 1966;37:1825–8.
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1.
Meinshausen N, Bühlmann P. Stability selection. J R Stat Soc Ser B (Stat Methodol). 2010;72(4):417–73.
Acknowledgements
The authors would like to thank Soufiane Mourragui (Netherlands Cancer Insitute) for the many worthwhile discussions.
Funding
The first author is supported by ZonMw TOP grant COMPUTE CANCER (40 008129816012).
Author information
Authors and Affiliations
Contributions
MN developed the software, performed the analyses and wrote the manuscript. LW and MW provided feedback on the method, analyses and manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
van Nee, M.M., Wessels, L.F.A. & van de Wiel, M.A. ecpc: an Rpackage for generic codata models for highdimensional prediction. BMC Bioinformatics 24, 172 (2023). https://doi.org/10.1186/s1285902305289x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285902305289x
Keywords
 Highdimensional data
 Penalised generalised linear models
 Empirical Bayes
 Prior information
 R