Extensions of ℓ1 regularization increase detection specificity for cell-type specific parameters in dynamic models

Background Ordinary differential equation systems are frequently utilized to model biological systems and to infer knowledge about underlying properties. For instance, the development of drugs requires the knowledge to which extent malign cells differ from healthy ones to provide a specific treatment with least side effects. As these cell-type specific properties may stem from any part of biochemical cell processes, systematic quantitative approaches are necessary to identify the relevant potential drug targets. An ℓ1 regularization for the maximum likelihood parameter estimation proved to be successful, but falsely predicted cell-type dependent behaviour had to be corrected manually by using a Profile Likelihood approach. Results The choice of extended ℓ1 penalty functions significantly decreased the number of falsely detected cell-type specific parameters. Thus, the total accuracy of the prediction could be increased. This was tested on a realistic dynamical benchmark model used for the DREAM6 challenge. Among Elastic Net, Adaptive Lasso and a non-convex ℓq penalty, the latter one showed the best predictions whilst also requiring least computation time. All extended methods include a hyper-parameter in the regularization function. For an Erythropoietin (EPO) induced signalling pathway, the extended methods ℓq and Adaptive Lasso revealed an unpublished alternative parsimonious model when varying the respective hyper-parameters. Conclusions Using ℓq or Adaptive Lasso with an a-priori choice for the hyper-parameter can lead to a more specific and accurate result than ℓ1. Scanning different hyper-parameters can yield additional pieces of information about the system.


Background
Describing processes in biological systems by mathematical models is a key feature to understand how living organisms work [1]. This task is frequently approached by mechanistic modelling via ordinary differential equations (ODEs). Yet, a severe obstacle to make predictions based on the models consists in the high-dimensional parameter spaces that quickly arise in realistic systems. The steadily growing availability of data and the development of experimental techniques need to be accompanied by statistical *Correspondence: pascal.dolejsch@merkur.uni-freiburg.de 1 Institute of Physics, University of Freiburg, Hermann-Herder-Str. 3,79104 Freiburg, Germany Full list of author information is available at the end of the article methods that can efficiently incorporate them into models even for hundreds of parameters to estimate.
If, for example, two cell types are examined with respect to the same process, additional parameters must be incorporated to describe the second cell type. One may however assume that the cell types of interest differ only in some aspects. This assumption would allow to assign some parameters to both cell-types whilst pinpointing the biological differences between the cell-types. In addition, reducing the dimension of the parameter space eases calculations. This idea of selecting only those features that relevantly contribute to the observations can be accomplished by various approaches. The most intuitive way might be to test all possible model configurations iteratively [2,3]. As for n parameters the number of models to test is given by 2 n , this becomes infeasible even for small models.
Whenever the general estimation procedure consists of minimizing an objective function such as the negative loglikelihood, which is equivalent to maximizing the likelihood, regularizing the objective function can be regarded as the consequent extension to incorporate equalities among different cell types. Regularization generally refers to including additional information, which here means to amend the objective function by a term which is larger than zero whenever parameters differ between the two cell types. Thus the optimization tends to shrink the model by preferring parameters that are equal among the two cell types. In the context of minimizing a sum of squares, it is intuitive to add a squared function to the objective function that is minimal if the two cell-types behave equally. This can be scaled by a factor λ. Then, both functions are minimized simultaneously. This idea has been known as Ridge Regression or Tikhonov Regularization for many decades now [4][5][6][7]. By this method, the minimum of the regularized objective function however only converges asymptotically to a point where some parameters are cell-type independent with increasing penalty strength. Hence, this method does not provide an effective model shrinkage.
With Tibshirani's introduction of the Lasso, i.e. least absolute shrinkage and selection operator, [8], selecting features and estimating optimal parameters was established for linear regressions. The method relies on the 1 norm of the parameter vector. As this quantity is continuous, it eases numerical calculations. It is however not differentiable if one parameter is zero. If the tuning parameter λ is sufficiently large, it enforces a sparse solution [9]. The original Lasso procedure has been generalized: Among others, Adaptive Lasso [10] and Elastic Net [11] have been proposed as they provide better convergence properties or an algorithmic simplification when compared to the original Lasso. In the framework of linear regression, also non-convex penalty functions have been proposed, such as the q pseudo-norm of the parameter vector [12].
An adoption, the so-called 1 regularization, has been used in Systems Biology for non-linear parameter estimation in cellular signalling models [13,14]. Therefor, the parameters p [0] i of one cell-type are chosen as reference. The parameters of the second cell-type p [1] i are then expressed as the product of fold-changes [1] i and the reference value p [0] i . The penalty term only acts on the logarithmic fold-changes. This method finds relevant differences between two cell types. These particularities of one cell-type could serve as targets for drugs that shall only affect malign cells [13]. However, Steiert et al. [14] point out that the amount of properties which are falsely detected as cell-type specific can be decreased by manual supervision of the regularization outcome. Hence, it is desirable to find a penalization that requires no manual checking to find potential drug targets in a more robust way.
One major limitation to 1 regularization is the presence of linearly correlated log-transformed parameters. Consider a reaction that involves the product of two kinetic rates p 1 × p 2 or their quotient. A log-transformation can then lead to a linear functional relation between the estimated values that minimize the objective function: logp 1 = const. ± logp 2 . Both parameters may still be identifiable if they appear independently in other reactions, too. Thus, the linear correlation does not hamper the optimization process if only one cell-type is modelled. When including a second cell-type, the linear relation however translates to a linear correlation between the log-fold-change estimates: ( 1 ) Figure 1 shows an example of linearly correlated logfold-changes taken from an EPO induced JAK2/STAT5 signalling pathway [13,15]. The model will be discussed in detail below. The objective function landscape reveals a minimum. However, the surrounding confidence region is aligned to the diagonal subspace logˆ 1 = const. − logˆ 2 for the two fold-change parameters belonging to the CISHRNA turn and delay rates, respectively. If 1 penalization is applied to such parameters, the space of constant penalty also partially coincides with the diagonal subspace. The 1 penalty does not provide additional degrees of freedom apart from the linear scaling factor λ. Hence, the co-alignment of penalty and confidence region cannot be prevented, so 1 can be considered as too rigid in this case. The linear relation has to be distinguished from a structural non-identifiability of the model itself (cf. [16]). Eq. 1 appears within the introduced penalty term while the model parameters themselves remain identifiable.
Here, we propose the use of extended 1 methods-Adaptive Lasso, Elastic Net and the non-convex q penalty-to achieve a more flexible regularization technique. Each of these methods introduces an additional degree of freedom to make the set of available penalty functions less rigid than only 1 . This deformation degree of freedom determines, how much the level lines differ from those of 1 . The shapes of the level lines corresponding to these extended penalty functions for exemplary deformation parameters are depicted in Fig. 1. If the extended penalty terms are added to the objective function, the deformation parameter can be tuned by varying the additional degree of freedom, so the level lines of the penalty do not coincide any more with the alignment of the objective function. Hence, a shrinkage can be Fig. 1 Level lines of objective function and penalty terms. Upper panel: Two-dimensional Profile Likelihood for two linearly correlated log-fold-changes, taken from an EPO-induced JAK2/STAT5 signalling pathway [13,15]. The dashed lines indicate the level lines of an 1 penalty. The gray area marks the 95% confidence region. Lower level: Two-dimensional level lines of constant penalty for the 1 /Lasso penalty function and the three extended methods presented in this manuscript: Non-convex q , Adaptive Lasso and Elastic Net penalty. The extended methods can replace the 1 penalty in the upper panel, which is analyzed in the "Results" section. The level lines of 1 are aligned to the gray confidence region, where the extended methods produce a deformation in the level lines that can prevent an alignment re-established. All methods presented in this manuscript are available within the MATLAB modelling framework Data2Dynamics [17,18]. This paper aims to extend the possibilities described by Steiert et al. [14] and Merkle et al. [13]. Thus, an introduction to the already published idea of 1 regularization and of the optimization routine are given first. In a second step, the extended methods, which have not been used in a non-linear dynamic modelling setting to our knowledge, are introduced and discussed. The accuracy of the prediction of cell-type dependent parameters could be significantly increased when using Adaptive Lasso or q penalties because these methods could reduce the number of falsely detected cell-type specific parameters.

Parameter estimation
Biological processes in cells can be translated into a system of coupled ODEs with the concentrations being time-dependent functions: where x denotes the system's intrinsic state, u a possibly existing external input, p u , p x a set of parameters, p 0 the initial conditions and f a continuous function that is determined by the biological properties of the system. All quantities are considered as vector valued. The internal states x that solve Eq. 2 are usually not accessible to an experimental observer, so all measurable quantities y are mappings from the space of internal states onto the observer space with some observation parameters p y and a measurement error (t) ∼ N 0, σ 2 (p σ ) . The latter will be assumed as normally distributed although the approach is not limited to this case. The observation function g depends on the observational set-up. The states x(t) and y(t) will be considered as vector-valued. The functions f and g are known except for the parameter values, so a set of parameters which is assumed as constant in time, is necessary to completely characterize a system as described in Eqs. (2) and (3). All equations are assumed to have only positive parameters. The parameters will be estimated on a logarithmic scale, also in order to avoid numerical instabilities among different orders of magnitude of the parameters. It can be shown that observables in biological processes are usually log-normally distributed [19], so the uncertainties of their log-transform follow a Gaussian distribution.
Maximum likelihood approach Given data points y ij = y i (t j ) for M states y i and N time points t j as well as the corresponding observation function values g(x i (t j )) resulting from the ODE system (Eq. 2) and standard deviations σ ij , the negative 2-fold log-likelihood is minimized. This yields the maximum likelihood parameter estimatep ML = arg min p χ 2 ML (p). In cases of unknown σ ij , additional terms must be taken into account [20].
To optimize the likelihood χ 2 ML (Eq. 5), it is generally necessary to apply numerical methods because no analytic solutions are available.

Profile likelihood
The profile likelihood χ 2 PL (p i ) is obtained by re-optimizing the objective function for each value of p i with respect to all remaining parameters p i =j [16,21,22]: By calculating the profile likelihood for each parameter, the confidence interval CI(p) around an optimump can be determined: where q (m) α denotes the α-quantile of the χ 2 distribution with m degrees of freedom. For α = 0.95, the threshold is q (1) 0.95 = 3.84 for determining confidence intervals for one parameter.

Regularization for two cell-types
Consider two cell-types [0] and [1]. If the model describes a biological process that both of them may undergo, the ODE system (Eq. 2) does not require changes, whereas the parameter values of p can depend on the cell-type. This section recapitulates the basic denotions as described by Steiert et al. [14].
Log-fold-changes Subsequently, one cell-type will be chosen as reference, corresponding to a parameter set p [0] . For the other cell-type, only the fold-changes [1] with respect to the type of reference will be considered, which are defined as with the log-fold-change vector r := r [1] , which will be used as only the log transformations of parameters p are considered. Thus, r [1] i is zero if and only if the value of parameter p [1] i of cell-type [1] is compatible with p [0] i , associated with the type of reference [0]. This parameter may then be called cell-type independent, while the term cell-type specific refers to the opposite case, r [1] i = 0.
Penalization If both cell types are likely to share certain properties, it can be assumed that some fold-change parameters r i vanish. Hence, the model is supposed to be sparse with respect to r. To incorporate this prior knowledge, the original objective function χ 2 ML is amended by a penalty term ν(r, r * ): where the function ν only depends on fold-change parameters and has its global minimum at a target value r * . For a sparse two-cell-type model with logarithmic parameters, r * = 0 is chosen. Other values of r * might be useful in cases where assumptions other than model sparsity motivate the usage of regularization. The dimensionless tuning parameter λ determines the penalization strength. Hence if λ is chosen large enough, the penalty should enforce that r = 0 be a solution to the optimization problem. This approach differs from the original Lasso regularization as it only penalizes the fold-changes, i.e. the subset of all parameters that links cell-type [1] to the cell of reference [0].

Two-step regularization routine
All parameter estimates resulting from the optimization of a penalized objective function (Eq. 9, λ > 0) must be considered as possibly biased by the penalty term. To circumvent this problem, a two-step estimation routine is implemented in Data2Dynamics and was used throughout all presented calculations.
1 Optimize the penalized objective function (Eq. 9) to obtain the subset of zero log-fold-changes Z(λ) for a given penalty strength λ > 0. 2 Remove the regularization (λ = 0) and set those parameters belonging to Z(λ) as fixed to zero. Then re-optimize the unbiased objective function.
The penalty strength λ essentially determines how many parameters are set to zero. Therefore, several orders of magnitude are scanned for λ, and for each of them the aforementioned two-step estimation is executed. To determine the sparsest model that can still be considered as consistent with the data, which will be referred to as parsimonious model, information theory based criteria have been developed. The likelihood ratio test (LRT, [23]), the Akaike information criterion (AIC, [24]) and the Bayesian information criterion (BIC, [25]) are the most prominent ones. As AIC generally selects too large models, it is not a consistent selection criterion. Since BIC is equivalent to LRT for an adjusted threshold α, only LRT will be considered here to find the parsimonious model in order to allow for a comparison to the 1 results found by [14] when using LRT.
After performing the two-step estimation, the final objective function value depending on λ is given by From this, the LRT statistic D(λ) is defined as quantifying the objective function difference to a notregularized model (λ = 0). Given a statistical significance level α, the model shrinkage induced by a penalty with strength λ is considered compatible with the data if D(λ) does not exceed q . This is the α quantile of a χ 2 distribution with #Z(λ) degrees of freedom. The parsimonious model, which has least cell-type specific features among all models that agree with data up to the α level, is found at the optimal penalty strength The first condition in Eq. (12) is necessary to include also those penalty strengths which do not lead to shrinkage.

Extended penalty functions for regularization
The original log-likelihood is a sum over squares, so defining ν(r, r * = 0) = i r 2 i would be a consequent approach. The smoothness of ν in r * = 0 however leads to only asymptotic optimality of the sum of objective function and penalty term in r = 0 for λ → ∞ [9]. If the sum over absolute values is chosen, ν(r) = i |r i |, the model can be shrunk even for finite λ, i.e. r i = 0 for some i in the optimum of χ 2 , see Fig. 2. This is due to the non-differentiable point of ν whenever r i = 0 for some i. The effect of this so-called 1 penalty was studied both, for simulated [14] and experimental data [13].
Choosing a penalty with 0 < q < 1 still implies a non-differentiability of ν if r i = 0 for any i. Moreover, this function is not convex in the case q < 1. Although [9] postulated that this would cause severe problems for the optimization routine, [12] showed that non-convex penalty functions can lead to better results compared to 2 or 1 in linear cases. The gradient of the q penalty function (Eq. 13) reads if r i = 0. This term diverges for r i → 0, with a right-side limit of +∞ and a left-side limit of −∞. Hence there is always an (at least local) optimum of the total objective function in r i = 0, see Fig. 2, right panel. This is independent of the underlying objective function and can be considered as an artefact. To avoid that the optimizer is constrained to this point even if it is not the global optimum, a small ε ≈ 10 −10 is chosen. The optimizer then considers all r i with |r i | ≤ ε as effectively zero, while the gradient is fixed to representing the gradient value of the limiting case |r i | = ε and hence remaining finite. Two other penalty functions, which had been developed for linear problems, were tested as well. The Elastic Net was introduced by [11] to reduce the bias of predictions for elasticities 0 ≤ α ≤ 1. This penalty function is strictly convex for α > 0, but singular for r i = 0 for α < 1. It includes the special cases of 1 and 2 penalties for α = 0 or α = 1, respectively. The Adaptive Lasso: was introduced by [10] given maximum likelihood estimatesr ML i and adaptivities γ > 0. This approach was proved to asymptotically provide unbiased estimates in the linear setting, keeping a convex penalty function. Dark-gray curves depict the sum of χ 2 ML and a penalty with strength λ = 1. The filled diamonds in zero represent penalty strengths λ Z which cause the minimum to be exactly in zero. The objective function penalized with λ Z is drawn in light grey. Finite λ Z = 2 is sufficient to shift the minimum to zero. While the convex absolute-value penalty only admits one minimum, the non-convex q penalty can lead to multiple local minima as depicted for λ = 1 on the right hand panel. For most optimizations described in this manuscript, an 0.8 penalty was used. In this figure, the case of q = 0.5 is depicted to show the multiple minima more clearly. They arise, however, for any 0 < q < 1 The three methods defined above all contain one additional deformation parameter d ∈ {1 − q, γ , α}. This terminology refers to the effect that they determine how much the level lines of the penalty functions are deformed with respect to the rigid diamond of 1 , see Fig. 1. The limit d → 0 always yields the original 1 .
All gradients are not defined if r i = 0, but subdifferentials can be obtained by defining sign(0) =[ −1, 1] [26]. This leads to set-valued gradient components in singular points. The method is applicable only to convex functions, so for the q penalty, this approach holds only within the small ε neighbourhood around zero. There, the gradient modulus is constant, making the penalty effectively behave like the modulus function, hence it becomes convex.
To determine the optimal estimate (p,r) for the objective function χ 2 (p, r, r * = 0, λ) (Eq. 9) for given λ, the following criteria must be fulfilled [14]: and for each i either (20) The first and second criterion (Eqs. 20 and 21) represent the requirement of vanishing gradients in all non-singular cases. The first and third criterion (Eqs. 20 and 22) are fulfilled if the parametersp are optimal and the penalty term dominates the maximum likelihood contributions. According to subdifferential calculus, this is sufficient to obtain an optimal point [26].

Implementation
All calculations to optimize the objective functions (Eqs. 5 and 9) are performed within the MATLAB framework Data2Dynamics [17,18]. This is a freely available, stateof-the-art software package that has been used for various system biology applications [13,27,28], performing parameter estimation, uncertainty analysis and prediction calculation. It also contains a toolbox for regularizing models by 1 or 2 penalties. Elastic Net, Adaptive Lasso and q penalties for arbitrary q > 0 have been integrated, keeping the already existing structure as described by [14]. The regularization routines can be applied to all models and data types that Data2Dynamics supports. They do not pose any restrictions such as normalization or an absolute scale on the experimental data that is used for modelling. Yet, the complexity of the underlying mathematical model should be tailored to the information available in the data, c.f. [29]. Further details on the usage of the new methods are given in the main regularization routine arRegularize, which can be found at the directory arFramework3/L1/arRegularize.m within Data2Dynamics.

Application on simulated data
In a first step, a simulation study is employed to investigate in how far Elastic Net, Adaptive Lasso or non-convex q penalties can lead to an improved estimation compared to the already established 1 approach. The M1 model from the sixth Dialogue for Reverse Engineering Assessment and Methods challenge (DREAM6, [30]) was used because it provides comparability with the results of pure 1 regularization obtained by [14].

Model description
The model system is composed of six genes, for which the concentrations of protein and mRNA were incorporated. The translation rate for the synthesis of proteins was considered as proportional to the concentration of mRNA and the strengths of the ribosomal binding sites. The transcription rates were assumed to follow Hill kinetics. Since simulated data was used for this model, no units for time and concentrations were specified. For the challenge, all mRNA degradation rates were assumed to be equal to 1 inverse time unit. This results in a total number of 29 kinetic parameters with six ribosomal and six protein synthesis strengths, one degradation parameter, eight K m values and eight Hill coefficients [14,30]. The full description of the model and the parameter values is available from the DREAM6 organizers under www.synapse.org/#!Synapse:syn2841366/wiki/71372. It is also included as an example in the Data2Dynamics framework.
The model was simulated 150 times for two cell-types. The first cell-type, which was used as reference for the fold-changes of the second, corresponded to the gold standard parameter set of the DREAM6 challenge in the first place. The Hill coefficients range from one to four which represents the number of ribosomal binding sites. For the second cell-type, fold-change parameters were introduced to relate the parameter values of both cell-types. For each simulation run independently, approximately one third of the 29 dynamic parameters were chosen as cell-type specific. The respective fold-changes were then drawn randomly from the set ∈ {1/10, 1/5, 1/2, 2, 5, 10} for the non-Hill parameters. Fold-changes for Hill parameters were randomly selected as ∈ {1/4, 1/2, 2, 4} such that the parameter values of both cell-types are within the interval [ 1,4].
By additionally introducing perturbations to the kinetic rates of one gene, in total experimental 18 set-ups were available: First, each gene could be knocked out, so it would not be produced at all. Second, a mRNA degradation rate could be increased by a factor of five for one gene. Third, the mRNA synthesis rate of one gene could be doubled. For each run, it was randomly selected whether each of these additional set-ups was observed.
It was assumed that either mRNA or proteins were observed to account for only partial observability in realworld models. The mRNA observation had a probability to be selected of one third. In this case, 21 data points of each mRNA would be observed. In the remaining two thirds of the cases, two selected proteins were measured with 41 data points each. Thus it was random whether the perturbation occurred in the variables that were observed or not.
In total, the resulting 150 simulation runs differed randomly with respect to the cell-type specific parameters, the magnitude of the corresponding fold-change, the available experimental set-ups and the observable quantities. For a more detailed set-up description, the reader is referred to [14]. When applying the regularization techniques as described above, the goal was to test in how far the algorithm would be able to find the cell-type specific features of each of the 150 models that were used as input.

Results of 1 regularization
For 150 repetitions of the above mentioned random model and observation calculations, the parsimonious model was calculated by applying 1 regularization to all fold-change parameters. Therefore, the penalty strengths were scanned from λ = 10 −4 to λ = 10 6 . The parsimonious model was determined by virtue of Eq. (12). Then, the regularization result whether a fold-change was non-zero (called positive prediction) was compared to the true parameter values used to simulate the data. An overall accuracy (correct over total classifications) of (71 ± 13)% was obtained, which is in accordance with the overall result of 78% and also with the accuracies sorted by parameter type, published by [14]. A receiver operating characteristics (ROC) plot is depicted in Fig. 3a. The average sensitivity (correct over total positive classifications) of (74 ± 17)% and the average specificity (correct over total negative classifications) of (70 ± 19)% also match with the previously published values of 74% and 80%, respectively.

Results of extended penalty functions
Non-convex 0.8 penalty, Adaptive Lasso (with deformation d = 0.2) and Elastic Net (d = 0.08) were applied to the same settings as 1 . The corresponding ROC curves are shown in Fig. 3b-d together with the 1 result, visualizing that for many runs, the false positive rate was decreased to a large extent (mostly horizontal dotted lines). The point-wise absolute change was calculated for sensitivity, specificity and accuracy to compare all extended methods with 1 and Adaptive Lasso with q in addition. Box plot diagrams of are depicted in Fig. 3e-g. The mean values¯ are given in Table 1. A t-test has been performed with the null hypothesis of¯ = 0. Only changes | | > 0.1 will be considered as relevant.
The sensitivity could not be increased significantly by any of the new methods compared to 1 . In turn, both, Adaptive Lasso and q were able to significantly increase the specificity. The mean improvements of 0.16 and 0.21, respectively, are also relevant as they reduce the false positive rate from initially 30% to only 9% and 14%, shrinking the number of falsely detected cell-type dependencies to a third or at least one half compared to 1 . The diagonal, black dashed line represents the characteristic that can be expected from a random classifier. The upper left corner of each ROC curve represents the optimal classification. The thin dotted lines show the changes with respect to the same run penalized by 1 . Mean curves were omitted to focus on the differences between 1 and the extended methods. Both, Adaptive Lasso and 0.8 lead to mostly horizontal changes compared to 1 , which corresponds to an increased specificity and a constant sensitivity. e-g: Absolute run-wise changes for extended methods vs. 1 and Adaptive Lasso against q . Boxplot whiskers extend to twice the interquartile range at most. The boxplots emphasize the gain in specificity and consequently in accuracy The total accuracy is hence increased significantly by all methods, with Adaptive Lasso and q providing the largest improvements. The q penalty provides more accurate results than Adaptive Lasso, yet the improvement is considered as minor.
The influence of the deformation parameter The 1 penalty function has no degree of freedom, so it cannot  be adjusted when finding likelihoods aligned along the penalty level lines. To circumvent this, a deformation parameter d was used in q , Adaptive Lasso and Elastic Net penalties. As d does not carry a biological meaning, the choice of d can be regarded as arbitrary. Therefore, different values were tested for one configuration containing seven cell-type specific features: the protein degradation rate, two protein and one ribosomal synthesis strengths, two Hill exponents and one K d value. As deformations, adaptivities γ ∈ [ 0, 2], exponents q ∈ (0, 1] and elasticities α ∈ [ 0, 1] were applied. The results are depicted in Fig. 4. The specificity of the 1 regularization was 19/22 = 86.4%. Interestingly, q penalties yield a unique result independent of q < 1 with a specificity of 100%. Hence, 1 should not be considered as a strict mathematical limiting case for q → 1, but rather a different class of penalty function. This might be related to the fact that the penalty term becomes convex for q = 1. Adaptive Lasso in turn shows varying shrinkage. Especially for adaptivities γ > 1, the false positive detection increases although the final estimates have small absolute values. For elasticities α > 0.5, the Elastic Net provides hardly any shrinkage. Parsimonious models for some choices of deformation parameters. Upper row: Parsimonious models by scanning over penalty strengths λ for a fixed deformation parameter. Only the log-fold-change parameters are shown. Filled circles depict the value of those log-fold-changes that are non-zero after regularization. Squares denote the 1 penalty at zero deformation, while circles represent the extended functions. Kinetic rates in bold face were chosen as cell-type specific for data simulation. The colours of circles after these kinetic rate names encode the corresponding value. For q , the parsimonious model was independent of the choice of q. Elastic Net did not induce an effective model shrinkage for α > 0.5. Lower row: The total effective run times of the regularization routines as function of the deformation. These account for the entire two-step regularization routines to find the parsimonious models. All runs were executed on a 20-core server This demonstrates that the penalty is then closer to Ridge Regression than to Lasso, hence sparsity is only reached asymptotically. Some parameters were estimated to be cell-type specific for almost all deformations d (protein degradation rate, synthesis strengths of protein 2 and 3, and Hill kinetic exponent 8), all of which are true positive classifications. This shows that the flexibility introduced by the deformation d encourages an additional scan over a range of possible values to find stronger evidence for true cell-type specific properties. However, it is not necessary to scan over all admissible deformations to achieve model shrinkage. A regularized optimization can be performed with an a priori choice of for example an q penalty with q = 0.8. This avoids possible numerical instabilities for too small q, whereas it is sufficiently different from 1 , too, to exhibit the advantageous features of non-convexity.
None of the applied regularization functions was able to detect the cell-type specific behaviour related to Hill coefficient 1 and K D parameter 4 except for those Elastic Net configurations that could not effectively shrink the model size. This coincides with previous findings that Hill and K D coefficients are less frequently detected as celltype specific compared to the remaining parameters. This occurs because the identifiability of those parameters is limited if the corresponding regulator concentration does not lie around K D , and because they are easily concealed by measurement uncertainties [14].
The lower row of Fig. 4 indicates the effective time that was necessary to detect the parsimonious model for a given deformation. It displays that all scans with extended penalty function were faster than the original 1 except for four deformation values. This holds even for q , which was considered as hampering the optimizer due to its non-convexity [9]. A precise assessment of the efficiency is postponed to further research.
The influence of the q cut-off threshold The q regularization requires a threshold parameter ε to cut off the unbounded gradients for zero log-fold-changes. Setting ε = 0 can hamper the simulation if some log-fold-changes are initialized as zero when finding the parsimonious model because these parameters remain zero independent of the scanned penalty strengths. This behaviour that matches the expectations from theory disappeared for all tested positive values of ε, even for ε ∼ 10 −16 . To avoid numerical problems with rounding, the threshold was fixed to ε = 10 −10 .

Application on biological data
The 1 penalty function has previously been applied to an EPO induced JAK2/STAT5 signalling pathway [13] to find cell-type specific behaviour between healthy CFU-E cells and non-small cell lung cancer cells of type H838. This model contains two feedback loops related to the proteins CISH (see Fig. 5) and SOCS3, both induced by nuclear pSTAT5. 1141 data points are available. For a detailed description, the reader is referred to [13]. By applying an 1 penalty to the published data, 10 out of 26 kinetic parameters were found to be cell-type specific. Three of these could be removed manually as they were compatible with zero within the confidence interval. The remaining seven cell-type specific properties reproduce the published results.
Scanning over deformations d reveals a setting different from the 1 result for the CISH feedback loop as Fig. 5 reveals. According to the 1 prediction, only the turnover rate CISHRNAturn is cell-type specific, which is the rate constant for the nuclear pSTAT5 induced synthesis of CISH mRNA. Especially q for q ∈ [ 0.2, 0.35] pinpoints an alternative that considers CISHRNAbasal and CISHRNAdelay as cell-type specific. This corresponds to a basal, not npSTAT5 induced synthesis rate of CISH mRNA and to the delay chain parameter that is applicable to both types of synthesis. The Adaptive Lasso yielded the same classification for some adaptivities. The objective function values of −1197.14 for the turn parameter and −1196.73 for the basal and delay parameters as celltype specific are almost equal. The profile likelihood of the turn fold-changes, which was presented in the introduction (see Fig. 1), underlines that it would be in accordance with the available data to set this value to zero. Both CISH parameter choices are parsimonious in the sense of Eq. 12. However, the option that considers both, the basal and the delay rate as cell-type specific yields a less sparse result as there remains one parameter more to be estimated. Additional informative data would be required to find better evidence for the cell-type specific properties of the CISH feedback loop. Elastic Net regularization cannot provide informative results that go beyond 1 . Even for low elasticities the shrinking capacity is low.

Discussion
Systems biology utilizes mathematical models to understand biological processes. Frequently, mechanistic models are built up from ordinary differential equations requiring the estimation of model parameters and a statistical test of the obtained results. A model of biological systems with two cell types can be regularized to force some of the model parameters to be equal amongst the cell types. This induces a sparse system with a reduced number of parameters to estimate.
Three penalty functions for regularization were presented to extend the pre-existing 1 penalty: The Adaptive Lasso, the Elastic Net and the non-convex q penalty. All of them include one additional parameter which can be called deformation d. The limit case d = 0 represents the 1 method that has already been applied to the Systems Biology settings. When utilizing Adaptive Lasso or q , both, specificity and accuracy of the classification of celltype specific parameters could be significantly increased for a realistic toy model involving Hill kinetics. The sensitivity remained unchanged in more than 50% of all runs. The Elastic Net did not show relevant improvements while being sensitive with respect to the deformation d. The difference between q and Adaptive Lasso is significantly in favour of the former, whereas the total gain in accuracy is minor.
In theory, the dependence on d has to be assessed for each model individually. The results indicate that q is the most robust method with respect to the predicted celltype specific properties. Scanning over a range of admissible deformations allows to find alternative parsimonious models and to check the proposed models for multiple deformation types and strengths. Especially for q it does however not seem necessary to scan over all deformations to simply regularize a model. A choice of, for instance, q = 0.8 seemed viable for all models examined so far. The final parsimonious model is mostly non-unique among different deformations. Scanning all deformations can be regarded as an additional possibility to find hidden alternatives. Further statistical criteria such as the Bayesian Information Criterion might be employed to evaluate the most common final results.
The q penalty function yielded the most reliable predictions for the simulated data. It has to be taken into account nevertheless that this method encounters issues with multiple minima whenever the penalty strength is sufficiently strong. Theoretically, the penalized objective function always has a (local) optimum if any fold-change is zero. Our method to circumvent this problem is twofold: First, the diverging derivatives are cut off at a small  [13]. Dash-dotted arrows denote a transcription, solid arrows a state transition. Boxes represent the kinetic rates associated to a reaction. The delay chain τ is modelled with five intermediate steps. The turquoise rate is cell-type specific according to 1 , purple rates are cell-type specific according to some q and Adaptive Lasso regularizations. Lower panel: Scan over deformations for the JAK2/STAT5 signalling pathway with CFU-E and H838 cells. Parsimonious Model parameters related to the CISH feedback loop are depicted only. Dot colours represent the final unpenalized estimates. The numbers in the upper row indicate the total number of log-fold-changes not estimated to zero, also including those that are not depicted here. Squares denote the results obtained by the original 1 penalty. Circles indicate the extended methods. Two possible parsimonious models were found when scanning the deformation strengths: either CISHRNA basal and delay rates were cell-type specific or the corresponding turn rate threshold. This implies that the penalty is made convex around zero, so the additional minimum at zero disappears for sufficiently strong objective function gradients. Parsimonious models can be found without this cut-off, but it renders the procedure more robust against logfold-changes initialized as zero or set to zero erroneously while scanning the penalty strengths. Second, the range of penalty strengths λ that is scanned over to find the parsimonious model is started at small λ. Then, the optimizer can neglect the optimum in zero if it is not global. When increasing λ, the optimizer stays in the previously found optimum without being trapped in zero. It is hence necessary to initialize subsequent penalized optimizations at the values estimated before. The methods implemented in Data2Dynamics regularize a system accordingly. In further research the effects of concave penalty functions with a bounded gradient at zero could be of interest.
The increased specificity induced by deforming 1 penalties can be explained by symmetry breaking effects. Especially in the case of linear correlations, which frequently occur in biological models, e.g. due to linked phosphorylation and dephosphorylation rates, the additional deformation degree of freedom allows to have penalty level lines that are not aligned along the correlation. Then, the penalization can act in a way that is less influenced by the objective function to shrink the model. If no log-transformation was applied to parameters, no linear correlation would arise and 1 would not encounter problems. However, several benchmarking assessments have found log-transformations to be advantageous [31][32][33]. Thus, they should be kept and linear correlations should be handled by methods like those presented in our manuscript.
If the original objective function exhibits nonidentifiabilities for fold-changes that are linearly correlated, the 1 penalty terms are aligned to a subspace of constant objective-function, so the sum of objective function and penalty is constant on some interval, too. Then, any point inside this interval could be selected during the optimization. Symmetry breaking effects might then lead to a random selection of either end-point of the interval. This, however, is not a pitfall of the extended regularization, but of the model or the available data for structural or practical non-identifiablities, respectively. The extended methods can hence not cure non-identifiabilities. In this case, a model reformulation or additional, informative data are required to achieve sensitivity improvements.
None of the methods presented in this paper was able to increase the sensitivity of its predictions on cell-type specific properties. It could hence be doubted whether any new penalty approach that extends 1 is able to predict more true positives than original 1 and all methods described above.
Here, we focused on the comparison of two cell types, whereas the approach could easily be extended to incorporate any number of cell-types undergoing the same biochemical process. The implementation provided in Data2Dynamics can be used for this purpose. Therefore, one cell-type of reference has to be selected and foldchanges must be defined relating the reference-type to all remaining cell-types. Parameters can also be grouped as described in the Group Lasso technique [34]. The number of parameters then grows linearly with the number of cell-types. However, recent benchmarking results show that the performance depends polynomially on the number of parameters [33]. It should be examined in how far the choice of the reference cell-type affects the outcome since non-reference cells cannot be compared among each other, but only with the reference type. An examination of multiple cell-type models should be the subject of further studies.
The presented methods can also be generalized to any other field of Lasso usage which allows for a priori assumptions on some parameter values, such as zero log-fold-changes in our case. Yet, they are most valuable in parameter estimation problems of non-linear systems, and where collinearity may arise. This comprises descriptions of chemical reactions, e.g. for biochemical rate-equations, as well as deep-learning networks.

Conclusion
In summary, we demonstrated that using extended 1 methods can lead to a more specific and accurate classification of cell-type differences. The non-convex q penalty, e.g. for q = 0.8, provided best and fast predictions albeit it leads to non-convex objective functions. In particular, the application to the JAK2/STAT5 model showed that scanning the additional deformation parameter of the new methods facilitates the detection of differences in cell kinetics between a healthy and malignant cells, which go beyond what was published before based on 1 . Extended 1 methods as described in this manuscript could play a role in unraveling fundamental features that characterize for instance cancer cells, possibly leading to new therapeutic entities and treatments.