chngpt: threshold regression model estimation and inference

Background Threshold regression models are a diverse set of non-regular regression models that all depend on change points or thresholds. They provide a simple but elegant and interpretable way to model certain kinds of nonlinear relationships between the outcome and a predictor. Results The R package chngpt provides both estimation and hypothesis testing functionalities for four common variants of threshold regression models. All allow for adjustment of additional covariates not subjected to thresholding. We demonstrate the consistency of the estimating procedures and the type 1 error rates of the testing procedures by Monte Carlo studies, and illustrate their practical uses using an example from the study of immune response biomarkers in the context of Mother-To-Child-Transmission of HIV-1 viruses. Conclusion chngpt makes several unique contributions to the software for threshold regression models and will make these models more accessible to practitioners interested in modeling threshold effects. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1863-x) contains supplementary material, which is available to authorized users.


Background
Threshold regression models are a class of regression models where the predictors are associated with the outcome in a threshold-dependent way. By introducing a threshold parameter, also known as the change point, threshold regression models provide a simple but elegant and interpretable way to model certain kinds of nonlinear relationships between the outcome and a predictor. There are many applications to threshold regression models in biomedical fields. Our interests in these models come primarily from the analyses of immunological assay data in human vaccine studies, where threshold-dependent association between risk of infection and immune response biomarkers abounds [1,2].
Threshold regression models can take many forms depending on what happens at the threshold [3]. For example, Fig. 1 shows four types of threshold effects: step, hinge, segmented and 'stegmented' . The step and hinge models are two of the most basic forms of threshold effects with zero slope before the threshold; the segmented model generalizes the hinge model by allowing non-zero slope between the threshold; and the stegmented model, as the name suggests, can be viewed as the fusion of the step and segmented models.
In the generalized linear regression framework, we can write down the mean function of these four types of threshold models as follows: stegmented Here e is the threshold parameter, x is the predictor with threshold effect, z denotes additional predictors, I (x > e) = 1 when x > e and 0 otherwise, and (x − e) + denotes the hinge function, which equals x − e when x > e and 0 otherwise.
Threshold regression models are related to but distinct from change-point analysis [4], which deal with time series data and are primarily concerned with detecting structural changes along a natural axis such as time or © The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. location on a chromosome. Many problems in changepoint analysis are not regression problems. In changepoint analysis regression problems, time series data are divided into regimens by change points; the relationships between the outcome and all the predictors are allowed to differ between regimens. In other words, all predictors are simultaneously thresholded in change-point analysis regression problems [5]. On the other hand, threshold regression models are fundamentally concerned with modeling nonlinearity. In this aspect, threshold regression models are more comparable to other nonlinear regression methods such as regression spline models [6].
Both threshold models and spline models are capable of modeling nonlinear relationships between the outcome and predictors. Their main differences lie in versatility and ease of interpretation. Take for example the hinge model and a natural cubic spline [7] with two degrees of freedom. Both have two degrees of freedom; in the case of the hinge model, the two relevant parameters are β 1 and e. The spline model is more versatile than the hinge model, but when both models offer reasonably good fits, the hinge model is more readily interpretable [8].
While many software programs are available for changepoint analysis and regression spline models, there are relatively few for threshold regression models. The best existing implementation is the R package segmented [5,9], which supports the hinge and segmented models and allows multiple thresholds. Our chngpt package complements the segmented package by making three unique contributions: (1) it supports all four types of threshold effects in Fig. 1, and supports models with interaction terms between predictors subjected to thresholding and predictors not subjected to thresholding [10]; (2) the search method in segmented employs a first order approximation of the non-smooth criterion function [11], while chngpt offers two alternative search methods: exact, which optimizes the exact criterion function, and smooth, which approximates the criterion function with a logistic function-based smooth function [3]. The exact method guarantees to find the globally optimal solution but can be slow when the sample size is large, while the smooth method, like segmented, is faster but may find a locally optimal solution; (3) segmented does not provide confidence intervals that account for the uncertainty of the threshold estimate, while chngpt do. The latter also includes modelrobust confidence intervals, which are designed to provide proper coverage even if the data-generating model is not truly a threshold model [8].

Estimation and confidence intervals
Estimation of threshold regression models is complicated by the fact that the models are not smooth in the threshold parameter. We consider two approaches for finding the maximum likelihood estimate of the threshold model parameters: exact and smooth. In the exact method we choose a grid of candidate change points that uniformly span the quantiles of the empirical distribution of the thresholded covariate. Given a candidate change point, the threshold regression model reduces to a regular regression model. The estimated change point is defined as the one corresponding to the reduced regression model with the highest likelihood. The advantages of the exact method are that it does not require a starting value for change point, and finds the global optimal solution; on the other hand, this method does not scale well to large datasets. As a practical measure, we set a default grid size of 500; when the sample size is larger than the grid size, we choose a subset of thresholded covariate values by uniformly sampling the ranks.
In the smooth method, we approximate the discontinuous likelihood function of a threshold regression model by approximating the step function I (x > e) with a twoparameter logistic function [ where b is chosen to be an appropriately large constant. To find the parameter estimate, we perform iterated optimization. Given an initial estimate of all model parameters, we alternatively 1. update the threshold parameter e and the coefficients β's and γ that are associated with thresholded covariate, conditional on the rest of the model parameters estimates,α, 2. update all coefficients, α, β's and γ , conditional on the estimated thresholdê.
Due to approximation of the step function, the second step above has a smooth objective function and can be performed using a wide variety of optimization methods. Our choice is a quasi-Newton method with box constraint as implemented in the R function optim. We stop the algorithm when the relative changes dip below a pre-determined tolerance level. The main advantage of the smooth approximation algorithm is that even with a large dataset, it can converge relatively quickly. On the other hand, the solution found by this algorithm is a local optimum, and its performance depends critically on the choice of starting value. To get good starting values, we perform hypothesis testing on the coefficients associated with the thresholded covariate, which will be described next, and use the threshold value corresponding to the maximal test statistic as the starting value.
There are two basic types of confidence intervals for threshold regression model parameter estimates: modelbased and model-robust. The former has the proper coverage only when the model is correctly specified, while the coverage of the latter is robust to model misspecification. There is an interesting interplay between these two types of confidence intervals and continuous (hinge and segmented) versus discontinuous (step and stegmented) threshold models. For continuous threshold models, all parameters converge at the regular n −1/2 rate whether or not the model is correctly specified, but the asymptotic variance-covariance matrices differ [8,12,13]. For discontinuous threshold models, under correct model specification, the threshold estimate converges at a fast rate of n −1 , and the regression coefficient estimates converge at their usual rates of n −1/2 [14]; under model misspecification, all parameters converge at the slower n −1/3 rate [15]. In the chngpt package, we provide implementation of modelbased intervals for all types of threshold models as well as model-robust confidence intervals for the continuous threshold models.

Hypothesis testing
For hypothesis testing, we extend the methods in [10], which dealt with the step model, to handle all four types of threshold effects. In the step and hinge models, we are interested in testing the null hypothesis β 1 = 0. In the segmented and stegmented models, the situation is more complex. We may be interested in testing all β and γ are 0, which corresponds to the null hypothesis that x has no effect on Y, or we may wish to leave γ out of the null hypothesis and test all β are 0. The latter puts the focus on threshold effect, and that is what we will focus on. Specifically, we will test β 1 = 0 for the segmented model and β 1 = β 2 = 0 for the stegmented model. When the null hypothesis involves only one parameter, either the maximum of scores or the maximum of likelihood ratios can be used. When the null hypothesis involves more than one parameter, tests based on maximum of likelihood ratios can be more powerful [10].
Consider the following equation that describes all threshold models in a uniform way: where z represents all predictors independent of e and x (e) represents all predictors dependent on e. Let p be the dimension of x(e); p = 2 in the stegmented model and p = 1 otherwise.
Let Q (e) denote the likelihood ratio statistic for comparing a threshold model and the null model conditional on a candidate change point e. Assuming there are M candidate change points, the maximum of likelihood ratios test statistics is defined as Under the standard regularity conditions, Q (e) converges to a chi-squared distribution with p degrees of free-dom under the null hypothesis. Furthermore, the statistic has the asymptotic representation

Monte Carlo studies
To validate the proposed methods and check the accuracy of the implementation, we conduct Monte Carlo experiments. We simulate data from logistic threshold regression models with true parameters listed in the Additional file 1: Table A.1. For the covariate distributions, we simulate from z ∼ N (0, σ = 1), x ∼ N (4.7, σ = 1.6). Each experiment is repeated 2,000 times to evaluate estimation performance and 10,000 times to evaluate hypothesis testing performance. The results of estimation are collected in Tables 1 and 2. Table 1 shows the relative bias of the estimated coefficients and the bias of the estimated threshold. For the step, hinge, and segmented models, the relative bias is computed by dividing the difference between the Monte Carlo mean and the true value by the true value; for the stegmented model, the Monte Carlo mean is replaced by the Monte Carlo median due to the skewness of the distribution of the parameters. The results indicate that the estimates from both grid search and smooth approximation are asymptotically unbiased. We also see that the estimated coefficients associated with the thresholded covariate x have larger finite sample biases than the    Table 2 shows that the sampling variability decreases as the sample size increases as expected.
In Tables 1 and 2 we also include the first order approximation search method that is available from the segmented package. These results show that the parameter estimates using this method have a similar profile to the estimates from the grid search and smooth approximation methods. An exception is when n=250, the estimated slope parameter associated with (x − e) + in the segmented model is more biased by the first order approximation method, 0.34 versus 0.22 for grid search. To further compare their performance, we repeat this simulation study with a reduced magnitude of the slope parameter associated with (x−e) + (β 1 = −0.51 instead of − 0.92). The results (Additional file 1: Table A. 2 and A.3) show that the distance between the parameter estimates from the first order approximation and the grid search deviates increases dramatically, while the distance between the results from the smooth approximation and the grid search remains close. Table 3 shows the type 1 error rates of hypothesis testing. We choose a moderate sample size of 250 for illustration. For the step, hinge, and segmented models, both the maximal score test and the maximal likelihood ratio test have close to nominal level type 1 error rates. For the stegmented model, the maximal likelihood ratio test type 1 error rate is slightly elevated, and as p = 2, there is not a univariate score test that is directly comparable to the other three models.
Finally, we compare the speed of grid search versus approximation methods. The model fitting is done on a Linux machine with a Intel ® Xeon ® CPU E5-2690 clocked at 2.90GHz. We examine four different samples sizes from 250 to 2000. The results from averaging over 10 simulations are shown in Table 4. For the smooth approximation method, model fitting for a dataset of 2000 rows takes less than one second, while for the exact method, model fitting for a dataset of 500 rows already takes more than one second. The performance of the first order approximation method also beats grid search but lags behind the smooth approximation method.

Real data illustrations
We illustrate the use of chngpt for fitting thresholded logistic regression models using a real data example from a study on the immunological biomarkers associated with the risk of Mother-To-Children Transmission (MTCT) of HIV-1 viruses [1]. The study was performed using stored blood samples from a cohort of U.S. non-breastfeeding, HIV-1-infected mother-infant pairs enrolled in the pre-ARV era Women and Infants Transmission Study [16]. Immunological assays were performed to measure antibody immune responses, including binding antibodies, neutralization antibodies, antibody avidity, and antibody-dependent cell-mediated cytotoxicity. The dataset includes 236 subjects, each corresponding to an infected mother; among them, 79 were transmitters and 157 are non-transmitters. For illustration, we consider the association between the transmission status and two covariates: birth, a categorical variable indicating the type of births, C-section or vaginal, and NAb_SF162LS, a continuous covariate giving the log titers of neutralization activities against a relatively easy to neutralize HIV-1 isolate named SF162LS. We fit the model using the exact search method, and a plot of the likelihood of the restricted regression model given a fixed change point versus the value of the change point is shown in Fig. 2a. Figure 2b plots the predicted risk as a function of NAb_SF162LS from the fitted model for vaginal births. For comparison, the figure also shows the predicted risk from a spline model that models the effect of NAb_SF162LS with a natural cubic spline with two degrees of freedom. Both the spline and hinge model fits suggest that the relationship between transmission and NAb_SF162LS is nonlinear; the hinge model fit further suggests that NAb_SF162LS needs to be above 7.4 (95% robust confidence interval 5.5, 8.2) before it is associated with decreased risk of MTCT.
To examine the classification performance of the hinge model, we perform Monte Carlo cross validation. We  A second example illustrating the use of chngpt for fitting thresholded linear regression models can be found in the Additional file 1: Section B.

Conclusions
We have developed an R package chngpt that supports four variants of threshold regression models that are most widely used in practice. The package implements both estimation and hypothesis testing functionalities based on a number of recent methodological advances [3,8,10].
chngpt is an open source software and can be downloaded from the Comprehensive R Archive Network. A short tutorial on how to use the package is contained in the Additional file 1: Section B.
Choosing among the four types of threshold models is an important and difficult question. We may divide this question into two parts: (i) whether a jump occurs at the threshold and (ii) whether the parameter space of the slope parameters should be restricted. The first question is especially challenging. For some processes, e.g. the occurrence of recombination events on a chromosome, it is natural to have jumps. For many others, the true underlying process may not be discontinuous; nonetheless, a discontinuous threshold model can be a useful approximation of a sudden shift in the response over a small span of the predictor values. To make this decision we recommend performing a test of the null hypothesis β 2 = 0 based on the stegmented model using a method from [15]. For the second question, take the choice between the two continuous threshold models for example. The hinge model is nested within the segmented model, which may suggest that we always use the segmented model. However, as the simulation studies in [17] show, for the same sample size the hinge model can be estimated with much greater accuracy than the segmented model. Thus, if applicable, the hinge model would be preferred over the segmented model. The decision to use the hinge model should be based on a combination of statistical evidence, e.g. fitting the segmented model and checking whether the modelrobust confidence interval ofγ includes 0, and scientific consideration, e.g. ifγ < 0 and yet, the predictor is not expected to have an inverse association with outcome given the domain knowledge, then it is more compelling to choose the hinge model when the confidence interval ofγ includes 0.