stpm: an R package for stochastic process model

Background The Stochastic Process Model (SPM) represents a general framework for modeling the joint evolution of repeatedly measured variables and time-to-event outcomes observed in longitudinal studies, i.e., SPM relates the stochastic dynamics of variables (e.g., physiological or biological measures) with the probabilities of end points (e.g., death or system failure). SPM is applicable for analyses of longitudinal data in many research areas; however, there are no publicly available software tools that implement this methodology. Results We developed an R package stpm for the SPM-methodology. The package estimates several versions of SPM currently available in the literature including discrete- and continuous-time multidimensional models and a one-dimensional model with time-dependent parameters. Also, the package provides tools for simulation and projection of individual trajectories and hazard functions. Conclusion In this paper, we present the first software implementation of the SPM-methodology by providing an R package stpm, which was verified through extensive simulation and validation studies. Future work includes further improvements of the model. Clinical and academic researchers will benefit from using the presented model and software. The R package stpm is available as open source software from the following links: https://cran.r-project.org/package=stpm(stable version) or https://github.com/izhbannikov/spm(developer version). Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1538-7) contains supplementary material, which is available to authorized users.


Background
A plethora of approaches to the joint analysis of longitudinal and time-to-event (survival) data have been developed the in last few decades, see ( [1], Ch. 11) and [2,3]. For example, joint longitudinal-survival models analyze the joint behavior of the process describing physiological variables (i.e. "longitudinal" process) and time-to-event ("survival") process, see [4][5][6][7] and R package JM [8], lcmm [9]. These models usually represent the hazard in the form of the Cox proportional hazards model [10] and dynamics of longitudinal variable as a mixed-effects model [11]. Also, random processes (e.g., Ornstein-Uhlenbeck or Gaussian processes) can be used in order to handle random fluctuations of individual measurements around the population average [12,13]. There are also extensions, such as accelerated failure time or additive hazard models [14,15] (see, e.g. R -package JM), which can be applied if the Cox proportional hazards assumption is violated.
However, an important challenge to consider in the context of bioinformatic studies is integration of biological knowledge and theories with statistical and computational methods and algorithms. One of possible approaches to integrate biological concepts and statistical models is based on the quadratic hazard models (also known alternatively as Stochastic Process Models, SPM) which were first introduced several decades ago [16][17][18]. Such models were recently modified [19,20] to incorporate several conceptual mechanisms with clear biological interpretation (such as homeostatic regulation, allostatic adaptation, stress resistance, adaptive capacity and physiological norms) relevant in the context of research on aging. Incorporation of available knowledge about regularities of aging-related changes developing in the human body into the model structure allows for addressing fundamental problems of aging dealing with age-related declines in stress resistance and adaptive capacity, changes in resilience and physiological norm, accumulation of allostatic load, etc. [20]. Importantly, these models permit evaluating all these mechanisms indirectly from longitudinal trajectories of biomarkers and data on mortality or onset of diseases when measurements of relevant variables representing the respective biological processes are not available in the data. Thus SPM provides an example of a successful attempt to analytically link biological knowledge about aging-related processes developing in the human body with changes in mortality risk using a compact and convenient mathematical framework.
The idea of SPM was first described in [16]. The theoretical background of the models with survival functions affected by stochastic covariates was also presented in [17,[21][22][23]. Later the methodology was extended in several publications, e.g. [19,[24][25][26][27][28][29]. The SPM links the dynamics of stochastic variables represented by multivariable autoregressive or stochastic differential equations with hazard rates described as quadratic functions of the state variables. The choice of the quadratic hazard function (also known as U-or J-shaped hazard function) is justified empirically based on many epidemiological observations for various biomarkers (see, e.g., [30][31][32][33][34][35]). The minimum of a quadratic function (or paraboloid in the multivariable case) is a point (or domain) in the variable state space, which corresponds to the optimal system status (e.g., the "normal" health status) with the minimal value of mortality risk at a specific time (or age). An important component of the SPM is the observational plan which characterizes how dynamic variables affecting mortality risk were measured in a longitudinal study.
We note also that such models have a much broader range of applications in many areas beyond research on aging capitalizing on their strength of using the stochastic dynamics of variables which may better describe the reality in many applications. Many publications demonstrated the high relevance of using the SPM in joint analyses of longitudinal measurements of various variables and timeto-event outcomes. However, until now, there were no publicly available software tools that implement this kind of analysis.
In this paper we present the R package, stpm, the first publicly available set of utilities which implements the SPM methodology in three specific cases covering analyses most frequently used in practice and, therefore, constituting a general framework for studying and modeling survival (censored) traits depending on random trajectories (stochastic paths) of variables.

Model forms
There are two forms of the SPM that have been developed recently stemming from the original works by Woodbury, Manton, Yashin, Stallard and colleagues in 1970-1980's: (i) discrete-time stochastic process model, assuming fixed time intervals between subsequent observations, initially developed by Woodbury, Manton et al. [16,23] and further developed by Akushevich et al. [24]; (ii) and continuous-time model, proposed in Yashin et al. [17,22] (and later extended in [19]), which can handle arbitrary time intervals. The 2007 version [19] specifies the components of the model tailored to applications in aging research which can still be used in a more general context.
In the R package stpm we implemented the models of type i and ii with time-independent coefficients, which can handle one or more variables (dimensions). In addition, we implemented a one-dimensional case (when one physiological variable/covariate is used) with timedependent coefficients of the model in [19]. Below we briefly describe the types of stochastic process models implemented in stpm.

Discrete-time SPM
The model [23,24] assumes fixed time intervals between consecutive observations. In this model, Y(t) (a k × 1 matrix of the values of covariates, where k is the number of covariates considered) and μ(t, Y(t)) (the hazard rate) have the following form: Coefficients u (a k × 1 matrix, where k is a number of covariates), R (a k × k matrix), μ 0 , b (a 1 × k matrix), Q (a k × k matrix) are assumed to be constant in the particular implementation of this model in the R package stpm.
contains normally-distributed random residuals, a k × 1 matrix. The symbol "*" denotes transpose operation. θ is a parameter to be estimated along with other parameters (u, R, μ 0 , b, Q).
These coefficients are then estimated directly from (i) linear auto-regression (Y(t + 1) = u + RY(t) + ), where Y(t) is empirically-observed for those subjects that are alive at time t and Y(t + 1) is the value of Y(t) at time t + 1; (ii) using a generalized linear model with family Binomial and link Log.

Continuous-time SPM
In the specification of the SPM described in the 2007 paper by Yashin and collegaues [19] the stochastic differential equation describing the age dynamics of a covariate is: In this equation, Y(t) (a k × 1 matrix) is the value of a particular covariate at time (age) t. f 1 (t) (a k × 1 matrix) corresponds to the long-term mean value of the stochastic process Y(t) which describes a trajectory of individual covariates influenced by different factors represented by a random Wiener process W(t). Coefficient a(t) (a k × k matrix) is a negative feedback coefficient which characterizes the rate at which the process reverts to its mean. In the area of research on aging, f 1 (t) represents the mean allostatic trajectory and a(t) represents the adaptive capacity of the organism. Coefficient b(t) (a k × 1 matrix) characterizes the strength of the random disturbances from the Wiener process W(t).
The following function μ(t, Y(t)) represents the hazard rate: here μ 0 (t) is the baseline hazard, which corresponds to the risk when Y(t) follows its optimal trajectory; f(t) (a k × 1 matrix) represents the optimal trajectory that minimizes the risk and Q(t) (a k × k matrix) models the sensitivity of the risk function to deviations from the norm.
In general, model coefficients a(t), f 1 (t), Q(t), f(t), b(t) and μ 0 (t) are time(age)-dependent. For example, the coefficient a can be assumed as (i) -0.05 (a constant, timeindependent) or (ii) a(t) = a 0 + b 0 t (time-dependent), in which a 0 and b 0 are unknown parameters to be estimated. The model can handle, in theory, any number of covariates.
In the implementation of the continuous-time SPM provided by the R package stpm, coefficients a, f 1 , f, b, μ 0 , Q are assumed to be time-independent. However, μ 0 and Q from (3) can be multiplied by e θt (by user's choice) and therefore are time-dependent: μ 0 (t) = μ 0 e θt , Q(t) = Qe θt . If not, they are assumed to be time-independent along with the other coefficients. Then the maximum likelihood method is used to estimate parameters a, f 1 , Q, f, b, μ 0 , θ as described further.

Parameter estimation procedure
The parameter estimation procedure can be found, e.g., in [19] and here we briefly summarize it. As shown in [19], the likelihood function is: Here L is the likelihood; i denotes individual, j denotes observation for respective variable. In Eq. (5), (6), (7) we suppressed i and j for brevity.μ(u) is the marginal hazard, presented in the survival function associated with the lifespan distribution P( and γ (0) are the mean and the variance/covariance matrix of the normal distribution of initial vector Y 0 = Y(t = t 0 ) and the mean and the variance/covariance matrix of this distribution at age t are given by m(t) and γ (t), respectively; Tr denotes the trace of a matrix; y i denote the measurements of the process Y (t); τ i is the lifespan (or age at censoring); δ i is a censoring indicator, m i (t) and γ i (t) satisfy (6) We use available optimization methods from package nloptr to estimate the parameters of this model. By default we use the Nelder-Mead method [36].
The coefficient conversion between continuous-and discrete-time models is as follows ("c" and "d" denote continuous-and discrete-time models respectively; note: these equations can be used if the intervals between consecutive observations of discrete-and continuous-time models are equal; it is also required that matrices a c and Q c,d be full-rank matrices): where k is the number of covariates, which is equal to the model's dimension and "*" denotes transpose operation; is a k × 1 matrix which contains the s.d.s of the corresponding residuals (residuals of a linear regression Y(t + 1) = u + RY(t) + ; s.d. is a standard deviation), I(k) is an identity k × k matrix.

Model with time-dependent coefficients
The two types of models described above assumes timeindependent coefficients, i.e. coefficients are constant and one-dimensional through the lifetime. We also implemented a model in which the coefficients are timedependent functions.

Description of the R package stpm
The general workflow of parameter estimation in the stpm R package consists of (i) data preparation and (ii) model parameter estimation. A user can potentially avoid the data preparation stage but should maintain an appropriate data format as described below and in the package user manual. The package is available as open source software from the following link: https://cran.r-project.org/package=stpm (stable version) or https://github.com/izhbannikov/spm (developer version).

Input data
The input data consists of longitudinal follow-up data that needs to be presented in the form of a dataset in commaseparated or SAS formats. The dataset is a longitudinal data file in a long format (i.e. each record represents a single observation for a subject, therefore there are multiple rows per individual). An example is presented in Table 1.

Data preparation
At the data preparation stage, the longitudinal dataset is preprocessed with the following command: Here longdat.csv is a path to the longitudinal dataset (i.e., can be csv or SAS data file). Names of specific covariates can be explicitly mentioned: d.prep <-prepare_data(x="longdat.csv" covariates=c("DBP", "BMI")) In this case we mentioned two covariates: DBP, which is diastolic blood pressure, and BMI -body mass index. Therefore, only these covariates will be "prepared" for downstream analysis. By default the first three columns of data give individual id, censoring status, times of measurements, and the values of measured covariates are provided in the rest (see Tables 1, 2 and 3).
The output of prepare_data(...) function includes a list of two datasets for modeling data with arbitrary or fixed intervals. A dataset with fixed intervals is used in the package function spm_discrete(...) which implements discrete-time model; a dataset with arbitrary time intervals is used in the package function spm_continuous(...) for continuous-time model (theoretically, there might be missing values in this data set and the algorithm can impute them). Linear interpolation is used for the former case to provide values of covariates between predetermined (empiricallyobserved) time points. Tables 2 and 3 are examples of such data sets. Those tables contain no missing values.

Model parameter estimation
At the parameter estimation stage, the stpm Rpackage offers three SPM specifications: (a) discrete-time, multi-dimensional SPM [23,24]; (b) continuous-time, multi-dimensional SPM [19]; (c) continuous-time, onedimensional SPM with time-dependent user-defined coefficients [25]. The package's central function spm(...) is used to estimate parameters from the model with different specifications and can be executed with the following command:  In this command: d.prep is a dataset (preprocessed data from function prepare_data(...)); model is a model type, the choices are: "discrete", "continuous", and "time-dependent. " For discrete and continuous model types, the output is a list with two subsets (parameters of these subsets are unambiguously related): (i) a set of estimated parameters [u, R, b, Sigma, Q, mu0, theta], see Eqs. (1); (ii) a set of estimated parameters [a, f1, Q, f, b, mu0, theta], see Eqs. (2,3).
Output for SPM with time-dependent parameters gives estimates for parameters provided in formulas, which is a list of formulas that define the time-dependent parameters. If some parameter's formulas were not explicitly indicated by a user in formulas then their defaults will be used and estimates will be given. The corresponding R-function to call this type of model is: model.par <-spm(d.prep, model="time-dependent", formulas=list(at="a1 * t+a2")) In this case the parameter formulas re-defines a. The model parameters not mentioned in the list formulas are constants (default). Initial values of parameters in formulas remaining for t=0 are estimated from the discretetime model and initial values of parameters that define time dependence (e.g., a1 in the above example) are set to zero.

Projection and simulation studies
The R package stpm also allows projection and data simulation with previously estimated or user-defined parameters. Projections are constructed for a cohort with normally distributed initial covariates. The results of the projections are (i) a dataset with individual projected values and (ii) a dataset with survival probabilities and agespecific means of state variables (covariates). An example of projection is: data.proj <-spm_projection(model.par, N=5000, ystart=80) The model.par here is a list of estimated model parameters from spm(...) function, 5000 is the number of individuals to simulate, 80 is the mean value of a covariate (in this case we have one-dimensional simulation). We present an example of simulation of 5,000 individuals: a data table and survival probabilities. library(stpm) # Starting parameters: model.par <-list(a=-0.05, f1=90, Q=2e-8, f=80, b=5, mu0=1e-5, theta=0.11) # Data simulation: data <-spm_projection(model.par, N=100, tstart=30, ystart=80, model="discrete") # Print some data: head(data$data) # Mean of covariates by age: data$stat$mean.by.age # Plot survival probabilities: plot(data$stat$srv.prob, xlab="Age", ylab="Percent survival", xlim=c(30,105)) Here we first set the model parameters: model.par <-list(a=-0.05, f1=90, Q=2e-8, f=80, b=5, mu0=1e-5,theta=0.11) Then we call a simulation function spm_projection (...) in order to simulate data (we specify a starting age of 30 (tstart=30)). We also can see mean values of covariates for each age group (with a command data$stat$mean.by.age) and plot survival curves (see Fig. 1).

Simulation strategies
Simulation is needed for verification of the estimation procedure. Below we describe simulation strategies implemented in the R package stpm. All three models described above were verified through simulation studies. To begin, a cohort of individuals at an initial time t 0 is constructed. We construct individual trajectories as the solution of Eqs. (1), (2) for discrete-and continuous-time models using initial values of covariates, and random stopping (death) times. The initial values of covariates for all individuals in the cohort are simulated through sampling from the Gaussian distribution: is a value of function f 1 at starting Fig. 1 The Kaplan-Meier estimate (along with confidence intervals) of the survival function of one simulated dataset generated by the procedure described in "Simulation strategies" section time t 0 (both user-defined) and σ 0 a standard deviation, user-defined (by default σ 0 = 1 for any covariate). Once we have the initial distribution of values of covariates for individuals, we then model trajectories in the multidimensional state space as follows: • First, the conditional probability of survival for each individual is computed using the mortality rate μ(t, Y(t)) for the interval (t, t + t): S(t|Y(t)) = e − t+ t t μ(s,Y(s))ds (for continuous-time model) and S(t|Y(t)) = e −μ(t,Y(t)) t for discrete-time model.
• Each individual in the simulated cohort is deemed to survive or not, according to the probability S(t|Y(t)).
To do that, a uniformly distributed random number r from the interval [0, 1] is generated. If r > S(t|Y(t)), the individual is assumed to have died, and the simulation of the corresponding individual trajectory stops at time t + t (the time of death). The whole process is repeated until all individuals have died or are censored.

Validation
We conducted simulations of 100 follow-up datasets with discrete intervals (1 year) between the observations, with 5,000 of subjects in each dataset separately for one and two covariates. Separately, we simulated another set of 100 follow-up datasets with arbitrary intervals between observation (for continuous-time model, for one and two covariates). Trajectory projections were performed according to the methodologies described above. Finally, we performed simulation of 100 follow-up datasets for the model with time-dependent parameters and one covariate. For this model we set the parameter f 1 = f 1a + f 1b t; other parameters were left as constants. Then we estimated all the parameters for discrete-, continuoustime and the model with time-dependent coefficients, for one and two covariates. The results are described in Discussion.

Case study: application to the Framingham Heart Study Data
Biological reasoning for the components of SPM [19] In this case study we illustrate the application of SPM [19] in the context of biological questions in studies of aging. As we noted, one of the challenges in the context of bioinformatic studies is to incorporate biological concepts into statistical models. Understandably, representing biological mechanisms relevant to functioning of such complicated systems as the human organism in the framework of a mathematical or statistical model is a tremendous task. Nevetheless, one can try to represent in the model some basic components of the system under study. The SPM (its 2007 version, [19]) represents such an attempt to incorporate basic concepts in the field of research on aging in the framework of mathematical equations.
The first equation of the SPM (see eq. 2) represents the stochastic dynamics of biomarkers. The stochastic component of the model is an important part of the aging process [37], therefore, it is natural to use stochastic processes in the models of aging. One type of process which is relevant for describing biological processes in a living organism is the so-called mean-reverting stochastic process [38]. Such a process has a tendency to move to its equilibrium state (also called a long-term mean) and it can represent homeostatic regulation in the structure of the model (which is a critical property of the living organism). In reality, organisms function in a non-optimal environment, therefore the regulatory systems push it to a different sub-optimal state, which is known as the allostatic state [39]. Representation of the mean allostatic state in the SPM is another important illustration of inclusion of biological reasoning in statistical models for research on aging. The statistical concept of a negative feedback coefficient a(t) provides one more way to include biological concepts in the model. The coefficient a(t) controls how quickly the physiological trajectory reverts to its average and modulates the adaptive response rate of an organism to the stress factors. Such factors impact the biomarker trajectories so they deviate from their normal (optimal) states. For example, one research question could be to look at the age dynamics of adaptive capacity. The phenomenon of worsening adaptive capacity with age implies that more time is required for the values of biomarkers to return to the average allostatic state for older people in comparison to the time needed for younger people.
The second equation of the SPM describes the hazard rate μ(t, Y(t)) (i.e., mortality/incidence rate) as a function of the stochastic covariates (see Eq. 3) [16,21,22]. The SPM represents the hazard as a quadtratic form: (Y(t) − f(t)) * Q(t)(Y(t) − f(t)); hence it is also called the quadratic hazard model. Such a hazard form is a convenient and useful choice with acceptable statistical properties [17,22] based on evidence that it is a quadratic function (J-or Ushape) of different covariates, see, e.g., [31][32][33]35]. The hazard rate used in the model is also a function of time (age) and it also includes a baseline hazard μ 0 which can be also time (age)-dependent (for example, Gompertz).
The parameter Q(t) controls how wide the U-shape (or a J-shape) is and can be formulated in terms of stress resistance [20,40] or "vulnerability" [41]. As discussed in these works, robustness or vulnerability are characterized by the width of the U-shape, and, therefore, if the U-shape shrinks, the organism becomes more and more susceptible to deviations of biomarkers from their "normal" states. Such decreases of stress resistance can be indirectly captured from longitudinal data by the SPM.
SPM estimates physiological or biological norms of biomarker values which correspond to minimal hazard rates at some particular time (age) [42]: f(t). This is estimated explicitly since the quadratic term contains the difference between the biomarker value and some function denoting the normal (optimal) state: if a biomarker value Y equals the function f then the quadratic part is nullified. Any other values of Y not equal to f result in larger hazard rates. The difference Y − f also indicates that it was impossible for the organism to return to the optimal state and, therefore, the organism is deregulated.

Application to blood glucose
Blood glucose (BG) has a tendency to increase with age and therefore to significantly differentiate from the normal level of BG determined among young adults. This can potentially contribute to increasing risks of death with age.
To study effects of BG on respective risks, researchers usually apply the Cox proportional hazards model. This gives one the estimates of coefficients β from which one can calculate the respective hazard ratios. Hazard ratios tell nothing about hidden and biologically interpretable components of aging processes, such as allostatic load, mean allostatic trajectory, stress resistance, adaptive capacity, and physiological norm. To see the effects of these components on mortality, which can not be captured by the Cox model, we performed analyses of repeated measurements of BG using SPM. This allows splitting the negative effects of external forces from the normal deterioration arising from the senescence process.
In this case study, we show that the level of BG which corresponds to the lowest mortality risk has a tendency to increase with age. The age-related changes in mortality-risk shape indicate the respective declines in stress-resistance which influence the level of BG. The case study results indicate that analyzing time-to-event data with SPM can substantially improve our knowledge of various factors and mechanisms which have an effect on aging-related changes in human organisms.

Data description
The Framingham Heart Study (FHS) Original Cohort was established in 1948 and has continued to the present [43]. In this study, we used the FHS data provided by the National Heart, Lung, and Blood Institute's (NHLBI) Biologic Specimen and Data Repositories Information Coordinating Center (BioLINCC) resource (https://biolincc. nhlbi.nih.gov/home/). Version 2014a was used in the analyses. The dataset of N = 5,079 individuals (2,785 females, 2,294 males; almost all subjects are White/Caucasians). The minimum individual age is 28 years and the maximum is 104 years; the average age is 60.18 years. The average observational time was 12 years (6 exams) and the average time between consecutive observations was 2 years. Missing BG observations were removed from the analysis. A histogram of BG levels is given in Additional file 1: Figure S6.

Analysis methodology
We analyzed the data with R package stpm using a one-dimensional continuous-time model with timedependent coefficients. Model parameters were in the form of linear functions and are presented below:

Results
The plots of a(t), f 1 (t), f (t) , Q(t) and μ(t) are presented in Fig. 2. Numerical values of parameter estimates along with their statistical characteristics such as s.d. and confidence intervals are presented in Additional file 1: Table S3.
From Fig. 2 we can see that the value of BG changes with age as shown in [42]. The function f 1 shows that the organism is not usually functioning in normal environment and therefore the trajectory of BG does not revert to the "norm" but rather to a different function. The increase with age in Q(t) indicates that the same deviation from the "norm" at older ages results in a larger increase in mortality risk. This means that the organism is more vulnerable to deviations of the level of BG from the normal value. We also see that the normal value is age-dependent. This indicates that the this optimal level of BG for younger individuals can actually increase the risk of death at older ages. Also, the age-dependence in a(t) shows that it takes more time for the trajectory of BG to go back to the allostatically prescribed value at older ages than it takes at younger ages. This means that the adaptive capacity of the organism (as related to adaptation to deviation of BG levels) declines with age.
As we show in this example, SPM [19] can estimate different aging-related components which eventually affect mortality though the longitudinal dynamics of physiological variables (such as BG). This provides a way to get additional insights into the processes of aging and serves as a background for further investigations using, for example, genetic analyses [26].

Results
Tables 4, 5 and 6 show simulation results for one-and two-dimensional discrete-time SPM models with timeindependent coefficients. Results for continuous-time SPM for both one-and two dimensions are provided in Table 4 Results of simulation studies for one-dimensional discrete-time model ( Additional file 1: Tables S1 and S2. All of the results show concordance with the parameter values used in simulation.
We also provide histograms of estimated parameters for all model types. Figures S1-S5 from Additional file 1 show histograms of estimated parameter for one-and twodimentional discrete-time models and the model with time-dependent coefficients.

Discussion
The Stochastic Process Model allows researchers to utilize the full potential of longitudinal data by evaluating  dynamic mechanisms of changing physiological variables with time (age), allowing the study of differences, for example, in genotype-specific hazards. Applying the Stochastic Process Model to analysis of longitudinal data can uncover influences of hidden components (adaptive capacity, allostatic load, resistance to stresses, physiological norm) of aging-related changes, which play important roles in aging-related processes but cannot be measured directly with common statistical methods. This provides researchers with a new way of analyzing longitudinal data. The specific form of the hazard of risk function should be taken into account where conducting analyses of longitudinal data using SPM. In our approach, we assume that the hazard rate (incidence rate related to changing physiological variable with age) has a U-or J-shape, which is biologically justified by empirical observations. In reality, the true form of this function is not known and, since it is impossible to estimate the true form from the data, an incorrectly assumed hazard may introduce additional bias. Additional investigation is needed in order to evaluate the effects of different forms of hazard functions on results.

Conclusion
We presented stpm -an R package that implements the Stochastic Process Model methodology. SPM can be used not only for stochastic modeling of probabilities of endpoints but in many other applied areas, e.g., life science applications including biologically based modeling. In this work, the package was validated through simulation studies. The stpm R package can be extended by including: (i) SPM with several health states [29]; (ii) SPM with hidden heterogeneity [28]; (iii) SPM with competing risks [27]; and (iv) SPM for partially observed covariates [26]. development of the package. KA, AIY, IA, ES revised the manuscript and Additional file 1 and gave comments helpful to finalize it. All authors read and approved the final version of the manuscript.