 Software
 Open access
 Published:
stpm: an R package for stochastic process model
BMC Bioinformatics volume 18, Article number: 125 (2017)
Abstract
Background
The Stochastic Process Model (SPM) represents a general framework for modeling the joint evolution of repeatedly measured variables and timetoevent outcomes observed in longitudinal studies, i.e., SPM relates the stochastic dynamics of variables (e.g., physiological or biological measures) with the probabilities of end points (e.g., death or system failure). SPM is applicable for analyses of longitudinal data in many research areas; however, there are no publicly available software tools that implement this methodology.
Results
We developed an R package stpm for the SPMmethodology. The package estimates several versions of SPM currently available in the literature including discrete and continuoustime multidimensional models and a onedimensional model with timedependent parameters. Also, the package provides tools for simulation and projection of individual trajectories and hazard functions.
Conclusion
In this paper, we present the first software implementation of the SPMmethodology by providing an R package stpm, which was verified through extensive simulation and validation studies. Future work includes further improvements of the model. Clinical and academic researchers will benefit from using the presented model and software. The R package stpm is available as open source software from the following links: https://cran.rproject.org/package=stpm(stable version) or https://github.com/izhbannikov/spm(developer version).
Background
A plethora of approaches to the joint analysis of longitudinal and timetoevent (survival) data have been developed the in last few decades, see ([1], Ch. 11) and [2, 3]. For example, joint longitudinalsurvival models analyze the joint behavior of the process describing physiological variables (i.e. “longitudinal” process) and timetoevent (“survival”) process, see [4–7] and R package JM[8], lcmm[9]. These models usually represent the hazard in the form of the Cox proportional hazards model [10] and dynamics of longitudinal variable as a mixedeffects model [11]. Also, random processes (e.g., OrnsteinUhlenbeck or Gaussian processes) can be used in order to handle random fluctuations of individual measurements around the population average [12, 13].
There are also extensions, such as accelerated failure time or additive hazard models [14, 15] (see, e.g. R  package JM), which can be applied if the Cox proportional hazards assumption is violated.
However, an important challenge to consider in the context of bioinformatic studies is integration of biological knowledge and theories with statistical and computational methods and algorithms. One of possible approaches to integrate biological concepts and statistical models is based on the quadratic hazard models (also known alternatively as Stochastic Process Models, SPM) which were first introduced several decades ago [16–18]. Such models were recently modified [19, 20] to incorporate several conceptual mechanisms with clear biological interpretation (such as homeostatic regulation, allostatic adaptation, stress resistance, adaptive capacity and physiological norms) relevant in the context of research on aging. Incorporation of available knowledge about regularities of agingrelated changes developing in the human body into the model structure allows for addressing fundamental problems of aging dealing with agerelated declines in stress resistance and adaptive capacity, changes in resilience and physiological norm, accumulation of allostatic load, etc. [20]. Importantly, these models permit evaluating all these mechanisms indirectly from longitudinal trajectories of biomarkers and data on mortality or onset of diseases when measurements of relevant variables representing the respective biological processes are not available in the data. Thus SPM provides an example of a successful attempt to analytically link biological knowledge about agingrelated processes developing in the human body with changes in mortality risk using a compact and convenient mathematical framework.
The idea of SPM was first described in [16]. The theoretical background of the models with survival functions affected by stochastic covariates was also presented in [17, 21–23]. Later the methodology was extended in several publications, e.g. [19, 24–29]. The SPM links the dynamics of stochastic variables represented by multivariable autoregressive or stochastic differential equations with hazard rates described as quadratic functions of the state variables. The choice of the quadratic hazard function (also known as U or J shaped hazard function) is justified empirically based on many epidemiological observations for various biomarkers (see, e.g., [30–35]). The minimum of a quadratic function (or paraboloid in the multivariable case) is a point (or domain) in the variable state space, which corresponds to the optimal system status (e.g., the “normal” health status) with the minimal value of mortality risk at a specific time (or age). An important component of the SPM is the observational plan which characterizes how dynamic variables affecting mortality risk were measured in a longitudinal study.
We note also that such models have a much broader range of applications in many areas beyond research on aging capitalizing on their strength of using the stochastic dynamics of variables which may better describe the reality in many applications. Many publications demonstrated the high relevance of using the SPM in joint analyses of longitudinal measurements of various variables and timetoevent outcomes. However, until now, there were no publicly available software tools that implement this kind of analysis.
In this paper we present the R package, stpm, the first publicly available set of utilities which implements the SPM methodology in three specific cases covering analyses most frequently used in practice and, therefore, constituting a general framework for studying and modeling survival (censored) traits depending on random trajectories (stochastic paths) of variables.
Implementation
Model forms
There are two forms of the SPM that have been developed recently stemming from the original works by Woodbury, Manton, Yashin, Stallard and colleagues in 1970–1980’s: (i) discretetime stochastic process model, assuming fixed time intervals between subsequent observations, initially developed by Woodbury, Manton et al. [16, 23] and further developed by Akushevich et al. [24]; (ii) and continuoustime model, proposed in Yashin et al. [17, 22] (and later extended in [19]), which can handle arbitrary time intervals. The 2007 version [19] specifies the components of the model tailored to applications in aging research which can still be used in a more general context.
In the R package stpm we implemented the models of type i and ii with timeindependent coefficients, which can handle one or more variables (dimensions). In addition, we implemented a onedimensional case (when one physiological variable/covariate is used) with timedependent coefficients of the model in [19]. Below we briefly describe the types of stochastic process models implemented in stpm.
Discretetime SPM
The model [23, 24] assumes fixed time intervals between consecutive observations. In this model, Y(t) (a k×1 matrix of the values of covariates, where k is the number of covariates considered) and μ(t,Y(t)) (the hazard rate) have the following form:
Coefficients u (a k×1 matrix, where k is a number of covariates), R (a k×k matrix), μ _{0}, b (a 1×k matrix), Q (a k×k matrix) are assumed to be constant in the particular implementation of this model in the R package stpm. ε contains normallydistributed random residuals, a k×1 matrix. The symbol “*” denotes transpose operation. θ is a parameter to be estimated along with other parameters (u, R, μ _{0}, b, Q).
These coefficients are then estimated directly from (i) linear autoregression (Y(t+1)=u+R Y(t)+ε), where Y(t) is empiricallyobserved for those subjects that are alive at time t and Y(t+1) is the value of Y(t) at time t+1; (ii) using a generalized linear model with family Binomial and link Log.
Continuoustime SPM
In the specification of the SPM described in the 2007 paper by Yashin and collegaues [19] the stochastic differential equation describing the age dynamics of a covariate is:
In this equation, Y(t) (a k×1 matrix) is the value of a particular covariate at time (age) t. f _{1}(t) (a k×1 matrix) corresponds to the longterm mean value of the stochastic process Y(t) which describes a trajectory of individual covariates influenced by different factors represented by a random Wiener process W(t). Coefficient a(t) (a k×k matrix) is a negative feedback coefficient which characterizes the rate at which the process reverts to its mean. In the area of research on aging, f _{1}(t) represents the mean allostatic trajectory and a(t) represents the adaptive capacity of the organism. Coefficient b(t) (a k×1 matrix) characterizes the strength of the random disturbances from the Wiener process W(t).
The following function μ(t,Y(t)) represents the hazard rate:
here μ _{0}(t) is the baseline hazard, which corresponds to the risk when Y(t) follows its optimal trajectory; f(t) (a k×1 matrix) represents the optimal trajectory that minimizes the risk and Q(t) (a k×k matrix) models the sensitivity of the risk function to deviations from the norm.
In general, model coefficients a(t), f _{1}(t), Q(t), f(t), b(t) and μ _{0}(t) are time(age)dependent. For example, the coefficient a can be assumed as (i) 0.05 (a constant, timeindependent) or (ii) a(t)=a _{0}+b _{0} t (timedependent), in which a _{0} and b _{0} are unknown parameters to be estimated. The model can handle, in theory, any number of covariates.
In the implementation of the continuoustime SPM provided by the R package stpm, coefficients a, f _{1}, f, b, μ _{0}, Q are assumed to be timeindependent. However, μ _{0} and Q from (3) can be multiplied by e ^{θt} (by user’s choice) and therefore are timedependent: μ _{0}(t)=μ _{0} e ^{θt}, Q(t)=Qe ^{θt}. If not, they are assumed to be timeindependent along with the other coefficients. Then the maximum likelihood method is used to estimate parameters a, f _{1}, Q, f, b, μ _{0},θ as described further.
Parameter estimation procedure
The parameter estimation procedure can be found, e.g., in [19] and here we briefly summarize it. As shown in [19], the likelihood function is:
Here L is the likelihood; i denotes individual, j denotes observation for respective variable. In Eq. (5), (6), (7) we suppressed i and j for brevity. \(\bar {\mu }(u)\) is the marginal hazard, presented in the survival function associated with the lifespan distribution \(P(T > t) = exp(\int _{0}^{t}{\bar {\mu }^{i}(u)du})\); m(0) and γ(0) are the mean and the variance/covariance matrix of the normal distribution of initial vector Y _{0}=Y(t=t _{0}) and the mean and the variance/covariance matrix of this distribution at age t are given by m(t) and γ(t), respectively; Tr denotes the trace of a matrix; \(y^{i}_{t^{i}_{0}}\), \(y^{i}_{t^{i}_{1}}\),..., \(y^{i}_{t^{i}_{n_{i}}}\) denote the measurements of the process Y(t); τ _{ i } is the lifespan (or age at censoring); δ _{ i } is a censoring indicator, m ^{i}(t) and γ ^{i}(t) satisfy (6),(7) at the intervals \([t^{i}_{0}, t^{i}_{1})\), \([t^{i}_{1}, t^{i}_{2})\),..., \([t^{i}_{n_{i}  1}, t^{i}_{n_{i}})\), \([t^{i}_{n_{i}}, \tau _{i})\) with the initial conditions \(y^{i}_{t^{i}_{0}}\), \(y^{i}_{t^{i}_{1}}\),..., \(y^{i}_{t^{i}_{n_{i}}}\); \(m^{i}\left (t^{i}_{j^{}}\right) = {\lim }_{t \uparrow t^{i}_{j}}m^{i}(t)\), and \(\gamma ^{i}(t^{i}_{j^{}}) = {\lim }_{t \uparrow t^{i}_{j}}\gamma ^{i}(t)\).
We use available optimization methods from package nloptr to estimate the parameters of this model. By default we use the Nelder–Mead method [36].
The coefficient conversion between continuous and discretetime models is as follows (“c” and “d” denote continuous and discretetime models respectively; note: these equations can be used if the intervals between consecutive observations of discrete and continuoustime models are equal; it is also required that matrices a _{ c } and Q _{ c,d } be fullrank matrices):
where k is the number of covariates, which is equal to the model’s dimension and “*” denotes transpose operation; Σ is a k×1 matrix which contains the s.d.s of the corresponding residuals (residuals of a linear regression Y(t+1)=u+R Y(t)+ε; s.d. is a standard deviation), I(k) is an identity k×k matrix.
Model with timedependent coefficients
The two types of models described above assumes timeindependent coefficients, i.e. coefficients are constant and onedimensional through the lifetime. We also implemented a model in which the coefficients are timedependent functions.
Description of the R package stpm
The general workflow of parameter estimation in the stpm R package consists of (i) data preparation and (ii) model parameter estimation. A user can potentially avoid the data preparation stage but should maintain an appropriate data format as described below and in the package user manual. The package is available as open source software from the following link: https://cran.rproject.org/package=stpm (stable version) or https://github.com/izhbannikov/spm(developer version).
Input data
The input data consists of longitudinal followup data that needs to be presented in the form of a dataset in commaseparated or SAS formats. The dataset is a longitudinal data file in a long format (i.e. each record represents a single observation for a subject, therefore there are multiple rows per individual). An example is presented in Table 1.
Data preparation
At the data preparation stage, the longitudinal dataset is preprocessed with the following command:
Here longdat.csv is a path to the longitudinal dataset (i.e., can be csv or SAS data file). Names of specific covariates can be explicitly mentioned:
In this case we mentioned two covariates: DBP, which is diastolic blood pressure, and BMI  body mass index. Therefore, only these covariates will be “prepared” for downstream analysis. By default the first three columns of data give individual id, censoring status, times of measurements, and the values of measured covariates are provided in the rest (see Tables 1, 2 and 3).
The output of prepare_data(...) function includes a list of two datasets for modeling data with arbitrary or fixed intervals. A dataset with fixed intervals is used in the package function spm_discrete(...) which implements discretetime model; a dataset with arbitrary time intervals is used in the package function spm_continuous(...) for continuoustime model (theoretically, there might be missing values in this data set and the algorithm can impute them). Linear interpolation is used for the former case to provide values of covariates between predetermined (empiricallyobserved) time points. Tables 2 and 3 are examples of such data sets. Those tables contain no missing values.
Model parameter estimation
At the parameter estimation stage, the stpm Rpackage offers three SPM specifications: (a) discretetime, multidimensional SPM [23, 24]; (b) continuoustime, multidimensional SPM [19]; (c) continuoustime, onedimensional SPM with timedependent userdefined coefficients [25]. The package’s central function spm(...) is used to estimate parameters from the model with different specifications and can be executed with the following command:
In this command: d.prep is a dataset (preprocessed data from function prepare_data(...)); model is a model type, the choices are: “discrete”, “continuous”, and “timedependent.” For discrete and continuous model types, the output is a list with two subsets (parameters of these subsets are unambiguously related): (i) a set of estimated parameters [u, R, b, Sigma, Q, mu0, theta], see Eqs. (1); (ii) a set of estimated parameters [a, f1, Q, f, b, mu0, theta], see Eqs. (2, 3).
Output for SPM with timedependent parameters gives estimates for parameters provided in formulas, which is a list of formulas that define the timedependent parameters. If some parameter’s formulas were not explicitly indicated by a user in formulas then their defaults will be used and estimates will be given. The corresponding Rfunction to call this type of model is:
In this case the parameter formulas redefines a. The model parameters not mentioned in the list formulas are constants (default). Initial values of parameters in formulas remaining for t=0 are estimated from the discretetime model and initial values of parameters that define time dependence (e.g., a1 in the above example) are set to zero.
In the toy example below we summarize the data preparation and parameter estimation stages in a typical workflow. Datasets stored in longdat.csv are simulated data of two covariates (diastolic blood pressure, DBP, and bodymass index, BMI) estimated for 100 subjects. After this example we provide descriptions of the results.
p.discr.model, p.cont.model contain parameters estimated for discretetime and continuoustime models. p.td.model contains parameters estimated for the SPM with timedependent coefficients.
Projection and simulation studies
The R package stpm also allows projection and data simulation with previously estimated or userdefined parameters. Projections are constructed for a cohort with normally distributed initial covariates. The results of the projections are (i) a dataset with individual projected values and (ii) a dataset with survival probabilities and agespecific means of state variables (covariates). An example of projection is:
The model.par here is a list of estimated model parameters from spm(...) function, 5000 is the number of individuals to simulate, 80 is the mean value of a covariate (in this case we have onedimensional simulation). We present an example of simulation of 5,000 individuals: a data table and survival probabilities.
Here we first set the model parameters:
Then we call a simulation function spm_projection(...) in order to simulate data (we specify a starting age of 30 (tstart=30)). We also can see mean values of covariates for each age group (with a command data$stat$mean.by.age) and plot survival curves (see Fig. 1).
Simulation strategies
Simulation is needed for verification of the estimation procedure. Below we describe simulation strategies implemented in the R package stpm. All three models described above were verified through simulation studies.
To begin, a cohort of individuals at an initial time t _{0} is constructed. We construct individual trajectories as the solution of Eqs. (1), (2) for discrete and continuoustime models using initial values of covariates, and random stopping (death) times. The initial values of covariates for all individuals in the cohort are simulated through sampling from the Gaussian distribution: \(\mathbf {Y}(t=t_0) \sim N(f_{1}(t=t_{0}), {\sigma }_{0}^2)\), where f _{1}(t=t _{0}) is a value of function f _{1} at starting time t _{0} (both userdefined) and σ _{0} a standard deviation, userdefined (by default σ _{0}=1 for any covariate).
Once we have the initial distribution of values of covariates for individuals, we then model trajectories in the multidimensional state space as follows:

First, the conditional probability of survival for each individual is computed using the mortality rate μ(t,Y(t)) for the interval (t, t + Δ t): \(S(t\mathbf {Y}(t)) = e^{\int _{t}^{t+\Delta t}\mu (s, \mathbf {Y}(s)) ds}\) (for continuoustime model) and S(tY(t))=e ^{−μ(t,Y(t))Δt} for discretetime model.

Each individual in the simulated cohort is deemed to survive or not, according to the probability S(tY(t)). To do that, a uniformly distributed random number r from the interval [0, 1] is generated. If r>S(tY(t)), the individual is assumed to have died, and the simulation of the corresponding individual trajectory stops at time t+Δ t (the time of death).

Next, the covariate Y(t+Δ t) for a surviving individual is modeled using Eqs. (1) for discretetime model or (2) for continuoustime model. The next observation time is modeled by adding Δ t, which is fixed for discretetime model and arbitrary (Δ t=step+unif(−0.1step,0.1step) where parameter step is fixed and userdefined; by default step=1) for continuoustime model, to the current time t.

If the age of a particular individual exceeded a maximum age tmax (userdefined, 105 by default), the individual is censored and trajectory simulation is stopped at time t+Δ t (a time of censoring). We also provided the possibility of censoring after achievement of n observations for a particular individual.
The whole process is repeated until all individuals have died or are censored.
Validation
We conducted simulations of 100 followup datasets with discrete intervals (1 year) between the observations, with 5,000 of subjects in each dataset separately for one and two covariates. Separately, we simulated another set of 100 followup datasets with arbitrary intervals between observation (for continuoustime model, for one and two covariates). Trajectory projections were performed according to the methodologies described above. Finally, we performed simulation of 100 followup datasets for the model with timedependent parameters and one covariate. For this model we set the parameter f _{1}=f _{1a }+f _{1b } t; other parameters were left as constants. Then we estimated all the parameters for discrete, continuoustime and the model with timedependent coefficients, for one and two covariates. The results are described in Discussion.
Case study: application to the Framingham Heart Study Data
Biological reasoning for the components of SPM [19]
In this case study we illustrate the application of SPM [19] in the context of biological questions in studies of aging. As we noted, one of the challenges in the context of bioinformatic studies is to incorporate biological concepts into statistical models. Understandably, representing biological mechanisms relevant to functioning of such complicated systems as the human organism in the framework of a mathematical or statistical model is a tremendous task. Nevetheless, one can try to represent in the model some basic components of the system under study. The SPM (its 2007 version, [19]) represents such an attempt to incorporate basic concepts in the field of research on aging in the framework of mathematical equations.
The first equation of the SPM (see eq. 2) represents the stochastic dynamics of biomarkers. The stochastic component of the model is an important part of the aging process [37], therefore, it is natural to use stochastic processes in the models of aging. One type of process which is relevant for describing biological processes in a living organism is the socalled meanreverting stochastic process [38]. Such a process has a tendency to move to its equilibrium state (also called a longterm mean) and it can represent homeostatic regulation in the structure of the model (which is a critical property of the living organism). In reality, organisms function in a nonoptimal environment, therefore the regulatory systems push it to a different suboptimal state, which is known as the allostatic state [39]. Representation of the mean allostatic state in the SPM is another important illustration of inclusion of biological reasoning in statistical models for research on aging. The statistical concept of a negative feedback coefficient a(t) provides one more way to include biological concepts in the model. The coefficient a(t) controls how quickly the physiological trajectory reverts to its average and modulates the adaptive response rate of an organism to the stress factors. Such factors impact the biomarker trajectories so they deviate from their normal (optimal) states. For example, one research question could be to look at the age dynamics of adaptive capacity. The phenomenon of worsening adaptive capacity with age implies that more time is required for the values of biomarkers to return to the average allostatic state for older people in comparison to the time needed for younger people.
The second equation of the SPM describes the hazard rate μ(t,Y(t)) (i.e., mortality/incidence rate) as a function of the stochastic covariates (see Eq. 3) [16, 21, 22]. The SPM represents the hazard as a quadtratic form: (Y(t)−f(t))^{∗} Q(t)(Y(t)−f(t)); hence it is also called the quadratic hazard model. Such a hazard form is a convenient and useful choice with acceptable statistical properties [17, 22] based on evidence that it is a quadratic function (J or Ushape) of different covariates, see, e.g., [31–33, 35]. The hazard rate used in the model is also a function of time (age) and it also includes a baseline hazard μ _{0} which can be also time (age)dependent (for example, Gompertz).
The parameter Q(t) controls how wide the Ushape (or a Jshape) is and can be formulated in terms of stress resistance [20, 40] or “vulnerability” [41]. As discussed in these works, robustness or vulnerability are characterized by the width of the Ushape, and, therefore, if the Ushape shrinks, the organism becomes more and more susceptible to deviations of biomarkers from their “normal” states. Such decreases of stress resistance can be indirectly captured from longitudinal data by the SPM.
SPM estimates physiological or biological norms of biomarker values which correspond to minimal hazard rates at some particular time (age) [42]: f(t). This is estimated explicitly since the quadratic term contains the difference between the biomarker value and some function denoting the normal (optimal) state: if a biomarker value Y equals the function f then the quadratic part is nullified. Any other values of Y not equal to f result in larger hazard rates. The difference Y−f also indicates that it was impossible for the organism to return to the optimal state and, therefore, the organism is deregulated.
Application to blood glucose
Blood glucose (BG) has a tendency to increase with age and therefore to significantly differentiate from the normal level of BG determined among young adults. This can potentially contribute to increasing risks of death with age.
To study effects of BG on respective risks, researchers usually apply the Cox proportional hazards model. This gives one the estimates of coefficients β from which one can calculate the respective hazard ratios. Hazard ratios tell nothing about hidden and biologically interpretable components of aging processes, such as allostatic load, mean allostatic trajectory, stress resistance, adaptive capacity, and physiological norm. To see the effects of these components on mortality, which can not be captured by the Cox model, we performed analyses of repeated measurements of BG using SPM. This allows splitting the negative effects of external forces from the normal deterioration arising from the senescence process.
In this case study, we show that the level of BG which corresponds to the lowest mortality risk has a tendency to increase with age. The agerelated changes in mortalityrisk shape indicate the respective declines in stressresistance which influence the level of BG. The case study results indicate that analyzing timetoevent data with SPM can substantially improve our knowledge of various factors and mechanisms which have an effect on agingrelated changes in human organisms.
Data description
The Framingham Heart Study (FHS) Original Cohort was established in 1948 and has continued to the present [43]. In this study, we used the FHS data provided by the National Heart, Lung, and Blood Institute’s (NHLBI) Biologic Specimen and Data Repositories Information Coordinating Center (BioLINCC) resource (https://biolincc.nhlbi.nih.gov/home/). Version 2014a was used in the analyses. The dataset of N = 5,079 individuals (2,785 females, 2,294 males; almost all subjects are White/Caucasians). The minimum individual age is 28 years and the maximum is 104 years; the average age is 60.18 years. The average observational time was 12 years (6 exams) and the average time between consecutive observations was 2 years. Missing BG observations were removed from the analysis. A histogram of BG levels is given in Additional file 1: Figure S6.
Analysis methodology
We analyzed the data with R package stpm using a onedimensional continuoustime model with timedependent coefficients. Model parameters were in the form of linear functions and are presented below:
Therefore, parameters a _{ y }, b _{ y }, \(a_{f_1}\), \(b_{f_1}\), a _{ q }, b _{ q }, a _{ f }, b _{ f }, b, μ _{0}, θ were estimated.
Results
The plots of a(t), f _{1}(t), f(t), Q(t) and μ(t) are presented in Fig. 2. Numerical values of parameter estimates along with their statistical characteristics such as s.d. and confidence intervals are presented in Additional file 1: Table S3.
From Fig. 2 we can see that the value of BG changes with age as shown in [42]. The function f _{1} shows that the organism is not usually functioning in normal environment and therefore the trajectory of BG does not revert to the “norm” but rather to a different function. The increase with age in Q(t) indicates that the same deviation from the “norm” at older ages results in a larger increase in mortality risk. This means that the organism is more vulnerable to deviations of the level of BG from the normal value. We also see that the normal value is agedependent. This indicates that the this optimal level of BG for younger individuals can actually increase the risk of death at older ages. Also, the agedependence in a(t) shows that it takes more time for the trajectory of BG to go back to the allostatically prescribed value at older ages than it takes at younger ages. This means that the adaptive capacity of the organism (as related to adaptation to deviation of BG levels) declines with age.
As we show in this example, SPM [19] can estimate different agingrelated components which eventually affect mortality though the longitudinal dynamics of physiological variables (such as BG). This provides a way to get additional insights into the processes of aging and serves as a background for further investigations using, for example, genetic analyses [26].
Results
Tables 4, 5 and 6 show simulation results for one and twodimensional discretetime SPM models with timeindependent coefficients. Results for continuoustime SPM for both one and two dimensions are provided in Additional file 1: Tables S1 and S2. All of the results show concordance with the parameter values used in simulation.
We also provide histograms of estimated parameters for all model types. Figures S1S5 from Additional file 1 show histograms of estimated parameter for one and twodimentional discretetime models and the model with timedependent coefficients.
Discussion
The Stochastic Process Model allows researchers to utilize the full potential of longitudinal data by evaluating dynamic mechanisms of changing physiological variables with time (age), allowing the study of differences, for example, in genotypespecific hazards. Applying the Stochastic Process Model to analysis of longitudinal data can uncover influences of hidden components (adaptive capacity, allostatic load, resistance to stresses, physiological norm) of agingrelated changes, which play important roles in agingrelated processes but cannot be measured directly with common statistical methods. This provides researchers with a new way of analyzing longitudinal data.
The specific form of the hazard of risk function should be taken into account where conducting analyses of longitudinal data using SPM. In our approach, we assume that the hazard rate (incidence rate related to changing physiological variable with age) has a U or J shape, which is biologically justified by empirical observations. In reality, the true form of this function is not known and, since it is impossible to estimate the true form from the data, an incorrectly assumed hazard may introduce additional bias. Additional investigation is needed in order to evaluate the effects of different forms of hazard functions on results.
Conclusion
We presented stpm  an R package that implements the Stochastic Process Model methodology. SPM can be used not only for stochastic modeling of probabilities of endpoints but in many other applied areas, e.g., life science applications including biologically based modeling. In this work, the package was validated through simulation studies. The stpm R package can be extended by including: (i) SPM with several health states [29]; (ii) SPM with hidden heterogeneity [28]; (iii) SPM with competing risks [27]; and (iv) SPM for partially observed covariates [26].
Availability and requirements
Project name: stpm Project home page: https://github.com/izhbannikov/spm/ Operating systems: Platform independent Programming language: R Other requirements: R 3.2.2 or higher + Rcpp, RcppArmadillo, mice, sas7bdat, stats, nloptr, survival, tools, knitr packages Licence: GPL licence
Abbreviations
 BG:

Blood glucose
 BMI:

Body mass index
 DBP:

Diastolic blood pressure
 JM:

Joint model
 SPM:

Stochastic process model
References
Yashin AI, Stallard E, Land KC. Biodemography of Aging. Netherlands: Springer; 2016.
Tsiatis AA, Davidian M. Joint modeling of longitudinal and timetoevent data: an overview. Statistica Sinica. 2004; 14:809–34.
Arbeev KG, Akushevich I, Kulminski AM, Ukraintseva SV, Yashin AI. Joint analyses of longitudinal and timetoevent data in research on aging: Implications for predicting health and survival. Front Public Health. 2014; 2:228.
Rizopoulos D. Joint Models for Longitudinal and Timetoevent Data : with Applications in R. New York: CRC Press; 2012.
McCrink LM, Marshall AH, Cairns KJ. Advances in joint modelling: A review of recent developments with application to the survival of end stage renal disease patients. Int Stat Rev. 2013; 81(2):249–69. doi:10.1111/insr.12018.
ProustLima C, Taylor JMG. Development and validation of a dynamic prognostic tool for prostate cancer recurrence using repeated measures of posttreatment psa: a joint modeling approach. Biostatistics. 2009; 10(3):535–49. doi:10.1093/biostatistics/kxp009.
Lawrence Gould A, Boye ME, Crowther MJ, Ibrahim JG, Quartey G, Micallef S, Bois FY. Joint modeling of survival and longitudinal nonsurvival data: current methods and issues. Stat Med. 2015; 34(14):2181–95. doi:10.1002/sim.6141.
Rizopoulos D. JM: An R package for the joint modelling of longitudinal and timetoevent data. J Stat Softw. 2010; 35(9):1–33.
ProustLima C, Philipps V, Diakite A, Liquet B. Lcmm: Extended Mixed Models Using Latent Classes and Latent Processes. 2016. R package version: 1.7.6. https://CRAN.Rproject.org/package=lcmm. Accessed 10 Feb 2017.
Cox DR. Regression models and lifetables. J R Stat Soc Series B (Methodological). 1972; 34(2):187–220.
Laird NM, Ware JH. Randomeffects models for longitudinal data. Biometrics. 1982; 38(4):963–74.
Wang Y, Taylor JMG. Jointly modeling longitudinal and event time data with application to acquired immunodeficiency syndrome. J Am Stat Assoc. 2001; 96(455):895–905.
Henderson R, Diggle P, Dobson A. Joint modelling of longitudinal measurements and event time data. Biostatistics. 2000; 1(4):465–80. doi:10.1093/biostatistics/1.4.465. http://biostatistics.oxfordjournals.org/content/1/4/465.full.pdf+html.
Tseng YK, Hsieh F, Wang JL. Joint modelling of accelerated failure time and longitudinal data. Biometrika. 2005; 92(3):587–603. doi:10.1093/biomet/92.3.587.
Song X, Huang Y. A corrected pseudoscore approach for additive hazards model with longitudinal covariates measured with error. Lifetime Data Anal. 2006; 12(1):97–110. doi:10.1007/s1098500572227.
Woodbury MA, Manton KG. A randomwalk model of human mortality and aging. Theor Popul Biol. 1977; 11(1):37–48. doi:10.1016/00405809(77)900053.
Yashin AI, Manton KG, Vaupel JW. Mortality and aging in a heterogeneous population: A stochastic process model with observed and unobserved variables. Theor Popul Biol. 1985; 27(2):154–75. doi:10.1016/00405809(85)900085.
Yashin AI. Dynamics in survival analysis: conditional Gaussian property versus CameronMartin formula In: Krylov NV, Lipster RS, Novikov AA, editors. Statistics and Control of Stochastic Processes. New York: Springer: 1985. p. 466–75.
Yashin AI, Arbeev KG, Akushevich I, Kulminski A, Akushevich L, Ukraintseva SV. Stochastic model for analysis of longitudinal data on aging and mortality. Math Biosci. 2007; 208(2):538–51. doi:10.1016/j.mbs.2006.11.006.
Yashin AI, Arbeev KG, Akushevich I, Kulminski A, Ukraintseva SV, Stallard E, Land KC. The quadratic hazard model for analyzing longitudinal data on aging, health, and the life span. Phys Life Rev. 2012; 9(2):177–88. doi:10.1016/j.plrev.2012.05.002.
Myers LE. Survival functions induced by stochastic covariate processes. J Appl Probab. 1981; 18(2):523–9.
Yashin AI, Manton KG, Stallard E. The propagation of uncertainty in human mortality processes operating in stochastic environments. Theor Popul Biol. 1989; 35(2):119–41. doi:10.1016/00405809(89)900130.
Manton KG, Stallard E, Singer B. Population forecasting projecting the future size and health status of the us elderly population. Int J Forecasting. 1992; 8(3):433–58. doi:10.1016/01692070(92)90057G.
Akushevich I, Kulminski A, Manton KG. Life tables with covariates: Dynamic model for nonlinear analysis of longitudinal data. Math Popul Stud. 2005; 12(2):51–80. doi:10.1080/08898480590932296.
Yashin AI, Arbeev KG, Kulminski A, Akushevich I, Akushevich L, Ukraintseva SV. Health decline, aging and mortality: how are they related?. Biogerontology. 2007; 8(3):291–302. doi:10.1007/s1052200690733.
Arbeev KG, Akushevich I, Kulminski AM, Arbeeva LS, Akushevich L, Ukraintseva SV, Culminskaya IV, Yashin AI. Genetic model for longitudinal studies of aging, health, and longevity and its potential application to incomplete data. J Theor Biol. 2009; 258(1):103–11. doi:10.1016/j.jtbi.2009.01.023.
Yashin AI, Manton KG, Stallard E. Dependent competing risks: a stochastic process model. J Math Biol. 1986; 24(2):119–40. doi:10.1007/BF00275995.
Yashin AI, Arbeev KG, Akushevich I, Kulminski A, Akushevich L, Ukraintseva SV. Model of hidden heterogeneity in longitudinal data. Theor Popul Biol. 2008; 73(1):1–10. doi:10.1016/j.tpb.2007.09.001.
Yashin AI, Akushevich I, Arbeev KG, Kulminski A, Ukraintseva S. Joint analysis of health histories, physiological state, and survival. Math Popul Stud. 2011; 18(4):207–33. doi:10.1080/08898480.2011.614486.
Witteman JCM, Grobbee DE, Valkenburg HA, Stijnen T, Burger H, Hofman A, van Hemert AM. Jshaped relation between change in diastolic blood pressure and progression of aortic atherosclerosis. The Lancet. 1994; 343(8896):504–7. doi:10.1016/S01406736(94)914591. Originally published as Volume 1, Issue 8896.
Allison DB, Faith MS, Heo M, Kotler DP. Hypothesis concerning the ushaped relation between body mass index and mortality. Am J Epidemiol. 1997; 146(4):339–49.
Boutitie F, Gueyffier F, Pocock S, Fagard R, Boissel JP. Jshaped relationship between blood pressure and mortality in hypertensive patients: New insights from a metaanalysis of individualpatient data. Ann Int Med. 2002; 136(6):438–48. doi:10.7326/00034819136620020319000007.
Kuzuya M, Enoki H, Iwata M, Hasegawa J, Hirakawa Y. Jshaped relationship between resting pulse rate and allcause mortality in communitydwelling older people with disabilities. J Am Geriatrics Soc. 2008; 56(2):367–8. doi:10.1111/j.15325415.2007.01512.x.
Mazza A, Zamboni S, Rizzato E, Pessina AC, Tikhonoff V, Schiavon L, Casiglia E. Serum uric acid shows a jshaped trend with coronary mortality in noninsulindependent diabetic elderly people. the cardiovascular study in the elderly (castel). Acta Diabetologica. 2007; 44(3):99–105. doi:10.1007/s0059200702493.
Okumiya K, Matsubayashi K, Wada T, Fujisawa M, Osaki Y, Doi Y, Yasuda N, Ozawa T. A ushaped association between home systolic blood pressure and fouryear mortality in communitydwelling older men. J Am Geriatrics Soc. 1999; 47(12):1415–21. doi:10.1111/j.15325415.1999.tb01559.x.
Nelder JA, Mead R. A simplex method for function minimization. Comput J. 1965; 7(4):308–13. doi:10.1093/comjnl/7.4.308.
Finch CE, Kirkwood T. Chance, Development, and Aging. New York, NY, and Oxford, UK: Oxford University Press; 2000, p. 278.
Uhlenbeck GE, Ornstein LS. On the theory of the brownian motion. Phys Rev. 1930; 36:823–41. doi:10.1103/PhysRev.36.823.
McEwen BS, Wingfield JC. The concept of allostasis in biology and biomedicine. Hormones Behav. 2003; 43(1):2–15. doi:10.1016/S0018506X(02)000247.
Yashin AI, Arbeev KG, Arbeeva LS, Wu D, Akushevich I, Kovtun M, Yashkin A, Kulminski A, Culminskaya I, Stallard E, Li M, Ukraintseva SV. How the effects of aging and stresses of life are integrated in mortality rates: insights for genetic studies of human health and longevity. Biogerontology. 2015; 17(1):89–107. doi:10.1007/s1052201595948.
Arbeev KG, Ukraintseva SV, Akushevich I, Kulminski AM, Arbeeva LS, Akushevich L, Culminskaya IV, Yashin AI. Age trajectories of physiological indices in relation to healthy life course. Mech Ageing Dev. 2011; 132(3):93–102. doi:10.1016/j.mad.2011.01.001.
Yashin AI, Ukraintseva SV, Arbeev KG, Akushevich I, Arbeeva LS, Kulminski AM. Maintaining physiological state for exceptional survival: What is the normal level of blood glucose and does it change with age?. Mech Ageing Dev. 2009; 130(9):611–8. doi:10.1016/j.mad.2009.07.004.
Mahmood SS, Daniel L, Vasan RS, Wang TJ. The framingham heart study and the epidemiology of cardiovascular disease: a historical perspective. The Lancet. 2013; 383(9921):999–1008. doi:10.1016/S01406736(13)617523.
Acknowledgments
The authors would like to thank the associate editors of this manuscript and three anonymous referees. Their comments and suggestions greatly improved the manuscript.
Funding
This work was supported by the National Institute on Aging of the National Institutes of Health (NIA/NIH) under Award Numbers P01AG043352, R01AG046860, and P30AG034424. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIA/NIH. This manuscript was prepared using FRAMCOHORT Research Materials obtained from the NHLBI Biologic Specimen and Data Repository Information Coordinating Center and does not necessarily reflect the opinions or views of the FRAMCOHORT or the NHLBI.
Availability of data and materials
Source code and examples: https://github.com/izhbannikov/spm/Simulation test scripts and data files are available at https://doi.org/10.5281/zenodo.267184.
Authors’ contributions
IYZ developed the package, performed evaluation/validation tests and wrote the manuscript and Additional file 1. KA, AIY, IA and ES contributed to the development of the package. KA, AIY, IA, ES revised the manuscript and Additional file 1 and gave comments helpful to finalize it. All authors read and approved the final version of the manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
This study was approved by Duke University Institutional Review Board (Protocol ID B0309). The study involved only secondary analysis of already existing publicly available data.
Author information
Authors and Affiliations
Corresponding author
Additional file
Additional file 1
Supplementary materials. Table S1 Results of simulation studies for onedimensional continuoustime model (5,000 subjects, 100 replications); Est.mean: estimated mean, SD: standard deviation, LW, UP: lower and upper boundaries of empirical confidence interval (95th percentile) of estimated coefficients. Table S2 Results of simulation studies for twodimensional continuoustime simulation (Var1 and Var2, 5,000 individuals, 100 replications); Est.mean: estimated mean; SD: standard deviation; LW, UP: lower and upper boundaries of empirical confidence interval (95th percentile) of estimated coefficients. Figure S1 Histograms of estimated parameters of onedimensional discretetime model. Vertical red lines show the estimated means. Blue vertical lines indicate true parameters. Figure S2 Histograms of estimated parameters of onedimensional continuoustime model. Vertical red lines show the estimated means. Blue vertical lines indicate true parameters. Figure S3 Histograms of estimated parameters of onedimensional continuoustime model with timedependent parameter f1 = f1a + f1bt; other parameters remained constant. Blue vertical lines indicate true parameters. Red vertical lines indicate estimated mean values of estimated parameters. Figure S4 Histograms of estimated parameters for discretetime twodimensional model. Blue vertical lines indicate true parameters, red lines indicate estimated parameters. Figure S5 Histograms of estimated parameters for continuous twodimensional model. Blue vertical lines indicate true parameters, red lines indicate estimated parameters. Table S3 Results of analysis Framingham Heart Study Data, Variable: blood glucose (BG); Est.mean: estimated mean; SD: standard deviation; LW, UP: lower and upper boundaries of empirical confidence interval (95th percentile) of estimated coefficients. There were 30 runs with different starting values of the model parameters. Figure S6 Histograms of Blood Glucose (BG) level extracted from FHS original cohort. (760 KB DOCX)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Zhbannikov, I.Y., Arbeev, K., Akushevich, I. et al. stpm: an R package for stochastic process model. BMC Bioinformatics 18, 125 (2017). https://doi.org/10.1186/s1285901715387
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901715387