Efficient estimation of grouped survival models

Background Time- and dose-to-event phenotypes used in basic science and translational studies are commonly measured imprecisely or incompletely due to limitations of the experimental design or data collection schema. For example, drug-induced toxicities are not reported by the actual time or dose triggering the event, but rather are inferred from the cycle or dose to which the event is attributed. This exemplifies a prevalent type of imprecise measurement called grouped failure time, where times or doses are restricted to discrete increments. Failure to appropriately account for the grouped nature of the data, when present, may lead to biased analyses. Results We present groupedSurv, an R package which implements a statistically rigorous and computationally efficient approach for conducting genome-wide analyses based on grouped failure time phenotypes. Our approach accommodates adjustments for baseline covariates, and analysis at the variant or gene level. We illustrate the statistical properties of the approach and computational performance of the package by simulation. We present the results of a reanalysis of a published genome-wide study to identify common germline variants associated with the risk of taxane-induced peripheral neuropathy in breast cancer patients. Conclusions groupedSurv enables fast and rigorous genome-wide analysis on the basis of grouped failure time phenotypes at the variant, gene or pathway level. The package is freely available under a public license through the Comprehensive R Archive Network. Electronic supplementary material The online version of this article (10.1186/s12859-019-2899-x) contains supplementary material, which is available to authorized users.

. Now consider a pre-specified partition of [0, ∞), 0 = t 0 < t 1 < t 2 < · · · < t r = ∞, where r is a positive integer. For subject i, T i is either observed to belong to the interval [t k i −1 , t k i ) or is censored at the beginning of [t k i −1 , t k i ), where k i ∈ {1, 2, . . . , r}. If T i is observed to belong to such a finite interval, we define δ i = 1, and otherwise define δ i = 0. We denote the observed data vector as y i = (k i , δ i , x i , z i ). An individual subject's contribution to the grouped survival likelihood function, based on equation (4) in [1], is where α j = exp(− t j t j−1 λ(u) du) and η = (γ, θ) is the vector of nuisance parameters. The full likelihood would be the product of the L i over all study subjects. The log-likelihood function then can be written as a sum over all subjects, where, following equation (5) in [1], we substitute γ j = log(− log(α j )), to remove the range restrictions on the parameters.
Henceforth we consider the simple example where x i = x i , a single explanatory variable of interest. . . . ∂ ∂γr l(y i , β, η)

Score functions
For readability we use the substitution c(y i , β, η) = δ i .
The efficient score statistic for testing H 0 : β = 0 can be calculated [2] as

Estimation of variance
Since there is no closed form for the expected values I ββ,i (β, η), I βη,i (β, η), and I ηη,i (β, η), we must estimate the variance of the score statistic. Definē and note that as n → ∞,Ī βη (β, η) and refer to the denominator as the asymptotic variance of the efficient score statistic.
Alternatively, we can define and compute the efficient score statistic based on the empirical variance, Testing shows the asymptotic variance and empirical variance to be numerically equivalent (for large enough n). For numerical efficiency, the groupedSurv package implements the empirical variance estimation version of the efficient score statistic, i.e., W .

Incorporating family structure
We now expand the above approach (in the absence of covariates) to account for possible correlations in outcome among members of the same family. The genotypic hazard for individual i, within family f of size s f , is given by equation (1) from [4]: Here b ∈ R s f is a random effect vector distributed according to N s f (0, σ 2 κ), where the elements of the kinship matrix, κ, are defined as Corresponding matrices can be constructed for different family compositions, e.g., offspringparent, offspring-offspring-parent, or offspring-offspring-parent-parent.

Log-likelihood
If conditionally on b the censoring is independent and non-informative also of b, then, following equation (2) from [4], the likelihood for family f is given by Assuming independence across families, and the full log-likelihood is given by

Score functions
Using logarithmic differentiation we find, Similarly, The efficient score statistic is then calculated as stated previously.

Gene-and pathway-level statistics
Within the context of GWAS, what is often of interest is to conduct the analysis at the level of a gene or pathway rather than individual variants. A variety of methods are available for aggregating variant-level statistics [5]. The groupedSurv package offers support for such setbased analyses by giving users the option of returning the contribution of sample i to the score statistic for each variant tested. These patient-by-variant-level results can then be used as inputs in aggregate statistics. Alternatively, the included geneStat() function accepts a user-specified R function as an argument, returning the desired aggregate statistic directly.
Some aggregate statistics incorporate a vector of constants used to up-or down-weight individual variants within the gene-or pathway-level statistics. The structure of the geneStat() argument geneSet, used to specify the variants composing the sets to be tested, also allows for specifying weights to be used for each variant, if desired. For more information about the arguments and use of geneStat(), please see the associated function documentation or groupedSurv vignette within the package.
In the absence of a user-specified aggregate statistic, geneStat() implements a Sequence Kernel Association Test (SKAT) [6,7] type statistic by default. The SKAT statistic is simply a weighted sum of squares of the single variant statistics, The associated set-level P-values can then be computed either from a mixed chi-square distribution, or through permutation.

Coding CALGB 40101 grouped survival data
One way to understand how grouped survival data should be coded for groupedSurv is to consider it within the context of the likelihood function.
The first term, which is raised to the power of δ i , represents the probability of failing in interval k i . The second term is the product of the probabilities of surviving the entirety of the previous intervals, 1 to k i − 1. The contribution of each patient to the total likelihood then is composed of the combination of the intervals they survived without incident, and that in which the event occurred, if applicable.
In CALGB 40101, the event of interest is the minimal dose needed to cause paclitaxel-induced grade 2 or higher peripheral neuropathy. Patients receive up to six cycles of paclitaxel, meaning the continuous interval of all possible doses of paclitaxel, [0, ∞) is divided by the six doses into r = 7 intervals: [0, 1), [1,2), [2,3), [3,4), [4,5), [5,6), [6, ∞). The coding of δ i = 0 or 1 should be independent of the treatment arm to which a patient was randomized. Given that there are only seven intervals, and only two event states, it is possible to enumerate all possible codings, which we do in Tables S1 and S2.
Table S1 considers the scenarios in which a patient is observed to have had the event. Note that there should be no events observed prior to the first cycle. In the case that an event is not observed for a particular patient, that patient will be censored, (Table S2). Note that patients who drop out prior to the first cycle contribute no information to the likelihood.
Continuous censoring times are independently generated from a uniform distribution over the interval (0, c max ). We then specify grouped survival times, representing fixed observation times in a hypothetical study, by first selecting a maximum observed time, τ , and then dividing the interval [0, τ ) into five equal subintervals. Grouped failure times and event indicators are then assigned as described in the main article. Table S3 gives the parameter values for the different simulations. In all cases, a baseline hazard of λ 0 = 1 and nuisance parameters of θ 1 = 0.2 and θ 2 = 0.2 are used. The c max and τ parameters are adjusted based on the MAF so as to generate empirical event rates of 60%.
All reported results are based on B = 10, 000 replicates of each set of simulation parameters, except for the comparison of performance using different numbers of CPU cores, where the results are the average of ten replicates.

Results
Empirical bias assessments for groupedSurv and the coxph() function from the survival package (using the exact likelihood method to adjust for ties) are shown in Figure S1. An event rate of 0.6 and minor allele frequency of 0.5 is used for each of B = 10, 000 simulation replicates for sample sizes of n = 500, 1, 000, and 3, 000. The simulations were conducted under the null hypothesis (β = 0). Our approach produces evidently unbiased estimates regardless of sample size, while the exact likelihood method for right-censored data seemingly underestimates the effect size for all but the largest sample size simulated.       Figure S4: Non-parametric maximum likelihood survival function estimates and regional visualization plots. Survival and LocusZoom [8] plots for six selected SNPs.