In this section we discuss in detail how to adapt constraint-based method for temporal data analysis. First, we will briefly present Generalized Linear Mixed Models (GLMM) and Generalized Estimating Equations (GEE). Both techniques are suitable for devising conditional independence tests for temporal data with (un)balanced study designs. For a thorough comparison between GLMM and GEE see [58, 59].

### Generalised linear mixed models

Let **T**_{
i
} denote the *n*_{
i
}-dimensional vector of observed values for the target (response) variable *T* in the *i*-th subject at the different *d* time-points. We model the link of **T**_{
i
} with *p* covariates via the following equation:

$$ g\left(\mathbf{T}_{i}\right) = \mathbf{X}_{i}\pmb{\beta} + \mathbf{W}_{i}\mathbf{b}_{i} + \mathbf{e}_{i}, \ \ \ i= 1, \ldots, K. $$

(1)

The vector *β* is the (*p*+1)-dimensional vector of coefficients for the *n*_{
i
}×(*p*+1) fixed effects design matrix **X**_{
i
}, which contains the predictor variables. The vector **b**_{
i
}∼*N*_{
q
}(**0**,*Σ*) is the *q*-dimensional vector of coefficients for the *n*_{
i
}×*q* random effects matrix **W**_{
i
}, while *Σ* is the random-effects covariance matrix. The vector \(\mathbf {e}_{i} \sim N_{n_{i}}\left (\mathbf {0}, \sigma ^{2}\mathbf {I}_{n_{i}}\right)\) is the *n*_{
i
} dimensional within-group error vector which follows a spherical normal distribution with zero mean vector and fixed variance *σ*^{2}.

We used the exchangeable or compound symmetry (CS) structure on the covariance matrix *Σ*. We decided not to use a first order autoregressive covariance (AR(1)) structure as a hyper-parameter of the GLMM method, since this type of structure did not improve the performance of generalised estimating equations (presented below) and would add a high computational burden to the fitting of GLMM.

*K* stands for the number of subjects and the total sample size (number of measurements) is equal to \(N=\sum _{i=1}^{K}n_{i}\). The link function *g* connects the linear predictors on the right hand side of (1) with the distribution of the target variable. Common link functions are the identity, for normally distributed target variables, and the logit function for binomial responses.

The possibility of specifying random effects allows mixed models to adequately represent between and within-subject variability, and to model the deviates of each subject from the average behavior of the whole population. These characteristics make GLMMs particularly suitable for temporal and longitudinal data [9].

### Generalised estimating equations

Generalised Estimating Equations (GEE), developed by [60, 61], are an alternative to mixed models for modeling data with complex correlation structures. In contrast to GLMM which are subject specific, GEE contain only fixed effects and thus are population specific.

Using the notation defined in the previous section, in GEE the *p* covariates are related to the outcome as

$$ g\left(\mathbf{T}_{i}\right) = \mathbf{X}_{i} \pmb{\beta} + \mathbf{e}_{i}, \ \ \ i = 1, \ldots, K. $$

(2)

with the variance of the response variable *T* being modeled as *V**a**r*(*T*_{
ij
})=*ϕ*·*α*_{
ij
}, *j*=1…*n*_{
i
}, where *ϕ* is a common scale parameter and *α*_{
ij
}=*α*(*T*_{
ij
}) is a known variance function. We will focus on two different correlation structures for estimating *α*, the CS and the first order autoregressive AR(1):

$$ \begin{aligned} \text{CS:} \quad\text{Cor}\left(\mathbf{T}_{ij}, \mathbf{T}_{ij'}\right) & =\,\, \alpha \\ \text{AR(1):} \quad\text{Cor}\left(\mathbf{T}_{ij}, \mathbf{T}_{ij'}\right) & =\,\, a^{\left|j-j' \right|}. \end{aligned} $$

(3)

CS assumes that correlations of measurements for the same subject at different time-points are always the same, regardless of the temporal distance between them. Depending on the specific application, this might be not very realistic. In contrast, the AR(1) structure assumes that the correlation between measurements at different time points for the same subject decreases exponentially as the temporal gap between them increases.

A precise numerical estimation of *α* is critical in GEE modeling; we use the jackknife variance estimator suggested by [62], which is quite suitable for cases when the number of subjects is small (*K*≤30), as in many biological studies. The simulation studies conducted by [63] and [64] showed that the approximate jackknife estimates are in many cases in good agreement with the fully iterated ones.

### Conditional independence tests for the Temporal-longitudinal scenario

We devise two independence tests based on GLMMs (Eq. 1) and GEEs (Eq. 2) respectively. This scenario assumes the predictors and the target variable are measured at a fixed set of time-points *τ*={*τ*_{1},…,*τ*_{
m
}} in the same set of subjects. For balanced designs, all subjects are measured at all time-points, i.e, *n*_{
i
}=*n*,∀*i*. The target variable is often a gene-expression trajectory and thus, in the rest of the paper and for this scenario we assume a continuous target.

Recall that the null hypothesis *I**n**d*(*X*;*T*|**Z**) implies that *X* is not necessary for predicting *T* when **Z** is given, and thus the conditional independence tests can be thought of as a testing the significance of the coefficient of *x*. The null and full models are written as

$$ \begin{aligned} H_{0}: \mathbf{T}_{i} =\,\, \mathbf{1} a \,\,+ \mathbf{1} b_{i} \,\,+ \,\, \gamma \pmb{\tau} \,\,+ \,\, \pmb{\delta} \mathbf{Z}_{i} & \\ H_{1}: \mathbf{T}_{i} =\,\, \mathbf{1} a \,\,+ \mathbf{1} b_{i} \,\,+ \,\, \gamma \pmb{\tau} \,\, + \,\, \pmb{\delta} \mathbf{Z}_{i} &\quad + & \beta \mathbf{X}_{i} \end{aligned} $$

(4)

where **1** is a vector of 1s, *a* is the global intercept, *b*_{
i
} stands for the random intercept of the *i*-th subject, *γ*, *δ* and *β* are the coefficient of the predictors, and the generic link function *g*(.) (Eq. 1) has been substituted with the identity one.

This formulation stems from two specific modeling choices: (a) we use the vector of actual time points *τ* as a covariate, in order to model the baseline effect of the time on the trajectory of the target variable. Time becomes a linear predictor of the target. Other choices are possible, but would require more time-points that are typically not available in gene-expression data. (b) We include random intercepts, meaning we allow a different starting point for the estimated trajectory of each subject. This choice leads to **W**_{
i
}=**1**_{
n
}_{
i
},∀*i*, where **1**_{
n
} is a vector of ones of size *n*. However, we do not allow random slopes, thus assuming all subjects have the same dynamics. This choice was dictated by the need of avoiding model over-specification, especially considering the small sample size of the datasets used in the experimentation.

Pinheiro and Bates [9] suggests the use of the *F*-test for comparing the two models, where only the model, the full, under the alternative is fitted and the significance of the coefficient *β* is tested. Another possible choice would be the log-likelihood ratio test, however the *F*-test is preferable for small samples, since the type I error is better controlled with the *F* distribution.

A second test is based on the GEE model. The null and alternative models now lose the random terms:

$$ \begin{aligned} H_{0}: \mathbf{T}_{i} = \,\, \mathbf{1} a &\quad + & \gamma \pmb{\tau} &\quad + & \mathbf{Z}_{i} \pmb{\delta} & & \\ H_{1}: \mathbf{T}_{i} = \,\, \mathbf{1} a &\quad + & \gamma \pmb{\tau} &\quad + & \mathbf{Z}_{i} \pmb{\delta} &\quad + & \beta \mathbf{X}_{i} \end{aligned} $$

(5)

GEE fitting does not compute a likelihood [59] and thus, no log-likelihood ratio test can be computed. A Wald test is used instead here again and the significance of the coefficient *β* is tested. Because of the lack of likelihood computation, its effectiveness in assessing conditional independence is questionable [65]. Despite these theoretical considerations, the experimental results proved the test to be quite effective in our context.

### Conditional independence tests for the Static-longitudinal scenario

The Static-longitudinal scenario assumes longitudinal data with continuous predictors and a static target variable *T* that is either binary or multi-category. The goal is to discriminate between two or more groups on the basis of time-depending covariates. As in the Temporal-longitudinal scenario, the presence of longitudinal data requires to take into account the within-subject correlations.

We have devised a two-stage approach, partially inspired by the work of [66] and [67], for testing conditional independence in this scenario. In our approach a separate regression model is first fitted for each subject and predictor, using the time-points vector *τ* as unique covariate:

$$ \mathbf{G}_{i} = \gamma_{i0} + \gamma_{i1}\pmb{\tau}, \ \ \ i = 1, \ldots, n. $$

(6)

Here, **G**_{
i
} is the vector of measurements for subject *i* and the generic predictor variable *G*. At the end of this step we end up with a matrix *Γ* with dimensions *K*×(2·*p*), containing all coefficients derived with the *K* models specified in (6). The two nested models needed for testing conditional independence can then be specified as:

$$ \begin{aligned} H_{0}: g\left(\mathbf{T}_{i}\right) = \,\, \mathbf{1} a &\quad + & \pmb{\delta} \pmb{\Gamma}_{\mathbf{Z}} \\ H_{0}: g\left(\mathbf{T}_{i}\right) = \,\, \mathbf{1} a &\quad + & \pmb{\delta} \pmb{\Gamma}_{\mathbf{Z}} &\quad + & \pmb{\beta} \pmb{\Gamma}_{X} \\ \end{aligned} $$

(7)

where *Γ*_{
Z
} are the coefficients corresponding to the set of conditioning variables **Z** and *Γ*_{
X
} are the coefficients corresponding to the variable *X*. A logit function *g*(.) is used for linking the linear predictors to the binomial (or multinomial) outcome. The log-likelihood ratio test (calibrated with a *χ*^{2} distribution) is used to decide which of the two models is to be preferred.

### Conditional independence tests for the Temporal-distinct and Static-distinct scenarios

In these two scenarios different subjects are sampled at each time point (time-course data), and subject-specific correlation structures cannot be modeled. For the Temporal-distinct scenario, where the target variable is continuous, it is thus possible to use models (5) for assessing conditional independence. In absence of subject-specific correlation structures the GEE models reduce to standard linear models that can be compared with the standard *F*-test. A similar approach can be used for the Static-distinct scenario, where the outcome is binary or multinomial, by using a logit link function instead of the identity.

### The SES algorithm

First introduced in [10], the SES algorithm attempts to identify the set(s) of predictors (signatures) that are minimal in size and provide optimal predictive performances for a target variable *T*. The basic idea is that if ∃**Z**, s.t., *I**n**d*(*X*;*T*|**Z**), then *X* is superfluous for predicting *T*. Thus, SES repetitively applies a test of conditional independence until it identifies the predictors that are associated with *T* regardless of the conditioning set used. Under certain conditions, these variables are the neighbors of *T* in a Bayesian Network representing the data at hand [2]. An interesting characteristic of SES is that it can return multiple, statistically indistinguishable predictive signatures. As discussed in [68], limited sample size, high collinearity or intrinsic characteristics of the data may produce several signatures with the same size and predictive power. From a biological perspective, multiple equivalent signatures may arise from redundant mechanisms, for example genes performing identical tasks within the cell machinery. The SES algorithm is further explained in the Additional file 1 and in [5].

#### Equipping constraint-based methods with conditional independence test for temporal data

SES belongs to the class of constraint-based feature-selection methods [4]. This type of algorithm processes the data exclusively through tests of conditional independence that assess *I**n**d*(*X*;*T*|**Z**). *This means that in order to extend any constraint-based methods to temporal data it is sufficient to equip an appropriate test, such as the ones defined in Eqs.*(4)-(7).

### Experimentation on real data

The experimental evaluation aims at assessing the capabilities of the proposed conditional independence tests in real setting. For each scenario we identified several gene-expression datasets over which we applied the SES algorithm equipped with the conditional independence test most suitable for the data at hand. The feature subsets identified by SES were then fed to modeling methods for obtaining testable predictions.

Furthermore, in each scenario we contrasted SES against a feature selection algorithm belonging to the family of LASSO methods. This class of algorithms has proven to be well-performing in several applications, including variable selection in temporal data (see the Section regarding the literature review). Particularly, we compare against glmmLasso [27] for the Temporal-longitudinal scenario, with standard LASSO regression [51] for the Temporal-distinct scenario, and the grouped LASSO (GLASSO) for classification [54, 56] in the Static-longitudinal and Static-distinct scenarios.

We excluded from this comparative analysis approaches that a) do not scale-up to thousands of variables (e.g., Bayesian procedures), b) require a number of time points much larger than the applications taken into consideration in this work (as for functional regression, [69]), and c) in general do not have available implementations.

The configuration settings of all algorithms involved in the experimentation were optimized by following an experimentation protocol specifically devised for estimating and removing any bias in performance estimation due to over-fitting.

### Datasets

We thoroughly searched the Gene Expression Omnibus database (GEO, http://www.ncbi.nlm.nih.gov/) for datasets with temporal measurements. Keywords “longitudinal”, “time course”, “time series” and “temporal” returned nearly 1000 datasets. We only kept datasets having at least 15 measurement and at least three time points, and complete information about the design of the study generating the data. This resulted in at least 6 datasets for each scenario, except for the Static-longitudinal scenario, where we identified 4 datasets with at least 8 measurements. Detailed information on the selected datasets are available in the (Additional file 1: Tables S5 and S6).

### Modeling approaches

For the Temporal-longitudinal scenario SES was coupled with either GLMM or GEE regression, so as to mirror the conditional independence test equipped to the algorithm. The glmmLasso algorithm is used for comparison, using a model similar to (4) defined over the whole predictors matrix **X**

$$ \begin{aligned} \mathbf{T}_{i} = &\quad \mathbf{1}a + b_{i} & + \quad& \gamma \pmb{\tau} & + \quad& \mathbf{X}_{i} \pmb{\beta}\\ \end{aligned} $$

(8)

For the Static-longitudinal scenario, logistic or multinomial regression was applied on the columns of the matrix *Γ* selected by SES, depending on the outcome at hand. The grouped Lasso (GLASSO, [56]) algorithm was used for comparison. GLASSO allows to specify groups of variables that can enter the final model only altogether. Particularly, the GLASSO was applied on the whole matrix *Γ*, forcing the algorithm to either select or discard predictors in pairs, following the way columns in *Γ* correspond to the original predictors.

For Temporal-distinct and Static-distinct scenarios SES was always coupled with standard linear, logistic or multinomial regression (depending on the specific outcome), while the standard LASSO algorithm (binary outcome) and GLASSO (multinomial outcome) were used for comparison (see Additional file 1 for further details).

In all analyses SES’ hyper-parameters maximum conditioning variables size *k* and significance level *a* varied between {3,4,5} and {0.05,0.1}, respectively. The *λ* penalty values generated by the Least Angle Square (LARS) algorithm [52] were used for the LASSO models of all scenarios, apart from the temporal-longitudinal. LARS cannot be adapted to this latter scenario, and thus the range of values was separately determined for each dataset, by using all integer values between *λ*_{
min
}, the smallest value guarantying the invertibility of the Hessian matrix in each fold, and *λ*_{
max
}, the highest value after which no variable was selected.

### Experimentation protocol

We used the *m*-fold cross-validation procedure with the Tibshirani-Tibshirani (TT) bias correction [70] for model selection and performance evaluation. In the standard cross-validation protocol the available samples are partitioned in *m* folds, with approximately an equal number of samples each. Each fold is in turn held-out for testing, while the remaining data form the training set. The current modeling approach is applied several times on the training set, once for each predetermined configuration setting, and the predictive performances of the corresponding models are evaluated on the hold-out fold. The configuration with the best average performance is then used for training a final model on the whole dataset. In all experimentation *m* was set to either 4 or 5, so that to have at least two measurements in each fold. Particularly, folds correspond to one or more subjects in the Static-longitudinal scenario, and to one or more time points in the other scenarios.

The performance of the best configuration is known to be optimistically biased, and thus a correction is needed for a fair evaluation. The TT method is a general methodology for estimating and removing the optimistic cross-validation bias. If the performance’s metric is defined in terms of prediction error (the lower the error the better the performance), the bias estimation according to the TT method is the following:

$$ \hat{\text{bias}}=\frac{1}{m}\sum_{i=1}^{m}\left[e_{i}\left(\hat{\pmb{\theta}} \right) - e_{i}\left(\hat{\pmb{\theta}}_{i} \right) \right], $$

(9)

where *e*_{
i
} is the performance on fold *i*, while \(\hat {\pmb {\theta }}\) and \(\hat {\pmb {\theta }}_{i}\) are the configurations corresponding to the best average performance and to the best performance of the *i*-th fold, respectively. Signs in (9) should be interchanged if the performance metric assigns higher scores to better models.

The statistical significance of the difference between average performances is computed through permutation-based t-tests, where single performances are randomly permuted for approximating the null distribution.

All of the simulations, computations and time measurements were performed on a desktop with Intel Core i5-3470 CPU @ 3.2 processor, 4 GB RAM memory using a 64-bit R version 3.2.2.