- Methodology article
- Open Access

# tigaR: integrative significance analysis of temporal differential gene expression induced by genomic abnormalities

- Viktorian Miok
^{1, 2}, - Saskia M Wilting
^{2}, - Mark A van de Wiel
^{1, 3}, - Annelieke Jaspers
^{2}, - Paula I van Noort
^{4}, - Ruud H Brakenhoff
^{5}, - Peter JF Snijders
^{2}, - Renske DM Steenbergen
^{2}and - Wessel N van Wieringen
^{1, 3}Email author

**15**:327

https://doi.org/10.1186/1471-2105-15-327

© Miok et al.; licensee BioMed Central Ltd. 2014

**Received:**30 May 2014**Accepted:**25 September 2014**Published:**2 October 2014

## Abstract

### Background

To determine which changes in the host cell genome are crucial for cervical carcinogenesis, a longitudinal *in vitro* model system of HPV-transformed keratinocytes was profiled in a genome-wide manner. Four cell lines affected with either HPV16 or HPV18 were assayed at 8 sequential time points for gene expression (mRNA) and gene copy number (DNA) using high-resolution microarrays. Available methods for temporal differential expression analysis are not designed for integrative genomic studies.

### Results

Here, we present a method that allows for the identification of differential gene expression associated with DNA copy number changes over time. The temporal variation in gene expression is described by a generalized linear mixed model employing low-rank thin-plate splines. Model parameters are estimated with an empirical Bayes procedure, which exploits integrated nested Laplace approximation for fast computation. Iteratively, posteriors of hyperparameters and model parameters are estimated. The empirical Bayes procedure shrinks multiple dispersion-related parameters. Shrinkage leads to more stable estimates of the model parameters, better control of false positives and improvement of reproducibility. In addition, to make estimates of the DNA copy number more stable, model parameters are also estimated in a multivariate way using triplets of features, imposing a spatial prior for the copy number effect.

### Conclusion

With the proposed method for analysis of time-course multilevel molecular data, more profound insight may be gained through the identification of temporal differential expression induced by DNA copy number abnormalities. In particular, in the analysis of an integrative oncogenomics study with a time-course set-up our method finds genes previously reported to be involved in cervical carcinogenesis. Furthermore, the proposed method yields improvements in sensitivity, specificity and reproducibility compared to existing methods. Finally, the proposed method is able to handle count (RNAseq) data from time course experiments as is shown on a real data set.

## Keywords

- Integration
- Empirical Bayes
- Semi-parametric
- INLA
- High-dimensional

## Background

Cervical cancer is caused by infection with high-risk types of the human papillomavirus (HPV) followed by additional changes in the host cell genome. Insight in genes that are consistently altered over time will improve our understanding of the molecular mechanisms driving cervical carcinogenesis. These genes may provide novel biomarkers for early detection of cervical cancer as well as potential therapeutic targets. High-throughput techniques, such as microarrays and next generation sequencing, are tools for fast high-resolution genome-wide molecular profiling. Applying these techniques to measure genes at consecutive moments in time at multiple molecular levels generates a description of the occurrence of molecular abnormalities during cervical carcinogenesis.

A longitudinal *in vitro* system of four independent cell lines immortalized with either HPV16 or HPV18, previously shown to faithfully mimic cervical carcinogenesis at the (epi)genetic level [1–3], was used in this study. Cell lines were assayed for gene expression (mRNA) and gene copy number (DNA) with microarrays at consecutive moments in time, representing distinct stages of transformation. Abnormalities in DNA copy number were previously shown to directly affect expression of the genes located within these abnormalities and are believed to facilitate the identification of functionally relevant gene expression changes [4]. Integrating these two molecular levels will yield models of cancer development and progression, thereby reducing the complexity of (cervical) carcinogenesis [5]. We present a method that, in contrast to existing methods, is able to integrate DNA copy number and gene expression over time, while identifying temporal differential gene expression.

Available methods in current literature for time-course differential gene expression analysis can only be applied to a single molecular level. Since microarrays have become widely used for studying genome-wide gene expression, a range of statistical methods have been tailored for the identification of differentially expressed genes in microarray time-course experiments. Several of these methods are developed in an empirical Bayes framework [6–9]. Tai and Speed [9] use multivariate empirical Bayes statistics to rank time-course gene expression profiles. Their method is applicable to both single-condition and multiple-condition datasets and includes a variance stabilization imposing common matrix as a gene-specific variance-covariance matrix. Alternative approaches involve spline-based methods which fit a smoothed curve to the longitudinal data to use for statistical testing [10–12]. Storey et al. [11] use a population average time curve based on natural cubic splines to capture dynamics in gene expression levels and employ the F-test to identify significant genes. On the other hand, BATS [13] combines these two: it employs gene-wise functional modelling to explain temporal differential gene expression, which is casked in a hierarchical Bayesian framework.

In this article we present a method for identification of temporal differential gene expression driven by genomic abnormalities, introducing several new concepts. First, employing low-rank thin-plate splines and empirical Bayes shrinkage, identification of temporal differential gene expression is improved in terms of sensitivity, specificity and reproducibility. Second, including DNA copy number as a time-varying molecular covariate reduces residual variance and allows for the identification of genes which have variation in expression over time caused by genomic abnormalities. Genes with expression levels affected by DNA copy number aberrations have the capability to contribute to malignant cell growth [4, 14]. Identification of these genes is therefore essential for a better understanding of cancer development in general. Third, we impose a multivariate spatial prior for the DNA copy number effect to make the estimate more stable, borrowing information from neighboring features. Furthermore, by changing the link function our method can straightforwardly deal both with continuous and count data. To illustrate the wide applicability of our method we applied it to HPV-induced transformation (microarray) and head & neck cancer (RNA-seq) data.

## Methods

A method for the identification of temporal variation in gene expression (due to genomic abnormalities) from an integrative genomics study with a time-course set-up is presented. The variation in gene expression over time is described by a generalized linear mixed model employing low-rank thin-plate splines. Hyperparameters of the model are estimated from the data with an empirical Bayes procedure. With parameter estimates at hand, we describe how relevant hypotheses may be tested. The section concludes with extensions of the model and practical considerations for its application.

### Model

Consider a time-course microarray experiment where *n* cell lines are assayed repeatedly over
consecutive time points. At each time point both the DNA copy number and mRNA gene expression of each cell line are measured. Let the random variables *X*_{i,j,t} and *Y*_{i,j,t} represent the DNA copy number and the expression level, respectively, of gene *j* of cell line *i* at time point *t*. Their realizations are denoted by *x*_{i,j,t} and *y*_{i,j,t}. For both random variables the index *j* runs from *j*=1 to *p*. This is due to the application of a matching procedure [15], which matches probes from two high-throughput platforms on the basis of their genomic location.

*j*are assumed to be normally distributed: ${Y}_{i,j,t}\sim \mathcal{N}({\mu}_{i,j,t},{\sigma}_{\epsilon ,j}^{2})$, with ${\sigma}_{\epsilon ,j}^{2}$ the error variance of the expression levels of gene

*j*. The mean of this distribution is modelled as:

where *f*(·;·) and *h*(·;·) model the fixed and random effects, respectively, of the expression level of gene *j*. Both *f*(·;·) and *h*(·;·) are specified next.

where α_{i,j} is the effect of cell line *i* in gene *j* and *β*_{
j
} the DNA copy number effect on the expression levels of gene *j*.

where **Z**_{
t
}=(|*t*−*κ*_{1}|^{3},…,|*t*−*κ*_{
K
}|^{3}) and γ_{
j
}=(*γ*_{
j
},…,*γ*_{j,K})^{T}, the vector of coefficients of the spline. These coefficients are randomly distributed as ${\mathit{\gamma}}_{j}\sim \mathcal{N}({\mathbf{0}}_{K\times 1},{\mathit{\Sigma}}_{\gamma ,j})$. The *κ*_{
k
}, *κ*_{1}<*κ*_{2}< ⋯<*κ*_{
K
}, are fixed knots, equally distributed over the interval $[1,\mathcal{T}]$.

*γ*

_{j,k}are assumed independent and all stemming from the same distribution $\mathcal{N}(0,{\sigma}_{\gamma ,j}^{2})$. Within the context of thin plate splines this is achieved by assuming a particular parameterization of Σ

_{γ,j}. This requires the definition of the matrix Ω with ${\left(\mathit{\Omega}\right)}_{{k}_{1},{k}_{2}}=|{\kappa}_{{k}_{1}}-{\kappa}_{{k}_{2}}{|}^{3}$ for

*k*

_{1},

*k*

_{2}=1,…,

*K*. Furthermore, let ${\mathbf{U}}_{\omega}{\mathbf{D}}_{\omega}{\mathbf{V}}_{\omega}^{\mathrm{T}}$ be the singular value decomposition of Ω with

**U**

_{ ω }and

**V**

_{ ω }containing its left and right singular vectors as columns and diagonal matrix

**D**

_{ ω }of its singular values. It is then assumed that Σ

_{γ,j}can be written as ${\mathbf{V}}_{\omega}{\mathbf{D}}_{\omega}{\mathbf{V}}_{\omega}^{\mathrm{T}}$. Under this assumption Model (1) can be reformulated as the mixed model:

where ${\stackrel{~}{\mathbf{Z}}}_{t}={\mathbf{Z}}_{t}{\mathbf{V}}_{\omega}{\mathbf{D}}_{\omega}^{1/2}$ and ${\stackrel{~}{\mathit{\gamma}}}_{j}={\mathbf{D}}_{\omega}^{-1/2}{\mathbf{V}}_{\omega}^{\mathrm{T}}{\mathit{\gamma}}_{j}$. The independent random variables *γ*_{j,k} and *ε*_{i,j,t} of this model are both multivariately normal with mean zero and covariances $\text{Cov}({\epsilon}_{{i}_{1},j,{t}_{1}},{\epsilon}_{{i}_{2},j,{t}_{2}})={\sigma}_{\epsilon}^{2}$ only if *i*_{1}=*i*_{2} and *t*_{1}=*t*_{2} and zero otherwise, and $\text{Var}\left({\mathit{\gamma}}_{j}\right)={\sigma}_{\gamma ,j}^{2}{\mathbf{I}}_{K\times K}$. The choice of Ω results in a covariance matrix with higher covariances of random effects of neighboring knots (than those of more distant knots).

**Y**

_{∗,j,∗}matrix of measurements for gene

*j*which are first ordered by cell lines

*i*and within a cell line by time. The likelihood for gene

*j*thus is:

which is to be used in the estimation.

### Estimation

The parameters of Model (2) are estimated by means of an empirical Bayes procedure. Empirical Bayes enables us to exploit the high-dimensionality of the data by ‘borrowing information across genes’, which yields more reproducible results. Information will be shared among genes via common hyperparameters of the priors of the model parameters. Here this sharing is done only for parameters that will be subject to inference. Other parameters are considered confounders that need to be taken into account but are not of central interest. In principle, our estimation allows common hyperparameters for these confounders, but at a computational cost.

where $\mathcal{N}({\beta}_{j};0,{\tau}^{2})$ denotes the Gaussian density with parameters (0,*τ*^{2}). The point mass accommodates the proportion of genes without a DNA copy number effect. As this proportion is likely to comprise the majority of genes, it shrinks the *β*_{
j
} to zero for those genes with a gene dosage effect.

The random effect *γ*_{j,k} and error *ε*_{i,j,t} are endowed with normal priors: $\mathcal{N}({\gamma}_{j,k};0,{\eta}_{j}^{2})$ and $\mathcal{N}({\epsilon}_{i,j,t};0,{\xi}_{j}^{2})$. Shrinkage is applied on the dispersion parameters ${\eta}_{j}^{2}$ and ${\xi}_{j}^{2}$ for two reasons: *a)* more stable parameter estimates and *b)* protection against over-fitting. The reciprocals of these parameters, ${\eta}_{j}^{-2}$ and ${\xi}_{j}^{-2}$, represent precision, for each a Gamma distribution is used as a conjugate prior. The parameters *a* and *b* of these hyperprior Gamma distributions are estimated using the method of [17]. The amount of shrinkage via this prior is determined by the data. The procedure is initiated with *a* and *b* resulting in a very flat prior on the precision. This corresponds to a very narrow prior on the variance of the random effect, that is, a flat spline. Iteratively, should the data give rise to it, the prior of precision becomes more informative. As a result the variance moves away from zero, increasing the flexibility of the spline. Would one desire more shrinkage our procedure allows the employment of a hyperprior composed of a Gamma and a point mass at zero.

For the fixed cell line effect *α*_{i,j} a Gaussian prior is assumed: $\mathcal{N}({\alpha}_{i,j};{\mu}_{i,j},{\nu}_{j}^{2})$. No shrinkage (via information borrowing) is applied to *α*_{i,j} as it is considered a confounder, for which an unbiased estimate is preferred.

We temporarily assume that the cell lines are merely biological replicates and no inferential statement with respect to the cell line effect will be made in the Section ‘Head-and-neck cancer’. Hence, for the moment the prior of the cell line effect *α*_{i,j} is $\mathcal{N}({\mu}_{i,j},{\nu}_{j})$. The variance of this prior depends on index *j*: the hyperparameter of this prior is different for each gene.

Given the hyperparameters shared by all genes, the model parameters of the individual genes are estimated by the mean of their posterior distributions. These are obtained by means of integrated nested Laplace approximations (INLA) [18]. INLA yields the marginal posterior distribution of a model parameter through integration of the posterior distribution over all remaining model parameters. This can be done computationally efficient by means of Laplace approximations under the assumption of a Gaussian prior on the model parameters. The use of approximated posterior distributions (instead of the possibly more exact posterior produced by Markov chain Monte Carlo (MCMC)) is motivated computationally: the posterior distributions of model parameters of many genes need evaluation.

It remains to choose the hyperparameters. An informed choice of the hyperparameters is made through application of the empirical Bayes procedure of [17]. That is, the hyperparameters are estimated from the data rather than set prior to the analysis. Apart from a less subjective choice of the hyperprior, this has two favorable consequences. First, it yields more reproducible results (confer the Section ‘Comparison’). Second, the hyperpriors employed cause shrinkage of the model parameter estimates. In particular, for the random effect it shrinks the dispersion parameter, which controls the smoothness of the fit [19]. Hence, it constrains the flexibility of the spline and thus reduces the risk of overfitting. An illustration of this effect on the fitted spline can be found in Additional file 1, Section 1.

where *π*(·) denotes the prior of its argument (hyperparameters of the priors are suppressed for ease of notation). The hyperparameters of the priors of the α_{
j
}, *β*_{
j
}, and ${\sigma}_{\gamma ,j}^{2}$ are estimated by maximization of loss function (3). Hereto equate the derivative of (3) with respect to the hyperparameters to zero and solve. The iterative procedure of [17] yields an approximate solution to estimating equation (3). The procedure of [17] is computationally fast and allows non-parametric hyperpriors.

From expression (3) it becomes clear how the parameter estimates are shrunken. The prior of (say) *β*_{
j
} is shared by all genes. Hence, the choice of the hyperparameters affects the posterior distribution of all *β*_{
j
}. In particular, if the mass of *π*(*β*_{
j
}) is more concentrated around zero, the posterior has more probability mass close to zero. Only if the data contains enough evidence (in favor of a non-zero *β*_{
j
}) to outweigh the prior, the posterior will center around a non-zero value. The prior of *β*_{
j
} puts (via the spike at zero) more mass at zero. Moreover, the precision of the other mixture component is estimated from all genes. Under the assumption that a (vast) majority of genes do not exhibit an effect, the precision is under-estimated for the minority. This, together with the spike, yields a conservative prior leading to shrunken estimates.

Finally, we point out that the procedure described in [17] assumes covariates to be identical over features. In our setting the DNA copy number covariate varies over the features. It would be more appropriate to assume a different Dirac-Gaussian prior for each group of features that shares the same aberration pattern over the samples. These mixture priors differ in the variance of their Gaussian part. This could – in principle – be accommodated by a mixture of a point mass at zero and a lot of Gaussians, all with mean zero but different variances. However, a mixture of Gaussians with the same location but different variance may be approximated by a *t*-distribution with degrees of freedom equal to the number of Gaussians [20]. However, this approximation improves as the number of Gaussians in the mixture increases. On the other hand, a *t*-distribution with many degrees approaches a Gaussian. As there are usually many different genomic aberration patterns, we exploit this approximation and assume a common variance in our Dirac-Gaussian mixture prior. In a simulation study (Additional file 1, Section 2) we show that the *t*-distribution approximation does not differ substantially from the mixture of Gaussians with common mean but different variances.

### Hypothesis testing

From a biological point two questions are of main interest: *i)* does DNA copy number drive gene expression, and *ii)* is there differential expression over time? The former question can be answered by testing whether the DNA copy number effect *β*_{
j
} differs significantly from zero, evaluating the null hypothesis *H*_{0}:*β*_{
j
}=0 versus its alternative *H*_{0}:*β*_{
j
}≠0. The second question is addressed by testing ${H}_{0}:{\sigma}_{\gamma ,j}^{2}=0$ against the alternative ${H}_{A}:{\sigma}_{\gamma ,j}^{2}>0$. The alternative implies that there is at least one *γ*_{j,k}≠0, resulting in a non-constant spline. Additionally, one may ask whether there is a difference between the cell lines, but this question is not considered here (although it could straightforwardly be addressed within the presented framework).

where e.g. ${\widehat{\mathit{\alpha}}}_{j}^{\left({H}_{A}\right)}$ is the estimate of α_{
j
} under the alternative hypothesis *H*_{
A
}. P-values are then obtained from the asymptotic (chi-square) distribution of the likelihood ratio statistics. The degrees of freedom of this chi-square distribution are equal to the difference in the number of parameters of the models compared. Note that this test is likely to be somewhat conservative, because ${\widehat{\beta}}_{j}^{\left({H}_{A}\right)}$ shrinks towards the null domain. The Dirac-Gaussian mixture prior, however, prevents overly conservative behavior, because it better separates the null genes from the non-null ones than a simple Gaussian prior. When testing ${H}_{0}:{\sigma}_{\gamma ,j}^{2}=0$, the degrees of freedom consumed by the penalized splines are estimated by the trace of hat matrix (details: Additional file 1, Section 3). Finally, to account for multiplicity the False Discovery Rate (FDR) is controlled by means of the procedure of [21].

This hybrid testing procedure (empirical Bayesian estimation with p-value based inference) mimics Limma [22, 23]. Limma obtains a shrunken (empirical Bayes) estimate of the variance and detects differential expression using a classical t-test (after adjusting the degrees of freedom). Here too we borrow strength (information) across genes to arrive at shrunken parameter estimates, and provide classical p-values. The latter meets the wishes of medical researchers. They commonly report p-values rather than Bayes factors. In addition, multiple testing corrections are generally more rigorous in a classical setting than in a Bayesian one, because the latter typically provides FDR estimation rather than control.

### Spatial multivariate prior

DNA copy number aberrations are often not confined to a single gene but span a large region of the genome that harbors multiple genes. Consequently, neighboring genes may share the same genomic aberration signature. At the transcriptomic level this results in co-expression of these genes [24]. Put differently, the DNA copy number effect (*β*_{
j
}) of gene *j* may be correlated with that of neighboring genes (e.g. *β*_{j−1} and *β*_{j+1}). This phenomenon is not accommodated by the aforementioned prior of the *β*_{
j
}.

In this section we describe an extension of our procedure that incorporates the possible spatial correlation among the DNA copy number effects. To this end it is assumed that the *β*_{
j
} follows a first-order autoregressive (AR(1)) process along the genome: *β*_{
j
}=*ρ* *β*_{j−1}+*ε*_{
j
}. The relevant parameter of this process is simply estimated by regression of the *β*_{
j
} on the *β*_{j−1}.

*β*

_{ j }, it rests to refit the model. However, the assumption of an AR(1) process on the gene dosage effect complicates the refitting as this should now be done simultaneously for all

*p*genes. This is computationally too demanding. To approximate the joint fit the model is refitted per triplet of neighboring genes (e.g. for

*β*

_{j−1},

*β*

_{ j },

*β*

_{j+1}). For each triplet a trivariate normal prior is assumed (all other priors as before):

where the correlation structure of the covariance matrix follows an AR(1) process. For the re-estimated vector of *β*_{
j
}s only the middle one is conserved. More details are provided in Additional file 1, Section 4.

Besides doing more justice to the underlying biology the ‘spatial prior’ above reduces the variation of the DNA copy number effect. This is achieved as the assumption of an AR(1) process effectively ‘averages’ the DNA copy number effect over neighbouring genes. As such, it is also a way of borrowing information across genes.

### Practical considerations

_{ j }accordingly. Figure 1 illustrates the difference between the same and different spline models.

Within mixed Model (2) DNA copy number and time (via the spline) compete to explain the gene expression. Due to its flexibility the spline may consume variation in expression levels actually due to DNA copy number changes. Moreover, with DNA copy number changes being a more clearly delineated cause (than time in the form of a nonparametric spline), we prefer to attribute variation in expression levels to the genomic aberration. To let DNA copy number changes prevail over the spline, the design matrix **Z** of the latter is orthogonalized to the DNA copy number data. This orthogonalization does not affect the overall fit, but ensures that the spline captures only variation in expression levels that cannot be explained by DNA copy number changes. The effect of the orthogonalization is illustrated in the Section ‘HPV-induced transformation’.

To determine the optimal number of knots, we employ the deviation information criterion (DIC) [25]. The DIC is a measure for the balance between fit and complexity. For each gene the DIC of the model is estimated for different numbers of knots. The number of knots that over all genes yields the best DIC is chosen as the optimal *k* and applied to all genes in the analysis. More details are provided in Additional file 1, Section 5.

## Results and discussion

### HPV-induced transformation

The proposed method is demonstrated on data of an experiment on HPV-induced transformation. The experiment intends to faithfully mimic cervical cancer development employing a HPV-immortalized *in vitro* cell line model. Hereto two cell lines are affected with HPV16 and two with HPV18 [26]. Over time these cell lines acquire genomic and transcriptomic changes. To assess these changes, the genomic and transcriptomic characteristics of the cell lines are measured at eight time points by means of oligonucleotide microarrays. The preprocessing of the DNA copy number data comprises of median normalization and segmentation using the circular binary segmentation (CBS) method [27]. Similarly, the gene expression data are background corrected using robust multi-array average (RMA) [28] and between-array normalized by the robust quantile method. Finally, the resulting expression intensity values are transformed using the variance stabilizing transformation [29]. DNA copy number is assigned to each transcript from the expression array by the overlapPlus matching procedure of [15] which uses chromosomal location information. The final data set contains genomic and transcriptomic information on 37768 features, however in this section analysis is performed only on one chromosome which contains 2202 features.

We now turn to the identification of genes with differential expression over time. To this end only the expression data features are used, ignoring the effect of genomic aberrations in Model (2). Gene *j* exhibits temporal gene expression if ${H}_{0}:{\sigma}_{\gamma ,j}^{2}=0$ is rejected. This corresponds to a spline differing from a flat line, and thus indicating changes in expression levels over time. Temporal gene expression is identified both with a common and different spline(s) for the cell lines. In both analyses the optimal number of knots equals two (determined by the procedure described in the Section ‘Practical considerations’). Specification of the prior distribution and description of the hyperparameter estimations for ${\eta}_{j}^{2}$ can be found in the Section ‘Estimation’.

*p*-values close to but just short of the significance threshold.

**The number of significant probes identified in the analysis for temporal differential expression and copy number (CN) effect**

Same spline | Different spline | ||||
---|---|---|---|---|---|

Effect | Model | Standard | Orthogonal | Standard | Orthogonal |

Time | Splines | 417 | 583 | ||

CN+Splines | 204 | 203 | 421 | 421 | |

CN | CN+Splines | 402 | 403 | 380 | 380 |

Multivariate | 398 | 399 | 377 | 380 |

The common and different spline models employed for the identification of temporal differential expression are now extended to include DNA copy number (as originally proposed in the Section ‘Model’). As noted in the Section ‘Practical considerations’ the flexibility of the spline may consume part of the DNA copy number effect. The proposed remedy limits (via projection) the spline basis to the space orthogonal to space spanned by the DNA copy number information. To assess the potential gain of the orthogonalization each analysis, with common and different spline(s), is done with and without orthogonalization. Prior distributions are as before. The number of knots is determined as done previously (optimal number still equals two). For each analysis hyperparameters of DNA copy number and spline(s) are re-estimated by the empirical Bayes procedure.

We first discuss the number of features with differential temporal expression (given in the second row of Table 1). As in the analysis without DNA copy number, the different spline model identifies substantially more features than the model with a common spline for the cell lines. Orthogonalization of the (common) spline basis onto the DNA copy number data misses one feature in comparison to the non-orthogonalized analysis. This feature is found only with the non-orthogonalized spline basis. In the latter analysis it only passes the significance threshold by a small margin. Hence, in these data the effect of orthogonalization on the identification of temporal differential expression is limited.

*β*

_{ j }, as can be witnessed from Figure 2. Clearly, orthogonalization moves the distribution of the

*β*

_{ j }’s to the right (the positive domain, which corroborates with the biologically expected direction of the effect). Finally, on the full data set (not shown) the analysis using the orthogonalized spline basis gives a modest improvement in the number of genes significantly affected by DNA copy number.

Another striking feature of Table 1 is the difference between the number of significant features with differential temporal expression and those with a gene dosage effect: the former exceeds the latter (in case of the full model with different splines per cell line). In part this difference is explained by the flexibility of the splines. But also by the presence of many other regulators of gene expression (e.g. microRNA, methylation, transcription factors) that result in temporal differential expression are captured by the splines. On the other hand, comparing the analyses with a common spline more significant gene dosage effect than temporal differential expression is found. This is due to the fact that DNA copy numbers may strongly correlate with gene expression over time. Features with expression levels that do not consistently (over cell lines) co-vary with DNA copy number are missed by the common spline model (and not including the gene dosage effect).

*γ*

_{k,j}. In the Section ‘Estimation’ we suggest to use the Gamma distribution. However, we have also implemented a mixture of a point mass at zero and Gamma distribution, which one expects to lead to more shrinkage. Model (2) using different splines and a standard design matrix is refitted now with this mixture prior for the random spline effect. Application of our empirical Bayes procedure with the Gamma prior identified 421 features, while the Dirac-Gamma mixture prior selected 396 features. The latter 396 are all included in the former 421 features. The slight reduction in the number of selected features is of course due to the inclusion of the point mass at zero. The fit of both resulting models is almost identical for most features, but for some features with a slightly less flexible spline as in Figure 7 (Additional file 1, Section 7 illustrates effect in all four cell lines).

### Head-and-neck cancer

To illustrate the wide applicability of our framework, we present the analysis of sequencing data from a head-and-neck cancer study endowed with a time-course set-up. Oshlack et al. [32] noticed that there are currently no appropriate methods for the analysis of RNA-seq data from time-course experiments. The head-and-neck study aims to identify temporal differential expression due to overexpression of a particular microRNA. MicroRNAs are small 20-22nt non-coding RNAs that inhibit expression of their target genes. The microRNAs recognize these genes by their seed sequence that is complementary to a sequence in the 3’ untranslated region (UTR) of the target transcripts. One gene can be targeted by multiple microRNAs and one microRNA may target a multitude of genes. It is therefore not simple to find a specific relevant target gene of a microRNA. In a previous functional screen microRNAs were identified that specifically kill head and neck cancer cells, but not normal cells [33]. The respective target genes of these microRNAs were to be identified in a follow-up experiment. In this follow-up experiment cells of a squamous cell carcinoma cell line were transfected by a microRNA mimic and a control. The transfected cells were grown *in vitro* and sampled at six time points. Transcript levels of the 2×6 samples were sequenced. Data were mapped to the human genome and raw count data (reads) per gene transcript were used and not summarized per gene. Their normalization comprises rescaling by a the trimmed mean of each sample’s library (following [32]). Normalized data are rounded to the closest integer to retain the count interpretation of the data.

Model (2) cannot be directly applied to the sequencing data, as the normal distribution is often a poor approximation for the distribution of counts. The normality assumption is replaced by the (zero-inflated) negative binomial [17, 34]: ${Y}_{i,j,t}\sim \mathcal{Z}\mathcal{I}$-$\mathcal{N}\mathcal{\mathcal{B}}({\mu}_{i,j,t},{\varphi}_{\epsilon ,j})$. The mean *μ*_{i,j,t} of the counts is (after transformation by the inverse of the link function) still modeled by the right-hand side of Model (2) with assumptions on model parameters in place. Hyperparameters are then estimated via the empirical Bayes procedure previously described.

*γ*

_{ j }is the main parameter of interest and the analysis compares the model with and without time effect. The optimal number of knots (again two, for both models) is determined using the procedure described in the Section ‘Practical considerations’. Prior distributions for cell line and time effect are as in the Section ‘Estimation’. Hyperparameters are estimated for each analysis separately, but only the variance of the random time effect is shrunken via the empirical Bayes procedure. Counts of are fitted using the model with same and different splines as illustrated for one RNA-seq tag in Figure 8.

The number of tags with a significant (at the 5% FDR level) temporal variation identified equals 8416 (10951) for the common (different) spline model. As observed in the analysis of the HPV-induced transformation data, the use of a different spline leads to many more findings. Again, this is explained by the improved fit due to a more flexible model. In particular, all the tags identified with the same spline model are also found by its flexible counterpart.

### Comparison

The proposed method is compared to three well-known alternatives for significance analysis of time-course microarray data: EDGE, [11], timecourse, [9], BATS [13, 35], and a reference method. The reference method comprises a standard frequentist approach. It fits a linear mixed-effect model, while the null hypothesis is evaluated through an analysis of variance approach (anova), comparing two nested models (with and without random effects).

These competitors have not been designed for the analysis of integrative genomics studies with a time-course set-up. Hence, our method is applicable to a wider class of studies. Besides this qualitative argument, we wish to have a quantitative comparison of the methods. To this end the comparison is restricted to time-course genomics studies involving only a single molecular level. Moreover, to avoid bias of any of the methods by a particular model choice, the comparison is done on two real data sets. The first is the HPV-induced transformation data from the Section ‘HPV-induced transformation’, limited to the gene expression levels only. The other data set is included in the EGDE-package [11], where gene expression has been monitored in four individuals from a control and endotoxin-treated group. Samples have been collected at five different time points: 2, 4, 6, 9 and 24 hours after treatment. One individual lacks data for the control group at two time points (4 and 6 hours). Since the timecourse method cannot deal with missing time-points this individual is omitted from the analysis. At each time point expression levels of 800 genes are available.

We now briefly describe the other methods used in the comparison: EDGE, BATS and timecourse. For a more detailed description please refer to the corresponding references.

EDGE ([11]) captures the temporal variation in the expression levels of gene *j* by means of a p-dimensional B-spline basis. Temporal differential expression is evaluated by an F-statistic measuring the goodness-of-fit of the null hypothesis (a flat or constant spline) in comparison to the alternative hypothesis. In the comparison EDGE is used with default parameter settings.

Method timecourse ([9]) uses novel multivariate empirical Bayes statistic to rank time-course gene expression profiles. Gene expression in timecourse method is assumed to follow a multivariate normal distribution with gene-specific mean and covariance. Conjugate priors are assumed on the unknown parameters. Hyperparameters of the conjugates are estimated from the data. Timecourse yields stable variance estimates by borrowing (co)variance information across genes. The posterior distribution and test statistics are obtained in an analytic form. Genes may be ranked using either Hotelling *T*^{2} or *MB*-statistics. For the comparison we used the timecourse R-package with standard settings and Hotelling *T*^{2}-statistic, due to the balancedness of the study design (equal number of replicates per gene).

Finally, BATS ([13]) which combines characteristics from previously described methods. Similar to EDGE gene expression variation over time is modelled by a polynomial function, while imposing a hierarchical Bayesian model on the parameters (as timecourse). BATS is flexible in its choice of the prior for dispersion related parameters, it offers delta, inverse Gamma and exponential priors. Significance analysis is based on the genes’ Bayes factors, while multiplicity correction is addressed in Bayesian manner ([36]).

Sensitivity and specificity of the four aforementioned methods are compared in both data sets. Hereto knowledge of the genes with true temporal differential expression is needed. In its absence we constructed a consensus set which fulfills this role. That consensus set comprises of the features identified by all four methods. Sensitivity is then the proportion of features with temporal differential expression correctly identified as such. On the other hand, specificity is the proportion of features which are correctly identified as features without differential expression over time (hence, rightly not significant). Sensitivity and specificity of each method are assessed for various numbers of significant features.

Furthermore, we compared the reproducibility of the five methods (now including reference method). Hereto each data set was divided into two equally sized groups. We assessed how well the results of the two splits coincided. This boils down to the application of each method on both splits. The overlap in significant features for each method was determined.

## Conclusions

We presented a method for the analysis of integrative (onco) genomics studies with a time-course experimental design. The method identifies temporal differential gene expression while accounting for time-varying molecular covariates like DNA copy number changes. Simultaneously, the method assesses which of these covariates significantly contributes to temporal differential gene expression. The method employs a mixed model describing the temporal changes in gene expression in terms of DNA copy number and a (low-rank thin-plate) spline which captures additional temporal variation in the transcript levels. The method estimates the parameters of this model by means of an empirical Bayes procedure that ‘borrows information’ across genes. The empirical Bayesian procedure shrinks the parameter estimates (towards zero), thus accounting for multiplicity. This shrinkage enhances the reproducibility of the results. In a direct comparison with other methods for the identification of temporal differential expression, the proposed method proved to be a strong competitor, particularly in terms of reproducibility. In addition existing methods cannot incorporate additional genomics data. Furthermore, our method is straightforwardly applicable to count data resulting from RNA-seq experiments. Application to an integrative oncogenomics study, involving HPV-transformed cell lines, confirmed genes CADM1 and SLC25A36, known to be implicated in the development of cervical cancer. The presented methodology also identified other, novel and potentially interesting genes. These are currently under investigation and will be reported in a follow-up medical paper. Preliminary pathway analysis already showed that genes identified from this dataset by tigaR but not by the other methods were enriched for genes involved in cellular transformation.

Our ongoing research concentrates on two extensions of the proposed method. First, we are considering the inclusion of microRNA data. MicroRNAs affect expression levels post-transcriptionally. However, which microRNA targets which mRNA is only partially known. Hence, integration of temporal microRNA expression data also needs to address the problem of selecting the microRNAs targets. With the number of microRNAs known and typically measured in time-course integrative genomics studies being larger than the number of samples (# time points × # cell lines) this adds an additional layer of complexity to the problem.

The second extension comprises the integration of pathway information. This requires a multivariate formulation of the model for temporal changes in gene expression. Next to DNA copy number changes now the changes in transcript levels of other genes in the pathway may need to be included. A key challenge here is to ‘borrow information’ within and between pathways.

The methodology described in this paper is implemented in the R-package tigaR available upon request from the first author (v.miok@vumc.nl).

## Declarations

### Acknowledgements

This research was funded by a grant from the VU University Medical Center-Cancer Center Amsterdam (VUMC-CCA, project CCA2011-5-02).

## Authors’ Affiliations

## References

- Steenbergen RDM, Kramer D, Braakhuis BJM, Stern PL, Verheijen RHM, Meijer CJLM, Snijders PJF: TSLC1 gene silencing in cervical cancer cell lines and cervical neoplasia. J Natl Cancer Inst. 2004, 96: 294-305. 10.1093/jnci/djh031.View ArticlePubMedGoogle Scholar
- Wilting SM, Snijders PJF, Meijer GA, Ylstra B, Van den Ijssel PR, Snijders AM, Albertson DG, Coffa J, Schouten JP, Van de Wiel MA, Meijer CJLM, Steenbergen RDM: Increased gene copy number at chromosome 20q are frequent in both squamous cell carcinomas and adenocarcinomas of the cervix. J Pathol. 2006, 209: 220-230. 10.1002/path.1966.View ArticlePubMedGoogle Scholar
- Henken FE, Wilting SM, Overmeer RM, Van Rietschoten JG, Nygren AOH, Errami A, Schouten JP, Meijer CJLM, Snijders PJF, Steenbergen RDM: Sequential gene promoter methylation during HPV-induced cervical carcinogenesis. Br J Cancer. 2007, 97: 1457-1464. 10.1038/sj.bjc.6604055.View ArticlePubMed CentralPubMedGoogle Scholar
- Vogelstein B, Kinzler KW: Cancer genes and the pathways they control. Nat Med. 2004, 10: 789-799. 10.1038/nm1087.View ArticlePubMedGoogle Scholar
- Hanash S: Integrated global profile of cancer. Nat Rev Canc. 2004, 4: 638-644. 10.1038/nrc1414.View ArticleGoogle Scholar
- Efron B, Tibshirani R, Storey JD, Tusher V: Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc. 2001, 96: 1151-1160. 10.1198/016214501753382129.View ArticleGoogle Scholar
- Lönnstedt I, Speed TP: Replicated microarray data. Statistica Sinica. 2002, 12: 31-46.Google Scholar
- Eckel JE, Gennings C, Chinchilli VM, Burgoon LD, Zacharewski TR: Empirical Bayes gene screening tool for time-course or dose-response microarray data. J Biopharm Stat. 2004, 14: 647-670. 10.1081/BIP-200025656.View ArticlePubMedGoogle Scholar
- Tai YC, Speed TP: A multivariate empirical Bayes statistics for replicated microarray time course data. Ann Stat. 2006, 34: 2387-2412. 10.1214/009053606000000759.View ArticleGoogle Scholar
- Bar-Joseph Z, Gerber GK, Lee TI, Rinaldi NJ, Yoo JY, Robert F, Gordon DB, Fraenkel E, Jaakkola TS, Young RA, Gifford DK: Computational discovery of gene modules and regulatory networks. Nat Biotechnol. 2003, 21: 1337-1324. 10.1038/nbt890.View ArticlePubMedGoogle Scholar
- Storey JD, Xiao W, Leek JT, Tompkins RG, Davis RW: Significance analysis of time course microarray experiments. Proc Natl Acad Sci U S A. 2005, 102: 12837-12842. 10.1073/pnas.0504609102.View ArticlePubMed CentralPubMedGoogle Scholar
- Hong F, Li H: Functional hierarchical models for identifying genes with different time-course expression profiles. Biometrics. 2006, 62: 534-544. 10.1111/j.1541-0420.2005.00505.x.View ArticlePubMedGoogle Scholar
- Angelini C, De Canditiis D, Mutarelli M, Pensky M: A Bayesian approach to estimation and testing in time-course microarray experiments. Stat Appl Genet Mol Biol. 2007, 6: 1-30.Google Scholar
- Bierkens M, Krijgsman O, Wilting SM, Bosch L, Jaspers A, Meijer GA, Meijer CJLM, Snijders PJF, Ylstra B, Steenbergen RDM: Focal aberrations indicate EYA2 and hsa-miR-375 as oncogene and tumor suppressor in cervical carcinogenesis. Gene Chromosome Canc. 2013, 52: 56-68. 10.1002/gcc.22006.View ArticleGoogle Scholar
- Van Wieringen WN, Unger K, Leday GGR, Krijgsman O, de Menezes R, Ylstra B, Van de Wiel MA: Matching of array CGH and gene expression microarray features for the purpose of integrative genomic analysis. BMC Bioinformatics. 2012, 13: 80-10.1186/1471-2105-13-80.View ArticlePubMed CentralPubMedGoogle Scholar
- Crainiceanu CM, Ruppet D, Wand MP: Bayesian analysis for penalized spline regression using WinBUGS. J Stat Software. 2005, 14: 1-47.View ArticleGoogle Scholar
- Van de Wiel MA, Leday GGR, Pardo L, Rue H, Van der Vaart AW, Van Wieringen WN: Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors. Biostatistics. 2013, 14: 113-128. 10.1093/biostatistics/kxs031.View ArticlePubMedGoogle Scholar
- Rue H, Matrino S, Chopin N: Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximation. J Roy Stat Soc B Stat Meth. 2009, 71: 319-392. 10.1111/j.1467-9868.2008.00700.x.View ArticleGoogle Scholar
- Ruppert D, Wand MP, Carroll RJ: Semiparametric regression during 2003–2007. Electron J Stat. 2009, 3: 1193-10.1214/09-EJS525.View ArticlePubMed CentralPubMedGoogle Scholar
- Peel D, McLachlan GJ: Robust mixture modeling using the t distribution. Stat Comput. 2000, 10: 339-348. 10.1023/A:1008981510081.View ArticleGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc B. 1995, 57: 289-300.Google Scholar
- Smyth GK: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004, 3: 3-Google Scholar
- Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010, 26: 139-140. 10.1093/bioinformatics/btp616.View ArticlePubMed CentralPubMedGoogle Scholar
- Van Wieringen WN, Berkhof J, Van de Wiel MA: A random coefficients model for regional co-expression associated with DNA copy number. Stat Appl Genet Mol Biol. 2010, 9: 1-30.Google Scholar
- Spiegelhalter DJ, Best NG, Carlin BP, van der Liner A: Bayesian measures of model complexity and fit. J Roy Stat Soc B Stat Meth. 2002, 64: 583-639. 10.1111/1467-9868.00353.View ArticleGoogle Scholar
- Steenbergen RDM, Walboomers JMM, Meijer CJLM, Van der Raaij-Helmer EMH, Parker JN, Chow LT, Broker TR, Snijders PJF: Transition of human papillomavirus type 16 and 18 transfected human foreskin keratinocytes towards immortality: activation of telomerase and allele loss at 3p, 10p, 11q and/or 18q. Oncogene. 1996, 13: 1249-1257.PubMedGoogle Scholar
- Olshen AB, Venkatraman ES, Lucito R, Wigler M: Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics. 2004, 5: 557-572. 10.1093/biostatistics/kxh008.View ArticlePubMedGoogle Scholar
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003, 4: 249-264. 10.1093/biostatistics/4.2.249.View ArticlePubMedGoogle Scholar
- Huber W, Von Heydebreck A, Sültmann H, Poustka A, Vingron M: Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics. 2002, 18: 96-104. 10.1093/bioinformatics/18.suppl_1.S96.View ArticleGoogle Scholar
- Overmeer RM, Henken FE, Snijders PJF, Claassen-Kramer D, Berkhof J, Helmerhorst TJM, Heideman DAM, Wilting SM, Murakami Y, Ito A, Meijer CJLM, Steenbergen RDM: Association between dense CADM1 promoter methylation and reduced protein expression in high-grade CIN and cervical SCC. J Pathol. 2008, 215: 388-397. 10.1002/path.2367.View ArticlePubMedGoogle Scholar
- Wilting SM, de Wilde J, Meijer CJL, Berkhof J, Yi Y, Van Wieringen WN, Braakhuis BJM, Meijer GA, Ylstra B, Snijders PJF, Steenbergen RDM: Integrated genomic and transcriptional profiling identifies chromosomal loci with altered gene expression in cervical cancer. Gene Chromosome Canc. 2008, 47: 890-905. 10.1002/gcc.20590.View ArticleGoogle Scholar
- Oshlack A, Robinson MD, Young MD: From RNA-seq reads to differential expression results. Genome Biol. 2010, 11: 220-10.1186/gb-2010-11-12-220.View ArticlePubMed CentralPubMedGoogle Scholar
- Lindenbergh-Van der Plas M, Martens-de Kemp SR, de Maaker M, Van Wieringen WN, Ylstra B, Agami R, Cerisoli F, Leemans CR, Braakhuis BJM, Brakenhoff RH: Identification of lethal microRNAs specific for head and neck cancer. Clin Cancer Res. 2013, 19: 5647-5657. 10.1158/1078-0432.CCR-12-2295.View ArticlePubMedGoogle Scholar
- Robinson MD, Smyth GK: Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007, 23: 2881-2887. 10.1093/bioinformatics/btm453.View ArticlePubMedGoogle Scholar
- Mutarelli M, Cicatiello L, Ferraro L, Grober OMV, Ravo M, Facchiano AM, Angelini C, Weisz A: Time-course analysis of genome-wide gene expression data from hormone-responisve human breast cancer cells. BMC Bioinformatics. 2008, 9: 12-10.1186/1471-2105-9-12.View ArticleGoogle Scholar
- Abramovich F, Angelini C: Bayesian maximum a posteriori multiple testing procedure. Indian J Stat. 2006, 68: 436-460.Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.