Identifying significant temporal variation in time course microarray data without replicates

Background An important component of time course microarray studies is the identification of genes that demonstrate significant time-dependent variation in their expression levels. Until recently, available methods for performing such significance tests required replicates of individual time points. This paper describes a replicate-free method that was developed as part of a study of the estrous cycle in the rat mammary gland in which no replicate data was collected. Results A temporal test statistic is proposed that is based on the degree to which data are smoothed when fit by a spline function. An algorithm is presented that uses this test statistic together with a false discovery rate method to identify genes whose expression profiles exhibit significant temporal variation. The algorithm is tested on simulated data, and is compared with another recently published replicate-free method. The simulated data consists both of genes with known temporal dependencies, and genes from a null distribution. The proposed algorithm identifies a larger percentage of the time-dependent genes for a given false discovery rate. Use of the algorithm in a study of the estrous cycle in the rat mammary gland resulted in the identification of genes exhibiting distinct circadian variation. These results were confirmed in follow-up laboratory experiments. Conclusion The proposed algorithm provides a new approach for identifying expression profiles with significant temporal variation without relying on replicates. When compared with a recently published algorithm on simulated data, the proposed algorithm appears to identify a larger percentage of time-dependent genes for a given false discovery rate. The development of the algorithm was instrumental in revealing the presence of circadian variation in the virgin rat mammary gland during the estrous cycle.


Background
In recent years, there has been considerable interest in using gene expression microarray data to study the dynamic behavior of cells. Microarrays allow researchers to take a "snapshot" of the state of a cell by measuring the mRNA expression levels of thousands of genes simultaneously. By taking multiple such "snapshots" at different times, one gains a dynamic picture of how expression levels change over time. A review article by Bar-Joseph [1] gives an excellent summary of many of the issues involved and methods developed for analyzing time course microarray data. Early analysis techniques applied methods that were originally designed for analyzing static data. But there are important differences between static experiments and time course experiments, which have motivated the development of specialized methods for analyzing time course data. In static experiments, data are collected for a number of different experimental conditions. There may be little or no mathematical relationship between these conditions, so they are usually represented as categorical data. In such experiments, it is essential to have several replicates from each condition. In time course experiments, time is a quantitative variable, so the order of the data and the spacing between time points matters. This difference can be exploited to develop more powerful techniques for analyzing time course data. Moreover, there is no longer an inherent requirement for replicates.
Based on this insight, we designed a microarray study of the estrous cycle of the rat mammary gland in which microarray data were collected at distinct time points without replicates. As part of this study, we developed an algorithm for identifying genes with significant temporal variation that does not rely on replicates. The algorithm is based on fitting the data with B-splines, which has a smoothing effect. A test statistic was developed that measures the magnitude of this smoothing effect relative to the overall variation in the data. This test statistic is then used in a false discovery rate procedure to identify significant genes.
The method described here, which we shall refer to as Method 1, has some notable similarities with a method recently proposed in [2], which we shall call Method 2. Both methods involve fitting the data with a B-spline model and neither method requires replicates. But there are some important differences between the two methods. First, Method 2 is based upon generalizing traditional analysis of variance methodologies to time series analysis. In contrast, Method 1 is motivated by the smoothing effect resulting from fitting the data by a spline function. It is designed explicitly to identify lower frequency variation in the data, which is not affected as much by the smoothing. Another difference is the technique used to assess statistical significance. In Method 2, a bootstrapping method is used to estimate a null distribution of the test statistic for each gene. In contrast, our method estimates a null distribution by recalculating the test statistic on a permuted data set.

Algorithm description
To describe our algorithm, we assume that gene expression data were collected at T time points denoted t 1 ,..., t T , given in nondecreasing order (i.e., t 1 ≤ t 2 ≤ ... ≤ t T ). Letting G denote the number of genes, Y denotes the G × T array whose ijth entry Y ij is the normalized log-expression level of the ith gene at time point t j . Y i denotes the ith row of Y, which will be referred to as the log-expression profile of gene i.
Conceptually, we can view the data as being comprised of both time-dependent and time-independent variation. Specifically, we assume the following model of the data: is a continuous function of time, and ij represents the time-independent variability, which may be due to noise, or to other influences such as sample-to-sample heterogeneity.
The function f i (t) summarizes the influences of many different time-dependent biological processes, such as cellcycles, circadian rhythms, development patterns, hormonal fluctuations, etc. Any such influences that occur at too high a frequency (relative to the sampling rate) cannot be distinguished from noise. So we assume the biologist is only interested in detecting the low frequency variation in f i .
With this in mind, we loosely define a gene to exhibit significant temporal variation if its expression profile exhibits "significant" low frequency variation. To translate this concept into a statistical test, we fit the expression profile of each gene with a "smooth" function of time ϕ(t) using B-splines.

Splines
Splines are piecewise polynomial functions, which are defined with respect to a non-decreasing set of knots τ 1 ≤ τ 2 ≤ ... ≤ τ M+1 . Between two distinct consecutive breakpoints, the spline function is a polynomial of a specified degree. The order of the spline is defined to be one greater than the degrees of the polynomials. Typically, the interior knots are chosen to be distinct time points, whereas for technical reasons, the first and last knots are repeated K times, where K is the order of the spline. For data approximation purposes, a convenient method for defining splines is to define a basis for the set of piecewise polynomial functions of a given order. A popular choice of basis is the B-spline basis, which is defined using the Cox-de Boor recursion formula [3] as follows: In this formula, k represents the order of the spline. This representation is particularly convenient for fitting splines to a set of data points. The goal is to determine for each gene i the coefficients C i = [c 1 ,..., c M ] which give the best approximation of the log-expression profile Y i in the least squares sense. The first step is to define the "spline collocation" matrix S, which stores the values of the basis functions evaluated at the sample time points. The entries of S are defined by S mj = b m (t j ). The least-squares approximation is then calculated by the equation where S + denotes the Moore-Penrose pseudoinverse of S (see [4]).

Test statistic
The spline approximation described above smooths out rapid fluctuations in the log-expression levels of a gene, while preserving longer term trends. This observation motivates the definition of the following test statistic, which is inversely related to the magnitude of the smoothing effect: where Z i := [ϕ i (t 1 ), ϕ i (t 2 ),..., ϕ i (t n )] is the vector of interpolated function values, ϕ i (t) is the approximating spline function for gene i, and Var denotes the variance. Intuitively, a small value of ρ corresponds to a large smoothing effect, which suggests that any long-term temporal trends are small relative to the overall variation in the log-expres-sion levels. In contrast, a large value of ρ indicates the presence of a meaningful temporal trend.
To determine which genes demonstrate significant temporal variation, we use a false discovery rate procedure in conjunction with the test statistic ρ. To apply this procedure, it is necessary to approximate an expected distribution for ρ under the assumption that there is no temporal variation. This is accomplished by creating a permuted data set , which is generated by reordering the columns of the original data matrix Y in such a way that columns that were originally close together become far apart in .
To make this more precise, a permutation vector π is specified, which is simply a rearrangement of the integers 1,...,

T. Then, define
, where π j is the jth component of π.
The ρ statistic is calculated for each "gene" in this permuted data set. The resulting values are then sorted, yielding an estimated null distribution. The estimated distribution is used to calculate p-values for each gene as follows: for gene i, the p-value is calculated by where L i is the number of ρ values from the permuted data set that are larger than ρ i . Finally, a false discovery rate procedure [5] is used to choose significant genes.

Choosing a permutation vector
The permutation vector π should be chosen so that columns that are close together in the original data matrix Y will be far apart in the permuted matrix . Because many biological processes are cyclic, we define the distance between two indices in a way that accounts for wrapping around. Specifically, we define We then define a score for a candidate permutation vector π according to the scoring function , where The permutation vector π is chosen to be a local minimum of s(π), which is found using the following local search procedure. Given a permutation vector π, let (i, . π π i j n ≥ ⎧ ⎨ ⎪ ⎩ ⎪ π j) denote the permutation vector resulting from swapping the ith and jth entries of π.

Local search procedure
Step 1: Generate a random permutation vector π.

Test results
To validate our algorithm, we performed three sets of tests on simulated data.

p-value tests
In the first set of tests, we compared the p-values calculated by the two methods on 9 simulated data sets. Each data set corresponds to an underlying temporal variation with different frequency and amplitude, specified by parameters ω and α. (See (5) in Section 5.1 for details).
The results are shown in Figures 1, 2, and 3. Figure 1 corresponds to the null case, where there are no temporal dependencies in the data. The graphs in Figure 2 correspond to cases where there is a temporal dependency whose frequency is relatively low, and the graphs in Figure  3 correspond to higher frequency temporal dependencies. In each graph, the simulated genes are sorted in order of increasing p-values for Method 1.

False discovery rate tests
The second set of tests analyzed 8 simulated data sets. In each data set, half of the genes were generated with timedependent variation, and half were generated from the null distribution. For each method, we determined for each gene the smallest false discovery rate that would result in that gene being selected. By sorting the genes according to these threshold rates, we calculated the number of true discoveries and the number of false discoveries that would result from every possible fdr rate. The results are shown in Figure 4.
We also plotted the number of false discoveries as a function of the specified fdr rate. An example for the case α = 2, ω = .5 is shown in Figure 5. In this figure, the dotted line corresponds to the predicted number of false discoveries for each false discovery rate. Observe that the true number of false discoveries is below this line for both algorithms. Results for the other cases are similar.

Sensitivity tests
Our third set of tests studied the sensitivity of the methods to i) the number of time points T, ii) the number of knots M + 1, and iii) the order of the spline K. Each test analyzed a data set using different values of T, M, and K. (See Section 5.1 for details). The results are reported in Table 1.

Estrous cycle study
Method 1 was used to analyze a data set collected by microarray to study the estrous cycle of the virgin rat mammary gland. After preprocessing, this data set consists of expression levels of 21044 genes at 31 different time points, spread out over the 4 day estrous cycle. The application of Method 1 to this data set identified 1893 temporally significant genes. By comparison, Method 2 (using the same splines and fdr rate) identified only 871 genes. The 1893 genes identified by Method 1 were clustered using a hierarchical clustering method to generate 20 clusters, which are displayed in Figures 6 and 7.
The choice of clustering methods resulted in most of the genes (1579) being grouped into three large clusters (2,3 and 5). Examining the individual expression profiles of genes in these clusters shows that most of these genes exhibit relatively weak, but easily discernible low frequency variation. However, due to the large numbers of genes in these clusters, the graphs of the clusters (particularly clusters 2 and 3) do not exhibit obvious temporal patterns. While we believe that most of the genes in these clusters are temporally significant, it is also likely that these three clusters are richer in false positives.  then decrease over night. The opposite behavior appears in clusters 9 and 11, for which expression levels decrease throughout the day. The presence of these circadian variations was unexpected and potentially quite significant biologically. To provide a preliminary validation of these results, two genes were selected for further study: Per1 (period homolog 1), which appears in cluster 1, and BMal (brain-muscle-ARNT-like protein), which is in cluster 9. For these two genes, quantitative real time PCR was performed, as described in [6]. The results of this analysis, shown in Figure 8, confirm that both genes are under circadian regulation and follow the patterns predicted by the clusters in which they are found.

Frequency response
The results of the p-value tests shown in Figures 1, 2, and 3 demonstrate that the methods are sensitive to the frequency of the underlying temporal variation. Figure 1 corresponds to the null case (α = 0), where there are no temporal dependencies in the data. In this case, both methods yield essentially identical p-values, whose graphs lie on the 45 degree diagonal line. This line corresponds to the uniform distribution, which is the expected distribution of p-values arising from a null distribution. The graphs in Figure 2 correspond to cases where there is a temporal dependency whose frequency is relatively low. In these cases, the p-values of Method 1 are lower than the p-values of Method 2. As a result, Method 1 identifies a greater number of significant genes for any specified pvalue. In contrast, the graphs in Figure 3 correspond to higher frequency temporal dependencies. In these cases, the p-values for Method 1 are higher, suggesting that Method 2 would be more sensitive. However, in these cases, the p-value distribution for both methods lies above the 45 degree diagonal line corresponding to the uniform distribution. Thus, when used in conjunction with a false discovery rate method, neither method would identify any of the genes as significant.
This phenomenon is surprising when viewed from the perspective of Method 2. That method was developed as a generalization of ANOVA techniques to the analysis of time series data. The test statistic is a straightforward generalization of the F statistic, and its null distribution is estimated using bootstrapping. From this perspective, one would expect that data containing time dependencies would, on average, produce lower p-values than the null distribution. But just the opposite is observed for high frequency data.
In contrast, the observed phenomenon is quite natural from the viewpoint of Method 1. Here, we are explicitly trying to measure the smoothing effect of the spline approximation on the data. When high frequency variation is present, the effect of smoothing will be large, resulting in larger p-values than one would expect from purely random data.
The impact of frequency on the calculated p-values is more exaggerated in Method 1. That is, for low frequency variation, Method 1 yields lower p-values than Method 2; and for high frequency variation, Method 1 yields higher p-values. Since neither method will detect high frequency variation, Method 1 appears to be more sensitive where it matters-i.e., in detecting low frequency variation.

Figure 3 Estimated p-values (high freq. signals).
This observation is confirmed by the results of the false discovery rate tests shown in Figures 4 and 5. Note that for any given false discovery rate, Method 1 detects a larger percentage of the time-dependent genes, with the exception of the high frequency cases, for which no genes were detected by either method.

Sensitivity tests
The average estimated p-values reported in Table 1 give an indication of how well each method does in identifying temporally significant genes, with smaller values indicating better performance. We make the following observations: True discoveries as function of fdr rates Figure 4 True discoveries as function of fdr rates.
1. Both methods are sensitive to the total number of knots. In particular, the methods perform best when there are at least 9 knots (which divide the range of times into 8 knot-intervals). For example, for T = 49, the p-values are very small for R = 4 (which corresponds to 13 knots) and R = 6 (9 knots), but get significantly worse for R = 8 (7 knots). This makes intuitive sense because the data were generated from a function which cycles 4 times, so 9 knots corresponds to two knot-intervals per cycle. Using fewer knots would make it very difficult to fit the data with the splines.
2. On the other hand, there can also be too many knots. For large numbers of time points (for example T = 65), the average p-values are worse for R = 2. This suggests that when there are relatively few time points per knot, the splines may be overfitting the data, resulting in poorer performance.
3. Both methods are relatively insensitive to the order of the splines.

Estrous cycle study
In the estrous cycle study, it is interesting to note that all genes identified by Method 2 were also identified by Method 1. To assess whether the additional genes identified by Method 1 are biologically significant, we examined each of the 20 clusters and identified which genes in the cluster were also identified by Method 2. The results are shown in Table 2.
The genes found only by Method 1 are spread across most of the clusters (with only some of the very small clusters missing). As a general trend, Method 2 identified a larger percentage of genes for clusters with more distinct temporal patterns. For example, Method 2 identified all the genes in cluster 4, which has a very distinct pattern; whereas it identified less than half of the genes in clusters 2,3 and 5, which exhibit far less obvious temporal patterns. This suggests that many of the additional genes found by Method 1 may be false positives. However, there are some clear exceptions to this general trend. For example, Method 2 found only 56% of the genes in cluster 6, despite the fact that the genes in this cluster have nearly identical expression profiles, and exhibit a very distinct temporal pattern. Clearly, Method 1 is identifying biologically significant genes that were missed by Method 2.

Conclusion
The algorithm presented in this paper provides a method for identifying significant temporal variation in time course gene expression data without requiring replicates. This experimental paradigm enables researchers to collect data at more time points, which often provides greater biological insight. This was evident in these experiments where the procedure allowed the discovery of significant circadian variation in a substantial number of genes.
Based on our simulation tests, when compared with the method described in [2], our method appears to be more sensitive to detecting low frequency (relative to the sampling rate) time-dependencies for any given false discovery rate. Neither method is effective at identifying higher frequency temporal trends.
Specifically, the results of the sensitivity tests indicate that to detect variation of a given frequency, at least 8 time points per cycle are needed.
The method was used in a study of the estrous cycle in rat mammary glands revealing the presence of genes with circadian variation. The circadian patterns for two genes, Per1 and Bmal, were validated by quantitative real time PCR analysis. Additional patterns have been validated in follow-up experiments. The identification of circadian genes by our method highlights the advantage of evaluating non-replicative time course data in comparison to replicate data. Specifically, if replicate and no-replicate studies had been undertaken with the same number of subjects (animals), the circadian patterns would not have been identified in the replicate study.
In the p-value tests, we generated 9 data sets, each consisting of the log-expression values of 2000 genes generated False discoveries as function of fdr rates Figure 5 False discoveries as function of fdr rates.
Here, α, and ω are parameters that control the magnitude and frequency of the time dependent variation in the data, and N(0, 1) is the Gaussian distribution with mean 0 and variance 1.
The null data set was generated using α = 0. The other 8 data sets were generated using all combinations of the following parameter values α = 1 or 2, and ω = .1, .25, .5 and 1.
In the FDR tests, we generated 2500 time-dependent genes using Equation (5) and 2500 genes from the null distribution, for a combined data set of 5000 simulated genes. For each method, we determined for each gene the smallest fdr rate that would result in that gene being Clusters 13-20 Figure 7 Clusters 13-20.
selected. By sorting the genes according to these threshold fdr rates, we could easily calculate the number of true discoveries and the number of false discoveries that would result from every possible fdr rate. These test were performed for the same 8 combinations of α and ω that were used in the p-value tests.
In the sensitivity tests, we generated 500 time-dependent genes from (5)

Estrous cycle study
The data analyzed in the estrous cycle study were generated as follows. Sprague-Dawley female rats were obtained from Taconic Farms (Germantown, New York, USA) as adult virgins 70 +/-3 days of age. Vaginal lavages were performed daily to identify rats with regular 4 day estrous cycles, as previously described [7]. Rats were included in the study after confirmation of at least two consecutive and regular 4-day estrous cycles. Cervical tissue, harvested and prepared for histological analysis, was used to confirm stage of estrous at time of euthanasia. Collecting each tissue sample required sacrificing the rat, so each microarray measured expression levels from a different rat. This resulted in significant heterogeneity between samples that was independent of time. Rats were sacrificed at approximately 1.5 hour intervals between 7 am and 7 pm over the four day estrous cycle. Blood samples were taken for hormone levels and a mammary gland was dissected and flash frozen to be used for later mRNA extraction. Left mammary gland chains 4-6, with lymph nodes removed, were quick frozen in liquid nitrogen for biochemical and molecular analyses. In order to minimize the possible adverse impact of lowlevel noise on the analysis, any gene that had a consensus expression level reading of less than 10 at any time point was deleted. This procedure resulted in a data set consisting of 21044 genes at 31 time points. Raw data are deposited in the National Center for Biotechnology Information Gene Expression Omnibus (GEO series GSE12289). Prior to applying our algorithm, we performed the following normalization procedure to the expression values: where X ij is the consensus expression level for gene i at the jth time point, and μ i is the average value of log X ij for the ith gene.
The algorithm was applied to the normalized data Y ij , using a 2nd order (piecewise linear) spline defined with the following knots sequence: 7,7,17.6,31.6,43,58,67,79.3,88.1,88.1. These knots correspond to the first and last time point collected for each day.
Significant genes were identified using a false discovery rate of f = .25. The genes were clustered by applying a hierarchical clustering method to the spline coefficients calculated by equation (3). A complete linkage method was used, and the distance between two genes i and j was defined using the Euclidean distance . From this, 20 clusters were identified.