# More powerful significant testing for time course gene expression data using functional principal component analysis approaches

- Shuang Wu
^{1}and - Hulin Wu
^{1}Email author

**14**:6

https://doi.org/10.1186/1471-2105-14-6

© Wu and Wu; licensee BioMed Central Ltd. 2013

**Received: **7 August 2012

**Accepted: **7 November 2012

**Published: **16 January 2013

## Abstract

### Background

One of the fundamental problems in time course gene expression data analysis is to identify genes associated with a biological process or a particular stimulus of interest, like a treatment or virus infection. Most of the existing methods for this problem are designed for data with longitudinal replicates. But in reality, many time course gene experiments have no replicates or only have a small number of independent replicates.

### Results

We focus on the case without replicates and propose a new method for identifying differentially expressed genes by incorporating the functional principal component analysis (FPCA) into a hypothesis testing framework. The data-driven eigenfunctions allow a flexible and parsimonious representation of time course gene expression trajectories, leaving more degrees of freedom for the inference compared to that using a prespecified basis. Moreover, the information of all genes is borrowed for individual gene inferences.

### Conclusion

The proposed approach turns out to be more powerful in identifying time course differentially expressed genes compared to the existing methods. The improved performance is demonstrated through simulation studies and a real data application to the *Saccharomyces cerevisiae* cell cycle data.

## Keywords

## Background

Time course microarray and RNA-seq experiments are increasingly used to study biological phenomena that evolve in a temporal fashion. Unlike the static experiment which captures only a snapshot of the gene expression, the time course experiment monitors the gene expression levels over several time points in a biological process, allowing investigators to study dynamic behaviors of the genes. One goal of such experiments is to identify genes associated with a biological process of interest or a particular stimulus, like a therapeutic treatment or virus infection. The differentially expressed genes can be defined as genes with expressions changed significantly with respect to time or across multiple conditions.

The time course gene expression data typically exhibit features such as high dimensionality, short time course, few or no replicates, missing values, large measurement errors, correlations between observations over time, etc. Many of the multivariate techniques for analyzing such data, for example, SAM [1, 2], ANOVA [3, 4] and empirical Bayes [5], suffer from serious limitations when facing missing data or non-uniformly sampled data. They also fail to account for correlations between measurements from the same gene and do not facilitate the removal of noise from the measured data. In addition, the timing information of when the measurements are taken is not utilized and the inherent temporal structure of the time course data is ignored. A hidden Markov model has been proposed by [6] and [7], where the observed gene profiles are considered to be influenced by an underlying Markov process. The computation of this model involves a large number of parameters, which can be difficult if there are no replicated data, and it can not be applied if the observation time points are distinct for different experimental groups.

Functional data analysis approaches view the expression profile of each gene as a smooth function of time, and the time course measurements are collected as discrete observations from the function that are contaminated by noisy signals [8, 9]. A key step is to create an estimate of the gene expression curve from the noisy functional data and this usually involves representing the expression curve as a linear combination of a finite number of basis functions, such as polynomials [10] and B-splines [11-13]. Another popular approach for representing functional curves is the Functional Principal Component Analysis (FPCA) [14, 15]. In the FPCA model, the basis functions are estimated from the observed data and the data-adaptive basis has the favorable property to flexibly characterize the major modes of variation in the data. So fewer number of basis functions are needed to capture the shape of gene expression patterns than that using the pre-specified basis.

Many of the existing functional methods are designed for time course data with longitudinal replicates. [13] used a functional hierarchical model and empirical Bayes techniques to determine differentially expressed genes, but the estimates of the model parameters can be very unstable if the number of replicates is small. [16] proposed a functional ANOVA model and [14] adopted the FPCA model for identifying differentially expressed genes across two conditions. In both methods, the model is fitted for one gene at a time to estimate the gene-specific group mean and the covariance structure, which is computationally intensive and may result in overfitted models for a small sample size. [15] adopted a similar approach as [13] to impose a mixture distribution on the gene-specific variation and employed an indicator to reflect whether a gene is differentially expressed. All of these methods require longitudinal replicates in data and only apply to two group comparisons.

Time course data with longitudinal replicates are costly and rather rare in reality. Many of the published time course data have no replicates or only a small number of independent replicates [17-19]. The EDGE method proposed by [12] is a comprehensive approach that is suitable for data with or without replicate, and for both single group and multiple group tests. It represented gene expression trajectories using natural cubic splines and then compared the goodness-of-fit of the model under the null hypothesis to that under the alternative hypothesis. The null distribution of test statistics was approximated by bootstrap. [20] recently extended this method under a permutation-based multiple testing framework. Specifically for time course data without replicate, [21] developed a statistics to measure the signal-to-noise ratio by comparing the energies of the smoothing convolution and differential convolution of the expression profiles. A common problem to these existing methods is that the test statistics are constructed for each gene separately. So they may not be powerful enough to identify differentially expressed genes for short time series data and data without replicates.

In this work, we propose a unified approach to model the gene profiles using the techniques of FPCA, and to identify differentially expressed genes in both single group test and multiple group test. Our methodology is motivated by the gene expression data without replicate, so we will focus on this case in this paper, although our method can also be easily adapted to accommodate data with replicates. We argue that our method can improve the power in identifying differentially expressed genes compared to existing methods. First, using the eigen-basis enables a parsimonious modeling of the gene expression curves, so we have more degrees of freedom for the inference than that using a pre-specified basis. Moreover, we propose to estimate the expression curve of a gene by borrowing strength across all the genes, which leads to a more powerful inference than that using the information of one gene only.

The remainder of the paper is organized as follows. We first describe the FPCA model for representing the gene expression curves and then elaborate a hypothesis testing method based on random permutations to identify differentially expressed genes in both single group and multiple group scenarios. The proposed method is compared with several existing methods via the analysis of the *Saccharomyces cerevisiae* cell cycle data from [17] and simulation studies. Lastly, we summarize the proposed method and discuss possible extensions.

## Methods

### FPCA for time course gene expression data

*X*(

*t*) of a single gene is a smooth function in the time interval [

*a*,

*b*], with mean function

*μ*(

*t*) = E (

*X*(

*t*)) and covariance function

*G*(

*s*,

*t*) = cov (

*X*(

*s*),

*X*(

*t*)). Under mild conditions, we can assume that

*X*possesses Karhunen-Loéve representation [22] with representation

where *ϕ*_{
l
} are sequences of orthonormal eigenfunctions with non-increasing eigenvalues *λ*_{
l
}, satisfying ∑*l* *λ*_{
l
}<*∞* and *G*(*s*,*t*)= ∑*l*=1*λ*_{
l
}*ϕ*_{
l
}(*s*)*ϕ*_{
l
}(*t*). These eigenfunctions reflect the direction of major shape deviations from the mean function and the random coefficient *ξ*_{
l
}, often referred to as the functional principal component (FPC) score, indicates how much a gene deviates from the mean function in the direction of *ϕ*_{
l
}.

where *n* is the number of genes, *J* is the number of experimental groups, and *K*_{
i
j
} is the number of sampling time points for gene *i* in the *j*-th experimental group. The noises *ε*_{
i
j
k
} are assumed to be i.i.d. random variables with mean 0 and variance *σ*^{2}. In this paper, we adopt the PACE method - principal component analysis through conditional expectation proposed by [23] to estimate ${\widehat{X}}_{\mathit{\text{ij}}}\left(t\right)$ from the observed noisy data. This approach borrows information across all the genes to predict individual expression curves and is more efficient than the gene-specific smoothing method when dealing with thousands of genes simultaneously.

*μ*(

*t*), which is not efficient, as the magnitude difference may obscure some other interesting variations. So we first subtract the mean expression of each gene from the observed data and apply FPCA to the centered data. The empirical covariances are calculated from the aggregated data of all genes as

where ${\widehat{\mu}}_{\mathit{\text{ij}}}=\sum _{k=1}^{{K}_{\mathit{\text{ij}}}}{Y}_{\mathit{\text{ijk}}}/{K}_{\mathit{\text{ij}}}$. We then obtain the estimate of the covariance function *G*(*s*,*t*) by applying two-dimensional local linear smoothers to (3). Note that the diagonal elements *C*_{
j
}(*t*_{
k
},*t*_{
k
}) should not be used in estimating *G*(*s*,*t*) because they are contaminated by the noise signal. When replicates are present, we can apply the above procedure to the averaged expression data ${\stackrel{\u0304}{Y}}_{\mathit{\text{ijk}}}=\frac{1}{{M}_{j}}\sum _{m}{Y}_{\mathit{\text{ijkm}}}$, where *M*_{
j
} is the number of replicates for group *j*.

*L*eigenfunctions to approximate ${\widehat{X}}_{\mathit{\text{ij}}}$ and

*L*can be chosen by the information criteria such as AIC, BIC, or by scree plot or fraction of variation explained (FVE), similarly as in the PCA in multivariate analysis. More details for the estimation procedure can be found in [23]. With the estimates of all the model components in hand, we can now represent the individual gene expression trajectories as

### Identifying differentially expressed genes

#### One group case

*J*=1), we are often interested in discovering genes whose expression profiles are time-dependent. We want to test whether the expression curve is constant for gene

*i*,

*i*=1,…,

*n*, i.e.

*H*

_{i0}, the constant is estimated as the sample mean ${\widehat{X}}_{i}^{0}\left(t\right)={\widehat{\mu}}_{i}=\sum _{k=1}^{{K}_{i}}{Y}_{\mathit{\text{ik}}}/{K}_{i}$, and under

*H*

_{i1}, the curve estimate ${\widehat{X}}_{i}^{1}\left(t\right)$ is obtained as (4). The test statistic is a modified

*F*-statistic, which compares the goodness-of-fit of the null model to the alternative model:

where ${\mathit{\text{RSS}}}_{i}^{0}$ and ${\mathit{\text{RSS}}}_{i}^{1}$ are the residual sum of squares under the null and the alternative models, respectively. In a typical time course gene expression data, the number of time points is the same for all genes within one experimental group. Since we use the same number of eigen-basis functions to approximate the gene expression curves, dividing the numerator and denominator of (6) by the corresponding degrees of freedom will not change the ordering of the test statistics. In case that there are missing values in the data, one could adjust (6) correspondingly.

This statistic can also be viewed as the signal-to-noise ratio of each gene. For genes with a low signal level, variance in *F*_{
i
} can be high because of small values of ${\mathit{\text{RSS}}}_{i}^{1}$. The small constant *δ* in the denominator can help stabilize the variance of *F*_{
i
}. A similar idea has been adopted in [1]. In this work, we set $\delta ={\widehat{\sigma}}^{2}$, the estimated variance of the noisy signal in (2). Since cov(*Y*_{
i
k
},*Y*_{
i
l
})=cov(*X*_{
i
}(*t*_{
i
k
}),*X*_{
i
}(*t*_{
i
l
}))+*σ*^{2}*δ*_{
k
l
}, where *δ*_{
k
l
}=1 if *k*=*l* and 0 otherwise, we can estimate *σ*^{2} by smoothing (3) with and without the diagonals *C*_{
j
}(*t*_{
k
},*t*_{
k
}). Specifically, *σ*^{2} can be estimated by the averaged difference between the local linear smoother along the diagonal of the raw covariance and a local quadratic smoother along the direction perpendicular to the diagonal. See [23] for more details.

There is a sizable literature on the asymptotic distribution of *F*_{
i
}, for example [24]. However, such methods are generally not applicable to time course gene expression data, as the number of measurements for each gene is usually very small. In order to generate the null distribution of (*F*_{1},…,*F*_{
n
}), we propose using a permutation test. Each permutation sample is generated by randomly matching the expression measurements of *n* genes with their sampling times. If there are replicates available, the expression measurements at the same time point are permuted as a group. For example, let *Y*_{
i
k
m
} be the *m*-th measurements for gene *i* at time *t*_{
k
}, *m*=1,…,*M*. The permutation samples are obtained by randomly shuffling the time index *k*. To facilitate the computing efficiency, we use the eigenfunctions obtained from the observed data for the permutation samples.

*F*-statistics of the permutation samples computed from (6), the

*p*-value for gene

*i*can be defined as

*B*is the number of permutation samples,

*I*(·) is an indicator function and ${F}_{i}^{\left(b\right)}$ is the statistic computed from the

*b*-th permutation. [12] proposed another definition of the

*p*-value by considering the permutation statistics from all genes:

This definition has the advantage that the ordering of the test statistics is preserved in the ordering of the *p*-values. We also find in our simulations that (8) leads to fewer false positives than (7). Therefore, we adopt (8) in the real data application.

When we apply the proposed procedure to identify differentially expressed genes, it is necessary to consider the multiple testing adjustment because *n* hypotheses are tested simultaneously and the number of genes *n* is usually very large. A commonly used strategy is to control the false discovery rate (FDR), which has been studied in various literature, including [2, 25] and [26]. We adopt the one proposed in [25] since it is easy to compute and widely accepted.

#### Multiple group case

*i*can be written as

The estimates ${\widehat{X}}_{\mathit{\text{ij}}}^{1}\left(t\right)$ under *H*_{i1} are obtained as (4) by using the data from group *j* only. Under *H*_{i0}, the group-free estimates ${\widehat{X}}_{i}^{0}\left(t\right)$ can be obtained using the pooled data from *J* groups. The residual sum of squares under the null and the alternative models are calculated as ${\mathit{\text{RSS}}}_{i}^{0}=\sum _{j=1}^{J}\sum _{k=1}^{{K}_{j}}({Y}_{\mathit{\text{ijk}}}-{\widehat{X}}_{i}^{0})$ and ${\mathit{\text{RSS}}}_{i}^{1}=\sum _{j=1}^{J}\sum _{k=1}^{{K}_{j}}({Y}_{\mathit{\text{ijk}}}-{\widehat{X}}_{\mathit{\text{ij}}}^{1})$, respectively, and the *F*-statistics are computed as (6), where $\delta =max\left(\underset{j}{\overset{2}{\widehat{\sigma}}}\right)$ with ${\widehat{\sigma}}_{j}^{2}$ being the estimated variance of the noisy signal for the *j*-th group.

We again use a permutation test to obtain the null distribution of (*F*_{1},…,*F*_{
n
}). Without loss of generality, we assume that the measurement times are the same among the *J* groups and at least one observation is available at each time point for each group. A permutation sample is generated by permuting the pooled gene expression data at each time point, i.e., the data {*Y*_{
i
j
k
},1≤*j*≤*J*} are randomly partitioned into *J* groups. If replicates are available, i.e., {*Y*_{
i
j
k
m
},1≤*m*≤*M*_{
j
},1≤*j*≤*J*}, the measurements are randomly partitioned into *J* groups of sizes *M*_{1},…,*M*_{
J
}. The calculation of *p*-values and the multiple testing adjustment are the same as described in the previous subsection. This approach can also be extended to situations where the measurement time points vary across different experimental groups. In this case, we can divide the time interval [*a*,*b*] into small bins so that at least one observation falls into each bin for each group. We then permute the gene expression data within each bin instead of at each time point. This extended approach is further illustrated with the yeast cell cycle data in the next section.

## Results and discussion

### Yeast cell cycle data

In this section, we applied the proposed method to *Saccharomyces cerevisiae* cell cycle gene expression data reported in [17]. The dataset includes the gene expression measurements of *n*=10928 probe sets for both wild type and cyclin mutant cells at 30 different time points. In the following presentation, we refer to the probe sets as genes. The clock time points are aligned to the corresponding lifeline positions, covering about two cell cycles in the wild type and about 1.5 cell cycles in the cyclin mutant. [17] found that 1275 genes were transcribed periodically and 835 of these periodic genes showed changes in expression behaviors in the cyclin mutant. In the following analysis, we use the numerical lifeline positions as indicators for time. The measurement times are irregular for both the wild type and the cyclin mutant, and in addition the time points are different among these two groups. We take log2 transformation of the original data.

#### Differentially expressed genes in wild type cells

Using *B*=10,000 permutations, we identified 1180 genes with *F* *D* *R*=0.01, in which 750 are shared with the list of 1275 genes identified by [17]. We also find that these 750 shared genes constitute 82% of the 500 top ranked genes in Orlando’s list. The discrepancy between genes identified by [17] and by our method may due to the differences in the selecting criteria. [17] emphasized on the periodic pattern of the gene expressions, while our method aims to maximize the ratio of the variation around the mean expression and the noise level. So genes included in Orlando’s list but missed by our method may have large noises in the observed data. On the other hand, genes identified by our method could have non-periodic patterns, for example, genes with large loadings on the first eigenfunction may exhibit linear trends.

**Identified gene numbers by the proposed method, EDGE and Sohn’s method for testing non-flat gene expressions in the wild type cells, adjusted at FDR levels of 0.01 and 0.05**

FDR = 0.01 | FDR = 0.05 | |||||
---|---|---|---|---|---|---|

Genes | Overlap | Overlap | Genes | Overlap | Overlap | |

identified | w/Orlando | w/proposed | identified | w/Orlando | w/proposed | |

Proposed | 1180 | 750 | - | 1958 | 901 | - |

EDGE | 1076 | 617 | 809 | 1705 | 783 | 1322 |

Sohn’s | 985 | 590 | 767 | 1636 | 769 | 1291 |

*F*-statistic (6) is a signal-to-noise ratio, these genes with small “signals” would still have large statistics due to small noises. In addition, the expression curves estimated by the gene-specific B-spline smoothing tend to be under-smoothed (Figure 4, last two rows), which makes the denominator of the

*F*-statistic even smaller. These genes are not identified by our method because we add a small constant in the denominator of (6) to stabilize its variance.

#### Comparing genes in wild type and cyclin mutant cells

We next apply our method to identify genes with different expression patterns in the wild type and cyclin mutant cells. We restrict this analysis to the 1275 periodic genes identified by [17]. Since the maximum lifeline position for the cyclin mutant is 243.8, we consider the gene expressions within interval [100,244], which includes 15 time points for the wild type and 22 time points for the cyclin mutant. The observation times are not equally spaced and are not the same for the two cell types.

Since the observation times are different for the wild type and the cyclin mutant, we can not use the usual permutation. Instead, we divide the lifeline domain [100,244] into small bins with length 10 each. Within each bin, there are at least one observation from each of the wild type and the cyclin mutant groups (except for [120,130], which includes no observations from either the wild type or the cyclin mutant). The observations are permuted within each bin, using the permutation strategy for data with replicates as described in Section “Multiple group case”.

*B*=10,000 permutations, we identified 883 genes for an FDR of 0.01, in which 631 are included in the 835 genes identified by [17] with changed expression patterns in the cyclin mutant. These genes are likely to be directly or indirectly regulated by Clb-CDK, and since Clb-CDK activities are known to be essential for triggering the transcriptional programme, we may not observe any periodic expression patterns in these genes for the cyclin mutant cells. For this data, ten B-spline basis functions are used in EDGE and Sohn’s methods for smoothing the expression data. The numbers of differential genes identified by EDGE and Sohn’s methods are 704 and 522 for

*F*

*D*

*R*=0.01, respectively, sharing 451 and 329 genes with those identified by [17] (Table 2).

**Identified gene numbers by the proposed method, EDGE and Sohn’s method for comparing gene expression profiles between the wild type and the cyclin mutant, adjusted at FDR levels of 0.01 and 0.05, respectively**

FDR = 0.01 | FDR = 0.05 | |||||
---|---|---|---|---|---|---|

Genes | Overlap | Overlap | Genes | Overlap | Overlap | |

identified | w/Orlando | w/proposed | identified | w/Orlando | w/proposed | |

Proposed | 883 | 631 | - | 1086 | 735 | - |

EDGE | 704 | 524 | 679 | 839 | 600 | 833 |

Sohn’s | 522 | 410 | 516 | 758 | 557 | 755 |

### Simulation studies

Simulation studies are carried out to compare the performance of our method to that of the EDGE method and the permutation-based method by [20]. We consider a single group test of non-flat gene expressions and a two group test of differential gene expressions across groups. In both cases, the data are simulated to mimic the *Saccharomyces cerevisiae* cell cycle data set.

In the single group case, the non-differential genes have model *X*_{
i
}(*t*)=0, and the differential genes have ${X}_{i}\left(t\right)=\sum _{l=1}^{3}{\xi}_{\mathit{\text{il}}}{\varphi}_{l}\left(t\right)$, where ${\varphi}_{1}\left(t\right)=-\sqrt{2/({b}_{1}-{a}_{1})}cos\left(\right.2\Pi (t-{a}_{1})/({b}_{1}-{a}_{1})\left)\right.$, ${\varphi}_{2}\left(t\right)=\sqrt{2/({b}_{1}-{a}_{1})}sin\left(\right.2\Pi (t-{a}_{1})/$ (*b*_{1}−*a*_{1})) and ${\varphi}_{3}\left(t\right)=-\sqrt{2/({b}_{1}-{a}_{1})}cos\left(\right.4\Pi (t-{a}_{1})/({b}_{1}-{a}_{1})\left)\right.$, *t*∈[*a*_{1},*b*_{1}]. The coefficients *ξ*_{
i
l
} are i.i.d. normal r.v.’s with mean 0 and variance *λ*_{
l
} with *λ*_{1}=4, *λ*_{2}=2 and *λ*_{3}=1. For each gene, the gene expression profiles are simulated at the same time points as the wild type yeast data, so there are 21 observations in [*a*_{1},*b*_{1}]=[100,305]. The noisy signal in the observed data is simulated as i.i.d. normal r.v. with mean 0 and variance *σ*^{2}=0.01.

In the two group case, the data are generated from model (2), where *J*=1,2, and the number of observations is *K*_{i1}=15 for the “wild” group and *K*_{i2}=22 for the “mutant” group, *i*=1,…,*n*. The sampling times are the same as those in the yeast cell cycle data, locating within interval [*a*_{2},*b*_{2}]=[100,244]. For the true gene expression profiles, we consider model ${X}_{\mathit{\text{ij}}}\left(t\right)=\sum _{l=1}^{2}({\xi}_{\mathit{\text{ijl}}}+{\gamma}_{\mathit{\text{ijl}}}){\psi}_{l}\left(t\right)$, where ${\psi}_{1}\left(t\right)=-\sqrt{2/({b}_{2}-{a}_{2})}cos\left(\right.2\Pi (t-{a}_{2})/({b}_{2}-{a}_{2})\left)\right.$, ${\psi}_{2}\left(t\right)=\sqrt{2/({b}_{2}-{a}_{2})}sin\left(\right.2\Pi (t-{a}_{2})/({b}_{2}-{a}_{2})\left)\right.$, *t*∈[*a*_{2},*b*_{2}]. The coefficients ${\xi}_{\mathit{\text{ijl}}}\sim \mathrm{i.i.d.}\phantom{\rule{1em}{0ex}}\mathcal{N}(0,{\lambda}_{\mathrm{\xi l}})$, ${\gamma}_{\mathit{\text{ijl}}}\sim \mathrm{i.i.d.}\phantom{\rule{1em}{0ex}}\mathcal{N}(0,{\lambda}_{\mathrm{\gamma l}})$, with (*λ*_{ξ1},*λ*_{ξ2})=(4,2) and (*λ*_{γ1},*λ*_{γ2})=(5,3). For non-differentially expressed genes, we let *γ*_{
i
j
l
}=0. The error term is generated from $\mathcal{N}(0,0.01)$.

For both cases, We generate *n*=1000 genes and the proportion of differential genes is set to be *Π*_{0}=0.05,0.2,0.5, respectively. The number of principal components in our method is selected so that the fraction of variation explained (FVE) exceeds 90%. This criterion selects the correct number of components for over 90% of time under all simulation scenarios. We tried the method proposed by [12], which fits all the top “eigen-genes”, to select the number of B-spline basis for EDGE and Sohn’s method. For over 80% of time, we ended up with selecting 19 basis for the single group case and 13 basis for the two group case, leading to severely under-smoothed gene expression curves and very few differential genes identified. We therefore manually select 6 bases for the single group case and 5 bases for the two group case for EDGE and Sohn’s methods, which seem to provide the best results when experimenting from 5 bases to 10 bases.

^{ b }’ in Tables 3 and 4, and by (8), ‘Proposed

^{ a }’ in Tables 3 and 4. We find that it is better to use (8) for computing the

*p*-values when the proportion of differential genes is small, and throughout all settings, using (8) provides smaller false positive rates. We also find that under all scenarios, the proposed method clearly outperforms the EDGE method and Sohn’s method, especially when the proportion of differential genes is small. The fact that our method has the highest sensitivity is in line with the finding that our method identified the most genes in the application of yeast cell cycle data.

**Comparison of the proposed method, EDGE and Sohn’s method for the single group test by FPR (proportion of falsely rejected hypotheses over the total number of rejected hypotheses) and sensitivity (proportion of true positives correctly identified)**

Proposed | Proposed | EDGE | Sohn’s | |||
---|---|---|---|---|---|---|

| FDR = 0.01 | FPR | 0.0016 | 0.0127 | 0.0100 | 0.0091 |

Sensitivity | 0.7606 | 0.6864 | 0.4742 | 0.5160 | ||

FDR = 0.05 | FPR | 0.0180 | 0.0541 | 0.0459 | 0.0514 | |

Sensitivity | 0.8440 | 0.7858 | 0.6378 | 0.6616 | ||

| FDR = 0.01 | FPR | 0.0003 | 0.0084 | 0.0079 | 0.0078 |

Sensitivity | 0.7716 | 0.7851 | 0.6210 | 0.6302 | ||

FDR = 0.05 | FPR | 0.0070 | 0.0409 | 0.0363 | 0.0408 | |

Sensitivity | 0.8644 | 0.8667 | 0.7652 | 0.7641 | ||

| FDR = 0.01 | FPR | 0.0002 | 0.0053 | 0.0041 | 0.0051 |

Sensitivity | 0.7731 | 0.8329 | 0.7021 | 0.7031 | ||

FDR = 0.05 | FPR | 0.0025 | 0.0251 | 0.0238 | 0.0244 | |

Sensitivity | 0.8709 | 0.9049 | 0.8285 | 0.8261 |

**Comparison of the proposed method, EDGE and Sohn’s method for the two group test by FPR (proportion of falsely rejected hypotheses over the total number of rejected hypotheses) and sensitivity (proportion of true positives correctly identified)**

Proposed | Proposed | EDGE | Sohn’s | |||
---|---|---|---|---|---|---|

| FDR = 0.01 | FPR | 0.0071 | 0.0168 | 0.0079 | 0.0055 |

Sensitivity | 0.6186 | 0.5638 | 0.4858 | 0.4706 | ||

FDR = 0.05 | FPR | 0.0432 | 0.0624 | 0.0478 | 0.0377 | |

Sensitivity | 0.7048 | 0.6810 | 0.5850 | 0.5756 | ||

| FDR = 0.01 | FPR | 0.0028 | 0.0099 | 0.0076 | 0.0032 |

Sensitivity | 0.6530 | 0.6566 | 0.5819 | 0.5153 | ||

FDR = 0.05 | FPR | 0.0276 | 0.0440 | 0.0400 | 0.0200 | |

Sensitivity | 0.7538 | 0.7579 | 0.6888 | 0.6505 | ||

| FDR = 0.01 | FPR | 0.0013 | 0.0062 | 0.0047 | 0.0011 |

Sensitivity | 0.6701 | 0.7127 | 0.6370 | 0.5364 | ||

FDR = 0.05 | FPR | 0.0134 | 0.0286 | 0.0238 | 0.0092 | |

Sensitivity | 0.7850 | 0.8089 | 0.7465 | 0.6817 |

## Conclusions

We proposed a new method for significance analysis of time course gene expression data by integrating a functional principal component method into a hypothesis testing framework. Our method can be applied to both single group and multiple group scenarios, and has shown to be more powerful in identifying temporally differentially expressed genes than the existing methods through real data application and various simulation studies. Moreover, our method is generally applicable, no matter the time course expression data have replicates or not, while most of the existing methods require replicates or even longitudinal replicates.

FPCA is a flexible nonparametric method for analyzing continuous trajectory data. The time course data are modeled through a data-based eigen-representation and the eigen-basis functions reflect the major modes of variation in the data. As illustrated in the yeast cell cycle data, the eigenfunctions often have a direct biological interpretation and offer a visual tool to assess the main directions in which the gene expression profiles vary. In addition, these eigenfunctions are orthogonal basis, so they carry information in a most efficient way and the representation of temporal trajectories can be more parsimonious than using the predetermined basis. This is particularly important in the significance analysis of time course gene expression data, because we could reserve more degrees of freedom for the inference.

In the EDGE and Sohn’s methods, the test statistic is constructed as a goodness-of-fit measure for each gene separately. Although our method uses a similar statistic, part of our statistic involves information from all genes, as the gene-specific expression curve and the variance stabilizer *δ* in the *F*-statistic (6) are estimated by borrowing strength from all the genes. This strategy can improve the power of the inference, especially for short time course data or data without replicates. In a simulation for one group test with only 11 measurements and no replicates for each gene (results not shown), our method can identify about 50% of differential genes correctly for an FDR of 0.01 and about 72% for an FDR of 0.05, but the EDGE and Sohn’s methods can hardly identify any differential genes.

Our method is also computationally fast. For the yeast cell cycle data, on a dual core processor 2.99GHz PC with 1.95 GB RAM, it took 385 seconds for the proposed method, 885 seconds for EDGE and 493 seconds for Sohn’s method to complete the one group test with *n*=10928 genes and *B*=10000 permutations/bootstraps, and 51, 176 and 189 seconds, respectively, for the two group test with *n*=1275 and *B*=10000. This ensures that our method can be applied to analyze very large genome wide data sets.

In our method, the covariance function is assumed to be the same for all genes in the same experimental group. This strategy has also been adopted by many other works in the analysis of time course gene expression data, for example [13] and [15]. A similar assumption was adopted in [6], where the within-gene correlation is implied in the presence of first-order dependence structure of the underlying Markov process and it is assumed to be identical for all genes. Although our method is presented assuming a homogeneous covariance, it can be easily extended to accommodate the heterogeneity in the covariance of gene expressions. We can first cluster the data and compute the covariance of gene expressions for each cluster, and then combine them to obtain the covariance of the mixed population [15].

When smoothing the covariance function, the bandwidth is chosen by generalized cross-validation (GCV). The overall shapes of the estimated covariance and eigenfunctions are quite stable over a range of bandwidth values in our numerical examples. The smoothing parameter may have effects on the power of the inference procedure, but the detailed investigation on this problem is beyond the scope of this paper. Intersected minds are referred to [27] and references therein for further discussions. Another related topic is to incorporate the inter-gene correlations in the multiple testing procedure, which has been discussed in [28, 29] and [30]. Our method can be applied in combination with any of these multiple testing adjustment methods. However, the effect of different multiple testing adjustments on the results of significant testing is not the emphasis of this paper and could be an interesting topic for future research.

## Declarations

### Acknowledgements

This research was partially supported by the NIH grants HHSN272201000055C, AI087135, and the University of Rochester CTSI pilot award (UL1RR024160) from the National Center for Research Resources.

## Authors’ Affiliations

## References

- Tusher V, Tibshirani R, Chu C: Significance analysis of microarrays applied to the ionizing radiation response. Proc Nat Acad Sci USA. 2001, 98: 5116-5121. 10.1073/pnas.091062498.PubMed CentralView ArticlePubMedGoogle Scholar
- Storey J, Tibshirani R: SAM thresholding and false discovery rates for detecting differential gene expression in DNA microarrays. The analysis of gene expression data: methods and software. Edited by: Parmigiani G, Garrett ES, Irizarry R, Zeger S. 2003, New York: Springer, 272-290.View ArticleGoogle Scholar
- Park T, Yi S, Lee S, Lee S, Yoo D, Ahn J, Lee Y: Statistical tests for identifying differentially expressed genes in time course microarray experiments. Bioinformatics. 2003, 19: 694-703. 10.1093/bioinformatics/btg068.View ArticlePubMedGoogle Scholar
- Smyth G: Limma: linear models for microarray data. Bioinformatics and computational biology solutions using R and Bioconductor. Edited by: Gentleman R, Carey V, Dudoit S, Irizarry R, Huber W. 2005, New York: Springer, 12837-12842.Google Scholar
- Tai Y, Speed T: A multivariate empirical Bayes statistic for replicated microarray time course data. Ann Stat. 2006, 34: 2387-2412. 10.1214/009053606000000759.View ArticleGoogle Scholar
- Yuan M, Kendziorski C: Hidden Markov Models for Microarray Time Course Data in Multiple Biological Conditions. J Am Stat Assoc. 2006, 101 (476): 1323-1332. 10.1198/016214505000000394.View ArticleGoogle Scholar
- Sun W, Wei Z: Multiple Testing for Pattern Identification, With Applications to Microarray Time-Course Experiments. J Am Stat Assocsss. 2011, 106: 73-88. 10.1198/jasa.2011.ap09587.View ArticleGoogle Scholar
- Ramsay JO, Silverman BW: Springer Series in Statistics. Functional data analysis. 2005, New York: SpringerView ArticleGoogle Scholar
- Coffey N, Hinde J: Analysing time-course microarray data using functional data analysis - A review. BMC Bioinf. 2011, 10: 23-Google Scholar
- Xu XL, Olson JM, Zhao LP: A regression-based method to identify differentially expressed genes in microarray time course studies and its application in an inducible Huntington’s disease transgenic model. Human Mol Genet. 2002, 11 (17): 1977-1985. 10.1093/hmg/11.17.1977.View ArticleGoogle Scholar
- Bar-Joseph Z, Gerber G, Simon I, Gifford D, Jaakkola T: Comparing the continuous representation of time-series expression profiles to identify differentially expressed genes. Proc Nat Acad Sci USA. 2003, 100: 10146-10151. 10.1073/pnas.1732547100.PubMed CentralView ArticlePubMedGoogle Scholar
- Storey JD, Xiao W, Leek JT, Tompkins RG, Davis RW: Significance analysis of time course microarray experiments. Proc National Acad Sci USA. 2005, 102 (36): 12837-12842. 10.1073/pnas.0504609102.View ArticleGoogle Scholar
- Hong F, Li H: Functional hierarchical models for identifying genes with different time-course expression profiles. Biometrics. 2006, 62: 534-544. 10.1111/j.1541-0420.2005.00505.x.View ArticlePubMedGoogle Scholar
- Liu X, Yang M: Identifying temporally differentially expressed genes through functional principal component analysis. Biostatistics. 2009, 10: 667-679. 10.1093/biostatistics/kxp022.View ArticlePubMedGoogle Scholar
- Chen K, Wang JL: Identifying differentially expressed genes for time-course microarray data through functional data analysis. Stat Biosci. 2010, 2: 95-119. 10.1007/s12561-010-9024-z.View ArticleGoogle Scholar
- Ma P, Zhong W, Liu J: Identifying differentially expressed genes in time course microarray data. Stat Biosci. 2009, 1: 144-159. 10.1007/s12561-009-9014-1.View ArticleGoogle Scholar
- Orlando D, Lin C, Bernard A, Wang J, Socolar J, Iversen E, Hartemink A, Haase S: Global control of cell-cycle transcription by coupled CDK and network oscillators. Nature. 2008, 453: 944-947. 10.1038/nature06955.PubMed CentralView ArticlePubMedGoogle Scholar
- Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization. Mol Biol Cell. 1998, 9 (12): 3273-3297.PubMed CentralView ArticlePubMedGoogle Scholar
- Cicatiello L, Scafoglio C, Altucci L, Cancemi M, Natoli G, Facchiano A, Iazzetti G, Calogero R, Biglia N, De Bortoli M, Sfiligoi C, Sismondi P, Bresciani F, Weisz A: A genomic view of estrogen actions in human breast cancer cells by expression profiling of the hormone-responsive trascriptome. J Mol Endocrinol. 2004, 32: 719-775. 10.1677/jme.0.0320719.View ArticlePubMedGoogle Scholar
- Sohn I, Owzar K, George SL, Kim S, Jung S: A permutation-based multiple testing method for time-course microarray experiments. BMC Bioinf. 2009, 10: 336-10.1186/1471-2105-10-336.View ArticleGoogle Scholar
- Han X, Sung WK, Feng L: Identifying differentially expressed genes in time-course microarray experiment without replicate. J Bioinf Comput Biol. 2007, 5: 281-296. 10.1142/S0219720007002655.View ArticleGoogle Scholar
- Ash RB, Gardner MF: Topics in stochastic processes. 1975, New York: Academic Press, [Probability and Mathematical Statistics, Vol. 27]Google Scholar
- Yao F, Müller HG, Wang JL: Functional data analysis for sparse longitudinal data. J Am Stat Assoc. 2005, 100 (470): 577-590. 10.1198/016214504000001745.View ArticleGoogle Scholar
- Shen Q, Faraway J: An F test for linear models with functional responses. Statistica Sinica. 2004, 14: 1239-1257.Google Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R S Soc: Ser B (Stat Methodology). 1995, 57: 289-300.Google Scholar
- Reiner A, Yekutieli D, Benjamini Y: Identifying diffrentially expressed genes using false discovery rate controlling procedure. Bioinformatics. 2003, 19: 368-375. 10.1093/bioinformatics/btf877.View ArticlePubMedGoogle Scholar
- Hart JD: Nonparametric smoothing and lack-of-fit tests. 1997, SpringerView ArticleGoogle Scholar
- Benjamini Y, Yekutieli D: The control of the false discovery rate in multiple testing under dependency. Ann Stat. 2001, 29: 1165-1188. 10.1214/aos/1013699998.View ArticleGoogle Scholar
- Klebanov L, Yakovlev A: Detecting intergene correlation changes in microarray analysis: A new approach to gene selection. Ann Appl Stat. 2007, 1: 538-559. 10.1214/07-AOAS120.View ArticleGoogle Scholar
- Hu R, Qiu X, Glazko G: A new gene selection procedure based on the covariance distance. Bioinformatics. 2010, 23: 348-354.View ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.