Identification of significant periodic genes in microarray gene expression data
- Jie Chen^{1}Email author
https://doi.org/10.1186/1471-2105-6-286
© Chen; licensee BioMed Central Ltd. 2005
Received: 28 March 2005
Accepted: 30 November 2005
Published: 30 November 2005
Abstract
Background
One frequent application of microarray experiments is in the study of monitoring gene activities in a cell during cell cycle or cell division. A new challenge for analyzing the microarray experiments is to identify genes that are statistically significantly periodically expressed during the cell cycle. Such a challenge occurs due to the large number of genes that are simultaneously measured, a moderate to small number of measurements per gene taken at different time points, and high levels of non-normal random noises inherited in the data.
Results
Based on two statistical hypothesis testing methods for identifying periodic time series, a novel statistical inference approach, the C&G procedure, is proposed to effectively screen out statistically significantly periodically expressed genes. The approach is then applied to yeast and bacterial cell cycle gene expression data sets, as well as to human fibroblasts and human cancer cell line data sets, and significantly periodically expressed genes are successfully identified.
Conclusion
The C&G procedure proposed is an effective method for identifying statistically significant periodic genes in microarray time series gene expression data.
Keywords
Background
Microarray experiments are widely used for gene profiling in different cell lines, various tissues, and conditions (normal versus cancerous). High throughput microarray technologies have made it possible to study problems that range from gene regulation and mRNA stability, to pathways for genetic diseases and the discovery of target subpopulations for drug or other therapies. One frequent application of microarray experiments is in the study of monitoring gene activities in a cell during cell cycle or cell division. A new challenge to statisticians for analyzing the microarray experiments is to identify genes that are statistically significantly periodically expressed during the cell cycle. Such a challenge occurs due to the large number of genes that are simultaneously measured, a moderate to small number of measurements per gene taken at different time points, and high levels of non-normal random noises inherited in the data (Wichert [1]). Several authors, including Spellman [2], Cho [3], Shedden and Cooper [4, 5], Whitfield [6] have noticed the presence of cyclicity or periodicity of genes in their microarray data sets and used a number of ways to identify periodically expressed genes in some available yeast and human cell cycle data sets obtained by them. There are some debates concerning the methods those authors used in finding the cyclic genes and how statistically significant those cyclic genes are. Whitfield [6] established a catalog of genes periodically expressed in the human cell cycle via a series of large-scale microarray experiments. They introduced a statistic (periodicity score) for testing the inference of a periodically expressed gene. The method introduced in Whitfield [6], however, may not be effective in identifying multiple periodically expressed genes, as it did not address the multiple comparison issue and hence inflated false discovery rate (FDR). Recently, Wichert [1] proposed to use a graphical device, average periodogram, as an exploratory method to signal the presence of possible periodic genes. They showed through extensive simulations that plotting average periodogram against frequencies reveals the presence of periodic genes in the data set if there is any. They also applied Fisher's exact G-test statistic, along with the use of FDR, on the periodogram to screen out statistically significantly periodically expressed genes.
In this paper, another test statistic, the Bartlett's exact C-test statistic, for the inference of periodic time series is introduced. By combining both the G-statistic and the C-statistic, a novel statistical inference approach, the C&G procedure, is proposed to effectively screen out statistically significantly periodically expressed. The approach is then applied to yeast and bacterial cell cycle gene expression data sets, as well as to human fibroblasts and human cancer cell line data sets, and significantly periodically expressed genes are successfully identified.
Results
For testing the null hypothesis of a signal being a normal white noise against the alternative hypothesis of a signal being periodic (see Methods section), a statistical method is to use the periodigrams of the signal (see Methods section for details) to form a test statistic and calculate the p-value of the test statistic. A small p-value, smaller than a predetermined significance level, indicates the significance of the signal being periodic rather than white noise. Fisher [7] proposed a test statistic and derived the null distribution of the Fisher's G-statistic. In the context of microarray gene expression data, the observed significance value or p-value for the hypothesis testing of the periodicity of a fixed gene g, using G-statistic as the test statistic, denoted by ${p}_{g}^{G}$, can be obtained by
where ξ_{ g }is the sample realization of the G-statistic value calculated from the Fisher's G-statistic (see equation (7) in Methods section) divided by m, and L(ξ_{ g }) is the largest integer less than 1/ξ_{ g }.
A more general setting of the hypothesis is to test whether a signal is normal white noise or not. Bartlett [8] proposed a test statistic, the C-statistic (see methods), to test for the hypotheses. According to Durbin [9], the p-value for the hypothesis testing of the periodicity of a fixed gene g using Bartlett's C-statistic as the test statistic, denoted by ${p}_{g}^{C}$, can be found by
where a_{ g }= mC_{ g }, C_{ g }is given in equation (10) of the Methods section, [a_{ g }] = INT{a_{ g }}, and n = m - 1. Suppose that a large number $G$ of genes are simultaneously observed through a microarray experiment, and each gene is measured at a relatively short period, or at sparse intervals of time (say at N time points). The researcher is interested in whether any genes are expressed in a periodic pattern of some kind. As high levels of non-normal random noise may present in the data, some visual evidence of periodic gene may be simply due to random noise; and as there are usually a large number of genes ($G$ is often from several thousands to several hundreds of thousands), there is a serious concern about the false discovery rate (FDR). Therefore, a multiple comparison approach must be employed to control the FDR level. Recently, Benjamini and Hochberg [10] introduced a practical and powerful approach to multiple testing by controlling the (FDR). This approach is especially useful for multiple hypothesis testing in microarray experiments. It is a step-down type of multiple testing procedure in combination with Bonferroni approach. In light of the p-value, ${p}_{g}^{C}$, obtained using the C-statistic, the p-value ${p}_{g}^{G}$, calculated using the G-statistic, and the multiple testing procedure controlling FDR, the following method (called "C&G Procedure") is proposed for the selection of periodic gene expressions of the same period:
Step 1: Calculate ${p}_{g}^{G}$, and ${p}_{g}^{C}$ according to equations (1) and (2), respectively, for g = 1, ..., $G$.
Step 2: Let the ordered ${p}_{g}^{C}$ values be ${p}_{(1)}^{C},{p}_{(2)}^{C},\dots ,{p}_{(G)}^{C}$ with corresponding genes ${g}_{(1)}^{C},{g}_{(2)}^{C},\dots {g}_{(G)}^{C}$; and let the ordered ${p}_{g}^{G}$ be ${p}_{(1)}^{G},{p}_{(2)}^{G},\dots ,{p}_{(G)}^{G}$ with corresponding genes ${g}_{(1)}^{G},{g}_{(2)}^{G},\dots {g}_{(G)}^{G}$.
Step 3: For a given FDR level of q, let i_{ q }be the largest i for which ${p}_{(i)}^{C}\le \frac{1}{G}q$, and let j_{ q }be the largest j for which ${p}_{(j)}^{G}\le \frac{j}{G}q$.
Step 4: The intersection set $K=\{{g}_{(1)}^{C},{g}_{(2)}^{C},\dots {g}_{({i}_{q})}^{C}\}\cap \{{g}_{(1)}^{G},{g}_{(2)}^{G},\dots {g}_{({j}_{q})}^{G}\}$ then contains all the statistically significantly periodically expressed genes (of the same period). The difference set $D=\{{g}_{(1)}^{C},{g}_{(2)}^{C},\dots {g}_{({i}_{q})}^{C}\}\backslash K$ then contains possible periodic genes with different periods, or of other patterns other than periodic.
A natural question that might come up is: What is the FDR level of the identified periodic genes contained in set K? A straightforward proof leads to the conclusion that the FDR level of the identified periodic genes contained in set K of step 4 in the C&G Procedure is at most q. In other words, by using this procedure, the FDR level is not inflated. The application of the C&G Procedure is illustrated in the following four examples.
Analysis of the bacterial cell cycle data
Analysis of the yeast cell cycle data
In the second example, the gene expression data sets from the well-known yeast Saccharomyces cerevisiae microarray experiments of Spellman [2] are analyzed for the identification of significantly periodically expressed genes. The data sets can be downloaded from the Yeast cell cycle data website [14]. These four data sets were produced by three different cell cycle synchronization techniques: alpha factor arrest (producing the "alpha" gene expression data), temperature arrest (producing "cdc15" and "cdc28" gene expression data sets), and elutriation synchronization (producing the "elution" data set). The alpha data set contains complete information on 4489 genes over 18 equally spaced time points (with a time interval of 7 minutes). Using the C&G Procedure, it is found out that the C-statistic identifies 1188 genes as significant non-white noise expressions (including possible cell-cycle regulated genes) and the G-statistic identifies 473 such genes, their intersection set contains 471 significant cell-cycle regulated genes. Therefore, we claim that there are at least 471 significant periodic genes (of the same period) at FDR level of 0.05 in the alpha experiment data, and there are additional 717 genes in set D that are possibly periodic with different periods, or of other patterns other than periodic.
Number of Significant Periodic Genes Identified by C-statistics, G-statistic, Intersection Set K, and Difference Set D
Cell type | Experiment | N | G | N _{ C } | N _{ G } | N _{ K } | N _{ D } |
---|---|---|---|---|---|---|---|
C. crescentus | bacteria | 11 | 1474 | 166 | 44 | 43 | 123 |
Yeast | alpha | 18 | 4489 | 1188 | 473 | 471 | 717 |
Yeast | cdc15 | 24 | 4381 | 1636 | 788 | 779 | 857 |
Yeast | cdc28 | 17 | 1383 | 292 | 27 | 27 | 265 |
Yeast | Elution | 14 | 5766 | 1056 | 769 | 695 | 361 |
Human fibroblasts | N2 | 12 | 7077 | 1 | 2 | 1 | 0 |
Human fibroblasts | N3 | 12 | 7077 | 2 | 0 | 0 | 2 |
Human HeLa | Score1 | 12 | 15536 | 44 | 7 | 6 | 38 |
Human HeLa | Score2 | 26 | 16287 | 1351 | 154 | 153 | 1198 |
Human HeLa | Score3 | 48 | 41508 | 9702 | 6117 | 5770 | 3932 |
Human HeLa | Score4 | 19 | 40815 | 52 | 52 | 17 | 35 |
Human HeLa | Score5 | 9 | 35871 | 5 | 1 | 0 | 1 |
Analysis of human fibroblasts data
Analysis of human cancer cell line data
Number of Periodic Genes Identified by the Original Experimenters, Wichert et al. (2004), and Chen
Cell type | Experiment | Experimenter | Wichert et al. (2004) | Chen |
---|---|---|---|---|
C. crescentus | bacteria | Laub et al. (2000), identified 553 periodic genes | 44 | 43 |
Yeast | alpha | Spellman et al. (1998), | 468 | 471 |
Yeast | cdc15 | total of 800 periodic genes | 766 | 779 |
Yeast | cdc28 | identified in all of these four | 105 | 27 |
Yeast | Elution | yeast cell cycle experiments | 193 | 695 |
Human fibroblasts | N2 | Cho et al. (2001), 700 periodic | 0 | 1 |
Human fibroblasts | N3 | genes identified in N2 and N3 | 0 | 0 |
Human HeLa | Score1 | Whitfield et al. (2002), | 0 | 6 |
Human HeLa | Score2 | total of 800+ periodic genes | 134 | 153 |
Human HeLa | Score3 | identified in these five | 6043 | 5770 |
Human HeLa | Score4 | Human Cancer cell line | 56 | 17 |
Human HeLa | Score5 | experiments | 0 | 0 |
Discussion
Regarding both of the test statistics, several points need to be addressed.
First of all, the G-statistic is testing for the significance of the maximum periodogram. When the result is significant, the message conveyed to us is that the maximum periodogram is significant with the possible cause of the underlying model being periodic. On the other hand, the C-statistic utilizes a sort of standardized cumulative periodograms, and considers all periodograms' contributions towards the periodicity of the underlying model. Therefore, these two statistics are not exactly the same. Secondly, although both G-statistic and C-statistic can be used as test statistics for searching periodicity in a time series, the G-statistic method is more specific and the C-statistic method is broader in the sense that the alternative hypothesis to the null hypothesis is rather vague. In other words, for a fixed gene g, when the p-value ${p}_{g}^{G}$ is small compared with a predetermined significance level, the conclusion that this gene is a significant periodic gene according to the G-statistic can be reached; however, when the p-value ${p}_{g}^{C}$ is small, only the claim that this gene is not a white noise (might be of periodic, periodic with different period, or of other patterns other than periodic) according to the C-statistic can be drawn. Hence, one can anticipate that the C-statistic will pick up more significant genes than the G-statistic. This is valuable, especially in expensive microarray experiments, because the biologist can use the information to possibly discover genes that are of different periods, or of other pattern which they have not encountered before. Thirdly, from the definitions of the two statistics (see Methods), we can easily establish that
G_{ g } ≥ 1,0 ≤ C_{ g } ≤ 1, and G_{ g } ≥C_{ g }. (3)
Then, the fact that G_{ g }is great than its threshold value does not necessarily imply that C_{ g }is greater than its threshold value, and vise versa. In other words, from the fact given by (3), it is clear that these two statistics are not equivalent in general; there are times, however, that both tests overlap with each other. This is not surprising because the G-statistic is constructed for testing normal white noise versus periodic function, and the C-statistic method is broader in the sense that the alternative hypothesis to the null hypothesis is rather vague. One might think that the set of periodic signals identified by the G-statistic is contained in the set of genes identified by the C-statistic. It is not necessarily true for the reasons mentioned here in this section.
Furthermore, the G-statistic method is sensitive to the departure from normality as pointed in Davis [18] and Wilks [19]. Hence, when the normality assumption on the random errors is violated, the null distribution of the G-statistic will not be true in general and the p-value in (1) could be very wrong. The C-statistic method is insensitive to the departure of normality as pointed out in Durbin [9]. The two statistics can then be served as constraints for each other in order to effectively search for true periodic genes.
Empirical power of C, G, and C&G with the ratio of amplitude of signal to noise being 1 : 1
Signal type | C | G | C&G |
---|---|---|---|
sine signal with skewed noise | 81.66% | 75.25% | 75.23% |
sine signal with normal white noise | 99.09% | 97.57% | 97.57% |
Empirical power of C, G, and C&G on stronger signals
The ratio of amplitude of signal to noise | C | G | C&G |
---|---|---|---|
9:8 | 99.78% | 99.50% | 99.50% |
10:8 | 99.97% | 99.93% | 99.93% |
11:8 | 99.99% | 99.99% | 99.99% |
12:8 | 100% | 100% | 100% |
Empirical power of C, G, and C&G on weaker signals
The ratio of amplitude of signal to noise | C | G | C&G |
---|---|---|---|
7:8 | 96.00% | 91.72% | 91.72% |
6:8 | 87.39% | 78.40% | 78.40% |
5:8 | 71.95% | 56.87% | 56.82% |
4:8 | 51.03% | 34.30% | 34.01% |
Empirical false positive rate of C, G, and C&G
noise type | C | G | C&G |
---|---|---|---|
normal | 12.3% | 6.45% | 4.29% |
uniform | 13.23% | 7.60% | 4.70% |
Chi-square | 7.32% | 2.32% | 1.36% |
Finally, as the null distributions of these two statistics are all exact distributions, they work well (as long as the underlying assumptions are met) for any sample size (small or large). This characteristic makes both tests very valuable to microarray data sets as the observations obtained for each gene is usually not large in a microarray experiment.
Conclusion
In this paper a statistical C&G Procedure is proposed for identifying significantly periodically expressed genes for a desired FDR level q. This approach uses both Bartlett's C-statistic and Fisher's G-statistic to secure the actual periodic genes existing in a microarray data set. As the searching process is also a multiple testing procedure, the FDR level is used to assure that the overall false discover rate for the whole procedure is at most α. The G-statistic does assume that the sequence is Gaussian, this may not be the case for any microarray data set. Nevertheless, a log-transformed expression data usually can satisfy the Gaussian assumption. The C-statistic is more robust towards the violation of Gaussian assumption. The advantage of the C&G Procedure thus emerges. Although the gene expression sequences in a microarray data set are usually correlated, the approach of the multiple testing with controlled FDR level does not rely on independence assumption heavily according to Benjamini and Hochberg [10]. Therefore, this C&G Procedure is a promising statistical tool for finding significantly periodically expressed genes (of the same period) in a microarray data set. Other issues, such as the analysis of data measured in unevenly spaced time intervals and the size of each sequence needed for valid statistical analysis, will be topics of future investigations in order to more effectively search for significantly periodically expressed genes in a microarray data set.
Methods
Suppose that a time series is observed and one concern is the possible periodicity of this time series. To be specific in the context of gene expressions observed at time t for any fixed gene g, we denote the time series (or gene expression observed in a time course) by Y_{ g }(t) for t = 1, ..., N and g = 1, ..., $G$. To model Y_{ g }(t) with periodicity, we can assume:
Y_{ g }(t) = f_{ g }(t) + ε_{ gt },
where f_{ g }(t) is a periodic function with a smallest positive period T_{ g }for gene g, that is f_{ g }(t + T_{ g }) = f_{ g }(t) for all t; and ε_{ gt }is a sequence of non-observable random errors with mean 0 and homogenous variance σ^{2} for all g and t. For a fixed gene g, we can specifically assume that a time series gene expression is well represented by
Y_{ g }(t) = μ + A cos (ωt) + B sin(ωt) + ε_{ gt },
where A, B, and μ (known) are constants, ω is of the form 2πk/N, for k = 0,1, ..., m, with m = (N - 1)/2 for N odd and m = N/2 for N even. Given a finite realization of the time series gene expressions y_{ g }(t) (sample values or microarray expressions obtained from the experiment), we can then view y_{ g }(t) as represented by
where ω_{ k } = 2πk/N, for k = 0,1, ..., m, , and
for k = 1, ..., m and g = 1, ..., $G$. For the testing of periodicity related hypotheses of a time series, the periodogram of gene g is denned as
for k = 1, ..., m and g = 1, ..., $G$. Under the assumption that ε_{ gt }'s are identically independently distributed normal random errors with mean 0 and homogenous variance σ^{2} (that is, Y_{ g }(t) is a normal white noise), Fisher [7] proposed a G-statistic and derived the exact null distribution of G. Suppose it is of our interest to test
H_{0}: Y_{ g }(t) = μ + ε_{ gt }, (5)
versus
H_{1}: Y_{ g }(t) = μ + A cos(ωt) + B sin (ωt) + ε_{ gt }, (6)
then for a fixed gene g, the Fisher's G-statistic is given by
For details on the G test statistic, its null distribution and the percentage points of the G test statistics, please refer to Fisher [7], Davis [18], Wilks [19], and Priestley [20].
Other test statistics for searching "hidden periodicity" in a time series have been proposed as part of spectral analysis (Fuller [21]) in the literature. For the following more general setting of hypothesis testing of
H_{0}: Y_{ g }(t) is a normal white noise, (8)
versus
H_{0}: Y_{ g }(t) is not a normal white noise, (9)
for fixed gene g, Bartlett [8] proposed to use a C-statistic as a test statistic to fulfill the task of such hypothesis testing procedure. For a fixed gene g, we obtain the C-statistic as
with
for g = 1, ..., $G$. Durbin ([9, 22]) provided the details of the null distribution of the test statistic C under the normality assumption.
According to Fisher [7], the observed significance value, or p-value ${p}_{g}^{G}$, for the hypothesis testing of the periodicity of a fixed gene g using G-statistic as the test statistic is expressed as in (1), or again
where ξ_{ g }is the sample realization of the G-statistic value calculated from (7) divided by m, and L(ξ_{ g }) is the largest integer less than 1/ξ_{ g }. Meanwhile, according to Durbin [9], the p-value, ${p}_{g}^{C}$, for the hypothesis testing of the periodicity of a fixed gene g using C-statistic as the test statistic is given in (2), or specifically,
where a_{ g }= mC_{ g }, C_{ g }is given in (10), [a_{ g }] = INT{a_{ g }}, and n = m - 1.
The C&G Procedure utilizes both of the test statistics and gives a practical way for identifying significant periodic genes in massive microarray data.
Declarations
Acknowledgements
This research is supported in part by the NSF grant DMS-0426148. Part of this work is done while the author is a visiting scientist at the Stowers Institute for Medical Research (SIMR) and is on leave from University of Missouri-Kansas City. The author thanks two anonymous referees whose comments greatly improved the manuscript.
Authors’ Affiliations
References
- Wichert S, Folianos K, Strimmer K: Identifying periodically expressed transcripts in microarray time series data. Bioinformatics 2004, 20: 5–20. 10.1093/bioinformatics/btg364View ArticlePubMedGoogle Scholar
- Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Bostein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 1998, 9: 3273–3297.PubMed CentralView ArticlePubMedGoogle Scholar
- Cho RJ, Huang M, Dong H, Steinmetz L, Sapinoso L, Hampton G, Elledge SJ, Davis RW, Lockhardt DJ, Campbell MJ: Transcriptional regulation and function during the human cell cycle. Nature Genetics 2001, 27: 48–54.PubMedGoogle Scholar
- Shedden K, Cooper S: Analysis of cell-cycle-specific gene expression in human cells as determined by microarrays and double-thymidine block synchronization. Proceedings of the National Academy of Science USA 2002, 99: 4379–4384. 10.1073/pnas.062569899View ArticleGoogle Scholar
- Shedden K, Cooper S: Analysis of cell-cycle gene expression in Saccharomyces cerevisiae using microarrays and multiple synchronization methods. Nucleic Acids Research 2002, 30: 2920–2929. 10.1093/nar/gkf414PubMed CentralView ArticlePubMedGoogle Scholar
- Whitfield ML, Sherlock G, Saldanha AJ, Murray JI, Ball CA, Alexander KE, Marese JC, Perou CM, Hurt MM, Brown PO, Botstein D: Identification of Genes periodically expressed in the human cell cycle and their expression in tumors. Molecular Biology of the Cell 2002, 13: 1977–2000. 10.1091/mbc.02-02-0030.PubMed CentralView ArticlePubMedGoogle Scholar
- Fisher RA: Tests of significance in Harmonic Analysis. Proceedings of the Royal Society of London, Series A 1929, 125: 54–59.View ArticleGoogle Scholar
- Bartlett MS: An introduction to Stochastic Processes with Special Reference to Methods and Applications. 2nd edition. Cambridge University Press, Cambridge; 1966.Google Scholar
- Durbin J: Tests of serial independence based on the cumulated periodogram. Bulletin of the International Statistical Institute 1967, 42: 1039–1049.Google Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society 1995, B57: 289–300.Google Scholar
- Laub MT, McAdams HH, Feldblyum T, Fraser CM, Shapiro L: Global analysis of the genetic network controlling a bacterial cell cycle. Science 2000, 290: 2144–2148. 10.1126/science.290.5499.2144View ArticlePubMedGoogle Scholar
- Bacterial cell cycle data[http://caulobacter.stanford.edu/CellCycle/DownloadData.htm]
- NCBI[http://www.ncbi.nlm.nih.gov]
- Yeast cell cycle data[http://genome-www.stanford.edu/cellcycle]
- Yeast genome website[http://www.yeastgenome.org]
- Human fibroblasts data[http://www-sequence.stanford.edu/human_cell_cycle]
- Human cancer cell line data[http://genome-www.stanford.edu/Human-CellCycle/Hela]
- Davis HT: The Analysis of Economic Time Series. Principia Press, Bloomington, Indiana; 1941.Google Scholar
- Wilks SS: Mathematical Statistics. Wiley, New York; 1962.Google Scholar
- Priestley MB: Spectral Analysis and Time Series. Academic Press: San Diego; 1981.Google Scholar
- Fuller WA: Introduction to Statistical Time Series. 2nd edition. Wiley-Interscience: New York; 1996.Google Scholar
- Durbin J: Tests for serial correlation in regression analysis based on the periodogram of least-squares residuals. Biometrika 1969, 56: 1–15.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.