Comprehensive analysis of circadian periodic pattern in plant transcriptome
© Ptitsyn; licensee BioMed Central Ltd. 2008
Published: 12 August 2008
Circadian rhythm is a crucial factor in orchestration of plant physiology, keeping it in synchrony with the daylight cycle. Previous studies have reported that up to 16% of plant transcriptome are circadially expressed.
Our studies of mammalian gene expression revealed circadian baseline oscillation in nearly 100% of genes. Here we present a comprehensive analysis of periodicity in two independent data sets. Application of the advanced algorithms and analytic approached already tested on animal data reveals oscillation in almost every gene of Arabidopsis thaliana.
This study indicates an even more pervasive role of oscillation in molecular physiology of plants than previously believed. Earlier studies have dramatically underestimated the prevalence of circadian oscillation in plant gene expression.
A timely response and preparedness in response to the changing environmental cues are essential for life in plants and animals alike. Since plants are dependent on light for photosynthesis, a natural assumption is that circadian (i.e., approximately daily) oscillation should be an even more prominent feature of the plant gene expression than in animals. Multiple studies have reported the existence and detailed mechanism of a circadian molecular clock in plants [1–5]. Based on the studies of the model plant, Arabidopsis thaliana, researchers have determined that a "substantial" part of plant transcriptome cycles follows a circadian rhythm. This estimation is based on microarray experiments that search for the genes following the circadian rhythm among the entire set of transcripts that is examined by the microarray. Early attempts to identify these genes employed two-color spotted arrays, resulting in a cumbersome experimental design, or tried to minimize expenses by increasing the time span between the sample collections. The latter produced data with a very low sampling rate, which obscured the oscillation pattern in all but a few of the least noisy genes. More recent studies [1, 4] used Affymetrix (Affymetrix Inc., Santa Clara) Arabidopsis thaliana expression arrays. These studies focused on the role of circadian oscillation in specific regulatory and signaling systems, but the entire set of data with expression profiles of more than 22,000 transcripts over two days was made available for downloading from the public databases. Although independent, both of these sets share almost identical experimental conditions and samples at the same rate of once every four hours. These features make the data easy to compare not only with one another but also to the large body of murine circadian expression data, which is also sampled every four hours over a period of two days.
In recent years, we have published a number of studies on circadian oscillation in metabolically active peripheral tissues in mice [6, 7]. We have also reanalyzed and reported the discovery of circadian oscillation in a number of independent murine data sets from public sources . The results of our analysis of murine circadian data are in sharp contradiction with previous reports. We were able to demonstrate circadian oscillation in not just a small number of the genes that are presumably linked to the circadian molecular clock but in all transcribed genes. Our most recent studies  show that circadian oscillation is traceable not only in expressed but also in genes that were previously considered silent or unexpressed. The prevailing theory reflected in the molecular biology textbooks states that 10–15% of genes cycle within a daily period and are presumably regulated by the circadian molecular clock. Molecular clocks vary significantly in details, and the genes that form the clock may be evolutionarily unrelated, but the molecular clocks of plants, mammals, and insects share the same negative feedback principle that makes oscillation self-sustaining and adjustable. We have previously established that this theory does not reflect the reality, at least in the murine model. While the basic circadian clock is active in all central and peripheral tissues, other genes show robust, noise-free oscillation, particularly those involved in supporting basic energy metabolism and not directly linked to the circadian molecular clock. Moreover, the key elements of the cell transcription machinery itself exhibit a pronounced circadian pattern in the modulation of expression of practically every gene. We now know that the entire animal transcriptome, not just a specially regulated portion, experiences circadian oscillation. However, a reasonable expectation is to find the same observation in the plant transcriptome. Plants are even more dependent on the daily change in lighting conditions. Nevertheless, the most recent studies reported only 10.4  and 16%  of "circadially regulated" genes in the plant transcriptome. This obvious contradiction demands a uniform re-analysis of the data using advanced methodology that has been tested in multiple previous studies.
Results and discussion
Overview of the analysis strategy
Independent circadian studies in plants or animals rarely use exactly the same analysis pipeline. However, comparing a representative set of studies [10–15] reveals a commonality in strategy. The most typical approach starts with the normalization and scaling of microarray experiments; then the data is filtered, and only the genes that are present at least n times throughout the complete timeline (i.e., the exact number varies) are selected for further analysis. Some studies  introduced an additional filter that selects only "actively expressed" genes, i.e., genes with an expression level that is estimated above some arbitrary threshold. This much reduced subset of transcripts is subjected to the periodicity test and, in rare cases, a panel of more than one test, followed by a false discovery rate (FDR) correction. The few genes that pass the test are further analyzed to determine the phase and the amplitude of oscillation and to visualize using profile plots and heat maps. This approach produces consistent results across a number of circadian data sets from diverse origins but also shares a common set of problems. First of all, the formulation of the null hypothesis for the statistical tests is based on the assumption of an absence of periodicity, i.e., a steady line rate of transcription for the majority of the genes. This assumption is intuitive but has no foundation in biology. Second, each gene (transcript) is tested independently. On the other hand, the authors realize that that the researchers are looking for a manifestation of the same rhythm that modulates expression of different genes and that this expression is expected to be highly correlated. Another problem common for all circadian studies is that microarray expression profiles are very expensive to generate. Additionally, even the best data sets count two consecutive circadian periods at most with samples collected every four hours. Such a low sampling rate combined with a high level of stochastic noise, which is also typical for microarray estimation of gene expression, makes testing for periodicity particularly challenging.
A series of papers that we have published since 2006 have reported new algorithms for the analysis of periodicity in gene expression, including a new statistical test for periodicity, a phase classification, an application of digital signal processing, and an analysis of same-phase groups of genes as a continuous signal [8, 9, 17, 18]. These algorithms were instrumental in the characterization of circadian expression in peripheral tissues [6, 7], the discovery of a baseline oscillation in the transcript abundance of all genes , the discovery of alternative transcripts oscillating with a phase shift , and the discovery of the extra-low expression of eukaryotic genes . However, all of our studies have been conducted with murine (circadian) and yeast (metabolic oscillation) models, which obviously does not include plants in the scope of our investigations. This study aims to rectify this shortcoming and determine if our previous findings of pervasive and persistent circadian oscillation in the murine transcriptome are also true for plants.
Our analysis starts with the same preprocessing normalizing and scaling microarray experiment in a time series. Then a provisional phase of oscillation is assigned to each gene. This is done before any selection or testing for periodicity; thus, a provisional or "most likely" phase can be potentially assigned to a non-oscillating noisy expression profile. However, assigning a phase does not introduce change in the data and thus does not preclude non-oscillating profiles from being filtered out in a later step. For further analysis, expression profiles are grouped into classes based on the provisionally assigned phase. In each group, profiles are joined into the phase continuum, which maximizes the statistical power in testing for periodicity and allows the application of digital filters . Our methods also do not attempt the impossible, i.e., aligning all noisy profiles by peaks at particular time point. We use only as many phase classes as possible by generating an artificial cosine curve with the given length of observation, which is typically two complete daily periods, and given sampling rate, which is typically one sample taken every four hours. This strategy of analysis is applicable to a wide variety of data and has been tested on multiple data sets that were produced by collaborators at the Pennington Biomedical Research Center as well as independent data obtained from the public databases and kindly provided by the respective authors. The detailed description of the algorithms that were used in each step of the analysis is given in the Materials and Methods section.
Among the circadian gene expression data that is available from the GEO (Gene Expression Omnibus, http://www.ncbi.nlm.nih.gov/geo), only two sets have a sufficient sampling rate and use a contemporary microarray platform (i.e., Affymetrix Arabidopsis expression array). For convenience in this paper, these data sets will be named by the academic affiliation of the majority of the authors, i.e., Davis  and Warwick  data sets. Unfortunately, no single experiment measures gene expression in the natural, undisturbed form. Both examine which data we reanalyze to collect samples in a constant light. The idea behind such an experimental design was to isolate the genes that are regulated by the molecular clock by presuming that all other genes will experience no oscillation without environmental cues. A brief description of the experimental conditions producing these data sets is given in the Materials and Methods section.
Difference between phase groups in Davis and Warwick data sets.
Warwick data set
Davis data set
The data obtained in two independent studies that were conducted at different times and in opposite hemispheres of the globe are very similar in the general pattern but exhibit some differences in the observed phase and the amplitude of some genes. The experimental design, though described in different words, is almost identical. Differences may be possible outside the methods described in the published papers, but the only significant differences seem to be in the time selected for the starting point (ZT, zeitgeber time) at subjective dawn, although not large, and the technique used to quantify expression values for microarrays. The Davis data set also has lower overall intensity and twice as many genes that are deemed absent as compared to the Warwick data. The latter can possibly explain the difference in signal to noise ratio between almost identically designed experiments. In both studies, the attempt to stop oscillation in the entire transcriptome by removing environmental oscillation (i.e., light) proved futile. In animal studies, changing the lighting regime from oscillating to constant darkness or dim light creates asynchrony between feeding, sleeping, and other activity patterns. As a result, a significant number of transcripts loose synchronization and identifying the baseline oscillation becomes more challenging. From a glance, the plant transcriptome data that was collected in constant light looks like the mouse transcriptome data that was collected under normal conditions with no alternation in the environment. Unfortunately, we do not have plant data that was collected under normal lighting for comparison. However, the robust oscillation under constant light in both the Davis and Warwick data leaves little space for a change. Leveling a single rhythmic environmental factor makes little impression on the pattern of gene expression in plants.
The authors of the publications that contribute to the Davis and Warwick data sets are referencing one another's works and are aware of some discrepancies, particularly in the number of rhythmic or "circadially regulated" genes. Covington and Harmer find a prevalence of higher than 10% of circadially-regulated genes intriguing and thus possible. However, neither of these research teams allowed for the possibility of all genes being expressed in circadian rhythm. This finding undermines the results of the studies that follow the separation of a small portion of rhythmic transcripts through both the analyses of over-represented pathways and the role of the molecular clock in specific pathways. Much of the results and discussion presented in these papers are based on the intuitive but unfounded assumption that all genes are expressed in a steady line pattern. Unfortunately, in the light of knowing that all genes are oscillating in a circadian pattern, these findings will have to be revised. On the other hand, the circadian timeline data that were collected for these studies are invaluable. These data could be an endless source of discovery. However, the analysis should be considered from a different angle, i.e., not whether a particular gene, co-regulated genes, or pathways are circadially-regulated but how changing experimental conditions affects oscillating properties, such as the phase and amplitude of the genes.
Materials and methods
Circadian expression data
UC Davis data set
Col-0 ecotype seeds were stratified at 4°C for 4 days before transfer to a growth chamber (22°C). Seedlings were entrained in 12-h white light (light source was cool white fluorescence tubes)/12-h dark cycles for 7 days before being released into free-running conditions of continuous white light at 22°C. Starting at subjective dawn of day 9, tissue was harvested every 4 h over the course of the next 44 h. Following standard protocols labelled cRNA targets were prepared from total RNA and hybridized to Affymetrix Arabidopsis expression GeneChips. Expression values were estimated at CSU from the original CEL files provided by the authors using dChip-derived Model-Based Expression Index.
University of Warwick data set
Wild-type Col-0 seedlings were used for the microarray circadian time-course experiment. Seedlings were placed immediately into LD 12:12 and grown for 7 days at 22°C. At dawn on the 8th day, they were placed into constant cool white fluorescent light. Samples were taken over two circadian cycles at 4-h intervals starting from ZT26. Samples were assayed on the Affymetrix GeneChip oligonucleotide ATH1 array (Affymetrix) according to the manufacturer's instructions. Background correction and normalization and gene expression analysis of the array data were performed using the GC-RMA routine  in GeneSpring version 7.2 (Silicon Genetics). The resulting table of gene expression values was downloaded from the GEO database.
Profiles have been smoothened by a 3rd degree polynomial procedure and median-subtracted. For smoothing we use seven-point Savitzky-Golay algorithm . To take advantage of all points in the time series a single-pass smoothing has been applied in a circular manner, with the last points contributing to smoothing the starting points. For better compatibility, the same smoothing and median subtraction procedure has been applied to all data sets.
For purposes of spectral analysis, consider a series of microarray expression values for gene x with N samples of the form
Y = x0, x1, x2, xN-1
If a time series has a significant sinusoidal component with frequency ω ∈ [0, π], then the periodogram exhibits a peak at that frequency with a high probability. Conversely, if the time series is a purely random process (a.k.a "white noise"), then the plot of the periodogram against the Fourier frequencies approaches a straight line .
where n = [N/ 2] and p is the largest integer less than 1/x.
This algorithm closely follows the guidelines recommended for analysis of periodicities in time-series microarray data  with the exception that we applied a locally developed C++ code instead of R scripts.
For each time series we calculate the maximum positive R(f) among all possible phase shifts f and use tabulated 0.05 significance cutoff values for correlation coefficient. Time series that shows significant autocorrelation R(f) with the lag f corresponding to one day (6 time points) are considered circadially expressed.
High p-value exceeding the threshold, for example 0.05, means that at least 5 out of 100 random permutations of time series produce a periodogram with the same or higher peak, corresponding to a given periodicity. Low p-values indicate a significant difference between periodogram IR(ω) preserving circadian periodicity and randomly permutated periodogram IY(ω) with the same level of technical variation. This difference leads to rejection of the null-hypothesis of purely random nature of variation in the original time series Y.
We start with phase classification, assigning each gene a phase based on maximal correlation to an ideal cosine curve. This method is superior to assigning a phase by position of peaks only because it takes into account more data. Each profile is subjected to z-score transformation equalizing the variation between time points. For each profile autocorrelation with circadian lag (R c ) is calculated and all profiles are sorted first by phase then by descending order of R c . Concatenating all profiles of the same phase with equalized range of variation (amplitude) we generate a continuous stream C ph of measurements containing a clear signal on one end and stochastic noise on the other. This continuum is treated with low-pass frequency filter and polynomial smoothing. We analyze each phase fraction separately to detect the point at which circadian signal deteriorates beyond p = 0.05 significance cutoff. A window W moving along the stream is transformed to frequency domain using Discrete Fourier Transform (DFT). The resulting periodogram I w is compared a periodogram of a randomly permutated W r using Kolmogorov-Smirnov goodness of fit test. Once the point at which I w does not differ significantly from a random periodogram I wr is detected, we count all original gene expression profiles that have circadian signal above the established cutoff .
False Discovery Rate analysis
this methodology often applied to reduce the number of false-positive tests is based in the assumption of independent or mildly dependent  hypothesis testing. However, in case of testing timeline expression profiles for periodicity independence could not be assumes for a number of reasons. First, the pattern of circadian oscillation is obvious in the great majority of expression profiles as seen on heatmaps (Figure 1, for example). Second, analysis of correlation with phase shift (also used to identify phase groups) confirms high correlation of nearly all profiles to common cosine curves. Third, living cells are known to have more than one oscillator, but these oscillators are normally synchronized to the rhythm of the circadian molecular clock, active in peripheral tissues. Testing individual expression profiles for periodicity we are looking for manifestation of the same factor, hence not independent hypothesis. For these reasons FDR correction has not been applied to reduce the number of detected oscillating genes.
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 9, 2008: Proceedings of the Fifth Annual MCBIOS Conference. Systems Biology: Bridging the Omics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S9
- Covington MF, Harmer SL: The circadian clock regulates auxin signaling and responses in Arabidopsis. PLoS Biol. 2007, 5 (8): e222-PubMed CentralView ArticlePubMedGoogle Scholar
- Harmer SL, Hogenesch JB, Straume M, Chang HS, Han B, Zhu T, Wang X, Kreps JA, Kay SA: Orchestrated transcription of key pathways in Arabidopsis by the circadian clock. Science. 2000, 290 (5499): 2110-2113.View ArticlePubMedGoogle Scholar
- Schaffer R, Landgraf J, Accerbi M, Simon V, Larson M, Wisman E: Microarray analysis of diurnal and circadian-regulated genes in Arabidopsis. Plant Cell. 2001, 13 (1): 113-123.PubMed CentralView ArticlePubMedGoogle Scholar
- Edwards KD, Anderson PE, Hall A, Salathia NS, Locke JC, Lynn JR, Straume M, Smith JQ, Millar AJ: FLOWERING LOCUS C mediates natural variation in the high-temperature response of the Arabidopsis circadian clock. Plant Cell. 2006, 18 (3): 639-650.PubMed CentralView ArticlePubMedGoogle Scholar
- Saad MF, Riad-Gabriel MG, Khan A, Sharma A, Michael R, Jinagouda SD, Boyadjian R, Steil GM: Diurnal and ultradian rhythmicity of plasma leptin: effects of gender and adiposity. J Clin Endocrinol Metab. 1998, 83 (2): 453-459.PubMedGoogle Scholar
- Zvonic S, Ptitsyn AA, Conrad SA, Scott LK, Floyd ZE, Kilroy G, Wu X, Goh BC, Mynatt RL, Gimble JM: Characterization of peripheral circadian clocks in adipose tissues. Diabetes. 2006, 55 (4): 962-970.View ArticlePubMedGoogle Scholar
- Zvonic S, Ptitsyn AA, Kilroy G, Wu X, Conrad SA, Scott LK, Guilak F, Pelled G, Gazit D, Gimble JM: Circadian oscillation of gene expression in murine calvarial bone. J Bone Miner Res. 2007, 22 (3): 357-365.View ArticlePubMedGoogle Scholar
- Ptitsyn AA, Zvonic S, Gimble JM: Permutation test for periodicity in short time series data. BMC Bioinformatics. 2006, 7 (Suppl 2): S10-PubMed CentralView ArticlePubMedGoogle Scholar
- Ptitsyn AA: Stochastic Resonance Reveals "Pilot Light" Expression in Mammalian Genes. PLoS ONE. 2008Google Scholar
- Hogenesch JB, Panda S, Kay S, Takahashi JS: Circadian transcriptional output in the SCN and liver of the mouse. Novartis Found Symp. 2003, 253: 171-180. discussion 152–175, 102–179, 180–173 passim.View ArticlePubMedGoogle Scholar
- Ando H, Yanagihara H, Hayashi Y, Obi Y, Tsuruoka S, Takamura T, Kaneko S, Fujimura A: Rhythmic mRNA Expression of Clock Genes and Adipocytokines in Mouse Visceral Adipose Tissue. Endocrinology. 2005, 146 (12): 5631-5636.View ArticlePubMedGoogle Scholar
- Klevecz RR, Bolen J, Forrest G, Murray DB: A genomewide oscillation in transcription gates DNA replication and cell cycle. Proc Natl Acad Sci USA. 2004, 101 (5): 1200-1205.PubMed CentralView ArticlePubMedGoogle Scholar
- Storch KF, Lipan O, Leykin I, Viswanathan N, Davis FC, Wong WH, Weitz CJ: Extensive and divergent circadian gene expression in liver and heart. Nature. 2002, 417 (6884): 78-83.View ArticlePubMedGoogle Scholar
- Wichert S, Fokianos K, Strimmer K: Identifying periodically expressed transcripts in microarray time series data. Bioinformatics. 2004, 20 (1): 5-20.View ArticlePubMedGoogle Scholar
- Tu BP, Kudlicki A, Rowicka M, McKnight SL: Logic of the yeast metabolic cycle: temporal compartmentalization of cellular processes. Science. 2005, 310 (5751): 1152-1158.View ArticlePubMedGoogle Scholar
- Panda S, Antoch MP, Miller BH, Su AI, Schook AB, Straume M, Schultz PG, Kay SA, Takahashi JS, Hogenesch JB: Coordinated transcription of key pathways in the mouse by the circadian clock. Cell. 2002, 109 (3): 307-320.View ArticlePubMedGoogle Scholar
- Ptitsyn AA, Zvonic S, Gimble JM: Digital signal processing reveals circadian baseline oscillation in majority of mammalian genes. PLoS Comput Biol. 2007, 3 (6): e120-PubMed CentralView ArticlePubMedGoogle Scholar
- Ptitsyn AA, Zvonic S, Conrad SA, Scott LK, Mynatt RL, Gimble JM: Circadian clocks are resounding in peripheral tissues. PLoS Comput Biol. 2006, 2 (3): e16-PubMed CentralView ArticlePubMedGoogle Scholar
- Ptitsyn AA, Gimble JM: Analysis of circadian pattern reveals tissue-specific alternative transcription in leptin signaling pathway. BMC Bioinformatics. 2007, 8 (Suppl 7): S15-PubMed CentralView ArticlePubMedGoogle Scholar
- Straume M: DNA microarray time series analysis: automated statistical assessment of circadian rhythms in gene expression patterning. Methods Enzymol. 2004, 383: 149-166.View ArticlePubMedGoogle Scholar
- Irizarry RA, Wu Z, Jaffee HA: Comparison of Affymetrix GeneChip expression measures. Bioinformatics. 2006, 22 (7): 789-794.View ArticlePubMedGoogle Scholar
- Guenther MG, Levine SS, Boyer LA, Jaenisch R, Young RA: A chromatin landmark and transcription initiation at most promoters in human cells. Cell. 2007, 130 (1): 77-88.PubMed CentralView ArticlePubMedGoogle Scholar
- Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T: KEGG for linking genomes to life and the environment. Nucleic Acids Res. 2008, D480-484. 36 DatabaseGoogle Scholar
- Savitzky A, Golay M: Smoothing and differentiation of data by simplified least squares procedures. Analytical Chemistry. 1964, 36: 1627-1639.View ArticleGoogle Scholar
- Priestley MB: Spectral Analysis and Time Series. 1981, Academic Press, LondonGoogle Scholar
- Storey JD, Tibshirani R: Statistical significance for genomewide studies. Proc Natl Acad Sci USA. 2003, 100 (16): 9440-9445.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.