Photosynthetic organisms such as cyanobacteria have been shown to employ complex diurnal regulatory patterns to prepare the organism for the light period
[1–3]. The extent, purpose and mechanism of diurnal and circadian oscillations have been reported to differ significantly among various cyanobacterial species. In particular, the reported estimates of the number of oscillating transcripts differ strongly between studies, ranging between 9-80% of protein-coding genes in microarray time series
[4–8]. Random insertion of a luciferase reporter system indicated that up to 100% of genes may be under circadian control
[1, 9]. Although microarray technology is a powerful genomic approach to quantify the expression levels of large numbers of genes simultaneously, there are technical limitations which significantly complicate the quantification and interpretation of such global transcriptional rearrangements. Here, we consider a large number of combinations of methods required in a typical microarray analysis pipeline to evaluate the impact of each step on the results. In addition to time series-specific descriptors, clustering is used due to its importance as tool for biological interpretation of microarray data.
Microarray platform-inherent technical limitations cause the resulting data to contain systematic or random technical variation in addition to the biological variation of interest
. Differences in the distribution of the measured fluorescence values are commonly attributed to variations in the quality of RNA extraction (experimental variation) and of individual arrays (technical variation). Based on assumptions about biologically plausible variation, a range of normalization methods attempt to reduce the technical variation between chips. The expected amount of change in gene expression is a crucial element in the design of normalization methods. This can be a hen-egg-problem in less well studied experimental systems, for which little or no information is available on the expected global remodeling of the transcriptional landscape. Various normalization methods have been employed in previous studies to describe the transcriptional landscape in cyanobacteria. A combination of LOESS and quantile normalization was used by Vijayan et. al[11, 12]. While spike-in standards were incorporated in these studies, normalization was performed without application of this additional information. Kucho et al. and Straub et al. employed LOWESS normalization. A modified LOWESS normalization was used in the work of Stöckel et al.. As described in more detail by Calza et al., the assumption of constant expression for traditional housekeeping genes does not hold under all conditions. Considering the high percentage of diurnally varying genes in cyanobacteria
 including central cellular processes such as translation
, an a priori definition of housekeeping genes is not possible for cyanobacteria. Correspondingly, reports employing housekeeping gene- based normalization are not known to the authors.
It is known that the application of such global normalization methods has significant impact on subsequent analyses, in particular when some of the underlying assumptions on data structure are not or only partially fulfilled
. Global normalization methods may change the set of differentially expressed genes
[14, 15] or lead to significant changes in the observed correlation between genes
[16, 17]. While cross-validation of expression measurements can be used to discover methodological problems
, the lack of diurnal expression datasets from alternative techniques, such as RNA-Seq, impedes such verification in the case of cyanobacteria. These observations raise the question, how normalization and other preprocessing steps affect commonly used descriptors for periodic expression, e.g., the number of oscillating genes (by tests of significance of oscillation) and the circadian phase of peak transcript levels
[11, 19, 20]. Such phase information is usually used to derive a temporal order of the observed processes. It is, therefore, of paramount importance to prevent systematical errors in the primary phase information.
In addition to the normalization steps, microarray analyses require a transformation which accounts for its semi-quantitative nature. Calibration methods for sequence-dependent hybridization energies and unspecific cross-hybridizations have been proposed
[10, 21, 22], but are not yet an established standard and so far only implemented for Affymetrix arrays. The interpretation of microarray data in terms of absolute mRNA copy numbers is currently not possible. Instead, data transformations are used to normalize a given transcript time series to the mean value or to the distribution of fluorescence intensities: the fold-change or l
2 mean ratio transformation (in the following: 12 m) removes the mean, while standardization (z-score transformation, in the following: std) additionally normalizes the standard deviation in order to focus on the pattern of change rather than its amplitude
. We also consider the discrete Fourier transformation (DFT) in the context of data transformation. The removal of the first DFT component results in a normalization by the expression mean in the 12 m, and an amplitude scaling serves to de-emphasize the amplitude
, comparable to std. Notably, only this transformation from time to frequency space considers explicitly the time series character of the data.
The biological interpretationof microarray data is possible only after the application of the transformation and normalization. Due to its high-dimensional nature, a standard step in the interpretation of microarray data is clustering. A variety of clustering algorithms have been proposed, making it necessary to systematically evaluate the performance on gene expression data
[25, 26]. However, due to the diversity of the data domain, a recent work concluded that the choice of a clustering algorithm might depend on the specific experiment
. In the case of time series analysis, it has been noted that most clustering algorithms do not consider the pattern of change over time, but treat each sample independently of the temporal order. An increasing number of algorithms propose solutions to this issue
[27–31], but there is no accepted standard. An interesting approach specifically designed to cluster periodic time series has recently been proposed for the analysis of respiratory oscillations in budding yeast culture
. Here, the DFT of the time-series was clustered with a model-based algorithm that uses t-distributions as a model (flowClust
). However, the impact of data transformation and normalization of time resolved microarray data, the clustering algorithm, and the similarity measure on the corresponding clustering result have not been fully described.
Several studies on various model organisms have reported that accepted standard normalization methods lead to inaccurate results under certain experimental conditions. A recent study of a human B cell line verified an increase of the global mRNA abundance per cell. This violation of a common assumption for normalization challenges the conclusions of a wide range of studies (Lovén2012). Furthermore, a work from Machné and Murray
 compares two independent measurements of the oscillatory transcriptional changes of budding yeast under continuous culture conditions. Due to the global nature of the observed oscillations, an alternative normalization scheme was employed to avoid detrimental effects of standard methods. This normalization method is included in this study. We therefore expect, that the conclusions drawn from our study are also valid for other model systems. However, cyanobacteria are a highly specific model system featuring, e.g., a small number of genes with a high fraction of diurnally oscillating genes. Therefore, we address the case of cyanobacteria with this systematic analysis of normalization methods and demonstrate how to circumvent problems while analyzing diurnal expression data.âfİ
The quantification of diurnal expression in the cyanobacterium Synechocystis sp. PCC 6803 using microarray technology poses new problems for old methods of data normalization, transformation and clustering. We compare four multi-array normalization methods and three data transformations with respect to diurnal expression oscillation strength and phase. Furthermore, we use a variety of clustering algorithms to examine the global expression landscape. The results of seven clustering algorithms are integrated to verify whether and how normalization shapes the results of downstream analyses. Our analysis demonstrates that normalization methods have significant impact on the estimated number and phases of oscillating transcripts, with major consequences for subsequent analysis and biological interpretation. We identify LOS normalization as the preferable method.