The complexity of gene expression dynamics revealed by permutation entropy
BMC Bioinformatics volume 11, Article number: 607 (2010)
High complexity is considered a hallmark of living systems. Here we investigate the complexity of temporal gene expression patterns using the concept of Permutation Entropy (PE) first introduced in dynamical systems theory. The analysis of gene expression data has so far focused primarily on the identification of differentially expressed genes, or on the elucidation of pathway and regulatory relationships. We aim to study gene expression time series data from the viewpoint of complexity.
Applying the PE complexity metric to abiotic stress response time series data in Arabidopsis thaliana, genes involved in stress response and signaling were found to be associated with the highest complexity not only under stress, but surprisingly, also under reference, non-stress conditions. Genes with house-keeping functions exhibited lower PE complexity. Compared to reference conditions, the PE of temporal gene expression patterns generally increased upon stress exposure. High-complexity genes were found to have longer upstream intergenic regions and more cis-regulatory motifs in their promoter regions indicative of a more complex regulatory apparatus needed to orchestrate their expression, and to be associated with higher correlation network connectivity degree. Arabidopsis genes also present in other plant species were observed to exhibit decreased PE complexity compared to Arabidopsis specific genes.
We show that Permutation Entropy is a simple yet robust and powerful approach to identify temporal gene expression profiles of varying complexity that is equally applicable to other types of molecular profile data.
High complexity is considered a defining feature setting living systems apart from non-living matter. Thus, investigating the emergence, maintenance, and functional significance of complexity has been a central research theme in all of biological sciences throughout history including the analysis of the complexity of cellular systems at the molecular level  and culminating in the emergence of Systems Biology that aims to develop an holistic understanding of the complex behavior of molecular and cellular systems . In the context of molecular interaction networks, for example, it was observed that eukaryotic evolution was accompanied by changes of the complexity and a fast - on an evolutionary time scale - rewiring of interactions between proteins . However, in the temporal domain, the complexity of molecular processes has not been adequately investigated yet. Dynamic phenomena such as the temporal gene expression response to external perturbations as measured in time course genome-scale microarray measurements, while constituting a major research topic, have been analyzed primarily to unravel structural relationships between different groups of genes with the aim to identify important gene sets - for example, for diagnostic purposes - via clustering [4–8] or principal component analysis (PCA) [6, 9], or to deduce regulatory transcriptional networks and modules [10–13], infer relationships between metabolic genes [14, 15] as well as to provide a basis for network modeling [16, 17]. Along with increasing numbers of experiments involving expression time series, approaches to identify dominating temporal patterns have been developed. Introduced approaches ranged from applying unbiased Singular Value Decomposition (SVD, ), utilizing the notion of patterns  and extracting gene sets that are consistent with simple up/down/unchanged patterns and successions thereof as a means to guided profile clustering , and to converting continuous level values into discrete ranks to determine the degree of randomness with regard to rank permutations .
The paucity of systematic studies of temporal gene expression complexity may in part be explained by the lack of a suitable metric that is applicable to the typically very short gene expression time series with only few time points per gene available.
The investigation of complexity has attracted considerable interest in the physical sciences and various other fields. A quantitative understanding of complexity emerging in dynamical systems is often obtained by the notion of entropy, Lyapunov exponents, or fractal dimensions . The former two quantities measure the predictability of a system by showing how sensitively it depends on the initial conditions, while fractal dimension characterizes the complex geometric property in phase space.
It is rather challenging to apply the physical concepts of complexity (e.g., entropy) to address biological complexity, especially in the context of temporal gene expression data. Specifically, the encountered obstacles include: (i) Very short time series. For example, entropy as a complexity measure is formally defined only in the asymptotic limit; i.e., very long time series data at arbitrary accuracy are needed. Finite time records require a suitable modification, e.g. ε-entropy . Still, even this modified entropy definition is not properly applicable to typical biological datasets, especially microarray measurements. Most microarray experiments include a very limited number of temporal measurements, typically no more than 10 or so, oftentimes simply because of the cost considerations. (ii) Non-stationarity. Most microarray measurements are intentionally non-stationary as they are being conducted specifically to measure the system's response to an external perturbation. Note that here only a weak non-stationary criterion is considered, where the mean and the variance might vary over time. (iii) Non-equidistant time scale. Microarray sampling schedules are intentionally designed with denser coverage at early time intervals of the process as the initial phases are, first, most interesting, and secondly, as the new steady state is reached, the rate of molecular adaptation can be expected to slow down.
In an attempt to capture complexity of temporal profiles understood as algorithmic compressibility and to identify profiles that are highly non-random, Ahnert and co-workers introduced a rank permutation-based approach . In this approach, a particular series of level data is converted to ranks and by virtue of various mapping functions that associate patterns with a single number, its likelihood of resulting from a random as opposed to a biological process can be assessed in an unbiased fashion. This method was shown to correctly identify cell cycle genes in yeast without the need to introduce any pre-conceptions on the data.
Here, we propose to use Permutation Entropy (PE) to analyze temporal profile data. PE was initially introduced as a measure to assess the complexity for time series based only on the ranks of the data, instead of a particular distance metric . PE decomposes data series (e.g. expression time series data) into elementary motifs (ordinal or order motifs capturing permutations of ranks) and associates high complexity with high and low complexity with low numbers of different motifs observed in a given temporal profile. So far, it has been applied mainly to detect dynamical transitions in both modeled and experimental data; i.e., electroencephalography (EEG) and magneto-cardiography records . Compared to the approach reported in , it has the advantage of only using a single pattern-to-value mapping function. Furthermore, we introduced the 'unchanged' pattern; i.e., differentiate between significant expression changes and those that are purely attributable to noise.
We applied the concept of PE to gene expression time series data obtained in Arabidopsis thaliana, an important model plant , exposed to different abiotic stress conditions to investigate specifically, how the complexity of temporal gene expression patterns changes upon stress exposure and whether different functional groups of genes as well as different experimental conditions are characterized by different PE properties. Secondly, we explore whether and to what degree gene-specific expression complexity is encoded in the Arabidopsis genome by examining associated gene promoter regions. Whether complexity is increasing during the course of evolution is an intriguing and much debated question . Here, we compare the obtained PE complexity measure for well-conserved (across different plant species) - and thus supposedly ancient - genes to those associated with Arabidopsis-specific, presumably young genes. Finally, we examine the introduced PE complexity measure in relation to gene expression correlation networks. Correlation networks allow studying gene expression data sets with regard to the structure of the underlying pairwise relationships with particular focus on genes involved in many interactions, so-called hub genes with a high degree of network connectivity [27, 28]. We find that high-degree genes are associated with increased PE indicative of the global stress induced restructuring of gene expression that has been observed similarly in yeast .
We demonstrate that PE is a simple, yet powerful novel concept to study the dynamics of temporal gene expression profiles and equally applicable to other types of molecular profile data.
We applied the concept of Permutation Entropy (PE) as a metric to assess the complexity of temporal gene expression profiles obtained from abiotic stress time series microarray experiments performed in Arabidopsis thaliana . The dataset comprised nine abiotic stress (including heat, cold, genotoxic, UV-B-light, osmotic, salt, wounding, drought, and oxidative stress conditions) and one common control condition. For every stress and the control condition, 7 time points were consistently available across all conditions corresponding to 0 h, 0.5 h, 1 h, 3 h, 6 h, 12 h, and 24 h after stress exposure.
Elementary three-point pattern distribution
The approach to measure complexity of temporal gene expression profiles by means of Permutation Entropy (PE) relies on converting continues expression level data into a succession of discrete order patterns of - in our case - three consecutive time points. Allowing for the no-change pattern (see Methods), 13 different three-point patterns are possible (Figure 1A). The most frequent pattern observed under control conditions and across all gene transcripts is pattern 13, the no-change pattern (Figure 1B), followed by partially unchanged patterns (patterns 7-12) and monotonic profile patterns (pattern 1 and 2). Least frequent are patterns corresponding to three distinct expression values (patterns 3-6). Notwithstanding the fact that the no-change pattern assignment depends on the set expression difference threshold between consecutive time points (d e , see Methods), a clear tendency to steady or gradually rather than abruptly changing elementary expression patterns is evident from the data.
The Permutation Entropy (PE) distribution of the Arabidopsis transcriptome under control and stress conditions
Applying our modified PE metric to all AtGenExpress abiotic stress time series datasets, the entropy values for all considered 21,797 (22k for short) genes under all 10 different conditions (nine stress and one control condition) were calculated. As an illustration, we selected the control dataset to show the general distribution of PE values across all gene transcripts (Figure 2). As only seven discrete PE values are possible given our definitions (see Methods) represented by the seven frequency bars in Figure 2, the distribution is not normal, and hence, all significance tests were performed applying the Mann-Whitney-U test (MWU-test or, as it is also known, the Wilcoxon rank sum test). The smallest PE value (PE = 0) corresponds to genes with only small fluctuation across all the time points such that gene expression levels do not change above the de threshold (no-change pattern) as illustrated for an exemplary profile surrounding the histogram or to strictly monotonic up or downward changes across all time points as such a profile also yields PE = 0. As evident from the example profiles, with increasing PE values, the associated temporal patterns become more complex; i.e., more and different patterns occur with the maximal possible PE value of 2.32. Zero is the most frequent PE value obtained for approximately half of all gene transcripts; i.e., most genes have a low expression complexity under control conditions. The maximal PE value was obtained for only about 10 percent of all transcripts. The least frequent profiles are the strictly periodic signals (third bar from the left).
Relative to control conditions, the median PE of all 22k gene transcripts monitored by the microarray increases significantly upon exposure to stress conditions (Table 1). The most dramatic increase of PE was observed under heat and UV-B light stress conditions, and smallest changes, but nonetheless leading to significantly increased PE, under cold and oxidative stress situations. Thus, when challenged by unfavorable environmental conditions, the complexity of gene expression programs in Arabidopsis increases, whereas the control condition appears to correspond to the ground state of minimal PE complexity.
Functional association analysis of high and low PE genes
We examined what biological processes and molecular functions are associated with genes of high or low PE, respectively. Moreover, we asked whether the association of particular processes and functions to high or low PE changes under different stress conditions or whether there are particular functional gene classes universally associated with either high or low PE. All gene transcripts were classified into 14 biological processes and 15 molecular function groups according to GO-slim terms (see Methods) followed by computing the mean PE value per condition for each GO gene set and applying biclustering (clustering conditions and GO terms simultaneously).
Across all conditions (including the control condition), genes involved in the general and abiotic stress response (Figure 3A) and regulatory processes (transcription factor activity, kinase activity (Figure 3B)) are universally and significantly (Table 2) associated with high PE (Figure 3A) indicating their sensitivity towards stress conditions, while genes of yet unidentified functions and process involvement and house-keeping functions such as DNA and RNA metabolism (Figure 3A) as well as general transcription processes and functions (DNA and RNA binding, Figure 3B) were found to represent a low PE cluster. Thus, their gene expression is less complex even when exposed to stress conditions. Surprisingly, even under control conditions, response to stress and transcription factor activity are significantly enriched in the high PE gene set suggesting that genes involved in these functions exhibit greater temporal gene expression complexity even under normal reference conditions (Table 2). Nonetheless, the reference, control condition was associated with relatively low PE values across most GO terms, while heat and UV-B light stress clearly stand out as being associated with high PE across all GO categories (Figures 3A and 3B). Interestingly, cold and heat stress exposure appear to fall on opposite ends of the PE spectrum with cold stress being similar to control conditions. Thus, temperature appears to either "freeze" = cold or "stir" up = heat the system reflected by PE. The conditions "salt" and "drought" stress - both water stress related - were found to cluster together based on mean PE values for the different GO terms as well as "oxidative" and "genotoxic" stress, albeit less tightly.
When investigating PE values for temporal expression profiles expressed relative to control values (see legend, Figure 4), we noted that with regard to Biological Process, "response to stress", "response to abiotic or biotic stimulus" and "other biological process" were consistently enriched among genes associated with the highest possible PE value (PE = 2.32) compared to all other genes with lower PE (Figure 4A, blue horizontal stripes), However, with regard to Molecular Function - an ontology that aims more at the precise molecular functionality rather than overall process, no individual function or activity is universally detected as being associated with high PE values upon stress exposures (Figure 4B). Thus, while the overall strategic response is similar across all conditions (stress response), the concrete realization as to which functions are modified by increasing gene expression complexity is condition specific.
PE and expression level
Evidently, genes with very low or even no appreciable expression levels across all conditions and at all time points will have low (or zero) PE values as there is no significant change either. Among those genes, genes with currently no annotated function or process involvement as judged by their GO annotation are clearly overrepresented (p < < 0.05). It is possible that those genes may be predicted incorrectly in the genome or may even not be true genes at all as frequently gene structure annotations have been generated by in silico gene predictions. Thus, including genes with very low expression level bears the risk of associating low PE values simply with genes of questionable gene structure. Therefore, in all subsequent analyses that rely on genes with correctly assigned gene structure and detectable gene expression, we discarded the 10,000 genes expressed at low levels under control conditions (see Methods) and focused on the remaining 11,797 genes with appreciable expression levels only.
Increased PE correlates with an increased upstream intergenic space and increased number of cis-elements
High PE-values resulting from an increased number of order patterns indicate that a gene's expression has truly changed significantly (according to d e ) over the course of the experiment. And not only has the expression level changed, but it has changed in different directions (successions of up, down, and unchanged regulation) as purely monotonic increases/decreases, despite being above the threshold d e , will not contribute towards increased PE values. Thus, genes with high PE have undergone differential regulation with opposing outcomes in response to the external stimulus. Naturally, this behavior provokes the question as to how this complex regulation is induced and regulated.
High entropy genes may be under a more complex cis-regulatory control, and thus should harbor more cis-regulatory motifs in their upstream promoter regions. They may also require larger upstream space to accommodate a more complex regulatory apparatus, which would be measurable as the distance to the next upstream gene. Indeed, high PE genes were observed to be associated with significantly longer upstream regions compared to low PE genes. When sorting all genes in ascending order of PE summed up over all conditions (PE_sum), the average upstream-gene-distance for the top 2,000 PE genes was 2,597 +/- 2,481 (s.d., median = 1848) compared to only 1,438 +/- 1,703 (median = 803.5), p MWU = 4.4e-86 for the bottom 2,000 PE genes. This held also true when considering only control conditions (2,000 high-PE genes: 2,394 +- 2,412 (median = 1645.5), 2,000 low-PE genes: 1,463 +/- 1,643 (median = 888.5), p MWU = 1.1e-51). As shown in Figure 5A, while the scatter based on per-gene data is appreciable, the average upstream distance shows a consistent upward trend with increasing values of PE_sum (inset of Figure 5A).
We then examined the number of cis-elements annotated within the 500 nucleotides upstream of a gene's transcription start site. High-PE genes were found to be associated with a significantly increased number of cis-elements, albeit the absolute difference was small (around 10%). The 2,000 highest PE genes possess, on average, 9.4 +/- 3.9 (median = 9) cis-elements compared to 8.4 +/- 3.6 (median = 8) found in the 2,000 lowest-PE genes, p MWU = 6.2e-16. Figure 5B shows the relationship of motif counts and PE_sum as a scatter plot. As for upstream distances (Figure 5A), the per-gene data vary substantially, but a consistent upward trend of motif counts with increasing PE_sum values is clearly discernable (inset of Figure 5B). The trend of increased cis-element motif counts with increased PE was observed similarly for unique motif counts; i.e., repeated occurrences of the same motif were not counted towards the total number of motifs.
Cis-regulatory elements associated with high and low PE genes
We determined if and which motifs are overrepresented in high- or low-PE genes. Significantly increased motif counts for particular motifs were found for both high and low PE genes (Table 3). Well known stress-response cis-regulatory motifs such as the ABRE-like and ABF binding site motifs  were found overrepresented in upstream regions of high-PE genes, while in upstream regions of low-PE genes, the TELO-box motif known to control genes involved in house-keeping functions  was identified as the most significantly overrepresented motif.
Genes with high connectivity degree in correlation networks have increased PE
The PE measure characterizes the temporal expression profiles of single genes. Next, we explored the PE of genes in relation to their respective network properties by way of comparing high to low degree genes, where degree measures the number of connections in correlation networks (see Methods), thus embedding single PE values in the context of other genes. Across all conditions, we found that highly connected genes exhibit increased PE values relative to genes with low degree (Figure 6). Thus, while simple expression profiles dominate - about half of all genes show the lowest PE value under control conditions (Figure 2), with regard to similarity of temporal profiles as judged by correlation, more complex patterns seem to be adopted synchronously by larger numbers of genes.
Conserved genes have lower PE compared to new, Arabidopsis-specific genes
For the set of 11,797 high expression level Arabidopsis gene probes (see Methods), we identified those genes that have been conserved over long evolutionary times - and thus can be assumed to be evolutionarily old - by comparing all Arabidopsis gene-encoded protein sequences to all annotated protein sequences in Physcomitrella patens, a moss, and Chlamydomonas reinhardtii, a unicellular algae. The 5,012 Arabidopsis genes also contained in Physcomitrella were found to have significantly lower mean PE when summed up over all conditions (< PEsum > = 11.1 +/- 6.0 (s.d.), median = 10.7) compared to 6,785 genes only present in Arabidopsis, and thus presumably evolutionarily young genes (< PEsum > = 11.9 +/- 6.4, median = 12.1), p MWU = 1.3E-13. Similarly, the 1,647 gene found both in Arabidopsis and Chlamydomonas are also associated with decreased PE <PEsum > = 11.0 +/- 5.8, median = 10.7 versus <PEsum > = 11.7 +/- 6.3, median = 11.6 for 10,150 Arabidopsis-specific genes, p MWU = 4.6E-05.
Typically, the notion of complexity is used in the context of entire systems, for example protein-protein interaction networks . We introduced the concept of Permutation Entropy (PE) to capture the complexity of a singular temporal expression profile. Here, complexity may be understood in the Kolmogorov Complexity  sense, namely as the quality of a profile of being simple (unchanged or monotonic) as opposed to more complex (succession of up and down) to that of highest complexity (a situation in which a profile can not be reduced to a simple generating principle such as a linear of periodic function; i.e. that of highest entropy). Specifically, we applied the PE metric to the analysis of temporal gene expression profile data. It relies on decomposing sequential expression patterns into elementary motifs (ordinal motifs) and associates high complexity with high and low complexity with low numbers of different motifs. Investigations of time series data have focused primarily on discerning differential gene expression; i.e., the discovery of genes whose expression is significantly altered in response to a perturbation by applying appropriate thresholds. As implemented here, PE also identifies patterns of differential expression - or absence thereof (no-change pattern), but in addition, examines whether the response is 'simple' (e.g. monotonic) or whether a succession of different motifs indicates more complex transcriptional changes. Thus, PE allows to easily identify those transcripts that exhibit intricate expression profiles as candidates for complex regulatory actions.
The issue of complexity of individual series of level data has been addressed before, most notably by Ahnert and co-workers, who also used a pattern-based approach associated with rank permutations . In the PE formulation of the problem applied here, a single mapping function (Eq. 1) to map series of rank numbers to single values is used and thus to associate a single quantity with each temporal profile rather than several possible mapping functions listed in . Furthermore, by introducing the no-change pattern that is supposed to capture level differences below the noise level, we specifically aimed to associate those profiles to high complexity that are characterized by an occurrence of many different up/down/unchanged patterns, whereas in , no such cutoff was introduced, such that seemingly erratic up/down patterns would be identified as random, and thus of lesser biological significance. By applying a sensible noise cutoff, we intended to identify those profiles as highly complex that are likely to have resulted from many and significant changes of the transcriptional program acting on those transcript. By contrast, the approach used in  primarily aims to identify profiles that are most non-random (as opposed to most complex as defined here using PE-complexity) and thus those profiles that are associated with a particular biological process. For example, in our PE approach cyclical patterns obtained for cell cycle series would be ranked as medium PE complexity as repetitive patterns have less than maximal PE, but as most highly non-random by the approach introduced in . In that regard, our PE formalism detects complex profiles as those with many significant changes and - to some extend - independent of the actual experiment.
As a metric, PE copes with the three problems typically encountered by other complexity measures (see Introduction). It is applicable to short, non-stationary, and non-equidistant time series data. Evidently, longer time series will, however, yield statistically more robust results on a per gene basis.
As all studies on dynamic processes, the sampling rate at which data points are taken critically influences the results and their interpretation. Ideally, the sampling rate matches the characteristic time scale of the process under investigation. The data used in this study were spaced in the minutes and hours time interval range, a scale that was found appropriate for transcriptional responses. It is clear that, had the sampling frequency been much higher or lower, the absolute PE values may not be directly comparable across different sampling frequencies. For example, at much higher rates, the unchanged pattern will be observed much more frequently. However, in relative terms, i.e. a comparison of PEs associated for a number of genes at a given sampling scheme, the ordering of genes with regard to PE can be expected to hold. Furthermore, the data used here were spaced evenly on a logarithmic time scale. By applying interpolation using cubical splines, we simulated even linear time spacing and found very similar results with regard to the overall distribution of PE values (Additional File 1, Supplementary Figure S3) and qualitative results with regard to GO functional annotations (Additional File 1, Supplementary Figure S4). Thus, the PE metric proved robust with regard to sampling scheme details.
Permutation Entropy and noise
High complexity measured by high PE indicates low predictability of future expression values based on past values of a gene expression time series. Noisy or very erratic gene expression profiles would also qualify as highly entropic and, thus, not predictable. By imposing a threshold between consecutive time points based on the observed technical and biological variation, we largely eliminated purely noise-induced expression changes and identified significant change patterns. Thus, high PE values can be assumed to be the result of significantly altered transcription programs and not by technical or biological noise. However, the appreciation of a biological role of noise in generating physiological responses is only now emerging. Recently, it was shown that noise may produce bistable positive transcriptional feedback loops  and that furthermore noisy genes have conserved coding sequences and exhibit characteristic protein-protein interaction network properties . The concept of PE may thus be a suitable metric to be used in such studies as well.
Mechanisms of gene expression complexity
As we implemented the PE concept to specifically identify patterns of significant change, the question arises as to how successions of multiple up/down/unchanged patterns for a single gene in response to a single stimulus are regulated. We found evidence that high PE genes are characterized by an increased length of their 5'-intergenic region as well as an increased density of cis-regulatory motifs. Thus, increased PE may in part be originating from larger and more complex promoter regions. This has been shown similarly in the context of multi-stress response genes signified by differential expression across different external perturbations contained in the AtGenExpress abiotic stress experiment series that was also used here . However, we took a different view by examining the complexity of temporal profiles of individual genes in a single stress experiment. As we find similar trends, cross-experiment complexity and temporal single-experiment complexity may have similar sources based on genomic, architectural properties, and cis-regulatory motifs. The role of other conceivable processes and factors involved in causing highly complex temporal expression profiles such as miRNA, epigenetic, chromosome structurally mediated regulations  remains to be established.
Complexity and evolution
Including only genes with appreciable expression levels under control conditions, we found that Arabidopsis genes that are conserved over long evolutionary distances, which can be assumed ancient, have lower PE than Arabidopsis specific genes. While the absolute difference was small (around 7%), it was statistically very significant. It appears reasonable to conclude that conserved genes serve functions that are universally needed. Indeed, among the conserved genes (both Arabidopsis vs. Chlamydomonas and vs. Physcomitrella, genes involved metabolic and protein metabolic processes, cell organization and biogenesis, and electron transport or energy pathways were overrepresented as judged by a GO-term enrichment analysis. Genes specific to Arabidopsis emerged later in evolution to potentially cope with more complex environments, and thus needed more complex gene expression regulation. In this study, complexity of the environment has essentially been mimicked by the different stresses applied. In the Arabidopsis specific gene set, genes involved in transcriptional regulatory processes were found overrepresented as well as genes with currently unknown function or process involvement. As has been emphasized , deciding the question of increasing complexity on evolutionary time scales may to a large degree depend on the definition of complexity. Here, we used PE as an entropy-based measure to capture complexity.
Permutation Entropy and the abiotic stress response in Arabidopsis
When challenged by environmental perturbations the transcriptional programs are being modified not only to change the expression levels of genes, but the adjustments is accompanied by inducing and repressing interactions leading to increase complexity of temporal expression profiles. Among the different abiotic stress conditions, exposure to heat and UV-B light stress conditions provoked the greatest changes of PE. Thus, both conditions appear to present major assaults resulting in massively change transcriptional programs. At the other end of the spectrum, cold stress - while also representing a temperature cue - resulted in relatively small PE changes. Intuitively, this could be interpreted as a general thermodynamic "freeze-up" generating responses of smaller amplitude. Whether this amplitude reduction is indeed a thermodynamic effect or a result of regulation remains to be established. In this context, it would be interesting to monitor concurrent metabolic changes as the temperature sensitive reaction rates would also reveal thermodynamic effects.
Genes involved in stress response processes and other abiotic stimuli were found to exhibit greatest PE. Surprisingly, however, even under reference conditions, genes annotated as being involved in stress response have increased PE relative to other gene functions (Table 2). Of course, it is difficult to decide what actually constitutes a stress condition. Also the reference, control condition is one particular set of environmental parameters the plant has to cope with. Growth processes and circadian as well as diurnal rhythms persist requiring dynamic changes of the underlying transcriptional program of those genes. Alternatively, it could be speculated whether those genes are naturally fluctuating more, such that they are active and not in a dormant state, ready to produce the necessary response when challenged.
We found that the overall strategy of stress response may be the same across different abiotic stresses - to cope with the stress condition (Figure 4A), the concrete tactics as to how to accomplish the stress response and which particular functional program to change, may very much be stress specific (Figure 4B).
Our analysis furthermore highlights the importance of time-resolved studies of stress response dynamics. Approached naively, a stress condition should induce or repress a gene's expression and lead to a one-time or steady response pattern. Correspondingly, previous analyses on the abiotic stress response in Arabidopsis thaliana using the same dataset as analyzed here were based on detecting differential expression at a selected, single time point . Here, we show that oftentimes stress response is signified by a complex pattern of up- and downregulation instead. Thus, when aiming to investigate stress response, single or few measurement time points may fall short of unraveling the true stress response dynamics. In addition, simple fold change considerations may be augmented by metrics that capture the complexity of response as well, as the one introduced here, the Permutation Entropy.
Although the focus here has been on the characterization of single profiles, the reduction of continuous level data to patterns or symbols has also been explored in the context of modeling dynamical systems including interactions between many interacting network constituents . The applicability of the PE metric to capture interactions may thus offer a fruitful avenue for further research.
Permutation Entropy provides a simple, yet powerful metric to capture complexity in patterns found in temporal profile data of single entities such as gene transcripts or individual metabolites. While longer time series data are preferable, even for relatively short time series with only few data points, meaningful results can be obtained. Evidently, the concept of PE lends itself to the analysis of any time series data. As generating time series is becoming increasingly common, PE may emerge as a standard quantitative approach for their analyses.
Expression data, Probe Set
Arabidopsis thaliana gene expression time series data based on the Affymetrix ATH1 microarray available as part of the AtGenExpress data resource  were obtained from The Arabidopsis Information Resource - TAIR [40, 41]. The dataset comprised nine abiotic stress (including heat, cold, genotoxic, UV-B-light, osmotic, salt, wounding, drought, and osmotic stress conditions) and one common control condition. All raw hybridization data (CEL-files) for each condition were processed and RMA-normalized  using Matlab 188.8.131.524 (R2008a) Bioinformatics Toolbox (version 3.1). Repeat hybridizations were averaged and all data log2-transformed. In total, expression data for 22,746 Arabidopsis gene probes measured at 7 time points common to all conditions and corresponding to 0 h, 0.5 h, 1 h, 3 h, 6 h, 12 h, and 24 h after stress induction were used for the analysis. Only ATH1 probes with unique gene-probe mappings and those mapping to nuclear encoded genes were used, leaving 21,797 gene probes for analysis.
Modified Permutation Entropy (PE) algorithm tailored to gene expression time series data
Permutation Entropy (PE) as introduced in  was applied to the Arabidopsis gene expression time series data investigated here. In short, PE corresponds to the entropy contained in a gene's, g, temporal expression profile based on the probabilities of possible orderings, or order patterns of time points sorted by expression value, of n subsequent data points, with n set to three, according to Eq. 1:
where p i is the estimated probability of occurrence of order pattern i in the t = 7 time series data points measured for gene g under a particular experimental condition and computed as the relative frequency of pattern i in the total of 5 consecutive three-point order patterns (t-n+1 patterns are possible in t time points). The summation is performed over all motifs actually observed in a given gene's time series. As there are N = 13 different three-point order pattern (see below), 7 discrete PE values are possible (corresponding to these discrete possible realizations 5 (all patterns identical), 4 (the same)-1 (different), and likewise: 3-2, 3-1-1, 2-2-1, 2-1-1-1, 1-1-1-1-1, resulting from 5 consecutive order patterns. Thus, the lowest possible PE value for a gene is zero - only one order pattern is realized - while the maximal possible PE value is PEmax = -5*1/5*log2(1/5) = 2.32 corresponding to five different patterns with equal probability.
Caused by technical noise and biological variability, small expression differences may not reflect an actual and significant change, but rather essentially unaltered expression levels. Therefore, deviating from the original PE computation procedure, we introduced the "no-change" pattern to avoid overly associating complexity (increased PE) with pure noise. A gene's expression level in two consecutive time points was regarded as unchanged if the respective difference was below a specified threshold, d e , set to d e = 0.35 corresponding to the determined mean absolute difference (d m ) plus one standard deviation between repeat measurements across all datasets with d m = 0.156 ± 0.19 (log2 values). By introducing the no-change pattern, N = 13 different order motifs for three consecutive time points (n = 3) are possible. Similar results were obtained for different values of d e .
Gene Ontology annotation
Gene Ontology (GO) slim term based gene annotation information was obtained from TAIR [40, 41]. All Arabidopsis genes were grouped into 14 GO process- and 15 molecular function categories according to the defined GO-slim terms.
Identification of genes with appreciable expression level
Low PE-values would also be obtained for genes not being expressed at all (or expressed at very low levels) across all time points and across the various conditions as there is no detectable change either. To eliminate this trivial result and to focus on genes whose expression level is indeed detectable, we sorted all gene probes based on their expression levels in the control sample and discarded the 10,000 genes with lowest expression rank from the analyses pertaining to motif counts, intergenic distances, and conservation. For the remaining 11,797 gene probes, no dependency of their cumulative PE (PE values summed up over all conditions) on expression rank was observed, while it increased steadily with increasing rank below the 10,000 rank cutoff (Additional File 1, Supplementary Figure S1). Thus, possible effects of absolute expression level on PE (as the trivial no expression = zero PE relationship) have been reduced by eliminating the 10,000 genes expressed at low levels under control conditions.
Cis-regulatory motif information, Distance to next up-stream gene
Cis-regulatory motif annotation information associated with all Arabidopsis genes was obtained from the Athena database  and processed as described in . Cis-regulatory motifs were considered up to 500 nucleotides upstream of a gene's annotated transcription start site. Only genes with intergenic upstream regions larger than 500 nucleotides were considered in the associated analyses. All motifs mapping to the considered regions were considered; i.e., multiply mapping motifs were not rendered unique. Motifs were allowed to overlap. Intergenic distances were obtained from TAIR [40, 41].
Identification of conserved plant genes
Well conserved ('ancient') plant genes across long evolutionary times were determined by identifying all genes shared (conserved) between Arabidopsis thaliana and Physcomitrella patens, a moss, and Chlamydomonas reinhardtii, a single-cell algae. All 30,690 Arabidopsis protein sequences obtained from TAIR [40, 41] were aligned to all 35,938 protein sequences from Physcomitrella  and all 16,709 protein sequences annotated in Chlamydomonas  (both protein sequence sets obtained from the Joint Genome Institute, JGI http://www.jgi.doe.gov/) using blastp . Pairwise alignments with greater or equal to 50% sequence identity over an alignment length of at least 100 amino acids were considered to be associated with conserved proteins, and in turn, genes. In total, 6,678 proteins/genes were identified as conserved between Arabidopsis and Physcomitrella, and 2,074 proteins shared in Chlamydomonas and Arabidopsis according to the applied parameters. Note that all reported results are obtained similarly, when a more stringent 70% sequence identity threshold was applied. Furthermore, in the analyses reported under Results, only those genes with high expression levels were included (see above).
GO-term enrichment analysis
GO-term enrichment analyses were performed using the Fisher's exact test. False Discovery Rate (FDR) multiple testing correction was applied to the obtained p-values according to Benjamini-Hochberg .
Correlation network connectivity degree
Across all experimental conditions, all temporal gene expression profiles were correlated using the Pearson correlation coefficient, r, in an all-against-all fashion. Two gene transcripts were considered connected if their respective pairwise absolute correlation coefficient, r, was greater than 0.98. The degree of a gene transcript refers to its summed up number of connections.
Adams J: The Complexity of Gene Expression, Protein Interaction, and Cell Differentiation. Nat Educ 2008., 1:
Kitano H: Systems biology: a brief overview. Science 2002, 295: 1662–1664. 10.1126/science.1069492
Beltrao P, Serrano L: Specificity and evolvability in eukaryotic protein interaction networks. PLoS Comput Biol 2007, 3: e25. 10.1371/journal.pcbi.0030025
Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998, 95: 14863–14868. 10.1073/pnas.95.25.14863
Androulakis IP, Yang E, Almon RR: Analysis of time-series gene expression data: Methods, challenges, and opportunities. Ann Rev Biomed Eng 2007, 9: 205–228. 10.1146/annurev.bioeng.9.060906.151904
Strassburg K, Walther D, Takahashi H, Kanaya S, Kopka J: Dynamic transcriptional and metabolic responses in yeast adapting to temperature stress. Omics 2010, 14: 249–259. 10.1089/omi.2009.0107
Ernst J, Nau GJ, Bar-Joseph Z: Clustering short time series gene expression data. Bioinformatics 2005, 21(Suppl 1):i159–168. 10.1093/bioinformatics/bti1022
Furlanello C, Serafini M, Merler S, Jurman G: Entropy-based gene ranking without selection bias for the predictive classification of microarray data. BMC Bioinformatics 2003, 4: 54. 10.1186/1471-2105-4-54
Menges M, Hennig L, Gruissem W, Murray JAH: Cell cycle-regulated gene expression in Arabidopsis. J Biol Chem 2002, 277: 41987–42002. 10.1074/jbc.M207570200
Chechik G, Oh E, Rando O, Weissman J, Regev A, Koller D: Activity motifs reveal principles of timing in transcriptional control of the yeast metabolic network. Nat Biotechnol 2008, 26: 1251–1259. 10.1038/nbt.1499
D'Haeseleer P, Liang S, Somogyi R: Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics 2000, 16: 707–726.
Bansal M, Della Gatta G, di Bernardo D: Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics 2006, 22: 815–822. 10.1093/bioinformatics/btl003
Kim SY, Imoto S, Miyano S: Inferring gene networks from time series microarray data using dynamic Bayesian networks. Brief Bioinform 2003, 4: 228–235. 10.1093/bib/4.3.228
Walther D, Strassburg K, Durek P, Kopka J: Metabolic pathway relationships revealed by an integrative analysis of the transcriptional and metabolic temperature stress-response dynamics in yeast. Omics 2010, 14: 261–274. 10.1089/omi.2010.0010
Jozefczuk S, Klie S, Catchpole G, Szymanski J, Cuadros-Inostroza A, Steinhauser D, Selbig J, Willmitzer L: Metabolomic and transcriptomic stress response of Escherichia coli. Mol Syst Biol 2010, 6: 364. 10.1038/msb.2010.18
Vu TT, Vohradsky J: Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae. Nucl Acids Res 2007, 35: 279–287. 10.1093/nar/gkl1001
Kim S, Imoto S, Miyano S: Dynamic Bayesian network and nonparametric regression for nonlinear modeling of gene networks from time series gene expression data. Biosystems 2004, 75: 57–65. 10.1016/j.biosystems.2004.03.004
Holter NS, Mitra M, Maritan A, Cieplak M, Banavar JR, Fedoroff NV: Fundamental patterns underlying gene expression profiles: simplicity from complexity. Proc Natl Acad Sci USA 2000, 97: 8409–8414. 10.1073/pnas.150242097
Phan S, Famili F, Tang ZJ, Pan YL, Liu ZY, Ouyang J, Lenferink A, O'Connor MMC: A novel pattern based clustering methodology for time-series microarray data. Int J Comp Math 2007, 84: 585–597. 10.1080/00207160701203419
Ahnert SE, Willbrand K, Brown FC, Fink TM: Unbiased pattern detection in microarray data series. Bioinformatics 2006, 22: 1471–1476. 10.1093/bioinformatics/btl121
Eckmann JP, Ruelle D: Ergodic-Theory of Chaos and Strange Attractors. Rev Mod Phys 1985, 57: 617–656. 10.1103/RevModPhys.57.617
Boffetta G, Cencini M, Falcioni M, Vulpiani A: Predictability: a way to characterize complexity. Phys Rep-Rev Sect Phys Lett 2002, 356: 367–474.
Bandt C, Pompe B: Permutation Entropy: A Natural Complexity Measure for Time Series. Phys Rev Lett 2002, 88: 174102–174106. 10.1103/PhysRevLett.88.174102
Frank B, Pompe B, Schneider U, Hoyer D: Permutation entropy improves fetal behavioural state classification based on heart rate analysis from biomagnetic recordings in near term fetuses. Med Biol Eng Comp 2006, 44: 179–187. 10.1007/s11517-005-0015-z
Kaul S, Koo HL, Jenkins J, Rizzo M, Rooney T, Tallon LJ, Feldblyum T, Nierman W, Benito MI, Lin XY, et al.: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 2000, 408: 796–815. 10.1038/35048692
Adami C: What is complexity? Bioessays 2002, 24: 1085–1094. 10.1002/bies.10192
Carlson MR, Zhang B, Fang Z, Mischel PS, Horvath S, Nelson SF: Gene connectivity, function, and sequence conservation: predictions from modular yeast co-expression networks. BMC Genomics 2006, 7: 40. 10.1186/1471-2164-7-40
Stuart JM, Segal E, Koller D, Kim SK: A gene-coexpression network for global discovery of conserved genetic modules. Science 2003, 302: 249–255. 10.1126/science.1087447
Luscombe NM, Babu MM, Yu H, Snyder M, Teichmann SA, Gerstein M: Genomic analysis of regulatory network dynamics reveals large topological changes. Nature 2004, 431: 308–312. 10.1038/nature02782
Kilian J, Whitehead D, Horak J, Wanke D, Weinl S, Batistic O, D'Angelo C, Bornberg-Bauer E, Kudla J, Harter K: The AtGenExpress global stress expression data set: protocols, evaluation and model data analysis of UV-B light, drought and cold stress responses. Plant J 2007, 50: 347–363. 10.1111/j.1365-313X.2007.03052.x
Yamaguchishinozaki K, Shinozaki K: A Novel Cis-Acting Element in an Arabidopsis Gene Is Involved in Responsiveness to Drought, Low-Temperature, or High-Salt Stress. Plant C 1994, 6: 251–264.
Tremousaygue D, Manevski A, Bardet C, Lescure N, Lescure B: Plant interstitial telomere motifs participate in the control of gene expression in root meristems. Plant J 1999, 20: 553–561. 10.1046/j.1365-313X.1999.00627.x
Schwikowski B, Uetz P, Fields S: A network of protein-protein interactions in yeast. Nat Biotechnol 2000, 18: 1257–1261. 10.1038/82360
Li M, Vitányi P: An introduction to Kolmogorov complexity and its applications. 3rd edition. Springer; 2008.
To TL, Maheshri N: Noise Can Induce Bimodality in Positive Transcriptional Feedback Loops Without Bistability. Science 2010, 327: 1142–1145. 10.1126/science.1178962
Li J, Min R, Vizeacoumar FJ, Jin K, Xin X, Zhang Z: Exploiting the determinants of stochastic gene expression in Saccharomyces cerevisiae for genome-wide prediction of expression noise. Proc Natl Acad Sci USA 2010, 107: 10472–10477. 10.1073/pnas.0914302107
Walther D, Brunnemann R, Selbig J: The regulatory code for transcriptional response diversity and its relation to genome structural properties in A. thaliana. PLoS Genet 2007, 3: e11. 10.1371/journal.pgen.0030011
Karlic R, Chung HR, Lasserre J, Vlahovicek K, Vingron M: Histone modification levels are predictive for gene expression. Proc Natl Acad Sci USA 2010, 107: 2926–2931. 10.1073/pnas.0909344107
Pigolotti S, Krishna S, Jensen MH: Symbolic dynamics of biological feedback networks. Phys Rev Lett 2009, 102: 088701. 10.1103/PhysRevLett.102.088701
Huala E, Dickerman AW, Garcia-Hernandez M, Weems D, Reiser L, LaFond F, Hanley D, Kiphart D, Zhuang MZ, Huang W, et al.: The Arabidopsis Information Resource (TAIR): a comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant. Nucl Acids Res 2001, 29: 102–105. 10.1093/nar/29.1.102
Rhee SY, Beavis W, Berardini TZ, Chen GH, Dixon D, Doyle A, Garcia-Hernandez M, Huala E, Lander G, Montoya M, et al.: The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucl Acids Res 2003, 31: 224–228. 10.1093/nar/gkg076
Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of affymetrix GeneChip probe level data. Nucl Acids Res 2003, 31: e15. 10.1093/nar/gng015
O'Connor TR, Dyreson C, Wyrick JJ: Athena: a resource for rapid visualization and systematic analysis of Arabidopsis promoter sequences. Bioinformatics 2005, 21: 4411–4413.
Rensing SA, Lang D, Zimmer AD, Terry A, Salamov A, Shapiro H, Nishiyama T, Perroud PF, Lindquist EA, Kamisugi Y, et al.: The Physcomitrella genome reveals evolutionary insights into the conquest of land by plants. Science 2008, 319: 64–69. 10.1126/science.1150646
Merchant SS, Prochnik SE, Vallon O, Harris EH, Karpowicz SJ, Witman GB, Terry A, Salamov A, Fritz-Laylin LK, Marechal-Drouard L, et al.: The Chlamydomonas genome reveals the evolution of key animal and plant functions. Science 2007, 318: 245–251. 10.1126/science.1143609
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic Local Alignment Search Tool. J Mol Biol 1990, 215: 403–410.
Benjamini Y, Hochberg Y: Controlling the False Discovery Rate - a Practical and Powerful Approach to Multiple Testing. J Royal Stat Soc B-Method 1995, 57: 289–300.
This work was supported in part by the German Research Foundation (SFB 555), the Potsdam Research Cluster for Georisk Analysis, Environmental Change and Sustainability (PROGRESS), and the Leibniz association (project ECONS). We wish to thank Sebastian Ahnert for helpful comments.
XS and DW designed all tests and analyses and performed all computations. YZ and JK introduced the concept of PE and contributed to the interpretation of results. VN initiated the research into the complexity of temporal gene expression profiles. XS, YZ, JK, and DW wrote the manuscript. All authors read and approved the final version of the manuscript.
Electronic supplementary material
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Sun, X., Zou, Y., Nikiforova, V. et al. The complexity of gene expression dynamics revealed by permutation entropy. BMC Bioinformatics 11, 607 (2010). https://doi.org/10.1186/1471-2105-11-607