Correlation analysis reveals the emergence of coherence in the gene expression dynamics following system perturbation

Time course gene expression experiments are a popular means to infer co-expression. Many methods have been proposed to cluster genes or to build networks based on similarity measures of their expression dynamics. In this paper we apply a correlation based approach to network reconstruction to three datasets of time series gene expression following system perturbation: 1) Conditional, Tamoxifen dependent, activation of the cMyc proto-oncogene in rat fibroblast; 2) Genomic response to nutrition changes in D. melanogaster; 3) Patterns of gene activity as a consequence of ageing occurring over a life-span time series (25y–90y) sampled from T-cells of human donors. We show that the three datasets undergo similar transitions from an "uncorrelated" regime to a positively or negatively correlated one that is symptomatic of a shift from a "ground" or "basal" state to a "polarized" state. In addition, we show that a similar transition is conserved at the pathway level, and that this information can be used for the construction of "meta-networks" where it is possible to assess new relations among functionally distant sets of molecular functions.


Background
Time series of gene expression data from high throughput experiments have been used to infer networks of co-expressed genes. By following the changes in expression at the genomic level, it is possible to identify groups of genes with a similar expression pattern. Most of the techniques currently used in functional genomics have been adapted from machine learning and statistical inference. Some of them generate networks of genes; others simply generate clusters of genes. Examples of the latter are algorithms based on Self-Organizing Maps (SOM) [1][2][3][4], Phylogenic-Type Trees [5][6][7][8][9][10], agglomerative clustering and partitional clustering. For network determination, techniques have been developed based on differential equations [11], Bayesian networks [12], hybrid petri networks [13], Boolean regulatory networks [14,15].
Relevance networks [16,17] are a popular method for the analysis of time series of expression levels. The basic idea is to construct a network of similarity of the genes' expression patterns. Several similarity measures have been utilized, such as correlation and mutual information. This technique can then represent multiple connections between genes, and capture negative as well as positive correlations. Once the matrix containing the similarity measure for all pairs of genes has been computed, a threshold is used to define the links in the network. Network validation can be obtained by permutation testing, i.e. by randomly shuffling the time points independently for each gene.
A similar approach has been applied to metabolic networks [18,19]. The authors computed metabolite correlations to infer changes in regulation using samples from different physiological states. On the other hand, we focused on the analysis of time series of expression levels for genes that were selected for differential expression between treatment and control.
An alternative approach is offered by Graphical Gaussian Models (GGM) that use partial correlation as a measure of independence between two genes. Partial correlations are related to the inverse of the correlation matrix, and in GGMs missing edges indicate conditional independence. One of the biggest problems with GGMs is that the number of genes is usually much larger than the number of samples (e.g. time points), such that the correlation matrix is usually singular and cannot be inverted. Different approaches have been proposed to circumvent this problem: restrict the number of genes analyzed to less than the number of samples [20][21][22]; use partial correlation coefficients of limited order [23][24][25]; approach the matrix inversion as an ill-posed inverse problem through regularization methods (usually via empirical Bayes, such as variance reduction) [26,27].
Although co-expression is not a direct indication of coregulation, it is a very useful tool that can be used to interpret the effect of a perturbation in eliciting different phenotypes when combined with an ontology analysis. Here, we use a correlation based method to generate co-expres-sion networks on three different datasets and we characterize the change in its structural properties induced by the system's perturbation. We then use a correlation-based analysis to infer how the perturbation affects the system at the level of metabolic and signalling pathways.

Results
We considered three datasets of time course gene expression arrays: 1) Conditional, Tamoxifen dependent, activation of the cMyc proto-oncogene in rat fibroblast; 2) Genomic response to nutrition changes in D. melanogaster; 3) Patterns of gene activity as a consequence of ageing occurring over a life-span time series (25y-90y) sampled from T-cells of human donors.

Gene selection
Due to the heterogeneous nature and properties of the datasets, we selected the significant genes according to different criteria.
For the cMyc dataset we applied a two way ANOVA (time and treatment as variability factors) to identify genes whose expression pattern was significantly affected by cMyc activation. With this method we identified a set of 1,191 significant genes out of a total of 8,799 [28].
The D. melanogaster diet arrays provide a higher resolution dataset and genes were selected via GeneTrace, which looks for change points in the time series of the expression ratios between the two cohorts. GeneTrace identified 3,519 genes with significant change point. These results showed that physiological response to nutrient uptake involves a rapid change in transcriptional profile at a global scale, and that most of the changes are small (81% of the 3,519 ratios smaller than 1.5 fold).
For the human aging dataset we applied one way ANOVA (with donors' age as variability factor) with a P value < 0.01 that resulted in a set of 768 probesets selected out of a total of 14,688 probesets.

Correlation network analysis
When cMyc is activated by Tamoxifen, the activity profile of the probesets clearly changes into a strongly correlated regime. These findings are reflected in the histograms of the correlation coefficients for the N-control and T-treatment data sets ( Figure 1) and in the main parameters of the connectivity distributions obtained from the corresponding adjacency matrices ( Table 1). The adjacency matrix characterizing the network was obtained by considering only the correlation coefficients whose absolute value exceeded a threshold fixed between 0.95 and 0.99 (we remark that the lower threshold value was higher than the value requested for a P < 0.05 statistical significance of the correlation coefficients). The results shown in this paper were obtained for a threshold equal to 0.98, but similar results held for the [0.95-0.99] interval. These coefficients were set equal to 1, producing a symmetric adjacency matrix a lr . For each gene connectivity degree k was defined as the total number of genes it was connected to, i.e.
For the T-treatment dataset the number of coefficients close to +1 or -1 increases significantly. This finding indicates that many of the 1,191 genes, whose expression levels over time were affected by tamoxifen stimulation, became either strongly correlated or anti-correlated.   Histogram of the correlation coefficients of the gene expression time series between all the genes selected with the change point analysis in the D. melanogaster dataset over NY-control (top right) the distribution becomes bimodal and genes tend to be either strongly correlated or anti-correlated.
We used time reshuffling to test the time sequence dependence of the results. By randomly shuffling the time series for each gene separately, time relationships between expression levels are broken, but the mean and standard deviation for each gene are unaltered. Properties of the gene network that truly depend on the expression level dynamics should be significantly affected by a random shuffling in time. The bottom row of Figure 2 shows the result for the D. melanogaster data. In both cohorts this leads to a Gaussian distribution for the values of the correlation coefficients ( Fig. 2 bottom row). Analogous results have been obtained for the other two datasets ( Fig.  1 and Fig. 3).
The role of noise was taken in to account by analyzing the correlation coefficient distribution obtained from a dataset of randomly generated vectors of the same size than the experimental datasets (e.g. for the c-Myc dataset: a 5 × 1191 array of values sampled from the Standardized Gaussian Distribution [29]). The resulting distribution strictly resambles that obtained with the no Tamoxifen dataset.
It is possible to identify a subnetwork of strongly correlated/anticorrelated genes by selecting the probesets based on the the treatment factor's strength in the ANOVA analysis. This selection process can be characterized as a transition from a unimodal (most genes are uncorrelated) to a bi-modal behavior (genes are either correlated or anticorrelated). Figure 4 shows how this process occurs by decreasing the cutoff P value (P thr ) used to select significant genes in the cMyc dataset. The different panels show the histogram of the correlation coefficients between the expression values over time from the probesets in the dataset N (left column) and T (right column) for decreasing P thr . The top row includes all the probesets used in the analysis (P thr > 1); the central row corresponds to an intermediate threshold (P thr = 0.2); the bottom row corresponds to the lowest threshold (P thr = 0.05). Notice that the transition from unimodal to bimodal is only present in the T dataset, while the N dataset in not affected even when the cutoff P value is very small. Analogous results have been obtained for the other two datasets.

Pathway analysis
We identified the pathways whose genes' expression was significantly affected by the system perturbation. We performed this analysis on the D. melanogaster diet dataset and on the cMyc dataset which share the common feature Histogram of correlation coefficients of the gene expression time series between genes for the aging dataset of being generated through a direct perturbation of the system. The problem in this analysis is that different pathways in the KEGG [30] database contain different numbers of genes, and that not all of them had been measured due to limitations in the microarrays utilized. For this reason we computed the 95% confidence limits for the population proportions (p) of affected genes in each pathway (see Methods). We found that 45 pathways out of 110 included at least 20% of genes whose expression pattern changed significantly upon re-feeding (see Table 2). For the cMyc dataset, a smaller proportion of genes had been measured in each pathway. Hence, we used a lower threshold and found that 51 pathways out of 135 had more than 5% of the measured genes with significant changes (Table 3).
When we analyzed these pathways with the correlation method described earlier, we found that the difference in the correlation distributions between treated and untreated cases is qualitatively the same as the one observed in the entire pool of genes selected with the change point analysis. Figure 5 shows this effect, as well as the time course of the expression ratios for the Purine metabolism pathway and the Target of Rapamycin (TOR) pathway (notice that TOR does not compare in the Table  2, since it is not part of the KEGG database and was instead manually annotated from the literature). The same effect was observed in the cMyc dataset, as exemplified in Figure 6 for five representative pathways: (A) MAPK signaling, (B) calcium signaling, (C) focal adhesion, (D) gap junction, (E) insulin/IGF signaling.
Transition from unimodal to bimodal behavior for the cMyc data

Discussion
In all three data sets different and specific genes are affected by the perturbations but give rise to a common correlation profile (Figures 1, 2, 3). This suggests that, despite their specificity, the global changes in gene expression follow a pattern that is largely independent from the data sets. This results could be explained by assuming that temporal changes in gene expression in biological complex systems is scale-independent and characterized by changes in an initial triggering core of genes (specific to the perturbation and the cell type) followed by a propagation to other genes.
In particular, in the cMyc dataset we found that the correlation method identifies a list of genes containing many of the genes found by O'Connell et al. [31], as well as many genes that, to our knowledge, have not been identified before. This indicates the possibility that the cMyc regulatory network may be much larger than currently described. To rule out the presence of undetectable corre- lations between genes in the unperturbed state (i.e. when cMyc is not active), we performed a one-way analysis of variance on the unperturbed dataset to detect genes whose expression level changes significantly over time. Less than 3% of the genes was selected for differential expression over time, compared to more than 10% for the perturbed dataset (P < 0.05).
For the D. melanogaster dataset our analysis showed that refeeding induces a change in the expression patterns of thousands of genes. Since this analysis was performed at saturation, we estimate there are few false positives and few undetected changes. The analysis of the correlation between expression patterns reveals that many genes respond to re-feeding in a coordinated manner. In fact, upon re-feeding, the activity profile of the genes clearly changes to a stronglycorrelated/anticorrelated regime.
An extension of this method, based on the cross-correlation between genes belonging to different pathways allowed us to build a network of relationships between pathways. In Figure 7 we show changes of a five-nodes network of selected genes when cMyc is activated. The five pathways, MAPK signaling, calcium signaling (CS), focal adhesion (FA), gap junction (GJ), insulin/IGF signaling (INS/IGF), were among those selected for the large proportion of significant genes ( Table 3). The unperturbed state is characterized by a dominance of random correlations between pathways whereas after the perturbation we observed the emergence of positive and negative correlation regimes ( Figure 6).

Conclusion
Three different systems sharing the time series experimental design after a perturbation (cMyc conditional activation, refeeding after caloric restriction and effect of time in young and old donors) have been examined.
At the genomic scale, we showed that in all three datasets the correlation regime between genes selected by statistical significance analysis follow a similar behavior, which is compatible with the emergence of coherence in the gene Histogram of the correlation coefficients between all the genes for selected pathways expression dynamics following cell perturbation. We also showed that this behavior is conserved at the pathways scale for pathways that have been significantly affected by the perturbation.
In particular, our analysis revealed the emergence of links between a core set of pathways in the cMyc dataset which may play an important role for the comprehension of the early phenotypical changes following cMyc activation.
This method was successful in identifying changes in gene expression profiles related to the acute response to a perturbation both in model systems and in humans as well as in revealing the centrality and importance of selected pathways by its multiscale generalization.

Microarray datasets 1) cMyc dataset
Two gene expression data sets were extracted from the set of microarray experiments based on genetically engineered rat cell lines [31]. The first data set (N data set) contains the gene expression data for the c -myc -/-MycER cell line treated with vehicle (ethanol) only. The second data set (T data set) contains the gene expression data col-lected after the addition of tamoxifen. Binding of tamoxifen to the estrogen receptor domain elicits a conformational change that allows the fusion protein to migrate to the nucleus and act as a transcription factor. Samples were harvested at five time points after the addition of tamoxifen to the culture medium: 1, 2, 4, 8, and 16 h. The entire experiment was repeated on three separate occasions, providing three independent measurements for each gene and each time point. Expression profiling was done by using the Affymetrix platform and U34A Gene Chips.

2) D. melanogaster diet dataset
Newly enclosed virgin females were maintained in yeastfree media until 4 days old and then transferred either to media with yeast (Y-treatment) or to control media without yeast (NY-control). Samples were collected every hour for the following 12 hours. Four additional samples were collected prior to refeeding and arbitrarily designated -4, -3, -2, and -1. Synchronization of the physiological state was obtained by imposing diet restriction in third instars. Affymetric gene chips were used to measure mRNA abundance at each hour for both Y-treatment and NY-controls. A time-ordered sequence of expression ratios was then computed for each gene in the array [32].

Probeset selection 1) cMyc dataset
A full factorial ANOVA was applied to each of the 8,799 probesets to identify those that significantly changed in their expression level in time between the two conditions (data set N versus data set T). Probesets with a P value corresponding to the change-in-treatment factor < 0.05 were considered to be significantly affected by the treatment (i.e., activation of Myc by tamoxifen). A total of 1,191 genes were selected using this criterion.

2) D. melanogaster diet dataset
When considering high resolution time series, changepoint analysis [33] is a powerful tool for detecting subtle changes; it is robust to outliers and can control the overall error rate. The Tatar group has extended this technique to statistically detect changes for time-series microarray data (GeneTrace [32]). For each gene's time-series, the algorithm returns an estimated change-point (CUSUM estimator) and the two-tailed probability of this statistic from a sample of 10,000 random permutations across all of the ordered expression ratios. GeneTrace identified 3,519 genes with significant change point (significance threshold = 0.005, expected false positive inferences 70 out of 14,000).

3) Human aging dataset
One-way ANOVA was applied to each of the 14,688 probesets to identify those that significantly change their expression level in time. With a P value of 0.01, 768 genes were selected for further analysis.

Correlation analysis
The similarity measure for the expression dynamics of two genes within the same data set is given by the correlation between the two expression-level time series. For a given data set, if x lj is the expression level of a gene with label l at time j, then the similarity between two genes with labels l and r, respectively, is given by: where μ l and μ r are the averages in time of the expression levels for the two genes, and σ l and σ r are their standard deviations.
Time reshuffling was used to test the time sequence dependence of the results obtained by the two techniques. By randomly shuffling the time series for each gene separately, time relationships between expression levels are broken, but the mean and standard deviation for each gene are unaltered. Properties of the gene network that truly depend on the expression level dynamics should be significantly affected by a random shuffling in time.

Pathway analysis
We ranked pathways based on the percentage of genes selected by the significance analysis in each one of them.
To account for the different number of measured genes in different pathways, we computed the 95% confidence limits for the population proportions (p) of significant genes in each pathway. Using a relationship between the F distribution and the binomial distribution [34], the confidence interval can be computed for the binomial parameter p [35][36][37]. The lower confidence limit L 1 and the upper confidence limit L 2 become: where n is the total number of genes measured in a pathway, and X is the number of significant genes in that pathway.
We selected a 20% threshold for the Drosophila array and 5% threshold for the rat array. In fact, the Drosophila array contained 14,688 probesets corresponding to more than 90% of the fly's genome. Hence the analysis was done at saturation. The rat array used in the cMyc experiments contained 8,799 probesets corresponding to roughly 1/3 of the rat's genome. Requiring a 20% change in the cMyc experiment would result in a very small list of pathways. However, it is well known that cMyc affects the expression of a large pool of genes (directly or indirectly). Our interpretation is that the same cMyc experiment done at saturation would reflect this at the pathway level as well, i.e. many pathways would more than 20% of the genes with a significant change in expression profile over time.
Links between selected pathways were generated by computing the correlation of the gene expression time series between genes in different pathways and generating a