Phase analysis of circadian-related genes in two tissues

Background Recent circadian clock studies using gene expression microarray in two different tissues of mouse have revealed not all circadian-related genes are synchronized in phase or peak expression times across tissues in vivo. Instead, some circadian-related genes may be delayed by 4–8 hrs in peak expression in one tissue relative to the other. These interesting biological observations prompt a statistical question regarding how to distinguish the synchronized genes from genes that are systematically lagged in phase/peak expression time across two tissues. Results We propose a set of techniques from circular statistics to analyze phase angles of circadian-related genes in two tissues. We first estimate the phases of a cycling gene separately in each tissue, which are then used to estimate the paired angular difference of the phase angles of the gene in the two tissues. These differences are modeled as a mixture of two von Mises distributions which enables us to cluster genes into two groups; one group having synchronized transcripts with the same phase in the two tissues, the other containing transcripts with a discrepancy in phase between the two tissues. For each cluster of genes we assess the association of phases across the tissue types using circular-circular regression. We also develop a bootstrap methodology based on a circular-circular regression model to evaluate the improvement in fit provided by allowing two components versus a one-component von-Mises model. Conclusion We applied our proposed methodologies to the circadian-related genes common to heart and liver tissues in Storch et al. [2], and found that an estimated 80% of circadian-related transcripts common to heart and liver tissues were synchronized in phase, and the other 20% of transcripts were lagged about 8 hours in liver relative to heart. The bootstrap p-value for being one cluster is 0.063, which suggests the possibility of two clusters. Our methodologies can be extended to analyze peak expression times of circadian-related genes across more than two tissues, for example, kidney, heart, liver, and the suprachiasmatic nuclei (SCN) of the hypothalamus.


Background
Circadian rhythms (or the biologic clocks that control them) have stimulated interest in recent years due to their importance in orchestrating physiological behavior, bio-logical processes, and adaptability of biological systems to changes in environment [1][2][3]. Many circadian-related genes have been explored using high-throughput DNA microarray technology [1][2][3]. These studies also have stim-ulated efforts to apply and develop methodologies in circular/directional statistics to elucidate important characteristics of circadian gene expression and also compare their patterns of peak expression times (phase angles) across different tissue types, to help elucidate their diverse tissue-specific functions [1,2,4,5].
As periodic oscillation characterizes the expression pattern of both circadian genes and cell cycle genes, many correlation-based and Fourier-based methodologies [1][2][3] proposed for analyzing cell cycle gene expression can be directly applied to circadian gene expression analysis. However, there are some distinct differences between studies in cell cycle gene expression and circadian gene expression. First, most cell cycle gene expression patterns are based on cell cultures studied in vitro, while most circadian gene expressions are based on various tissues or organs in vivo. Consequently, circadian gene expression may be more complex or tissue/cell-specific. Second, the four phases of a cell cycle, namely, G 1 , S, G 2 , and M phases, have been well characterized through intensive research over the last thirty years, and more than 54 mammalian [6] and 104 yeast cell cycle genes have been identified [7]. In contrast, to date, less is known about circadian genes: only eight core mammalian circadian genes have been identified: Csnk1e, Cry1, Cry2, Per1, Per2, Per3, Clock, and Bmall [8]. In addition, it is not clear whether these known circadian genes and any other circadian-related genes identified from high-throughput microarrays can be assigned to a few functional phases, in analogy to the phases (G 1 , S, G 2 , M) in cell cycle. Note that many studies on cell cycle gene expression based on microarray were on same organism [6] and cell-type [7] under different experimental conditions. Therefore, we expect that a set of cell cycle genes commonly expressed in various conditions are consistent in their peak expression/ activation time [9]. However, it is an opening question whether phases or peak expression times for a set of circadian-related genes commonly expressed in multiple tissues, such as heart, liver, kidney, and SCN of the hypothalamus are in synchrony because expression of some circadian-related genes may be tissue-specific. Statistical tools for analyzing such a type of circular data cross multiple tissues need to be developed.
The phase angles estimated from cycling transcripts in Panda et al. [1] and Storch et al. [2] can be regarded as points on a circle of unit radius, which are treated as circular data in circular/directional statistics [10,11]. Circular data are commonly modeled with a von Mises distribution function on the unit circle, an analog to the normal distribution for linear data. The main feature of circular data is that it is directional and classical methods based on linear data can produce meaningless results. As an example, suppose a bird takes off in the northeast direction at an angle of 2°, while another takes off in the southeast direction at an angle 358° then their mean direction (by usual linear methods) is 180°, or due west! Means and variances and other statistical analyses must respect the directional nature of the data to avoid such nonsensical results. For example, a sum of two points on a unit circle is calculated as the sum of the two vectors, yielding a vector with certain direction and length. With this vector averaging, the two birds now have a mean direction of 0°, corresponding properly to due east. Special methods are available in the literature for describing correlation and regression between circular variables [10,11]. One needs to be cautious when analyzing circular and linear variables simultaneously. In some cases circadian phase angles have been mistreated as linear variables in linear regression of the phase angle on the period and amplitude [12]. Results so obtained may not be useful for interpretation.
The motivation for our work is based on the observation in [1][2][3] that some circadian-related genes that are expressed in two tissues are systematically lagged in peak expression time in the two tissues. Panda et al. [1] reported that many of the 28 circadian-related transcripts common to the suprachiasmatic nuclei (SCN) of the hypothalamus and liver, including Per2 and Rev-Erbβ, are delayed by 4-8 hrs in peak expression in liver relative to the SCN. Ueda et al. [3] validated in vitro that the Rev-ErbA/ ROR response element in both the SCN and liver tissues is expressed in phase with Bmal1 and in anti-phase with Per2 oscillation. These studies suggest that the coordinated temporal expression of circadian genes in-phase and antiphase in different tissues is an interesting but a complex biological phenomenon. Statistical analysis tools for studying this type of interesting biological questions arising in recent genomics studies are needed.
To address the above questions, we propose a few steps in the following sections. Given a set of circadian-related genes common to two tissues, we first fit a random-periods model [13] to the time-course expression for each gene individually in each of two tissues, to estimate its phase angles along with periods and amplitudes. The angular difference between the two phases for each gene can be represented as an angle, i.e. as a point on a circle. Using a mixture of two von Mises distributions, we cluster angular differences of the genes into two groups; genes whose expressions are synchronized (mean difference is close 0) in the two tissues and those whose expressions are different in the two tissues. The identified clusters may provide a hint on association of circadian genes specific to these tissues. We then assess the association of each set of genes common to two tissues using Down and Mardia's circular-circular regression model [14]. In addition, we propose a new circular-circular regression-based boot-strap method to assess the mixture of two homogeneous phase distributions for the two tissues. We illustrate the proposed methodologies using the heart and liver circadian-related gene expression data sets from Storch et al. [2].

Phase estimation
One characterizing feature of circadian-related genes expression is the periodically oscillating pattern. Sinusoidal functions have been used to model the circadian gene expression level [1][2][3]. We apply the "random-periods model" [13] to estimate the phase angles, period, and amplitude together for a given circadian gene expression using nonlinear least-squares regression. While there is no attenuation in circadian gene expression, the sinusoidal component of the "random-periods model" is reduced to a simple sinusoidal function K g cos(2πt/T g + φ g ), where K g , T g, , and φ g are the amplitude, period, and phase (angle) of gene g. The phase parameter φ g indicates when the expression of the g gene reaches its maximum.

A mixture of two von Mises distributions for circular paired-difference data
After estimation of activation times or phase angles for a set of circadian-related genes that expressed in two tissues or organs, we are interested in examining whether the phase angles are synchronous or not. Let and denote the estimated phase angles of a circadian-related gene g, g = 1, 2, ..., n, in the two tissues x and y, where -π≤ ≤π, -π ≤ ≤ π. Further, we model the distribution of the angular difference as a mixture of two von Mises distributions. One component will correspond to a subset of the n genes have the same phase angle in the two tissues and the other will correspond to genes having unequal phase angles in the two tissues. Thus the probability density function is given by: Here, p i is the mixing parameter, µ i is the mean direction for distribution i, κ i is the concentration parameter characterizing the variability of the estimated differences ∆ g about µ i , and I 0 (κ i )is the modified Bessel function of the first kind and order zero. We expect that one von Mises distribution has mean close to 0 radians, because it consists of a concordant subset of genes having the same phase in the two tissues, whereas the other distribution contains a set of "discordant" genes. The variation in shift characterizing genes of the second set can be measured by summing1 -cos(∆ g ) [11].
The log likelihood for the mixture of two von Mises distributions in (2) is The parameters in the vector (p 1 , κ 1 , µ 1 , κ 2 , µ 2 ) in the mixture model (3) can be estimated using the Newton-type optimization method in the Matlab optimzation toolbox.
To ensure convergence to the global solution, we use fifty random starting points. A comparison of the performance of various estimators can be found in [15]. We chose the Newton-type optimization method in the estimation due to its simplicity and flexibility of converting unconstrained searching to constrained optimization by adding constraints on the mixing parameter p 1 , i.e., 0.15 <p 1 < 0.85 or the concentration parameter κ 1 , or κ 2 , i.e., κ 1 <10 and κ 2 < 10. Upon obtaining the five estimated parameters ( ) in the mixture model (2), we statistically assign each of the ∆ g to one of the two components based on its relative likelihood. That is, gene g is assigned to cluster 1 if , otherwise to cluster 2.

Circular-circular regression
In a recent article [9] we described the notion of association between the phase angles of a set of cell-cycle genes from a pair of experiments using the circular-circular regression model of Downs and Mardia [14]. Within each cluster obtained above, we shall apply the methodology described in [9] to examine the association between the estimated phase angles of the genes in the two tissues.   Plot of ∆ g showing clusters for 48 circadian-related cycling transcripts [2].

A bootstrap test for number of clusters
To assess whether there are two clusters in the mixture of ∆ g , g = 1,..., n, in (1), in the following we propose a bootstrap methodology to test the null hypothesis that ∆ g 's are a random sample from a single von-Mises distribution against the alternative hypothesis that they are from a mixture of two independent von-Mises distributions. The proposed bootstrap procedure is described in the following steps: 1) Regress phase angles in y tissue on phase angles in x tissue using the circular-circular regression model (4) and compute the circular variance cv based on the residuals r g , g = 1, ..., n, from this single circular regression; 2) Compute for each gene g the difference ∆ g = -, where and are the estimated phase angles for gene g in tissues y and x; 3) Fit the phase differences ∆ g , g = 1, ..., n, to a two-component mixture of von-Mises model, obtaining two sepa- Residual plot of circular-circular regression for the 38 synchronized genes in heart and liver tissues [2] from cluster 1 Figure 2 Residual plot of circular-circular regression for the 38 synchronized genes in heart and liver tissues [2] from cluster 1.
rate clusters with n 1 and n 2 genes in provisional clusters 1 and 2, respectively; 4) Regress n 1 and n 2 phase angles of y 1 on x 1 and y 2 on x 2 separately for the two clusters, and obtain the residuals r cluster1 and r cluster2 from each of the two regressions; Then the bootstrap p-value is the proportion of T bootstrap that are greater than the calculated value of test statistic T.

Datasets
We apply our methodologies to 52 circadian-related cycling transcripts that are expressed in mouse heart and r g * r g bootstrap r g Residual plot of circular-circular regression for the 10 discordant genes in heart and liver tissues [2] from cluster 2 Figure 3 Residual plot of circular-circular regression for the 10 discordant genes in heart and liver tissues [2] from cluster 2.
liver tissues, identified by Storch et al. [2]. In Storch's studies, mice were entrained to a 12 hrs light/dark cycle for more than two weeks, then placed in a constant dim light for more than 42 hrs. The tissue samples were taken from sacrificed mice at 4-hour intervals for 48 hrs, or about two circadian cycles, as in the circadian studies of Panda et al. [1].
Due to the poor fit of our random-periods model [13] to the expression of four transcripts (accession numbers: AI834950, AF003348, AF043288, and AB014494), we excluded them from the list of 52 circadian-related transcripts [2] in our analysis. The estimated phase angles for the remaining 48 transcripts in both heart and liver for clusters 1 and 2 are listed in Tables 1 and 2, respectively. The angular differences (heart, denoted as y, minus liver, denoted as x) ∆ g are plotted in Figure 1.
The 48 circadian-related cycling transcripts were assigned into two clusters based on a mixture of two von Mises distributions, as described above. The first cluster contains 38 genes (Table 1) and the second cluster contains 10 genes ( Table 2). The five estimated parameters ( ) = (0.79, 1.50, -0.07, 4.56, 2.16). The distance between the two clusters in mean direction ( ) is 2.23 rads, suggesting that the ten transcripts in cluster 2 have different points of peak expression in heart and liver. This can also be seen in Figures 2 and 3. The mean direction of 38 phase angles in the cluster 1 is -0.07 rads, suggesting that the peak expression times for the 38 cycling genes in heart and liver are close to synchronized. In contrast, the peak expression times for the 10 genes in heart and liver (in cluster 2) are away from the 0 direction by 2.16 rads, with heart ahead of liver by about 8 hrs. This result suggests that these 10 discordant circadian-related genes may play different roles in the heart and liver.
We estimated the across-tissue association of the genes in each of the two clusters by regressing the phase angles in heart, denoted as y, onto those in liver, denoted as x, using the circular-circular regression model described above (5).  Figure 2 shows the residuals for all genes, i.e. the angular difference between the phase angle in heart and the prediction based on the same gene's pattern of expression in liver. The residuals for the 10 genes in the second cluster are shown in Figure 3. We can see that the activation times of these transcripts in heart and liver tissues are matched well after a 2.16 radian clockwise rotation of the phase angles in heart relative to in liver. Upon obtaining the 'slope parameters' and , we tested the null hypothesis H 0 : ω i = 0 vs. the alternative hypothesis, H a : ω i ≠ 0, i = 1, 2, using the test statistic derived by Downs and Mardia [14]. The corresponding p-values for clusters 1 and 2 were less than 5 × 10 -7 and 1.4 × 10 -4 , respectively, suggesting that the associations of the circadian activation times in heart and liver are strong in both clusters. Here we included the expression plots of the transcripts AI850638 and AI846522 in heart and liver tissues [2] and the fits of our model to the two transcripts in Figures 4 and  5, accordingly. The plots reveal that the fits of our model to the data are reasonably good, and peak times of the two transcripts in liver tissue are markedly lagged relative to heart tissue.
Based on 3000 bootstraps using the procedure outlined in section of bootstrap test, we found that the estimated pvalue for sufficiency of a one-component distribution in the phase difference of heart and liver tissues for the 48 genes was 0.063. Although this is not significant at 5% level of significance, it suggests that a two-component von-Mises distribution for the phase in heart and liver tissues better describes the relationships among the peak expression times for the studied genes.

Discussion and conclusion
Our analysis of the peak expression times (phases) for a set of 48-circadian-related genes expressed in both heart and liver tissues [2] suggests that not all of the genes are maximally expressed at the same time in the two tissues. Instead, among the 48 genes, 38 are synchronized in phase or peak expression times in heart and liver tissues, and the other 10 genes express earlier by about 2.23 rads or 8 hours in heart than in liver. Our bootstrapping test result supports, albeit weakly, the existence of two distinct subsets among the 48 genes. Although our findings are based on the single experimental dataset of Storch et al. [2], our results are similar to an earlier observation made by Panda et al. [1] that the peak expression times for some genes are not synchronized in suprachiasmatic nuclei (SCN) compared to liver. One implication of our results, giving quantitative support for the conclusion of Storch et al. [2], may be that some commonly expressed circadianrelated genes may perform different functions across different organs [1,2].
We have developed a new bootstrap method for assessing the adequacy of one versus the need for two clusters of genes in the two sets of phase angles in heart and liver tissues for the 48 genes. In particular, we evaluated the significance of the circular variances of the residuals from circular-circular regression of the phase angles in heart on in liver in one cluster vs. two clusters. In contrast, most studies on mixtures of circular variables have focused on dividing a set of data into two subsets. To the best of our knowledge, no studies have used a mixture of two compo-nents for circular datasets for testing heterogeneity of a cyclic pattern.
This work would not have been undertaken without the interesting observations by Panda et al. [1], Storch et al. [2] and Ueda et al. [3] that a few circadian-related genes expressed out of phase across two tissues of mouse. In addition, quantitatively the results of our statistical analysis depend on the approximation of sinusoidal waveform for circadian gene expression, on reasonably accurate estimation of the phase angles, on a relative large sample size of genes common to two tissues, on the approximate validity of the von Mises distribution for each cluster of differences. In previous circadian gene expression studies [1][2][3], a tissue sample was commonly taken at each 4-hr interval for two circadian cycles, i.e., 12 time points per gene expression. Although our fits to the 48 circadian gene expression in both tissues are reasonably good, as shown in Figures 4, 5, further experimental and simulation studies may be needed to understand the role of sample size and sampling frequency on phase estimation when a sinusoidal waveform is presumed for circadian rhythm. In this work, we considered the difference of two phases for a gene in two tissues as a random variable modeled by a von Mises distribution on a circle. The corresponding uncertainties are captured by the degree to which the von Mises distribution is spread out on the circle.
The 48 circadian related genes expressed in heart and liver of mouse each provide a pair of peak expression times. We have assumed that there are at most two clusters. Further experimental studies are needed for testing whether there might be more than two clusters with for 48 genes. While our method can hypothetically be extended to allow one to test the need for 3 clusters rather than for 2 clusters, the sample size of 48 genes, may not be sufficient to carry out such a test with much power. The results of the two-clusters analysis must be regarded as descriptive. Our analysis of circadian gene expression may serve to stimulate further methodological development in circular/directional statistical analysis of genes that may be expressed differently in phase in two or more tissues. The genes in each cluster would need to be scrutinized separately for further elucidation of their tissue-specific biological and physiological functions [1,2,5].
Our methodologies can be, in principle, extended to analyzing multiple circadian gene expression data sets across multiple tissues (organs), e.g., kidney, heart, liver, and SCN where investigators are interested in understanding whether there is one set of core circadian-related genes that are similar in their patterns of activation across different organs (or tissues), and others that are differently expressed in different tissues, as suggested by Reppert and Weaver [5]. Because the maximum likelihood approach Plots of log 2 expression for the transcript AI846522 in heart and liver tissues [2] Figure 5 Plots of log 2 expression for the transcript AI846522 in heart and liver tissues [2]. Data (-o-) and fitted curve of the model (-). The phase difference of the gene in the heart and liver is -3.06 rads -1.31 rads = 1.9 rads (module).
considered in this paper may become challenging due to computational complexity, further methodology development is needed in this area. For such multiple mixture problems, one may want to consider a Bayes or empirical Bayes approach. However, to the best of our knowledge, Bayes and empirical Bayes methods for mixture problems associated with circular data are not well developed. The present application provides an excellent opportunity for developing such methodology for mixture problems associated with circular data. Secondly, using the standard likelihood approach, we clustered the 48 genes into two clusters on the basis of the phase difference between the two tissues. It would be interesting and useful to derive an estimate of "reliability" of clustering for each gene. One possible approach is to perform a bootstrap by selecting a simple random sample of 48 genes from the list of 48 genes and classify them into two clusters using procedure described in this paper. This procedure could be repeated for a large number of times, say 1000. One then could estimate the proportion of times a gene was classified into one of two clusters. Unfortunately, such a procedure does not work well when the number of genes is small and their "true" cluster memberships are unknown and the numbers of genes in each cluster are highly unbalanced. Nonetheless, these important questions need to be addressed because the number of potential applications for such a procedure is ever growing.