- Research article
- Open access
- Published:

# A phase synchronization clustering algorithm for identifying interesting groups of genes from cell cycle expression data

*BMC Bioinformatics*
**volume 9**, Article number: 56 (2008)

## Abstract

### Background

The previous studies of genome-wide expression patterns show that a certain percentage of genes are cell cycle regulated. The expression data has been analyzed in a number of different ways to identify cell cycle dependent genes. In this study, we pose the hypothesis that cell cycle dependent genes are considered as oscillating systems with a rhythm, i.e. systems producing response signals with period and frequency. Therefore, we are motivated to apply the theory of multivariate phase synchronization for clustering cell cycle specific genome-wide expression data.

### Results

We propose the strategy to find groups of genes according to the specific biological process by analyzing cell cycle specific gene expression data. To evaluate the propose method, we use the modified *Kuramoto model*, which is a phase governing equation that provides the long-term dynamics of globally coupled oscillators. With this equation, we simulate two groups of expression signals, and the simulated signals from each group shares their own common rhythm. Then, the simulated expression data are mixed with randomly generated expression data to be used as input data set to the algorithm. Using these simulated expression data, it is shown that the algorithm is able to identify expression signals that are involved in the same oscillating process. We also evaluate the method with yeast cell cycle expression data. It is shown that the output clusters by the proposed algorithm include genes, which are closely associated with each other by sharing significant Gene Ontology terms of biological process and/or having relatively many known biological interactions. Therefore, the evaluation analysis indicates that the method is able to identify expression signals according to the specific biological process. Our evaluation analysis also indicates that some portion of output by the proposed algorithm is not obtainable by the traditional clustering algorithm with Euclidean distance or linear correlation.

### Conclusion

Based on the evaluation experiments, we draw the conclusion as follows: 1) Based on the theory of multivariate phase synchronization, it is feasible to find groups of genes, which have relevant biological interactions and/or significantly shared GO slim terms of biological process, using cell cycle specific gene expression signals. 2) Among all the output clusters by the proposed algorithm, the cluster with relatively large size has a tendency to include more known interactions than the one with relatively small size. 3) It is feasible to understand the cell cycle specific gene expression patterns as the phenomenon of collective synchronization. 4) The proposed algorithm is able to find prominent groups of genes, which are not obtainable by traditional clustering algorithm.

## Background

Since Hereford *et al*. [1] first discovered yeast histone mRNAs oscillate during cell division cycle, several experimental studies have identified that many genes are expressed in a cell-cycle-specific manner. These studies have motivated the study of global extent of cycle-specific gene expression. To this end, there have been a number of studies using DNA microarrays to understand whole-genome expression patterns during cell division cycle [2–8]. A particular example is flagella biogenesis in *Caulobactor*, which has four distinct and dependent waves of transcription. Laub *et al*. [3] showed that 20% of *Caulobactor* genes are cell cycle regulated, their expression level consistently having peaks when they function. Another example is the study of yeast *Saccharomyces cerevisiae* [6], in which they also discovered that between 10 and 20% of yeast genes are periodically expressed during cell division. Therefore, it suffices to say that a certain percentage of genes may have the periodicity for its oscillatory activity throughout the cell division. These cell-cycle-specific oscillatory activities can be explained by a biological phenomenon in terms of efficiency and logical order. The cell only makes the enzyme when it is needed. If the enzymes were made all the time, the cell would be inefficient in an environment devoid of the substrates of the enzymes [9].

In this study, we are motivated to apply the theory of multivariate phase synchronization to cell-cycle-specific gene expression data. Synchronization is one of the most commonly present phenomena in various fields of science [10, 11]. Generally, we understand synchronization as a complete coincidence of the states between oscillating systems due to their interactions. Rosenblum *et al*. [12] show that the phase difference of two coupled oscillating systems is bounded while the amplitude is uncorrelated and irregular. There have been numerous applications in different areas such as cardiorespiratory interaction [13–15], brain activity of Parkinsonian patients [16], EEG measurements [17–20], ecology [21], and climate systems [22]. Because our interests of this study are cellular activity during cell cycle, our interested systems are the cell cycle specific genes. Based on the theory of phase synchronization, we pose a hypothesis that expression signals from two genes could be synchronized if these two genes are biologically interacting with each other. That is, two biologically interacting genes produce oscillating expression signals with a common rhythm. Therefore, we propose the phase synchronization as a measure to identify biologically relevant interactions using cell-cycle-specific gene repression data and the cell cycle specific genes are oscillating systems, which produce gene expressions with rhythms (periodicity).

In this study, we present the effort of applying the theory of multivariate phase synchronization to find groups of cell cyclic gene expression signals according to the specific biological process, which is based on the study of Allefeld and Kurths [17]. They present a method for the multivariate analysis of statistical phase synchronization phenomena in empirical data, which is based on the theory of *synchronization cluster*. The basic idea of their analysis is to consider the oscillating systems forming a cluster in which each one contributes to the cluster in different degree. The cluster consists of a common rhythm that is a mean oscillation for all oscillating systems inside the cluster. Based on their theory, we propose an algorithm named as Phase Synchronization Clustering (PSC) algorithm, which produce the clusters of cell cycle specific genes from genome expression data set, and the genes from the same cluster are expected to be involved in the specific biological process. The PSC algorithm is evaluated with synthetic data and cell cycle specific expression data of *Saccharomyces cerevisiae* from the study of Spellman *et al*. [6], in which they analyze gene expression levels in yeast cell cultures whose cell cycle has been synchronized by various methods.

## Results and discussion

### Case study 1: *in silico* experiments

The purpose of this experiment is to show how the proposed PSC algorithm is able to identify the signals that are expressed in the same specific process. In this study, it is assumed that a certain group of gene expression levels during cell cycle can be explained as the synchronization of large ensembles of oscillators, in which each element of the ensemble interacts with all others and is driven by the mean field that is formed by all elements, provided that every members from the group play a role for a certain biological process. The driving force, or the mean field, is not predetermined, but arises from interactions within the ensemble. This force determines whether the systems synchronize, but it itself depends on their oscillation. We use the modified *Kuramoto model* [23] as a phase governing equation that gives the long-term dynamics of globally coupled oscillators

where the *φ*_{i}s are the instantaneous phase, the *ψ* are the mean phase, and the positive constant C represents the coupling strength. It should be noted that the autonomous (or natural) frequency term is excluded from the original *Kuramoto model* for this model, and the mean phase *ψ* is roughly approximated by averaging the phases of all oscillators at current time point. The original *Kuramoto model* describes a large population of coupled limit-cycle oscillators [23]. With this modified model, it is assumed that the instantaneous rate of phase change is proportional to the mean sinusoidal coupling between the mean phase and each instantaneous phase. Given a set of initial conditions and a step size, we can simulate the instantaneous phase using the following for each gene i

With given initial random instantaneous phase signals, the expression signal can be simulated and converted into real signals as*x*_{
i
}(*t*) = *real*[*A* exp[(*jφ*_{
i
}(*t*))] = *A* cos(*φ*_{
i
}(*t*)),

where A is the instantaneous amplitude and is set to 1 for all signals. Then the simulated signals are updated by adding random noise from Gaussian distribution with mean *μ* = 0 and standard deviation *ε*.

To evaluate the PSC algorithm, we generate four sets of the expression signals with four different standard deviations of random Gaussian noise (i.e. *noise level* s) *ε* = 0.1, 0.2, 0.3, 0.4. For each set, we generate two groups of 100 signals. For first group, 20 measurements of signals are simulated with the coupling strength C = 3.0, and for second group, 20 measurements with the coupling strength C = 2.0. It suffices to say that these two groups are separately involved in their own oscillating process because each group has different coupling strength. It means that each group has different driving forces or mean field for their own signals. For each data set, we generate a group of random signals with same number of genes and measurements. This random group is combined with two other groups of simulated signals. Thus, in each data set, two-thirds of signals are simulated signals and one-third of signals are random signals. Then, we randomly shuffle the locations of all expression signals for each data set. We use four different values of *cutoff* (=0.9, 0.8, 0.7, 0.6) for these four data sets. Figure 1 shows all three groups of sorted expression signals without any addition of noises, and Figure 2 displays the change of the simulated signals of first group (C = 3.0) as the *noise level* increases from 0.1 to 0.4.

As an initial step, the algorithm creates a set of clusters of which the size is equal to the number of signals in the input data set. In this case, the algorithm creates 300 initial clusters, of which all sizes are equal to one. After the final step of the algorithm, the size of each cluster will be different depending on the values of *cutoff* and *noise level* *ε*. For each non-empty cluster, the signals from the group with simulated signals are counted and labeled as true positive (TP) for each group, and the signals from the group with random signals are also counted and labeled as false positive (FP).

In Figure 3, we explore the relationship between the size of clusters and the number of TP from the two groups of simulated signals with four different values of *cutoff* and *noise level* *ε*, and it is shown that all of them have linear relationship. It is also shown that the 1^{st} and 2^{nd} largest clusters contain the most TP signals among the output clusters. Hence, only these two largest output clusters are used as the output of the algorithm, and we explore the effect of the *cutoff* and *noise level* on the performance of the algorithm with these two clusters. We then systemically compare the sensitivity (percentage of correctly identified from input expression signals) and precision (percentage of TP expression signals among the output expression signals) for different *cutoff* and *noise level* s *ε* (Figure 4, 5, 6, 7) See Table 7 for the index of each figure. To test the variability of the results, we run the algorithm 20 times for each value of *cutoff* and *noise level* *ε*. Note that the simulated expression signals are different for each run due to random generations of initial phase signals and random noise addition, and the results are also expected to have certain degree of variability.

It is shown that the more noises are included in the data set, the less the sensitivity is obtained by the method (Figure 4, 5). On the other hand, the overall precision is almost constant (i.e. = 100%) as the *noise level* *ε* increases (Figure 6, 7), i.e. the almost 100% of the output signals are TP signals. It is shown that the sensitivity are approximately 82 – 96% with *cutoff* = 0.7 for all *noise level* s. If we assume that the *noise level* *ε* is ≤0.4, the *cutoff* values to obtain the sensitivity ≥82% for both groups should be 0.7. Based on this experiment, we conclude that the *cutoff* value ≥ 0.7 should be used for the analysis of yeast expression data to evaluate the PSC method, provided that the *noise level* in yeast data is ≤0.4. This could be reasonable assumption, because it is believed that the *noise level* = 0.4 is relatively large.

### Case study 2: *α* factor-synchronized cell cycle gene expression data analysis

We evaluate the PSC algorithm with expression data from the study of Spellman *et al*. [6], in which they monitor genome-wide mRNA levels by using four different cell cycle synchronization techniques. We evaluate the PSC algorithm with the data sets by three yeast experiments (Alpha, Cdc15, and Cdc28), in which mRNA levels of 6,178 yeast ORFs are measured simultaneously over approximately two cell cycle periods. A fourth data set using elutriation-based experiment by Spellman *et al*. [6] is not used, since it only covers a single cell cycle and because most published methods were not applied to this data set. There are many missing values in Spellman *et al*. [6]'s data set – only 605 genes have no missing values in all three data sets. These missing values can lead to problems in the data analysis, because they obviously interfere with computation of any statistical test or clustering. The one of default ways to handle these missing values is to exclude all data points that have missing data in at least one of the selected genes. However, if missing data points are randomly distributed across the arrays, there could be very few "valid" data points left to be analyzed in the data sets, because of the abundance of missing values in our chosen data sets. Therefore, we replace the missing values for each gene by the mean expression levels of its gene. We perform mean imputation in the gene expression levels for the *K* = 4,201 genes, which have no more than one missing value in each expression data set. Then the expression profile is normalized to the standardized variable. Let's say we have an expression profile *z(t)*, *t* = 1, 2, ..., *n*. If the expression profile has mean *μ* and variance *σ*^{2}, then the corresponding normalized expression value

has the mean 0 and variance 1.

It is clear that each expression data set contains artifacts, which would not occur in freely growing cells, due to the treatment of the cells for the cell cycle synchronization. For example, in Cdc15 and Cdc28 experiments, cells are released from arrest by an abrupt drop in temperature, which likely results in changes in expression of e.g. heat shock genes. To avoid the artifacts due to each cell cycle synchronization treatment, three expression data sets should be combined to uncover the "correct" sets of genes using the PSC algorithm as follows. The PSC algorithm mainly consists of the estimation of the strength between a system and the cluster, *ρ*_{
aC
}in Eq. (9). For the estimation of *ρ*_{
aC
}, the algorithm requires the strength values of the bivariate synchronization between all systems within the cluster of interest to us, *ρ*_{a,b}in Eq. (4). As a step for combining three data sets, we calculate the average *ρ*_{a,b}values from the three data sets as follows

where *ρ*_{a,b}^{α}is the values of bivariate synchronization by alpha-factor data set, *ρ*_{a,b}^{cdc 15}by cdc15 data set, and *ρ*_{a,b}^{cdc 28}by cdc28 data set. It is noteworthy that this step could also reduce the noises in expression data due to the missing values.

The main purpose of the PSC algorithm is to find groups of genes according to the specific biological process using the cell-cycle-specific expression data. The previous *in silico* experiment provide the effectiveness of the PSC algorithm to reach this purpose. Based on the theory of multivariate phase synchronization [17], it is assumed that each gene from the same output cluster is closely associated by having relevant biological interactions and/or sharing significant Gene Ontology (GO) terms. That is, each cluster is related with a certain specific biological process. To evaluate the PSC algorithm, all of output clusters with different *cutoff* = 0.9, 0.8, 0.7, 0.6, 0.5 are analyzed with the *GO Term Finder*. It is a tool for searching significant GO terms, or parent GO terms, used to annotate genes in a given cluster and is available from *Saccharomyces* Genome Database [24]. The significant GO terms for each output clusters depending on *cutoff* are presented as Additional files 1, 2, 3, 4, 5 and the summary of the GO analysis is also provided in Table 1. Note that the *p-value* cutoff for significant GO terms is set to 0.05. It is found that significant number of clusters from the output have significant GO terms. It should be noted that genes within those significant clusters are evidently associated with certain specific biological processes depending on the GO terms of their own.

As another evaluation experiment, experimentally identified physical and genetic interactions between genes are mined from *BioGRID* database for each cluster from the output. Note that the *BioGRID* database is a freely accessible database of physical and genetic interactions available at [25]. The numbers of known interactions are presented for clusters only with significant GO terms in Table 2, 3, 4, 5, 6. Note that the clusters are sorted according the size of clusters for each output. It is noticed that the relatively large clusters tend to have many known biological interactions. It means that the genes within relatively large clusters are evidently interacting each other during cell division cycle by participating in the certain specific biological process depending on the significant GO terms of their own. Therefore, it suffices to say that the relatively large clusters are prominent clusters among the output clusters.

The traditional clustering algorithms focus on relationships based on similar expression profiles, identifying cluster of genes whose expression signals simultaneously rise or fall with an assumption that genes with similar expression profiles have similar biological functions. For example, Spellman *et al*. [6] identify a large number of genes (~800) as giving a cell-cycle-specific patterns of gene expression by fitting the expression profile of given gene to a sine wave, which is used as a surrogate pattern of ideal cyclicity. Then, they use the hierarchical clustering algorithm to linearly correlate the expression profile for a given gene with the expression profile of other genes, which are considered to be confirmed as certain cell-cycle-regulated genes. To this end, they cluster genes into five cell cycle phases (G1, S, S/G2, G2/M, and M/G1). On the other hand, the PSC algorithm use the theory of multivariate phase synchronization, in which the mean phase coherence in Eq. 4 are used to find closely related genes that have relevant biological interactions and/or sharing significant GO terms. Here, the PSC algorithm deal with a special case of random variable that is defined on a circular scale, such that values whose difference is an integral multiple of a certain period (i.e. 2*π*) are regarded the same, and all values are wrapped into a single period. Note that the phase difference between expression profiles (or the phase of a expression profile) is an example of circular random variables *φ*_{
i
}(i = 1, 2,...). It is noteworthy that standard (or linear) statistical measures and moments like mean and variance are not applicable, because they yield different values if the period is added to or subtracted from some values, though the physical meaning of these changed values is the same. Based on the theory of phase synchronization, it is assumed that expression signals from two genes could be synchronized if these two genes are biologically interacting with each other. That is, two biologically interacting genes produce oscillating expression signals with a common rhythm. This phenomenon is explained in terms of coincidence of frequencies defined as "phase locking" [12]. With this theory, it is possible to measure the coupling strength between genes, which describes how strong the interaction is between genes.

To compare the capabilities of the PSC algorithm over the traditional clustering algorithm, we investigate whether genes from the output clusters are linearly correlated with each other as follows. Let's suppose that we have *n* number of genes in one of output clusters. Then, there are *(n*^{2}*-n)/2* number of all possible pairs of genes in the cluster. For each pair of genes, the linear correlation coefficients can be calculated for three expression data sets (i.e. alpha, cdc15, and cdc28), and the mean value of these three linear correlation coefficients is used as the "true" linear correlation coefficient for the given pair of genes. Note that the average values are used because of artifacts due to different cell cycle arrest treatments. Then, the mean linear correlation coefficient of all possible pairs can be obtained for each cluster, and the distribution of mean linear correlation coefficients for each output cluster with *cutoff* = 0.9, 0.8, 0.7, 0.6, 0.5 are presented in Figure 8. It is observed that the overall mean linear correlation coefficients are relatively low, and some of them are significantly low enough to be considered that their clusters are randomly created based on the linear correlation. That is, there is a significant portion of output clusters that are not obtainable by the traditional clustering algorithm. As an example, let's consider the 1^{st} cluster from the output clusters with *cutoff* = 0.7, which consists of genes associated with DNA metabolism process. From all possible pairs of genes in this cluster, we present two types of pairs: 1) similar expression profiles in Figure 9a and 2) time-shifted expression profiles in Figure 9b. It should be noted that all of these presented pairs in Figure 9 are identified as having known biological interactions between genes from the *BioGRID* database [25]. It is obvious that the time-shifted expression profiles are not obtainable by traditional clustering algorithm, because these profiles have significantly low linear correlation coefficients (<0.5). It is noteworthy that these time-shifted profiles are "similar" expression profiles that are constantly time-shifted from each other, and each gene is identified as having peak levels during different cell cycle phases according to the classification by Spellman *et al*. [6]. That is, these profiles are oscillating expression signals with a common rhythm during cell division process, which can be obtainable by PSC algorithm based on the theory of multivariate phase synchronization. As an another example, let's consider a relatively smaller cluster (41^{st} cluster from the output cluster with *cutoff* = 0.7) than 1^{st} cluster at this point. The 41^{st} cluster consisted of 11 genes, which are associated with translation process resulting in the formation of proteins. There are 23 known biological interactions identified from the *BioGRID* database [25], and these interactions are presented in Figure 10 (also see Table 9). It should be noted that none of genes in this cluster are identified as cell cycle regulated by Spellman *et al*. [6], and this example also includes expression profiles with significantly low linear correlation coefficient. This is another evidence that PSC algorithm is able to find prominent groups of genes, which are not obtainable by traditional clustering algorithm. It means that the PSC algorithm is able to find prominent groups of non-cell-cycle-regulated genes, which share significant GO terms and/or have relatively many known biological interaction from the *BioGRID* database [25]. There are more examples of such output clusters that have relative many known interactions and small (or zero) number of identified as cell-cycle-regulated genes by Spellman *et al*. [6]: e.g. 1) 3^{rd}, 9^{th}, 18^{th}, 41^{st} clusters with *cutoff* = 0.7 (Table 4). 2) 5^{th}, 19^{th}, 37^{th}, 50^{th}, 169^{th} clusters with *cutoff* = 0.6 (Table 5) 3) 1^{st}, 4^{th}, 6^{th}, 7^{th}, 11^{th}, 24^{th} clusters with cutoff = 0.5 (Table 6). It should be reminded that these clusters have relatively low linear correlation. Therefore, it suffices to say that the PSC algorithm has capabilities to find many prominent groups of genes that can not be obtained by the traditional clustering algorithms.

For clarification of the significant of *cutoff* value, the *P-value* from the distribution of strength of phase synchronization between each oscillator (gene) and the cluster is calculated. It is reasonable to assume that the complete understanding of cellular process during cell division cycle in whole genome scale is not available yet. However, it is well known that the cell division process is a single continuous process. The cell division process is mainly consisted of *interphase* (G1, S, and G2 phase) and *division* (M phase). During the S phase, the DNA in the nucleus is replicated, and the M phase includes two separate processes, i.e. mitosis and cytokinesis. The G1 phase is an interval phase between the end of M phase and the beginning of DNA synthesis, and the G2 phase is an interval phase separating the end of DNA synthesis from the beginning of M phase. It is assumed that genes during a certain cell cycle have relatively fewer interactions with genes during the other cell cycle phase. Therefore, the cell division process is a single continuous process and each process is "weakly" connected with other process in the downstream of cell cycle process. Based on this point of view, it is assumed that the whole cell division process is consisted of genes that create a single interacting network with heterogeneous connectivity distribution; thus, whole genomes are considered to estimate the *P-value*. In order to estimate a *P-value* for a given strength of phase synchronization between each gene and the cluster *ρ*_{
kC
}, a set of random expression signals is generated by shuffling the expression signals at different time points by interchanging the expression signal at time points 3 and 14. Using Eq. 9, the strength values of phase synchronization between each gene and the cluster *ρ*_{
kC
}are calculated and tabulated their distribution with combined expression set of alpha-factor, cdc15 and cdc28 (Figure 11). This distribution is an approximation of true negatives for input expression signals. By integration, we could estimate a *P-value*, which is defined as the probability of obtaining a *ρ*_{
kC
}larger than the *cutoff* from the random distribution: the smaller the *P-value*, the more significant the strength value *ρ*_{
kC
}and vice versa.

For further understanding of the significance of *cutoff* values, we examine two biological processes (i.e. *mitotic cell cycle* and *protein amino acid glycosylation*) in three output clusters: 11^{th} cluster (*cutoff* = 0.7), 1^{st} cluster (*cutoff* = 0.6), and 2^{nd} cluster (*cutoff* = 0.5), and the known physical or genetic interactions from *BioGRID* database [25] are visualized for the selected clusters in Figure 12. It is reasonable to assume that genes associated with *mitotic cell cycle* are more "tightly" interacting with each other than the ones associated with *protein amino acid glycosylation*. Figure 12 shows that genes with *protein amino acid glycosylation* have relatively fewer known interactions than the ones with *mitotic cell cycle*. Therefore, it suffices to say that genes with *protein amino acid glycosylation* are "weakly" connected to the genes with *mitotic cell cycle*, and the genes with *protein amino acid glycosylation* are combined with genes with *mitotic cell cycle* as *cutoff* decreases. It means that the size of cluster increases as *cutoff* decreases, and the PSC algorithm creates relatively small clusters with significantly prominent genes with relatively larger *cutoff* value and these clusters grows in size by combining other "weakly" interacting genes as *cutoff* decreases. Therefore, it can be concluded that the larger the *cutoff*, the more portions of prominent genes in the output clusters and vices versa.

## Conclusion

PSC algorithm is mainly based on the theory of multivariate phase synchronization, and the phase synchronization could be understood as a common rhythm of oscillatory activities of systems due to their interactions with each other. We develop the strategy of identifying and categorizing cell cycle specific gene expressions according to the specific biological process, in which expression signals share a common rhythm during cell cycle. That is, PSC algorithm is efficient to find groups of genes that share same periodic variations of expression profiles, which is coincident with the length of the cell cycle. On the other hand, the traditional clustering algorithms search similar expression profiles with an assumption that genes with similar expression profiles have similar biological functions. Our evaluation analysis clearly indicates that PSC algorithm produces prominent clusters, which are not obtainable by traditional clustering algorithms.

Our evaluation analysis also shows that the PSC algorithm is able to find groups of gene, which are significantly associated with each other by sharing significant GO terms of biological process and/or relevant biological interactions. However, the algorithm does not have a capability to create a directed and weighted network of synchronization. Recently, Motter *et al*. [26] showed that the maximum synchronizability can be achieved when the network of synchronization is weighted and directed for a given degree distribution of heterogeneous connectivity. Therefore, the study for the analysis of cell cycle specific genome expression data could be further advanced by considering the directed and weighted network structure and addressing the effect that asymmetry has on the synchronizability of complex networks.

Based on the evaluation experiments, we draw the conclusion as follows: 1) Based on the theory of multivariate phase synchronization, it is feasible to find groups of genes, which have biological interactions and/or significantly shared GO slim terms of biological process, with cell cycle specific gene expression signals. 2) Among all the output clusters by PSC algorithm, the cluster with relatively larger size has a tendency to include more known interactions than the one with relatively smaller size. 3) It is feasible to understand the cell cycle specific gene expression patterns as the phenomenon of collective synchronization. 4) PSC algorithm is able to find prominent groups of genes, which are not obtainable by traditional clustering algorithms.

## Methods

### 1) Fundamental mathematical concept: multivariate synchronization

The proposed algorithm builds on the concepts of analytic signal and phase synchronization. Hence, we first explain the basic idea of analytic signal and phase synchronization [12, 29]. Then we continue to describe the basic idea of synchronization in ensembles of oscillating systems. By "oscillating systems", we mean systems that produce the response signals with period and frequency. As a first step, we convert the gene expression signal *x(t)* into analytic signal *x*_{
a
}*(t)* using Hilbert transform (HT). The analytic signal of gene expression signal x(t) is defined by*x*_{
a
}*(t)* = *x(t)* + *jx*_{
h
}(*t*)

where *j* is the imaginary unit and *x*_{
h
}*(t)* is the HT of *x(t)*

From this equation, it is noticed that the HT of *x(t)* may be considered as the convolution of the *x(t)* and 1/*πt*. Due to the properties of convolution, the Fourier transform (FT) *X*_{
h
}*(ς)* of *x*_{
h
}*(t)* is the product of the FT of *x(t)* and 1/*πt*. For physically relevant Fourier frequencies *ς* > 0, *X*_{
h
}*(ς)* = -*jX(ς)*. In other words, the HT can be considered by an ideal filter whose amplitude response is unity and phase response is a constant *π*/2 lag at all Fourier frequencies. The analytic signal can also be expressed in terms of complex polar coordinates*x*_{
a
}(*t*) = *A*(*t*)exp(*jφ*(*t*)),

where A\left(t\right)=\left|{x}_{a}\left(t\right)\right|=\sqrt{{x}^{2}\left(t\right)+{x}_{h}^{2}\left(t\right)} and *φ*(*t*) = arg{*x*_{
a
}(*t*)}. These two functions are respectively called the amplitude and instantaneous phase of the signal *x(t)*. The basic idea of the analytic signal is that the negative frequency components of the FT (or spectrum) of *x(t)* s are superfluous, due to the hermitian symmetry of such a spectrum. These can be removed without any loss of information, if an analytic signal is used instead. But note that the removal of the negative frequencies will eliminate such spectral symmetry; the inverse FT of such a one-sided spectrum will give back a complex analytic signal.

In this study, we use the phase of the analytic signal *x*_{
a
}*(t)* to detect phase synchronization between oscillating systems; i.e. the phase synchronization can be defined as locking of the phases, while the amplitudes can be quite different. Using the methods of analytic signal, it can be shown that the interaction of nonidentical oscillators can lead to a perfect locking of their phases, whereas their amplitudes remain uncorrelated [12]. The strength of phase synchronization between two signals can be measured using the mean phase coherence [20] as follows

The values of *ρ*_{a,b}are confined between 0 (no synchrony) and 1 (perfect synchrony) and this value monotonically increases with the strength of phase synchronization [18].

For multivariate oscillating systems, we use the concept of synchronization in ensembles of oscillators, in which each component interacts with all others. This can be described as *global* coupling. Let's consider an ensemble of non-identical oscillators to understand the process of collective synchronization. From the previous section, it is understood that a pair of systems can be synchronized, and it is expected that synchronization can be extended to a whole population of systems, or at least to a large portion of it. Pikovsky *et al*. [11] explained how such coupling results in synchronization in the ensemble with a drawing as shown in Figure 13. With this figure, they have described the driving inputs that come to each system from all others by one input from the whole ensemble. It means that a common force operates each system and this force is proportional to the sum of outputs of all systems in the ensemble. This force can entrain many oscillating systems if their frequencies are close. In the case of global coupling this force is not predetermined, but comes from interaction within the ensemble. To explain qualitatively this force, we consider the study of Allefeld and Kurths [17]. They described the basic idea of multivariate synchronization analysis to understand the oscillating systems as constituting a cluster, in which each component system participates in different degree. The cluster consists of a common rhythm and it is described by the dynamics of a cluster phase Φ. For each measurement, the phase of the cluster is defined as a circular weighted mean of all phases inside the cluster,

where the participation index *c*_{
m
}can be obtained as a function of the synchronization strength between a system and the cluster,

This participation index *c*_{
m
}measure both how close each system inside the cluster follow the common rhythm Φ as well as how much a system contribute to the cluster. In this definition, it is not clear which function *f* should be chosen for the relationship between the *ρ*_{m,C}and the *c*_{
m
}. They modified the idea of synchronization cluster analysis in a way that is much more generally applicable. For the strength values of the bivariate synchronization *ρ*_{a,b}can be defined as

where Δ*φ*_{
a
}= *φ*_{
a
}- Φ and Δ*φ*_{
b
}= *φ*_{
b
}- Φ. They introduce the phase of the mean field to factorize the strength of phase synchronization as follow,

where *a* ≠ *b* (*ρ*_{
aa
}= 1). This factorization makes possible to estimate the strength of the synchronization between a system and the cluster, *ρ*_{
aC
}, thus this leads to the derivation of an iteration equation for estimating *ρ*_{
kC
}as follow

with *F*_{
ik
}= 1/(1 - *ρ*_{
iC
}^{2}*ρ*_{
ik
}^{2})^{2} for *i* ≠ *k* and *F*_{
ik
}= 0 for *i* = *k*.

### 2) Phase synchronization clustering algorithm

Based on the concept of synchronization in ensembles of oscillating systems, we propose the strategy to make clusters of genes based on the theory of multivariate synchronization. There are 5 steps in this procedure. The descriptions for each step are listed as follow. Inputs to this method are the time series of expression data set and *cutoff* value for synchronization strength.

#### Step 1. Obtaining the phase vector *φ*_{
i
}

Let's say there are signals *x*_{
i
}*(t)* of the *K* systems *i* = 1, 2, ..., *K* with n number of observations *t* = 1, 2, ..., *n* of the stochastic process. In this step, the analytical signal can be approximated using Fast Fourier transform [27]. The output of this step is phase vector *φ*_{
i
}, which is defined as *φ*_{
i
}= {*φ*_{
i
}*(1)*, *φ*_{
i
}*(2)*, ..., *φ*_{
i
}*(n)*}, for 1 ≤ *i* ≤ *K*.

#### Step 2. Initialization of cluster array

First, an array for *K* number of clusters is produced. For each cluster *cluster(i)*, a phase vector *φ*_{
i
}is stored for 1 ≤ *i* ≤ *K*. The output of this step is cluster array *cluster(i)*, for 1 ≤ *i* ≤ *K*. The pseudo algorithm of this step is presented in APPENDIX A.

#### Step 3. Initial clustering

For each phase vector, this step finds how closely the phase vector follows the common rhythm for each cluster from the array. This can be measured by the synchronization strength between the phase vector and the cluster. Then the algorithm finds the cluster in which the phase vector has the highest value of the synchronization strength. If the synchronization strength between the phase vector and the selected cluster is greater or equals to the pre-defined *cutoff* value, this cluster is updated by appending the phase vector to the selected cluster. This procedure is repeated for the entire phase vectors. The output of this step is the updated cluster array. The pseudo algorithm of this step is presented in APPENDIX B.

#### Step 4. Filtering cluster

If the cluster contains no more than a system, this does not constitute as a cluster. Thus, the cluster is set to empty list. The pseudo algorithm of this step is presented in APPENDIX C.

#### Step 5. Combining clusters

Empty clusters are not considered in this step. For each non-empty cluster, the algorithm finds a cluster from the array such that these two clusters will have the most common rhythm when they are combined. If all of the synchronization strength between the combined cluster and each element are greater or equals to the *cutoff* value, these two clusters are combined. The pseudo algorithm is presented in APPENDIX D.

## Appendix A: The pseudo algorithm for *initialization of cluster array*

Input: phase vectors, *φ*_{
i
}for 1 ≤ i ≤ K.

Output: cluster array, *cluster(i)* for 1 ≤ i ≤ K.

for 1 ≤ i ≤ K

*cluster(i)* = {*φ*_{i}}

end

## Appendix B: The pseudo algorithm for *initial clustering*

Input: *cutoff* and cluster array, *cluster(i)* for 1 ≤ i ≤ K.

Output: cluster array, *cluster(i)* for 1 ≤ i ≤ K.

for 1 ≤ i ≤ K

Initialize *[SynStrength]*^{(j)}with 0, for 1 ≤ j ≤ K

for 1 ≤ j ≤ K, i ≠ j

*temp_list* = {*cluster(j)*, *φ*_{
i
}}

n = the size of *temp_list*

Compute *ρ*_{mC} using *φ*_{
m
}from *temp_list*, for 1 ≤ m ≤ n

*[SynStrength]*^{(j)}= *ρ*_{nC}

end

Find *max_SynStrength* = max{*[SynStrength]*^{(j)}, 1 ≤ j ≤ K, i ≠ j}

If *max_SynStrength* ≥ *cutoff*

*cluster(j)* = {*cluster(j)*, *φ*_{
i
}}

end

end

## Appendix C: The pseudo algorithm for *filtering cluster*

Input: cluster array, *cluster(i)* for 1 ≤ i ≤ K.

Output: cluster array, *cluster(i)* for 1 ≤ i ≤ K.

for 1 ≤ i ≤ K

if the size of *cluster(i)* equals to 1

*cluster(i)* = {}

end

end

## Appendix D: The pseudo algorithm for *combining clusters*

Input: *cutoff* and cluster arrays, *cluster(i)* for 1 ≤ i ≤ K.

Output: cluster array, *cluster(i)* for 1 ≤ i ≤ K.

for 1 ≤ i ≤ K

Initialize *[SynStrength]*^{(j)}with 0, for 1 ≤ j ≤ K

for 1 ≤ j ≤ K, i ≠ j

if *cluster(i)* and *cluster(j)* are not empty lists

*temp_list* = {*cluster(j)*, *cluster(i)*}

n = the size of *temp_list*

Compute *ρ*_{mC} using *φ*_{
m
}from *temp_list*, for 1 ≤ m ≤ n

*[SynStrength]*^{(j)}= min{*ρ*_{mC}, 1 ≤ m ≤ n}

end

end

Find *max_SynStrength* = max{*[SynStrength]*^{(j)}, 1 ≤ j ≤ K, i ≠ j}

If *max_SynStrength* ≥ *cutoff*

*cluster(j)* = {*cluster(j)*, *cluster(i)*}

*cluster(i)* = {}

end

end

## References

Hereford LM, Osley MA, Ludwig TRD, McLaughlin CS: Cell-cycle regulation of yeast histone mRNA.

*Cell*1981, 24: 367–375. 10.1016/0092-8674(81)90326-3Cho RJ, Campbell MJ, Winzeler EA, Steinmetz L, Conway A, Wodicka L, Wolfsberg TG, Gabrielian AE, Landsman D, Lockhart DJ, Davis RW: A genome-wide transcriptional analysis of the mitotic cell cycle.

*Mol Cell*1998, 2: 65–73. 10.1016/S1097-2765(00)80114-8Laub MT, McAdams HH, Feldblyum T, Fraser CM, Shapiro L: Global analysis of the genetic network controlling a bacterial cell cycle.

*Science*2000, 290: 2144–2148. 10.1126/science.290.5499.2144Menges M, Murray JA: Synchronous Arabidopsis suspension cultures for analysis of cell-cycle gene activity.

*Plant J*2002, 30: 203–212. 10.1046/j.1365-313X.2002.01274.xMenges M, Hennig L, Gruissem W, Murray JA: Cell cycle-regulated gene expression in Arabidopsis.

*J Biol Chem*2002, 277: 41987–42002. 10.1074/jbc.M207570200Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization.

*Mol Biol Cell*1998, 9: 3273–3297.van der Meijden CM, Lapointe DS, Luong MX, Peric-Hupkes D, Cho B, Stein JL, van Wijnen AJ, Stein GS: Gene profiling of cell cycle progression through S-phase reveals sequential expression of genes required for DNA replication and nucleosome assembly.

*Cancer Res*2002, 62: 3233–3243.Whitfield ML, Sherlock G, Sadanha AJ, Murray JI, Ball CA, Alexander KE, Matese JC, Perou CM, Hurt MM, Brown PO, Botstein D: Identification of gene periodically expressed in the human cell cycle and their expression in tumors.

*Mol Biol Cell*2002, 13: 1977–2000. 10.1091/mbc.02-02-0030.Cooper S, Shedden K: Microarray analysis of gene expression during the cell cycle.

*Cell & Chromosome*2003, 2: 1. 10.1186/1475-9268-2-1Glass L, Mackay MC:

*From clocks to chaos.*New Jersey: Princeton University Press; 1998.Pikovsky A, Rosenblum M, Kurths J:

*Synchronization: A universal concept in nonlinear sciences.*Cambridge: Cambridge University Press; 2001.Rosenblum MG, Pikovsky AS, Kurths J: Phase Synchronization of Chaotic Oscillators.

*Phys Rev Lett*1996, 76: 1804–1807. 10.1103/PhysRevLett.76.1804Anishchenko VS, Balanov AG, Janson NB, Igosheva NB, Bordyugov GV: Entrainment between heart rate and weak noninvasive forcing.

*Int J Bifurcation and Chaos*2000, 10: 2339–2348.Schäfer C, Rosenblum MG, Kurths J, Abel HH: Heartbeat synchronized with ventilation.

*Nature*1998, 392: 239–240. 10.1038/32567Stefanovska A, Haken H, McClintock PVE, Hozic M, Bajrovic F, Ribaric S: Reversible transitions between synchronization states of the cardiorespiratory system.

*Phys Rev Lett*2000, 85: 4831–4834. 10.1103/PhysRevLett.85.4831Tass P, Rosenblum MG, Weule J, Kurths J, Pikovsky A, Volkmann J, Schnitzler A, Freund HJ: Detection of n:m phase locking from noisy data: Application to magnetoencephalography.

*Phys Rev Lett*1998, 81: 3291–3294. 10.1103/PhysRevLett.81.3291Allefeld C, Kurths J: An approach to multivariate phase synchronization analysis and its application to event-related potentials.

*Int J Bifurcation and Chaos*2004, 14: 417–426. 10.1142/S0218127404009521Bhattacharya J: Reduced degree of long-range phase synchrony in pathological human brain.

*Acta Neurobiol Exp*2001, 61: 309–318.Jerger KK, Netoff TI, Francis JT, Sauer T, Pecora L, Weinstein SL, Schiff SJ: Early seizure detection.

*J Clin Neurophysiol*2001, 18: 259–268. 10.1097/00004691-200105000-00005Mormann F, Lehnertz K, David P, Elger CE: Mean phase coherence as a measure for phase synchronization and its application to EEG of epilepsy patients.

*Physica D*2000, 144: 358–369. 10.1016/S0167-2789(00)00087-7Blasius B, Huppert A, Stone L: Complex dynamics and phase synchronization in spatially extended ecological systems.

*Nature*1999, 399: 354–359. 10.1038/20676Lunkeit F: Synchronization experiments with an atmospheric global circulation model.

*Chaos*2001, 11: 47–51. 10.1063/1.1338127Strogatz SH: From Kuramoto to Crawford: exploring the onset of synchronization in populations of coupled oscillators.

*Physica D*2000, 143: 1–20. 10.1016/S0167-2789(00)00094-4Saccharomyces Genome Database[http://www.yeastgenome.org]

BioGRID: General Repository for Interaction Datasets[http://www.thebiogrid.org]

Motter AE, Zhou C, Kurths J: Network synchronization, diffusion, and the paradox of heterogeneity.

*Physical Review E*2005, 71: 016116. 10.1103/PhysRevE.71.016116Marple SL Jr: Computing the discrete-time "analytic" signal via FFT.

*IEEE trans Signal processing*1999, 47: 2600–2603. 10.1109/78.782222

## Acknowledgements

The authors thank Professor Choung Ik Kim from Kangwon National University for helpful discussion.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Authors' contributions

CSK conceived this study, developed the proposed method, and performed the evaluation experiments. CSK, CSB, and HJT participated in the interpretation of the evaluation experiments. CSK, CSB and HJT drafted and finalized the manuscript. All authors approved the manuscript.

## Electronic supplementary material

### 12859_2007_2041_MOESM1_ESM.txt

Additional file 1: Tables of significant GO terms of biological process for the output clusters with cutoff = 0.9. This file includes 8 clusters from the PSC algorithm with cutoff = 0.9, which have significantly GO terms of biological process. These significant GO terms are obtained by a tool called *GO Term Finder*, which is available from *Saccharomyces* Genome Database [24]. (TXT 7 KB)

### 12859_2007_2041_MOESM2_ESM.txt

Additional file 2: Tables of significant GO terms of biological process for the output clusters with cutoff = 0.8. This file includes 57 clusters from the PSC algorithm with cutoff = 0.8, which have significantly GO terms of biological process. These significant GO terms are obtained by a tool called *GO Term Finder*, which is available from *Saccharomyces* Genome Database [24]. (TXT 42 KB)

### 12859_2007_2041_MOESM3_ESM.txt

Additional file 3: Tables of significant GO terms of biological process for the output clusters with cutoff = 0.7. This file includes 148 clusters from the PSC algorithm with cutoff = 0.7, which have significantly GO terms of biological process. These significant GO terms are obtained by a tool called *GO Term Finder*, which is available from *Saccharomyces* Genome Database [24]. (TXT 110 KB)

### 12859_2007_2041_MOESM4_ESM.txt

Additional file 4: Tables of significant GO terms of biological process for the output clusters with cutoff = 0.6. This file includes 151 clusters from the PSC algorithm with cutoff = 0.8, which have significantly GO terms of biological process. These significant GO terms are obtained by a tool called *GO Term Finder*, which is available from *Saccharomyces* Genome Database [24]. (TXT 124 KB)

### 12859_2007_2041_MOESM5_ESM.txt

Additional file 5: Tables of significant GO terms of biological process for the output clusters with cutoff = 0.5. This file includes 21 clusters from the PSC algorithm with cutoff = 0.8, which have significantly GO terms of biological process. These significant GO terms are obtained by a tool called *GO Term Finder*, which is available from *Saccharomyces* Genome Database [24]. (TXT 50 KB)

## Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

## Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

### Cite this article

Kim, C.S., Bae, C.S. & Tcha, H.J. A phase synchronization clustering algorithm for identifying interesting groups of genes from cell cycle expression data.
*BMC Bioinformatics* **9**, 56 (2008). https://doi.org/10.1186/1471-2105-9-56

Received:

Accepted:

Published:

DOI: https://doi.org/10.1186/1471-2105-9-56