- Research article
- Open Access
A phase synchronization clustering algorithm for identifying interesting groups of genes from cell cycle expression data
- Chang Sik Kim^{1}Email author,
- Cheol Soo Bae^{2} and
- Hong Joon Tcha^{3}
https://doi.org/10.1186/1471-2105-9-56
© Kim et al; licensee BioMed Central Ltd. 2008
- Received: 26 November 2007
- Accepted: 28 January 2008
- Published: 28 January 2008
Abstract
Background
The previous studies of genome-wide expression patterns show that a certain percentage of genes are cell cycle regulated. The expression data has been analyzed in a number of different ways to identify cell cycle dependent genes. In this study, we pose the hypothesis that cell cycle dependent genes are considered as oscillating systems with a rhythm, i.e. systems producing response signals with period and frequency. Therefore, we are motivated to apply the theory of multivariate phase synchronization for clustering cell cycle specific genome-wide expression data.
Results
We propose the strategy to find groups of genes according to the specific biological process by analyzing cell cycle specific gene expression data. To evaluate the propose method, we use the modified Kuramoto model, which is a phase governing equation that provides the long-term dynamics of globally coupled oscillators. With this equation, we simulate two groups of expression signals, and the simulated signals from each group shares their own common rhythm. Then, the simulated expression data are mixed with randomly generated expression data to be used as input data set to the algorithm. Using these simulated expression data, it is shown that the algorithm is able to identify expression signals that are involved in the same oscillating process. We also evaluate the method with yeast cell cycle expression data. It is shown that the output clusters by the proposed algorithm include genes, which are closely associated with each other by sharing significant Gene Ontology terms of biological process and/or having relatively many known biological interactions. Therefore, the evaluation analysis indicates that the method is able to identify expression signals according to the specific biological process. Our evaluation analysis also indicates that some portion of output by the proposed algorithm is not obtainable by the traditional clustering algorithm with Euclidean distance or linear correlation.
Conclusion
Based on the evaluation experiments, we draw the conclusion as follows: 1) Based on the theory of multivariate phase synchronization, it is feasible to find groups of genes, which have relevant biological interactions and/or significantly shared GO slim terms of biological process, using cell cycle specific gene expression signals. 2) Among all the output clusters by the proposed algorithm, the cluster with relatively large size has a tendency to include more known interactions than the one with relatively small size. 3) It is feasible to understand the cell cycle specific gene expression patterns as the phenomenon of collective synchronization. 4) The proposed algorithm is able to find prominent groups of genes, which are not obtainable by traditional clustering algorithm.
Keywords
- Gene Ontology
- Phase Synchronization
- Phase Vector
- Mitotic Cell Cycle
- Specific Biological Process
Background
Since Hereford et al. [1] first discovered yeast histone mRNAs oscillate during cell division cycle, several experimental studies have identified that many genes are expressed in a cell-cycle-specific manner. These studies have motivated the study of global extent of cycle-specific gene expression. To this end, there have been a number of studies using DNA microarrays to understand whole-genome expression patterns during cell division cycle [2–8]. A particular example is flagella biogenesis in Caulobactor, which has four distinct and dependent waves of transcription. Laub et al. [3] showed that 20% of Caulobactor genes are cell cycle regulated, their expression level consistently having peaks when they function. Another example is the study of yeast Saccharomyces cerevisiae [6], in which they also discovered that between 10 and 20% of yeast genes are periodically expressed during cell division. Therefore, it suffices to say that a certain percentage of genes may have the periodicity for its oscillatory activity throughout the cell division. These cell-cycle-specific oscillatory activities can be explained by a biological phenomenon in terms of efficiency and logical order. The cell only makes the enzyme when it is needed. If the enzymes were made all the time, the cell would be inefficient in an environment devoid of the substrates of the enzymes [9].
In this study, we are motivated to apply the theory of multivariate phase synchronization to cell-cycle-specific gene expression data. Synchronization is one of the most commonly present phenomena in various fields of science [10, 11]. Generally, we understand synchronization as a complete coincidence of the states between oscillating systems due to their interactions. Rosenblum et al. [12] show that the phase difference of two coupled oscillating systems is bounded while the amplitude is uncorrelated and irregular. There have been numerous applications in different areas such as cardiorespiratory interaction [13–15], brain activity of Parkinsonian patients [16], EEG measurements [17–20], ecology [21], and climate systems [22]. Because our interests of this study are cellular activity during cell cycle, our interested systems are the cell cycle specific genes. Based on the theory of phase synchronization, we pose a hypothesis that expression signals from two genes could be synchronized if these two genes are biologically interacting with each other. That is, two biologically interacting genes produce oscillating expression signals with a common rhythm. Therefore, we propose the phase synchronization as a measure to identify biologically relevant interactions using cell-cycle-specific gene repression data and the cell cycle specific genes are oscillating systems, which produce gene expressions with rhythms (periodicity).
In this study, we present the effort of applying the theory of multivariate phase synchronization to find groups of cell cyclic gene expression signals according to the specific biological process, which is based on the study of Allefeld and Kurths [17]. They present a method for the multivariate analysis of statistical phase synchronization phenomena in empirical data, which is based on the theory of synchronization cluster. The basic idea of their analysis is to consider the oscillating systems forming a cluster in which each one contributes to the cluster in different degree. The cluster consists of a common rhythm that is a mean oscillation for all oscillating systems inside the cluster. Based on their theory, we propose an algorithm named as Phase Synchronization Clustering (PSC) algorithm, which produce the clusters of cell cycle specific genes from genome expression data set, and the genes from the same cluster are expected to be involved in the specific biological process. The PSC algorithm is evaluated with synthetic data and cell cycle specific expression data of Saccharomyces cerevisiae from the study of Spellman et al. [6], in which they analyze gene expression levels in yeast cell cultures whose cell cycle has been synchronized by various methods.
Results and discussion
Case study 1: in silico experiments
With given initial random instantaneous phase signals, the expression signal can be simulated and converted into real signals asx_{ i }(t) = real[A exp[(jφ_{ i }(t))] = A cos(φ_{ i }(t)),
where A is the instantaneous amplitude and is set to 1 for all signals. Then the simulated signals are updated by adding random noise from Gaussian distribution with mean μ = 0 and standard deviation ε.
As an initial step, the algorithm creates a set of clusters of which the size is equal to the number of signals in the input data set. In this case, the algorithm creates 300 initial clusters, of which all sizes are equal to one. After the final step of the algorithm, the size of each cluster will be different depending on the values of cutoff and noise level ε. For each non-empty cluster, the signals from the group with simulated signals are counted and labeled as true positive (TP) for each group, and the signals from the group with random signals are also counted and labeled as false positive (FP).
It is shown that the more noises are included in the data set, the less the sensitivity is obtained by the method (Figure 4, 5). On the other hand, the overall precision is almost constant (i.e. = 100%) as the noise level ε increases (Figure 6, 7), i.e. the almost 100% of the output signals are TP signals. It is shown that the sensitivity are approximately 82 – 96% with cutoff = 0.7 for all noise level s. If we assume that the noise level ε is ≤0.4, the cutoff values to obtain the sensitivity ≥82% for both groups should be 0.7. Based on this experiment, we conclude that the cutoff value ≥ 0.7 should be used for the analysis of yeast expression data to evaluate the PSC method, provided that the noise level in yeast data is ≤0.4. This could be reasonable assumption, because it is believed that the noise level = 0.4 is relatively large.
Case study 2: α factor-synchronized cell cycle gene expression data analysis
has the mean 0 and variance 1.
where ρ_{a,b}^{ α }is the values of bivariate synchronization by alpha-factor data set, ρ_{a,b}^{cdc 15}by cdc15 data set, and ρ_{a,b}^{cdc 28}by cdc28 data set. It is noteworthy that this step could also reduce the noises in expression data due to the missing values.
The result from the analysis for significant GO terms according to the cutoff value.
The number of known biological interactions mined from BioGRID database [25] for each output cluster with cutoff = 0.9.
ci | n1 | n2 | n3 | size | ci | n 1 | n 2 | n 3 | size |
---|---|---|---|---|---|---|---|---|---|
1 | 4 | 4 | 12 | 12 | 5 | 0 | 0 | 5 | 5 |
2 | 3 | 4 | 7 | 7 | 6 | 0 | 0 | 4 | 4 |
3 | 0 | 0 | 6 | 6 | 9 | 0 | 0 | 4 | 4 |
4 | 5 | 4 | 5 | 5 | 10 | 6 | 4 | 4 | 4 |
The number of known biological interactions mined from BioGRID database [25] for each output cluster with cutoff = 0.8.
ci | n1 | n2 | n3 | size | ci | n 1 | n 2 | n 3 | size | ci | n1 | n2 | n3 | size | ci | n 1 | n 2 | n 3 | size |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 2 | 2 | 16 | 16 | 19 | 1 | 1 | 2 | 7 | 47 | 3 | 3 | 0 | 5 | 80 | 0 | 0 | 0 | 4 |
3 | 5 | 5 | 14 | 14 | 22 | 0 | 0 | 7 | 7 | 48 | 1 | 1 | 0 | 5 | 83 | 2 | 3 | 3 | 4 |
4 | 2 | 2 | 11 | 11 | 23 | 0 | 0 | 7 | 7 | 49 | 3 | 3 | 0 | 5 | 86 | 1 | 1 | 0 | 4 |
5 | 1 | 1 | 11 | 11 | 26 | 4 | 4 | 0 | 6 | 61 | 2 | 2 | 0 | 4 | 91 | 0 | 0 | 0 | 4 |
6 | 0 | 0 | 11 | 11 | 27 | 2 | 2 | 0 | 6 | 63 | 0 | 0 | 2 | 4 | 92 | 1 | 1 | 0 | 4 |
7 | 0 | 0 | 9 | 9 | 28 | 4 | 4 | 6 | 6 | 66 | 1 | 2 | 4 | 4 | 95 | 2 | 2 | 4 | 4 |
9 | 6 | 5 | 9 | 9 | 32 | 0 | 0 | 5 | 6 | 67 | 1 | 1 | 4 | 4 | 96 | 0 | 0 | 4 | 4 |
10 | 2 | 3 | 8 | 8 | 33 | 2 | 2 | 6 | 6 | 68 | 1 | 2 | 3 | 4 | 100 | 2 | 2 | 1 | 4 |
11 | 8 | 6 | 8 | 8 | 37 | 7 | 5 | 0 | 5 | 69 | 2 | 2 | 0 | 4 | 103 | 0 | 0 | 0 | 4 |
12 | 1 | 1 | 8 | 8 | 38 | 0 | 0 | 5 | 5 | 70 | 1 | 1 | 0 | 4 | 104 | 0 | 0 | 0 | 4 |
13 | 1 | 2 | 8 | 8 | 39 | 0 | 0 | 5 | 5 | 73 | 1 | 1 | 0 | 4 | 106 | 0 | 0 | 2 | 4 |
14 | 1 | 1 | 8 | 8 | 41 | 1 | 1 | 1 | 5 | 74 | 1 | 1 | 0 | 4 | 109 | 0 | 0 | 1 | 4 |
15 | 1 | 1 | 7 | 7 | 44 | 3 | 3 | 0 | 5 | 75 | 0 | 0 | 4 | 4 | |||||
16 | 2 | 3 | 7 | 7 | 45 | 3 | 3 | 0 | 5 | 76 | 1 | 2 | 0 | 4 | |||||
18 | 2 | 2 | 6 | 7 | 46 | 0 | 0 | 0 | 5 | 79 | 2 | 2 | 0 | 4 |
The number of known biological interactions mined from BioGRID database [25] for each output cluster with cutoff = 0.7.
ci | n1 | n2 | n3 | size | ci | n1 | n2 | n3 | size | ci | n1 | n2 | n3 | size | ci | n1 | n2 | n3 | size |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 24 | 19 | 50 | 50 | 51 | 0 | 0 | 2 | 10 | 138 | 2 | 3 | 0 | 7 | 283 | 0 | 0 | 0 | 6 |
2 | 25 | 25 | 44 | 44 | 53 | 3 | 4 | 5 | 10 | 140 | 2 | 2 | 0 | 7 | 284 | 3 | 3 | 0 | 6 |
3 | 14 | 12 | 8 | 31 | 56 | 1 | 1 | 0 | 10 | 149 | 0 | 0 | 0 | 7 | 285 | 1 | 2 | 0 | 6 |
4 | 14 | 15 | 25 | 27 | 57 | 3 | 3 | 4 | 10 | 150 | 2 | 2 | 3 | 7 | 292 | 2 | 2 | 1 | 6 |
5 | 3 | 5 | 21 | 24 | 58 | 0 | 0 | 3 | 10 | 151 | 2 | 2 | 0 | 7 | 297 | 0 | 0 | 0 | 6 |
6 | 3 | 3 | 10 | 24 | 60 | 4 | 3 | 7 | 10 | 154 | 1 | 1 | 0 | 7 | 305 | 3 | 3 | 0 | 6 |
9 | 14 | 9 | 1 | 18 | 61 | 0 | 0 | 5 | 10 | 155 | 2 | 2 | 2 | 7 | 306 | 2 | 2 | 1 | 6 |
10 | 4 | 4 | 17 | 18 | 62 | 1 | 1 | 0 | 9 | 160 | 0 | 0 | 1 | 7 | 307 | 3 | 2 | 4 | 6 |
11 | 7 | 4 | 18 | 18 | 66 | 3 | 3 | 8 | 9 | 164 | 2 | 2 | 0 | 7 | 309 | 8 | 6 | 0 | 6 |
12 | 8 | 6 | 11 | 17 | 69 | 1 | 1 | 3 | 9 | 168 | 0 | 0 | 2 | 7 | 313 | 1 | 1 | 2 | 6 |
13 | 4 | 5 | 15 | 17 | 72 | 2 | 3 | 2 | 9 | 173 | 8 | 5 | 5 | 7 | 315 | 1 | 1 | 0 | 6 |
14 | 2 | 3 | 17 | 17 | 76 | 5 | 6 | 0 | 9 | 175 | 2 | 2 | 4 | 7 | 320 | 1 | 1 | 0 | 6 |
16 | 7 | 5 | 3 | 16 | 77 | 0 | 0 | 1 | 9 | 177 | 2 | 3 | 0 | 7 | 321 | 0 | 0 | 0 | 6 |
17 | 10 | 11 | 13 | 16 | 78 | 2 | 3 | 6 | 9 | 181 | 1 | 1 | 0 | 7 | 322 | 3 | 3 | 0 | 6 |
18 | 19 | 12 | 0 | 16 | 82 | 1 | 1 | 8 | 9 | 182 | 1 | 1 | 1 | 7 | 334 | 1 | 1 | 1 | 5 |
19 | 1 | 1 | 2 | 16 | 83 | 3 | 5 | 1 | 9 | 185 | 2 | 2 | 0 | 7 | 347 | 3 | 3 | 0 | 5 |
20 | 4 | 6 | 0 | 15 | 87 | 0 | 0 | 0 | 8 | 190 | 0 | 0 | 4 | 6 | 348 | 1 | 1 | 2 | 5 |
21 | 3 | 3 | 8 | 14 | 91 | 4 | 4 | 0 | 8 | 204 | 1 | 2 | 0 | 6 | 349 | 2 | 2 | 0 | 5 |
22 | 2 | 2 | 14 | 14 | 93 | 0 | 0 | 5 | 8 | 205 | 1 | 2 | 0 | 6 | 353 | 0 | 0 | 0 | 5 |
23 | 8 | 7 | 0 | 14 | 94 | 0 | 0 | 2 | 8 | 206 | 2 | 2 | 0 | 6 | 356 | 2 | 3 | 0 | 5 |
24 | 0 | 0 | 13 | 13 | 96 | 1 | 1 | 2 | 8 | 209 | 4 | 4 | 0 | 6 | 359 | 0 | 0 | 1 | 5 |
26 | 5 | 6 | 0 | 13 | 99 | 3 | 3 | 0 | 8 | 210 | 0 | 0 | 0 | 6 | 360 | 1 | 1 | 0 | 5 |
28 | 2 | 3 | 0 | 12 | 100 | 0 | 0 | 0 | 8 | 211 | 1 | 1 | 2 | 6 | 368 | 2 | 2 | 0 | 5 |
29 | 0 | 0 | 3 | 12 | 101 | 11 | 6 | 0 | 8 | 218 | 0 | 0 | 0 | 6 | 374 | 1 | 1 | 2 | 5 |
31 | 1 | 1 | 0 | 12 | 102 | 5 | 4 | 0 | 8 | 228 | 5 | 4 | 2 | 6 | 387 | 2 | 2 | 1 | 5 |
34 | 1 | 2 | 5 | 12 | 103 | 0 | 0 | 5 | 8 | 229 | 1 | 2 | 4 | 6 | 388 | 2 | 2 | 2 | 5 |
35 | 4 | 4 | 0 | 11 | 104 | 3 | 3 | 0 | 8 | 230 | 0 | 0 | 0 | 6 | 389 | 0 | 0 | 0 | 5 |
36 | 3 | 3 | 0 | 11 | 105 | 4 | 4 | 1 | 8 | 233 | 1 | 1 | 0 | 6 | 393 | 2 | 3 | 4 | 5 |
39 | 2 | 2 | 1 | 11 | 106 | 0 | 0 | 0 | 8 | 235 | 1 | 1 | 0 | 6 | 395 | 0 | 0 | 5 | 5 |
41 | 23 | 9 | 0 | 11 | 107 | 0 | 0 | 3 | 8 | 239 | 2 | 2 | 0 | 6 | 396 | 3 | 3 | 0 | 5 |
42 | 4 | 6 | 1 | 11 | 113 | 0 | 0 | 5 | 8 | 243 | 0 | 0 | 0 | 6 | 397 | 1 | 1 | 0 | 5 |
43 | 2 | 2 | 0 | 11 | 119 | 0 | 0 | 8 | 8 | 244 | 1 | 2 | 0 | 6 | 401 | 0 | 0 | 0 | 5 |
44 | 4 | 4 | 0 | 11 | 120 | 3 | 3 | 4 | 8 | 246 | 2 | 2 | 0 | 6 | 403 | 1 | 1 | 0 | 5 |
45 | 1 | 1 | 1 | 11 | 125 | 1 | 1 | 2 | 8 | 263 | 1 | 1 | 0 | 6 | 413 | 1 | 2 | 0 | 5 |
46 | 4 | 5 | 8 | 10 | 127 | 0 | 0 | 0 | 8 | 274 | 1 | 1 | 1 | 6 | 414 | 1 | 2 | 1 | 5 |
48 | 1 | 1 | 2 | 10 | 128 | 2 | 3 | 0 | 8 | 275 | 2 | 2 | 0 | 6 | 415 | 1 | 1 | 0 | 5 |
50 | 2 | 2 | 1 | 10 | 130 | 4 | 4 | 0 | 8 | 278 | 1 | 1 | 0 | 6 | 417 | 1 | 2 | 0 | 5 |
The number of known biological interactions mined from BioGRID database [25] for each output cluster with cutoff = 0.6.
ci | n1 | n2 | n3 | size | ci | n1 | n2 | n3 | size | ci | n1 | n2 | n3 | size | ci | n1 | n2 | n3 | size |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 109 | 58 | 108 | 118 | 80 | 2 | 3 | 0 | 20 | 184 | 5 | 5 | 4 | 13 | 297 | 1 | 2 | 2 | 9 |
2 | 108 | 61 | 105 | 116 | 84 | 3 | 3 | 0 | 19 | 186 | 2 | 2 | 1 | 13 | 307 | 0 | 0 | 1 | 8 |
3 | 32 | 31 | 19 | 88 | 86 | 1 | 1 | 7 | 19 | 187 | 1 | 1 | 0 | 13 | 312 | 0 | 0 | 0 | 8 |
4 | 24 | 23 | 32 | 83 | 91 | 3 | 4 | 0 | 19 | 188 | 2 | 2 | 0 | 13 | 313 | 0 | 0 | 0 | 8 |
5 | 104 | 41 | 1 | 69 | 97 | 6 | 6 | 0 | 18 | 196 | 2 | 2 | 2 | 13 | 315 | 1 | 2 | 1 | 8 |
9 | 19 | 20 | 33 | 43 | 100 | 8 | 10 | 0 | 18 | 197 | 0 | 0 | 1 | 13 | 316 | 2 | 2 | 1 | 8 |
11 | 13 | 12 | 8 | 43 | 108 | 7 | 7 | 0 | 18 | 198 | 0 | 0 | 1 | 13 | 325 | 2 | 2 | 0 | 8 |
12 | 11 | 10 | 34 | 41 | 112 | 7 | 6 | 0 | 18 | 199 | 2 | 2 | 0 | 13 | 330 | 0 | 0 | 0 | 8 |
15 | 16 | 14 | 20 | 35 | 113 | 0 | 0 | 17 | 18 | 202 | 1 | 1 | 5 | 12 | 332 | 1 | 1 | 0 | 8 |
17 | 15 | 13 | 0 | 33 | 116 | 4 | 5 | 0 | 17 | 206 | 2 | 2 | 1 | 12 | 334 | 5 | 4 | 0 | 8 |
19 | 96 | 22 | 0 | 32 | 117 | 3 | 5 | 2 | 17 | 207 | 5 | 6 | 1 | 12 | 340 | 1 | 1 | 1 | 8 |
20 | 15 | 13 | 2 | 30 | 120 | 5 | 4 | 2 | 17 | 213 | 1 | 1 | 5 | 12 | 354 | 2 | 3 | 0 | 8 |
22 | 4 | 4 | 0 | 30 | 121 | 5 | 5 | 0 | 17 | 214 | 3 | 3 | 1 | 12 | 355 | 5 | 5 | 0 | 8 |
27 | 7 | 8 | 2 | 29 | 122 | 2 | 3 | 14 | 16 | 215 | 1 | 1 | 4 | 12 | 356 | 2 | 2 | 0 | 8 |
30 | 18 | 12 | 0 | 28 | 124 | 1 | 1 | 7 | 16 | 217 | 2 | 2 | 0 | 12 | 358 | 0 | 0 | 3 | 8 |
31 | 8 | 12 | 6 | 28 | 127 | 0 | 0 | 2 | 16 | 219 | 0 | 0 | 0 | 12 | 364 | 1 | 1 | 0 | 8 |
34 | 3 | 3 | 11 | 27 | 129 | 0 | 0 | 0 | 16 | 221 | 2 | 3 | 0 | 12 | 365 | 2 | 2 | 0 | 8 |
36 | 7 | 7 | 9 | 26 | 132 | 4 | 4 | 7 | 16 | 222 | 1 | 1 | 3 | 12 | 366 | 1 | 1 | 1 | 8 |
37 | 25 | 16 | 0 | 26 | 134 | 4 | 5 | 3 | 16 | 230 | 5 | 6 | 0 | 11 | 372 | 2 | 3 | 0 | 7 |
38 | 4 | 5 | 1 | 26 | 137 | 4 | 6 | 0 | 16 | 231 | 1 | 1 | 0 | 11 | 374 | 3 | 3 | 1 | 7 |
39 | 7 | 8 | 0 | 26 | 138 | 5 | 7 | 1 | 16 | 232 | 1 | 1 | 2 | 11 | 375 | 0 | 0 | 0 | 7 |
40 | 4 | 4 | 1 | 26 | 139 | 3 | 4 | 3 | 16 | 236 | 0 | 0 | 2 | 11 | 377 | 2 | 2 | 0 | 7 |
41 | 4 | 4 | 2 | 26 | 140 | 3 | 3 | 0 | 15 | 237 | 2 | 2 | 0 | 11 | 378 | 2 | 2 | 0 | 7 |
42 | 5 | 5 | 13 | 26 | 143 | 1 | 2 | 0 | 15 | 240 | 0 | 0 | 2 | 11 | 388 | 0 | 0 | 2 | 6 |
44 | 17 | 12 | 18 | 26 | 145 | 4 | 4 | 3 | 15 | 241 | 7 | 5 | 0 | 11 | 389 | 0 | 0 | 1 | 6 |
46 | 2 | 2 | 10 | 25 | 147 | 9 | 9 | 0 | 15 | 242 | 2 | 2 | 2 | 11 | 395 | 1 | 1 | 2 | 6 |
50 | 31 | 11 | 2 | 24 | 150 | 2 | 2 | 0 | 15 | 243 | 3 | 3 | 0 | 11 | 413 | 1 | 1 | 0 | 6 |
51 | 3 | 3 | 22 | 23 | 151 | 6 | 6 | 2 | 15 | 245 | 4 | 4 | 0 | 11 | 416 | 0 | 0 | 0 | 6 |
52 | 23 | 13 | 10 | 23 | 152 | 0 | 0 | 1 | 15 | 249 | 2 | 2 | 0 | 10 | 420 | 0 | 0 | 0 | 6 |
53 | 4 | 4 | 0 | 23 | 153 | 2 | 2 | 0 | 14 | 255 | 3 | 3 | 0 | 10 | 421 | 0 | 0 | 0 | 6 |
54 | 8 | 10 | 2 | 23 | 154 | 4 | 5 | 0 | 14 | 259 | 2 | 3 | 0 | 10 | 429 | 1 | 1 | 0 | 6 |
58 | 4 | 3 | 0 | 22 | 155 | 1 | 2 | 3 | 14 | 267 | 4 | 5 | 2 | 10 | 431 | 1 | 1 | 0 | 6 |
63 | 7 | 8 | 0 | 22 | 166 | 2 | 2 | 0 | 14 | 270 | 1 | 1 | 2 | 10 | 433 | 1 | 1 | 0 | 6 |
67 | 1 | 1 | 3 | 21 | 167 | 2 | 2 | 2 | 14 | 285 | 3 | 3 | 0 | 10 | 441 | 2 | 2 | 0 | 6 |
71 | 7 | 7 | 12 | 21 | 169 | 10 | 6 | 0 | 14 | 289 | 1 | 1 | 2 | 10 | 442 | 3 | 3 | 0 | 6 |
74 | 12 | 12 | 1 | 20 | 175 | 4 | 5 | 1 | 14 | 290 | 0 | 0 | 0 | 10 | 444 | 1 | 1 | 0 | 6 |
76 | 2 | 2 | 9 | 20 | 176 | 1 | 1 | 0 | 14 | 291 | 6 | 4 | 0 | 10 | 445 | 0 | 0 | 0 | 6 |
79 | 3 | 3 | 0 | 20 | 180 | 4 | 4 | 0 | 13 | 296 | 4 | 4 | 0 | 9 |
The number of known biological interactions mined from BioGRID database [25] for each output cluster with cutoff = 0.5.
ci | n1 | n2 | n3 | size | ci | n 1 | n 2 | n 3 | size | ci | n1 | n2 | n3 | size | ci | n 1 | n 2 | n 3 | size |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 483 | 244 | 30 | 429 | 11 | 133 | 88 | 1 | 170 | 27 | 48 | 39 | 4 | 98 | 46 | 11 | 14 | 6 | 57 |
2 | 464 | 165 | 242 | 277 | 13 | 73 | 61 | 14 | 159 | 29 | 56 | 44 | 5 | 92 | 50 | 16 | 13 | 2 | 51 |
4 | 187 | 113 | 30 | 245 | 14 | 257 | 78 | 13 | 150 | 32 | 33 | 31 | 7 | 81 | 54 | 10 | 11 | 1 | 33 |
6 | 146 | 110 | 15 | 212 | 15 | 69 | 57 | 36 | 148 | 39 | 22 | 21 | 2 | 68 | |||||
7 | 185 | 107 | 12 | 204 | 16 | 72 | 55 | 86 | 147 | 43 | 8 | 11 | 4 | 63 | |||||
8 | 63 | 54 | 19 | 184 | 24 | 80 | 59 | 1 | 116 | 45 | 8 | 11 | 5 | 58 |
The traditional clustering algorithms focus on relationships based on similar expression profiles, identifying cluster of genes whose expression signals simultaneously rise or fall with an assumption that genes with similar expression profiles have similar biological functions. For example, Spellman et al. [6] identify a large number of genes (~800) as giving a cell-cycle-specific patterns of gene expression by fitting the expression profile of given gene to a sine wave, which is used as a surrogate pattern of ideal cyclicity. Then, they use the hierarchical clustering algorithm to linearly correlate the expression profile for a given gene with the expression profile of other genes, which are considered to be confirmed as certain cell-cycle-regulated genes. To this end, they cluster genes into five cell cycle phases (G1, S, S/G2, G2/M, and M/G1). On the other hand, the PSC algorithm use the theory of multivariate phase synchronization, in which the mean phase coherence in Eq. 4 are used to find closely related genes that have relevant biological interactions and/or sharing significant GO terms. Here, the PSC algorithm deal with a special case of random variable that is defined on a circular scale, such that values whose difference is an integral multiple of a certain period (i.e. 2π) are regarded the same, and all values are wrapped into a single period. Note that the phase difference between expression profiles (or the phase of a expression profile) is an example of circular random variables φ_{ i }(i = 1, 2,...). It is noteworthy that standard (or linear) statistical measures and moments like mean and variance are not applicable, because they yield different values if the period is added to or subtracted from some values, though the physical meaning of these changed values is the same. Based on the theory of phase synchronization, it is assumed that expression signals from two genes could be synchronized if these two genes are biologically interacting with each other. That is, two biologically interacting genes produce oscillating expression signals with a common rhythm. This phenomenon is explained in terms of coincidence of frequencies defined as "phase locking" [12]. With this theory, it is possible to measure the coupling strength between genes, which describes how strong the interaction is between genes.
Conclusion
PSC algorithm is mainly based on the theory of multivariate phase synchronization, and the phase synchronization could be understood as a common rhythm of oscillatory activities of systems due to their interactions with each other. We develop the strategy of identifying and categorizing cell cycle specific gene expressions according to the specific biological process, in which expression signals share a common rhythm during cell cycle. That is, PSC algorithm is efficient to find groups of genes that share same periodic variations of expression profiles, which is coincident with the length of the cell cycle. On the other hand, the traditional clustering algorithms search similar expression profiles with an assumption that genes with similar expression profiles have similar biological functions. Our evaluation analysis clearly indicates that PSC algorithm produces prominent clusters, which are not obtainable by traditional clustering algorithms.
Our evaluation analysis also shows that the PSC algorithm is able to find groups of gene, which are significantly associated with each other by sharing significant GO terms of biological process and/or relevant biological interactions. However, the algorithm does not have a capability to create a directed and weighted network of synchronization. Recently, Motter et al. [26] showed that the maximum synchronizability can be achieved when the network of synchronization is weighted and directed for a given degree distribution of heterogeneous connectivity. Therefore, the study for the analysis of cell cycle specific genome expression data could be further advanced by considering the directed and weighted network structure and addressing the effect that asymmetry has on the synchronizability of complex networks.
Based on the evaluation experiments, we draw the conclusion as follows: 1) Based on the theory of multivariate phase synchronization, it is feasible to find groups of genes, which have biological interactions and/or significantly shared GO slim terms of biological process, with cell cycle specific gene expression signals. 2) Among all the output clusters by PSC algorithm, the cluster with relatively larger size has a tendency to include more known interactions than the one with relatively smaller size. 3) It is feasible to understand the cell cycle specific gene expression patterns as the phenomenon of collective synchronization. 4) PSC algorithm is able to find prominent groups of genes, which are not obtainable by traditional clustering algorithms.
Methods
1) Fundamental mathematical concept: multivariate synchronization
The proposed algorithm builds on the concepts of analytic signal and phase synchronization. Hence, we first explain the basic idea of analytic signal and phase synchronization [12, 29]. Then we continue to describe the basic idea of synchronization in ensembles of oscillating systems. By "oscillating systems", we mean systems that produce the response signals with period and frequency. As a first step, we convert the gene expression signal x(t) into analytic signal x_{ a }(t) using Hilbert transform (HT). The analytic signal of gene expression signal x(t) is defined byx_{ a }(t) = x(t) + jx_{ h }(t)
From this equation, it is noticed that the HT of x(t) may be considered as the convolution of the x(t) and 1/πt. Due to the properties of convolution, the Fourier transform (FT) X_{ h }(ς) of x_{ h }(t) is the product of the FT of x(t) and 1/πt. For physically relevant Fourier frequencies ς > 0, X_{ h }(ς) = -jX(ς). In other words, the HT can be considered by an ideal filter whose amplitude response is unity and phase response is a constant π/2 lag at all Fourier frequencies. The analytic signal can also be expressed in terms of complex polar coordinatesx_{ a }(t) = A(t)exp(jφ(t)),
where $A\left(t\right)=\left|{x}_{a}\left(t\right)\right|=\sqrt{{x}^{2}\left(t\right)+{x}_{h}^{2}\left(t\right)}$ and φ(t) = arg{x_{ a }(t)}. These two functions are respectively called the amplitude and instantaneous phase of the signal x(t). The basic idea of the analytic signal is that the negative frequency components of the FT (or spectrum) of x(t) s are superfluous, due to the hermitian symmetry of such a spectrum. These can be removed without any loss of information, if an analytic signal is used instead. But note that the removal of the negative frequencies will eliminate such spectral symmetry; the inverse FT of such a one-sided spectrum will give back a complex analytic signal.
The values of ρ_{a,b}are confined between 0 (no synchrony) and 1 (perfect synchrony) and this value monotonically increases with the strength of phase synchronization [18].
with F_{ ik }= 1/(1 - ρ_{ iC }^{2}ρ_{ ik }^{2})^{2} for i ≠ k and F_{ ik }= 0 for i = k.
2) Phase synchronization clustering algorithm
Based on the concept of synchronization in ensembles of oscillating systems, we propose the strategy to make clusters of genes based on the theory of multivariate synchronization. There are 5 steps in this procedure. The descriptions for each step are listed as follow. Inputs to this method are the time series of expression data set and cutoff value for synchronization strength.
Step 1. Obtaining the phase vector φ_{ i }
Let's say there are signals x_{ i }(t) of the K systems i = 1, 2, ..., K with n number of observations t = 1, 2, ..., n of the stochastic process. In this step, the analytical signal can be approximated using Fast Fourier transform [27]. The output of this step is phase vector φ_{ i }, which is defined as φ_{ i }= {φ_{ i }(1), φ_{ i }(2), ..., φ_{ i }(n)}, for 1 ≤ i ≤ K.
Step 2. Initialization of cluster array
First, an array for K number of clusters is produced. For each cluster cluster(i), a phase vector φ_{ i }is stored for 1 ≤ i ≤ K. The output of this step is cluster array cluster(i), for 1 ≤ i ≤ K. The pseudo algorithm of this step is presented in APPENDIX A.
Step 3. Initial clustering
For each phase vector, this step finds how closely the phase vector follows the common rhythm for each cluster from the array. This can be measured by the synchronization strength between the phase vector and the cluster. Then the algorithm finds the cluster in which the phase vector has the highest value of the synchronization strength. If the synchronization strength between the phase vector and the selected cluster is greater or equals to the pre-defined cutoff value, this cluster is updated by appending the phase vector to the selected cluster. This procedure is repeated for the entire phase vectors. The output of this step is the updated cluster array. The pseudo algorithm of this step is presented in APPENDIX B.
Step 4. Filtering cluster
If the cluster contains no more than a system, this does not constitute as a cluster. Thus, the cluster is set to empty list. The pseudo algorithm of this step is presented in APPENDIX C.
Step 5. Combining clusters
Empty clusters are not considered in this step. For each non-empty cluster, the algorithm finds a cluster from the array such that these two clusters will have the most common rhythm when they are combined. If all of the synchronization strength between the combined cluster and each element are greater or equals to the cutoff value, these two clusters are combined. The pseudo algorithm is presented in APPENDIX D.
Appendix A: The pseudo algorithm for initialization of cluster array
Input: phase vectors, φ_{ i }for 1 ≤ i ≤ K.
Output: cluster array, cluster(i) for 1 ≤ i ≤ K.
for 1 ≤ i ≤ K
cluster(i) = {φ_{i}}
end
Appendix B: The pseudo algorithm for initial clustering
Input: cutoff and cluster array, cluster(i) for 1 ≤ i ≤ K.
Output: cluster array, cluster(i) for 1 ≤ i ≤ K.
for 1 ≤ i ≤ K
Initialize [SynStrength]^{ (j) }with 0, for 1 ≤ j ≤ K
for 1 ≤ j ≤ K, i ≠ j
temp_list = {cluster(j), φ_{ i }}
n = the size of temp_list
Compute ρ_{mC} using φ_{ m }from temp_list, for 1 ≤ m ≤ n
[SynStrength]^{ (j) }= ρ_{nC}
end
Find max_SynStrength = max{[SynStrength]^{ (j) }, 1 ≤ j ≤ K, i ≠ j}
If max_SynStrength ≥ cutoff
cluster(j) = {cluster(j), φ_{ i }}
end
end
Appendix C: The pseudo algorithm for filtering cluster
Input: cluster array, cluster(i) for 1 ≤ i ≤ K.
Output: cluster array, cluster(i) for 1 ≤ i ≤ K.
for 1 ≤ i ≤ K
if the size of cluster(i) equals to 1
cluster(i) = {}
end
end
Appendix D: The pseudo algorithm for combining clusters
Input: cutoff and cluster arrays, cluster(i) for 1 ≤ i ≤ K.
Output: cluster array, cluster(i) for 1 ≤ i ≤ K.
for 1 ≤ i ≤ K
Initialize [SynStrength]^{ (j) }with 0, for 1 ≤ j ≤ K
for 1 ≤ j ≤ K, i ≠ j
if cluster(i) and cluster(j) are not empty lists
temp_list = {cluster(j), cluster(i)}
n = the size of temp_list
Compute ρ_{mC} using φ_{ m }from temp_list, for 1 ≤ m ≤ n
[SynStrength]^{ (j) }= min{ρ_{mC}, 1 ≤ m ≤ n}
end
end
Find max_SynStrength = max{[SynStrength]^{ (j) }, 1 ≤ j ≤ K, i ≠ j}
If max_SynStrength ≥ cutoff
cluster(j) = {cluster(j), cluster(i)}
cluster(i) = {}
end
end
Declarations
Acknowledgements
The authors thank Professor Choung Ik Kim from Kangwon National University for helpful discussion.
Authors’ Affiliations
References
- Hereford LM, Osley MA, Ludwig TRD, McLaughlin CS: Cell-cycle regulation of yeast histone mRNA. Cell 1981, 24: 367–375. 10.1016/0092-8674(81)90326-3View ArticlePubMedGoogle Scholar
- Cho RJ, Campbell MJ, Winzeler EA, Steinmetz L, Conway A, Wodicka L, Wolfsberg TG, Gabrielian AE, Landsman D, Lockhart DJ, Davis RW: A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell 1998, 2: 65–73. 10.1016/S1097-2765(00)80114-8View ArticlePubMedGoogle Scholar
- Laub MT, McAdams HH, Feldblyum T, Fraser CM, Shapiro L: Global analysis of the genetic network controlling a bacterial cell cycle. Science 2000, 290: 2144–2148. 10.1126/science.290.5499.2144View ArticlePubMedGoogle Scholar
- Menges M, Murray JA: Synchronous Arabidopsis suspension cultures for analysis of cell-cycle gene activity. Plant J 2002, 30: 203–212. 10.1046/j.1365-313X.2002.01274.xView ArticlePubMedGoogle Scholar
- Menges M, Hennig L, Gruissem W, Murray JA: Cell cycle-regulated gene expression in Arabidopsis. J Biol Chem 2002, 277: 41987–42002. 10.1074/jbc.M207570200View ArticlePubMedGoogle Scholar
- Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 1998, 9: 3273–3297.PubMed CentralView ArticlePubMedGoogle Scholar
- van der Meijden CM, Lapointe DS, Luong MX, Peric-Hupkes D, Cho B, Stein JL, van Wijnen AJ, Stein GS: Gene profiling of cell cycle progression through S-phase reveals sequential expression of genes required for DNA replication and nucleosome assembly. Cancer Res 2002, 62: 3233–3243.PubMedGoogle Scholar
- Whitfield ML, Sherlock G, Sadanha AJ, Murray JI, Ball CA, Alexander KE, Matese JC, Perou CM, Hurt MM, Brown PO, Botstein D: Identification of gene periodically expressed in the human cell cycle and their expression in tumors. Mol Biol Cell 2002, 13: 1977–2000. 10.1091/mbc.02-02-0030.PubMed CentralView ArticlePubMedGoogle Scholar
- Cooper S, Shedden K: Microarray analysis of gene expression during the cell cycle. Cell & Chromosome 2003, 2: 1. 10.1186/1475-9268-2-1View ArticleGoogle Scholar
- Glass L, Mackay MC: From clocks to chaos. New Jersey: Princeton University Press; 1998.Google Scholar
- Pikovsky A, Rosenblum M, Kurths J: Synchronization: A universal concept in nonlinear sciences. Cambridge: Cambridge University Press; 2001.View ArticleGoogle Scholar
- Rosenblum MG, Pikovsky AS, Kurths J: Phase Synchronization of Chaotic Oscillators. Phys Rev Lett 1996, 76: 1804–1807. 10.1103/PhysRevLett.76.1804View ArticlePubMedGoogle Scholar
- Anishchenko VS, Balanov AG, Janson NB, Igosheva NB, Bordyugov GV: Entrainment between heart rate and weak noninvasive forcing. Int J Bifurcation and Chaos 2000, 10: 2339–2348.Google Scholar
- Schäfer C, Rosenblum MG, Kurths J, Abel HH: Heartbeat synchronized with ventilation. Nature 1998, 392: 239–240. 10.1038/32567View ArticlePubMedGoogle Scholar
- Stefanovska A, Haken H, McClintock PVE, Hozic M, Bajrovic F, Ribaric S: Reversible transitions between synchronization states of the cardiorespiratory system. Phys Rev Lett 2000, 85: 4831–4834. 10.1103/PhysRevLett.85.4831View ArticlePubMedGoogle Scholar
- Tass P, Rosenblum MG, Weule J, Kurths J, Pikovsky A, Volkmann J, Schnitzler A, Freund HJ: Detection of n:m phase locking from noisy data: Application to magnetoencephalography. Phys Rev Lett 1998, 81: 3291–3294. 10.1103/PhysRevLett.81.3291View ArticleGoogle Scholar
- Allefeld C, Kurths J: An approach to multivariate phase synchronization analysis and its application to event-related potentials. Int J Bifurcation and Chaos 2004, 14: 417–426. 10.1142/S0218127404009521View ArticleGoogle Scholar
- Bhattacharya J: Reduced degree of long-range phase synchrony in pathological human brain. Acta Neurobiol Exp 2001, 61: 309–318.Google Scholar
- Jerger KK, Netoff TI, Francis JT, Sauer T, Pecora L, Weinstein SL, Schiff SJ: Early seizure detection. J Clin Neurophysiol 2001, 18: 259–268. 10.1097/00004691-200105000-00005View ArticlePubMedGoogle Scholar
- Mormann F, Lehnertz K, David P, Elger CE: Mean phase coherence as a measure for phase synchronization and its application to EEG of epilepsy patients. Physica D 2000, 144: 358–369. 10.1016/S0167-2789(00)00087-7View ArticleGoogle Scholar
- Blasius B, Huppert A, Stone L: Complex dynamics and phase synchronization in spatially extended ecological systems. Nature 1999, 399: 354–359. 10.1038/20676View ArticlePubMedGoogle Scholar
- Lunkeit F: Synchronization experiments with an atmospheric global circulation model. Chaos 2001, 11: 47–51. 10.1063/1.1338127View ArticlePubMedGoogle Scholar
- Strogatz SH: From Kuramoto to Crawford: exploring the onset of synchronization in populations of coupled oscillators. Physica D 2000, 143: 1–20. 10.1016/S0167-2789(00)00094-4View ArticleGoogle Scholar
- Saccharomyces Genome Database[http://www.yeastgenome.org]
- BioGRID: General Repository for Interaction Datasets[http://www.thebiogrid.org]
- Motter AE, Zhou C, Kurths J: Network synchronization, diffusion, and the paradox of heterogeneity. Physical Review E 2005, 71: 016116. 10.1103/PhysRevE.71.016116View ArticleGoogle Scholar
- Marple SL Jr: Computing the discrete-time "analytic" signal via FFT. IEEE trans Signal processing 1999, 47: 2600–2603. 10.1109/78.782222View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.