- Research article
- Open Access
Robust discovery of periodically expressed genes using the laplace periodogram
© Liang et al; licensee BioMed Central Ltd. 2009
- Received: 19 June 2008
- Accepted: 11 January 2009
- Published: 11 January 2009
Time-course gene expression analysis has become important in recent developments due to the increasingly available experimental data. The detection of genes that are periodically expressed is an important step which allows us to study the regulatory mechanisms associated with the cell cycle.
In this work, we present the Laplace periodogram which employs the least absolute deviation criterion to provide a more robust detection of periodic gene expression in the presence of outliers. The Laplace periodogram is shown to perform comparably to existing methods for the Sacharomyces cerevisiae and Arabidopsis time-course datasets, and to outperform existing methods when outliers are present.
Time-course gene expression data are often noisy due to the limitations of current technology, and may include outliers. These artifacts corrupt the available data and make the detection of periodicity difficult in many cases. The Laplace periodogram is shown to perform well for both data with and without the presence of outliers, and also for data that are non-uniformly sampled.
- Impulse Noise
- Impulsive Noise
- Periodic Gene
- Less Absolute Deviation
- Permute Sequence
In the past decade, time-course gene expression datasets have become increasingly available, and have enabled the study of the dynamical behaviors of gene expression and the related regulatory mechanisms, as well as the analysis of the relationships between genes and cellular processes. Of particular interests are the genes that regulate and that are being regulated in relation to the cell-division cycles. A cell-division cycle is a series of sequential steps which are repeated throughout the lifetime of an eukaryotic cell, and it consists of four distinct phases: G1 phase, S phase, G2 phase, and M phase. The cell-division cycle is regulated by a complex interaction of a set of mechanisms which include genes such as cyclins and cyclin-dependent kinases (CDKs). These genes are known to be expressed periodically with respect to the cell-division cycle . In the genome-wide time-course gene expression studies performed on Saccharomyces cerevisiae by [2, 3], a group of genes are shown to be periodically expressed with respect to the cell-division cycle. Since then, other time-course gene expression studies have been conducted on other organisms such as Plasmodium falciparum  and Schizosaccharomyces pombe , and similar experiments were also conducted on human cells [6, 7]. The detection of these periodically expressed genes is essential in the understanding of the regulatory mechanism for the cell-division cycle.
where j = , and ω is the angular frequency. Such classical periodogram has been used in various studies to detect periodicity in experimental data [3, 8–11]. Other algorithms have also been proposed for the detection of sinusoids in expression data. In , a single pulse model is proposed to model the periodic behavior in the time-series gene expression of Saccharomyces cerevisiae. In , partial least-squares regression is used to fit the data to a linear combination of sinusoids of various frequencies. A gene is then represented as a point in a multidimensional space whose coordinates are given by its coefficients for the cosine and sine functions. The strength of periodicity of the expression of a gene is determined by the distance of the point from the origin. In , periodic time-course gene expression are modeled by Y i = μ i + β i f(t i - τ i ) + ε i , where f(t) is a periodic function which is modeled by a linear combination of cubic B-spline basis. This approach allows the modeling of periodic gene expression that take forms other than sinusoids. In addition to periodicity detection algorithms, various clustering algorithms have been proposed to obtain subsets of genes that exhibit periodic behavior. These methods include support vector machine , singular value decomposition , and independent component analysis .
While the above listed algorithms have achieved varying degrees of success, they are often limited by factors such as being developed for a specific dataset, not being able to provide a ranking for genes, and the nature of the time-course gene expression. In particular, one problem that plagues the detection of periodic signals in time-course gene expression data is that the samples are typically non-uniformly spaced, which is caused by the cell arresting and measurement methods employed by the experiments. One approach to resolve non-uniform sampling is to extrapolate a continuous signal from the available samples, and obtain a set of uniformly-spaced samples of the data from the extrapolated signal. Various works have explored this option, such as linear interpolation , cubic interpolation on the log of the expression levels , and B-spline interpolation . However, errors can be introduced during interpolation, which can lead to inaccuracies in ranking the periodicity of genes. To modify the classical periodogram so that it can treat observations obtained with non-uniform spacing, it was proposed in  to fit the data to sinusoidal signals with least-squares. The asymptotic distribution of the proposed periodogram is derived in . In , it is shown that the Lomb-Scargle periodogram outperforms methods based on interpolation. The Lomb-Scargle periodogram is also used to detect periodicity or to estimate the angular frequency spectrum of non-uniformly spaced biological samples in [24, 25], and it is used to detect genes exhibiting periodic time-course gene expression in .
While using the least-squares fitting to sinusoidal functions allows the treatment of non-uniformly spaced samples, it is also well known that the least-squares method is non-robust in the presence of heavy-tailed noise and outliers due to its assumption that noise is independently and identically Gaussian distributed . It is shown in  that the performance of Lomb-Scargle periodogram degrades in the presence of heavy-tailed non-Gaussian noise. In , robust periodicity detection algorithms using three different estimators – the M-estimator with Tukey's biweight function, the least trimmed squares, and the minimum covariance determinant estimator – are proposed to detect periodic genes in the mussel species Mytilus californianus. These robust regression methods are shown to outperform the Lomb-Scargle periodogram in the presence of outliers.
In this paper, we propose the use of the Laplace periodogram  as an alternative to the Lomb-Scargle periodogram for the detection of periodic genes from time-course gene expression data. The Laplace periodogram, which we describe in the Methods Section, employs the least absolute deviation (LAD) fit to sinusoidal functions instead of the least squares used in the Lomb-Scargle periodogram. As we will show in the following sections, the Laplace periodogram achieves better performance than the Fourier-based algorithm proposed in  and the M-estimator-based algorithm proposed in  when impulsive noise is present in the Saccharomyces cerevisiae datasets of [2, 3].
Periodic gene detection without outliers
In this section, we compare the periodicity detection performance of the Fourier-score-based algorithm , M-estimator , and the Laplace periodogram using two sets of real data. First, we use the Saccharomyces cerevisiae Alpha (6075 genes), CDC15 (5673 genes), and CDC28 (6214 genes) experiments from [2, 3]. The time-course gene expression datasets of these three experiments are obtained from Cyclebase.org http://www.cyclebase.org. The expression datasets obtained have been normalized to a common scale based on the percentage of the cell division cycle where the sampling occurs. Furthermore, the magnitudes of the gene expression have also been normalized to have a standard deviation of one . Next, we compare the performance of the Fourier-score-based algorithm and the Laplace periodogram using the circadian oscillation dataset for auxin signaling in Arabidopsis.
Saccharomyces cerevisiae datasets
We use the same three sets of benchmarks described in . Set B1 contains 113 genes identified to be periodically expressed by the small scale experiments of [3, 13]. Set B2 contains genes identified in the Chromatin IP studies in , with 50 genes that are in common with the benchmark set B1 removed, leaving a total of 352 genes. Set B3 contains genes identified by MIPS  as "cell cycle and DNA processing" related genes. Genes that are also annotated as meiosis and genes that are in common with set B1 are removed, resulting in a total of 518 genes in this benchmark set. In B2 and B3, the genes selected based on their association with other genes that are associated with the cell cycle, or genes that play a role in the cell cycle. Thus, it is expected that most of the genes in B2 and B3 are not periodic, and should not be detected as periodic by the algorithms.
For each experiment, we ranked the time-course expression of the genes at the normalized cell-division-cycle frequency using the p-values of the scores computed by each of the three algorithms. Since we have very small number of samples, we estimated the p-values of the scores at the normalized cell-division-cycle frequency by a bootstrap method similar to , that is, for each gene, we fix the sampling times and permute the expression values. A total of 1000 permutations are generated, and for each permuted time series the score is computed. The p-value is simply the ratio of the permuted sequences that produced scores higher than from the original time-course expression. By permuting the expression values with respect to their sampling times, we aimed to destroy the periodicity that may exist within the time series. If periodicity exists within the time series, it is unlikely that a randomly permuted sequence will recover that periodicity, thus the magnitude of the spectrum at the tested period will be reduced and unlikely to achieve greater magnitude than the magnitude of the original time series. If on the other hand periodicity does not exist in the time series, then neither the original nor the permuted sequences will likely to have large magnitude at the test period, therefore, the possibility of the permuted sequence having higher magnitude than the original sequence is quite high. Thus the lower the p-value, meaning that only a small ratio of the permuted sequences resulted in higher magnitude than by the original sequence, the more likely that the gene is periodically expressed.
The figures plot the ratio of periodic genes as indicated by the B1, B2, and B3 benchmark sets discovered in a subset of the top scoring genes scored by the three algorithms. As the subset of top scoring genes (number of genes in the subset) increases, the ratio of benchmark periodic genes contained in these subsets also increases.
From Figures 1, 2, 3, we can see that the Fourier score method consistently produces the best performance out of the three methods. The Laplace periodogram is able to detect periodically expressed genes at detection ratio that is comparable to the Fourier score for some experiment-benchmark combinations, and less for others. In particular, the Laplace periodogram has the biggest drop-off in performance compared to the Fourier score for the combinations of CDC15-B1 and CDC28-B1. For the other combinations, the Laplace periodogram either has comparable detection performance for all values of K, or is able to bridge the performance gap as K approaches 700. For each of the combinations, the M-estimator-based method achieves the worst performance, with large drop-offs in performance from both the Fourier score and the Laplace periodogram. Also note that the results in Figures 2 and 3 meet with our expectation that most of the genes in B2 and B3 are not periodic, leading to a very low ratio of genes from those two sets being detected as periodic.
For this comparison, we use the experimental data provided by , which includes the expression of 22810 genes in Arabidopsis, 1677 of which are determined to follow a circadian fluctuation in mRNA abundance. Total of 12 samples of the time-series gene expression is taken for each gene, which covers approximately two cycles of the circadian oscillation.
Periodic gene detection in the presence of outliers
where c is the constant that determines the shape of the curve and the cutoff point above which the residual of the sample and the current fit contributes zero weighting, thus completely removing the contribution of possible outliers that exceeds a certain magnitude. However, this approach requires a value for c which may vary from dataset to dataset. The optimal value for c may depend on the magnitude of the peaks and the magnitude of the outliers. When given an unknown dataset without a benchmark set of periodically expressed genes, it is uncertain how c can be tuned to achieve the optimal result, thus demonstrates a disadvantage of the M-estimator-based method. Note that for both simulations with and without the presence of outliers, we used the default value of c which came with the program provided by the authors of .
Our simulation results have shown that the Laplace periodogram is a useful tool for detecting periodic time-course gene expression, particularly when the dataset contains outliers and when the sampling intervals are highly uneven. The Laplace periodogram achieves better performance for the Saccharomyces cerevisiae dataset when outliers are present, and achieves equal or comparable performances to the Fourier-based method for both the Saccharomyces cerevisiae and Arabidopsis datasets when no outlier.
With . In other words, is the solution to the least-squares regression of y to the regressor x t .
Note here that the magnitude at each angular frequency can be computed independent of the other frequencies, meaning that if we know exactly the periodicity that we are looking for, there is no need to compute the LAD coefficients for the entire frequency spectrum. In , the proposed robust estimators are used in the Fisher's g-statistics, which has a denominator that is computed with contribution from all the frequency components. Thus the LAD periodogram has an advantage in terms of computational complexity when the periodicity is known, by foregoing the evaluation of the magnitudes of the rest of the frequency components.
An implementation of the proposed algorithm in MATLAB can be found at http://kuoching.googlepages.com
Method for LAD approximation
where [t1, t2, ..., t N ] are the N non-uniformly sampled time instants.
In this formulation, the LAD coefficients can be easily solved using standard algorithms for solving linear programming problems. For our implementation in MATLAB, we used the LINPROG function in the Optimization Toolbox. In terms of computational time required to process the data, for experiment Alpha which consists of 6075 genes and 18 samples each, the total time to compute 1000 permutations for the p-value analysis takes approximately 24 hours on a Pentium Core 2 CPU at 2.66 GHz, which is similar to the amount of time taken by the M-estimator-based method, also implemented in MATLAB using the ROBUSTFIT function in the Statistics Toolbox.
We would like to thank Harri Läahdesmäaki for generously providing us with the source to their code for the robust regression methods, and Cyclebase.org for the datasets used in the simulations in this work.
- Alberts B, Bray D, Hopkin K, Johnson A: Essential Cell Biology. Garland Science. 2003.Google Scholar
- Cho R, Campbell M, Winzeler E, Steinmetz L, Conway A, Wodicka L, Wolfsberg T, Gabrielian A, Landsman D, Lockhart D, Davis R: A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell 1998, 2: 65–73.View ArticlePubMedGoogle Scholar
- Spellman P, Sherlock G, Zhang M, Iyer V, Anders K, Eisen M, Brown P, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast S. cerevisiae by microarray hybridization. Mol Biol Cell 1998, 9: 3273–3297.PubMed CentralView ArticlePubMedGoogle Scholar
- Bozdech Z, Linas M, Pulliam B, Wong E, Zhu J, DeRisi J: The transcriptome of the intraerythrocytic developmental cycle of Plasmodium falciparum . PLoS Biol 2003, 1(1):1–16.View ArticleGoogle Scholar
- Rustici G, et al.: Periodic gene expression program of the fission yeast cell cycle. Nat Genet 2004, 36: 809–817.View ArticlePubMedGoogle Scholar
- Cho RJ, Huang M, Campbell M, Dong H, Steinmetz L, Sapinoso L, Hampton G, Elledge S, Davis R, Lockhart D: Transcriptional regulation and function during the human cell cycle. Nat Genet 2001, 27: 48–54.PubMedGoogle Scholar
- Whitfield ML, Sherlock G, Saldanha A, Murray J, Ball C, Alexander K, Matese J, Perou C, Hurt M, Brown P, Botstein D: Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Mol Biol Cell 2002, 13(6):1977–2000.PubMed CentralView ArticlePubMedGoogle Scholar
- Liavas AP, Moustakides GV, Hanning G, Psarakis E, Husar P: A periodogram-based method for the detection of steady-state visually evoked potentials. 1998, 45(2):242–248.Google Scholar
- Straume M: DNA microarray time series analysis: automated statistical assessment of circadian rhythms in gene expression patterning. Meth Enzymol 2004, 383: 149–166.View ArticlePubMedGoogle Scholar
- Wichert S: Identifying periodically expressed transcripts in microarray time series data. Bioinformatics 2004, 20: 5–20.View ArticlePubMedGoogle Scholar
- Chen J: Identification of significant periodic genes in microarray gene expression data. BMC Bioinformatics 2005, 6: 286.PubMed CentralView ArticlePubMedGoogle Scholar
- Chu M, Haussecker H, Zhao F: Scalable information-driven sensor querying and routing for ad hoc heterogeneous sensor network. 2002, 16(3):293–314.Google Scholar
- Johansson D, Lindgren P, Berglund A: A multivariate approach applied to microarray data for identification of genes with cell cycle-coupled transcription. Bioinformatics 2003, 19: 467–473.View ArticlePubMedGoogle Scholar
- Luan Y, Li H: Model-based methods for identifying periodically expressed genes based on time course microarray gene expression data. Bioinformatics 2004, 20: 332–339.View ArticlePubMedGoogle Scholar
- Brown M, Grundy WN, Lin D, Cristianini N, Sugnet C, Furey T, Ares M, Haussler D: Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Nat Acad Sciences 2000, 97: 262–267.View ArticleGoogle Scholar
- Alter O, Brown PO, Botstein D: Singular value decomposition for genome-wide expression data processing and modeling. Proc Nat Acad Sciences 2000, 97(18):10101–10106.View ArticleGoogle Scholar
- Liebermeister W: Linear modes of gene expression determined by independent component anaylsis. Bioinformatics 2002, 18: 51–60.View ArticlePubMedGoogle Scholar
- Aach J, Church GM: Aligning gene expressing time series with time warping algorithms. Bioinformatics 2001, 17(6):495–508.View ArticlePubMedGoogle Scholar
- D'haeseleer P, Wen X, Fuhrman S, Somogyi R: Linear modeling of mRNA expression levels during CNS development and injury. Proc Pacific Symp Biocomput 1999, 41–52.Google Scholar
- Bar-Joseph Z, Gerber G, Gifford DK: Continuous representations of time series gene expression data. J Comput Biol 2003, 10(3–4):241–256.Google Scholar
- Lomb NR: Least-squares frequency analysis of unequally spaced data. Astrophys Space Sci 1976, 39: 447–462.View ArticleGoogle Scholar
- Scargle JD: Statistical aspects of spectral analysis of unevenly spaced data. Astrophys J 1982, 263: 835–853.View ArticleGoogle Scholar
- Laguna P, Moody G, Mark R: Power spectral density of unevenly sampled data by least-square analysis: performance and application to heart rate signals. 1998, 45: 698–715.Google Scholar
- Dongen HV, Olofsen E, Hartevelt JV, Kruyt E: A procedure of multiple period searching in unequally spaced time-series with the Lomb-Scargle method. Biol Rhythm Res 1999, 30: 149–177.View ArticlePubMedGoogle Scholar
- Ruf T: The Lomb-Scargle periodogram in biological rhythm research: analysis of incomplete and unequally spaced time-series. Biol Rhythm Res 1999, 30: 178–201.View ArticleGoogle Scholar
- Glynn EF, Chen J, Mushegian AR: Detecting periodic patterns in unevenly spaced gene expression time series using Lomb-Scargle periodograms. Bioinformatics 2006, 22(3):310–316.View ArticlePubMedGoogle Scholar
- Huber PJ: Robust Statistics. New York: Wiley; 1981.View ArticleGoogle Scholar
- Schimmel M: Emphasizing difficulties in the detection of rhythms with Lomb-Scargle periodograms. Biol Rhythm Res 2001, 32(3):341–345.View ArticlePubMedGoogle Scholar
- Ahdesmäki M, Lähdesmäki H, Gracey A, Shmulevich I, Yli-Harja O: Robust regression for periodicity detection in non-uniformly sampled time-course gene expression data. BMC Bioinformatics 2007, 8: 233.PubMed CentralView ArticlePubMedGoogle Scholar
- Li T: Laplace periodogram for time series analysis. American Statistical Association 2008, 103(482):757–768.View ArticleGoogle Scholar
- de Lichtenberg U, Jensen LJ, Fausbøll A, Jensen T, Bork P, Brunak S: Comparison of computational methods for the identification of cell cycle-regulated genes. Bioinformatics 2005, 21(7):1164–1171.View ArticlePubMedGoogle Scholar
- 32. Gauthier NP, Larsen ME, Wernersson R, et al.: Cyclebase.org – a comprehensive multi-organism online database of cell-cycle experiments. Nucleic Acids Res 2008, (36 Database):D854-D859.Google Scholar
- Covington MF, Harmer SL: The circadian clock regulates auxin signaling and responses in Arabidopsis . PLoS Biology 2007, 5(8):1773–1784.View ArticleGoogle Scholar
- Simon I, Barnett J, Hannett N, Harbison C, Rinaldi N, Volkert T, Wyrick J, Zeitlinger J, Gifford D, Jaakkola T, Young R: Serial regulation of transcriptional regulators in the yeast cell cycle. Cell 2001, 106: 697–708.View ArticlePubMedGoogle Scholar
- Mewes HW, Frishman D, Güldener U, Mannhaupt G, Mayer K, Mokrejs M, Morgenstern B, Münsterkötter M, Rudd S, Weil B: MIPS: a database for genomes and protein sequences. Nucleic Acids Res 2002, 30(1):31–34.PubMed CentralView ArticlePubMedGoogle Scholar
- Ptitsyn AA, Zvonic S, Gimble JM: Permutation test for periodicity in short time series data. BMC Bioinformatics 2006, 7(Suppl 2):S10. doi:10.1186/1471–2105–7-S2-S10 doi:10.1186/1471-2105-7-S2-S10PubMed CentralView ArticlePubMedGoogle Scholar
- Brockwell PJ, Davis RA: Time Series: Theory and Methods. 3rd edition. New York: Springer; 1991.View ArticleGoogle Scholar
- Rosen JB, Mangasarian OL, Ritter K, (Eds): Nonlinear programming. New York: Academic Press; 1970.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.