Robust regression for periodicity detection in non-uniformly sampled time-course gene expression data
© Ahdesmäki et al. 2007
Received: 06 October 2006
Accepted: 02 July 2007
Published: 02 July 2007
Skip to main content
© Ahdesmäki et al. 2007
Received: 06 October 2006
Accepted: 02 July 2007
Published: 02 July 2007
In practice many biological time series measurements, including gene microarrays, are conducted at time points that seem to be interesting in the biologist's opinion and not necessarily at fixed time intervals. In many circumstances we are interested in finding targets that are expressed periodically. To tackle the problems of uneven sampling and unknown type of noise in periodicity detection, we propose to use robust regression.
The aim of this paper is to develop a general framework for robust periodicity detection and review and rank different approaches by means of simulations. We also show the results for some real measurement data.
The simulation results clearly show that when the sampling of time series gets more and more uneven, the methods that assume even sampling become unusable. We find that M-estimation provides a good compromise between robustness and computational efficiency.
Since uneven sampling occurs often in biological measurements, the robust methods developed in this paper are expected to have many uses. The regression based formulation of the periodicity detection problem easily adapts to non-uniform sampling. Using robust regression helps to reject inconsistently behaving data points.
The implementations are currently available for Matlab and will be made available for the users of R as well. More information can be found in the web-supplement .
The detection of periodically behaving gene expression time series has been an area of enormous interest lately. Since more and more microarray  data is becoming available, including time series, the periodicity detection methods from other branches of science are being modified for use in gene expression studies. Periodicity detection methods can be broadly divided into generic and more specific detection rules. The generic approaches use the available statistical theory to seek strong periodic components at all the available frequencies [3–7] and use exact tests to yield significance values with multiple correction. The more specific methods try to find periodic phenomena at specific frequencies, e.g. the assumed cell cycle frequency (see e.g. [8–15])
Some of the most severe problems of processing gene expression time series data include short time series length, the presence of noise of unknown distribution, outliers (i.e. points that are clearly inconsistent with most of the other points in the data), non-uniform sampling used in performing the experiments and other non-linearities involved in the measurement technologies themselves. Outliers can be thought of as low-probability values from a mixture model where with a high probability the noise in the signal is modelled by a (Gaussian) distribution and with a low probability by another distribution whose variance is much higher than that of the first one. In earlier work we presented a robust modification  of Fisher's g-test  for finding hidden periodicities in time series data. The method performs well both under the Gaussian noise assumption and when outliers and other non-linearities are present. However, non-uniform sampling, other than the one resulting from missing values, was not considered and the aim of this paper is to evaluate different robust methods for periodicity detection that can handle non-uniform sampling. Non-uniform sampling in periodicity detection has been previously considered in . The authors use a so-called Lomb-Scargle periodogram to find the spectral estimate for a time series, not limited by non-uniform sampling, and then test whether the maximum value of the spectral estimate is significantly higher than the other values. While the method is mathematically sound and is based on an exact test, it is non-robust, as is the basic Fisher's test. The same issue applies to most of the other previously published methods. Exceptions to this are in  and in , where the authors use Bayesian detection and show that the method can handle data that is corrupted with uniform and Laplacian noise as well (besides Gaussian).
The matter of choosing the sampling scenario in a cost-effective way is discussed in  where the authors present an active learning based online algorithm for choosing the sampling strategy.
In  the authors have developed a periodicity detection method in which they fit orthogonal periodic polynomials to non-uniformly sampled data. If the periodicity of interest is not sinusoidal (e.g. narrow pulse signals), the method improves on the performance of the Lomb-Scargle periodogram, but reduces to it in the case of sinusoidal model, which is the case of interest for us. An approach based on similar ideas is presented in  where the authors use least squares fitting of wavelets which is especially suitable if we want to search for periodicity in non-uniformly sampled data with non-sinusoidal cyclic components.
In  the authors use the Lomb-Scargle periodogram for periodicity detection and show that it performs better than the combination of interpolation to uniform sampling and ordinary periodogram. They point out that there is a low-pass effect involved in interpolation that is a major problem. In  the authors use a complicated approach of neural networks for periodicity detection in non-uniformly sampled time series but use interpolation to uniform sampling first, which, according to , causes problems in the high frequency end of the spectra. In  the authors actually make use of non-uniform sampling in digital alias-free signal processing applications. However, their approach is based on the idea of being able to choose the sampling intervals, which is not the usual case in biological studies. A model similar to the one in this paper is presented in , where the authors aim to estimate a wide spectral range of frequencies of a non uniformly sampled signal. Their approach is, however, aimed more at real-time applications and longer signals than those usually present in microarray studies. Some of these methods can be thought to be improvements over the standard periodogram but non-robust when it comes to heavy tailed distributions and/or mixture models where the presence of (low probability) outliers can cause large residuals in the estimators and thus bias the results. The Bayesian approach  is a clear distinction to the aforementioned approaches and presents an opportunity to make use of prior knowledge, such as the frequency of the oscillation. It is shown in  that the Bayesian detector performs better than methods that assume a strict frequency of periodicity  in case the frequency is not exactly known a priori. Other Bayesian periodicity detectors are presented in [23–26].
In this paper we follow the general direction of Fisher's g-test together with multiple testing correction for the detection of periodic time series in multiple time series data. Several modifications are needed to take into account non-uniform sampling and unknown noise characteristics. We use several different robust regression based methods [27–30] to find the spectral estimate of a time series instead of using the basic non-robust periodogram. By using regression we can readily take non-uniform sampling into account.
After finding the spectral estimate we propose to replace the periodogram in the g-test with the robust spectral estimate. Since no analytical results for such modifications exist, we resort to permutation tests in finding the p-values. We also note that the test can be modified to yield a test for one specific chosen frequency if an a priori hypothesis is made about the frequency of the periodicity of interest.
To compare the performance of the different regression methods and some of the previously introduced novel methods [5, 7, 15] in this framework, we use simulations and show the receiver operating characteristic (ROC) figures under several noise and signal configurations and non-uniform sampling. The computational complexity of the different methods is also briefly considered.
As an application we choose one of the best performing methods and apply it to microarray data measured from the mussel Mytilus californianus. Our experiments indicate that there is no statistically significant connection between the circadian rhythm and cell cycle regulated genes that could be expected to show periodic expression.
The performance of the different introduced estimators when using the g-test framework is now verified by simulations and through the use of receiver operating characteristic curves (ROC, see e.g. ). The following simulations are first based on the assumption that we do not know the frequency of the underlying periodic signal. In addition, we iteratively subtract the fitted sinusoidals in the process of calculating the Fourier coefficients when using the robust regression based methods (as suggested in ). We then simulate time series where we have approximate a priori knowledge on the frequency of periodicity and compare the performance of the regression approach to the rank based method presented in  and the Bayesian detector presented in .
The ROC curve is a plot of sensitivity on the y-axis, versus 1-specificity on the x-axis. TP stands for true positive, FN for false negative and so on. For a perfect test the sensitivity is 1 for a specificity of 1. For any non-ideal tests the ROC curve shows the sensitivity-specificity tradeoff. The line segment from (0,0) to (1,1) is called the chance diagonal since the discrimination ability of a test is near to random if the curve is near to the diagonal.
We use the model of Equation (1) in the generation of 300 periodic time series corresponding to the alternative hypothesis and 300 nonperiodic time series corresponding to the null hypothesis. We consider the same two cases of non-uniform sampling described above. For each time series we calculate the g-statistic and order the values. We repetitively accept one more time series as periodic and since the ground truth is known, we can now tell the number of true and false positives and true and false negatives to construct the ROC curves for each of the methods. Since we assume the null distribution to be the same for all the time series we do not use permutation tests here. We chose to consider three types of noise for both of the non-uniform samplings, namely where the additive noise is Gaussian (std. 0.5,1.0), Gaussian (std. 0.75) with outliers at random locations (1,2 or 3 outliers, amplitude uniformly randomly chosen from ± (5 . . . 6)) or Laplacian (std. 1.0). The underlying periodic signal in the alternative hypothesis set always has an amplitude of (the amplitude of the sum of zero phase cosine plus sine) as in  and the frequency of periodicity is uniformly random chosen in the interval [0.05,0.45]2π.
In case we have prior knowledge on the frequency of the periodicity of interest, it is desirable to take this into account when trying to rank the time series according to periodicity. For the purpose of performance comparison we compare here the simulated ROC curves of three methods under the two presented sampling scenarios when the frequency is approximately known. We choose for the simulations the M-estimator as a representative of the regression estimators, the Bayesian approach method presented in  and the rank based method not designed for not-uniform sampling . The strength of the Bayesian method  is that it is training-free and it does not assume a single frequency for the periodicity but rather a prior distribution whose mean and scale can be tuned according to prior knowledge.
We generate four sets of 300 time series according to the null hypothesis with Gaussian noise (std. 1) and 0, 1, 2 or 3 (0 in the time series of the first set, 1 in the second and so forth) randomly placed outliers of amplitude ± (5 . . . 6) per series. For the alternative hypothesis sets (4) we generate sinusoidals of random phase, amplitude and frequency uniformly distributed in the range [0.09,0.11]2π in addition to the additive noise similar to the null hypothesis set. We then deliberately choose the frequency that is given to the periodicity detectors incorrectly as 2π 0.125 (as in ) and observe the ROC curves for the three methods. The standard deviation for the Bayesian method was set to 0.1 as in .
As we can see, there is no single best approach to periodicity detection and one must weigh the options. If we want to reject outliers and have non-uniformly sampled data, the M-estimation based regression estimator is a safe choice. If we have a reason to trust in the Gaussian noise model (or other well behaving symmetric distribution) and we have weak prior knowledge on the frequency of the cycle period, the Bayesian approach  should be the best alternative. In the case of uniform sampling and no prior knowledge on the frequency, the rank based estimator  is also a considerable option, among others.
Computational complexity. The approximate time in seconds needed for the different spectrum estimation methods to evaluate one spectral estimate. Random signals were used in the evaluation. FFT is the acronym for fast fourier transform, LTS for least trimmed squares and MCD for minimum ovariance determinant. The method Robustperiodic f fcorresponds to the method in 
We analyse here the microarray time series data set obtained from the mussel species Mytilus californianus, by measuring the gene expression over several days with non-uniform sampling. It is expected that the gene expression of some of the genes of the population would correlate with the tidal cycle or day and temperature cycle. We therefore seek the genes that are periodically expressed in the 24-hour cycle and show the annotations and time series profiles of the best ranking signals. In practice we use M-estimator based regression to fit sine and cosine signals of 24-hour cycle to the time series. To estimate the null hypothesis distribution (i.e. no periodicity present) for the estimated coefficients we use 300 random permutations of each time series.
To determine how many of the best ranked genes we should accept as periodic we use the false discovery rate at a level of 0.1. This corresponds to 33 of the best time series out of 7679. This is quite a conservative decision boundary since the visual inspection of the first few hundred time series clearly shows similar patterns as in Figure 6.
The best detected genes were mostly expressed at small amplitude. To determine if there was a biological theme to the best detected genes, a Gene Set Enrichment Analysis (GSEA)  was performed using the list of genes ranked by the significance of periodic signal against 245 gene sets which were defined by their shared participation in a specific biological process in the Gene Ontology database. These gene sets covered biological processes which are known to be under circadian regulation in mammals such as the cell cycle and metabolic pathways. This analysis revealed that none of these gene sets were statistically significantly enriched towards the top of the ranked list of periodic genes. One explanation for this finding is that biological prior knowledge may be incomplete and does not capture the sets of genes that are under circadian regulation in mussels, or that the circadian cycle is attenuated in the tidal environment and does not really contribute much to the cyclic expression of the genes.
Replicate spots of each gene (that appear on the array) were removed prior to enrichment analysis to prevent false hits.
The usual motivation for using methods based on least squares and a Gaussian model assumption, is that the theoretical background is thoroughly understood and computations are easier than in the case of robust methods. However, in reality, outlying data points are present in biological high throughput measurements such as microarrays. We have shown here that even a small number of outliers cause problems in periodicity detection if least squares methods are used. Taking into account the amount of data produced by microarray experiments, the manual replacing of these outliers is not possible and even if it was, the question of how to replace the outliers raises even more questions.
This leaves us with two reasonable approaches: we could either use a robust filter to clean the data first and then use methods based on least squares or use robust methods in the first place. If we however take a closer look at the filter cleaner proposed in , the method first evaluates a robust spectral estimate with M-estimator regression and then takes the inverse discrete Fourier transform to yield a cleaned time series. If we then evaluate the periodogram of the time series we end up with the same estimated power spectrum as we would have gotten by just using the M-estimator. This shows that in this case the two mentioned approaches are actually the same.
In addition to robustness against outliers we have shown in this paper that if sampling is not performed uniformly, it is very important to choose analysis methods that can adapt to not-uniform sampling. The regression based periodicity detection method implemented with the Tukey's biweight M-estimator was shown to have excellent performance in simulation studies and provided visually good results with real measurement data as well. Since the method can be readily implemented and is relatively fast to use, we propose it to be used as a first choice in periodicity detection with non-uniformly sampled data.
Future work includes the use of wavelets in periodicity detection. This has been previously considered in the literature in the context of spectrum estimation (e.g. ) and gene expression periodicity analysis  but it would be interesting to try to robustify the wavelet transform to reject outliers. Besides straightforward periodicity detection with wavelets, the robustify interpolation of the non-uniformly sampled time series to uniform sampling would be one reasonable approach as well. The ever growing need for data integration is also an urgent matter that must be given much attention in the future. Combining computational predictions from different data sources to increase statistical power is of utmost importance.
The assumptions about the time series model and different methods for periodicity detection are first introduced. We then perform simulations and show the ROC curves for the methods when using the Fisher's g-statistic.
where β ≥ 0, ω ∈ (0, π), n = 0,..., N - 1(∈ Z), φ ∈ (-π, π], and ε n is an i.i.d. noise sequence (distribution unknown). To test for periodicity, define the null hypothesis as H 0: β = 0, i.e., the time series consists of the noise sequence alone, y n = ε n .
To take non-uniform sampling into account we loosen the definition of Equation (1) so that n is replaced by t n . Time index t n can be any real number to allow the evaluation of the sine signal at arbitrary time points. For the practical implementation, see Appendix A.
We first briefly review Fisher's test for the detection of periodic time series. We then generalise the test for different robust spectral estimators and finally we modify the test for the detection of a specific frequency.
where q = [(N - 1)/2] and we can analytically find the p-values for the statistic under the Gaussian assumption (see e.g.  or ). Other test statistics have been proposed for periodicity detection as well. For example, Chen  used a test called Hubert's C-test to complement the Fisher's test. The C-test tests whether a time series is a Gaussian white noise sequence. However, like Fisher's test, the C-test is based on the standard periodogram. Therefore, the C-test is not robust and even if the underlying sequence is Gaussian, a few outliers (e.g. measurement errors) can lead to incorrect results.
We should also verify that the above design matrix is invertible (full column rank), which should be the case unless the measurement times t n (or frequencies) are "pathological". To elaborate, for any frequency ω it is possible to choose t n so that, e.g., the cos term in one column of A 1 is always equal to 1, making that column linearly dependent with the first column vector (vector of ones). It should, however, be unlikely that this happens in practice. Furthermore, given time instants t n , we can define the frequencies (perhaps slightly differently if needed) so that the columns are always linerly independent.
yields a power spectral estimate B(ω) that coincides (see  p. 395) with the periodogram if the time series sampling is uniform. This explicitly shows that the periodogram can be obtained directly from a least squares estimate and is hence sensitive to different anomalies in data. On the other hand, the above regression based formulation opens up the possibility of developing alternative robust periodicity detection methods (and also robust spectrum estimation, see ) that can naturally handle non-uniform sampling. Since the dimensionality of the model matrix in Equation (5) is high and robust regression estimators do not necessarily yield optimum solutions easily for high dimensional models, we propose to find the parameters one frequency at a time. Therefore, for the purposes of robust estimation, we consider the following reduced modely = X(ω)b + e, (11)
where ω is defined similarly as in Equation (1) and t n is the real-valued index.
To estimate b at the different frequencies we propose to use robust regression methods. The different robust regression estimators we chose for evaluation are reviewed in subsection The different estimators. It is suggested in  that when calculating the coefficients one frequency at a time, the fitted sinusoidal of the latest iteration should be subtracted from the signal y (i.e. leaving the residual). This residual is then used in place of y when estimating the parameter b of the next frequency, thus avoiding the problems caused by the loss of orthogonality. In addition, the order in which the frequencies should be sought is based on an initial spectral estimate, which is obtained by regression of the coefficients without any subtraction. The strategy is then to process the frequencies according to the magnitude of the coefficients, in descending order. To expand, in the case of uniform sampling, sinusoidals at the Fourier frequencies constitute an orthonormal basis. Thus in the least squares case, it is irrelevant in which order we fit the sinusoidals, since they are independent and the resulting spectral estimate will be the same . If we have non-uniform sampling or we use estimators other than least squares we can no longer use this property.
The dependence between the fitted sinusoidals and the problem of overfitting can be corrected, at least to a reasonably good degree , by removing the fitted signal and fitting to the residuals. We should also mention that the non-orthogonolity would not be a problem if all the sinusoidals were fitted simultaneously, but this is not a reasonable approach for the robust estimators (due to the problems of high dimensionality and non-convergence.
We can now use the robust power spectral estimate B(ω) in Equation (3) without the requirement of uniform time series sampling. However, if the time series sampling is not uniform, the notions of Nyquist frequency and harmonic Fourier frequencies are no longer clear. We can still approximate the sampling frequency as if the sampling was uniform, as explained in Appendix A, and thus approximate the Fourier frequencies as well. For more discussion on the subject, see . p-values can be estimated using simulated distributions or permutation tests. The use of simulated distributions is now more awkward than in the case of the rank-based method of . It was noted in  that the rank-based method is "distribution-free", which implies that it is sufficient to simulate the null hypothesis distribution for only one noise type (e.g., Gaussian) under the assumption that the noise is i.i.d. The proposed regression based methods are not distribution-free and we would have to generate the null hypothesis distribution for all the different imaginable noise types separately. Of course, independent of the method, the null hypothesis distribution must be generated separately for different time series lengths. Therefore, based on the preceding discussion we propose to use permutation tests with multiple correction method of Benjamini and Hochberg  in the same way as in  to find the significance values.
where g m is the modified g-statistic, ω c is the frequency of interest and and correspond to the coefficients of the fitted sine and cosine terms (Equation 11) at this frequency. The first problem arises with p-value computation. If we use ordinary least squares (OLS) for the regression then we can just neglect the denominator in Equation (3) since, as will be explained later, we will be using permutation tests for p-value computation and the denominator is the same for all the permutations of a single time series; thus, there is no need for the scaling. However when using the robust regression based methods there is no guarantee that the sum of the terms is a constant for different permutations. On the other hand, in many cases the effect of the omission of the scaling factor in the denominator has little effect on periodicity detection. This is also verified with good results on real data in subsection The methods in practice. Moreover, this way we do not need to compute the whole spectral estimate but just at the frequency of interest, which can now be other than one of the harmonic Fourier frequencies as well. This is a huge benefit for the use of the regression methods since the robust ones are typically computationally time-consuming. To sum up, we propose to use the following procedure for finding periodicities at a known frequency:
1. For a time series, fit the model of Equation (11) at a chosen frequency ω c
2. Evaluate Equation (13).
3. Randomly permute the original time series and for each permutation repeat steps 1 and 2.
4. To estimate the null hypothesis distribution of the modified g-statistic, compose a histogram of the population of g-statistics generated in step 3.
5. Use the histogram as a distributional estimate to get a p-value for the original test statistic computed in step 2.
6. Repeat steps 1–5 for all the time series to get the necessary p-values.
7. use multiple correction for the obtained p-values.
Although the idea of robustly fitting sinusoidals of one frequency to data is not new. we have here plausibly incorporated it for periodicity detection by modifying Fisher's test. We would like to note that exact tests are also available for some robust tests (see e.g. ). However, since these tests require some knowledge of the underlying distributions and because they are not applicable to all robust methods considered here, we propose to use the general purpose non-parametric permutation tests.
In this subsection we present the different estimators that we will use in periodicity detection. We will mainly concentrate on robust estimators but will also present for comparison purposes the non-robust Lornb-Scargle periodogram, a modification of the ordinary periodogram taking non-uniform sampling into account.
where the influence function ψ (Tukey's biweight function in this case) is the derivative of ρ. The idea of the biweight M-estimator is to give zero weighting to those data points whose residuals are large if compared to the estimated scale. Therefore, if the scale is estimated robustly as well, we expect the M-estimators to give us good performance on data with different distributional characteristics. For more information on M-estimators, see e.g. [27, 41]. In our computations we use the implementation in Matlab's Statistics Toolbox that uses iteratively reweighted least squares estimation.
The least trimmed squares regression  (LTS) is a robust regression method that was developed to have similar robustness properties as the least median of squares (LMS) regression but that would perform better under the Gaussian assumption and would be less computationally expensive.
where [N/2] + 1 ≤ h ≤ N. In practice not all subsets can be considered and only some hundreds or maybe thousands of randomly chosen subsets are processed to yield the output.
One important feature of the LTS (and the MCD, which will be introduced shortly) regression is that it can tolerate outliers in the predictor variables as well, whereas the M-estimators cannot. However, this is not critical in spectrum estimation since the measurement time points usually contain no stochasticity. The implementation (FAST-LTS) is introduced in .
The minimum covariance determinant (MCD) regression method  is a well performing robust regression method that can also handle cases where both X and y are multivariate. This method has similarities to the LTS regression in that it also considers subsets instead of complete sets in the estimation. The subset that yields the smallest covariance determinant is chosen for the estimation. For multivariate regression the model is presented asy = β T X + α + ε, (17)
where y is q-variate, X is p-variate, β is the (p × q] slope matrix, α is the q-dimensisnal intercept vector and ε is i.i.d with zero mean and with a positive definite covariance matrix.
Since the standard LS estimates of β and α can be written as functions of the estimated empirical mean vector (location) and the covariance matrix (scatter), a logical way of robustifying the regression is to replace the location vector and the scatter matrix with robust alternatives. In  the authors use the minimum covariance determinant estimator to yield the necessary robust estimators for the location and covariance matrix, thus leading to robust regression. It is shown that this is a positive-breakdown and bounded-influence method, which, if properly reweighted and iterated, has a high efficiency. The implementation of the algorithm (FAST-MCD) is introduced in .
The Lomb-Scargle periodogram (see e.g. ) can be seen as a least squares fit of unequally sampled sinusoids to the measured signal and thus an extension of the periodogram. Exact tests for periodicity detection have been developed but problems include the well known outlier sensitivity of the OLS .
It frequently occurs that besides non-uniform sampling there are some measurement points in some genes that are missing although they are present in other genes, thus reducing the quality of the data. Since we do not use an analytical null hypothesis distribution for all the time series but rather simulate the distribution for each time series separately, these missing points are handled so that we fit the sinusoidals only to the time points that are present. The presented methods thus have no further need for missing value imputation (a topic which has received considerable attention lately, see e.g. ).
To take non-uniform sampling into account, we loosen the definition of the variable n in Equation (1) so that it does not need to be integer-valued. We perform a scaling of the measurement time points in the following way: Denote the time points when the actual measurements have been made as a vectorτ= [τ 0, ..., τ N - 1] T . (18)
to correspond to the new indices, i.e. we normalise the last time point to N - 1. Note that for any uniformly sampled time series Eq. (19) yields an integer valued vector t.
To find the corresponding real time frequency for an angular frequency ω ∈ (0, π) we consider two interpretations of the ω. The first one is in Equation (2) and the second one isω = 2πf/F s , (20)
where the quotient is a constant and independent of the index n (as long as the index exists).
This work was supported in part by the Academy of Finland, project No. 213462 (Finnish Centre of Excellence program (2006 – 2011)), and R01 GM072855 grant from NIH. The support of Tampere Graduate School in Information Science and Engineering (TISE) is also gratefully acknowledged. We would like to thank Dr. Sabine Verboven and Sanne Engelen for useful tips on the use of the implementations of the LTS and MCD regression methods. We are indebted to Dr. Mats Gustafsson for providing us the implementation of the Bayesian periodicity detector and to Claes Andersson for the implementation details. Finally, we thank the anonymous reviewers for the valuable comments that greatly improved the quality of this paper.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.