Significance analysis of microarray transcript levels in time series experiments
© Di Camillo et al; licensee BioMed Central Ltd. 2007
Published: 8 March 2007
Microarray time series studies are essential to understand the dynamics of molecular events. In order to limit the analysis to those genes that change expression over time, a first necessary step is to select differentially expressed transcripts. A variety of methods have been proposed to this purpose; however, these methods are seldom applicable in practice since they require a large number of replicates, often available only for a limited number of samples. In this data-poor context, we evaluate the performance of three selection methods, using synthetic data, over a range of experimental conditions. Application to real data is also discussed.
Three methods are considered, to assess differentially expressed genes in data-poor conditions. Method 1 uses a threshold on individual samples based on a model of the experimental error. Method 2 calculates the area of the region bounded by the time series expression profiles, and considers the gene differentially expressed if the area exceeds a threshold based on a model of the experimental error. These two methods are compared to Method 3, recently proposed in the literature, which exploits splines fit to compare time series profiles. Application of the three methods to synthetic data indicates that Method 2 outperforms the other two both in Precision and Recall when short time series are analyzed, while Method 3 outperforms the other two for long time series.
These results help to address the choice of the algorithm to be used in data-poor time series expression study, depending on the length of the time series.
A crucial issue in genomic studies is the elucidation of how genes change expression and interact as a consequence of external/internal stimuli such as an illness, drug administration, hormone stimuli, etc. Microarray technology makes it possible to monitor simultaneously a large number of gene transcripts through a series of different experimental conditions. In particular, microarray time series studies are essential to understand the dynamics of biological events at the molecular level.
A first necessary step in order to limit the analysis to those genes that change expression over time is to select differentially expressed transcripts. Selection methods proposed in the literature usually deal with the comparison of static (e.g. no treatment vs treatment) rather than dynamic conditions, and are based on statistical tests [1, 2]. These methods test the significance of the differential expression gene by gene. At least two replicates for each of the conditions to be tested are necessary, but a higher number is required to have reliable results. In time series experiments, in which gene expression is monitored over time, it is necessary to test differential expression at different sampling times. ANOVA or ANOVA based procedures  have been proposed to this purpose. However, since in time series experiments replicates are often available only for a limited number of samples, ANOVA tests are seldom applicable. For this reason, differentially expressed genes in time series experiments are often selected using an empirical constant fold change threshold . This is far from ideal, since it is based on an arbitrary choice (e.g. FC = 3), which does not take into account the characteristics of the measurement error.
When the number of the replicates is not sufficient to apply traditional statistical tests, alternative methods need to be applied. Two methods based on a fit of the time series were recently proposed in the literature [5, 6]. These methods fit the time series expression profiles using respectively polynomials and splines. Comparison between time series is based respectively on model parameters and goodness of fit. Both methods are really general and do not require any replicates; however, it is not clear the role of the number of available samples on their performance.
Here we propose Methods 1 and 2 able to select differentially expressed gene profiles in data-poor conditions, based on a model of the experimental error. Their performance is investigated in comparison to method  (Method 3 in the following), based on splines fit, using synthetic time series of different length. Finally, a case study on insulin treated muscle cells is presented to better appreciate the implementation aspects of Methods 1 and 2.
Let's call xT(tk) and xC(tk) the log-expression measurements in treated (T) and control (C) cultures, available for a generic gene X at time sample tk (k = 1, ..., M, with M number of time samples). Log expression measurement are used, as in , because the signal is considered proportional to the log of the measurements, the error is considered log-additive, and the large range of expression intensities makes the log-expression practical.
The rationale adopted to label a gene X as differentially expressed in condition T vs C is described in details for methods 1 and 2 and is briefly reviewed for Method 3, since we refer to  for further details.
The deviation of expression of gene X in T and C is calculated for each sample tk as:
d(tk) = xT(tk) - xC(tk) (1)
The gene is considered differentially expressed in T vs C if |d(tk)| exceeds a threshold θd in at least one sampling time tk (k = 1, ..., M):
|d(tk)| > θd (2)
where θd is determined in correspondence to a significance level α, based on the null hypothesis distribution of d(tk). This distribution is modeled from d(tk) values calculated by Equation 1, with xT(tk) and xC(tk) measured on experimental replicates (see below), which provide a situation where genes are not differentially expressed, so that the null hypothesis is verified. Therefore, replicates of at least one time sample are necessary, to apply Method 1.
Each contribution Ak is calculated from the deviation of expression in T and C (Equation 1), as:
The gene X is considered differentially expressed in T vs C if the following inequality holds:
A > θA (5)
where θA is a threshold to be determined, in correspondence to a significance level α, based on the null hypothesis distribution of A, i.e. the distribution of A derived from experimental replicates (see below for its calculation).
For each gene X, the expression profiles C and T are fitted using natural cubic splines. The null hypothesis is that both C and T time series share the same fit, the alternative hypothesis is that the fits are not equal. As a first step of the method, the same spline function is fitted simultaneously to the two profiles C and T and the sum of squares residuals SS0 is calculated; then two different spline functions are fitted separately to time series C and T and the sum of squares residuals SS1 is computed. To assess differentially expressed genes, the goodness of model fit under the null hypothesis is compared to that under the alternative hypothesis, by calculating a statistic F as:
Gene X is considered differentially expressed if the following inequality holds:
F > θF (7)
where θF is a threshold to be determined, in correspondence to a significance level α, based on the null hypothesis distribution of F.
Null hypothesis distribution model
Both Methods 1 and 2 are based on a threshold derived from the null hypothesis distribution of d(tk) and A, respectively, obtained using available replicates. We denote as "available replicates" replicates available for a subset of time samples (therefore we assume replicates available for at least one time sample).
Let's suppose two experimental replicates are available for a generic time tk and a generic experimental condition (T or C). By assuming a log-additive error model as in , the log-expression measurement of each gene X in replicates a and b, can be expressed as:
where μ represents the actual gene expression (unknown) and εa, εb are two realizations of the error variable ε, monitoring both technical and biological variability. The indices for condition and time tk are omitted here because we refer to pair of replicates available for a generic time sample tk.
Null hypothesis distribution of variable d
To quantify the deviation of expression d(tk) under the null hypothesis, Equation 1 is applied to available replicates as:
dH0 = xa - xb = εa - εb (9)
Different distribution models (t-Student distribution, bi-exponential distribution, and mixture models of N Gaussians, N = 1, ..., 6) are used to fit the set of dH0 values obtained by applying Equation 9 to all genes and available replicates. The best model is selected based on the goodness of fit and the parameters precision, and is used as the null hypothesis distribution of d(tk) to determine θd to be used in Equation 2.
To determine the threshold in correspondence to a significance level α, Method 1 uses a model to fit the observed statistics rather than using quantiles. The reason for this choice is that the lack of a sufficient number of observations from available replicates renders the determination of appropriate thresholds difficult when low significance levels are chosen, as often the case in microarray studies. If a sufficient number of replicates is available to guarantee a good threshold setting at the desired significance level α, it may be preferable to use quantiles.
Null hypothesis distribution of variable A
At least two replicates for each time sample would be necessary to derive A distribution under the null hypothesis from the data. Since we address selection of differentially expressed genes in data-poor condition, i.e. a sufficient number of replicates is not available, a Monte Carlo procedure is used to derive the null distribution of A. First, dH0 distribution is derived from d(tk) values obtained from available replicates as described above. Then, B profiles of length M are sampled from dH0 (here we used B = 104) under the hypothesis that the error at different time samples is independent and identically distributed. Subsequently, B values of AH0 are calculated from these profiles. Finally, different distribution models (Gamma, Log-normal, Weibull) are used to fit the entire set of AH0 values and the best model is chosen based on goodness of fit and parameter precision. This model is used as the null hypothesis distribution of A to determine θA to be used in Equation 5.
As for Method 1, a model is fitted to the observed statistics rather than using quantiles to determine the threshold in correspondence to a significance level α.
Null hypothesis distribution of variable F
The null distribution of F is obtained using bootstrap. See  for details.
Intensity dependency of error
In Affymetrix chips it is well known that dH0 has an intensity dependent distribution . In particular, analysis of technical replicates of Affymetrix Human chip has shown that the standardized variable sH0 (obtained dividing dH0 by its standard deviation):
has an intensity independent distribution . Therefore, in case of data showing intensity dependency of the variable dH0, it is convenient to model sH0 distribution, as indicated in the Additional File 1, to derive the threshold θd to be used in Equation 2. Consistently, the values of d(tk) observed from the data are standardized before applying Equation 2, if Method 1 is used:
Analogously, if Method 2 is used on data showing intensity dependency of the variable dH0, AH0 and θA are derived using sH0 and the values of A observed from the data (Equation 4) are calculated using s(tk) instead of d(tk).
Once the null hypothesis distribution of d, A and F are obtained, thresholds θd, θA and θF are determined in correspondence to a significance level α. Rather than fixing it a priori, α can be optimized based on a variety of criteria aiming to control the family wise error rate (FWER) , or the false discovery rate (FDR) [10, 11] or a compromise between false positive and false negative classification . As an example, let's focus on a criterion based on the control of FDR, defined as the expected proportion of false positive classification (FP) among the number S α of genes selected as differentially expressed, using significance level α:
In case of numerous sets as for microarrays, FDR is well approximated by
Calculating FDR requires the estimate of the expected number of false positives, obtained as the product of α by the number N0 of non differentially expressed genes:
E [FP] = N0 · α (14)
N0 is unknown and is estimated using the bootstrap procedure described in .
FDR is calculated for a range of significance levels α and the significance level that guarantees the desired FDR is then used to select differentially expressed genes.
Since Method 1 applies M tests (corresponding to individual time points) for each gene, the significance level α for Method 1 is corrected by applying Šidák correction  in order to account for multiple testing.
Three different experimental conditions with a number of time samples M = 10, 30, 50 were simulated. 100 synthetic data sets were generated for each experimental condition, each consisting of 2000 profiles: 300 simulated as differentially expressed and the remaining as random noise. In both cases, the deviation of expression in T vs C was generated at each sampling time tk as standardized deviation:
s(tk) ~ N(μk, σ2) ∀k = 1, ..., M (15)
where σ2 was set equal to 1 and μk = 0 (k = 1, ..., M) for not differentially expressed genes; while, for differentially expressed genes, plausible profiles were obtained by modeling μk as dependent on μk-1 according to a first order Markov model (see Additional File 1 for details), with the only constraint of being greater than 1 (or lower than -1) for at least one time samples.
Samples k = 1 were generated twice for each gene, so as to provide replicates useful to apply Method 1 and 2. These replicates were included also in the analysis for method 3. Simulated data were used to test the performance of the methods in different experimental conditions. After the null hypothesis distributions of variables d(tk), A and F are modeled as described above, a significance level α had to be fixed to determine the confidence thresholds θd, θA and θF to be used in Equations 2, 5 and 7 respectively. We compared the performance of the three methods across the entire range of significance level α by using Precision (true positives divided by the number of selected genes) vs Recall (true positives divided by the number of differentially expressed genes), and curves of observed false positives divided by the number of selected genes (observed FDR) vs number of selected genes. Moreover, we compared the average results obtained with the 3 methods by setting α in correspondence to a desired FDR of 0.05, in terms of number of selected genes and observed false positives divided by the number of selected genes. All measurements were averaged across the 100 simulations.
Insulin case study
To better appreciate some characteristics of Methods 1 and 2 related to the experimental error modeling, the analysis of the null hypothesis distribution of the variables d(tk) and A (Equation 1 and 4) was applied to a real case study on rat muscle cells treated with insulin. The study was performed in vitro, on muscle L6 rat cell line. Cells were treated with insulin at time 0+, just after the collection of a first baseline sample at time 0; eight samples were harvested every hour during eight hours insulin stimulation. A control experiment was also performed in order to be able to distinguish between insulin effect and biological processes of different nature, which take place in the culture simultaneously to insulin induced processes (Figure 1). A total of twenty Affymetrix chips RG_U34A (monitoring 8.799 transcripts) were hybridized using four replicates of the basal sample, eight samples collected from the control culture, and eight samples collected from the treated culture. Standard Affymetrix MAS 5 software  was used for data pre-processing.
Results on simulated data for threshold setting based on desired FDR = 0.05.
# selected genes
# FP/# sel.genes
# selected genes
# FP/# sel.genes
# selected genes
# FP/# sel.genes
M = 10
M = 30
M = 50
Insulin case study
sH0 distribution parameters. Parameters of the sum of two Gaussian mixture model for the distribution of sH0.
AH0 distribution parameters. Parameters of the Gamma distribution of AH0.
Number of selected genes on real data for different threshold settings.
# selected genes
# selected genes
# selected genes
In this work we evaluated the performance of two selection methods here proposed to be applied in time series studies in data-poor conditions, i.e. when the number of available replicates does not make possible or practical the use of standard statistical methods. We also tested the two methods in comparison with a third method from the literature.
Method 1 compares samples time by time using a statistically based fold change threshold derived from a null hypothesis distribution of variable d(tk). To this purpose, replicates of at least one time sample are necessary. Since the threshold is derived based on the experimental variability, Method 1 accounts for the error characteristics, e.g. its intensity dependence; therefore, for example using Affymetrix chips, genes expressed at high intensities are not penalized with respect to genes at low intensities, which show a higher variability (Figure 4). Although Method 1 improves upon the use of a constant empirical fold change threshold, it considers time samples independently to each other, which is not a realistic assumption in time series studies. Method 2 calculates the area of the region bounded by the time series expression profiles to be compared and considers the gene differentially expressed if this area exceeds a threshold based on a model of the experimental error; therefore, besides accounting for the error distribution, it considers the entire expression profile and not single time samples. Also this method needs replicates for at least one time sample in order to derive the null hypothesis distribution. Both Methods 1 and 2 assumes as working hypothesis that the error at different time points is independent and identically distributed. This hypothesis, on our experience, is usually verified on real data (data not shown). However, for some experimental settings there may be a dependency of the error on time. In this latter case, it would be more appropriate to perform replicates at different time samples, covering the duration of the experiment, and use Method 1 with time dependent threshold settings. Methods 1 and 2 can be applied to compare time series from 2 different experimental conditions or a time series vs its baseline (e.g. time 0), by defining the deviation of expression of gene × for each sample tk (Equation 1) as the deviation between treated at time tk and the baseline. Moreover, if the sampling grid is different for the two time series, Method 2 can be easily generalized by generating AH0 distribution from B time series with appropriate sampling grid.
Method 3  uses natural cubic splines to fit time series expression data using a null hypothesis and an alternative hypothesis model, and performs a statistical test on the sum of squares residuals obtained using the two models to assess differential expression. It therefore, besides considering the entire expression profiles, implicitly considers time dependencies. Moreover, it does not need any experimental replicate, which makes it practical for a wide range of time series microarray experiments.
We tested the performance of the three methods using synthetic data to assess their validity over a range of experimental conditions, specifically the length of the analyzed time series. In the simulation, we used a Markov model to generate plausible data, accounting for the dependencies of time samples and not biased toward one of the methods. Taking for example μ k equal to an arbitrary constant for a random subset of time samples tk (k = 1, ..., M) in Equation 12, would not have accounted for time dependencies, and would have generated non realistic oscillation in the profiles, thus penalizing Method 3 which is based on a model fit, with respect to the other methods, in particular Method 1, which applies a threshold on each time sample. Moreover, simulated profiles (Figure 2) represent a variety of possible situations in time series expression, such as profiles characterized by one or few peaks, waves of different length or consistent trends along the time series.
Results on simulated data showed that Method 2 outperforms Method 1 independently from the length of the time series being analyzed, probably because, as Method 1 is based on single sample comparisons, it is particularly sensitive to random fluctuations due to the noise, thus resulting in a larger number of false positives. Method 2 constitutes an improvement with respect to Method 1 since the entire expression profile is considered simultaneously and this allows better distinguishing between consistent differences in expression profiles and random oscillations, thus resulting in a lower false negative rate. Method 3, as Method 2, considers the entire expression profile, but performs better than Method 2, only for long time series (M = 50).
Looking at the ability of the methods to classify profiles with particular characteristics, we observed that Method 1 works better than Method 2 to detect differentially expressed genes that show just one or two peaks as differentially expressed. On the opposite, Method 2 works better than Method 1 in detecting profiles characterized by waves of length greater than three samples, so as profiles that show a characteristic increasing/decreasing trend. Method 3, as Method 2, is better in detecting bumps and consistent trends in the profiles, than in detecting isolated peaks. However, it needs long time series (M = 50) to perform better than Method 2, probably because the fit are more reliable when performed on long time series; Method 3 was in fact proposed by the authors on real case studies with more than 40 samples. These results are certainly of interest to address the choice of the algorithm to be used in data-poor time series expression studies, depending on the availability of replicates and on the length of the time series.
Microarray time series studies are essential to understand the dynamics of biological molecular events. In order to limit the analysis to those genes that change expression over time, it is necessary to select differentially expressed transcripts. Due to the high cost of microarrays, experiments are often performed without replication; therefore, traditional statistical methods can't be applied. Here we evaluate the performance of two selection methods applicable in data poor conditions, based on: a statistically based threshold on individual samples; a statistically based threshold to be applied on the area of the region bounded by the time series expression profiles to be compared.
Application on a real data set on insulin regulation on muscle cells, obtained using Affymetrix chips, revealed as the error analysis performed using Methods 1 and 2 may be useful to detect error characteristics such as intensity dependencies and to properly address these feature by standardization.
We evaluated Methods 1 and 2 performance using simulated data with a different number of available samples and compared these performance with those obtained using Method 3, based on a splines fit of time series profiles. The results outlines that the two error based Methods 1 and 2 work better than Method 3 with short time series experiments, while Method 3 works better than Methods 1 and 2 with long time series experiments. These results might help to optimize the choice of the algorithm to be used in different experimental conditions.
A preliminary version of Method 2, implemented in R, is available at .
This study was supported by Ministero dell'Università e della Ricerca Scientificae Tecnologica (PRIN 2003 Italy), and by National Institute of Health, grant ROIDK41973.
This article has been published as part of BMC Bioinformatics Volume 8, Supplement 1, 2007: Italian Society of Bioinformatics (BITS): Annual Meeting 2006. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/8?issue=S1.
- Cui X, Churchill GA: Statistical tests for differential expression in cDNA microarray experiments. Genome Biol 2003, 4: 210. 10.1186/gb-2003-4-4-210PubMed CentralView ArticlePubMedGoogle Scholar
- Tusher GT, Tibshirani R, Chu G: Significane Analysis of Microarrays Applied to the Ionizing Radiation Response. PNAS 2001, 98: 5116–5121. 10.1073/pnas.091062498PubMed CentralView ArticlePubMedGoogle Scholar
- Park T, Yi SG, Lee S, Lee SY, Yoo DH, Ahn JI, Lee YS: Statistical tests for identifying differentially expressed genes in time-course microarray experiments. Bioinformatics 2003, 19: 694–703. 10.1093/bioinformatics/btg068View ArticlePubMedGoogle Scholar
- Gentile M, Latonen L, Laiho M: Cell cycle arrest and apoptosis provoked by UV radiation-induced DNA damage are transcriptionally highly divergent responses. Nucleic Acids Res 2003, 31: 4779–4790. 10.1093/nar/gkg675PubMed CentralView ArticlePubMedGoogle Scholar
- Xu XL, Olson JM, Zhao LP: A regression-based method to identify differentially expressed genes in microarray time course studies and its application in an inducible Huntington's disease transgenic model. Human Molecular Genetics 2002, 11(17):1977–1985. 10.1093/hmg/11.17.1977View ArticlePubMedGoogle Scholar
- Storey JD, Xiao W, Leek JT, Tompkins RG, Davis RW: Significance analysis of time course microarray experiments. PNAS 2005, 102(36):12837–12842. 10.1073/pnas.0504609102PubMed CentralView ArticlePubMedGoogle Scholar
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data. Biostatistics 2003, 4: 249–264. 10.1093/biostatistics/4.2.249View ArticlePubMedGoogle Scholar
- Tu Y, Stolovitzky G, Klein U: Quantitative Noise Analysis for gene expression microarray experiment. PNAS 2002, 99: 14031–14036. 10.1073/pnas.222164199PubMed CentralView ArticlePubMedGoogle Scholar
- Dudoit S, Shaffer JP, Boldrick JC: Multiple hypothesis testing in microarry experiments. Technical Report Tech. Report # 110, U.C. Berkeley Division of Biostatistics, Working Paper Series 2002.Google Scholar
- Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: a Practical and Powerful Approach to multiple testing. J R Statist SocB 1995, 57: 289–300.Google Scholar
- Storey JD, Tibshirani R: Statistical significance for genomewide studies. PNAS 2003, 100: 9440–9445. 10.1073/pnas.1530509100PubMed CentralView ArticlePubMedGoogle Scholar
- Di Camillo B, Sanchez-Cabo F, Toffolo G, Nair SK, Trajanoski Z, Cobelli C: A quantization method based on threshold optimization for microarray short time series. BMC Bioinformatics 2005, 6(Suppl 4):S11. 10.1186/1471-2105-6-S4-S11PubMed CentralView ArticlePubMedGoogle Scholar
- Storey JD: A direct approach to false discovery rates. J R Stat Soc 2002, 3: 479–498. 10.1111/1467-9868.00346View ArticleGoogle Scholar
- Šidák Z: Rectangular confidence regions for the means of multivariate normal distributions. J Amer Statist Assoc 1967, 62: 626–633. 10.2307/2283989Google Scholar
- Affymetrix, Santa Clara, CA. Statistical Algorithm Description Document Affymetrix – NetAffx Analysis Center 2002. [http://www.affymetrix.com/analysis/index.affx]
- Kohonen T: Self-Organizing Maps. Springer; 1995.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.