A Platform for Processing Expression of Short Time Series (PESTS)
© Sinha and Markatou; licensee BioMed Central Ltd. 2011
Received: 17 June 2010
Accepted: 11 January 2011
Published: 11 January 2011
Time course microarray profiles examine the expression of genes over a time domain. They are necessary in order to determine the complete set of genes that are dynamically expressed under given conditions, and to determine the interaction between these genes. Because of cost and resource issues, most time series datasets contain less than 9 points and there are few tools available geared towards the analysis of this type of data.
To this end, we introduce a platform for Processing Expression of Short Time Series (PESTS). It was designed with a focus on usability and interpretability of analyses for the researcher. As such, it implements several standard techniques for comparability as well as visualization functions. However, it is designed specifically for the unique methods we have developed for significance analysis, multiple test correction and clustering of short time series data. The central tenet of these methods is the use of biologically relevant features for analysis. Features summarize short gene expression profiles, inherently incorporate dependence across time, and allow for both full description of the examined curve and missing data points.
PESTS is fully generalizable to other types of time series analyses. PESTS implements novel methods as well as several standard techniques for comparability and visualization functions. These features and functionality make PESTS a valuable resource for a researcher's toolkit. PESTS is available to download for free to academic and non-profit users at http://www.mailman.columbia.edu/academic-departments/biostatistics/research-service/software-development.
A frequent goal of high-throughput biological studies, in general, and microarray studies, in particular, is the identification of genes that show differential expression between phenotypes (e.g. cancer vs. no cancer). Microarray experiments are used in a wide variety of studies to understand the mechanisms governing variation in complex traits , for example, in studies of treatment effects on diseases . Using microarray technology, mRNA expression data can be gathered on whole genomes or tens of thousands of unique DNA sequences at a time. And this data provides a snapshot of gene activity in a particular sample at a particular time. This snapshot, or cross-sectional point of view, has dominated microarray research  and much has been published on the identification of differentially expressed genes. Taking a snapshot of the expression profile following a new condition can reveal some of the genes that are specifically expressed under the new condition. However, in order to determine the complete set of genes that are expressed under these conditions, and to determine the interaction between these genes, it is necessary to measure a time course of expression experiments . Time-dependent, or temporal, microarray profiles look at the expression of genes over a time domain, with the goal of taking a closer view at gene expression profiles to understand their characteristics. They provide an additional layer of information and an important characterization of gene function, as biological systems are predominantly developmental and dynamic.
Typical characteristics of microarray time course data are: 1) sparsity, in terms of both the number of replicates per sample and the number of time points per replicate and 2) irregularly spaced time points. Although there have been temporal microarray studies with as many as 80 time points, almost all are much shorter. In fact, Ernst et al. (2005)  found that more than 80% of all time series datasets they surveyed contained less than 9 points. The primary reason why short time-series datasets are so common is expense - a limiting factor for most researchers. Additionally, it can be difficult to obtain large quantities of biological material. These factors can similarly limit the number of replicates tested and drive the use of irregularly spaced time points as well.
The purpose of this paper is to introduce the Processing Expression of Short Time Series (PESTS) platform, designed for the complete analysis of short time series gene expression datasets. PESTS provides a set of methods targeted to the analysis of sparse and irregularly-spaced time course microarray expression data making minimal assumptions about the underlying process that generated the data. It is designed specifically for the unique methods we have developed for significance analysis, multiple test correction and clustering of short time series data. Although PESTS was specifically designed for short microarray time series analyses, it is generalizable to other, longer time series analyses. Together with its implementation of several standard techniques and its visualization capabilities, users may find PESTS to be a useful tool for time series data analysis with or without PESTS-specific algorithms.
Much of the work on significance analysis of time series expression experiments uses methods originally developed for static or uncorrelated data [6–8]. While biologically relevant results may be found, these methods ignore the trend or sequential nature of time courses. At the same time, static methods do not allow us to leverage the attributes of time course data. More recently, several algorithms have been developed [3, 9, 10] which use model-based techniques to determine significant genes, accounting for time-dependence, but are generally more appropriate with longer time series. Non-parametric approaches have also been devised including those in [11, 12]. Similarly, clustering techniques for time course data have gone through a similar evolution from static techniques [6, 13] to a host of model-based techniques [4, 14, 15], to non-parametric methods targeting short time series [5, 16] with analogous advantages and pitfalls.
The fundamental principle behind the time series methods developed for PESTS is to appropriately use expression profiles and dependence across time points to determine salient genes and gain biological insight about them while accounting for sparsity in the data. Instead of using model-based techniques which do account for time dependence but generally tend to be inappropriate in cases of sparsity, PESTS methods summarize profiles using an innovative set of features. Features summarize short gene expression profiles, inherently incorporate dependence across time, and allow for both full description of the examined curve and missing data points. They are based on the structural characteristics of the time course data and reflect a clear link with subject-matter considerations, capturing the "global picture" of an admittedly short time horizon of expression. In the case of short time series, features are used as a dimension augmentation technique. By contrast, this algorithm could also be extendable to longer time series through the use of features which provide dimension reduction such as autocorrelation functions, skewness, kurtosis, etc. as well as the descriptive features presented here. These biologically relevant features or curve summarization measures are then used for significance analysis or clustering. We provide brief summaries for these methods in the context of the interface description next and further information can be found in [17, 18].
In this paper, we will discuss details of the PESTS platform as well as give brief overviews of the relevant methodologies used and evaluation. First, we give implementation details and briefly discuss data requirements for using the platform. Then we give an overview of the interface, as well as the implemented visualization tools. Lastly, we compare the platform to other available resources for both significance analysis of time course data and clustering.
The focus of this work is on time series data that is both sparse and irregularly spaced. Thus, the methods presented are implicitly tailored to these data characteristics. Here, we note our other guiding principles. First, the interface is designed for both paired and unpaired data. For significance analysis, the data must have more than one treatment, allowing for comparison. While paired data has the same number of replicates per treatment by definition, unpaired data is not required to. Furthermore, any given replicate can have measurements taken at different time points. In other words, for a given analysis, there are i = 1,..., I treatments and r i = 1, ..., R i replicates for each treatment. Additionally, there are time points for each replicate in each treatment. In the case of paired data, for a given treatment i ≠ j, R i = R j but this may or may not be the case for unpaired data. In either case, time points of measurement may not be the same, so for a treatment i and replicate .
PESTS is implemented entirely in Java http://www.java.com and will work with any operating system supporting Java 6 or later. Advantages to using Java for this platform are that it is flexible, freely distributed, provides comprehensive graphical interface capabilities, implementations are platform independent, and the use of an interface does not require expertise in any programming language, statistical or otherwise, for the user. Further, Java is well-suited to memory management tasks, critical in data-intensive analyses such as microarray analyses. Because of the large open-source community, many implementations of methods found in standard statistical packages were available to us for development. However, we do note some limitations in this area, so some methods were implemented from scratch - most notably, the clustering algorithms. Several third party libraries were used to support the application. The Java Statistical Classes (JSC) http://www.jsc.nildram.co.uk/ package was used for some of the standard statistical computations. Foxtrot http://foxtrot.sourceforge.net/ was used for thread management. JFreeChart http://www.jfree.org/jfreechart/ provided implementations for plot rendering. The JExcel API http://jexcelapi.sourceforge.net/ was used to generate excel spreadsheets for saving results. Lastly, EaSynth http://www.easynth.com/ was used for the look and feel of the application.
Figure 5 shows the interface for the significance analysis method. The user first selects the treatments to compare and then chooses a feature, or data summarization measure, of the gene expression profile to use for comparison. The replicates as denoted in the covariate file are then used for comparison. Obviously, as in standard statistical comparisons, the more replicates there are, the more power in the analysis. However, we suggest not using less than 3 replications for comparison. On the other hand, we note here that the clustering piece of the software can be executed with or without replications. The available choices are the signed Area Under the Curve (AUC), the slope between 2 time points, and a particular time point. The signed AUC is a good choice when the biological question has to do with the overall change in expression over a chosen time frame. The slope over a time period can be used to compare the rate of change, and the time point is a good choice when a particular time is known to show maximal change. The time point field allows the user to input the time points to use to calculate the feature, thereby specifying a period of interest. Time points should be entered as discussed previously. The null hypothesis of no differential expression is tested using standard statistical tests. Both parametric and non-parametric tests are listed, as well as tests for paired data if applicable. Parametric tests included are the t-test and the paired t-test which can be used when the distribution of the selected feature is approximately normal or assumed to be approximately normal. The non-parametric tests include the Wilcoxon test, the Mann-Whitney U test, the permutation t-test and the permutation paired t-test which do not make assumptions about the distribution of the feature and are thus less powerful. All of these tests assume independent samples. The user can select the test from the drop down box. The user can also select options for outlier removal. We provide three methods for outlier removal. The first two  approximate the variance of the selected feature and use the difference between the mean and median to find outliers. The last, Dixon's Extreme Value Test , is specific to cases where sample size<25 and can be used to find outliers in both tails.
Figure 8 shows the screen to view the multiple test correction. The left panel indicates the calculated m 0 and the corresponding estimates for sensitivity, specificity and false discovery rate for various levels of significance. These can be used to determine an appropriate threshold of significance for a particular dataset and an estimated m 0 . The right panel is a graphing panel that can show the ROC plot, the p-value plot or the CDF plot.
Finally, the user can perform cluster analysis. The clustering screen is shown in Figure 6. The left panel is used to input the gene probe ids to be clustered. The user also needs to select the treatment(s); if the data are paired, the user can cluster the difference between two treatments. The top right panel lets the user select the feature(s) to be used for clustering. As with the significance analysis, the data are clustered using features of the gene expression curve in order to account for sparsity and incorporate dependence inherent in time course data. The current list of features is: the signed AUC, the slope, the raw expression, the maximum and minimum expressions, the time of the maximum and minimum expressions, and the steepest positive and negative slopes. Features are summarized using either the mean or median across replicates. In the sparse-data context, we use feature selection as a dimension augmentation technique to effectively and appropriately describe the curve and provide the most complete description of a time series as possible. The clustering features we use here are based on the structural characteristics of the time course data and meant to reflect a clear link with subject-matter considerations and the questions under study. The user should select the feature(s) that are germane to their particular analysis. Again, the user identifies the time points to use for calculating the features. Lastly, the user selects the clustering algorithm (K-means or PAM), the distance metric (Euclidean or Manhattan) and the number of clusters. The question of the appropriate number of clusters can be addressed manually with our system. We suggest running the algorithm over a reasonable set of k s and choosing the optimal k as the clustering with the highest average silhouette .
Results and Discussion
There are few software platforms available for the purposes of short time-series data analysis. In terms of both significance analysis and clustering, PESTS is the only platform we are aware of that does both.
For identifying differentially expressed genes, the available options are Significance Analysis of Microarrays (SAM) , Extraction of Differential Gene Expression (EDGE) , and maSigPro  which is incorporated in to the Serial Expression Analysis (SEA)  platform, a web-based tool for analysis. EDGE is an R-based platform which models time course data using splines and then uses model fit information to determine significance. It also uses a method for m 0 estimation to improve FDR calculations. Given that this method requires model-fitting, it may be more suitable to longer time series or data sets with many replicates, which allow for accurate estimation of model parameters. Similarly, maSigPro is a two-regression step approach targeted to determining differences in time course expression over multiple treatments of the data. The reliance on model fitting with a specific functional form for the time element and a two-step regression strategy suggests limitations, similar to those met in other model-based approaches, when applied to short time series. Additionally, maSigPro does not perform m 0 estimation. SAM is an R-based excel plugin tool. It is similar to PESTS in that its time series method uses features such as the signed AUC or slope across time points, and it uses the SAM test for significance. SAM also performs m 0 estimation for multiple test correction. However, using PESTS, other standard tests of significance can be applied using information about the data distribution. Furthermore, the PESTS interface allows more flexibility and usability in time point selection. A user would need to modify the input files in order to look at different periods of time with any of these platforms. Both EDGE and SAM use asymptotic m 0 estimation methods which are useful but may not be optimal in certain datasets. Additionally, PESTS provides information about the sensitivity and specificity to aid the user in selecting a reasonable threshold for significance. It also provides methods for outlier detection and removal. Genes with outliers are removed from testing, increasing the reliability of results.
For clustering, there are several more options. Order Restricted Inference for Ordered Gene Expression data (ORIOGEN)  uses user-defined candidate temporal profiles based on mean expression measurements at each time point and then assigns genes to the best-fitting pre-defined profile. This approach uses bootstrapping to asses significance for each gene, and thus requires more than a handful of (independent) replicates. Also, it uses pre-defined models which may or may not fully describe the information in the data. Cluster Analysis of Gene Expression Data (CAGED)  and Graphical Query Language (GQL)  are also useful tools for clustering, but are better suited to longer time series . CAGED provides both an autoregressive approach and a spline linear model based approach and GQL uses hidden Markov models to cluster the data. In the short time series framework, available platforms include the Short Time-series Expression Miner (STEM)  and Analysis of Short Time-series using Rank Order preservation (ASTRO) . STEM uses pre-defined profiles to cluster data based on a transformation of the gene profiles to units of change. The user inputs parameters which determine the number of units of change and the number of profiles to consider. Then, clusters are assigned significance levels using a permutation test based method, so not all genes are assigned to significant clusters. ASTRO groups together genes by first constructing a rank matrix for the time series of each gene and then grouping together genes with the same rank profile. Both methods are designed specifically for short time series, and are computational in nature. As such, they transform raw expression data to a sequence of symbols which are then used for clustering. In contrast, PESTS allows the user to select features that are biologically germane to the researcher's interests and sufficiently summarize curve information. It allows flexibility in the number and types of features selected as well as the clustering method. Finally, it provides cluster evaluation metrics which can be used to determine the clustering quality and, by extension, the most appropriate number of clusters to use.
In this paper, we have introduced PESTS, a software platform for the analysis of time course data. It is designed specifically for the unique methods we have developed for significance analysis, multiple test correction and clustering of short time series data. The central tenet of these methods is the use of biologically relevant features for analysis which summarize gene expression profiles and inherently incorporate the dependence across time. It is fully generalizable to other types of time series analyses. PESTS was designed with a focus on usability and interpretability of analyses for the researcher. As such, it also implements several standard techniques for comparability, as well as visualization functions. These features and functionality make PESTS a valuable resource for a researcher's toolkit.
Availability and requirements
Project name: PESTS (Processing Expression of Short Time Series)
Operating system(s): Platform independent
Programming language: Java
Other requirements: Java 6 or higher
License: non-commercial research use license
Any restrictions to use by non-academics: license needed for commercial use
Analysis of Short Time-series using Rank Order preservation
Cluster Analysis of Gene Expression Dynamics
Extraction of Differential Gene Expression
Graphical Query Language
Order Restricted Inference for Ordered Gene Expression data
Processing Expression of Short Time Series
Short Time-series Expression Miner
Significance Analysis of Microarrays
Anshu Sinha was supported through NLM Informatics Research Training Program (T15 LM007079-18). Dr. Markatou would like to acknowledge OBE/CBER/FDA and NSF DMS-0504957 for salary support.
- Filho JSS, Gilmour SG, Rosa GJM: Design of microarray experiments for genetical genomic studies. Genetics 2006, 174: 945–957. 10.1534/genetics.106.057281PubMed CentralView ArticleGoogle Scholar
- Ribeiro CM, Hurd H, Wu Y, Martino MEB, Jones L, Brighton B, Boucher RC, O'neal WK: Azithromycin treatment alters gene expression in inflammatory lipid metabolism, and cell cycle pathways in well-differentiated human airway epithelia. PLoS 2009, 4(6):e5806.View ArticleGoogle Scholar
- Storey JD, Xiao W, Leek JT, Tompkins RG, Davis RW: Significance analysis of time course microarray experiments. PNAS 2005, 102(36):12837–12842. 10.1073/pnas.0504609102PubMed CentralView ArticlePubMedGoogle Scholar
- Bar-Joseph Z, Gerber GK, Gifford DK, Jaakkola TS, Simon I: Continuous Representations of Time-Series Gene Expression Data. Journal of Computational Biology 2003, 10(3–4):341–356. 10.1089/10665270360688057View ArticlePubMedGoogle Scholar
- Ernst J, Nau GJ, Bar-Joseph Z: Clustering short time-series gene expression data. Bioinformatics 2005, 21(Suppl. 1):i159-i168. 10.1093/bioinformatics/bti1022View ArticlePubMedGoogle Scholar
- Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. PNAS 1998, 95(25):14863–14868. 10.1073/pnas.95.25.14863PubMed CentralView ArticlePubMedGoogle Scholar
- Park T, Yi SU, Lee S, Lee SY, Yoo D, Ahn J, Lee YS: Statistical tests for identifying differentially expressed genes in time-course microarray experiments. Bioinformatics 2003, 19: 694–703. 10.1093/bioinformatics/btg068View ArticlePubMedGoogle Scholar
- Wang J, Kim S: Global analysis of dauer gene expression in Caenorhabditis elegans. Development 2003, 130: 1621–1634. 10.1242/dev.00363View ArticlePubMedGoogle Scholar
- Conesa A, Nueda MJ, Ferrer A, Talon M: maSigPro: a method to identify significantly differential expression profiles in time-course microarray experiments. Bioinformatics. 2006, 22(9):1096–1102.Google Scholar
- Hong F, Li H: Functional hierarchical models for identifying genes with different time-course expression profiles. Biometrics 2005, 62: 534–544. 10.1111/j.1541-0420.2005.00505.xView ArticleGoogle Scholar
- Tusher V, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. PNAS 2001, 98: 5116–5121. 10.1073/pnas.091062498PubMed CentralView ArticlePubMedGoogle Scholar
- Camillo B, Toffolo G, Nair SK, Greenlund LJ, Cobelli C: Significance analysis of microarray transcript levels in time series experiments. BMC Bioinformatics 2007, 8(Suppl 1):S10. 10.1186/1471-2105-8-S1-S10PubMed CentralView ArticlePubMedGoogle Scholar
- Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR: Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 1999, 96(6):2907–2912. 10.1073/pnas.96.6.2907PubMed CentralView ArticlePubMedGoogle Scholar
- Ramoni MF, Sebastiani P, Kohane IS: Cluster analysis of gene expression dynamics. Proc Natl Acad Sci USA 2002, 99: 9121–9126. 10.1073/pnas.132656399PubMed CentralView ArticlePubMedGoogle Scholar
- Schliep A, Schonhuth A, Steinhoff C: Using hidden Markov models to analyze gene expression time course data. Bioinformatics 2003, 19: i264-i272. 10.1093/bioinformatics/btg1036View ArticleGoogle Scholar
- Tchagang AB, Bui KV, McGinnis T, Benos PV: 2009 Extracting biologically significant patterns from short time series gene expression data. BMC Bioinformatics 2009, 10: 255. 10.1186/1471-2105-10-255PubMed CentralView ArticlePubMedGoogle Scholar
- Ghandhi SA, Sinha A, Markatou M, Amundson SA: Time-series clustering of gene expression in irradiated and bystander fibroblasts: an application of FBPA clustering. BMC Genomics 2011, 12(1):2.PubMed CentralView ArticlePubMedGoogle Scholar
- Sinha A: Analyzing sparse and irregularly spaced time dependent gene expression data. Diss. Columbia University; 2010.Google Scholar
- NAC Cressie: Statistics for Spatial Data. 2nd edition. Wiley, New York; 1993.Google Scholar
- Wilfrid DixonJ, Frank MasseyJ Jr: Introduction to Statistical Analysis. fourth edition. Edited by: Wilfrid J. Dixon McGraw-Hill Book Company, New York; 1983:P377-P548.Google Scholar
- Bonferroni CE: "Teoria statistica delle classi e calcolo delle probabilità.". Volume 8. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze; 1936:3–62.Google Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Statist Soc Ser 1995, B57: 289–300.Google Scholar
- Schweder T, Spjøvtoll E: Plots of p-values to evaluate many tests simultaneously. Biometrika 1982, 69: 493–502.View ArticleGoogle Scholar
- Rousseeuw PJ: "Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis". Computational and Applied Mathematics 1987, 20: 53–65. 10.1016/0377-0427(87)90125-7View ArticleGoogle Scholar
- Leek J, Monsen E, Dabney A, Storey J: EDGE: extraction and analysis of differential gene expression. Bioinformatics 2006, 22: 507–508. 10.1093/bioinformatics/btk005View ArticlePubMedGoogle Scholar
- Conesa A, Nueda MJ, Ferrer A, Talon M: maSigPro: a method to identify significantly differential expression profiles in time-course microarray experiments. Bioinformatics. 2006, 22(9):1096–1102.Google Scholar
- Serial Expression Analysis[http://sea.bioinfo.cipf.es/]
- Peddada S, Harris S, Zajd J, Harvey E: ORIOGEN: order restricted inference for ordered gene expression data. Bioinformatics 2005, 21: 3933–3934. 10.1093/bioinformatics/bti637View ArticlePubMedGoogle Scholar
- Ramoni M, Sebastiani P, Kohane I: Cluster analysis of gene expression dynamics. PNAS 2002, 99(14):9121–9126. 10.1073/pnas.132656399PubMed CentralView ArticlePubMedGoogle Scholar
- Costa IG, Schonhuth A, Schliep A: The Graphical Query Language: a tool for analysis of gene expression time-courses. Bioinformatics 2005, 21(10):2544–2545. 10.1093/bioinformatics/bti311View ArticlePubMedGoogle Scholar
- Ernst J, Bar-Joseph Z: STEM: a tool for the analysis of short time series gene expression data. BMC Bioinformatics 2006, 7: 191. 10.1186/1471-2105-7-191PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.