A Platform for Processing Expression of Short Time Series (PESTS)
 Anshu Sinha^{1} and
 Marianthi Markatou^{2}Email author
DOI: 10.1186/147121051213
© Sinha and Markatou; licensee BioMed Central Ltd. 2011
Received: 17 June 2010
Accepted: 11 January 2011
Published: 11 January 2011
Abstract
Background
Time course microarray profiles examine the expression of genes over a time domain. They are necessary in order to determine the complete set of genes that are dynamically expressed under given conditions, and to determine the interaction between these genes. Because of cost and resource issues, most time series datasets contain less than 9 points and there are few tools available geared towards the analysis of this type of data.
Results
To this end, we introduce a platform for Processing Expression of Short Time Series (PESTS). It was designed with a focus on usability and interpretability of analyses for the researcher. As such, it implements several standard techniques for comparability as well as visualization functions. However, it is designed specifically for the unique methods we have developed for significance analysis, multiple test correction and clustering of short time series data. The central tenet of these methods is the use of biologically relevant features for analysis. Features summarize short gene expression profiles, inherently incorporate dependence across time, and allow for both full description of the examined curve and missing data points.
Conclusions
PESTS is fully generalizable to other types of time series analyses. PESTS implements novel methods as well as several standard techniques for comparability and visualization functions. These features and functionality make PESTS a valuable resource for a researcher's toolkit. PESTS is available to download for free to academic and nonprofit users at http://www.mailman.columbia.edu/academicdepartments/biostatistics/researchservice/softwaredevelopment.
Background
A frequent goal of highthroughput biological studies, in general, and microarray studies, in particular, is the identification of genes that show differential expression between phenotypes (e.g. cancer vs. no cancer). Microarray experiments are used in a wide variety of studies to understand the mechanisms governing variation in complex traits [1], for example, in studies of treatment effects on diseases [2]. Using microarray technology, mRNA expression data can be gathered on whole genomes or tens of thousands of unique DNA sequences at a time. And this data provides a snapshot of gene activity in a particular sample at a particular time. This snapshot, or crosssectional point of view, has dominated microarray research [3] and much has been published on the identification of differentially expressed genes. Taking a snapshot of the expression profile following a new condition can reveal some of the genes that are specifically expressed under the new condition. However, in order to determine the complete set of genes that are expressed under these conditions, and to determine the interaction between these genes, it is necessary to measure a time course of expression experiments [4]. Timedependent, or temporal, microarray profiles look at the expression of genes over a time domain, with the goal of taking a closer view at gene expression profiles to understand their characteristics. They provide an additional layer of information and an important characterization of gene function, as biological systems are predominantly developmental and dynamic.
Typical characteristics of microarray time course data are: 1) sparsity, in terms of both the number of replicates per sample and the number of time points per replicate and 2) irregularly spaced time points. Although there have been temporal microarray studies with as many as 80 time points, almost all are much shorter. In fact, Ernst et al. (2005) [5] found that more than 80% of all time series datasets they surveyed contained less than 9 points. The primary reason why short timeseries datasets are so common is expense  a limiting factor for most researchers. Additionally, it can be difficult to obtain large quantities of biological material. These factors can similarly limit the number of replicates tested and drive the use of irregularly spaced time points as well.
The purpose of this paper is to introduce the Processing Expression of Short Time Series (PESTS) platform, designed for the complete analysis of short time series gene expression datasets. PESTS provides a set of methods targeted to the analysis of sparse and irregularlyspaced time course microarray expression data making minimal assumptions about the underlying process that generated the data. It is designed specifically for the unique methods we have developed for significance analysis, multiple test correction and clustering of short time series data. Although PESTS was specifically designed for short microarray time series analyses, it is generalizable to other, longer time series analyses. Together with its implementation of several standard techniques and its visualization capabilities, users may find PESTS to be a useful tool for time series data analysis with or without PESTSspecific algorithms.
Much of the work on significance analysis of time series expression experiments uses methods originally developed for static or uncorrelated data [6–8]. While biologically relevant results may be found, these methods ignore the trend or sequential nature of time courses. At the same time, static methods do not allow us to leverage the attributes of time course data. More recently, several algorithms have been developed [3, 9, 10] which use modelbased techniques to determine significant genes, accounting for timedependence, but are generally more appropriate with longer time series. Nonparametric approaches have also been devised including those in [11, 12]. Similarly, clustering techniques for time course data have gone through a similar evolution from static techniques [6, 13] to a host of modelbased techniques [4, 14, 15], to nonparametric methods targeting short time series [5, 16] with analogous advantages and pitfalls.
The fundamental principle behind the time series methods developed for PESTS is to appropriately use expression profiles and dependence across time points to determine salient genes and gain biological insight about them while accounting for sparsity in the data. Instead of using modelbased techniques which do account for time dependence but generally tend to be inappropriate in cases of sparsity, PESTS methods summarize profiles using an innovative set of features. Features summarize short gene expression profiles, inherently incorporate dependence across time, and allow for both full description of the examined curve and missing data points. They are based on the structural characteristics of the time course data and reflect a clear link with subjectmatter considerations, capturing the "global picture" of an admittedly short time horizon of expression. In the case of short time series, features are used as a dimension augmentation technique. By contrast, this algorithm could also be extendable to longer time series through the use of features which provide dimension reduction such as autocorrelation functions, skewness, kurtosis, etc. as well as the descriptive features presented here. These biologically relevant features or curve summarization measures are then used for significance analysis or clustering. We provide brief summaries for these methods in the context of the interface description next and further information can be found in [17, 18].
In this paper, we will discuss details of the PESTS platform as well as give brief overviews of the relevant methodologies used and evaluation. First, we give implementation details and briefly discuss data requirements for using the platform. Then we give an overview of the interface, as well as the implemented visualization tools. Lastly, we compare the platform to other available resources for both significance analysis of time course data and clustering.
Implementation
The focus of this work is on time series data that is both sparse and irregularly spaced. Thus, the methods presented are implicitly tailored to these data characteristics. Here, we note our other guiding principles. First, the interface is designed for both paired and unpaired data. For significance analysis, the data must have more than one treatment, allowing for comparison. While paired data has the same number of replicates per treatment by definition, unpaired data is not required to. Furthermore, any given replicate can have measurements taken at different time points. In other words, for a given analysis, there are i = 1,..., I treatments and r_{ i }= 1, ..., R_{ i }replicates for each treatment. Additionally, there are ${t}_{{r}_{i}}=1,\phantom{\rule{0.5em}{0ex}}...\phantom{\rule{0.5em}{0ex}},{T}_{{r}_{i}}$ time points for each replicate in each treatment. In the case of paired data, for a given treatment i ≠ j, R_{ i }= R_{ j }but this may or may not be the case for unpaired data. In either case, time points of measurement may not be the same, so for a treatment i and replicate $r\ne s,either\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}{T}_{{r}_{i}}={T}_{{s}_{i}}or\phantom{\rule{0.5em}{0ex}}{T}_{{r}_{i}}\ne {T}_{{s}_{i}}$.
PESTS is implemented entirely in Java http://www.java.com and will work with any operating system supporting Java 6 or later. Advantages to using Java for this platform are that it is flexible, freely distributed, provides comprehensive graphical interface capabilities, implementations are platform independent, and the use of an interface does not require expertise in any programming language, statistical or otherwise, for the user. Further, Java is wellsuited to memory management tasks, critical in dataintensive analyses such as microarray analyses. Because of the large opensource community, many implementations of methods found in standard statistical packages were available to us for development. However, we do note some limitations in this area, so some methods were implemented from scratch  most notably, the clustering algorithms. Several third party libraries were used to support the application. The Java Statistical Classes (JSC) http://www.jsc.nildram.co.uk/ package was used for some of the standard statistical computations. Foxtrot http://foxtrot.sourceforge.net/ was used for thread management. JFreeChart http://www.jfree.org/jfreechart/ provided implementations for plot rendering. The JExcel API http://jexcelapi.sourceforge.net/ was used to generate excel spreadsheets for saving results. Lastly, EaSynth http://www.easynth.com/ was used for the look and feel of the application.
Figure 5 shows the interface for the significance analysis method. The user first selects the treatments to compare and then chooses a feature, or data summarization measure, of the gene expression profile to use for comparison. The replicates as denoted in the covariate file are then used for comparison. Obviously, as in standard statistical comparisons, the more replicates there are, the more power in the analysis. However, we suggest not using less than 3 replications for comparison. On the other hand, we note here that the clustering piece of the software can be executed with or without replications. The available choices are the signed Area Under the Curve (AUC), the slope between 2 time points, and a particular time point. The signed AUC is a good choice when the biological question has to do with the overall change in expression over a chosen time frame. The slope over a time period can be used to compare the rate of change, and the time point is a good choice when a particular time is known to show maximal change. The time point field allows the user to input the time points to use to calculate the feature, thereby specifying a period of interest. Time points should be entered as discussed previously. The null hypothesis of no differential expression is tested using standard statistical tests. Both parametric and nonparametric tests are listed, as well as tests for paired data if applicable. Parametric tests included are the ttest and the paired ttest which can be used when the distribution of the selected feature is approximately normal or assumed to be approximately normal. The nonparametric tests include the Wilcoxon test, the MannWhitney U test, the permutation ttest and the permutation paired ttest which do not make assumptions about the distribution of the feature and are thus less powerful. All of these tests assume independent samples. The user can select the test from the drop down box. The user can also select options for outlier removal. We provide three methods for outlier removal. The first two [19] approximate the variance of the selected feature and use the difference between the mean and median to find outliers. The last, Dixon's Extreme Value Test [20], is specific to cases where sample size<25 and can be used to find outliers in both tails.
Figure 8 shows the screen to view the multiple test correction. The left panel indicates the calculated m_{ 0 } and the corresponding estimates for sensitivity, specificity and false discovery rate for various levels of significance. These can be used to determine an appropriate threshold of significance for a particular dataset and an estimated m_{ 0 } . The right panel is a graphing panel that can show the ROC plot, the pvalue plot or the CDF plot.
Finally, the user can perform cluster analysis. The clustering screen is shown in Figure 6. The left panel is used to input the gene probe ids to be clustered. The user also needs to select the treatment(s); if the data are paired, the user can cluster the difference between two treatments. The top right panel lets the user select the feature(s) to be used for clustering. As with the significance analysis, the data are clustered using features of the gene expression curve in order to account for sparsity and incorporate dependence inherent in time course data. The current list of features is: the signed AUC, the slope, the raw expression, the maximum and minimum expressions, the time of the maximum and minimum expressions, and the steepest positive and negative slopes. Features are summarized using either the mean or median across replicates. In the sparsedata context, we use feature selection as a dimension augmentation technique to effectively and appropriately describe the curve and provide the most complete description of a time series as possible. The clustering features we use here are based on the structural characteristics of the time course data and meant to reflect a clear link with subjectmatter considerations and the questions under study. The user should select the feature(s) that are germane to their particular analysis. Again, the user identifies the time points to use for calculating the features. Lastly, the user selects the clustering algorithm (Kmeans or PAM), the distance metric (Euclidean or Manhattan) and the number of clusters. The question of the appropriate number of clusters can be addressed manually with our system. We suggest running the algorithm over a reasonable set of k s and choosing the optimal k as the clustering with the highest average silhouette [24].
Results and Discussion
There are few software platforms available for the purposes of short timeseries data analysis. In terms of both significance analysis and clustering, PESTS is the only platform we are aware of that does both.
For identifying differentially expressed genes, the available options are Significance Analysis of Microarrays (SAM) [11], Extraction of Differential Gene Expression (EDGE) [25], and maSigPro [26] which is incorporated in to the Serial Expression Analysis (SEA) [27] platform, a webbased tool for analysis. EDGE is an Rbased platform which models time course data using splines and then uses model fit information to determine significance. It also uses a method for m_{ 0 } estimation to improve FDR calculations. Given that this method requires modelfitting, it may be more suitable to longer time series or data sets with many replicates, which allow for accurate estimation of model parameters. Similarly, maSigPro is a tworegression step approach targeted to determining differences in time course expression over multiple treatments of the data. The reliance on model fitting with a specific functional form for the time element and a twostep regression strategy suggests limitations, similar to those met in other modelbased approaches, when applied to short time series. Additionally, maSigPro does not perform m_{ 0 } estimation. SAM is an Rbased excel plugin tool. It is similar to PESTS in that its time series method uses features such as the signed AUC or slope across time points, and it uses the SAM test for significance. SAM also performs m_{ 0 } estimation for multiple test correction. However, using PESTS, other standard tests of significance can be applied using information about the data distribution. Furthermore, the PESTS interface allows more flexibility and usability in time point selection. A user would need to modify the input files in order to look at different periods of time with any of these platforms. Both EDGE and SAM use asymptotic m_{ 0 } estimation methods which are useful but may not be optimal in certain datasets. Additionally, PESTS provides information about the sensitivity and specificity to aid the user in selecting a reasonable threshold for significance. It also provides methods for outlier detection and removal. Genes with outliers are removed from testing, increasing the reliability of results.
For clustering, there are several more options. Order Restricted Inference for Ordered Gene Expression data (ORIOGEN) [28] uses userdefined candidate temporal profiles based on mean expression measurements at each time point and then assigns genes to the bestfitting predefined profile. This approach uses bootstrapping to asses significance for each gene, and thus requires more than a handful of (independent) replicates. Also, it uses predefined models which may or may not fully describe the information in the data. Cluster Analysis of Gene Expression Data (CAGED) [29] and Graphical Query Language (GQL) [30] are also useful tools for clustering, but are better suited to longer time series [31]. CAGED provides both an autoregressive approach and a spline linear model based approach and GQL uses hidden Markov models to cluster the data. In the short time series framework, available platforms include the Short Timeseries Expression Miner (STEM) [31] and Analysis of Short Timeseries using Rank Order preservation (ASTRO) [16]. STEM uses predefined profiles to cluster data based on a transformation of the gene profiles to units of change. The user inputs parameters which determine the number of units of change and the number of profiles to consider. Then, clusters are assigned significance levels using a permutation test based method, so not all genes are assigned to significant clusters. ASTRO groups together genes by first constructing a rank matrix for the time series of each gene and then grouping together genes with the same rank profile. Both methods are designed specifically for short time series, and are computational in nature. As such, they transform raw expression data to a sequence of symbols which are then used for clustering. In contrast, PESTS allows the user to select features that are biologically germane to the researcher's interests and sufficiently summarize curve information. It allows flexibility in the number and types of features selected as well as the clustering method. Finally, it provides cluster evaluation metrics which can be used to determine the clustering quality and, by extension, the most appropriate number of clusters to use.
Conclusion
In this paper, we have introduced PESTS, a software platform for the analysis of time course data. It is designed specifically for the unique methods we have developed for significance analysis, multiple test correction and clustering of short time series data. The central tenet of these methods is the use of biologically relevant features for analysis which summarize gene expression profiles and inherently incorporate the dependence across time. It is fully generalizable to other types of time series analyses. PESTS was designed with a focus on usability and interpretability of analyses for the researcher. As such, it also implements several standard techniques for comparability, as well as visualization functions. These features and functionality make PESTS a valuable resource for a researcher's toolkit.
Availability and requirements
Project name: PESTS (Processing Expression of Short Time Series)
Project home page: http://www.mailman.columbia.edu/academicdepartments/biostatistics/researchservice/softwaredevelopment
Operating system(s): Platform independent
Programming language: Java
Other requirements: Java 6 or higher
License: noncommercial research use license
Any restrictions to use by nonacademics: license needed for commercial use
Abbreviations
 ASTRO:

Analysis of Short Timeseries using Rank Order preservation
 CAGED:

Cluster Analysis of Gene Expression Dynamics
 EDGE:

Extraction of Differential Gene Expression
 GQL:

Graphical Query Language
 ORIOGEN:

Order Restricted Inference for Ordered Gene Expression data
 PESTS:

Processing Expression of Short Time Series
 STEM:

Short Timeseries Expression Miner
 SAM:

Significance Analysis of Microarrays
Declarations
Acknowledgements
Anshu Sinha was supported through NLM Informatics Research Training Program (T15 LM00707918). Dr. Markatou would like to acknowledge OBE/CBER/FDA and NSF DMS0504957 for salary support.
Authors’ Affiliations
References
 Filho JSS, Gilmour SG, Rosa GJM: Design of microarray experiments for genetical genomic studies. Genetics 2006, 174: 945–957. 10.1534/genetics.106.057281PubMed CentralView ArticleGoogle Scholar
 Ribeiro CM, Hurd H, Wu Y, Martino MEB, Jones L, Brighton B, Boucher RC, O'neal WK: Azithromycin treatment alters gene expression in inflammatory lipid metabolism, and cell cycle pathways in welldifferentiated human airway epithelia. PLoS 2009, 4(6):e5806.View ArticleGoogle Scholar
 Storey JD, Xiao W, Leek JT, Tompkins RG, Davis RW: Significance analysis of time course microarray experiments. PNAS 2005, 102(36):12837–12842. 10.1073/pnas.0504609102PubMed CentralView ArticlePubMedGoogle Scholar
 BarJoseph Z, Gerber GK, Gifford DK, Jaakkola TS, Simon I: Continuous Representations of TimeSeries Gene Expression Data. Journal of Computational Biology 2003, 10(3–4):341–356. 10.1089/10665270360688057View ArticlePubMedGoogle Scholar
 Ernst J, Nau GJ, BarJoseph Z: Clustering short timeseries gene expression data. Bioinformatics 2005, 21(Suppl. 1):i159i168. 10.1093/bioinformatics/bti1022View ArticlePubMedGoogle Scholar
 Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genomewide expression patterns. PNAS 1998, 95(25):14863–14868. 10.1073/pnas.95.25.14863PubMed CentralView ArticlePubMedGoogle Scholar
 Park T, Yi SU, Lee S, Lee SY, Yoo D, Ahn J, Lee YS: Statistical tests for identifying differentially expressed genes in timecourse microarray experiments. Bioinformatics 2003, 19: 694–703. 10.1093/bioinformatics/btg068View ArticlePubMedGoogle Scholar
 Wang J, Kim S: Global analysis of dauer gene expression in Caenorhabditis elegans. Development 2003, 130: 1621–1634. 10.1242/dev.00363View ArticlePubMedGoogle Scholar
 Conesa A, Nueda MJ, Ferrer A, Talon M: maSigPro: a method to identify significantly differential expression profiles in timecourse microarray experiments. Bioinformatics. 2006, 22(9):1096–1102.Google Scholar
 Hong F, Li H: Functional hierarchical models for identifying genes with different timecourse expression profiles. Biometrics 2005, 62: 534–544. 10.1111/j.15410420.2005.00505.xView ArticleGoogle Scholar
 Tusher V, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. PNAS 2001, 98: 5116–5121. 10.1073/pnas.091062498PubMed CentralView ArticlePubMedGoogle Scholar
 Camillo B, Toffolo G, Nair SK, Greenlund LJ, Cobelli C: Significance analysis of microarray transcript levels in time series experiments. BMC Bioinformatics 2007, 8(Suppl 1):S10. 10.1186/147121058S1S10PubMed CentralView ArticlePubMedGoogle Scholar
 Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR: Interpreting patterns of gene expression with selforganizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 1999, 96(6):2907–2912. 10.1073/pnas.96.6.2907PubMed CentralView ArticlePubMedGoogle Scholar
 Ramoni MF, Sebastiani P, Kohane IS: Cluster analysis of gene expression dynamics. Proc Natl Acad Sci USA 2002, 99: 9121–9126. 10.1073/pnas.132656399PubMed CentralView ArticlePubMedGoogle Scholar
 Schliep A, Schonhuth A, Steinhoff C: Using hidden Markov models to analyze gene expression time course data. Bioinformatics 2003, 19: i264i272. 10.1093/bioinformatics/btg1036View ArticleGoogle Scholar
 Tchagang AB, Bui KV, McGinnis T, Benos PV: 2009 Extracting biologically significant patterns from short time series gene expression data. BMC Bioinformatics 2009, 10: 255. 10.1186/1471210510255PubMed CentralView ArticlePubMedGoogle Scholar
 Ghandhi SA, Sinha A, Markatou M, Amundson SA: Timeseries clustering of gene expression in irradiated and bystander fibroblasts: an application of FBPA clustering. BMC Genomics 2011, 12(1):2.PubMed CentralView ArticlePubMedGoogle Scholar
 Sinha A: Analyzing sparse and irregularly spaced time dependent gene expression data. Diss. Columbia University; 2010.Google Scholar
 NAC Cressie: Statistics for Spatial Data. 2nd edition. Wiley, New York; 1993.Google Scholar
 Wilfrid DixonJ, Frank MasseyJ Jr: Introduction to Statistical Analysis. fourth edition. Edited by: Wilfrid J. Dixon McGrawHill Book Company, New York; 1983:P377P548.Google Scholar
 Bonferroni CE: "Teoria statistica delle classi e calcolo delle probabilità.". Volume 8. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze; 1936:3–62.Google Scholar
 Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Statist Soc Ser 1995, B57: 289–300.Google Scholar
 Schweder T, Spjøvtoll E: Plots of pvalues to evaluate many tests simultaneously. Biometrika 1982, 69: 493–502.View ArticleGoogle Scholar
 Rousseeuw PJ: "Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis". Computational and Applied Mathematics 1987, 20: 53–65. 10.1016/03770427(87)901257View ArticleGoogle Scholar
 Leek J, Monsen E, Dabney A, Storey J: EDGE: extraction and analysis of differential gene expression. Bioinformatics 2006, 22: 507–508. 10.1093/bioinformatics/btk005View ArticlePubMedGoogle Scholar
 Conesa A, Nueda MJ, Ferrer A, Talon M: maSigPro: a method to identify significantly differential expression profiles in timecourse microarray experiments. Bioinformatics. 2006, 22(9):1096–1102.Google Scholar
 Serial Expression Analysis[http://sea.bioinfo.cipf.es/]
 Peddada S, Harris S, Zajd J, Harvey E: ORIOGEN: order restricted inference for ordered gene expression data. Bioinformatics 2005, 21: 3933–3934. 10.1093/bioinformatics/bti637View ArticlePubMedGoogle Scholar
 Ramoni M, Sebastiani P, Kohane I: Cluster analysis of gene expression dynamics. PNAS 2002, 99(14):9121–9126. 10.1073/pnas.132656399PubMed CentralView ArticlePubMedGoogle Scholar
 Costa IG, Schonhuth A, Schliep A: The Graphical Query Language: a tool for analysis of gene expression timecourses. Bioinformatics 2005, 21(10):2544–2545. 10.1093/bioinformatics/bti311View ArticlePubMedGoogle Scholar
 Ernst J, BarJoseph Z: STEM: a tool for the analysis of short time series gene expression data. BMC Bioinformatics 2006, 7: 191. 10.1186/147121057191PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.