Extracting biologically significant patterns from short time series gene expression data

Background Time series gene expression data analysis is used widely to study the dynamics of various cell processes. Most of the time series data available today consist of few time points only, thus making the application of standard clustering techniques difficult. Results We developed two new algorithms that are capable of extracting biological patterns from short time point series gene expression data. The two algorithms, ASTRO and MiMeSR, are inspired by the rank order preserving framework and the minimum mean squared residue approach, respectively. However, ASTRO and MiMeSR differ from previous approaches in that they take advantage of the relatively few number of time points in order to reduce the problem from NP-hard to linear. Tested on well-defined short time expression data, we found that our approaches are robust to noise, as well as to random patterns, and that they can correctly detect the temporal expression profile of relevant functional categories. Evaluation of our methods was performed using Gene Ontology (GO) annotations and chromatin immunoprecipitation (ChIP-chip) data. Conclusion Our approaches generally outperform both standard clustering algorithms and algorithms designed specifically for clustering of short time series gene expression data. Both algorithms are available at .


Background
Time series experiments have been widely used to study the dynamic behavior of the cells in a variety of biological processes, including cell proliferation [1], development [2], and response to extracellular stimuli [3,4]. Time series data can be broadly divided into two classes: the short-time series with few sampled time points (typically 3-8) and long-time series with more than 10 time points sampled. Most algorithms used to analyze time series datasets ini-tially were based on general clustering methods like hierarchical clustering [5], k-means [6], Bayesian networks [7], and self-organizing maps [8]. Although these methods are capable of revealing some biological features, they are not taking into consideration the sequential nature of the time series data. More recently, some groups suggested methodologies specifically designed for clustering time series expression data, including the use of continuous representation of expression profiles [9], hidden Markov models [10], and others [11][12][13][14]. However, algorithms such as those developed by Bar-Joseph et al. [9], De Hoon et al. [12] and Peddada et al. [13] perform better on long time series datasets where the statistical power is higher. For short time series data, which represent about 80% of the time series gene expression datasets [15], they are expected to perform less optimal due to data overfitting caused by the small number of sampled time points.
In order to avoid that, some researchers have suggested the use of predefined patterns of expression profiles (either taken directly from the data or from prior biological observations) and matching the observed data to these profiles using some cost function [15][16][17][18]. Such approaches usually identify a large number of patterns, but many of them may arise randomly from noise due to the small number of sampled time points. The algorithm proposed by Ernst et al. [15] is capable of partially correcting for this problem with the implementation of heuristics: the user is required to select a set of potential profiles that are expected to represent better the real biological nature of such data. Last but not least, almost all of the approaches mentioned above use a cost function followed by a greedy algorithm to find clusters. As we will show later, such approaches may miss some biologically significant characteristics of the data.
In this paper, we present two new algorithms, ASTRO and MiMeSR, respectively, which are specifically designed to identify biologically relevant clusters of genes from short time series data. ASTRO and MiMeSR are inspired by the order preserving framework and the minimum mean squared residue approach, respectively. Other algorithms have used the same principles in the past, but in the biclustering context [19][20][21], which makes such algorithms NP hard [21]. We demonstrate the utility of ASTRO and MiMeSR using several well-defined short time datasets. We show that our approaches are robust to noise and random patterns and they can correctly detect the temporal expression profile of relevant functional categories in linear time. Comparative analysis also showed that our approaches outperform both general clustering algorithms and algorithms designed specifically for short time series gene expression data.

Robustness to noise
To test the robustness of ASTRO and MiMeSR to noise, we generated three sets of data, 1000 rows and 3, 5, and 7 time points respectively, with five order preserving submatrix which at the same time verify the minimum mean squared residue property embedded in it (domain knowledge). Then, we added 0%, 1%, 3%, and 5% level of noise into the simulated data. We ran each algorithm several times on each set of data and plot the average of the Adjusted Rand index (Figure 1). The Adjusted Rand index values lies between 0 and 1. Larger value means higher similarity between the clustering results. If the simulated result is perfectly consistent to the domain knowledge, the index value will be 1. If a clustering is no more than a random choice, the index will be zero [22]. The results in Figure 1 show that both algorithms perform equally well on the 5 time points dataset, while ASTRO is more robust on the 3 time points datasets and MiMeSR on the 7 time points dataset.

Application on Saccharomyces cerevisiae amino acid starvation dataset
We tested the ability of ASTRO and MiMeSR to identify biologically relevant clusters from short time series data using the yeast amino acid (AA) starvation dataset [3]. Saccharomyces cerevisiae response to stress by AA starvation is measured at time points 0.5 h, 1 h, 2 h, 4 h, and 6 h and at the control (unstimulated) cells (time point 0 h). The data was filtered to remove genes with missing values and genes whose expression level did not change substantially between time points, filtering threshold ε < 2.0 for ASTRO and MiMeSR. The results show that both our approaches can correctly identify the temporal profiles of relevant functional groups. Statistical evaluation of our clusters was performed using external datasets, like the GO categories [23] and AA starvation ChIP-chip data [3,24]. Compared to general clustering algorithms (e.g., k-means) and algorithms designed specifically for clustering short time series gene expression data [17,25], our techniques were able to detect more significant patterns. Figure 2A and 2B present the plot of the most significant clusters identified in this dataset by ASTRO (Z = 10 -7 to 10 -68 ) and MiMeSR (H < 2), respectively. The minimum number of genes per cluster in both cases was set to K min = 25. In principle, one might expect that the genes in biologically relevant clusters will also participate in the same biological processes. We used the on-line yeast GO Term Finder tool http://www.yeastgenome.org to assess GO membership of the genes in the identified clusters. We found that with the exception of the clusters with many genes of unknown function, the majority of the genes in the identified clusters belong to the same GO categories. The p-values for these clusters were ranging from 10 -10 to 10 -34 for ASTRO (Table 1) and from 10 -34 to 10 -68 for MiMeSR ( Table 2.) The results also show that in general MiMeSR clusters are more homogenous than the ASTRO clusters regarding the GO pathways. For example, the percentage of the genes in the ASTRO clusters C1 and D1 that belong to the "ribosome biogenesis" category (Table 1) is smaller than MiMeSR clusters E2 and F2 (Table 2); consequently, their p-values are higher. The same is true for ASTRO cluster B1 and MiMeSR clusters D2 and G2.

Evaluation using ChIP-chip data
Another reasonable assumption is that genes that belong to the same cluster (co-expressed genes) are more likely to be regulated by the same transcription factors (co-regulated genes). We evaluated the ASTRO and MiMeSR clusters using the published AA starvation ChIP-chip dataset on 34 transcription factors [24]. Each of the 34 transcription factors in the ChIP-chip dataset was tested for target over-representation in each of the clusters using the Fisher's exact test. The results are presented on Tables 3 (ASTRO) and 4 (MiMeSR). FHL1, and SFP1 appear to have overrepresented number of target genes in clusters A1, B1, D1 (Table 3), A2, C2, D2, and G2 (Table 4), which are also the most similar in their overall expression pattern ( Figure 2A, 2B), especially in 0.5 hr and 1 hr time points. It is possible that these transcription factors act early on in the AA starvation response as it was previously suggested [26]. Consistent with our results, Jorgensen et al. [27] have found that FHL1 and SFP1 regulate many ribosome biosynthesis genes, which is the most significant GO process in cluster C1, D1, A2, C2, D2, E2, and F2 genes. Furthermore, MET31, MET32 and MET4 have been associated with regulation of sulfur metabolism genes [28], which is the most significant category for our cluster E1 genes.
Comparison of the results in Tables 3 and 4 also shows that in general ASTRO finds more clusters that are statistically significantly enriched in genes bound by transcription factors. In other words, ASTRO performs better than MiMeSR in identifying co-regulated genes.

Comparison with other methods
We compared ASTRO and MiMeSR with the popular kmeans general clustering algorithm and the recently published STEM [25] and FCV [17], both of which are designed specifically for short time series gene expression data. We used the Matlab 7.0.0 implementation of kmeans with correlation distance. We ran the k-means algorithm for 10 clusters because most of our algorithm picked the same number. FCV was implemented according to the description provided in [17]. STEM ran over the web http://www.cs.cmu.edu/~jernst/stem/ with the following parameters: maximum unit of change in model profiles between time points = 4; number of model profiles = 50. For unbiased comparison, we ran these algorithms on the same dataset of 698 genes (filtering threshold ε < 2). We selected clusters with less than 50% genes of unknown function as depicted by the GO database. Figure 3A and 3B presents the comparative evaluation of these approaches using GO and ChIP data, respectively. A particular cluster was considered to be "significant" if the p-value of the top GO category or transcription factor-gene association was smaller than the threshold. A good clustering algorithm is expected to identify sets of genes that will participate in the same biological processes (GO annotation) and/or regulated by the same transcription factors. The more homogeneous these clusters are the more significant the annotation categories will become. ASTRO identified a higher number of significant clusters than the k-means and the FCV algorithms in all p-value thresholds ( Figure 3A and 3B). Also, it performed equally well or better when compared to STEM. MiMeSR also performed equally well or better than all other algorithms with respect to the TF-gene association data; and it gave comparable results to the other algo-ASTRO and MiMeSR performance on simulated data Figure 1 ASTRO and MiMeSR performance on simulated data. Comparison of the two algorithms on simulated data with different noise levels.
rithms in the GO category analysis. In particular, we found it to be more accurate than the other algorithms in predicting the more tight clusters (low p-values), whereas the other algorithms performed better in higher p-values. This shows the potential of our algorithms in identifying more tight, biologically relevant clusters. We note, however, that a thorough comparison of different methods is impossible when dealing with noisy datasets and algorithms with different reporting thresholds.

Conclusion
In this study, we presented two new algorithms for analyzing short time series gene expression data: ASTRO that uses the order preserving matrix concept and MiMeSR that uses the minimum mean squared residue concept. Both these algorithms use linear algebra techniques to identify coherent gene clusters over all time points in linear time. This offers a significant advantage over existing methods that employ greedy approaches or heuristics, since ASTRO and  MiMeSR avoid problems related to cost functions or the choice of predefined sets of expression profiles [19,21,25]. Also, the complexity of our methods is smaller than that of existing algorithms, even when it is compared to those that use a greedy approach to speed up their running time [9][10][11][12][13][14][15]. ASTRO identifies all gene clusters with coherent expression patterns, irrespectively of the magnitude of expression change. Genes belonging in such clusters are generally expected to participate in the same biological processes (see, also, Table 1). MiMeSR identifies all gene clusters with coherent expression change both in direction as well as in magnitude. Genes belonging in such clusters are expected to be regulated by the same set of transcription factors (although different transcription factors may act at different time points). In general, MiMeSR identifies clusters that are enriched for genes belonging to the same pathways, while ASTRO identifies clusters with genes that are regulated by the same transcription factors (Tables 1, 2, 3 and Figure 3). In terms of overall statistical significance of the identified clusters, MiMeSR outperforms ASTRO in most cases (Figure 3). In a direct comparison of the two algorithms in simulated data ( Figure 1) we found that ASTRO outperforms MiMeSR when fewer data points are available, whereas MiMeSR performs better when more data points are available.
Testing ASTRO &MiMeSR in well-characterized short time series gene expression datasets showed that it is robust to noise and to random patterns, and that it can correctly predict the temporal expression profile of relevant functional categories as confirmed by statistical analysis of GO category membership over-representation and analysis of transcription factor occupancy in the promoters of the gene members of the various clusters. Our approaches are shown to outperform existing clustering algorithms, including the popular k-means, as well as FCV and STEM and they were able to distinguish between closely related but biologically distinct patterns. As expected, ASTRO finds more homogeneous clusters than MiMeSR, with respect to the percentage of genes associated with a given transcription factor. This is because it takes into consideration the magnitude of the change in gene expression, which is more closely related to the transcription factors involved.
In principle, ASTRO and MiMeSR can also be applied to long time series gene expression data (more than 10 time points) or gene expression data sampled over different conditions, but in this case the number of genes in each cluster is expected to be low. However, they can be adapted to identify local patterns, thus overcoming this problem.  The entry a nm is the expression level of gene n at time point t m or -simply-m. Given a short timeseries gene expression dataset, our goal is to identify the sets of genes with coherent behavior, i.e., genes whose expression levels increase and/or decrease coherently across the time point experiments, by minimizing the effect of noise and random patterns. The input data are pre-processed to identify and remove from the matrix all genes whose expression level remains constant across time points. We consider the expression of a gene to be constant when the difference between the minimum and the maximum value (in log-scale) is less than a positive real number, ε. ε is a user-defined parameter and it can be based on prior knowledge on the expected level of noise on a given experiment.

ASTRO (Rank Order Preserving Matrix framework)
ASTRO seeks to identify sets of genes with similar expression profiles irrespectively of their expression foldchange. Therefore, only the direction of expression change is considered and not its magnitude. This reflects the biological fact that different gene products may be required at different quantities for a given cellular response or function. Under this assumption, the problem of finding a cluster of similarly expressed genes can be caste into a problem of finding an order preserving submatrix (OPSM) of the gene expression matrix A. A submatrix C of A is order preserving (OP) if there is a permutation of its columns under which the sequence of values in every row is strictly increasing or decreasing. In other words an OPSM is a set: C = {2-tuples (I,J), I ∈ G and J ∈ T}, such that each row induces the same rank order permutation on the columns. The problem of

MET4
1/3% (7e-02) searching over all possible subsets of columns for identifying the most significant OPSM is NP hard [19]. However, taking advantage of the small number of sampled points in a short time series dataset, one may seek patterns of consistent gene expression over all time points. In such case the order will be required to be preserved in all columns (time-points) for the genes in a cluster (i.e. J = T). In fact, this assumption is now commonly used in analyzing short-time series experiments [9][10][11][12][13][14][15]. As we show below, this reduces the complexity of finding OPSMs, which offers an advantage over methods that use probabilistic models and greedy algorithms. ASTRO is guaranteed to find all OPSMs across all time points in O(NM) using linear algebra techniques.

Overview
In this part, we focus on genes with coherent behavior, i.e., genes whose expression levels increase and/or decrease coherently across all time points. ASTRO starts by filtering those genes with constant gene expression across all time points (using a threshold ε). It then proceeds by constructing the rank matrix of the time series gene expression data. Next, it identifies all distinct coherent patterns in the ranked matrix. Finally, it assigns each gene to its corresponding cluster by performing a row comparison between the set of distinct rows of the ranked matrix and the ranked matrix itself. See Additional file 1 for ASTRO pseudo codes.  3, 4, 2). The ranking is performed in increasing order. If more than two entries have the same value, the user can decide to give them the same ranking or rank them in the order they appear. In this study, we choose the former. By replacing each entry of the gene expression matrix with  their rank along the rows, we are no longer considering the expression level of a given gene per se, but its dynamics over all time points. The advantage of this method is its speed and that it avoids the use of probabilistic models, greedy algorithms, or costly column permutations. Also, one will notice that for any k > 1 rows of the ranked matrix that are similar under any permutation of the columns of the gene expression matrix; they will always belong to the same cluster. Finally, the fact that the rank is conserved under any permutation of the columns of the gene expression matrix further reduces the chance that a random pattern might be picked up in a cluster (see, also, the Statistical Significance and Complexity Analysis section).

Pattern Identification
Given a gene expression data matrix, A, the exact number of distinct OP expression profiles that can be found in the dataset (time points t = 1,..., M) is the number of distinct rows, N U , of its corresponding ranked matrix R. The set of distinct OP patterns, U, can thus be identified by considering the rank matrix R as a set of rows and identify all subsets of identical rows in it. ASTRO is guaranteed to identify the exact number of distinct OP patterns in a given matrix in O(NM).

Identification of Order Preserving Clusters
Once the exact number of distinct OP patterns has been identified, ASTRO assigns each gene to one of the N U groups by comparing each distinct row U k* of the ranked matrix to the rows R n* of the ranked matrix itself, and assign gene n to cluster G{k} each time U k* = R n* . This approach is guaranteed to identify all OP clusters of size K × M, with K min ≤ K ≤ N, and K min is the minimum number of genes in a cluster.

Statistical Significance and Complexity Analysis of ASTRO
The statistical significance of each identified cluster with K genes is assessed by computing the tail probability that a random dataset of size N × M will contain a cluster with K or more genes in it [19]. In principle, the probabilistic description of the reference random matrix would be that of the observed noise in the microarray experiment [29,30]. Since this distribution is difficult to calculate in closed form, we calculate the upper bound of this tail probability following the approach described below. Let's assume we have a dataset with M time points that are independent, identically distributed according to the uniform distribution. Then the probability that a random row (gene) supports a given cluster is equal to the number of possible column permutations or 1/M!. Since the rows are assumed to be independent, the probability of having at least K rows in the cluster is the k-tail of the (N,(1/M!)) binomial distribution, i.e.: As there are M s = M! ways to choose an OP cluster of size M, the following expression Z(M,K) is an upper bound on the probability of having a cluster of size M with K or more genes: We use this bound to estimate the significance of any given cluster of size M with K members. The best cluster is the one with the largest statistical significance, i.e., the one with the smallest Z(M,K). Therefore, as long as that upper Comparison of clustering approaches Figure 3 Comparison of clustering approaches. Comparative analysis of clustering approaches using (A) GO data and (B) amino acid starvation ChIP-chip data. The y-axis represents the percent of clusters for which the p-value of their most significant category (GO or ChIP-chip) was lower than the given threshold.
bound probability is smaller than any desired significance level, the identified cluster in the real gene expression matrix will be statistically significant. The

MiMeSR (Minimum Mean Squared Residue)
MiMeSR seeks to solve the same problem, but unlike ASTRO, it takes into consideration the magnitude of the expression change in the analysis. This is based on the (biological) assumption that if a set of genes is regulated by the same transcription factors across all time points (even if different transcription factors are active at different time points,) then the expression pattern of these genes will not only be the same in terms of direction, but also in magnitude. In other words, MiMeSR aims to identify more coherent clusters of co-regulated genes rather than simply genes with similar expression patterns under a given set of conditions. Under this hypothesis, the problem of finding a cluster of similarly expressed genes is a problem of finding submatrices of the gene expression matrix, A, with minimum mean squared residue or coherent values [20,21]. Proof.
The MiMeSR algorithm that we develop in this study uses this concept to search for submatrices with mean squared residue smaller that a given threshold, δ → 0.
Cheng and Church have shown that when the search extends over all possible subsets of columns, then the solution is NP hard [20]. As we will show below, looking for patterns consistent over all time points (i.e. J = T) reduces the algorithmic complexity to O(NMK), and produces biologically relevant results. MiMeSR uses linear algebra and arithmetic tools to solve the problem, which is advantageous over greedy algorithms or the use of heuristics that were used in the past.

Overview
MiMeSR starts by filtering those genes whose expression levels do not change significantly during the time course (threshold ε). Then, it writes the gene expression matrix A as the sum of matrix Z 1 , with constant values on columns, and Z 2 = A -Z 1 . Finally, it identifies submatrices with constant values on rows in Z 2 , which correspond to the minimum mean squared residue clusters in the gene expression matrix A. See Additional file 1 for MiMeSR pseudo codes.

Identification of minimum mean squared residue clusters
MiMeSR extracts minimum mean squared residue submatrices from the gene expression matrix using the following approach. For a given row i of matrix A, a new matrix Z 1 is constructed with constant values in the columns. All rows in Z 1 are identical to row A[i]. Then Z 2 is calculated as Z 2 = A -Z 1 . Then, MiMeSR identifies the submatrix with constant values on rows across the whole time points in Z 2 . This step is easily performed by identifying the set of rows of Z 2 such that max(Z 2 (n,:) -min(Z 2 (n,:)) <ε, with ε → 0.
The submatrices with constant values on rows in Z 2 correspond to submatrices with minimum mean squared residue (coherent values) in A. For simplicity and without loss of generality, let us consider an example of the synthetic gene expression matrix A, with coherent values cluster in it, corresponding to rows r 1 , r 3 and r 4 (Figure 4.) By subtracting from A matrix Z 1 (constructed using the first row of A, (2 4 6 3)), a new matrix Z 2 is generated whose rows r 1 , r 3 , and r 4 correspond to the submatrix with constant values on rows. Note that, the same cluster will be constructed by using any of the rows r 1 , r 3 , or Example of the minimum mean squared residue method Figure 4 Example of the minimum mean squared residue method.
after a cluster has been identified, its rows are not further considered in the construction of new Z 1 matrices. This approach is guaranteed to identify all submatrices with minimum mean squared residue across all time point experiments. Note that, since the operation Z 2 = A -Z 1 is performed using all the rows of A during each iteration, and since we are seeking for the set of rows of Z 2 such that max(Z 2 (n,:) -min(Z 2 (n,:)) < ε, MiMeSR can allow rows (genes) to belong to more than one cluster. The biological equivalent of this notion is that genes may be involved in more than one genetic pathway or to be regulated by more than one transcription factors.

Statistical Significance and Complexity Analysis of MiMeSR
For practical reasons, it is important to assess the effects of the ε parameter on the clusters that are identified by MiMeSR. This can be done by sensitivity analysis in which the parameter ε is perturbed and the results are compared.
For this analysis, it is usually sufficient to consider one or two values above and below the originally selected value of ε. Only clusters that are consistently identified by MiMeSR as ε varies should be retained for further examination. Note that the number of genes in these clusters may also change. The user therefore needs to determine a rule for dealing with genes that may be dropped from the clusters as ε changes. The most conservative approach would be to retain only the genes that remain in the clusters for all values of ε around its selected value. It can be easily shown that the overall complexity of MiMeSR is O(NMK), where K is the number of minimum mean squared residue clusters in A. Note that K corresponds to the maximum number of constant columns matrices that can be constructed using the rows of A without identifying redundant clusters.