- Methodology article
- Open Access

# Mining biological information from 3D short time-series gene expression data: the OPTricluster algorithm

- Alain B Tchagang
^{1}Email author, - Sieu Phan
^{1}, - Fazel Famili
^{1}, - Heather Shearer
^{2}, - Pierre Fobert
^{2}, - Yi Huang
^{2, 3}, - Jitao Zou
^{2}, - Daiqing Huang
^{2}, - Adrian Cutler
^{2}, - Ziying Liu
^{1}and - Youlian Pan
^{1}

**13**:54

https://doi.org/10.1186/1471-2105-13-54

© Tchagang et al; licensee BioMed Central Ltd. 2012

**Received:**21 December 2011**Accepted:**4 April 2012**Published:**4 April 2012

## Abstract

### Background

Nowadays, it is possible to collect expression levels of a set of genes from a set of biological samples during a series of time points. Such data have three dimensions: gene-sample-time (GST). Thus they are called 3D microarray gene expression data. To take advantage of the 3D data collected, and to fully understand the biological knowledge hidden in the GST data, novel subspace clustering algorithms have to be developed to effectively address the biological problem in the corresponding space.

### Results

We developed a subspace clustering algorithm called Order Preserving Triclustering (OPTricluster), for 3D short time-series data mining. OPTricluster is able to identify 3D clusters with coherent evolution from a given 3D dataset using a combinatorial approach on the sample dimension, and the order preserving (OP) concept on the time dimension. The fusion of the two methodologies allows one to study similarities and differences between samples in terms of their temporal expression profile. OPTricluster has been successfully applied to four case studies: immune response in mice infected by malaria (*Plasmodium chabaudi*), systemic acquired resistance in *Arabidopsis thaliana*, similarities and differences between inner and outer cotyledon in *Brassica napus* during seed development, and to *Brassica napus* whole seed development. These studies showed that OPTricluster is robust to noise and is able to detect the similarities and differences between biological samples.

### Conclusions

Our analysis showed that OPTricluster generally outperforms other well known clustering algorithms such as the TRICLUSTER, gTRICLUSTER and K-means; it is robust to noise and can effectively mine the biological knowledge hidden in the 3D short time-series gene expression data.

## Keywords

- Gene Expression Data
- Order Preserve
- Subspace Cluster
- Adjusted Rand Index
- Temporal Expression Profile

## Background

Clustering of co-expressed genes has been an active data mining topic and advanced in parallel with the development of microarray technology [1]. There is a vast amount of literature on clustering algorithms developed for microarray data analysis [1]. Microarray gene expression data can be classified into two categories: steady state and time-series gene expression data [2]. Time-series gene expression data are widely used to study the dynamic behaviour of various biological processes in the cell [3–5]. They can be classified into two categories (relative to the clustering algorithms design for their analysis): short time-series corresponding to 3-8 time points [6], and long time-series corresponding to more than 8 time points. Short time-series are the most abundant type of time-series data in the literature [6]. Short-time series data are usually very noisy. Algorithms that are designed to analyze either steady state data or long time-series data do not perform well on short time-series data due to their relatively small number of time points [5–7]. Hence, it is necessary to develop algorithms that can be used specifically for their analysis.

Pioneering clustering algorithms such as K-means [8], Hierarchical clustering [9], and Self Organizing Map [10] identify full space clusters. Unfortunately, in many applications, subspace clusters are more meaningful than full space clusters [11]. Biclustering algorithms were recently proposed to find subgroups of genes that exhibit similar behaviour across subsets of samples, experimental conditions, or time points [11–15]. Nowadays, it is possible to collect expression levels of a set of genes for a given set of biological samples during a series of time points. Such data have three dimensions, gene-sample-time (GST), and thus are called 3D gene expression data. To take advantage of the 3D data collected, and to fully understand the biological knowledge hidden in the GST data, we have to move beyond the full space clustering concepts and develop algorithms that can effectively address the problem in the corresponding 3 dimensions [16–18]. A 3D cluster consists of a subset of genes of similar expression profiles along a segment of time-series, in a subset of samples. This kind of coherent clusters may contain information that could be used to identify useful phenotypes, potential genes related to these phenotypes and their interaction/regulation.

Although subspace clustering algorithms are biologically more meaningful than full space clustering algorithms, the identification of full space clusters are less costly compared to subspace clusters. In fact, most subspace clustering algorithms have been shown to be NP-complete [11], thus making their identification computationally expensive. This is due to the fact that in full space clustering, one usually looks for clusters across the entire dataset at once, whereas in subspace clustering one looks for clusters across all possible subsets of the data space.

Most clustering models, including those used in subspace clustering described above, define similarity among different objects by *distances* over either all or only a subset of the dimensions [1]. However, distance functions, such as Euclidean distance, Manhattan distance, and cosine distance are not always adequate in capturing correlations among the objects. In fact, strong correlations may still exist among a set of objects even if they are far apart from each other as measured by the distance functions [1]. Pioneering works on triclustering algorithm relied on graph-based approaches and similarity measures to mine triclusters [16–18]. For example, the triclustering algorithm of [16] mines the largest triclusters satisfying a constant multiplicative or additive relationship between the expression levels in a cluster. Such a strict constraint considerably limits the capability of an algorithm to identify useful patterns and may not be able to fully cope with noise when dealing with time-series gene expression data in general. Several of the algorithms designed for the analysis of long time-series - do not work well on short time-series due to over fitting [6]. Also, most pioneering triclustering algorithms described in the literature only focus on the similarities between the biological samples and do not consider differences between them.

In this paper, we developed an order preserving 3D clustering algorithm named OPTricluster, for 3D short time-series data mining. A 3D short time series corresponds to GST with 2-5 samples and 3-8 time points. In fact most of the 3D time series gene expression data in the GEO database [19] have less than 5 samples and less than 9 time points. OPTricluster is able to identify triclusters (3D clusters) with coherent evolution from a given 3D dataset using a combinatorial approach on the sample dimension, and the order preserving (OP) concept on the time dimension. We say that a matrix is order preserving if there exists a permutation of its columns such that its rows are monotonic functions [13]. The OP concept is a well established mathematical theory, and has been used by several other authors in the past to tackle the problem of gene expression data analysis [13, 14, 20, 21].

The integration of the two methodologies (combinatorial and OP) allows one to study similarities and differences between samples in terms of temporal expression profile and/or differential expression. Basically, the novel triclustering algorithm is able to mine triclusters of genes in a subset of samples, with expression level having same directions (i.e. increase, decrease and/or stay constant similarly) across the time-series experiments. OPTricluster takes into account the sequential nature of the time-series and the noisy nature of the data through the OP concept. The OP model focuses on the similarity in the relative order of the time points, rather than on the distance between the actual expression levels as in several other clustering models [7–10]. This approach is potentially more robust to the stochastic nature of the gene expression levels, and to the variation caused by the measurements procedures.

## Results

OPTricluster is applied to analyze four different 3D gene expression datasets: immune response in mice infected by malaria (*Plasmodium chabaudi*), systemic acquired resistance (SAR) in *Arabidopsis thaliana*, similarities and differences between inner and outer cotyledon in *Brassica napus* during seed development, and *Brassica napus* whole seed development. We used these four datasets to show how OPTricluster can be used to tackle a variety of problems in Computational Biology. For example, application of OPTricluster to the mouse dataset shows how it can be used to study similarities and differences between biological samples in terms of temporal expression profiles. Application to the *Arabidopsis thaliana* SAR dataset shows how it can be used not only to study similarities and differences between biological samples, but also to infer the transcription network, that is, the relationship between the transcription factors (TFs) and their target genes. Application to the *Brassica napus* cotyledon dataset shows how OPTricluster can be used to study the spatial similarities and differences between biological samples in term of temporal expression profile, whereas its application to the *Brassica napus* whole seed dataset shows how OPTricluster can be used to handle classical short time-series dataset (2 dimensional dataset).

### Implementation

OPTricluster is implemented entirely in Java and will work with any operating system supporting Java 1.6 or later. The executable OPTricluster Java package is available as Additional File 1, and its user manual as Additional File 2. Portions of the interface of OPTricluster are implemented using a third party library, the JFreeChart [22]. In a post processing step, OPTricluster also makes use of external Gene Ontology files. OPTricluster can download the Gene Ontology and gene annotation files directly from the Gene Ontology websites. In fact the GO analysis plug-in of the Gene Ontology Analysis (GOAL) [23] package that we recently developed is integrated into OPTricluster for biological evaluation of the clusters. A user of OPTricluster first specifies a tab delimited gene expression data file as input to OPTricluster. The file is uploaded and displayed as a table on the screen. The user can check their data and do other manipulations on the data once it is displayed. This includes visual check, plot and view the distribution of the input gene expression data, which may help in deciding the best ranking threshold input parameter (see Methodology Section). Next, the user selects the analysis parameters through the OPTricluster parameter input interface. Following this input phase, the OPTricluster executes and a new table will appear displaying the clustering results, where new columns are added to the initial table and they correspond to the ranking profile of each gene across experimental time points in each sample respectively. From this step, the user will either select between constant, conserved, and divergent patterns for further analysis as explained below.

Note that, at each step of the analysis described above, OPTricluster offers the possibility to the user to perform several other tasks and gain more knowledge on the results obtained at the corresponding step. These manipulations include: open the table in Excel and use the Excel capabilities for other manipulations. Plot and view the pie chart or the bar chart of the results. Merge cluster with similar profile for further analysis. Obtain the difference between patterns in different subsets of samples. Perform and visualize the gene ontology enrichment analysis of the group of genes obtained at that particular step.

### Simulation and robustness to noise

To test the robustness of OPTricluster to noise, we used the adjusted rand index (ARI). ARI has been used previously for clustering techniques comparison and robustness to noise [18]. ARI's values lie between 0 and 1, where larger value means higher similarity between the clustering results. If the experimental result is perfectly consistent with the domain knowledge (known triclusters embedded in the simulated dataset), the index value will be 1. If a clustering is no more than a random choice, the index will be 0.

### Immune responses in mice infected by malaria

The goal of this study is to examine whether immune responses to malaria (*Plasmodium chabaudi*) infection differ between the sexes and are altered by the presence of gonadal steroids. To tackle this problem, we used a 3D short time-series gene expression dataset downloaded from the Gene Expression Omnibus website [19], (accession number: GSE4324). This dataset has N = 33935 probes and it corresponds to the gene expression profiles of mice response to *Plasmodium chabaudi*[24], representing two disease states (*P. chabaudi* infected and non-infected), two genders (male and female), two protocols (intact and gonadectomized), and four time point experiments: 0, 3, 7, and 14 days after inoculation (DAI). We refer to the four biological samples used in this study as follows: intact male (*I*_{
M
} ), intact female (*I*_{
F
} ), gonadectomized male (*G*_{
M
} ), and gonadectomized female (*G*_{
F
} ). For ethical issue pertaining to this dataset, see reference [24].

After data pre-processing and normalization, we ended up with 5783 significant probes, corresponding to 5063 unique genes. The three dimensions of the data are: G (N = 5783 probes), S (M = 4 samples: *I*_{
M
} , *I*_{
F
} , *G*_{
M
} , and *G*_{
F
} ), and T (L = 4 time points: 0, 3, 7, and 14 DAI). We set the input parameters to: minimum number of genes in a cluster *I*_{
m
} *=* 1, minimum number of samples in a cluster *J*_{
m
} *=* 1, and differential expression threshold (or ranking threshold) *δ =* 0.31. With the minimum number of samples in a tricluster set to 1, the algorithm generated 2^{4}-1 = 15 combinations of samples ({*I*_{
M
}*, I*_{
F
}*, G*_{
M
}*, G*_{
F
} }, {*I*_{
M
}*, I*_{
F
}*, G*_{
M
} }, {*I*_{
M
}*, I*_{
F
}*, G*_{
F
} }, {*I*_{
M
}*, G*_{
M
}*, G*_{
F
} }, {*I*_{
F
}*, G*_{
M
}*, G*_{
F
} }, {*I*_{
M
}*, I*_{
F
} }, {*I*_{
M
}*, G*_{
M
} }, {*I*_{
M
}*, G*_{
F
} }, {*I*_{
F
}*, G*_{
M
} }, {*I*_{
F
}*, G*_{
F
} }, {*G*_{
M
}*, G*_{
F
} }, {*I*_{
M
} }, {*I*_{
F
} }, {*G*_{
M
} }, {*G*_{
F
} }).

*I*

_{ M }

*, I*

_{ F }

*, G*

_{ M }

*, G*

_{ F }}, 3516 genes are unchanged, whereas 427 (Figure 3) changed similarly in all four samples. These 427 genes are further clustered into six groups (Figure 4) with 12 or more genes. Clearly, the genes in Figure 4 have similar behaviour in the four samples and across the entire time series. Most of these 427 genes may play the role of housekeeping. In other words, they represent the set of genes that are co-expressed regardless of the experimental condition to maintain basic cellular function. Indeed, Gene Ontology (GO) analysis of these six clusters (Figure 5), showed that they are involved in similar molecular function and biological processes, such as protein and DNA binding, transcription regulation, cell cycle and basic metabolism.

*I*

_{ M }) have the highest number of genes (1778) that the expression level changed following pathogen attack. This indicates that

*I*

_{ M }is probably more vulnerable to

*Plasmodium*infection compared to the other three phenotypes. This is consistent with the phenotypical observation made in [24]. Indeed the Gene Ontology analysis shows that

*I*

_{ M }have more genes involved in the GO biological processes, such as, cell death (GO:0008219), programmed cell death (GO:0012501), apoptosis (GO:0006915), than

*I*

_{ F }(Figure 6). On the other hand, these same results showed that gonadectomy of males altered the sex-associated differences, suggesting that sex steroid hormones may modulate immune responses to infection [24].

In terms of differences between the four samples tested, our analysis identified genes that were unique to only one and a combination of two or more samples. For example we identified 251, 266, 216, and 249 genes unique to *I*_{
M
} , *I*_{
F
} , *G*_{
M
} , and *G*_{
F
} respectively. These genes may be the origin of the differences between the four samples after *Plasmodium* infection. Thus, they may represent potential targets or biomarkers to be used not only to understand the differences between the samples, but also to develop novel therapeutic means.

### Systemic acquired resistance in *Arabidopsis thaliana*

We applied OPTricluster to study similarities and differences in defence mechanism of *Arabidopsis thaliana* against pathogenesis. The goal of the study is to understand the roles of NPR1 and some of TGA family TFs during systemic acquired resistance in *Arabidopsis*. The 3D microarray data used here was obtained using Affymetrix *Arabidopsis* Genechip consisting of 22810 probes. The Columbia wild-type (W), mutant *npr1* (P), double mutant *tga1 tga4* (Z_{1}), and triple mutant *tga2 tga5 tga6* (Z_{2}) were treated with salicylic acid (SA) for 0, 1, and 8 hours. After data pre-processing and normalization, we ended up with 3945 significant genes. We set the Columbia wild-type as our baseline and took the log_{2} ratio of the mutant gene expression levels over the wild-type at respective time points.

*N × M × L*gene expression matrix, our goal is to identify the set of genes that are controlled by the TFs at a given time point, to study similarities and differences between them, and to infer a temporal transcriptional regulatory network controlling SAR in

*A. thaliana*. OPTricluster generated 2

^{4}-1 = 15 combinations of samples. Below we present some of the results obtained by OPTricluster. Figure 7 for example shows an example of divergent patterns. The expression levels of these genes are relatively unchanged (constant) in three samples and behave differently in one. For example, in the first row, significant changes are visible within WT, whereas the other three genotypes stay relatively constant (within threshold of ± 0.5) across the three time point. Clearly, since the expression level of these genes stay constant in three of the four experimental conditions and change considerably in only one of them, these genes may represent potential targets of the TFs tested in this study.

*Arabidopsis thaliana*at 0 h, 1 h, and 8 h inferred using the OPTricluster algorithm (Equation 4 in the Method Section). For example, our analysis showed that only 23, 66, and 73 genes are either down- or up-regulated by the combined action of the three sets of TFs at 0, 1, and 8 h respectively. The number of NPR1 targeted genes is less than that of TGA1 TGA4 and TGA2 TGA5 TGA6 at 0 h. But at 8 h, it is the reverse situation where the number of NPR1 targeted genes is higher than those regulated by TGA1 TGA4 and TGA2 TGA5 TGA6, respectively. This is consistent with the fact that NPR1 gene expression in the Columbia wild type was initially moderate but drastically increased at 1 hour and increased further until 8 hours after SA treatment (Additional file 3).

Gene Ontology analysis reveals that several of the genes that are regulated by the three sets of TFs (NPR1, TGA1 TGA4, and TGA2 TGA5 TGA6) at 8 h (Figure 8) are associated with response to stimulus (GO:0050896; p-value = 1.1e-08), stress (GO:0006950; p-value = 1.7e-05), abiotic stimulus (GO:0009628; p-value = 3.1e-05), biotic stimulus (GO:0009607; p-value = 1.3e-03), and defence mechanism (GO:0006952; p-value = 5.0e-02). These correspond to the fact that the plants were treated by SA, which mimic pathogen infection. They also confirm the fact that the TFs tested in this study are known to play major roles in plant defence mechanism [25]. In this application of OPTricluster, we used similarities in gene expression profiles of *Arabidopsis thaliana* with single, double or triple mutations of key transcription factors in the defence signalling network. We studied the network dynamics over a time series after treatment with salicylic acid (SA), which mimics a pathogen infection. We found that most SA-responsive genes were affected by at least one mutation and that most affected genes fit one of a few patterns of regulation. We then provided a first glimpse into the temporal pattern of the gene regulatory network during systemic acquired resistance in *Arabidopsis*.

### Similarities and differences between inner and outer cotyledon in *brassica napus* during seed development

Canola (*Brassica napus*) has two cotyledons; one embraces the other. The outer cotyledon is bigger and has higher oil content than the inner one. There exists similarity and differences in the development of and the metabolic processes between the two cotyledons. The objective of this analysis is to unravel the spatial similarity and differences in gene expression profiles between the two cotyledons in order to identify genes that play important roles in seed development and oil production. After obtaining the respective cotyledons, DNA microarray experiments were performed on each cotyledon using the DNA Combimatrix 90 K chip, developed by Plant Biotechnology Institute, NRC [26]. The output yielded two time-series data matrices *X*_{
I
} and *X*_{
O
} representing the gene expression level in the inner and outer cotyledons, respectively. The time-series has six time points, 20, 22, 24, 26, 28, and 30 days after pollination (DAP). After data pre-processing and normalization, significantly expressed genes were identified for this study. The three dimensions considered in this analysis were G (N = 3945 genes), S (M = 2 biological samples: *I* and *O*), and T (L = 6 time points: 20, 22, 24, 26, 28, and 30 DAP).

^{2}-1 = 3 combinations of samples ({

*I, O*}, {

*I*}, {

*O*}

*)*. The subset of sample {

*I, O*} yielded similar patterns between the inner and outer cotyledons, whereas the equations {

*O*}

*-*{

*I, O*}, or {

*I*}

*-*{

*I, O*} yielded patterns specific to outer or inner cotyledon, respectively. Analysis reveals 33 genes depicting the main difference between the two samples across the six time point experiments (Figure 9) and several others across subsets of the six time points (Additional file 4). Among the 33 genes, 17 and 16 were highly expressed in inner compared to outer, and outer compared to inner, respectively.

*Brassica napus* whole seed development

*Brassica napus*microarray dataset, which corresponds to the gene expression profiles during seed development. It has eight times points: 10, 15, 20, 25, 30, 35, 40, and 45 DAP. Each time point has six biological replicates. After data pre-processing and normalization, we ended up with 10865 probes, from which we took the mean of the replicates. Figure 11 show examples of patterns identified using OPTricluster.

Biological evaluation of these clusters showed that they are highly enriched under seed development; fatty acid and lipid metabolism; fatty acid and lipid biosynthesis; lipid storage, transport, and localization; oil content and biosynthesis, with *p*-values < 1.0e-03 (Additional file 5). These results demonstrate that seed development and fatty acid synthesis are highly correlated. Indeed, this is consistent with the fact that in higher plants, the biosynthesis of most fatty acids and lipids is physiologically coupled with seed development [27]. Further analyses showed that most of these patterns are highly positively or negatively correlated with the expression profile of transcription factors such as: LEC1, WRI1, FUS3, ABI3, and ABI5. This observation is consistent with the fact that the LEC1 function is partially dependent on ABI3, FUS3, and WRI1 in the regulation of fatty acid and fatty acid derived complex lipid [27–29]. It is also consistent with the fact that WRI1 plays a significant role during oil accumulation in maturing seed, that WRI1 is a prerequisite for fatty acid synthesis, and is important for normal embryo development [30]. Most of these results suggest that, the genes involved in these patterns may represent potential targets that could be used for the genetic improvement of oil production in *B. napus*.

## Discussion

In this paper, we developed a subspace clustering algorithm OPTricluster, for 3D short time-series data mining. OPTricluster is able to identify triclusters (3D clusters) with coherent evolution from a given 3D dataset using a combinatorial approach on the sample dimension, and the OP concept on the time dimension. The amalgamation of the two methodologies, combinatorial and order preserving allows one to study similarities and differences between samples in terms of temporal expression profile and/or differential expression. The combinatorial approach on the sample dimension is necessary because the goal of the algorithm is not only to study similarities between biological samples, but also to study differences between them at the single and multiple samples level. Because of this enumerative approach, it makes OPTricluster computational heavy when one is dealing with an increasing number of samples. As a result, we have restrictions for analyzing 3D short time series genes expression data. That is 3D gene expression with ~3-8 time points and ~2-5 samples. For longer 3D time series, OPTricluster will still work, but the computational complexity of the algorithm will increase exponentially with the increase in the number of time points and/or the number of samples.

Because of this combinatorial approach in the sample dimension and the OP concept in the time dimensions, the algorithm is able to mine clusters of genes in a subset of samples, with expression level having same directions across a time-series experiment. OPTricluster takes into account the sequential nature of the time-series and the noisy nature of the data through the OP concept. The results of Gene Ontology analysis and statistical analysis show that OPTricluster is robust to noise and is able to detect the similarities and differences between samples in terms of temporal expression profile of relevant functional categories. This is due to the fact that the OP model focuses on the similarity of the relative order over the time-series, rather than on the similarity of the actual expression levels based on a distance measure as in several other clustering models. This approach is potentially more robust to the stochastic nature of the gene expression values, and to the variation caused by the measurement procedure. This is clearly demonstrated in our comparison among the four clustering methods (Figure 2). The OP model has been accepted as a biologically meaningful clustering approach, capturing the general tendency of gene expressions across a subset of conditions [5, 11–14, 20, 21].

Furthermore, OPTricluster also offers the possibility to analyze the genes that the expression profiles change similarly across different samples for a given time point. This is easily done within the current version of the OPTricluster algorithm by simply swapping the time dimension with the sample dimension.

One relevant application of OPTricluster is the identification of genes that may play the role of housekeeping. This is revealed by the study of similarities among the samples. Indeed, several of the genes with same behaviour across all the samples and over the entire time series are usually not affected by phenotypes (Figure 4). Thus these genes may represent genes that are needed by all cells of the organism to fulfill basic cellular functions. As has been shown in the literature, housekeeping genes have many applications. They can be used as transcription and expression controls in laboratory experiments, they can be used to infer the set of basic cellular functions, and they are essential to characterizing normal and diseased states [31]. One has to keep in mind that in medicine, genetic diseases that are linked to housekeeping genes are more likely to affect multiple organs; in microbiology, housekeeping genes of pathogens play a role in enhancing virulence; and in evolutionary biology studies, housekeeping genes diverge more slowly than other genes [31]. Therefore, their detection may enhance the creation of new drug targets and can also be used for subspecies discrimination.

The study of differences among samples in terms of co-expression is among the most important applications of the OPTricluster algorithm. This feature does not exist in the other three algorithms tested in this study. In the mice response to *Plasmodium* infection for example, OPTricluster was able to identify a set of genes specific to each sample. This group of genes can be used as biomarkers for the differentiation between these samples, thus for the detection of diseases, and subsequently use as targets to design drugs that are specific to diseases at their subtype level.

In the *Arabidopsis* data, OPTricluster identified genes that may only be regulated by one TF, or a combination of two or more TFs. This kind of information can be used to identify transcription networks as we did in this analysis, and subsequently to get more insights between TFs and gene regulation. Figure 7 for example show that only a small number of genes in each or multiple mutant genotypes overlaps across different time points. This observation suggests that most of the genes that participate in SAR have impulse behaviour; different sets of genes are targeted by the corresponding transcription factors at different time points during a response to pathogen infection. On the other hand, these results show that the true behaviour of the underlying biological process is captured at different time points, with each time point containing a unique piece of information that should be integrated in order to get the whole picture underlying the signalling pathway during SAR. Therefore, studies, such as [32], that focus on a single time point to infer genetic information and generalize the results to describe the pathogen-host interaction, could miss out important information.

In the cotyledons dataset, OPTricluster depicted differences in the inner and the outer cotyledons in terms of direction and magnitude of the gene expression level, and thus yielded more biological information relative to the spatial distribution of gene expression at the molecular, cell, tissue, organ, or system level. The analysis of the *Brassica* whole seed dataset showed that OPTricluster can not only be used to study similarities and differences among samples in a 3D short time series, but also, can be used to tackle classical short time-series problems. In several of these studies OPTricluster offered more biological insights to the problem compared to prior computational approaches used to tackle similar problems.

## Conclusions

We developed a subspace clustering algorithm for 3D short time-series gene expression data analysis. The developed algorithm is used to identify statistically and biologically significant clusters from 3D gene expression data, to study similarities and differences between samples in terms of co-expression and/or differential expression. Biological and statistical results obtained showed that OPTricluster is robust to noise and is able to detect the temporal expression profile of relevant functional categories in terms of similarities and differences in samples. The OPTricluster is implemented in Java and it is available as Additional File 1 to this manuscript.

## Methodology

Our objective is to develop a subspace clustering algorithm that is able to cope with the sequential nature of a time-series, along with the noisy nature of a dataset. We also aim at developing an algorithm that can be used to identify biologically significant subspace clusters from a given 3D short time-series gene expression dataset. This would allow one to study similarities and differences between two or more biological samples in terms of temporal expression profile.

### Definitions

*N*genes

*G =*{

*g*

_{ 1 }

*, . . ., g*

_{ n }

*, ..., g*

_{ N }}, a set of

*M*biological samples

*S =*{

*s*

_{ 1 }

*, . . .,s*

_{ m }

*, ..., s*

_{ M }}, and a series of

*L*time points

*T =*{

*t*

_{ 1 }

*, . . ., t*

_{ l }

*, ..., t*

_{ L }}, a 3D microarray gene expression dataset or gene-sample-time (GST) dataset is a real-values

*N × M × L*matrix,

*A =*{

*a*

_{ nml }}. Each entry

*a*

_{ nml }represents the expression level of gene

*g*

_{ n }in sample

*s*

_{ m }at time

*t*

_{ l }. We shall also refer to the 3D data as a set:

*A =*{

*G, S, T*}. In the sequel, we denote the expression level of gene

*g*

_{ n }in sample

*s*

_{ m }across the time points as

*f*

_{ nm }(

*T*), which is a row vector. The expression profile of gene

*g*

_{ n }in all samples and across the time-series

*f*

_{ n }(

*s,T*), which is a 2D matrix. Thus the 3D gene expression matrix can be viewed as a set of 2D matrices in the horizontal plane: Equation 1.

A 3D cluster or tricluster is a 3D submatrix *C =* {*c*_{
ijk
} } of *A*, or a subset *C =* {*I, J, K*}, with *I ⊆ G, J ⊆ S*, and *K ⊆ T*, such that the content of *C =* {*c*_{
ijk
} } (*i*∈*I, j*∈*J*, and *k*∈*K*), verifies a desired pattern: constant, coherent values, or order preserving.

### Problem statement

Given the 3D matrix *A* as defined above and an ordering ranking threshold *δ*, find all triclusters *C =* {*I, J, K*}, with minimum number of genes *I*_{
min
} , minimum number of samples *J*_{
min
} , such that the content of each *C =* {*c*_{
ijk
} } is order preserving over a given segment of a time-series. Since we are dealing with short time-series in this study, we set *K = T*. Thus we are interested in patterns that increase, decrease, or stay constant coherently across the entire time-series experiments. Note that constant patterns and coherent values are subsets of order preserving clusters.

### OPTricluster

Furthermore, there exists a permutation of its columns (1, 4, 3, 2) or (2, 3, 4, 1) such that the values of its rows are monotonic increasing or decreasing function, respectively.

*δ*. Third, it identifies the set of distinct coherent 3D patterns in the 3D dataset. Fourth, triclusters of coherent patterns are formed by assigning genes with similar ranking along the time-dimension and across subsets of samples to the same group, then divergent patterns are identified. Finally, statistical significance and biological evaluation of the triclusters identified are performed.

#### Gene expression matrix quantization

*δ*, for each row (gene) and a given sample, OPTricluster computes

*E =*(

*b*

_{ E }

*- b*

_{0})/

*δ*, where

*b*

_{ 0 }

*= min*(

*f*

_{ nm }(

*T*)),

*b*

_{ E }

*= max*(

*f*

_{ nm }(

*T*)), and

*δ ≠ 0*. Note that

*E*is always rounded to the next integer of

*(b*

_{ E }

*- b*

_{0})/

*δ*. Then the interval [

*b*

_{0},

*b*

_{ E }] is divided into

*E*equal intervals: [

*b*

_{0},

*b*

_{ E }]

*=*[

*b*

_{0},

*b*

_{1}[U

*..*.U [

*b*

_{e- 1},

*b*

_{ e }[U ...U [

*b*

_{e- 1},

*b*

_{ E }], where

*b*

_{ e }

*= b*

_{0}+

*eδ*, and

*e =*1 to

*E*. Finally, a new expression level of the corresponding gene in the considered sample at the given time point is obtained using Equation 3.

Specifically, if the expression level *f*_{
nml
} of the corresponding gene *g*_{
n
} in the given sample *s*_{
m
} and at a given time points *t*_{
l
} falls in the interval [*b*_{e- 1}, *b*_{
e
} [, then it is quantized to the centroid *α*_{
e
} of that interval. Subsequently, each centroid *α*_{
e
} is assigned a ranking order (see below).

Note that, for a given microarray data, if the distribution of the expression levels of most genes in each sample and across the time points are densely located in a certain region, depending on the *δ* value, the quantization scheme defined above will likely put the densely related genes in the same cluster, usually the constant ones. Thus the value of *δ* should be chosen based on the following criteria: distribution of the gene expression data, the desired fold change between the expression levels of the genes across the time series, and based on the expected level of noise in the microarray data. Since the later is difficult to measure in real applications, one can also run the algorithm several times for a given value of *δ*, with perturbation values below and above it, and consider only the clusters that the content does not change a lot.

#### 3D rank expression matrix

*N × M × L*matrix,

*R =*[

*r*

_{ n }(

*s,T*)]

*=*[

*r*

_{ nml }], in which every row along the time-dimension for a given sample is a vector of the ranks of the corresponding expression values in

*A*, in an increasing or decreasing order. For example, if the expression levels of gene

*g*

_{ n }in sample

*s*

_{ m }along the time-dimension is

*f*

_{ nm }(

*T*)

*=*[

*1.5, 3, 0.5*] at

*δ*≤1.0, then, the corresponding row in the rank matrix would be

*r*

_{ nm }(

*T*)

*=*[1-3] in the increasing order. The ranking matrix of the example of an order preserving matrix defined by Equation 2 above will be:

Without lost of generalities, one can easily see from this example (Equation 4) that the rows of the ranking matrix of an order preserving matrix will always be identical. In fact, this property is exploited below for the identification of order preserving patterns from a 3D short time series gene expression matrix. Note that, if more than two entries have the same value, they are given the same ranking. For example [*0.5, 3, 0.5*] would be [*1, 2, 1*]. This approach allows the identification of constant patterns.

There are several advantages associated with this transformation. First, it avoids the use of greedy algorithms, probabilistic approaches, and exhaustive permutations along the time-dimension, and thus speeds up the computation time. Second, it is obvious that for any *k >* 1 rows *f*_{
nm
} (*T*) of similar ranking *r*_{
nm
} (*T*), under any permutation of the time points, their order is always preserved (example in Equation 4). Thus they will always belong to the same coherent tricluster. Since the rank is conserved under any permutation along the time-dimension of the 3D gene expression matrix and given that we are dealing with multiple samples at the same time, the probability that a random pattern might be picked up in a cluster is very low.

#### Identification of distinct patterns

*A*. Given the 3D gene expression matrix

*A =*[

*f*

_{ n }(

*s,T*)

*]*as defined above, with a set of genes

*G =*{

*g*

_{ 1 }

*, . . ., g*

_{ n }

*, ..., g*

_{ N }}, a set of biological samples

*S =*{

*s*

_{ 1 }

*, . . .,s*

_{ m }

*, ..., s*

_{ M }} and a series of time points

*T =*{

*t*

_{ 1 }

*, . . ., t*

_{ l }

*, ..., t*

_{ L }}, we define the biological sample space

*Ω*as the set of all possible combination of the samples, and

*Γ*their number. That is, if

*S =*{

*s*

_{ 1 }

*s*

_{ 2 }

*s*

_{ 3 }} as in Figure 12, then the biological sample space would be

*Ω =*{{

*s*

_{ 1 }

*, s*

_{ 2 }

*, s*

_{ 3 }}, {

*s*

_{ 1 }

*, s*

_{ 2 }}, {

*s*

_{ 1 }

*, s*

_{ 3 }}, {

*s*

_{ 2 }

*, s*

_{ 3 }}, {

*s*

_{ 1 }}, {

*s*

_{ 2 }}, {

*s*

_{ 3 }}} and

*Γ = 7*. Recall that

*r*

_{ n }

*(s,T)*corresponds to a row vector of the ranks of the nth gene, in sample

*s*and across the entire time series

*T*. Hence, for each combination

*Ω*

_{ i }∈

*Ω*, the exact number

*h*

_{ i }of distinct order preserving triclusters that can be found in the 3D dataset is the number of distinct 2D

*r*

_{ n }(

*Ω*

_{ i }

*,T*) matrices of its corresponding 3D ranked matrix

*R*. Thus, the set of 3D distinct order preserving patterns,

*V*, can be identified by considering

*R*as a set of 2D matrices

*r*

_{ n }(

*Ω*

_{ i }

*,T*), that is

*, R =*{

*r*

_{ 1 }(

*Ω*

_{ i }

*,T*)

*, r*

_{ 2 }(

*Ω*

_{ i }

*,T*)

*, ..., r*

_{ n }(

*Ω*

_{ i }

*,T*)

*, ..., r*

_{ N }(

*Ω*

_{ i }

*,T*)}, and identify all distinct

*r*

_{ n }(

*Ω*

_{ i }

*,T*) in it. From the above definitions, one can easily show that the exact number

*Λ*of order preserving triclusters in the 3D gene expression matrix is:

Recall that *h*_{
i
} is the number of distinct 2D *r*_{
n
} (*Ω*_{
i
}*,T*) (rank matrices)corresponding to each *Ω*_{
i
} ∈*Ω* as defined above.

#### Conserved clusters identification

Once the exact number of distinct 3D order preserving patterns has been identified, for each *Ω*_{
i
} ∈*Ω*, OPTricluster assigns each gene to one of the *h*_{
i
} groups by comparing each distinct pattern *v*_{
k
} of *V* (*V*: set of 3D distinct order preserving patterns identified from the previous step) to *r*_{
n
} (*Ω*_{
i
}*,T*), and assign gene *g*_{
n
} to the order preserving tricluster *C*{*k*} each time *r*_{
n
} (*Ω*_{
i
}*,T*) = *v*_{
k
} . This approach is guaranteed to identify all order preserving triclusters of size *I × J × K*, with *I*_{
min
} *≤ I ≤ N*, *J*_{
min
} *≤ J ≤ M*, and *K = L*, where *I*_{
min
} and *J*_{
min
} are the minimum number of genes and samples in a tricluster, respectively.

Since the goal of the OPTricluster algorithm is to study the similarities and differences between samples in terms of the expression profile of all genes, I_{min} and J_{min} should be set to 1. In this case, the algorithm will identify all the conserved clusters and perform comparison between them at the single sample and single gene level in the divergent patterns identification step, as explained below.

#### Divergent patterns identification

*D*can be easily derived from the sets of conserved ones using Equation 5.

*C*{*p*} *=* {*I*_{
p
}*, J*_{
p
}*, K*_{
p
} } and *C*{*q*} *=* {*I*_{
q
}*, J*_{
q
}*, K*_{
q
} } are two conserved triclusters (similar OP patterns). Basically, Equation 5 identifies sets of genes that are co-expressed in the subset of sample in *C*{*p*}, but split and co-expressed differently in one or more samples in *C*{*q*}. For example, if clusters *C*{*p*} and *C*{*q*} have the same OP patterns, and if *C*{*p*} *=* {{*g*_{
1
}*, g*_{
2
}*, g*_{
3
}*, g*_{
4
} }, {*s*_{
1
}*, s*_{
2
} }, {*t*_{
1
}*, t*_{
2
}*, t*_{
3
} }} and *C*{*q*} *=* {{*g*_{
1
}*, g*_{
2
} }, {*s*_{
1
}*, s*_{
2
}*, s*_{
3
} }, {*t*_{
1
}*, t*_{
2
}*, t*_{
3
} }}, then *D*_{
pq
} *=* {{*g*_{
3
}*, g*_{
4
} }, {*s*_{
3
} }, {*t*_{
1
}*, t*_{
2
}*, t*_{
3
} }}, meaning that genes {*g*_{
3
} ,*g*_{
4
} } have different behaviour in {*s*_{
3
} } compared to {*s*_{
1
} ,*s*_{
2
} }. The computational burden of this step is reduced because only triclusters with same OP or ranking patterns are compared. This is due to the fact that ranking patterns are associated with the expression profile and are unique to each cluster for a given subset of samples.

### Statistical significance and complexity analysis

*I*genes and

*J*samples is assessed by computing the tail probability that a random dataset of size

*N × M × L*will contain an order preserving tricluster with

*I*or more genes and

*J*or more samples in it. In principle, the probabilistic description of the reference

*3D*random matrix would be that of the observed noise in the microarray experiment [5, 14, 33]. Since this distribution is difficult to calculate in closed form, the upper bound of this tail probability is estimated using the same approach as in [5, 13, 14, 33]. Assuming that we have a dataset with

*L*time points that are independent and identically distributed according to the uniform distribution, the probability that a random gene-sample supports a given cluster is equal to the number of possible time points permutations or

*1/L!*. Since the genes and samples are assumed to be independent, the probability of having at least

*I*genes and

*J*samples in the cluster is the

*I*-tail of the (

*N,(1/L!))*binomial distribution, i.e.:

*Ls*=

*L!*ways to choose an OP tricluster of size

*L*, the following expression

*Z(I, J, L)*is an upper bound on the probability of having a tricluster of size

*L*with

*I*or more genes and

*J*samples:

We use this bound to estimate the significance of any given tricluster of size *L* with *I* genes and *J* samples. The best tricluster is the one with the largest statistical significance, i.e., the one with the *smallest Z(I, J, L)* Therefore, as long as that upper bound probability is smaller than any desired significance level, the identified tricluster in the real 3D gene expression matrix will be statistically significant.

The overall complexity of the triclustering algorithm is *O*(*NΓΛ*). Recall that the 3D short time-series gene expression data *A* is an *N × M × L* matrix. The 3D rank matrix can be identified within *O*(*NML*). The set of distinct 3D patterns can be identified with *O*(*NΓ*). Finally, the set of coherent conserved triclusters can be identified within *O*(*NΓΛ*). In all, the complexity of the triclustering algorithm is *O*(*NML*) *+ O*(*NΓ*) *+ O*(*NΓΛ*), which is *O*(*N*(*ML + Γ + ΓΛ*)). Note that the complexity for identifying the sets of divergent patterns from convergent ones is negligible. Since *ΓΛ* > *Γ* and *ΓΛ* > *ML*, the overall time complexity is *O*(*NΓΛ*).

## Declarations

### Acknowledgements

This work is jointly supported by Genomics and Health Initiative, Institute for Information Technology, and Plant Biotechnology Institute of the National Research Council Canada. We thank William Leung for implementation of an earlier java version of the OPTricluster core algorithm, and Alex Gawronski for implementation of graphical user interface and tools for cluster merging.

## Authors’ Affiliations

## References

- Androulakis IP, Yang E, Almon RR: Analysis of time-series gene expression data: methods, challenges, and opportunities.
*Annu Rev Biomed Eng*2007, 9: 205–228. 10.1146/annurev.bioeng.9.060906.151904PubMed CentralView ArticlePubMedGoogle Scholar - Gardner TS, Faith JJ: Reverse-engineering transcription control networks.
*Phys Life Rev*2005, 2: 65–88. 10.1016/j.plrev.2005.01.001View ArticlePubMedGoogle Scholar - Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization.
*Mol Biol Cell*1998, 9(12):3273–3297.PubMed CentralView ArticlePubMedGoogle Scholar - Arbeitman MN, Furlong EE, Imam F, Johnson E, Null BH, Baker BS, Krasnow MA, Scott MP, Davis RW, White KP: Gene expression during the life cycle of Drosophila melanogaster.
*Science*2002, 297(5590):2270–2275. 10.1126/science.1072152View ArticlePubMedGoogle Scholar - Tchagang AB, Bui KV, McGinnis T, Benos PV: Extracting biologically significant patterns from short time series gene expression data.
*BMC Bioinformatics*2009, 10: 255. 10.1186/1471-2105-10-255PubMed CentralView ArticlePubMedGoogle Scholar - Ernst J, Nau GJ, Bar-Joseph Z: Clustering short time series gene expression data.
*Bioinformatics*2005, 21(Suppl 1):i159-i168. 10.1093/bioinformatics/bti1022View ArticlePubMedGoogle Scholar - Shah M, Jacques Corbeil J: A general framework for analyzing data from two short time-series microarray experiments.
*IEEE/ACM Trans Comput Biol Bioinform*2011, 8(1):14–26.View ArticlePubMedGoogle Scholar - Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture.
*Nat Genet*1999, 22(3):281–285. 10.1038/10343View ArticlePubMedGoogle Scholar - Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns.
*Proc Natl Acad Sci USA*1998, 95(25):14863–14868. 10.1073/pnas.95.25.14863PubMed CentralView ArticlePubMedGoogle Scholar - Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR: Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation.
*Proc Natl Acad Sci USA*1999, 96(6):2907–2912. 10.1073/pnas.96.6.2907PubMed CentralView ArticlePubMedGoogle Scholar - Madeira SC, Oliveira AL: Biclustering algorithms for biological data analysis: a survey.
*IEEE/ACM Trans Comput Biol Bioinform*2004, 1(1):24–45. 10.1109/TCBB.2004.2View ArticlePubMedGoogle Scholar - Tanay A, Sharan R, Shamir R: Biclustering Algorithms: A survey. In
*In Handbook of Computational Molecular Biology*.*Volume 1*. 1st edition. Edited by: Aluru S. New York: Chapman and Hall; 2005:26–27.Google Scholar - Tchagang AB, Pan Y, Famili F, Tewfik AH, Benos P: Biclustering of DNA Microarray Data: Theory, Evaluation, and Applications. In
*In Handbook of Research on Computational and Systems Biology, Interdisciplinary Applications*.*Volume 1*. 1st edition. Edited by: Liu L, Wei D, Li Y, Lei H. Hershey, Pennsylvania: IGI Publishing; 2011:148–186.View ArticleGoogle Scholar - Ben-Dor A, Chor B, Karp R, Yakhini Z: Discovering local structure in gene expression data: the order-preserving submatrix problem.
*J Comput Biol*2003, 10(3–4):373–384. 10.1089/10665270360688075View ArticlePubMedGoogle Scholar - Tchagang AB, Tewfik AH: DNA microarray data analysis: a novel biclustering algorithm approach.
*EURASIP J Appl Signal Process*2006. Article ID 59809, 12 pages Article ID 59809, 12 pagesGoogle Scholar - Zhao L, Zaki MJ: TRICLUSTER: an effective algorithm for mining coherent clusters in 3D microarray data.
*SIGMOD '05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data*2005.Google Scholar - Araujo R, Trielli G, Orair G, Meira W Jr, Ferreira R, Guedes D: ParTriCluster: A Scalable Parallel Algorithm for Gene Expression Analysis. SBAC-PAD'06: Proceedings of the 18th International Symposium on Computer Architecture and High Performance Computing 2006Google Scholar
- Jiang H, Zhou S, Guan J, Zheng Y: gTRICLUSTER: a more general and effective 3 d clustering algorithm for gene sample-time microarray data.
*BioDM'06: Proceedings of the 2006 international conference on Data Mining for Biomedical Applications*2006.Google Scholar - The Gene Expression Omnibus Database[http://www.ncbi.nlm.nih.gov/geo/]
- Bleuler S, Zitzler E: Order preserving clustering over multiple time course experiments.
*EvoWorkshops'05: Proceedings of the 3rd European conference on Applications of Evolutionary Computing*2005.Google Scholar - Syeda-Mahmood T: Order-Preserving Clustering and Its Application to Gene Expression Data.
*ICPR'04: Proceedings of the 17th International Pattern Recognition Conference*2004.Google Scholar - The Java chart library[http://www.jfree.org/jfreechart/]
- Tchagang AB, Gawronski A, Bérubé H, Phan S, Famili F, Pan Y: GOAL: a software tool for assessing biological significance of genes group.
*BMC Bioinformatics*2010, 11: 229. 10.1186/1471-2105-11-229PubMed CentralView ArticlePubMedGoogle Scholar - Cernetich A, Garver LS, Jedlicka AE, Klein PW, Kumar N, Scott AL, Klein SL: Involvement of gonadal steroids and gamma interferon in sex differences in response to blood-stage malaria infection.
*Infect Immun*2006, 74(6):3190–203. 10.1128/IAI.00008-06PubMed CentralView ArticlePubMedGoogle Scholar - Jones JD, Dangl JL: The plant immune system.
*Nature*2006, 444: 323–9. 10.1038/nature05286View ArticlePubMedGoogle Scholar - The DNA Combimatrix 90 K chip[http://www.brassicagenomics.ca/90koligoarray.html]
- Ohlrogge JB: Regulation of fatty acid synthesis.
*Annu Rev Plant Physiol Plant Mol Biol*1997., 48:Google Scholar - Mu J, Tan H, Zheng Q, Fu F, Liang Y, Zhang J, Yang X, Wang T, Chong K, Wang XJ, Zuo J: LEAFY COTYLEDON1 is a key regulator of fatty acid biosynthesis in Arabidopsis.
*Plant Physiol*2008, 148: 1042–1054. 10.1104/pp.108.126342PubMed CentralView ArticlePubMedGoogle Scholar - Braybrook SA, Stone SL, Park S, Bui AQ, Le BH, Fischer RL, Goldberg RB, Harada JJ: Genes directly regulated by LEAFY COTYLEDON2 provide insight into the control of embryo maturation and somatic embryogenesis.
*PNAS*2006, 9(103):3468–3473.View ArticleGoogle Scholar - Baud S, Mendoza MS, To A, Harscoët E, Lepiniec L, Dubreucq B: WRINKLED1 specifies the regulation action of LEAFY COTYLEDON2 towards fatty acid metabolism during seed maturation in Arabidopsis.
*Plant J*2007, 50: 825–838. 10.1111/j.1365-313X.2007.03092.xView ArticlePubMedGoogle Scholar - Eisenberg E, Levanon EY: Human housekeeping genes are compact.
*Trends Genet*2003, 19(7):362–365. 10.1016/S0168-9525(03)00140-9View ArticlePubMedGoogle Scholar - Wang L, Mitra RM, Hasselmann KD, Sato M, Lenarz-Wyatt L, Cohen JD, Katagiri F, Glazebrook J: The genetic network controlling the Arabidopsis transcriptional response to Pseudomonas syringae pv. maculicola: roles of major regulators and the phytotoxin coronatine.
*Mol Plant Microbe Interact*2008, 21: 1408–20. 10.1094/MPMI-21-11-1408View ArticlePubMedGoogle Scholar - Tewfik AH, Tchagang AB, Vertatschitsch L: Parallel identification of gene biclusters with coherent evolution.
*IEEE Trans Signal Process*2006, 54(6):2408–2417.View ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.