 Research Article
 Open access
 Published:
Interpolation based consensus clustering for gene expression time series
BMC Bioinformatics volumeÂ 16, ArticleÂ number:Â 117 (2015)
Abstract
Background
Unsupervised analyses such as clustering are the essential tools required to interpret timeseries expression data from microarrays. Several clustering algorithms have been developed to analyze gene expression data. Early methods such as kmeans, hierarchical clustering, and selforganizing maps are popular for their simplicity. However, because of noise and uncertainty of measurement, these common algorithms have low accuracy. Moreover, because gene expression is a temporal process, the relationship between successive time points should be considered in the analyses. In addition, biological processes are generally continuous; therefore, the datasets collected from time series experiments are often found to have an insufficient number of data points and, as a result, compensation for missing data can also be an issue.
Results
An affinity propagationbased clustering algorithm for timeseries gene expression data is proposed. The algorithm explores the relationship between genes using a slidingwindow mechanism to extract a large number of features. In addition, the timecourse datasets are resampled with spline interpolation to predict the unobserved values. Finally, a consensus process is applied to enhance the robustness of the method. Some real gene expression datasets were analyzed to demonstrate the accuracy and efficiency of the algorithm.
Conclusion
The proposed algorithm has benefitted from the use of cubic Bsplines interpolation, slidingwindow, affinity propagation, gene relativity graph, and a consensus process, and, as a result, provides both appropriate and effective clustering of timeseries gene expression data. The proposed method was tested with gene expression data from the Yeast galactose dataset, the Yeast cellcycle dataset (Y5), and the Yeast sporulation dataset, and the results illustrated the relationships between the expressed genes, which may give some insights into the biological processes involved.
Background
In the last two decades, the development of medicine and molecular biology has been significantly improved by DNA microarray technology applications. The technology allows variations in expression levels to be monitored simultaneously for thousands of genes, even in some multiple experiments in which data are collected across various time points. During distinct biological processes, highthroughput data of timeseries gene expression are recorded to explore the complex dynamics of biological systems. The expression data can reveal gene activities in conditional reactions such as cellcycle, disease progression, and response to external stimuli (e.g., drugs and stress).
Analyses of microarray data are essential in several timeseries expression experiments such as biological systems, infectious diseases, and genetic interactions [1]. Pattern recognition techniques are helpful [2] to explore and exploit highthroughput screening data from microarrays. By using these techniques, similar expression patterns can be organized into a group. In addition, phenotypic responses triggered by the production of proteins coded by the expressed mRNAs are assumed to have a causal relationship with the gene expression [3]. Therefore, numerous efficient clustering algorithms have been developed to analyze gene expression data. Some of the older methods such as kmeans, hierarchical clustering, and selforganizing maps [4] are popular for their simplicity. However, the accuracy of these common algorithms is low because of background noise caused by experimental errors and uncertainty of measurement. Gene expression is a temporal process, therefore the relationship between successive time points should be considered. Some algorithms have been designed for timeseries clustering, which involves temporal dependency as a critical factor. Modelbased clustering methods, for example, use statistical and probabilistic models to determine the characteristics of data ([59]). The main advantage of modelbased methods is their tolerance toward experimental errors, including noises and missing values from measurement. Modelbased approaches are based on probabilistic models rather than raw expression values to maximize the likelihood of timeseries expression data. However, modelbased clustering methods require profitable pretraining and, in particular, suffer from computation inefficiency while modeling the gene expression profiles for clustering. In addition, considering that a biological process is a continuous function, the datasets collected from experiments are often found to contain an insufficient number of data points. Therefore, generating these missing data is another problem that researchers need to address [10]. To solve this problem, we used cubic Bsplines [11] interpolation to compensate for missing data in timecourse datasets, which may help disclose some unobserved details.
Affinity propagation [12] is an effective unsupervised clustering scheme. Based on this concept, with some restrictions relaxed, a method called softconstraint affinity propagation (SCAP) [13] has been proposed for use in gene expression clustering.
In this paper, we combined the cubic Bsplines interpolation, affinity propagation, and consensus clustering ([1417]) to analyze the dependency on various time intervals between genes. Using a slidingwindow mechanism, clustering results obtained from affinity propagation were merged for final clustering partition with consensus clustering.
Methods
Here, the proposed algorithm based on Bsplines interpolation [10], affinity propagation [12], and consensus clustering [14] is described. The timecourse gene expression clustering problem was formulated as follows: for a set of genes G={G _{1},G _{2},â€¦,G _{ n }} where n is the number of genes, and each gene G _{ i } includes Ï„ time points for the gene expression values, the n genes are grouped into K disjoint clusters C _{1},C _{2},â€¦,C _{ K }. Based on the clustering, various groups of genes with similar expressions can be identified and organized for further analyses. The framework of the proposed algorithm is shown in Figure 1. First, the original Ï„ time points are doubled to give 2Ï„âˆ’1 points using the cubic Bsplines interpolation algorithm [11]. Second, a slidingwindow mechanism is applied to extract the possible features within the extended timecourse gene expression profiles. Then, affinity propagation is performed to form the gene cluster. Third, a voting mechanism for gene grouping is applied and construction of the generelativity graph is performed. Noted that Steps 23 are corresponding to the construction of the gene relatively graph, which is the time consuming part. Once the graph is settled, we generate 31 grouping results corresponding to 31 different thresholds Ïƒ = 0.50, 0.51, â€¦, 0.80. In Step 4, each of these grouping results is finetuned by the reclustering algorithm to reduce the size of clusters. Step 3 evaluates the SI (Silhouette Index) of each grouping and output the one with the largest SI value finally.
Spline interpolation
Cubic Bsplines were used to represent gene expression curves to obtain a continuous time formulation. After spline interpolating, the curve can be resampled to estimate expression values at any time point; for example, for the case shown in Figure 2, the splineinterpolated gene data are smoother than the original data. However, too many resample points may cause fallacy and inefficiency. In the proposed algorithm, one point in each time interval is resampled, which doubles the number of expression data points to 2Ï„âˆ’1.
Affinity propagation clustering with interval selection
Affinity propagation [12] was proposed by Frey and Dueck in the spirit of massage passing which is approximated following the concepts of beliefpropagation [18]. It chooses the viewpoint of all data points could be exemplars potentially, and takes the measure similarity between pairs of data points as input. Besides, the preference of each data point has also been taken as input to decide whether a data point should be chosen as an exemplar for larger value of preference. The preference can be used to control the number of clusters and is suggested to be the median of the input similarities for a moderate number of clusters.
The traditional clustering algorithm, such as kcenters clustering, generally begins with a set of randomly selected exemplars (i.e., centroid in clusters) and minimizes the error function for convergence iteratively. Compared to kcenters clustering, affinity propagation avoids the drawback of improper exemplars initialization which is far to a correct solution.
To take the temporal relationship and interval between time points into account, we used a slidingwindow mechanism to evaluate all subintervals as features. It should be noted here that, unlike pattern recognition in image processing where the feature selection is employed for dimensionality reduction because of the rich information available in an image, timecourses in gene expression are relatively small. Therefore, we applied the slidingwindow mechanism for informative feature selection so that all possible segments could be explored.
Affinity propagation clustering [12] is selforganized without the need to input the anticipative number of clusters K. Based on this, affinity propagation was applied to group genes in each window while discovering the relationship between genes in different time intervals in unsupervised mode.
For each gene expression of G _{ i } with 2Ï„âˆ’1 time points, the slidingwindow mechanism with size w first generates a series of time patterns, (G _{ i }(t) to G _{ i }(t+wâˆ’1))s, as depicted in Figure 3. A total of 2Ï„âˆ’w time patterns are obtained for each gene.
To measure the similarity of two genes G _{ i } and G _{ j }, the feature C o r r(G _{ i },G _{ j }), which corresponds to the subpattern (G _{ i }(t) to G _{ i }(t+wâˆ’1)), is calculated by the Pearson correlation coefficient as:
where G _{ i }(l) and G _{ j }(l) are the expression values at the lth time point of G _{ i } and G _{ j }, respectively, and \(\overline {G_{i}}\) and \(\overline {G_{j}}\) are the mean values of w expression data of the two genes.
Let CM be the correlation matrix, where entry r _{ ij } is the similarity Corr(G _{ i },G _{ j }). As suggested in [12], we choose the median of those similarities as the preference value for affinity propagation input. For each gene G _{ i }, affinity propagation is applied to assign a categorized label y _{ i } as the most similar exemplar of G _{ i }. After the affinity propagation clustering is completed, the categorized label for each gene can be obtained. Accordingly, an adjacency matrix M is formed with an entry m _{ ij } defined as:
where y _{ i } and y _{ j } are the labels of the genes G _{ i } and G _{ j }, respectively.
Generelativity graph construction
As mentioned, there are (2Ï„âˆ’w) windows; thus, (2Ï„âˆ’w) adjacency matrices M _{1},M _{2},â€¦,M _{2Ï„âˆ’w }, will be constructed to represent the temporal dynamics between genes in the slidingwindow mechanism. Next, the (2Ï„âˆ’w) adjacency matrices are summarized into one consensus matrix \({M_{w}^{c}}\) by simply merging as:
where the entry m _{ ij } in the consensus matrix \({M_{w}^{c}}\) indicates the possibility that the genes G _{ i } and G _{ j } are in the same class.
The window size w can be changed to further observe the relationship between genes. For each w, one consensus matrix is constructed for each window size, so, for l window sizes w _{1}, w _{2}, â€¦, w _{ l }, l corresponding consensus matrices \(M_{w_{1}}^{c}\), \(M_{w_{2}}^{c}\), â€¦, \(M_{w_{l}}^{c}\) are obtained. We also defined an aggregated consensus matrix R as:
where each entry in the aggregated consensus matrix R denotes the probability of two genes, G _{ i } and G _{ j }, appearing in the same class.
Afterwards, a graph is constructed to represent the relationship between genes from the aggregated consensus matrix R, called the generelativity graph P=(G,R). The vertices of the generelativity graph correspond to the genes in G, and the edges indicate the probability that two genes will eventually appear in the same class.
Graph partitioning for class discovery
The generelativity graph P can be used to investigate the relationship between genes. A relativity threshold Ïƒâ‰¥0.5 was chosen to convert the graph P into a binary graph P ^{b}:
where p _{ ij } and \(p_{\textit {ij}}^{b}\) are the edge weights between genes G _{ i } and G _{ j } of P and P ^{b}, respectively.
Next, a depthfirst search algorithm was employed to find connected components of the undirected graph P ^{b}. Suppose L connected components C _{1},C _{2},â€¦,C _{ L }, are found, then these L connected components could be considered to be the L disjoint clusters C _{1},C _{2},â€¦,C _{ L }. However, some edges weighing more than 0.5 but slightly less than the relativity threshold Ïƒ in the generelativity graph P are eliminated from the binary graph P ^{b}, which produces several connected components (i.e., clusters) with a few vertices (i.e., genes) when the depthfirst search algorithm is applied. In our experience, this is a common problem in processing affinity propagation results. Connected components with a few vertices, called subclusters, can influence the clustering results, so, in our algorithm, the current clustering results are rearranged using a refinement process to reduce the influence of noise.
To determine which cluster is minor (subcluster), a threshold parameter Ï† was used to restrict the number of genes in one cluster. That is, a cluster is denoted as subcluster if the number of genes in the cluster is less than Ï†. For each cluster C _{ p }, 1â‰¤pâ‰¤L, the number of genes is compared with the parameter Ï† to identify L ^{â€²} subclusters with \(n_{1}, n_{2}, \ldots, n_{L^{\prime }}\phantom {\dot {i}\!}\) number of genes in each.
Next, the refinement process is applied by merging each L ^{â€²} minor subclusters with Lâˆ’L ^{â€²} major clusters with more than Ï† genes.
Consider each subclusters C _{ q } with the n _{ q } number of genes. If n _{ q } is equal to one, this is a singleton cluster with one gene, say G _{ q }. Our algorithm first identifies the gene G _{ h } that has the highest relativity to G _{ q } in the generelativity graph P. Then, C _{ q } is processed using the following rule:
Rule 1:

(1.1)
If the relativity between G _{ q } and G _{ h } is below 0.5, the merging process stops and C _{ q } survives as a singleton cluster.

(1.2)
If C _{ h } is a major cluster, C _{ q } is merged into C _{ h }.

(1.3)
Otherwise (C _{ h } is a subcluster), C _{ q } and C _{ h } are merged to form a new cluster with n _{ q }+n _{ h } number of genes; and the type, subcluster or major, is checked.
When C _{ q } is not a singleton, the algorithm calculates the mean values \({G_{q}^{m}}\) of these n _{ q } gene profiles. Then, the maximum Pearson correlation coefficient is compared with the mean value \({G_{h}^{m}}\), which belongs to C _{ q } in all major clusters. Next, C _{ q } is merged using the following rule:
Rule 2:

(2.1)
If the Pearson correlation coefficient between \({G_{q}^{m}}\) and \({G_{h}^{m}}\) is below 0, the merging process is stopped.

(2.2)
Otherwise, C _{ q } and C _{ h } are merged.
The above rules are performed repeatedly until no subcluster exists and, finally, the K number of disjoint clusters C _{1},C _{2},â€¦,C _{ K } are obtained.
The proposed algorithm is listed below.
The computational complexity of the proposed algorithm
The main idea of our proposed clustering algorithm is to collect enough consensus votes so as to estimate the matching probability between each pair of genes. So we construct the gene relativity graph in Step 2. Within this step, the affinity propagation procedure will be called several times, normally the number is \(\sum _{w=l}^{u}(2w\tau)\), where \(l=\lfloor \frac {2\tau  1}{2}\rfloor \) and \(u = \lfloor \frac {3(2\tau  1)}{4}\rfloor \). Since the AP algorithm has a quadratic complexity [19]. Thus, the computational complexity of Step 2 is O(n ^{2})O(Ï„ ^{2}), where n is the number of genes and Ï„ is the number of time points. Consider Step 3. The kernel is to discover the partition (the set of connected components) using the depthfirst search for each threshold, thus the computational complexity of Step 3 is 31Ã—O(E), where E is the number of edges in the relativity graph. Since E=O(n ^{2}), the computational complexity of Step 3 is O(n ^{2}). Notice that, the optimal relativity threshold can be located (Step 33) with the help of the Silhouette index. As for the refinement step (Step 4), we iteratively apply both the Rules 1 and 2 to merge each of smallsize clusters with some of the major clusters. Let c denote the number of connected components. In each iteration we need O(c ^{2}) for merging. Thus, the computational complexity of Step 4 is O(c ^{2}) because the number of iterations is 500 in our settings. Therefore, the overall computational complexity is O(n ^{2} Ï„ ^{2})+O(c ^{2}), where n is the number of genes, Ï„ is the number of time points, and c is the number of initial clusters. Notice that, the number c is much small than n generally.
Extended version of the proposed algorithm
As mentioned, in some datasets, the collection of data points is insufficient. To alleviate the possible unsatisfactory performance of the proposed algorithm because of the lack of information of short timeseries (Ï„<10), an extended version of our algorithm was proposed.
Recall that the degree of investigation of the relationship between genes is influenced by the window size w and the number of timepoints Ï„ because the number of times affinity propagation clustering is repeated is decided by w and Ï„.
For Ï„(Ï„<10) time points, the window size is w and the number of times for affinity propagation would be (2Ï„âˆ’w). To increase the degree of investigation of the relationships between genes, combinations \(C_{w}^{\tau }\) of subfeatures selection were suggested. That is, w time points are chosen randomly from Ï„ time points as feature vectors for the affinity propagation clustering. As depicted in Figure 4, the feature is a subset instead of a segment. Because \(C_{w}^{\tau }\) is more than (2Ï„ âˆ’ w), the relationships between genes are more precise and clustering accuracy is enhanced. However, the efficiency of the algorithm will decrease because extra affinity propagations are performed. Therefore, we strongly suggest that the extended version of the algorithm is executed only when analyzing short timeseries datasets (Ï„<10).
Prediction of optimal threshold values
As discussed before, Steps 23 are corresponding to the construction of the gene relatively graph P=(G,R). The vertices of the graph correspond to the genes in G, and the edges indicate the probability that two genes may appear in the same class. And for each relativity threshold Ïƒ, a depthfirst search algorithm was employed to find connected components, which correspond to the initial clustering for Ïƒ. Thus, we adopt these 31 initial clusters to predict the optimal threshold value using the Silhouette index. And Step 4 (reclustering) is excluded in the determination because it is unstable in our experience. We mainly trust on the probability matrix R.
Results and discussions
The adjusted Rand index and the Silhouette index are adopted to judge the clustering accuracy of our algorithm are described. Then, several real gene expression datasets were used to evaluate the proposed algorithm. Finally, the parameter settings and the performance evaluation are discussed.
Measure of agreement
Adjusted Rand index
The adjusted Rand index (Hubert and Arabie [20]; Yeung and Ruzzo [21]) is a popular similarity measure of agreement between two partitions against external criteria. Using this validity index, the partition U={u _{1},u _{2},â€¦,u _{ R }} given by the clustering result and V={v _{1},v _{2},â€¦,v _{ C }} a priori classification can be compared. The adjusted Rand index (ARI) is defined as:
where n _{ ij } is the number of objects in clusters u _{ i } and v _{ j }, and n _{ iâˆ—} and n _{âˆ—j } are the number of objects in clusters u _{ i } and v _{ j }, respectively. From this definition, ARI gives the value [0, 1] to assess the degree of agreement. High values indicate that U is more similar to V; in particular, the value 1 indicates absolute agreement between the partitions U and V.
Sensitivity, specificity, Jaccard index and Minkowski measure
Schliep et al. [7] suggested the use of sensitivity and specificity as complementary to the use of ARI. Let TP denote the number of pairs in the same cluster in U and same priori classification class in V; FP denote the number of pairs in the same cluster in U and distinct class in V; FN denote distinct cluster in U and same class in V, TN denote distinct cluster in U and distinct class in V. The index sensitivity is defined as \(\frac {\#TP}{\#TP + \#FN}\) and specificity defined as \(\frac {\#TP}{\#TP + \#FP}\).
The Jaccard index JC is defined as \(\frac {\#TP}{\#TP + \#FN + \#FP}\), another agreement between U and V, while, Minkowski measure illustrates the proportion of disagreements to # T P+# F N. Noted that all indices except ARI (or RI), do not involve the term TN, because this term would dominate the other three terms in both good and bad solutions [22].
Silhouette, Dunnâ€™s and DavisBouldin index
The Silhouette index (Rousseeuw [23]) is an internal cluster validity index that is used when true class labels are unknown. With a clustering solution C, the Silhouette index is used to judge the quality and to determine the proper number of clusters within a dataset. For each data point i, the Silhouette index is defined as:
where a(i) is the average dissimilarity of i with other data points in the same cluster, and b(i) is the minimum average dissimilarity of i with other data points in other clusters. The Silhouette index s(C) is the average of the Silhouette width for all data points the value [1, 1] reflects how appropriately the data has been clustered. A high Silhouette index indicates a good clustering result, which indicates the data are classified appropriately.
The Dunnâ€™s validation index (DI) is defined as
where Î´(u _{ i },u _{ j }) defines the distance between clusters u _{ i } and u _{ j } (intercluster distance); Î”(u _{ k }) represents the intracluster distance of cluster u _{ k }, and R is the number of clusters of partition U. Likewise, the large values DI correspond to good clusters.
The DaviesBouldin index (DBI) is defined as
Unlike SI and DI, intercluster distance Î´(u _{ i },u _{ j }) is in the denominator [24], small values of DBI correspond to compactness, thus, the cluster configuration that minimizes DBI indicates a good clustering result.
Besides, some recent and effective measures ([2527]) of the reliability of the clusters can be used to evaluate the results as well.
Timeseries datasets
Yeast galactose dataset
The Yeast galactose dataset (Ideker et al. [28]) was built to study integrated genomic data and is composed of gene expression measurements for 205 genes involved in galactose use in Saccharomyces cerevisiae. The gene expression profiles were measured with four replicate assays across 20 time points (20 perturbations in the galactose pathway) and the genes have been annotated in four functional categories in the Gene Ontology (Ashburner et al. [29]) listings. For external validation, we chose the mean of four replicates at each time point to judge the clustering results against the four published functional categories.
Yeast cellcycle dataset
The Yeast cellcycle dataset (Cho et al. [30]) includes more than 6000 yeast genes and their expression levels measured during two cell cycles at 17 time points. Schliep et al. [7] used a subset of the Yeast cellcycle dataset, called Y5, and identified the peak time points of 17 time points as the 5phase of the cellcycle. The Y5 dataset consists of 384 genes that are all annotated with five phases: early G1 (G1E), late G1 (G1L), S, G2, and M. We used the standardized expression levels to enhance the performance of our algorithm and compared the results with other studies. Figure 5 and Figure 6 represent the gene expression profiles for the Yeast cellcycle dataset (Y5). Different peak time points can be seen for the five annotated phases in Figure 5.
The Yeast sporulation dataset
The Yeast sporulation dataset (Chu et al. [31]) contains the expression levels of more than 6000 genes measured during the sporulation process of budding yeast across seven time points (0, 0.5, 2, 5, 7, 9, and 11 h). Genes with missing expression values and genes that showed no significant changes in expression during the process were excluded from the experimental analyses (Bandyopadhyay et al. [32]). The final dataset consisted of 474 genes for which no annotation could be assigned. Figure 7 represents the gene expression profiles for the Yeast sporulation dataset.
Parameter settings
Here, we discuss the parameter settings used in our algorithm for the experimental evaluation. The similarities s for affinity propagation were chosen as the similarities between gene expression profiles based on the Pearson correlation. The preferences p were chosen as the medians of the similarities s. In addition, the window size w and the number of windows l are tradeoff between clustering accuracy and algorithm efficiency. Smaller window sizes could miss the dynamic of the temporal gene expression while larger ones could decrease the number of votes, making the relationships between genes difficult to determine. The parameter relativity threshold Ïƒ for graph partitioning ranged from 0.5 to 0.8 in the experimental evaluation.
Choosing proper window sizes
In our implementation we prefer to apply a large amount of windows (O(Ï„ ^{2}), Ï„ is the number of points) to collect enough consensus votes to reach the precise probability between each pair of genes. Thus, we recommend that the proper window sizes w range from \(\lfloor \frac {2\tau  1}{2}\rfloor \) to \(\lfloor \frac {3(2\tau  1)}{4}\rfloor \) where Ï„ is the number of time points for the gene expression values being analyzed, and the appropriate number of windows l is more than or equal to \(\lceil \frac {2\tau  1}{4}\rceil \). Please note that the small window size might not yield a suitable clustering because insufficient information may let the affinity propagation procedure perform unstable (that is, large variation of the number of clusters). Thus, some bad votes will be included in the aggregated consensus matrix. In our experience, especially in the case of yeast sporulation, expanding the window size is necessary.
Experimental results
For datasets which have been annotated, the clustering results are compared with the adjusted Rand index, specificity and sensitivity; otherwise the Silhouette index and DaviesBouldin index are used for validation. As mentioned before, in the implementation we adopt 31 partitions to predict the optimal threshold value using the Silhouette index. Our program are coded in MATLAB, been running on Intel Corei7(3.33 GHz) with 8GB memory using Windows 7 64bit.
The Yeast galactose dataset
Consider Figure 8. As the indication in the curve of SI, our algorithm has a maximum value of the adjusted Rand index of 0.92576 with relativity threshold 0.67 (the leftmost point with SI value 0.75451). All of the results show that the true number of clusters are captured as the four clusters in the dataset which is annotated as four GO categories. Please note that we do not perform the spline interpolation because this dataset is simple. And the range of window size is set as (10, 15), our regular setting.
We compare our results with other methods in the literatures as shown in Table 1. Four experimental evaluation, including CRF [9], CORE [33], Ng et al. [8], and Yeung et al. [5]. This dataset is quite simple thus most of the algorithms perform well, ARI > 0.925.
The Yeast cellcycle dataset (Y5)
In Figure 9(a), the ARI and SI plots are varying significantly even if we have built the aggregated consensus matrix of probabilities of pairs of two genes being in the same group through the collection of enough votes. Only two thresholds 0.67 and 0.68 let the ARI of value 0.60789. So, for this tough case, our algorithm exhaustively checks 31 partitions (corresponding to 31 thresholds, 0.50, 0.51, â€¦, 0.80) to locate the threshold with the help of SI. Here the largest SI value is 0.24416, thus we have the corresponding ARI value 0.57113. After interpolation with 33 time points, the range of window size is set as (16, 24), our regular setting.
In Figure 9(b) and 9(c), the other five measurements: sensitivity, specificity, Rand, Jaccard and Minkowski are depicted.
Compare with the results of other methods without partial learning of labeled data including kmeans, cubic splines [10], different HMM model [7], CRF [9], and Chiu et al. [34]. The contrasts between these methods are listed in Table 2.
Consider Table 2. The performance of our algorithm is superior to kmeans algorithm which needs the number of clusters as parameter for input, and also outperforms the Splines model which using spline curves to represent gene expression profiles [10] and our early works (Chiu et al. [34]). Furthermore, the performances of probabilistic sequence data models such as HMM (Hidden Markov models) [7] and CRF (conditional random fields) [9] are also inferior to ours. We also applied six different ranges of window size, (15, 23), (16, 24), â€¦, and (20, 28) to demonstrate the average performance. The corresponding ARI values (fixed by SI values as discussed in Figure 9(a)) are 0.58330, 0.57113, 0.52061, 0.51881, 0.40816 and 0.53399, the average value is 0.52267.
Even though, in contrast to the results derived from the Yeast galactose dataset, our best result of 0.57113 of the adjusted Rand index is not looking good. To take a closer look on the phenomenon, we demonstrate the clustering result with five groups derived from our algorithm in Figure 10. Due to the ambiguities of the peak time points as annotation among the five groups in the Yeast cellcycle dataset (Y5), the clustering result can not be improved to higher values. Besides, the outliers and noises of the dataset also impact on the clustering accuracy even if our algorithm applies voting mechanism for diminishing the influence of noises.
The relationship between normalized gene expression values of the genes and the time points for each cluster are indicated by cluster profile plots. The best clustering results of cluster profile plots evaluated by our proposed algorithm are shown in Figure 11 and Figure 12 (green). In Figure 11, the clusters with 33 time points are extracted from original Yeast cellcycle dataset (Y5) by using the spline interpolation. In addition, the original 17 time points of Y5 are depicted in Figure 12.
The clustering performance can be observed by the gene expression of each cluster shown in Figure 11 and Figure 12 from (a) to (e). The first (a), second (b) and fifth (e) clusters are almost perfectly separated. The mean values and the confidence intervals (95%) are plotted as red lines.
The Yeast sporulation dataset
Due to the lack of annotations on the Yeast sporulation dataset, we use two internal validity measurements, i.e., the SI and DBI to judge the clustering performance. Amongst all 31 relativity thresholds, our algorithm has a maximum value of the Silhouette index of 0.72923 as shown in Figure 13. To take a closer look at the results, a lower Ïƒ causes distinct groups combined into a larger group including all 474 genes. The effect is caused by the insufficient number of time points which impacts the number of votes for investigating the relationships between genes. Noted that there are only seven time points on the Yeast sporulation dataset. As discussed in the windowsize setting, for this Yeast sporulation dataset, we have to expand the window size to prevent some bad votes been included in the aggregated consensus matrix. The range of window size is suggested as (9, 12), not (6, 9) the regular setting.
To resolve the problem with short timeseries gene expression datasets, Ernst et al. present an algorithm to analyses and retain significant gene expression profiles (Ernst et al. [35]). In this paper, we use our extended version of proposed algorithm for choosing more combinations from interval selection as the number of times to apply affinity propagation in order to increase the number of votes for investigating the relationship between genes. Our extended algorithm has a better value of the Silhouette index 0.75976 depicted in Figure 14.
Compare with the results of other methods including SiMMTS [32], Chiu et al. [34], VGA [36], SOM [4], and CRC [37], our result shows better performance, which are shown in Table 3.
In the implementation of the extended version of our algorithm, the execution time may rapidly increase because the number of combinations grows exponentially. Our suggestion is applied this version on those datasets with small size of time points, say 10 at most.
We demonstrate the clustering result with five groups derived from our algorithm in Figure 15. The clustering performance of our proposed algorithm for the Yeast sporulation dataset with the original 7 time points in Figure 16.
The statistics of the execution time for three dataset is shown in Table 4. A large amount of features extracted using our slidingwindow mechanism can help to enhance the accuracy practically. Even though the affinity propagation is performed many times, the execution time of each dataset is appropriate. For the case of Y5 dataset, our interpolation version costs 177.851 seconds; the interpolation version of the Yeast sporulation dataset of 13 points costs 32.803 seconds; and for the Yeast galactose dataset, the execution time is 27.243 seconds. Note that the maximum number of iterations is 1000, if the AP procedure cannot converge.
Conclusion
In this paper, we present an unsupervised clustering algorithm to analyze timeseries gene expression data, which requires no prior knowledge, such as the number of clusters or the cluster exemplars (centroids). The algorithm combines affinity propagation and consensus clustering with various intervals of time points, which provided progressive robustness and accuracy by overcoming the interference from background noises and experimental errors. Besides, the interactions between genes across distinct time points were investigated by interval selection based on a slidingwindow mechanism.
Because of the efficiency of affinity propagation, the proposed algorithm provides appropriate and effective analyses of timeseries gene expression data. Based on three real gene expression datasets, our algorithm significantly outperformed other methods when the same datasets were used in the evaluation. The experimental results on the Yeast galactose dataset, the Yeast cellcycle dataset (Y5), and the Yeast sporulation dataset, confirmed that our method can successfully illustrate biological relevance between the expressed genes.
In the future development of our method, we aim to integrate the problem of absent features at some time points, which is a critical issue in bioinformatics and machine learning. The standard treatments for absent features such as Zero (missing values set to zero), Mean (set to the mean value of the feature over all the data), and kNN (set with the average value obtained from the K nearest neighbors) in timeseries data neglect the temporal dependence, causing improper results. We also aim to improve the semisupervised clustering analyses, which are currently affected by the incompleteness of the gene annotations. By combining unannotated data with the small amount of annotated data that is available, we expect to see considerable improvements in the clustering accuracy of our method.
Availability
The program is freely available for nonprofit use on request at https://vc.cs.nthu.edu.tw/tychiu/Programs/MainProgram.7z.
References
BarJoseph Z. Analyzing time series gene expression data. Bioinformatics. 2004; 20(16):2493â€“503.
de Ridder D, de Ridder J, Reinders MJT. Pattern Recognition in Bioinformatics. Brief Bioinformatics. 2013; 14(5):633â€“47.
Androulakis IP, Yang E, Almon RR. Analysis of timeseries gene expression data: methods, challenges, and opportunities. Annu Rev Biomed Eng. 2007; 9:205â€“28.
Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, et al. Interpreting patterns of gene expression with selforganizing maps: Methods and application to hematopoietic differentiation. Proc Natl Acad Sci U S A. 1999; 96(6):2907â€“912.
Yeung KY, Medvedovic M, Bumgarner RE. Clustering geneexpression data with repeated measurements. Genome Biol. 2003; 4:R34.
MedvedovicM, Yeung KY, Bumgarner RE. Bayesian mixture model based clustering of replicated microarray data. Bioinformatics. 2004; 20(8):1222â€“32.
Schliep A, Costa IG, Steinhoff C, SchÃ¶nhuth A. Analyzing gene expression timecourses. IEEE/ACM Trans Comput Biol Bioinf. 2005; 2(3):179â€“93.
Ng SK, McLachlan GJ, Wang K, Jones LBT, Ng SW. A mixture model with randomeffects components for clustering correlated geneexpression profiles. Bioinformatics. 2006; 22(14):1745â€“52.
Li CT, Yuan Y, Wilson R. An unsupervised conditional random fields approach for clustering gene expression time series. Bioinformatics. 2008; 24(21):2467â€“73.
BarJoseph Z, Gerber G, Gifford DK, Jaakkola TS, Simon I. A new approach to analyzing gene expression time series data. In: Proc. Sixth Ann. Intâ€™l Conf. Computational Biology (RECOMB 02): 2002. p. 39â€“48.
Luan Y, Li H. Clustering of timecourse gene expression data using a mixedeffects model with bsplines. Bioinformatics. 2003; 19(4):474â€“82.
Frey BJ, Dueck D. Clustering by passing messages between data points. Science. 2007; 315(5814):972â€“976.
Leone M, Sumedha WM. Clustering by softconstraint affinity propagation. Bioinformatics. 2007; 23(20):2708â€“15.
Monti S, Tamayo P, Mesirov J, Golub T. Consensus clustering: a resamplingbased method for class discovery and visualization of gene expression microarray data. Mach Learn. 2003; 52(12):91â€“118.
Swift S, Tucker A, Vinciotti V, Martin N, Orengo C, Liu X, Kellam P. Consensus clustering and functional interpretation of geneexpression data. Genome Biol. 2004; 5:R94.
Yu Z, Wong H, Wang H. Graph based consensus clustering for class discovery from gene expression data. Bioinformatics. 2007; 23(21):2888â€“96.
Avogadri R, Valentini G. Fuzzy ensemble clustering based on random projections for DNA microarray data analysis. Artif Intell Med. 2009; 45(2):173â€“83.
Yedidia JS, Freeman WT, Weiss Y. Constructing freeenergy approximations and generalized belief propagation algorithms. IEEE Trans Inf Theory. 2005; 51(7):2282â€“312.
Zhang X, Wang W, NÃ¸rvag K, Sebag M. KAP: Generating Specified K Clusters by Efficient Affinity Propagation. In: Proceedings of the 10th IEEE International Conference on Data Mining: 2010. p. 1187â€“92.
Hubert L, Arabie P. Comparing partitions. J Classif. 1985; 2(1):193â€“218.
Yeung KY, Ruzzo WL. Principal component analysis for clustering gene expression data. Bioinformatics. 2001; 17(9):763â€“74.
Jiang D, Tang C, Zhang A. Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng. 2004; 16(11):1370â€“86.
Rousseeuw P. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987; 20(1):53â€“65.
Bolshakova N, Azuaje F. Cluster validation techniques for genome expression data. Signal Process. 2003; 83(4):825â€“33.
Bertoni A, Valentini G. Discovering multilevel structures in biomolecular data through the Bernstein inequality. BMC Bioinformatics. 2008; 9:(Suppl 2):S4.
Valentini G. Mosclust: a software library for discovering significant structures in biomolecular data. Bioinformatics. 2007; 23(3):387â€“9.
Smolkin M, Gosh D. Cluster stability scores for microarray data in cancer studies. BMC Bioinformatics. 2003; 4:36.
Ideker T, Thorsson V, Ranish JA, Christmas R, Buhler J, Eng JK, et al. Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Bioinformatics. 2001; 292(5518):929â€“34.
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. Nat Genet. 2000; 25(1):25â€“9.
Cho RJ, Campbell MJ, Winzeler EA, Steinmetz L, Conway A, Wodicka L, et al. A genomewide transcriptional analysis of the mitotic cell cycle. Mol Cell. 1998; 2(1):65â€“73.
Chu S, DeRisi J, Eisen M, Mulholland J, Botstein D, Brown PO, et al. The transcriptional program of sporulation in budding yeast. Science. 1998; 282(5389):699â€“705.
Bandyopadhyay S, Mukhopadhyay A, Maulik U. An improved algorithm for clustering gene expression data. Bioinformatics. 2007; 23:2859â€“65.
Tjaden B. An approach for clustering gene expression data with error information. BMC Bioinformatics. 2006; 7:17.
Chiu TY, Hsu TC, Wang JS. Apbased consensus clustering for gene expression time series. In: IEEE International Conference on Pattern Recognition: 2010. p. 2512â€“5.
Ernst J, Nau GJ, BarJoseph Z. Clustering short time series gene expression data. Bioinformatics. 2005; 21(Supp11):i159i168.
Maulik U, Bandyopadhyay S. Fuzzy partitioning using a realcoded variablelength genetic algorithm for pixel classification. IEEE Trans Geosci Remote Sensing. 2003; 41(5):1075â€“81.
Qin ZS. Clustering microarray gene expression data using weighted Chinese restaurant process. Bioinformatics. 2006; 22(16):1988â€“97.
Acknowledgements
This research is based on many authorsâ€™ previous works related to gene expression clustering. Thanks a lot for their contributions on our research. And Thanks K. Thangavel for providing the Yeast Saccharomyces cerevisiae dataset for us.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authorsâ€™ contributions
TYC and JSW began this work about several years ago. The framework including slidingwindow mechanism, affinity propagation, consensus clustering was composed, discussed and implemented. The contributions of TCH and CCY are on the improvement with spline interpolation and more experiments. All authors read and approved the final manuscript.
Rights and permissions
Open Access Â This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the articleâ€™s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the articleâ€™s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.
The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Chiu, TY., Hsu, TC., Yen, CC. et al. Interpolation based consensus clustering for gene expression time series. BMC Bioinformatics 16, 117 (2015). https://doi.org/10.1186/s1285901505410
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901505410