Comparative study of unsupervised dimension reduction techniques for the visualization of microarray gene expression data

Background Visualization of DNA microarray data in two or three dimensional spaces is an important exploratory analysis step in order to detect quality issues or to generate new hypotheses. Principal Component Analysis (PCA) is a widely used linear method to define the mapping between the high-dimensional data and its low-dimensional representation. During the last decade, many new nonlinear methods for dimension reduction have been proposed, but it is still unclear how well these methods capture the underlying structure of microarray gene expression data. In this study, we assessed the performance of the PCA approach and of six nonlinear dimension reduction methods, namely Kernel PCA, Locally Linear Embedding, Isomap, Diffusion Maps, Laplacian Eigenmaps and Maximum Variance Unfolding, in terms of visualization of microarray data. Results A systematic benchmark, consisting of Support Vector Machine classification, cluster validation and noise evaluations was applied to ten microarray and several simulated datasets. Significant differences between PCA and most of the nonlinear methods were observed in two and three dimensional target spaces. With an increasing number of dimensions and an increasing number of differentially expressed genes, all methods showed similar performance. PCA and Diffusion Maps responded less sensitive to noise than the other nonlinear methods. Conclusions Locally Linear Embedding and Isomap showed a superior performance on all datasets. In very low-dimensional representations and with few differentially expressed genes, these two methods preserve more of the underlying structure of the data than PCA, and thus are favorable alternatives for the visualization of microarray data.

Background DNA microarrays allow the measurement of transcript abundances for thousands of genes in parallel. Applications in quality assessment and interpretation of such high dimensional data by clustering [1,2] and visualization [3,4] make use of algorithms that reduce its dimension. Two and three dimensional visualizations are often a good way to get a first impression of properties or the quality of a dataset or of special patterns within the data by showing clusters such as diseased and healthy patients, revealing outliers, a high level of noise or to generate hypotheses for further experimentation [5][6][7][8]. In general, there are two different approaches to reduce a datasets' dimension.
Feature selection methods [9][10][11] compute a ranking on all genes by means of some given score and pick a gene subset based on this ranking. Feature extraction methods define a mapping between the high-dimensional input space and a low-dimensional target space of a given dimension. Both methods are used in machine learning concepts. Most classification algorithms use many or all features in a complex (nonlinear) manner whereas approaches like [12,13] are based on the relative expression of only two or three genes to overcome the "black box" character of the other classifiers. So they allow an easy traceability of the genes leading to the classification result. On the other hand, applications like the visualization of high-dimensional data may profit from extracting information from all features. This results in feature extraction methods usually being more suited for lowdimensional representations of the whole data. In the following, we refer to feature extraction methods when speaking of dimension reduction techniques.
Considering visualization, these kind of mappings are often unsupervised, because they don't use further information of the data like class labels and allow an unbiased view of the structure within the data. Supervised methods are more applicable to improve classification or regression procedures, assuming that less non-differential or noisy features are reduced after the mapping.
All features, that are related to special properties of the data or a separation into classes or clusters, often lie in a subspace of a lower (intrinsic) dimension within the original data. A 'good' dimension reduction technique should preserve most of these features and generate data with similar characteristics like the high-dimensional original. For example, classifications should work at least as well on the low-dimensional representation and clusters within the reduced data should also be found, preferably more distinct. Principal Component Analysis (PCA) is a widely used unsupervised method to define this mapping from high-to low-dimensional space. Availability of large datasets with high-dimensional data, especially in biological research (e.g. microarrays), led to many new approaches in the last years.
Other studies, that deal with the assessment of dimension reduction techniques, either compare them against the background of classification [14][15][16][17][18], and hence mainly discuss supervised methods like Partial Least Squares [19,20], Sliced Inverse Regression [21] or other Regression models [22], or come from Computer Vision and deal with text, image, video or artificial data like the Swiss Roll [23][24][25][26][27][28]. This study instead, focuses on microarray data and its two and three dimensional visualization. We compare PCA to six recent unsupervised methods to find out if and under which conditions they are able to outperform PCA. In the following sections, we describe a benchmark, consisting of classifications and cluster validations, to compare the visualization performance of seven dimension reduction techniques on ten real microarray and several simulated datasets. After some technical details in the methods section, we present and discuss all results, based on one representative dataset. Further details of the other nine datasets are available in the supplement.

Dimension Reduction
Seven unsupervised dimension reduction techniques were compared within this study: Principal Component Analysis (PCA), Kernel PCA (KPCA), Isomap (IM), Maximum Variance Unfolding (MVU), Diffusion Maps (DM), Locally Linear Embedding (LLE) and Laplacian Eigenmaps (LEM). These dimension reduction techniques can be divided into two groups: linear and nonlinear methods. While PCA belongs to the former, due to a linear combination of the input data, the other six methods were designed with respect to data lying on or near a nonlinear submanifold in the higher dimensional input space and perform a nonlinear mapping.
Given an input space ℝ D and target space ℝ d (with d <<D) let X ∈ ℝ N×D be an input dataset of N samples and D features (gene expression values) and Y ∈ ℝ N×d its lowdimensional representation. A dimension reduction technique is a mapping F: ℝ D ℝ d that optimizes a cost function ∈ : R d ℝ on the target space. This problem can often be reduced to an eigenvalue problem, whose eigenvectors will define the embedding Y .

Principal Component Analysis
Principal Component Analysis (PCA) [29,30] builds a new coordinate system by selecting those d axes w 1 , . . . , w d ∈ ℝ D , which maximize the variance in the data: arg max arg max , var w 2 , . . . , w d are chosen in the same way, but orthogonal (independent) to each other (here, C ∈ ℝ DxD denotes the covariance matrix of the data X). So, the principal components p i = Xw i explain most of the variance in the data. Before mapping the data, the samples in X were centered by subtracting their mean. Since PCA only considers the variance among samples, it works best if those features, that are relevant for class labeling, account for a large part of the variance. Sometimes, the first two or three principal components are not sufficient for a good representation of the data [26]. This can lead to a high target dimensionality and prevent a well suited visualization. Furthermore, the covariance matrix grows rapidly for high-dimensional input data. To overcome this issue, we substituted the covariance matrix by the matrix of squared Euclidean distances

Kernel PCA
To make PCA more suitable for nonlinear data, Kernel PCA (KPCA) maps the data into a higher dimensional feature space before applying the the same optimization as PCA. [32,33]. The mapping can be done implicitly by using a kernel function. The Gaussian kernel was applied in our study.

Isomap
Isomap (IM) [27,28], a nonlinear modification of Multidimensional Scaling [34], preserves the global structure of the input data in its low-dimensional representation. This is done by constructing a neighborhood graph G, weighted by shortest geodesic distances D G ∈ ℝ N×N between all k nearest neighbors. This way, Isomap captures paths along a nonlinear manifold instead of the direct Euclidean distance. The embedding into the lowdimensional space is done by selecting y 1 , . . . , y d ∈ ℝ N such that being the pairwise distance matrix of neighbors y i , y j in the target space. Previous work in [23] addressed problems in visualizing datasets consisting of several well separated clusters. Since Isomap is known to suffer from holes in the underlying manifold [14], it is suggested to modify the method by selecting k 2 nearest and k 2 farthest neighbors when constructing the graph, instead of the k nearest neighbors. Both, IM and IM(mod), will be discussed in the results section.

Maximum Variance Unfolding
Similar to Isomap, Maximum Variance Unfolding (MVU) [25,26] preserves the distances among k nearest neighbors by means of a neighborhood graph G. But it varies in considering squared Euclidean distances between two neighbored samples, instead of geodesic distances and in maximizing the Euclidean distance between all points y i , y j in the target space (to 'unfold' the data) while preserving the distances in the neighborhood graph. This leads to the optimization problem.
Based on the same concept, MVU shares some weaknesses with Isomap like suffering from erroneous connections in the graph.

Diffusion Maps
Diffusion Maps (DM) [35,36] start with building a graph G as well, but differ in weighting the edges by the Gaussian kernel function: With the rows being normalized byŴ can be seen as a Markov Matrix that defines the probability to move from one sample to another in one time step. The transition probability for t time steps, denotedˆ( ) W t , is given byŴ t . It can be used to control the local connections among neighbored samples. Here, we set it to t = 1. Diffusion Maps retain a weighted L 2 distance, the leads to stronger weighting of samples from dense areas in the graph.
Since the diffusion distance between two points is computed over all possible paths in the graph, Diffusion Maps are more robust to noise.

Locally Linear Embedding
Unlike Isomap and MVU, Locally Linear Embedding (LLE) [24,37] attempts to preserve local properties of the data. Each sample x i is represented by a linear combination of its k nearest neighbors: . The weights W ∈ ℝ N×N are estimated by minimizing the reconstruction error the low-dimensional representation that best preserves the weights in the target space is chosen.

Laplacian Eigenmaps
As well as LLE, Laplacian Eigenmaps (LEM) [38,39] are a local technique. Similar to Diffusion Maps, this method first constructs a neighborhood graph, weighted with values W (i, j) from the Gaussian kernel function. By minimizing a cost function for neighbored y i , y j (W (i, j) = 0 otherwise), the distances between the low-dimensional representations are minimized and nearby samples x i , x j are highly weighted, and thus brought closer together. This way, Laplacian Eigenmaps implicitly enforces natural clusters in the data.

Benchmark
Our benchmark ( Figure 1) is divided into three parts. First, the studied dimension reduction methods were applied to the complete dataset. The low-dimensional datasets were then assessed by two different approaches, namely classification and cluster validation. To evaluate and compare the performance of each method, the classification accuracies of Support Vector Machines [40] (with Gaussian kernel) and the compactness and distance of clusters within the low-dimensional representations were used. In the following, each step of our benchmark is described in detail.

Datasets
The methods were tested on ten published microarray datasets as well as on simulated data. Each published dataset was divided into two classes according to a binary variable corresponding to the samples' disease status, the presence of certain molecular mutations or other sample characteristics as shown in Table 1. Since microarray data is technically provided with a more or less high level of noise, we reran the benchmark on the microarray datasets combined with normally distributed noise with zero mean and an increasing variance between 0 and 0.1. Before adding noise, all data was scaled to values between 0 and 1 to overcome the varying means and standard deviations of the datasets.  The simulated data is based on a 50 sample dataset whose 10.000 gene expression values are normally distributed with zero mean and standard deviation one. The covariances of all genes are given by a block diagonal matrix with coefficients r = 0.2 within and r = 0 outside the blocks of size 50 × 50. To separate the data into two classes, between 10 and 500 genes were randomly chosen to be differentially expressed by adding a constant of 0.6 to the expression values of the first 25 samples. We generated 100 datasets for testing.
In the same manner as for the ten microarray datasets before, normally distributed noise with zero mean and an increasing variance between 0 and 0.2 was added to the simulated data. We repeated the benchmark on 50 of these noisy artificial datasets. The number of differential features was fixed to 300.

Dimension reduction
All dimension reduction techniques discussed here have one or two free parameters, that influence the embedding and the target dimension. Their determination was done by minimizing the error rate of a Support Vector Machine (SVM) within a leave-one-out cross-validation (loo-cv) schema: For N samples, the dataset was divided N times into a training and a test set. One sample was excluded for testing while the rest was taken for training. The average over all prediction accuracies gives an estimate of the SVMs' generalization error.
This procedure was repeated for every set of parameters within the following ranges: If the same loo-cv accuracies were achieved by using different parameter values for the target dimension, the lowest value was taken for reasons of a most simple representation. The same applies to the neighbor/kernel parameters.
After the loo-cv, the whole dataset was reduced in its dimension in an unsupervised manner, i.e. without consideration of class labels.

Classification
The first evidence for the quality of the different dimension reduction methods are the accuracies of a Support Vector Machine with Gaussian kernel.
The data was classified repeatedly during several randomization steps: We randomly split the dataset a hundred times into a set to train the SVM and a test set for classification, and selected the median accuracy of all runs. Within the training set, a loo-cv was performed to determine the SVM parameters. For reasons of performance, a gradient descent procedure as proposed in [41] was used to minimize the loo-cv error. Every time during randomization, the training set consisted of two thirds of the original data and the test set of the remaining samples. The only constraint was to keep the balance between the number of samples in each class. Since SVMs do not restrict the dimension of the input data, the randomization results of the low-dimensional data can be compared to the high-dimensional original data, to see if more or less significant features got lost after the embedding.

Cluster validation
To measure the distances between the labeled clusters, we used the Davis-Bouldin-Index (DB-Index) [42]: Given M clusters C i (i = 1, . . . , M ) and their centers  (52) Normal (50) Summary of all ten microarray gene expression datasets we used for testing the dimension reduction techniques. Here, we focus on the data by Wang et al., which represents best the results of the whole benchmark. Datasets 2-10 are shortly discussed in the supplement to this work. All datasets were separated into two classes according to two characteristics or the diagnosis of a disease.
is the average distance of the samples in cluster C i to reports the compactness of clusters C i , C j related to their distance, the DB-Index averages the worst cases of the clusters' separations. One might expect well separated clusters to have smaller values close to one. In our case, the DB-Index was computed for fixed target space dimensions 2,3,5, and 10.

Implementation details
The presented benchmark was implemented in Matlab 7.8.0 (R2009a). Furthermore, libsvm (version 2.89) [43] served as Support Vector Machine implementation, in conjunction with Automatic Model Selection for Kernel Methods (Apr 2005) [44]. The Dimensionality Reduction Toolbox (version 0.7 -Nov 2008) [45], Isomap package (Release 1 -Dec 2000) [46], LLE routine [45] and MVU implementation (version 1.3) [47] were used for dimension reduction. Because the Isomap and LLE routines performed best in our benchmark, we converted their Matlab implementations for the statistical programming language R [48]. The R-package 'RDRToolbox', also including a routine to compute the Davis-Bouldin-Index and our microarray gene expression data simulator, can be downloaded from [49] (see also Additional file 1).

Results and Discussion
The following sections present the results for the Wang et al. Breast Cancer dataset, which represents best the results of the whole procedure. For the sake of simplicity, the visualization example in Figure 2  A linear approach like PCA is known to recover the true structure of data lying on or near a linear subspace of the high-dimensional input space. The following results show that the structure of microarray data is often too complex to be captured well in very low dimensional target spaces in a linear manner. Nonlinear methods, in particular LLE and Isomap, preserve more information in the data than the first few principle components of a PCA are able to cover.

Classification
The results of the randomization procedure are shown in Figure 3. In case of two and three dimensions, PCA performs worst, while all nonlinear methods, except Diffusion Maps, tend to retain the underlying structure of the data better in such low-dimensional target spaces. Table 2 shows the parameters having the best loo-cv accuracies. The estimated target dimension was higher than ten in most cases. PCA and Kernel PCA result in the highest dimensions (14 and 15), while other methods like Laplacian Eigenmaps, MVU and Isomap worked best with less than ten dimensions. But classifications in two or three target dimensions often yield only slightly different accuracies. The classification accuracies on data with and without dimension reduction were often similar, even in two and three target dimensions.
While all methods perform nearly even in higher dimensions, Isomap, LLE and Laplacian Eigenmaps performed best in two and three dimensions. Only on two of ten datasets (Alizadeh et. al and Singh et. al), PCA performed as well as other nonlinear methods like Isomap in two or three dimensional target spaces (see Supplemental Figures S18/S19, S27/S28). On all ten datasets considered together (see supplement), Diffusion Maps and Laplacian Eigenmaps produce more varying results The initial publications on Isomap and MVU [25,27], covering text classification and face recognition, pointed out, that PCA might need higher dimensional target spaces than its nonlinear counterparts to lead to similar results. Since PCA only considers the variance in the data, it works best if those features, which are relevant for the class labeling, account most for the variance. Considering complex microarray data, the first two or three principal components were often not enough to cover the information necessary to sufficiently distinguish different classes within the data. This might prevent a well suited visualization, which is true to the original. LLE, Isomap and MVU, which classified best most of datasets, take advantage of overlapping local neighborhoods to create an image of the global geometry of the data. Although this approach may suffer from "holes" within the data (manifold), it proved more useful for accurate low-dimensional representations. Well

Cluster validation
The cluster distances, presented in Figure 4, confirm the above conclusions. In two and three target dimensions, PCA results in worse scores than most nonlinear methods. DM performs the worst for more than two dimensions. With increasing target space dimension all methods converge, while the DB-Index itself increases as well. Although Laplacian Eigenmaps implicitly enforce natural clusters in the data, they show only slight different scores than e.g. LLE, which clusters best on most of the datasets. Just in case of ten target dimensions, LLE's and Laplacian Eigenmaps score remarkably worser on four of our ten datasets, while the other methods, including PCA, hold steady (see Supplemental Figures S9, S12, S21, S24). While Isomap might map well separated clusters to very close points, the slight modification of regarding nearest and farthest neighbors seems to correct this behavior on three datasets (Supplemental Figures S3, S6, S24), but performs similar or (much) worse otherwise (see for example Supplemental Figures S15, S18, S27). MVU scores similar to Isomap, but fails on three other datasets in two dimensional target spaces (Supplemental Figures S6, S15, S18).
Because LLE and Isomap performed best on most of the datasets during classification and cluster validation, Figure 2 compares their two dimensional embedding of the Haferlach et al. Leukemia dataset to the first two principal components of a PCA. All three visualizations clearly show two clusters of AML patients with t(15;17) and t(8;21) respectively. But LLE and Isomap distinguish both classes best, while in the PCA embedding three more t (15;17) samples lie between samples of the other class. Since LLE and Isomap both map more samples correctly, there seems to be more information within the data, that the first two PCA components fail to preserve. On closer inspection, the common three t (15;17) outliers, that are in between or closest to t(8;21) samples in all three visualizations, are always the same samples #44 and #46 #57. Another visualization example of the Alon et al. Colon Cancer dataset with all eight dimension reduction techniques can be seen in Supplemental Figures S1 and S2.

Noise evaluation
The tests on artificially noised microarray datasets reveal, that PCA, Kernel PCA and Diffusion Maps are most robust on noisy data ( Figure 5). But the differences are less strong and the results more variable than for the classification and cluster validation without adding noise. The sensitivity to noise of all methods strongly depends on the given class labels and associated features, and thus leads to varying results between all ten datasets (see supplement). While Diffusion Maps are known to be robust to noise [14], all other nonlinear methods, especially Isomap and its modification, suffer most from unstructured data and lead to strongly varying cluster scores.

Simulated data
Since LLE and Isomap performed best in the first two tests, the classifications on simulated data refer only to these methods. In all three cases, we fixed a two dimensional target space. Figure 6 shows that the results of the loo-cv on real microarray datasets can be reproduced on simulated data. With only few differential features, LLE and Isomap already capture more of the structure of the data than PCA. It takes more than 150 (of overall 10.000) differential features for PCA to perform nearly even. Furthermore, for less than 200 differential features, the accuracies of PCA are spreading much stronger, while LLE and Isomap give more stable results. The findings for three target dimensions are similar to the two dimensional case and can be seen in the supplement (Supplemental Figure S30).
The benchmark with noisy simulated data, however, confirms the results of the noise evaluation for the ten microarray datasets. Supplemental Figures S31 and S32 show for two and three target dimensions, that PCA performs more robust than LLE and Isomap for both, classification and cluster validation, when noise within the data increases. These conclusions hold true for noisy data with a larger variance, since PCA, LLE and Isomap are invariant to multiplication of the data with a scalar.

Statistical hypothesis test
In Table 3, we compare all results by applying the Wilcoxon signed-rank test on the accuracies and cluster scores for two dimensional data representations. We tested the null hypothesis, that the median of the differences between PCA and each of the nonlinear methods is equal to zero. This way, we computed the p-values of 14 paired samples. The p-values were not adjusted for multiple testing. Isomap and LLE show the most significant results in accuracy and clustering with p-values 0.0078 and 0.0273 respectively. Diffusion Maps led to results most similar to PCA.

Runtime
The computational complexity and memory requirements for all dimension reduction methods except MVU are equal, as shown in Supplemental Table S3. However, we observed differences in runtime between the methods due to different constant factors. Table 4 lists the runtime of all seven methods in seconds for the smallest dataset (Chiaretti et al. leukemia dataset, 22 samples) and the largest dataset (Verhaak et al. leukemia dataset, 461 samples). The target dimensionality was set to two. The embeddings were computed on an AMD Opteron processor with 2 GHz.
The runtime of all nonlinear methods (Kernel PCA, Isomap, LLE, LEM, DM, MVU) depends on the number of samples. Even for relatively large microarray datasets (461 samples in this case), runtimes between 9.4 and 21.9 seconds are acceptable for visualization purposes. Only the solution of a semidefinite program in the MVU algorithm takes two hours. For all methods, the computing time for datasets with more common sample sizes (≤ 50) is less than a second.

Conclusions
Classifications on high and low-dimensional data showed, that the most significant information within microarray data can be captured quite well in very few  Figure 6 Simulated data. Leave-one-out cross-validation accuracies of the simulated datasets with increasing differential features. The red line marks the average accuracy of all 100 generated datasets within the 95% quantile. In two dimensional target spaces, LLE an IM capture more of the underlying structure of the data with much fewer significant features than PCA.   dimensions compared to the thousands of features of the original data. Our benchmark further revealed significant shortcomings of PCA in two and three dimensional target spaces and brought out two nonlinear methods, that distinguished most from PCA. Especially the performances of Locally Linear Embedding and Isomap in classification and cluster validation make them well suited alternatives to the classic, linear approach of PCA.

Additional material
Additional file 1: R-package. RDRToolbox_1.0.0.tar.gz: A package for nonlinear dimension reduction using the Isomap and LLE algorithm. It also includes a routine for computing the Davis-Bouldin-Index for cluster validation, a plotting tool and a data generator for microarray gene expression data and for the Swiss Roll dataset.
Additional file 2: Supplement. Supplement.pdf: Contains information about preprocessing of the data, a discussion of the computational complexity of each dimension reduction method, further classification, cluster validation and noise evaluation results of nine other microarray datasets and further classification and randomization results for simulated datasets.