 Methodology article
 Open access
 Published:
Multidimensional scaling for large genomic data sets
BMC Bioinformatics volumeÂ 9, ArticleÂ number:Â 179 (2008)
Abstract
Background
Multidimensional scaling (MDS) is aimed to represent high dimensional data in a low dimensional space with preservation of the similarities between data points. This reduction in dimensionality is crucial for analyzing and revealing the genuine structure hidden in the data. For noisy data, dimension reduction can effectively reduce the effect of noise on the embedded structure. For large data set, dimension reduction can effectively reduce information retrieval complexity. Thus, MDS techniques are used in many applications of data mining and gene network research. However, although there have been a number of studies that applied MDS techniques to genomics research, the number of analyzed data points was restricted by the high computational complexity of MDS. In general, a nonmetric MDS method is faster than a metric MDS, but it does not preserve the true relationships. The computational complexity of most metric MDS methods is over O(N^{2}), so that it is difficult to process a data set of a large number of genes N, such as in the case of whole genome microarray data.
Results
We developed a new rapid metric MDS method with a low computational complexity, making metric MDS applicable for large data sets. Computer simulation showed that the new method of splitandcombine MDS (SCMDS) is fast, accurate and efficient. Our empirical studies using microarray data on the yeast cell cycle showed that the performance of Kmeans in the reduced dimensional space is similar to or slightly better than that of Kmeans in the original space, but about three times faster to obtain the clustering results. Our clustering results using SCMDS are more stable than those in the original space. Hence, the proposed SCMDS is useful for analyzing whole genome data.
Conclusion
Our new method reduces the computational complexity from O(N^{3}) to O(N) when the dimension of the feature space is far less than the number of genes N, and it successfully reconstructs the low dimensional representation as does the classical MDS. Its performance depends on the grouping method and the minimal number of the intersection points between groups. Feasible methods for grouping methods are suggested; each group must contain both neighboring and far apart data points. Our method can represent high dimensional large data set in a low dimensional space not only efficiently but also effectively.
Background
Representing high dimensional data in a low dimensional space is an important task because it becomes much easier to study the information structure when the dimension is greatly reduced. The main idea of MDS techniques is to configure the coordinates of the data in the significant space such that the pairwise relationship of relocated data in a low dimensional space is similar to that in the high dimensional space of the original data. With the dimensional reduction, one can cluster the data relationships by their distribution in the low dimensional space and explore significant patterns. When the data configuration is Euclidean, MDS is similar to principle component analysis (PCA), which can remove inherent noise with its compact representation of data [1]. When the data configuration is nonlinear, MDS can be further improved to capture the imbedded manifold in data [2].
MDS techniques have been applied to many fields, e.g., pattern recognition, stock market analysis, and molecular conformational analysis. However, the computational complexity of most metric MDSs is over O(N^{2}), though some nonmetric methods can reduce the complexity to O(N/\sqrt{N})[3]. Genomics research represents a challenging application of MDS. Data from microarray experiments are typically noisy with a large number of genes, but few replicates and frequent data updates. Due to the high computational complexity, it is very difficult to apply MDS to whole genome data, such as ~6000 genes in yeast, not to mention ~23,000 genes in human.
Taguchi and Oono [4] developed a novel algorithm for nonmetric MDS analysis and applied it to analyze patterns of gene expression. However, the result of a nonmetric MDS method depends heavily on the initial configuration and nonmetric MDS only preserves the order of similarities instead of the original scale of similarities. Therefore, it remains an important issue to reduce the computational complexity for a metric MDS. In this paper, we develop a fast metric MDS method for large data sets that is suitable for data analysis and updates. Indeed, the computational time for 6000 target data points is within 30 seconds in a PC with CPU 1.67 GHz and 2G memory.
We review typical MDS techniques in the following and propose a new MDS method to solve the problem of large data sets in Section 3. We split the data into overlapping subsets, apply our MDS technique to each subset, and then recombine the configurations of the subsets into the same space. We call this method the splitandcombine MDS (SCMDS) method. The complexity of SCMDS is O(p^{2} N), where p is the dimension of the feature space, which is far smaller than the number of data points N. In Section 4, we evaluate the performance of SCMDS using simulation, apply SCMDS to the GO database, and lastly, improve the Kmeans clustering of gene expression profiles by applying SCMDS to yeast cell cycle microarray data [5].
There are many different categories of MDS techniques. For example, a distinction can be made between metric and nonmetric MDSs, between weighted and unweighted MDSs, between single matrix and multiple matrices, and between deterministic and probabilistic matrices [3, 6]. In this section, we introduce three typical MDS methods that are relevant to the present work.
Classical MDS (CMDS)
Torgerson [7] proposed the first MDS method. The distance is the Euclidian distance, and the similarity matrix is complete (with no missing data) and symmetric. The main idea was that given the Euclidean distances or the inner products among points, it is possible to construct a matrix X of Cartesian coordinates of these points in the Euclidean space. Torgerson derived matrix X from the distance (or similarity) matrix D and showed what to do when the distance matrix includes noisy data. The key is to apply the double centering operator and singular value decomposition (SVD).
Double centering is the process of subtracting the row and column means of a matrix from its elements and adding the grand mean. For example, suppose that D is a product matrix, that is, D = X^{T}X, where X is a p Ã— N matrix with each column of X being a vector in a p dimension space. We define i as a N Ã— 1 vector in which every element is one and B={(X\xe2\u02c6\u2019\frac{1}{N}X\stackrel{\xcb\u2122}{1}{\stackrel{\xcb\u2122}{1}}^{T})}^{T}(X\xe2\u02c6\u2019\frac{1}{N}X\stackrel{\xcb\u2122}{1}{\stackrel{\xcb\u2122}{1}}^{T}) as the similarity matrix with the means of the column vectors being zero. Then we have
where {\stackrel{\xc2\xaf}{D}}_{r} denotes the row means, {\stackrel{\xc2\xaf}{D}}_{c} denotes the column means and {\stackrel{\xc2\xaf}{D}}_{g} denotes the grand mean. If H=1\xe2\u02c6\u2019\frac{1}{N}{\stackrel{\xcb\u2122}{\text{1}}\stackrel{\xcb\u2122}{\text{1}}}^{T}, (1) can be simplified asB = HDH.
After performing double centering on D, one can apply SVD to B. Since B is symmetric, the decomposition is of the formB = UVU^{T}.
Hence, \sqrt{B}=X\xe2\u02c6\u2019\frac{1}{N}X{11}^{T}=U{V}^{\frac{1}{2}}, where the columns of \sqrt{B} are the coordinates of data, with the mean of the data being moved to the original point.
When D is a distance matrix, {d}_{i,j}=\sqrt{{({x}_{i}\xe2\u02c6\u2019{x}_{j})}^{T}({x}_{i}\xe2\u02c6\u2019{x}_{j})}, where x_{ i }is the configuration of the ith data point. The double centering of D^{2} is equal to 2B, provided that {\displaystyle \underset{i=1}{\overset{N}{\xe2\u02c6\u2018}}{x}_{i}=0}. Hence, the CMDS method performs double centering on D^{2}, multiplies by \xe2\u02c6\u2019\frac{1}{2}, and then performs SVD, which gives the configurations of the data. Thus, the key scheme of PCA is embedded in this CMDS method. When we want to find out the r dimensional configurations of the data in the space generated by the r principal components, we only use the first leading r spectrums and r columns of U to generate X, that is,
where V_{ r }is the r Ã— r submatrix of V and U_{ r }is a matrix of size N Ã— r.
There are many drawbacks of this method [8]. For example, missing data is not allowed and the computational complexity is O(N^{3}). Hence, this method is not suitable for massive data sets.
Chalmer's Linear Iteration Algorithm
One of the forcebased models of MDS is the spring model [9]. It considers each point of the data as a vertex in a low dimensional space, with springs connecting each vertex and the distance (or the spring length) between vertices proportional to their high dimensional distance. If d_{i, j}denotes the high dimensional distance between vertices i and j, and Î´_{i, j}denotes the low dimensional distance, then the stress between vertices i and j is proportional to d_{i,j} Î´_{i, j}. The spring model computes (N  1) forces at each vertex per iteration, and the computational complexity of this model is O(N^{2}) per iteration.
Chalmers [8] proposed a linear iteration time layout algorithm. Instead of computing all the forces at each point, he computed only constant forces in the neighborhood of each point and randomly chose another constant point that is not in the neighborhood to compute the large distance effect. Only constant points are computed at each point, so that the computational complexity is reduced to O(N) per iteration. This spring model does not find the steady state solution in general. One can only process a fixed number of iterations, say 8 or 10, as opposed to finding the converged solution. Note that the constant forces are selected in both the neighborhood and afar. Failure to incorporate one of these two forces will diminish the performance of this method [8].
Anchor Point Method
As in Chalmers' linear layout algorithm, in the anchor point MDS method [10] only a portion of data is used to reconstruct the layout for intermediate steps. The data are grouped into clusters, so that the distances between points in different clusters are less meaningful than the distances between points in the same cluster.
In this method, some points in the same cluster are chosen as anchors and others are considered as floaters. Distance information of anchors is used to construct the coarse structure of layout, and the floaters are used to update the fine structure. When a small number of K anchor points are chosen, a modified MDS procedure only computes the N Ã— K matrix. Buja et al. [10] showed that the number of anchors could not be smaller than the dimension p of the given data. Moreover, the anchors should be chosen carefully because random choices of anchors do not work [10]. This is challenging when the grouping structure is unknown.
From these two methods, we can see that the intermediate steps for calculating MDS do not need to employ all entries of the dissimilarity matrix. We can use this property to reduce the computational complexity of MDS. Another important issue is choosing the number of dimensions for layout. In a small data set, one can use the elbow test or similar methods to detect the changing shape for the decay of the spectrum of SVD to determine the layout dimension. In a large data set, one feasible approach is to use stochastic methods by crossvalidation to measure the layout dimension [11].
Methods
We first describe SCMDS using a simple case, and for convenience we use the classical MDS (CMDS) as the default MDS method to show how we can improve it. Assume that x_{ i }âˆˆ R^{p}are the coordinates of data points for i = 1,...,N and p <<N. We define d_{i,j}= x_{ i } x_{ j }_{2} as the Euclidean distance between x_{ i }and x_{ j }. N is large such that applying the CMDS technique is impractical. We split the points into two overlapping sets S_{1} and S_{2}, and the intersection of these two sets contains more than p points. The main idea is to apply MDS to each individual set to get the configuration, and to use the information of the overlapping points to combine the two sets into one. There are two problems that need to be solved: (1) how to combine two sets into one and (2) what is the sufficient condition for this solution to be equivalent to that obtained by working directly with the full set? The solutions to these two problems are proposed in the next subsections.
Combination Method
When we split the whole set of data points into two overlapping subsets of equal sizes, the combined size of the two distance matrices for the two subsets is less than that for the whole set. Assume that the configurations of these two subsets obtained from MDS are x_{i,1}and x_{i,2}and the dimensions of the two configurations are the same. We can fix the coordinates of the data points in the first set and use the overlapping data points to find an affine mapping U(Â·) + b such that for each intersection point x_{k,1}âˆˆ S_{1}, x_{k,1}= Ux_{j,2}+ b, where x_{j,2}âˆˆ S_{2} for some j. Note that the matrix U of the affine mapping is a unitary matrix, which is a volumepreserving operator. The affine mapping can be found as follows.
Assume X_{1} and X_{2} are matrices in which the columns are the two coordinates of the overlapping points obtained by applying MDS to two data sets, and {\stackrel{\xc2\xaf}{X}}_{1} and {\stackrel{\xc2\xaf}{X}}_{2} are the means of columns of X_{1} and X_{2}, respectively. In order to use the same orthogonal basis to represent these vertices, we apply QR factorization to {X}_{1}\xe2\u02c6\u2019{\stackrel{\xc2\xaf}{X}}_{1}{\stackrel{\xcb\u2122}{1}}^{T} and {X}_{2}\xe2\u02c6\u2019{\stackrel{\xc2\xaf}{X}}_{2}{\stackrel{\xcb\u2122}{1}}^{T}, so that {X}_{1}\xe2\u02c6\u2019{\stackrel{\xc2\xaf}{X}}_{1}{\stackrel{\xcb\u2122}{1}}^{T}={Q}_{1}{R}_{1} and {X}_{2}\xe2\u02c6\u2019{\stackrel{\xc2\xaf}{X}}_{2}{\stackrel{\xcb\u2122}{1}}^{T}={Q}_{2}{R}_{2}. Since these two coordinates represent the same points, the triangular matrices R_{1} and R_{2} should be identical when there is no rounding error from computing the QR factorization in X_{1} and X_{2}. The positive and negative signs of columns of Q_{ i }could be arbitrarily assigned in the computation of QR factorization. Hence, the signs of columns of Q_{ i }should be adjusted according to the corresponding diagonal elements of R_{ i }so that the signs of diagonal elements of R_{1} and R_{2} become the same.
After the signs of columns of Q_{ i }are modified, we conclude
Furthermore, we can obtain
That is, the unitary operator is U={Q}_{1}{Q}_{2}^{T} and the shifting operator is b=\xe2\u02c6\u2019{Q}_{1}{Q}_{2}^{T}{\stackrel{\xc2\xaf}{X}}_{2}+{\stackrel{\xc2\xaf}{X}}_{1}.
In practice, one problem with the X_{1} and X_{2} obtained by CMDS from a distance matrix is that some of the eigenvalues of the double centered distance matrix can be negative. When negative eigenvalues occur, the dimension of the configuration is determined by the number of positive eigenvalues. This could cause those two triangular matrices R_{1} and R_{2} to be unequal and sometimes the dimensions of the configurations of the two subsets are different. In the case of equal dimensions, we can still use equation (6) to combine the two data sets into one, but the equality in equation (6) becomes an approximation and the combination will induce computing errors. In the case that the dimension of X_{1} is not equal to that of X_{2}, for example, dim(S_{1}) = q_{1} < dim(S_{2}) = q_{2}, we project x_{i,2}into the space generated by the leading q_{1} basis of Q_{2} We then use the new projected configuration of x_{i,2}and the configuration x_{i,1}to perform the combination processing. The projection of x_{i,2}from q_{2} dimension to q_{1} dimension induces computational errors too. To avoid this error, the sample number of the overlapping region is important. This sample number must be large enough so that the derived dimension of data is greater or equal to the real data.
Sufficient Condition for Successive Combinations
In the case of a large number of data points, the data points are split into several overlapping groups such that the number of overlapping points is greater than the dimension of real data. The recombination approach is similar to the case of two overlapping subsets. For example, we split data points into K overlapping chained subsets {S_{1},...,S_{ k }}, i.e. S_{ i }âˆ© S_{i+1}â‰ âˆ…; we apply the MDS techniques to each S_{ i }; we use the configurations of S_{1} as the central reference and combine the subsets around S_{1}; we repeat this procedure until all the subsets are combined.
The minimal number of points of each overlapping region and the grouping method used will strongly affect the performance of the low dimensional layout of MDS. Firstly, if the number of the overlapping points is smaller than the real data dimension, the rank of the affine mapping will be less than the dimension of data and the affine mapping cannot transform the coordinates to the corresponding coordinates. We demonstrate this point by a simulation case in the next section. Secondly, points of each group should be chosen both in the neighborhood and beyond. This has been mentioned in [3], where the information of both the neighborhood and afar is used for the spring model. If one puts only the neighboring points into the same group, the rotation effect will hamper the performance of the low dimensional layout (see Fig. 1d later). These two conditions are sufficient to guarantee good performance in a low dimension layout. The layout of the correct approximation solution of our method is outlined (see Fig. 1b).
Computational Complexity Reduction
We now show how SCMDS reduces the computational complexity of CMDS from O(N^{3}) to O(N) when p <<N. Assume that there are N points in a data set, N_{ I }is the number of points in each intersection region, and N_{ g }is the number of points in each group. When we split N points into K overlapping groups, we have KN_{ g } (K1)N_{ I }= N, and we have K=\frac{(N\xe2\u02c6\u2019{N}_{I})}{({N}_{g}\xe2\u02c6\u2019{N}_{I}}~O(N).
For each group, we apply CMDS to compute the coordinates of the group data, which costs O({N}_{g}^{3}) computation time. At each overlapping region, we apply QR factorization to compute the affine transform, which costs O({N}_{I}^{3}) computation time. Since the lower bound of N_{ I }is p+1, we can assume that N_{ g }= Î±p for some constant Î±. Then the total computation time is about
Thus, when p <<N, the computation time of the SCMDS becomes O(p^{2}N). Moreover, if p<\sqrt{N}, the computational complexity is smaller than O(\sqrt{N}N), which is the computational complexity for the fast MDS method proposed by Morrison et al. [12]. When p is very large (say p^{2} > N), SCMDS has no advantage in computational speed, but it makes the computation feasible even when the computer memory does not afford the MDS computation. Furthermore, if we do not use CMDS as our default MDS method but use the linear time algorithm [8] instead, then we can further reduce the complexity of the SCMDS method to O(pN). This improvement makes application of the MDS to large data sets feasible. Hence, data analysis using SCMDS can guarantee better accuracy than existing nonmetric MDS methods.
Results and Discussion
Simulation experiments
We simulate a spiral in two dimensions:r = 2Î¸, 2Ï€ â‰¤ Î¸ â‰¤ 5Ï€.
First, let us construct the reference coordinates X. Discretize Î¸ into N1 intervals and let Î¸_{ i }= 2Ï€ + idÎ¸, d\mathrm{\xce\xb8}=\frac{3\mathrm{\xcf\u20ac}}{N\xe2\u02c6\u20191}. At each Î¸_{ i }, k points are constructed with noise. The following steps generate the data:
Construct two raw vectors q_{1} and q_{2},
where n_{ i }is a random variable with normal distribution n_{ i }~ N(0,1), for i = 1â‹¯kN. Then we add some p2 dimensional randomness into the coordinate matrix as follows. Let Z = âŸ¨Î±_{ j }u_{i,j}âŸ©, u_{i,j}~ N(0,1), for i = 1â‹¯kN and j = 1â‹¯p  2, where Î±_{ j }is the parameter to control the standard deviation of random variables. The final reference coordinates matrix X isX = [q_{1}, q_{2}, Z]^{T}= [x_{1}, x_{2},...,x_{ kN }],
where x_{ i }âˆˆ R^{p}. So the distance matrix is D = âŸ¨d_{i, j}âŸ© with d_{i,j}= x_{ i } x_{ j }. In this simulation, p = 17 is used.
Fig. 1a is the 2D projection obtained by applying CMDS to 2000 simulation points. Then, in the first application of SCMDS, we set the number of points in each group to N_{ g }= 200 and the number of points in each intersection region to N_{ I }= 30 such that the number of the overlapping points is greater than the real dimension 17. All points in each group are chosen randomly. In order to measure the difference between classical MDS and SCMDS, we use the STRESS (Kruskal's goodness of fit index) to compute the error between the distance matrixes. The formula of STRESS is
where d_{i,j}refers to the distance matrix of the original data and {\stackrel{\xcb\u0153}{d}}_{i,j} refers to that for the SCMDS reconstruction. In this spiral example, the STRESS of computing errors for SCMDS is only 3.93 Ã— 10^{14}, and the STRESS for CMDS is only 1.25 Ã— 10^{15}. These errors are here considered as rounding errors. Thus, SCMDS can reconstruct the configuration as does CMDS; the result of our method (Fig 1b) is similar to the 2D projection in Fig. 1a. Because this spiral example has only two essential dimensions, the 3rd to 17th dimensions are considered as random perturbations. If we reduce the number N_{ I }from 16 to 3 but keep N_{ g }= 200, we can see that the 2D SCMDS representation keeps the shape of spiral but becomes more and more blurred as N_{ I }decreases. At the same time, the STRESS increases when N_{ I }decreases, as will be shown in Fig 2. In Fig. 1c, we set N_{ g }= 200 and N_{ I }= 5 and the points in each group are also chosen randomly. The performance is now slightly worse than CMD due to dimension loss; the STRESS here is 0.0398. In Fig. 1d, we set N_{ g }= 200 and N_{ I }= 5, and group points neighboring to each other into the same group. It shows that the performance is totally bad â€“ the spiral structure is lost and the STRESS increases to 0.5232.
Figure 2 shows the correlation between STRESS and the number of points in the overlapping region N_{ I }. There are two lines. The solid line refers to points in each group that are chosen randomly, while the dash line refers to points in each group that are chosen from neighbors; all N_{ g }are fixed to 200. Points in each line are the average of 20 repeats of STRESS computed with different randomly choosing points in each group. We can see that the STRESS of each line decays as N_{ I }increases. When N_{ I }is larger than the real dimension of data, the STRESS is almost zero. When N_{ I }is smaller than the real dimension of data, the performance of solid line is better than the dash line. Without the afar information in each group, the performance is worse. That is, randomly choosing points in each group helps to reduce the error when N_{ I }is not large enough.
In the same simulation example, we observed the relationship between error and the number of groups. When N_{ I }is larger than the dimension of the data, like N_{ I }= 20, we can reconstruct the configuration of the data well. The number of groups of our grouping method does not affect the STRESS and the average STRESS is ~10^{12}. Hence, SCMDS gives accurate results if we carefully control the number of overlapping points and choose random points in each group.
Next, we compare the speeds of nonmetric MDS, CMDS and SCMDS. We create random vectors in a 20dimension space, with the sample size ranging from 50 to 2000.
In Figure 3 the black line (nonmetric MDS) and the blue line (CMDS) are produced by Matlab default scripts, mdscale.m and cmdscale.m. The red line refers to our SCMDS method. We can see that the simulation result matches our theoretical analysis.
Because of the hardware limitation to process CMDS, we use only a set of 2000 sample points as the maximal sample size to demonstrate that SCMDS performs as well as CMDS when the number of data points (N) is not very large. We give below an example of large N to show that SCMDS works well in a large data set that cannot be handled by CMDS.
Gene Correlation Map
Gene correlation maps are used to represent the correlations of genes such that genes with similar biological functions or in the same biological pathway tend to be located in the same neighborhood. It provides a prior probability in many applications of genome research by Bayesian methods. Since Affy U133A GeneChip is widely used in many studies, we used genes listed in this chip and GO descriptions on the Gene Ontology website to create the gene correlation map. In this chip, there are 22,283 genes and 2168 GO terms are used in the list of these genes. Hence, we consider each gene a binary vector with length of 2168. If the ith term of the vector is one, then this gene has the ith GO description. There are 5781 genes without any GO description so that these genes are not used to compute the correlation with the genes with a GO description. Hence, the termdocument matrix is reduced to the size of 16502 Ã— 2168.
To apply the CMDS to a distance matrix of 16502 Ã— 16502 is impossible for current hardware. So we use SCMDS to randomly separate the 16,502 genes into 6 overlapping subsets, where N_{ I }= 2200 and N_{ g }= 4500. Although the essential dimension should be smaller then 2168, we still use N_{ I }= 2200 to ensure accurate reconstruction. The QR operation and SVD operation are available for this size. In each subset, we compute the 4500 Ã— 4500 distance matrix and then compute the 3D MDS layout. Figure 4 is the 3D layout of SCMDS results on these 16,502 genes. In this figure, two genes located closely have similar GO descriptions. We use the Euclidian distance in this 3D layout to measure the relationship between genes. For example, gene probe IDs 220259_at, 220815_at, 221980_at, 201015_s_at, 209880_s_at, 219765_at, 203018_s_at, 205523_at, 209879_at, 218796_at refer to the same GO description and they have the same coordinates in Figure 4. This gene correlation map is useful for many gene selection problems. Although we could not validate this result by comparing it with CMDS, we can repeat the SCMDS procedure to see if we get consistent results among repeats. We find that the values of STRESS between different repeats are all small. This suggests that our method is stable for this large data set.
Another way to check our method is to sample a subset of genes to create the correlation map, for example, 4000 sampled genes. Then we can compare the performance between CMDS and SCMDS. We randomly sampled 4000 genes from the 16,502 genes, and then compute the distance matrix corresponding to these genes. The CMDS layout showed that there are 1106 dimensions in this gene set. Then we applied SCMDS to this distance matrix with three different pairs (N_{ I }= 2200, N_{ g }= 2500), (N_{ I }= 1110, N_{ g }= 2000) and (N_{ I }= 500, N_{ g }= 1000). We repeated SCMDS 10 times for each pair. Figure 5a is the 2D CMDS layout. The STRESS between the original data and CMDS is 1.796 Ã— 10^{15}. Figure 5b is the 2D SCMDS layout with (N_{ I }= 2200, N_{ g }= 2500), and its average STRESS compared with the original data is 0.0138. Figure 5c is the 2D SCMDS layout with (N_{ I }= 1110, N_{ g }= 2000), its average STRESS is 0.0140. Figure 5d is 2D SCMDS layout with (N_{ I }= 500, N_{ g }= 1000), and its average STRESS is 0.0282. We can see that all the 2D structures are preserved, though less well in Figure 5d where there is a substantial dimension loss.
Gene expression clustering
The goal of gene expression clustering is to subdivide a set of gene expressing profiles into clusters such that genes in the same cluster share the same or similar patterns of expression profiles. In the situation with high dimensional data, researchers tend to obtain a manifold, which is defined by the regression of the data. Because gene expression data is typically noisy, by clustering the projection of data in this manifold, which is in a lower dimensional space in comparison to the original data, a better result can be obtained. In this section, we show that the SCMDS method can successfully transform a high dimensional gene expression data set to a much lower dimension and preserve the intrinsic information of the original data. This transformation makes the clustering algorithm faster to get a converged solution. Moreover, the representation of these gene expression data in this lower dimensional space reveals a better clustering result in biological validation.
We use the Î±38synchronized yeast cell cycle dataset [5]. There were 5006 genes in the data set, and each gene has a 25 pointtimecourse expression profile. However, this expression profile contained missing values, and we input these missing values by the KNN imputation method [13]. To avoid the synchronized block effect, we remove the first two points of the time course data, so that the expression profiles for each gene are left with 23 timecourse points. We then (1) compute the pairwise dissimilarity of genes from each subset that are randomly chosen by the sufficient condition of SCMDS; (2) apply SCMDS to the subsets to reconstruct the new coordinate in the feature space; and (3) cluster genes in this feature space. Because CMDS cannot handle large samples and thus cannot be used to reduce the dimension when the sample size is large, we do not consider the clustering results in the reduced dimension space derived from CMDS. Instead, we compare the clustering results in the reduced dimension space derived from SCMDS with the results obtained in the original (nonreduced) space.
Using the standard Euclidean distance to measure the pairwise dissimilarity of genes, we process the SCMDS method such that the data points are split into 61 groups, N_{ I }= 100 and N_{ g }= 200.
Note that instead of computing the pairwise dissimilarity for every pair, we need to compute only the pairwise dissimilarity in each group. This reduces the computational complexity of all processing to O(N). We choose N_{ I }to be greater than the length of original time course data points to satisfy the sufficient condition for an accurate layout.
Using a PC with CPU 1.67 GHz, 2G memory and Matlab R14 as the testing software, we complete the analysis with the CPU time of ~46 seconds. From the degradation rate of the distribution of standard deviations of output coordinates (Figure 6) the derived from the average of 30 repeats with different randomly grouping the elements of each subgroup, there are turning points at the 4^{th}, 7^{th} and 11^{th} coordinates; after the 11th coordinate, the degradation is very small. By examining the second order difference of Figure 6, a local extremum occurs at 4^{th}, 7^{th} and 11^{th} coordinates. Because the standard deviation decreases smoothly after the 7^{th} coordinate, and because the variation of the concavity in the 7^{th} coordinate is larger than those at the 4^{th} and 11^{th} coordinates, we assume that the variables after dimension 7 are noisy and can be removed for dimension reduction. Hence, we lay out these 5006 genes on a 7dimensional space. In this space, two genes located nearby will have similar expression profiles in this dataset. Then we perform the Kmeans clustering method to compare the clustering result between the original expression data and our low dimensional layout.
Note that Kmeans usually get a local optimal solution. When we want to obtain a reliable solution, we need to repeat Kmeans several times and choose the best solution. Or we can define a function to measure the quality of Kmeans solutions, and then apply the simulated annealing to get the optimal solution. Thus, obtaining a reliable Kmeans solution will take time. Fortunately, applying Kmeans in the reduced dimension space are more stable than applying Kmeans in the high dimension space. That is, the iteration of Kmeans in the reduced dimension space converges faster than in the high dimension space. To demonstrate how dimension affects the stability of Kmeans, we repeat Kmeans on the yeast cell cycle data in the SCMDS space with different restricted dimensions, until Kmeans obtain 50 converged solutions. Figure 7 shows that the CPU times are inversely proportional to the data dimension. The time Kmeans requires in a seven dimensional space is ~1/3 of a 23 dimensional space. Hence dimension reduction accelerates the Kmeans solution. However, although applying Kmeans in a low dimension space is faster than in a high dimension space, loss of too many dimensions will distort the distance relationship between data points, thus distorting the clustering result. Hence, the dimension of reduced space should be determined carefully.
Before applying Kmeans clustering, we determine how many clusters we should use as the parameter in the Kmeans process. We search the number of clusters in the Kmeans process from 10 to 75. Since the clustering result of Kmeans depends on the initial guess of the centroids of the sets, we repeat 30 times the Kmeans process and use the Bayesian Information Criterion (BIC) score [14] to pick up the best clustering index from the mean of these 30 clustering results. In each Kmeans process, if it does not reach convergence in 100 iterations, we reset the initial values of the cluster centers and rerun the Kmeans process until the process converges. We set the upper bound for rerunning time to be 5. If the rerunning time reaches this upper bound, we choose the final cluster index by the best cluster sets from the previous iteration results.
The BIC score is computed by the following formula:
where n is the number of data points, k is the number of clusters, c_{ j }is the jth model and p is the length of data points. We assume that each cluster from the Kmeans procedure is a multivariate normal distribution. The first term of the formula is the loglikelihood and the second term is penalty. When the data fit the model well, the term of loglikelihood becomes larger. We assume that the standard deviations of the multivariate normal distributions are the same, so that there are (k+1)p parameters, k means, and one standard deviation in the p dimensional space, in these k clusters (models). If a model becomes complicated, i.e., the number of the model parameters becomes large, the BIC score will decay. In Fig. 8, the dash line is reduced by taking the average of 30 repeats and the solid line is reduced by taking the maximal value of 30 repeats. We can see that the maximal value of solid line occurs in 45 clusters and the maximal of the dash line occurs in 61 clusters. Hence, we partition 5006 genes into 61 clusters in our example.
We cluster the original data set and the reduced 7dimensional space data set from SCMDS separately, and each data set gives rise to 61 distinct gene clusters. Then we input gene names from these clusters to the MIPS Functional Catalogue Database. The outputs indicate that 13 clusters of the original clustering data set are significant to the cell cycle (pvalue is smaller than 10^{3}). In contrast, 14 clusters of the 7dimensional data set are significant to the cell cycle function. Twenty pairs of almost matched clusters, which consist of highly similar genes, are found by comparing the clusters from the two data sets. Nine clusters of the 7dimensional data set are not annotated to any function, while 13 clusters of the original clustering data set are not annotated. The details of these paired clusters are shown in Additional file 1 and file 2. There are many unpaired clusters, which contain lower proportions (the match rate < 80%) of similar genes. In Figs. 9 to 11, we show these lower matched pairs that are related to cell cycle. In Figs. 9 and 10 the biological validation shows that these clusters obtained from SCMDS are better than the corresponding clusters from the original 23dimensional space data. In Fig. 9(a), 45.5% genes are indicated to have cell cycle function in the cluster obtained in the reduced space, and its pvalue is 1.1 Ã— 10^{9}. In Fig. 9(b), 40% genes are indicated to have cell cycle function in the cluster obtained in the original space, and its pvalue is 1.4 Ã— 10^{8}. There are, however, another 37 genes that are included in the corresponding cluster in the original space. Nine of them have significant functions in DNA processing. Conversely, there are another 17 genes that are included in the corresponding cluster in the reduced space. Six of them have significant function in mitotic cell cycle and cell cycle control. In Fig. 10, 51.8% of the genes in the 32^{th} cluster are annotated to have cell cycle functions in the cluster of Kmeans in the reduced space, while the genes in the corresponding cluster in the original space, the 29^{th} cluster, 42.3% are annotated to have cell cycle function. Compare the difference between Fig. 10(a) and 10(b). Five genes included in the cluster of the original space have no significant in the cell cycle function. Conversely, 7 genes included in the cluster of the reduced space include 3 genes (YHR152w, YML128c and YOR265w) that are significant in meiosis. In Fig. 11, 45.8% of the genes in the 54 cluster of the reduced space are annotated to have cell cycle, its pvalue = 7.27 Ã— 10^{14}; 48% of the genes in the 58 cluster of the original space are annotated to cell cycle function, its pvalue = 2.03 Ã— 10^{7}. We can see that in this pair, the cluster in the reduced space is worse than that in the original space. The other comparisons between the clusters from the reduced 7dimensional space data and from the original space data are shown in Additional file 1 and file 2. Combine the above three cases and twenty pairs that are almost matched, we can see that applying Kmeans to the SCMDS reduced space performs at least as well as in the original data set. SCMDS did not distort the pairwise relationship between data points, and it speeds up the analysis with accuracy preserved.
Conclusion
Our new method reduces the computational complexity from O(N^{3}) to O(N) when the dimension of the feature space is far less than the number of genes N, and it successfully reconstructs the low dimensional representation as does the classical MDS. Its performance depends on the grouping method and the minimal number of the intersection points between groups. Feasible methods for grouping methods are suggested; each group must contain both neighboring and far apart data points. Our method can represent a high dimensional large data set in a low dimensional space not only efficiently but also effectively. This SplitandCombine method makes the parallel computation of MDS feasible. If samples increase to the level that one computer could not handle, we can split data to several subgroups, assign them to different computers to compute the MDS, and then collect the results and combine them into one. In the cell cycle example, we showed that the clustering results of dimension reduction are more stable than the results in the original space. Hence, SCMDS has overcome the curse of dimensionality in MDS.
Availability and requirements

Project name: SCMDS

Project home page: http://idv.sinica.edu.tw/jengnan/scmds/scmds.html

Operating system(s): OS Portable (Source code to work with many OS systems)

Programming language: Matlab 6.0 or higher

Licence: GNU GPL

Any restrictions to use by nonacademics: License needed. Commercial users should contact jengnan@gmail.com
References
Mardia KV, Kent JT, Bibby JM: Multivariate Analysis. Academia Press; 1979.
Tenenbaum JB, Vin de Silva , Langford JC: A global geometric framework for nonlinear dimensionality reduction. Science 2000, 290(5500):2319â€“2323.
Morrison A, Ross G, Chalmers M: Fast multidimensional scaling through sampling, springs and interpolation. Information Visualization 2003, 2: 68â€“77.
Taguchi Yh, Oono Y: Relational patterns of gene expression via nonmetric multidimensional scaling analysis. Bioinformatics 2005, 21(6):730â€“740.
Pramila T, Wu W, Miles S, Noble WS, Breeden LL: The Forkhead transcription factor Hcm1 regulates chromosome segregation genes and fills the Sphase gap in the transcriptional circuitry of the cell cycle. Genes Dev 2006, 20(16):2266â€“2278.
Borg I, Groenen P: Modern Multidimensional Scaling: Theory and Applications (Springer Series in Statistics). New York: SpringerVerlag; 1996.
Torgerson WS: Multidimensional scaling: I. theory and method. Psychometrika 1952, 17: 401â€“419.
Chalmers M: A linear iteration time layout algorithm for visualizing highdimensional data. Proceedings of the 7th conference on Visualization 1996, 127ff.
Eades P: A heuristic for graph drawing. Congressus Numerantium 1984, 42: 149â€“160.
Buja A, Swayne DF, Littman M, Dean N, Hofmann H: XGvis: Interactive data visualization with multidimensional scaling. Journal of Computational and Graphical Statistics 2001.
Steyvers M: Multidimensional Scaling. In Encyclopedia of Cognitive Science. London: Nature Publishing Group; 2002.
Morrison A, Ross G, Chalmers M: A hybrid layout algorithm for subquadratic multidimensional scaling. Proc IEEE Information Visualization (InfoVis'02) 2002, 152â€“158.
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB: Missing value estimation methods for DNA microarrays. Bioinformatics 2001, 17(6):520â€“525.
Sherwood T, Perelman E, Hamerly G, Calder B: Automatically characterizing large scale program behavior. 10th International Conference on Architectural Support for Programming Languages and Operating Systems 2002.
Acknowledgements
We thank the three reviewers for their valuable comments. We also thank National Center for HighPerformance Computing, National Applied Research Laboratories for use of the computer facilities. This study was supported by Academia Sinica and National Science Council, Taiwan.
Author information
Authors and Affiliations
Corresponding author
Additional information
Authors' contributions
WH Li was the project leader in this study and he investigated the biological issues. HHS Lu and J Tzeng designed and gave the mathematical proof of the fast algorithm of MDS. J Tzeng carried out the programming, data collection and analysis. All authors read and approved the final manuscript.
Electronic supplementary material
12859_2007_2164_MOESM1_ESM.xls
Additional file 1: Function annotation of original space clustering. This file provided the function annotation of the best Kmeans clustering result in data presented by original space. Each column refers to a cluster. If the cluster has no efficient annotated function, then the corresponding column is empty. (XLS 55 KB)
12859_2007_2164_MOESM2_ESM.xls
Additional file 2: Function annotation of reduced space clustering. This file provided the function annotation of the best Kmeans clustering result in data presented by reduced 7dimensional space. Each column refers to a cluster. If the cluster has no efficient annotated function, then the corresponding column is empty. (XLS 57 KB)
Authorsâ€™ original submitted files for images
Below are the links to the authorsâ€™ original submitted files for images.
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Tzeng, J., Lu, H.HS. & Li, WH. Multidimensional scaling for large genomic data sets. BMC Bioinformatics 9, 179 (2008). https://doi.org/10.1186/147121059179
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/147121059179