Multidimensional scaling for large genomic data sets
 Jengnan Tzeng^{1},
 Henry HorngShing Lu^{2} and
 WenHsiung Li^{1, 3}Email author
https://doi.org/10.1186/147121059179
© Tzeng et al; licensee BioMed Central Ltd. 2008
Received: 21 June 2007
Accepted: 04 April 2008
Published: 04 April 2008
Abstract
Background
Multidimensional scaling (MDS) is aimed to represent high dimensional data in a low dimensional space with preservation of the similarities between data points. This reduction in dimensionality is crucial for analyzing and revealing the genuine structure hidden in the data. For noisy data, dimension reduction can effectively reduce the effect of noise on the embedded structure. For large data set, dimension reduction can effectively reduce information retrieval complexity. Thus, MDS techniques are used in many applications of data mining and gene network research. However, although there have been a number of studies that applied MDS techniques to genomics research, the number of analyzed data points was restricted by the high computational complexity of MDS. In general, a nonmetric MDS method is faster than a metric MDS, but it does not preserve the true relationships. The computational complexity of most metric MDS methods is over O(N^{2}), so that it is difficult to process a data set of a large number of genes N, such as in the case of whole genome microarray data.
Results
We developed a new rapid metric MDS method with a low computational complexity, making metric MDS applicable for large data sets. Computer simulation showed that the new method of splitandcombine MDS (SCMDS) is fast, accurate and efficient. Our empirical studies using microarray data on the yeast cell cycle showed that the performance of Kmeans in the reduced dimensional space is similar to or slightly better than that of Kmeans in the original space, but about three times faster to obtain the clustering results. Our clustering results using SCMDS are more stable than those in the original space. Hence, the proposed SCMDS is useful for analyzing whole genome data.
Conclusion
Our new method reduces the computational complexity from O(N^{3}) to O(N) when the dimension of the feature space is far less than the number of genes N, and it successfully reconstructs the low dimensional representation as does the classical MDS. Its performance depends on the grouping method and the minimal number of the intersection points between groups. Feasible methods for grouping methods are suggested; each group must contain both neighboring and far apart data points. Our method can represent high dimensional large data set in a low dimensional space not only efficiently but also effectively.
Background
Representing high dimensional data in a low dimensional space is an important task because it becomes much easier to study the information structure when the dimension is greatly reduced. The main idea of MDS techniques is to configure the coordinates of the data in the significant space such that the pairwise relationship of relocated data in a low dimensional space is similar to that in the high dimensional space of the original data. With the dimensional reduction, one can cluster the data relationships by their distribution in the low dimensional space and explore significant patterns. When the data configuration is Euclidean, MDS is similar to principle component analysis (PCA), which can remove inherent noise with its compact representation of data [1]. When the data configuration is nonlinear, MDS can be further improved to capture the imbedded manifold in data [2].
MDS techniques have been applied to many fields, e.g., pattern recognition, stock market analysis, and molecular conformational analysis. However, the computational complexity of most metric MDSs is over O(N^{2}), though some nonmetric methods can reduce the complexity to $O(N/\sqrt{N})$[3]. Genomics research represents a challenging application of MDS. Data from microarray experiments are typically noisy with a large number of genes, but few replicates and frequent data updates. Due to the high computational complexity, it is very difficult to apply MDS to whole genome data, such as ~6000 genes in yeast, not to mention ~23,000 genes in human.
Taguchi and Oono [4] developed a novel algorithm for nonmetric MDS analysis and applied it to analyze patterns of gene expression. However, the result of a nonmetric MDS method depends heavily on the initial configuration and nonmetric MDS only preserves the order of similarities instead of the original scale of similarities. Therefore, it remains an important issue to reduce the computational complexity for a metric MDS. In this paper, we develop a fast metric MDS method for large data sets that is suitable for data analysis and updates. Indeed, the computational time for 6000 target data points is within 30 seconds in a PC with CPU 1.67 GHz and 2G memory.
We review typical MDS techniques in the following and propose a new MDS method to solve the problem of large data sets in Section 3. We split the data into overlapping subsets, apply our MDS technique to each subset, and then recombine the configurations of the subsets into the same space. We call this method the splitandcombine MDS (SCMDS) method. The complexity of SCMDS is O(p^{2} N), where p is the dimension of the feature space, which is far smaller than the number of data points N. In Section 4, we evaluate the performance of SCMDS using simulation, apply SCMDS to the GO database, and lastly, improve the Kmeans clustering of gene expression profiles by applying SCMDS to yeast cell cycle microarray data [5].
There are many different categories of MDS techniques. For example, a distinction can be made between metric and nonmetric MDSs, between weighted and unweighted MDSs, between single matrix and multiple matrices, and between deterministic and probabilistic matrices [3, 6]. In this section, we introduce three typical MDS methods that are relevant to the present work.
Classical MDS (CMDS)
Torgerson [7] proposed the first MDS method. The distance is the Euclidian distance, and the similarity matrix is complete (with no missing data) and symmetric. The main idea was that given the Euclidean distances or the inner products among points, it is possible to construct a matrix X of Cartesian coordinates of these points in the Euclidean space. Torgerson derived matrix X from the distance (or similarity) matrix D and showed what to do when the distance matrix includes noisy data. The key is to apply the double centering operator and singular value decomposition (SVD).
where ${\overline{D}}_{r}$ denotes the row means, ${\overline{D}}_{c}$ denotes the column means and ${\overline{D}}_{g}$ denotes the grand mean. If $H=1\frac{1}{N}{\dot{\text{1}}\dot{\text{1}}}^{T}$, (1) can be simplified asB = HDH.
After performing double centering on D, one can apply SVD to B. Since B is symmetric, the decomposition is of the formB = UVU^{ T }.
Hence, $\sqrt{B}=X\frac{1}{N}X{11}^{T}=U{V}^{\frac{1}{2}}$, where the columns of $\sqrt{B}$ are the coordinates of data, with the mean of the data being moved to the original point.
where V_{ r }is the r × r submatrix of V and U_{ r }is a matrix of size N × r.
There are many drawbacks of this method [8]. For example, missing data is not allowed and the computational complexity is O(N^{3}). Hence, this method is not suitable for massive data sets.
Chalmer's Linear Iteration Algorithm
One of the forcebased models of MDS is the spring model [9]. It considers each point of the data as a vertex in a low dimensional space, with springs connecting each vertex and the distance (or the spring length) between vertices proportional to their high dimensional distance. If d_{i, j}denotes the high dimensional distance between vertices i and j, and δ_{i, j}denotes the low dimensional distance, then the stress between vertices i and j is proportional to d_{i,j} δ_{i, j}. The spring model computes (N  1) forces at each vertex per iteration, and the computational complexity of this model is O(N^{2}) per iteration.
Chalmers [8] proposed a linear iteration time layout algorithm. Instead of computing all the forces at each point, he computed only constant forces in the neighborhood of each point and randomly chose another constant point that is not in the neighborhood to compute the large distance effect. Only constant points are computed at each point, so that the computational complexity is reduced to O(N) per iteration. This spring model does not find the steady state solution in general. One can only process a fixed number of iterations, say 8 or 10, as opposed to finding the converged solution. Note that the constant forces are selected in both the neighborhood and afar. Failure to incorporate one of these two forces will diminish the performance of this method [8].
Anchor Point Method
As in Chalmers' linear layout algorithm, in the anchor point MDS method [10] only a portion of data is used to reconstruct the layout for intermediate steps. The data are grouped into clusters, so that the distances between points in different clusters are less meaningful than the distances between points in the same cluster.
In this method, some points in the same cluster are chosen as anchors and others are considered as floaters. Distance information of anchors is used to construct the coarse structure of layout, and the floaters are used to update the fine structure. When a small number of K anchor points are chosen, a modified MDS procedure only computes the N × K matrix. Buja et al. [10] showed that the number of anchors could not be smaller than the dimension p of the given data. Moreover, the anchors should be chosen carefully because random choices of anchors do not work [10]. This is challenging when the grouping structure is unknown.
From these two methods, we can see that the intermediate steps for calculating MDS do not need to employ all entries of the dissimilarity matrix. We can use this property to reduce the computational complexity of MDS. Another important issue is choosing the number of dimensions for layout. In a small data set, one can use the elbow test or similar methods to detect the changing shape for the decay of the spectrum of SVD to determine the layout dimension. In a large data set, one feasible approach is to use stochastic methods by crossvalidation to measure the layout dimension [11].
Methods
We first describe SCMDS using a simple case, and for convenience we use the classical MDS (CMDS) as the default MDS method to show how we can improve it. Assume that x_{ i }∈ R^{ p }are the coordinates of data points for i = 1,...,N and p <<N. We define d_{i,j}= x_{ i } x_{ j }_{2} as the Euclidean distance between x_{ i }and x_{ j }. N is large such that applying the CMDS technique is impractical. We split the points into two overlapping sets S_{1} and S_{2}, and the intersection of these two sets contains more than p points. The main idea is to apply MDS to each individual set to get the configuration, and to use the information of the overlapping points to combine the two sets into one. There are two problems that need to be solved: (1) how to combine two sets into one and (2) what is the sufficient condition for this solution to be equivalent to that obtained by working directly with the full set? The solutions to these two problems are proposed in the next subsections.
Combination Method
When we split the whole set of data points into two overlapping subsets of equal sizes, the combined size of the two distance matrices for the two subsets is less than that for the whole set. Assume that the configurations of these two subsets obtained from MDS are x_{i,1}and x_{i,2}and the dimensions of the two configurations are the same. We can fix the coordinates of the data points in the first set and use the overlapping data points to find an affine mapping U(·) + b such that for each intersection point x_{k,1}∈ S_{1}, x_{k,1}= Ux_{j,2}+ b, where x_{j,2}∈ S_{2} for some j. Note that the matrix U of the affine mapping is a unitary matrix, which is a volumepreserving operator. The affine mapping can be found as follows.
Assume X_{1} and X_{2} are matrices in which the columns are the two coordinates of the overlapping points obtained by applying MDS to two data sets, and ${\overline{X}}_{1}$ and ${\overline{X}}_{2}$ are the means of columns of X_{1} and X_{2}, respectively. In order to use the same orthogonal basis to represent these vertices, we apply QR factorization to ${X}_{1}{\overline{X}}_{1}{\dot{1}}^{T}$ and ${X}_{2}{\overline{X}}_{2}{\dot{1}}^{T}$, so that ${X}_{1}{\overline{X}}_{1}{\dot{1}}^{T}={Q}_{1}{R}_{1}$ and ${X}_{2}{\overline{X}}_{2}{\dot{1}}^{T}={Q}_{2}{R}_{2}$. Since these two coordinates represent the same points, the triangular matrices R_{1} and R_{2} should be identical when there is no rounding error from computing the QR factorization in X_{1} and X_{2}. The positive and negative signs of columns of Q_{ i }could be arbitrarily assigned in the computation of QR factorization. Hence, the signs of columns of Q_{ i }should be adjusted according to the corresponding diagonal elements of R_{ i }so that the signs of diagonal elements of R_{1} and R_{2} become the same.
That is, the unitary operator is $U={Q}_{1}{Q}_{2}^{T}$ and the shifting operator is $b={Q}_{1}{Q}_{2}^{T}{\overline{X}}_{2}+{\overline{X}}_{1}$.
In practice, one problem with the X_{1} and X_{2} obtained by CMDS from a distance matrix is that some of the eigenvalues of the double centered distance matrix can be negative. When negative eigenvalues occur, the dimension of the configuration is determined by the number of positive eigenvalues. This could cause those two triangular matrices R_{1} and R_{2} to be unequal and sometimes the dimensions of the configurations of the two subsets are different. In the case of equal dimensions, we can still use equation (6) to combine the two data sets into one, but the equality in equation (6) becomes an approximation and the combination will induce computing errors. In the case that the dimension of X_{1} is not equal to that of X_{2}, for example, dim(S_{1}) = q_{1} < dim(S_{2}) = q_{2}, we project x_{i,2}into the space generated by the leading q_{1} basis of Q_{2} We then use the new projected configuration of x_{i,2}and the configuration x_{i,1}to perform the combination processing. The projection of x_{i,2}from q_{2} dimension to q_{1} dimension induces computational errors too. To avoid this error, the sample number of the overlapping region is important. This sample number must be large enough so that the derived dimension of data is greater or equal to the real data.
Sufficient Condition for Successive Combinations
In the case of a large number of data points, the data points are split into several overlapping groups such that the number of overlapping points is greater than the dimension of real data. The recombination approach is similar to the case of two overlapping subsets. For example, we split data points into K overlapping chained subsets {S_{1},...,S_{ k }}, i.e. S_{ i }∩ S_{i+1}≠ ∅; we apply the MDS techniques to each S_{ i }; we use the configurations of S_{1} as the central reference and combine the subsets around S_{1}; we repeat this procedure until all the subsets are combined.
Computational Complexity Reduction
We now show how SCMDS reduces the computational complexity of CMDS from O(N^{3}) to O(N) when p <<N. Assume that there are N points in a data set, N_{ I }is the number of points in each intersection region, and N_{ g }is the number of points in each group. When we split N points into K overlapping groups, we have KN_{ g } (K1)N_{ I }= N, and we have $K=\frac{(N{N}_{I})}{({N}_{g}{N}_{I}}~O(N)$.
Thus, when p <<N, the computation time of the SCMDS becomes O(p^{2}N). Moreover, if $p<\sqrt{N}$, the computational complexity is smaller than $O(\sqrt{N}N)$, which is the computational complexity for the fast MDS method proposed by Morrison et al. [12]. When p is very large (say p^{2} > N), SCMDS has no advantage in computational speed, but it makes the computation feasible even when the computer memory does not afford the MDS computation. Furthermore, if we do not use CMDS as our default MDS method but use the linear time algorithm [8] instead, then we can further reduce the complexity of the SCMDS method to O(pN). This improvement makes application of the MDS to large data sets feasible. Hence, data analysis using SCMDS can guarantee better accuracy than existing nonmetric MDS methods.
Results and Discussion
Simulation experiments
We simulate a spiral in two dimensions:r = 2θ, 2π ≤ θ ≤ 5π.
First, let us construct the reference coordinates X. Discretize θ into N1 intervals and let θ_{ i }= 2π + idθ, $d\theta =\frac{3\pi}{N1}$. At each θ_{ i }, k points are constructed with noise. The following steps generate the data:
where n_{ i }is a random variable with normal distribution n_{ i }~ N(0,1), for i = 1⋯kN. Then we add some p2 dimensional randomness into the coordinate matrix as follows. Let Z = ⟨α_{ j }u_{i,j}⟩, u_{i,j}~ N(0,1), for i = 1⋯kN and j = 1⋯p  2, where α_{ j }is the parameter to control the standard deviation of random variables. The final reference coordinates matrix X isX = [q_{1}, q_{2}, Z]^{ T }= [x_{1}, x_{2},...,x_{ kN }],
where x_{ i }∈ R^{ p }. So the distance matrix is D = ⟨d_{i, j}⟩ with d_{i,j}= x_{ i } x_{ j }. In this simulation, p = 17 is used.
Figure 2 shows the correlation between STRESS and the number of points in the overlapping region N_{ I }. There are two lines. The solid line refers to points in each group that are chosen randomly, while the dash line refers to points in each group that are chosen from neighbors; all N_{ g }are fixed to 200. Points in each line are the average of 20 repeats of STRESS computed with different randomly choosing points in each group. We can see that the STRESS of each line decays as N_{ I }increases. When N_{ I }is larger than the real dimension of data, the STRESS is almost zero. When N_{ I }is smaller than the real dimension of data, the performance of solid line is better than the dash line. Without the afar information in each group, the performance is worse. That is, randomly choosing points in each group helps to reduce the error when N_{ I }is not large enough.
In the same simulation example, we observed the relationship between error and the number of groups. When N_{ I }is larger than the dimension of the data, like N_{ I }= 20, we can reconstruct the configuration of the data well. The number of groups of our grouping method does not affect the STRESS and the average STRESS is ~10^{12}. Hence, SCMDS gives accurate results if we carefully control the number of overlapping points and choose random points in each group.
Next, we compare the speeds of nonmetric MDS, CMDS and SCMDS. We create random vectors in a 20dimension space, with the sample size ranging from 50 to 2000.
Because of the hardware limitation to process CMDS, we use only a set of 2000 sample points as the maximal sample size to demonstrate that SCMDS performs as well as CMDS when the number of data points (N) is not very large. We give below an example of large N to show that SCMDS works well in a large data set that cannot be handled by CMDS.
Gene Correlation Map
Gene correlation maps are used to represent the correlations of genes such that genes with similar biological functions or in the same biological pathway tend to be located in the same neighborhood. It provides a prior probability in many applications of genome research by Bayesian methods. Since Affy U133A GeneChip is widely used in many studies, we used genes listed in this chip and GO descriptions on the Gene Ontology website to create the gene correlation map. In this chip, there are 22,283 genes and 2168 GO terms are used in the list of these genes. Hence, we consider each gene a binary vector with length of 2168. If the ith term of the vector is one, then this gene has the ith GO description. There are 5781 genes without any GO description so that these genes are not used to compute the correlation with the genes with a GO description. Hence, the termdocument matrix is reduced to the size of 16502 × 2168.
Gene expression clustering
The goal of gene expression clustering is to subdivide a set of gene expressing profiles into clusters such that genes in the same cluster share the same or similar patterns of expression profiles. In the situation with high dimensional data, researchers tend to obtain a manifold, which is defined by the regression of the data. Because gene expression data is typically noisy, by clustering the projection of data in this manifold, which is in a lower dimensional space in comparison to the original data, a better result can be obtained. In this section, we show that the SCMDS method can successfully transform a high dimensional gene expression data set to a much lower dimension and preserve the intrinsic information of the original data. This transformation makes the clustering algorithm faster to get a converged solution. Moreover, the representation of these gene expression data in this lower dimensional space reveals a better clustering result in biological validation.
We use the α38synchronized yeast cell cycle dataset [5]. There were 5006 genes in the data set, and each gene has a 25 pointtimecourse expression profile. However, this expression profile contained missing values, and we input these missing values by the KNN imputation method [13]. To avoid the synchronized block effect, we remove the first two points of the time course data, so that the expression profiles for each gene are left with 23 timecourse points. We then (1) compute the pairwise dissimilarity of genes from each subset that are randomly chosen by the sufficient condition of SCMDS; (2) apply SCMDS to the subsets to reconstruct the new coordinate in the feature space; and (3) cluster genes in this feature space. Because CMDS cannot handle large samples and thus cannot be used to reduce the dimension when the sample size is large, we do not consider the clustering results in the reduced dimension space derived from CMDS. Instead, we compare the clustering results in the reduced dimension space derived from SCMDS with the results obtained in the original (nonreduced) space.
Using the standard Euclidean distance to measure the pairwise dissimilarity of genes, we process the SCMDS method such that the data points are split into 61 groups, N_{ I }= 100 and N_{ g }= 200.
Note that instead of computing the pairwise dissimilarity for every pair, we need to compute only the pairwise dissimilarity in each group. This reduces the computational complexity of all processing to O(N). We choose N_{ I }to be greater than the length of original time course data points to satisfy the sufficient condition for an accurate layout.
Before applying Kmeans clustering, we determine how many clusters we should use as the parameter in the Kmeans process. We search the number of clusters in the Kmeans process from 10 to 75. Since the clustering result of Kmeans depends on the initial guess of the centroids of the sets, we repeat 30 times the Kmeans process and use the Bayesian Information Criterion (BIC) score [14] to pick up the best clustering index from the mean of these 30 clustering results. In each Kmeans process, if it does not reach convergence in 100 iterations, we reset the initial values of the cluster centers and rerun the Kmeans process until the process converges. We set the upper bound for rerunning time to be 5. If the rerunning time reaches this upper bound, we choose the final cluster index by the best cluster sets from the previous iteration results.
Conclusion
Our new method reduces the computational complexity from O(N^{3}) to O(N) when the dimension of the feature space is far less than the number of genes N, and it successfully reconstructs the low dimensional representation as does the classical MDS. Its performance depends on the grouping method and the minimal number of the intersection points between groups. Feasible methods for grouping methods are suggested; each group must contain both neighboring and far apart data points. Our method can represent a high dimensional large data set in a low dimensional space not only efficiently but also effectively. This SplitandCombine method makes the parallel computation of MDS feasible. If samples increase to the level that one computer could not handle, we can split data to several subgroups, assign them to different computers to compute the MDS, and then collect the results and combine them into one. In the cell cycle example, we showed that the clustering results of dimension reduction are more stable than the results in the original space. Hence, SCMDS has overcome the curse of dimensionality in MDS.
Availability and requirements

Project name: SCMDS

Project home page: http://idv.sinica.edu.tw/jengnan/scmds/scmds.html

Operating system(s): OS Portable (Source code to work with many OS systems)

Programming language: Matlab 6.0 or higher

Licence: GNU GPL

Any restrictions to use by nonacademics: License needed. Commercial users should contact jengnan@gmail.com
Declarations
Acknowledgements
We thank the three reviewers for their valuable comments. We also thank National Center for HighPerformance Computing, National Applied Research Laboratories for use of the computer facilities. This study was supported by Academia Sinica and National Science Council, Taiwan.
Authors’ Affiliations
References
 Mardia KV, Kent JT, Bibby JM: Multivariate Analysis. Academia Press; 1979.Google Scholar
 Tenenbaum JB, Vin de Silva , Langford JC: A global geometric framework for nonlinear dimensionality reduction. Science 2000, 290(5500):2319–2323.View ArticlePubMedGoogle Scholar
 Morrison A, Ross G, Chalmers M: Fast multidimensional scaling through sampling, springs and interpolation. Information Visualization 2003, 2: 68–77.View ArticleGoogle Scholar
 Taguchi Yh, Oono Y: Relational patterns of gene expression via nonmetric multidimensional scaling analysis. Bioinformatics 2005, 21(6):730–740.View ArticlePubMedGoogle Scholar
 Pramila T, Wu W, Miles S, Noble WS, Breeden LL: The Forkhead transcription factor Hcm1 regulates chromosome segregation genes and fills the Sphase gap in the transcriptional circuitry of the cell cycle. Genes Dev 2006, 20(16):2266–2278.PubMed CentralView ArticlePubMedGoogle Scholar
 Borg I, Groenen P: Modern Multidimensional Scaling: Theory and Applications (Springer Series in Statistics). New York: SpringerVerlag; 1996.Google Scholar
 Torgerson WS: Multidimensional scaling: I. theory and method. Psychometrika 1952, 17: 401–419.View ArticleGoogle Scholar
 Chalmers M: A linear iteration time layout algorithm for visualizing highdimensional data. Proceedings of the 7th conference on Visualization 1996, 127ff.Google Scholar
 Eades P: A heuristic for graph drawing. Congressus Numerantium 1984, 42: 149–160.Google Scholar
 Buja A, Swayne DF, Littman M, Dean N, Hofmann H: XGvis: Interactive data visualization with multidimensional scaling. Journal of Computational and Graphical Statistics 2001.Google Scholar
 Steyvers M: Multidimensional Scaling. In Encyclopedia of Cognitive Science. London: Nature Publishing Group; 2002.Google Scholar
 Morrison A, Ross G, Chalmers M: A hybrid layout algorithm for subquadratic multidimensional scaling. Proc IEEE Information Visualization (InfoVis'02) 2002, 152–158.Google Scholar
 Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB: Missing value estimation methods for DNA microarrays. Bioinformatics 2001, 17(6):520–525.View ArticlePubMedGoogle Scholar
 Sherwood T, Perelman E, Hamerly G, Calder B: Automatically characterizing large scale program behavior. 10th International Conference on Architectural Support for Programming Languages and Operating Systems 2002.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.