caBIG™ VISDA: Modeling, visualization, and discovery for cluster analysis of genomic data

Background The main limitations of most existing clustering methods used in genomic data analysis include heuristic or random algorithm initialization, the potential of finding poor local optima, the lack of cluster number detection, an inability to incorporate prior/expert knowledge, black-box and non-adaptive designs, in addition to the curse of dimensionality and the discernment of uninformative, uninteresting cluster structure associated with confounding variables. Results In an effort to partially address these limitations, we develop the VIsual Statistical Data Analyzer (VISDA) for cluster modeling, visualization, and discovery in genomic data. VISDA performs progressive, coarse-to-fine (divisive) hierarchical clustering and visualization, supported by hierarchical mixture modeling, supervised/unsupervised informative gene selection, supervised/unsupervised data visualization, and user/prior knowledge guidance, to discover hidden clusters within complex, high-dimensional genomic data. The hierarchical visualization and clustering scheme of VISDA uses multiple local visualization subspaces (one at each node of the hierarchy) and consequent subspace data modeling to reveal both global and local cluster structures in a "divide and conquer" scenario. Multiple projection methods, each sensitive to a distinct type of clustering tendency, are used for data visualization, which increases the likelihood that cluster structures of interest are revealed. Initialization of the full dimensional model is based on first learning models with user/prior knowledge guidance on data projected into the low-dimensional visualization spaces. Model order selection for the high dimensional data is accomplished by Bayesian theoretic criteria and user justification applied via the hierarchy of low-dimensional visualization subspaces. Based on its complementary building blocks and flexible functionality, VISDA is generally applicable for gene clustering, sample clustering, and phenotype clustering (wherein phenotype labels for samples are known), albeit with minor algorithm modifications customized to each of these tasks. Conclusion VISDA achieved robust and superior clustering accuracy, compared with several benchmark clustering schemes. The model order selection scheme in VISDA was shown to be effective for high dimensional genomic data clustering. On muscular dystrophy data and muscle regeneration data, VISDA identified biologically relevant co-expressed gene clusters. VISDA also captured the pathological relationships among different phenotypes revealed at the molecular level, through phenotype clustering on muscular dystrophy data and multi-category cancer data.


Initialization sensitivity, local optimum and solution reproducibility
There are multiple local optimums of partitional clustering objective functions such as KMC and mixture modeling. Standard learning methods optimize the objective function and thus find the local optimum solution "nearest to'' the model initialization. More sophisticated methods or initialization schemes, which seek to avoid poor local optimums, exist, but require significant computation and normally do not ensure convergence to the global optimum (Rose, 1998). The quality of the local optimum may be decidedly poor, compared to that of the global optimum. This is especially true for genomic data sets, with high dimensionality and small sample size (Bishop, 1995;Duda, et al., 2001;Zhu, et al., 2008). On the other hand, methods such as conventional agglomerative Hierarchical Clustering (HC) perform a very simple, greedy optimization, which severely limits the amount of search they perform over the space of possible solutions, and thus limits the solution accuracy.

Solution reproducibility
Since clustering may help drive scientific hypotheses, it is extremely important that solutions be reproducible/robust. This refers to the ability of a clustering algorithm to identify the same (or quite similar) structure under different model initializations, in the presence of small data set perturbations or additive noise, as well as when analyzing independent data sets drawn from the same underlying distribution. KMC and mixture modeling, which are sensitive to model initialization, do not yield reproducible solutions when random model initialization is applied. However, more sophisticated initialization, e.g., applying the splitting algorithm (Hartigan, 1975) to initialize KMC cluster centers, will yield less variable KMC solutions. Likewise, robust KMC solutions can in turn be used to provide a robust initialization for mixture model fitting. On the other hand, agglomerative HC (e.g., the single linkage algorithm (Duda, et al., 2001)) is highly sensitive to the particular data sample and to the noise in the data.

Order selection: estimating the number of clusters
For hierarchical clustering, simple heuristics are typically applied to estimate the number of clusters, such as identifying the order at which a large improvement in fitting accuracy occurs. For mixture modeling, information-theoretic model selection criteria such as minimum description length (MDL) are often applied. These criteria consist of a data fitting term (mismatch between data and model) and a model complexity term. For sample clustering on genomic data, with high feature/gene dimensionality and small sample size, use of such criteria is likely to grossly underestimate the number of clusters because each gene will entail at least one free parameter per cluster, leading to a very high complexity penalty when increasing the number of clusters. This can be mitigated by front-end gene selection or embedding feature selection within clustering (Graham and Miller, 2006;Wang, et al., 2008), which reduce the gene number and, thus, model complexity, for a given number of clusters.

Confounding variables and multiple sources of cluster structure
A fundamental, flawed assumption in much clustering work is that there is a single source of clustering tendency in the data -e.g., "disease'' vs. "non-disease''. There may be other sources of clustering tendency, based either on measured or even unmeasured (latent) factors, e.g. gender, environment, measurement platform (Benito, et al., 2004), or other biological processes that are peripheral or at any rate not of direct interest in the current study (Clarke, et al., 2008). While there is some research in this area (Benito, et al., 2004), more work is needed in removing or accounting for confounding influences in genomics research (Miller, et al., 2008;Wang, et al., 2008).

Exploitation of prior knowledge and human interaction
The most fundamental limitation in clustering is the ill-posed nature of the problem. The number of clusters depends on the scale at which one views the data (Rose, 1998) and the data grouping clearly depends on the geometric or statistical assumption/definition of cluster (Duda, et al., 2001;Lange, et al., 2004). Even when there is a known parametric mixture model form, there may be no unique solution, i.e. there is the mixture identifiability problem (Duda, et al., 2001). What can break nonuniqueness and ill-posedness is to incorporate prior knowledge. Incorporating prior knowledge may also help to focus the clustering analysis on interesting data structure and remove the effect of confounding variables. Unfortunately, most clustering algorithms do not have mechanisms for exploiting prior information -exceptions include semi-supervised gene clustering methods that utilize gene annotations to form gene clusters (Brown, et al., 2000;Pan, 2006;Qu and Xu, 2004) and some other more general semi-supervised methods . Besides auxiliary database information, the user's (expert's) domain knowledge and the human gifts for pattern recognition and data visualization in low-dimensional spaces can also help to produce accurate and meaningful clustering outcomes (Chien, 1978;Zou and Nagy, 2006). For example, Bishop and Tipping developed a hierarchical data visualization and exploration scheme based on a mixture of latent variables model with human-data interaction (Bishop and Tipping, 1998;Tipping and Bishop, 1999).

1.7
Inflexibility and non-adaptivity of standard methods Y. Zhu,et al. VISDA: modeling,visualization and discovery 4 Most clustering algorithms have a standalone nature and make underlying statistical or geometric assumptions about clusters. When these assumptions are violated, the method may fail badly, and with no "backup plan'', i.e. no capability to modify the assumptions to seek a better result. A recent strategy with some ability to mitigate this is ensemble or consensus clustering (Strehl and Ghosh, 2002;Topchy, et al., 2005), wherein multiple algorithms are applied and their results then fused so as to maximize a clustering "consensus'' function. However, these methods, again being fully automated, cannot benefit from human interaction and expert knowledge, which can guide algorithm choices in a flexible, adaptive manner, to match the underlying data characteristics and the (domain-specific) clustering structure of interest. In particular, as demonstrated in this paper, clustering and data visualization can complement each other, providing a "transparent'' clustering process that enhances the user's understanding of the data and that informs clustering via choices based on expert knowledge and human intelligence.

Algorithm for gene clustering and sample clustering
Let t = {t 1 , t 2 , …, t N | t i ∈ R p , i = 1, 2, …, N} denote N p-dimensional data points to be clustered. Suppose that the hierarchical exploration proceeds to the lth level, i.e. K l clusters are detected at level l and the posterior probability of data point x i belonging to cluster k (k = 1, …, K l ) is z i, k .
PCA uses the eigenvectors associated with the largest eigenvalues of the cluster's covariance matrix as projection directions. The covariance matrix of cluster k is calculated by where μ t, k is the mean of cluster k, Σ t, k is the covariance matrix of cluster k. The subscript 't' indicates that these parameters model the data in the original data space.
PCA-PPM selects two of the eigenvectors on which the projected data distribution has the smallest kurtosis, because flat distributions or distributions with thick tails usually reveal structure information and kurtosis measures the peakedness of a distribution (Hyvärinen, et al., 2001). For any projection vector w, the kurtosis of the projected data distribution is calculated by where y i is the image of t i after projection.
(1) Apply HC on the data points that most likely belong to the cluster, i.e. the data points that have a bigger posterior probability of belonging to cluster k than to all other clusters. On the HC dendrogram, the user chooses a distance threshold to cut the cluster into sub-clusters. Very small sub-clusters are merged into their nearest larger sub-clusters.
(2) Run KMC on the data points using the sub-clusters obtained from HC as initialization. (3) Use the result of KMC to initialize an SFNM model in the visualization space, and run the EM algorithm to refine the model. The SFNM model and the corresponding EM algorithm take the form of Equation (12) and Equation (13), respectively. (4) DCA based on the weighted Fisher criterion uses the two eigenvectors associated with the largest eigenvalues of the weighted Fisher Scatter Matrix (wFSM) calculated based on the refined SFNM partition to project the cluster (Duda, et al., 2001;Loog, et al., 2001). Compared to the Fisher criterion, this weighted Fisher criterion confines the influence of class pairs that are well separated, and emphasizes on the class pairs that are overlapped. DCA based on the weighted Fisher criterion is expect to get a good total Bayesian classification accuracy, while a projection based on the Fisher criterion can be highly influenced by the well separated class pair, which is not so important for reducing the classification error. Suppose that K k sub-clusters exist after the SFNM model fitting. Let μ t, (k, j) and Σ t, (k, j) be the mean and covariance of sub-cluster j in the SFNM model. The wFSM is calculated by where r i is the sample proportion of sub-cluster i within cluster k, and erf(•) is the Gaussian error function. The two eigenvectors found in this way are not guaranteed to be orthogonal, so they need to be orthogonalized by the Gram-Schmidt process to achieve an affine projection.
LPP also works on the data points that most likely belong to the cluster. The projection directions are obtained through minimizing a compactness cost function, which is a weighted summation of the pair-wise distances between points in the projection space. The weights are assigned in a way that the distances between neighboring points have big weights while the distances between far apart points have small weights. Thus the minimization emphasizes on keeping the neighboring data points still close in the projection space and preserves the local data structure. The minimization is achieved by the generalized eigenvalue decomposition (He and Niyogi, 2004). The eigenvectors are also orthogonalized by the Gram-Schmidt process to form the projection matrix.
The APC-DCA projection follows the idea of using DCA to evaluate/confirm unsupervisedly obtained partitions, but the process is automatic. APC is applied on the data points that most likely belong to the cluster to find a partition. Being quite different from the KMC, which needs an initialization of cluster centers, APC simultaneously considers all data points as potential cluster centers. By viewing each data point as a node in a network, the affinity propagation method recursively transmits along edges of the networks real-valued messages, whose magnitude reflects the current affinity that one data point has for choosing another data point as its cluster center, until a good set of cluster centers and corresponding clusters emerges (Frey and Dueck, 2007). The messages are updated to search for minima of the cost function, which is the sum of dissimilarities between data points and their cluster centers. It was shown that the affinity propagation method finds a "neighborhood minima" solution, i.e. the obtained solution is the best solution in a particular large region around it (Frey and Dueck, 2007;Weiss and Freeman, 2001). This condition is weaker than a global minimum but stronger than a local minimum. Compared to randomly initialized KMC, APC achieves a better cost function value in a time efficient manner (based on comparing a single run of APC with 10,000 runs of randomly initialized KMC) (Frey and Dueck, 2007). In addition, APC has an automatic model selection scheme. Here, DCA is also based on the wFSM calculated by Equation (3), but specialized for the "hard" case, because APC generates a hard partition, not a soft partition like SFNM fitting. Orthogonalization of the eigenvectors by the Gram-Schmidt process is needed to achieve an affine projection.
Obviously, VISDA's projection suite framework is scalable and extensive to incorporate various existing clustering and visualization algorithms to increase the chance of revealing data structure of interest. When all the projections are made and shown to the user, the user will be asked to select one projection that he/she thinks best reveals the data structure as the final visualization. An interesting question is how effective a particular projection is. It may be agreeable that the bench mark criteria in visual exploration are very different and difficult (Nielson, 1996). As shared by Bishop and Tipping (Bishop and Tipping, 1998), we believe that in data visualization there is no objective measure of quality, and so it is difficult to quantify the merit of a particular data visualization method, and the effectiveness of such a techniques is often highly data dependent and application dependent. A possible alternative is to perform a rigorous psychological evaluation using a simple and controlled environment, or to invite domain experts to directly evaluate the efficacy of the algorithm for a specified task. For the practical use of VISDA, the user has two guidelines for selecting the best projection. One guideline is human inspection/discernment on sub-cluster separability, i.e. if in a projection, the sub-clusters are well-separated and show clear data structure, this projection is a good projection. The other guideline is domain knowledge. If the user is certain about the relationship between some samples or genes under the particular experimental condition, he/she can choose a projection that favors this relationship. For example, in gene clustering, if several genes are well studied in previous researches and are known to be co-regulated and co-expressed under the particular experimental condition that produces the data, a projection that has these genes closely located is more preferred than a projection that has these genes located far apart.
Besides the two guidelines given above, from the experience of performing the experiments in this paper, authors also got some empirical understanding of using the projection suite. When the data size is pretty big, e.g. doing gene clustering of thousands of genes, the HC-KMC-SFNM-DCA, LPP, and APC-DCA projection may be slow and may even encounter the "out of memory" error, because they need to calculate square matrices with dimensionality of the data size in the clustering process. So for very big dataset, we did not use these three projections, but only use the PCA and PCA-PPM projections. When the data size is moderate, although all projections may be used as the final visualization, the authors used HC-KMC-SFNM-DCA and APC-DCA slightly more frequently, especially HC-KMC-SFNM-DCA. A possible reason is that 8 DCA finds the projection directions with which the obtained sub-clusters show best separability. Compared with APC-DCA, HC-KMC-SFNM-DCA can handle bigger dataset than APC-DCA and runs faster. It also includes more human data interaction and lets the user control the number of sub-clusters by cutting on the dendrogram of HC. When the data size is very small and the data dimensionality is high, i.e. the sample-to-dimensionality ratio is very low, the DCA based projections may always find a projection space that the sub-clusters show good separability, or even greatly squeeze the sub-cluster, so that it looks like one point in the projection space. In such cases, we normally did not choose DCA based projections.

Cluster decomposition in visualization space
We use the two-level hierarchical SFNM model to present the relationship between the lth and the l + 1th levels of the VISDA's hierarchical exploration. The probability density function for a two-level hierarchical SFNM model is formulated as: where K k, l+1 sub-clusters exist at level l + 1 for each cluster k of level l, π k is the mixing proportion for cluster k in level l, π j|k is the mixing proportion for sub-cluster j within cluster k, g(•) is the Gaussian probability density function, and θ t, (k, j) are the associated parameters of subcluster j. When the cluster labels of the samples in level l are known, the conditional distribution is formulated as where r i, k is a binary cluster label indicator of sample i, i.e. one of {r i, 1 , …, , l i K r } is one and others are all zeros. In fact, we only have partial, probabilistic, information in the form of the posterior probability z i, k for cluster k having generated t i . Taking the expectation of the conditional log-likelihood and focusing only on cluster k, we get ( ) According to the projection invariant property of normal distributions, i.e. a Gaussian distribution is still a Gaussian distribution after a linear projection, the projected data have an expectation of conditional log-likelihood given by where x = {x 1 , x 2 , …, x N | x i ∈ R 2 , i = 1, 2, …, N} indicates all the projected data points, the subscript 'x' indicates that these parameters model the data in the visualization space. Equation (7) is a weighted log-likelihood, whose value can be maximized or locally maximized by the Expectation Maximization (EM) algorithm (Dempster, et al., 1977). The EM algorithm iteratively performs the E-step and M-step to monotonically increase the log-likelihood until convergence. The E-step calculates the expectation of the samples' sub-cluster labels conditioned on the data and the current model parameters.
where z i, (k, j) is the posterior probability of data point x i belonging to the jth sub-cluster in cluster k, μ x, (k, j) and Σ x, (k, j) are the mean and covariance matrix of sub-cluster j in cluster k. The M step updates the model parameters using formulas given by Models with different numbers of sub-clusters are initialized by the user and trained by the EM algorithm. The obtained partitions of all the models are displayed to the user as a reference for model selection. The MDL criterion is also utilized as a theoretical validation for model selection (Rissanen, 1978;Schwarz, 1978). Because the data size of the microarray gene expression dataset is usually small, the MDL model selection with Gaussian distributions has the tendency to select complex models in low dimensional space (Ridder, et al., 2005). Based on our experimental experience and reference to (Liang, et al., 1992;Ridder, et al., 2005), we use a modified formula to calculate the description length given by where K a and N k are the number of free adjustable parameters and the effective number of data points in the cluster, respectively. This modified MDL formula not only eases the trend to overestimate the sub-cluster number when the data size is small, but also is asymptotically consistent with the classical MDL formula, which is

The full dimensional model and its training algorithm
The probability density function of the SFNM model is where K l+1 is the number of clusters at level l+1 and equal to , M-Step: The E-step and the M-step run iteratively until convergence.

Algorithm extension for sample clustering
Let {g 1 , …, g N } be the expression values of a gene. The variance of {g 1 , …, g N } is calculated by The absolute difference between the minimum and maximum gene expression values across all the samples is calculated by These two criteria can be used to identify and then remove constantly expressed genes. A rank of all the genes for each of the two variation criteria can be formed.
Discrimination power analysis uses a 1-D SFNM model to fit the gene's expression values. The probability density function of the model and its corresponding EM algorithm take the form of Equation (12) and Equation (13), respectively, but in 1-D space. To determine the model order, we followed the iterative procedure in (Miller and Browning, 2003). The SFNM model is initialized with a high model order (much bigger than the true component number), with randomly chosen means and uniform variances. In each iteration, we (one-by-one) trial-delete each component and rerun the fitting algorithm. The component whose removal yields minimum description length will be permanently removed. This iterative process ends when only one component remains, and the optimum model order is then determined by comparing the description length (or modified description length if the sample size is small) values of solutions in the sequence. Once the SFNM model is trained, genes with single component mixture are removed, because they do not support any cluster structure. For the other genes, the accuracy in classifying samples to components resulting from applying the Maximum A Posteriori probability (MAP) rule (based on the samples' posterior probabilities for belonging to the components) quantifies the gene's discrimination power. Thus, a rank of genes according to their discrimination power can be constructed.
The classification accuracy of a K-component mixture has a low limit of 1/K, which decreases when K increases. However, this may not impact but actually may improve the clustering performance of VISDA due to the following reasons. (1) VISDA performs a hierarchical exploration process where the data decomposition models are initialized by data structures presented in multiple subspaces. The combinatory power of multiple subspaces helps VISDA to detect all clusters. Thus a single gene or a single subspace is not required to discern among many or all the clusters. (2) With a fewer number of components, the statistics of each component can be better estimated, which is important for sample clustering on genomic data, where the sample size is normally small. (3) With a fewer number of components in the subspace, the local decomposition problem at each node of the VISDA hierarchical exploration process is simpler to solve. (4) In many genomic research applications where gene expression values are required to be quantized or genes are analyzed in terms of states, genes are normally assumed to take very few (two or three) discrete values or states, which corresponds to down-regulation and up-regulation or low, middle, and high states (Xing and Karp, 2001).
Besides the two kinds of non-informative genes discussed above, "redundant" genes (genes that are highly correlated with other genes) provide only limited additional separability between sample clusters. However, this limited additional separability may in fact greatly improve the achievable partition accuracy (Guyon and Elisseeff, 2003). Thus, we take removal of redundant genes as an optional step. If the dimensionality of the gene space after variation filtering and discrimination power filtering can not be well handled by the clustering algorithm (i.e. if the samples-to-genes ratio is not sufficiently large), we suggest removing highly correlated genes. Here, we provide a simple scheme to remove redundant genes. In the gene list resulting from variation filtering and discrimination power analysis, keep the most discriminative gene and remove the genes that are highly correlated with it, then keep the second most discriminative gene in the remaining list and remove the genes that are highly correlated with this gene. Keep performing this procedure until no further removal can be done. The correlation between genes can be measured by the Pearson correlation coefficient or mutual information normalized by entropy. A threshold needs to be set to identify the highly correlated genes.

Algorithm extension for phenotype clustering
Suppose that the exploration process has proceeded to the lth level and level l has K l phenotype clusters. For each phenotype cluster k (k = 1, …, K l ) at level l, we do the following to visualize and decompose the cluster.

Locally discriminative gene subspace and visualization
Let t = {t (1) , …, t (Q k ) } denote the phenotypes in the cluster. Q k is the number of phenotypes in cluster k. t (q) = {t (q) i , i = 1, 2, …, N (q) } (q = 1, …, Q k ) is the set of samples in phenotype q. To achieve an effective visualization, we first use a supervised gene selection method to select top discriminative genes to form a locally discriminative gene subspace, and then project the samples from the discriminative gene subspace to a 2-D visualization space through DCA.
Let μ q and σ q be the mean and standard deviation of a gene's expression values in phenotype q. A gene's discrimination power is measured by ( ) where r q is the sample proportion of phenotype q and equal to ( ) According to the discrimination power, top discriminative genes can be selected to form a locally discriminative gene subspace. The number of selected genes is n g Q k , where n g is the number of selected genes per phenotype. We use DCA based on the Fisher criterion to find a projection matrix to project the samples from the gene subspace onto a 2-D visualization space. The projection matrix is obtained by orthogonalizing the eigenvectors associated with the largest two eigenvalues of the Fisher Scatter Matrix (FSM) that is calculated based on the phenotypic categories (Duda, et al., 2001). Here, we use the Fisher criterion not the weighted Fisher criterion (i.e. do not weight the phenotype pairs in the objective function) for the purpose of preserving original data structure by treating all phenotype pairs fair. Thus the identified TOP will not be distorted. Let μ s, i and Σ s, i be the mean and covariance matrix of phenotype i in the gene subspace. The FSM is calculated by

Cluster decomposition in the visualization space
In the visualization space, we use a class-pooled finite normal mixtures model to fit and decompose the cluster. Let x = {x (1) , …, x (q) , …, x (Q k ) } denote the projected samples from each phenotype. x (q) = {x (q) i , i = 1, 2, …, N (q) } is the set of samples in phenotype q. The probability density function for all samples from phenotype q is where cluster k at level l is decomposed into K k, l+1 sub-clusters at level l + 1, π j and θ x, j are the mixing proportion and parameters associated with sub-cluster j. The EM algorithm used to train this model is formulated as Step: where z (q), j is the posterior probability of phenotype q (all samples from phenotype q) belonging to sub-cluster j, and μ x, j and Σ x, j are the mean and covariance matrix of sub-cluster j in the visualization space. To select the "optimal" model from models with different number of subclusters, the MDL model selection criterion is also applied for theoretical validation. For the small sample size case, the description length is calculated by where L(x|θ x , π) is the log-likelihood, K a is the number of free adjustable parameters, and N is the number of samples. L(x|θ x , π), K a , and N are given by The user can override the MDL model selection by specifying the number of sub-clusters according to his/her justification.

ADDITIONAL DISCUSSION
In further research, some modifications can be made to improve VISDA. For example, 3-D visualization and model initialization can be implemented without theoretical obstacles. Nonlinear projections may also be used to visualize data cluster, but two things need to be noted.
(1) Suitable data models need to be defined in the visualization space and in the original data space.
(2) A corresponding inverse (or approximate inverse) transform of model parameters from the visualization space to the original data space must also be defined. For phenotype clustering, in the current version of VISDA, the gene subspace is formed by selecting individually discriminative genes. Because the jointly most discriminative gene set does not necessarily solely consist of the genes that are most discriminative individually (Jain, et al., 2000), an improvement based on selecting jointly discriminative genes can be made using methods such as (Xuan, et al., 2007), at the cost of increased computation time.