Misty Mountain clustering: application to fast unsupervised flow cytometry gating
- István P Sugár^{1}Email author and
- Stuart C Sealfon^{1}
DOI: 10.1186/1471-2105-11-502
© Sugár and Sealfon; licensee BioMed Central Ltd. 2010
Received: 26 August 2009
Accepted: 9 October 2010
Published: 9 October 2010
Abstract
Background
There are many important clustering questions in computational biology for which no satisfactory method exists. Automated clustering algorithms, when applied to large, multidimensional datasets, such as flow cytometry data, prove unsatisfactory in terms of speed, problems with local minima or cluster shape bias. Model-based approaches are restricted by the assumptions of the fitting functions. Furthermore, model based clustering requires serial clustering for all cluster numbers within a user defined interval. The final cluster number is then selected by various criteria. These supervised serial clustering methods are time consuming and frequently different criteria result in different optimal cluster numbers. Various unsupervised heuristic approaches that have been developed such as affinity propagation are too expensive to be applied to datasets on the order of 10^{6} points that are often generated by high throughput experiments.
Results
To circumvent these limitations, we developed a new, unsupervised density contour clustering algorithm, called Misty Mountain, that is based on percolation theory and that efficiently analyzes large data sets. The approach can be envisioned as a progressive top-down removal of clouds covering a data histogram relief map to identify clusters by the appearance of statistically distinct peaks and ridges. This is a parallel clustering method that finds every cluster after analyzing only once the cross sections of the histogram. The overall run time for the composite steps of the algorithm increases linearly by the number of data points. The clustering of 10^{6} data points in 2D data space takes place within about 15 seconds on a standard laptop PC. Comparison of the performance of this algorithm with other state of the art automated flow cytometry gating methods indicate that Misty Mountain provides substantial improvements in both run time and in the accuracy of cluster assignment.
Conclusions
Misty Mountain is fast, unbiased for cluster shape, identifies stable clusters and is robust to noise. It provides a useful, general solution for multidimensional clustering problems. We demonstrate its suitability for automated gating of flow cytometry data.
Background
Clustering is widely used for exploratory data analysis, with applications ranging from physics and biology to social sciences and psychology. In data intensive fields of biology, it is important to identify groups or clusters of data showing similar behavior. Many methods for clustering have been developed, which fall into two general categories: heuristic algorithms and model based analyses. In heuristic algorithms clustering is obtained either by optimizing a certain target function or iteratively agglomerating (or dividing) nodes to form bottom-up trees. Examples of these approaches include: K-means[1] and K-median [2] clustering, fuzzy K-means clustering [3], affinity propagation [4], spectral clustering [5, 6], QT (quality threshold) clustering [7] and density contour clustering [8]. In contrast to heuristic methods, model-based clustering methods make inferences based on probabilistic assumptions about the data distribution. Gaussian or modified Gaussian mixture models [9] use the Expectation-Maximization algorithm [10–13] to find the parameters of the distributions that are fitted to the data. Then Bayesian information criterion (BIC) [14], Akaike information criterion (AIC) [13], integrated completed likelihood (ICL) [15] or other criterion is used to select the number of clusters.
Flow cytometry (FCM) is a commonly used technique to measure the levels of expression of multiple markers, such as specific proteins, in millions of cells. FCM data is typically analyzed by an attempt at visual selection of similar groups of data in 2 dimensional projections, a process referred to as gating. The visual identification of similar groups of data points, referred to in FCM as manual gating, is error-prone, non-reproducible, non-standardized, difficult to apply to more than two dimensions, and manpower-intensive, making it a limiting aspect of the technology [16]. Despite its widespread use, FCM lacks a fast and reliable method for automated analysis to parallel its high-throughput data-generation. The development of a reliable, heuristic clustering approach suitable for large datasets would significantly improve the value of FCM experiments and would have widespread application to other data-intensive biological clustering problems.
Automated FCM gating attempts using heuristic methods, such as K-means and fuzzy K-means [1, 3, 17–20] do not provide stable results. Different initial values for the algorithm, i.e. initial locations of the cluster centers, typically result in different clustering results. Often, with a poor set of initial values, the minimization of the target function falls into a local minimum and gives an undesirable clustering result. Furthermore, these methods work best with spherical or hyperspherical shaped clusters, a distribution often not observed in FCM datasets. Several other useful clustering algorithms based on pairwise comparisons, including linkage or Pearson coefficients method [21] and the affinity propagation method [4], are computationally too expensive to be used for FCM because the size of the pairwise distance matrix increases on the order of n^{ 2 } with the number of points. Classification and regression trees [22], artificial neural networks [23] and support vector machines [24, 25] have also been used in the context of FCM analyses [26–29], but these supervised approaches require training data, which may not be available and may perform unreliably if the features of the experimental data diverge from the training set. Model-based approaches are slow, need user involvement and require assumptions about cluster distributions that limit their general utility [13, 15]. A major problem of all practical approaches for unsupervised FCM cluster analysis remains the determination of the number of clusters. The use of BIC, AIC, ICL or other criterion can make the determination of cluster number unreliable (see Additional File 1).
To overcome these limitations of the above approaches, we have developed a new density contour clustering method that is particularly suitable for FCM data. In the early 1960's Boyell and Ruston[30], working on methods for storing topological data in a manner allowing efficient reconstruction, recognized that contour lines can be represented as a tree structure. This insight led to the idea of density contour clustering by finding the largest cross section of each histogram peak [8]. Jang and Hendry[31, 32] used a density contour method for clustering galaxies, that in principle is most similar to our method. Their method is a modification of a method proposed by Cuveas et al.[33, 34]. We have developed a new, fast density contour clustering method suitable for large, multi-dimensional datasets that will be compared with Jang and Hendry's method in Additional File 1. The method is unbiased for cluster shape and does not require global optimization of a multi-variable target function like other commonly used clustering methods do. The algorithm run time increases on the order of n. According to the tests on manually gated and simulated data the method provides correct clustering with correct number of clusters.
The Misty Mountain algorithm can be understood as the computational analogy of an airplane view of histogram terrain that is initially completely immersed in misty clouds. The mist is steadily removed from the top down by the sun, progressively uncovering clusters as peaks that pierce the mist. Eventually the merging points of two peaks, the highest saddle, is revealed. From there two peaks form one instead of two holes in the mist. As the level of the mist decreases, more and more summits and saddles are revealed and evaluated to determine the number of statistically distinct peaks and their extent.
Results and Discussion
Misty Mountain algorithm
The approach is briefly described here and more extensively in Methods. The multi-dimensional data is first processed to generate a histogram containing an optimal number of bins by using Knuth's data-based optimization criterion [35]. Then cross sections of the histogram are created. The algorithm finds the largest cross section of each statistically significant histogram peak. The data points belonging to these largest cross sections define the clusters of the data set.
To realize the steps described above computationally, the algorithm uses a percolation theory based procedure[36, 37] by labeling different bin aggregates of a histogram cross section by different integers. Then the algorithm comparatively analyzes pairs of consecutive cross sections to recognize coalescing bin aggregates. Assigning clusters to the coalescing bin aggregates requires the L_{p 1}-L_{ s } and L_{p 2}-L_{ s } relative heights of the two peaks that fuse both be statistically significantly greater than random fluctuations (see Methods). L_{p 1}, L_{p 2}and L_{ s } are the heights of the fusing first and second peak and the saddle between them, respectively.
Characteristics of the clusters assigned to data in Figure 1a
Color code | L _{ p } | L _{ s } | C | f |
---|---|---|---|---|
green | 3385 | 756 | 313369 | 0.777 |
red | 10706 | 0 | 300000 | 1 |
pink | 2493 | 756 | 143539 | 0.697 |
blue | 1911 | 102 | 94930 | 0.947 |
Testing Misty Mountain algorithm
Characteristics of the clusters assigned to data in Figure 2a
Color code | L _{ p } | L _{ s } | C | f |
---|---|---|---|---|
red | 430 | 5 | 8338 | 0.988 |
blue | 529 | 5 | 804 | 0.991 |
Summary of comparing Misty Mountain with state of the art flow cytometry specific clustering methods
Data set | Manually gated 2D barcoding^{&} | Simulated 5D Gaussians | Simulated 2D non-convex | 3D rituximab | 4D GvHD | Manually gated 4D OP9 | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Misty Mountain | accuracy | sens | (%) | 100 | 100 | 100 | - | - | 100 | |||
spec | (%) | 100 | 100 | 100 | - | - | 100 | |||||
CPU | (sec) | 10 | 196 | 6 | 0.3 | 0.8 | 3.6 | |||||
FLAME | accuracy | sens | (%) | 20^{a} | 60^{b} | - | 0^{d}* | 100^{d} | - | - | - | |
spec | (%) | 33^{a} | 50^{b} | - | 0^{d}* | 100^{d} | - | - | - | |||
CPU | (sec) | 5.10^{4} | >3.10^{5} | 1.10^{4} | 10 | 360 | 1.4 · 10^{4} | |||||
flowClust | accuracy | sens | (%) | 45^{a}* | 60^{b}* | 100^{c} | 0^{c}* | 100^{d} | - | - | 60^{d}* | 60* |
spec | (%) | 60^{a}* | 55^{b}* | 100^{c} | 0^{c}* | 100^{d} | - | - | 75^{d}* | 38* | ||
CPU | (sec) | 5.10^{4} | 4.10^{4} | 7200 | 43 | 480 | 3660 | |||||
flowMerge | accuracy | sens | (%) | 25 | 100 | 0 | - | - | 80 | |||
spec | (%) | 45 | 100 | 0 | - | - | 57 | |||||
CPU | (sec) | 1.3 · 10^{5} | 1.27 · 10^{5} | 7200 | 124 | 1020 | 8400 | |||||
flowJo | accuracy | sens | (%) | 45 | - | - | - | - | - | |||
spec | (%) | 47 | - | - | - | - | - | |||||
CPU | (sec) | 1-10 | - | - | 1-10 | 1-10 | - |
As another example we analyzed one of the graft-versus-host disease (GvHD) data sets.
Characteristics of clusters assigned by Misty Mountain to the 4D GvHD data in Figure 4.
Code # | L _{ p } | L _{ s } | C | Bin # | f |
---|---|---|---|---|---|
1 | 1541 | 1033 | 1542 | 1 | 0.33 |
2 | 1115 | 1033 | 1116 | 1 | 0.074 |
4 | 230 | 25 | 1011 | 11 | 0.891 |
3 | 889 | 804 | 890 | 1 | 0.096 |
5 | 175 | 30 | 858 | 8 | 0.829 |
6 | 132 | 30 | 265 | 3 | 0.773 |
We also compared the performance of the various gating algorithms using a dataset from 4D bone-marrow derived mouse stromal cells (OP9 cells) stained with antibodies for CD45, Gr1, Mac1 and CD19. Two experts manually gated this experiment obtaining identical results. Misty Mountain gave results identical to that of the experts, unlike the other automated gating methods (Table 3 and in Additional File 1, Figures AF14-18, Table AF13-18). In order to test algorithm performance we used a variety of other experimental and simulated data sets with biologically interesting populations such as low density, overlapping and non-convex populations. Comparisons were made using simulated 2 dimensional and 5 dimensional data and additional experiments with 3 dimensional and 4 dimensional data (Additional File 1). These results all strongly support the improved accuracy and utility of the Misty Mountain algorithm relative to other state of the art methods.
Studies were done to evaluate the time complexity of the Misty Mountain algorithm. These simulations revealed that at fixed bin number the overall run time for the composite steps of the algorithm increases linearly by the number of data points. Also an increase in the run time was detected with increasing dimensionality of the data space (see Additional File 5). The number of clusters did not alter the computation time (Additional File 5).
The Misty Mountain algorithm can be applied to analyze other than FCM data when the data set is large enough to construct an adequate histogram. For example in astrophysics it can be used for unsupervised recognition of star/galaxy clusters, or in social sciences to analyze questionnaires and identify groups with common interests/opinions.
Implementation
The implementation, instruction and the input data files of all the examples analyzed in this study are available in Additional Files 6, 7 and 8.
PCA- Misty Mountain algorithm for high dimensional data
The current version of the Misty Mountain algorithm software uses direct analysis for data having up to 5 dimensions. Some flow cytometry datasets may have up to twelve or even more dimensions. One can set the critical dimension higher than 5, however the run time, the number of data points needed for an adequate histogram and the memory requirement for storing the histogram increases super linearly with increasing dimension. As another option, we have combined the Misty Mountain algorithm with principal component analysis (PCA) [43]. In order to analyze higher than 5 dimensional data, we use PCA to project the high dimensional data into a 5 dimensional subspace. The subspace is spanned by 5 eigenvectors belonging to the 5 largest eigenvalues of the covariance matrix of the data. Then Misty Mountain analysis is performed on the projected data. Finally the points of the assigned clusters are back-projected into their original position in the data space. This procedure is demonstrated on a simulated 10 dimensional data set containing points that distributed as the sum of 8 distorted-Gaussians. The parameters of the distorted-Gaussians (mean and standard deviation of the distributions) are listed in the table in Additional File 9. By using PCA, the simulated data are projected into the 5D subspace where Misty Mountain clustering is performed. The points of the assigned 8 clusters are back-projected to their original position in the 10D data space. Table in Additional File 10 lists the center coordinates of the assigned clusters. As a demonstration of correct clustering these cluster centers are very close to the means of the respective distorted-Gaussians. It is important to note that the projection of the data into the 5D subspace may bring some of the otherwise separated histogram peaks so close to each other that the number of clusters assigned by the Misty Mountain algorithm becomes less than the true value. This happens with higher frequency when the data histogram contains many, broad peaks. Finally it is important to note that the optimal choice for the critical dimension depends on the actual number of the data points, i.e. one should be able to create an adequate histogram from the data at the critical dimension.
Advantages and limitations of Misty Mountain algorithm
- 1)
Misty Mountain algorithm is unbiased for cluster shape.
- 2)
it is robust to noise,
- 3)
it is fast,
- 4)
it is unsupervised. It does not need estimation for cluster number.
- 5)
the computation time linearly increases with the number of data points
- 1)
Misty Mountain algorithm identifies two closely situated populations as one when the respective histogram has only one peak
- 2)
it identifies two populations as one when L_{ p } -L_{ s } is comparable with the standard deviation of L_{ p } -L_{ s } . (L_{ p } is the bin content at the smaller histogram peak, and L_{ s } is the bin content at the saddle between the two histogram peaks.)
- 3)
the computation time, the number of data points needed for an adequate histogram and the memory requirement for the histogram super linearly increase with the dimension of the data space
Misty Mountain provides a useful, general solution for multidimensional clustering problems. It can be easily adapted to address diverse large dataset clustering problems in computational biology. It is particularly suitable for automated gating of FCM and should improve the ability to interpret experimental data in this field.
Conclusions
In biology, measurements on a single object (such as a cell or image) are frequently represented by a point in a multi-dimensional space where the coordinates of the point refer to the measured values. With the advent of high-throughput assays, these experiments can generate datasets comprising millions of points. Clusters of points may be thought of as regions of high density separated from other such regions of low density. We describe a fast algorithm that automatically identifies clusters of data points showing similar values. The three major steps of the algorithm are: i) The multi-dimensional data is first processed to generate a histogram containing an optimal number of bins. ii) The cross sections of the histogram are created. iii) The algorithm finds the largest cross section of each statistically significant histogram peak. The data points belonging to these largest cross sections define the clusters of our data set.
While the idea of clustering by using a density histogram is old, the present implementation results in particularly fast clustering that is useful for data-intensive computational biology applications. Misty Mountain clusters 10^{6} data points in 2D data space in about 15 seconds on a standard laptop PC. The run time linearly increases with the number of data points. Unlike other commonly used clustering methods, Misty Mountain is not model-based, unsupervised and does not require global optimization of a multi-variable target function. Without making strong assumptions, this method provides fast and accurate clustering. The algorithm is general, but was motivated by the need for an unbiased automated method for analysis of flow cytometry (FCM) data.
Methods
In the previous sections we gave a qualitative description of the Misty Mountain algorithm. In order to help to understand the logic of the algorithm, we discuss its key features in detail.
Histogram Optimization
where n is the number of data, n_{ k } is the number of data in the k^{th} bin, and D is the dimension of the data space. The N that maximizes this probability is the optimal bin number along each coordinate axis. There are other optimal data based binning methods such as Wand's method [44]. We prefer using Knuth's method because its implementation is particularly easy for any dimension of data.
The LABELING routine
The LABELING routine separately analyzes each cross section of the histogram. As an example let us consider a two dimensional histogram (e.g. Figure 1b). Each cross section of the histogram is mathematically represented by an NxN square matrix where the I,J-th element of the matrix is equal 1 if the respective bin content is higher than LEVEL (the frequency at the actual cross section) otherwise it is 0. In this matrix the cross section of a peak appears as a group or aggregate of 1's. The aim of the aggregate labeling algorithm is to assign the same positive integer to the same aggregate. On the other hand different aggregates will be labeled by different integers. The NxN label matrix, NSTRA is created in three steps.
Step 1. Initialization of the label matrix.
NSTRA(I,J) = NBIN if the content of the I,J-th bin is larger than the level of the cross section, otherwise NSTRA(I,J) = 0. NBIN = NxN is the number of bins. Set the aggregate counter zero, i.e.: IL = 0.
Step 2. First scanning of the label matrix.
- a)
If NSTRA(I, J) = 0 then it remains zero
- b)
If NSTRA(I, J) = NBIN, and NSTRA(I-1, J-1), NSTRA(I-1, J), NSTRA(I-1, J+ 1) and NSTRA(I, J-1) matrix elements (if they exist) are equal to 0, then first let us increase the value of IL by 1; second change the value of NSTRA(I, J) from NBIN to NSTRA(I, J) = IL; and, finally, let the IL-th element of the ICOUNT vector equal to IL.
- c)If NSTRA(I, J) = NBIN, and any of the NSTRA(I-1, J-1), NSTRA(I-1, J), NSTRA(I-1, J+ 1) and NSTRA(I, J-1) matrix elements (if they exist) are not equal to 0 then we determine the proper aggregate label for these non-zero neighbor matrix elements by applying routine CLASSIFY (described in Step 3). Then we select the smallest of the proper labels, called JM, and we set$\begin{array}{l}ICOUNT(NSTRA(I,J))=JM\\ ICOUNT(NSTRA(I-1,J-1))=JM\\ ICOUNT(NSTRA(I-1,J))=JM\\ ICOUNT(NSTRA(I-1,J+1))=JM\\ ICOUNT(NSTRA(I,J-1))=JM\\ NSTRA(I,J)=JM\end{array}$
Step 3: Second scanning of the label matrix.
This simple procedure finds the smallest label among the labels of the aggregate where the I,J-th bin is situated. The labeling routine is similar to the one used in percolation theory [36, 37] for labeling spin clusters. The difference is that in Step 2b and 2c in spin cluster labeling, usually only two nearest neighbors: NSTRA(I-1,J) and NSTRA(I,J-1), of the I,J-th matrix element are considered. In our algorithm, we also consider two next nearest neighbor matrix elements, NSTRA(I-1,J-1) and NSTRA(I-1,J+1). By using this important modification elongated slanted aggregates are properly labeled. The FORTAN source code of the LABELING routine is able to label bin aggregates of any dimension.
The ANALYZE routine
First we give a brief description of the flowchart in Figure 7. The cross sections of the histogram are created consecutively from the highest to the lowest level, i.e. from LEVELMAX to 0. In the grey region of the flow chart aggregates emerging at LEVELMAX are handled. When a new aggregate appears at a lower level the yellow part of the flow chart is active. The cyan colored part of the flow chart is active when a single peak that emerged at a previous level belongs to the aggregate. The rest of the flow chart is active when more than one peak belongs to the aggregate. The green colored part is active when at the previous level every peak was single and no more than one of them was major peak. The red, purple and pink colored parts are active when at the previous level either not every peak was single or more than one single peak were major.
Now in the rest of this section a more detailed description of the flowchart (Figure 7) is given. At the highest level, one or more aggregates appear in the cross section. The counter of the peaks NCL is increased from zero, and the types and positions of the emerging peaks are registered in three vectors:
IIPOINT1(ICLU) - location (or characteristic position) of the emerging ICLU-th peak
IIPOINT2(ICLU) - label of the aggregate of the ICLU-th peak
IIPOINT3(ICLU) = 1 - type code of a single peak (the other two type codes are defined below). In the flow chart (in Figure 7) these steps are highlighted by grey.
- a)
There is no characteristic peak position in the aggregate labeled by ILAB.
- b)
There is one characteristic peak position in the aggregate labeled by ILAB.
- c)
There are more than one characteristic peak positions in the aggregate labeled by ILAB.
This is the most important part of the algorithm that handles the merger of major peaks and the elimination of small noisy peaks. The counter of the peak positions falling into the aggregate is denoted by IP.
The analysis of the aggregate starts by creating the ISZVEC vector. The IP-th element of the vector refers to the type of the respective peak at the previous cross section: ISZVEC(IP) = -1 when the IP-th peak has merged with other peak(s), ISZVEC(IP) = 0 when the IP-th peak was a small single peak, and ISZVEC(IP) = 1 when the peak was a single major peak. The number of 1's in ISZVEC is denoted by IHIGH, while the number of 0's and 1's is denoted by INEW.
c1) If at the previous level every peak was single - INEW = IHELP(ILAB), and no more than one of them was major peak - IHIGH ≤ 1.
In this case the highest peak, encoded by ICLUm, is retained, while all the other peaks are eliminated, i.e.: IIPOINT3(ICLUm) = 1 and IIPOINT3(ICLU) = 0 for all the remaining small peaks. In Figure 7 the respective part of the flow chart is highlighted by green.
This strategy is particularly useful at cross sections where a major peak appears. Frequently, as a first sign of a major peak, small nearby aggregates appear that merge at lower levels. This is also our usual strategy for retaining major peaks, while eliminating the frequently appearing small noisy peaks.
If LEVEL = 0 the aggregate belonging to the retained ICLUm-th peak is also copied into the final label matrix, NSTRF (see part of the flow chart highlighted by cyan).
c2) In all other cases - either not every peak was single at the previous level or more than one single peak was major - there are three options.
c21) Small and previously single peaks are eliminated, i.e. the value of the respective elements of IIPOINT3 vector change from 1 to 0. This part of the flow chart is highlighted by red.
c22) Major and previously single peaks become merged peaks, i.e. the value of the respective elements of IIPOINT3 vector change from 1 to -1. The aggregates belonging to these peaks at the previous cross section are copied into the final label matrix, NSTRF. The respective part of the flow chart is highlighted by pink.
c23) Handling of previously merged peaks is shown and highlighted by purple in the flow chart.
Major and Small Peaks of the Histogram
A histogram contains major peaks such as the four peaks in Figure 1b and small peaks that are superimposed on the major peaks. A small peak is the consequence of the fluctuation of the number of data points in the respective bins. One can observe this fluctuation of bin contents by comparing the histograms of repeated experiments.
The fluctuation of the bin content can be estimated as follows.
First we point out that the content of each bin follows binomial distribution. Let us assume that we measure n cells to create our FCM data set. The probability that the measured fluorescent intensities of a cell falls into the ε-th bin is p_{ ε } . If the measurements on different cells are statistically independent events the probability that out of n measurements the result of b measurements will fall into the ε-th bin and n-b measurements will fall out of the ε-th bin is: $P(b,n|{p}_{\epsilon})=\left(\begin{array}{c}n\\ b\end{array}\right){p}_{\epsilon}^{b}{(1-{p}_{\epsilon})}^{n-b}$
This is the binomial distribution. If the mean bin content, <b> is larger than 10 the binomial distribution can be approximated by its limit: the Poisson distribution [45]. The mean of the Poisson distribution can be estimated by the average of the contents in the actual and nearest-neighbor bins $\overline{b}$, while the standard deviation of the Poisson distribution by the square root of this average $\sqrt{\overline{b}}$.
Every time when two or more aggregates merge we have to decide if the merging aggregates belong to small and/or major peaks. A peak is considered major if: ${L}_{p}-{L}_{s}>2\sqrt{{\overline{b}}_{p}+{\overline{b}}_{s}}$ where L_{ p } and L_{ s } are the peak height and the height of the saddle between the merging peaks, respectively. On the other hand a peak is considered small if $0<{L}_{p}-{L}_{s}\le 2\sqrt{{\overline{b}}_{p}+{\overline{b}}_{s}}$. When ${\overline{b}}_{p}<10$ the Poisson approximation fails and the respective peak is always considered small.
Simulation of Data
where Δ and Δ_{1} is a normal deviates generated by the Box-Muller method [46], ${X}_{IK}^{mean}$ is the K-th coordinate of the mean of the I-th distorted-Gaussian and SD_{ IK } is the standard deviation of the I-th distorted-Gaussian along the K-th axis, while ${K}_{1}^{I}$ and ${K}_{2}^{I}$ are the first and second axes, respectively that randomly selected to the I-th distorted-Gaussian. Parameter s scales the strength of the distortion. In the case of our 2D and 10D simulations s = 0.002 and 0.004 have been used. Note that by using the above procedure one can simulate the sum of regular Gaussian distributions by setting the distortion parameter s = 0.
Declarations
Acknowledgements
We thank Profs. D. Stäuffer and B. Roysam for sending the source code of a Hoshen-Kopelman type cluster counting algorithm and spectral clustering, respectively. We also thank Prof. F. Hayot for the critical evaluation of the manuscript. We acknowledge Drs. B. Hartman and J. Seto for providing the barcoding FCM data, Dr. German Nudelman for making the program available on the web and Dr. Yongchao Ge for analyzing FCM data with flowClust and flowMerge. We are grateful for Prof. Ryan Brinkman for providing access to the GvHD flow cytometry data sets and to Prof. Hans Snoeck for providing the OP9 dataset. This work from the Program for Research in Immune Modeling and Experimentation (PRIME) was supported by contract NIH/NIAID HHSN266200500021C.
Authors’ Affiliations
References
- MacQueen JB: Some methods for classification and analysis of multivariate observations. In Proceedings of fifth Berkeley Symposium on Mathematical Statistics and Probability: 1967, Berkeley. University of California Press; 1967:281–297.Google Scholar
- Cornuejols G, Fisher ML, Nemhauser GL: Location of bank accounts to optimize float - Analytic study of exact and approximate algorithms. Management Science 1977, 23(8):789–810. 10.1287/mnsc.23.8.789View ArticleGoogle Scholar
- Rousseeuw PJ, Kaufman L, Trauwaert E: Fuzzy clustering using scatter matrices. Computational Statistics & Data Analysis 1996, 23(1):135–151.View ArticleGoogle Scholar
- Frey BJ, Dueck D: Clustering by passing messages between data points. Science 2007, 315(5814):972–976. 10.1126/science.1136800View ArticlePubMedGoogle Scholar
- Donath WE, Hoffman AJ: Lower bounds for partitioning of graphs. Ibm Journal of Research and Development 1973, 17(5):420–425. 10.1147/rd.175.0420View ArticleGoogle Scholar
- Fiedler M: Algebraic connectivity of graphs. Czechoslovak Mathematical Journal 1973, 23(2):298–305.Google Scholar
- Heyer LJ, Kruglyak S, Yooseph S: Exploring expression data: Identification and analysis of coexpressed genes. Genome Research 1999, 9(11):1106–1115. 10.1101/gr.9.11.1106View ArticlePubMedPubMed CentralGoogle Scholar
- Hartigan JA: Clustering Algorithms. New York, Wiley & Sons; 1975.Google Scholar
- Fraley C, Raftery AE: How many clusters? Which clustering method? Answers via model-based cluster analysis. Computer Journal 1998, 41(8):578–588. 10.1093/comjnl/41.8.578View ArticleGoogle Scholar
- McLachlan GJ, Basford KE: Mixture Models: Inference and Applications to Clustering. New York: Marcel Dekker; 1988.Google Scholar
- Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via tha EM algorithm. Journal of the Royal Statistical Society B 1977, 39: 1–22.Google Scholar
- Celeux G, Govaert G: Gaussian parsimonious clustering models. Pattern Recognition 1995, 28: 781–793. 10.1016/0031-3203(94)00125-6View ArticleGoogle Scholar
- Pyne S, Hu X, Wang K, Rossin E, Lin T-I, Mailer LM, Baecher-Allan C, McLachlan GJ, Tamayo P, Hafler DA, et al.: Automated high dimensional flow cytometric data analysis. Proc Natl Acad Sci USA 2009, 106: 8519–8524. 10.1073/pnas.0903028106View ArticlePubMedPubMed CentralGoogle Scholar
- Schwarz G: Estimating the dimension of a model. Annals of Statistics 1978, 6: 461–454. 10.1214/aos/1176344136View ArticleGoogle Scholar
- Lo K, Brinkman RR, Gottardo R: Automated gating of flow cytometry data via robust model-based clustering. Cytometry 2008, 73: 321–332. 10.1002/cyto.a.20531View ArticlePubMedGoogle Scholar
- Lizard G: Flow Cytometry analyses and bioinformatics: Interest in new softwares to optimize novel technologies and to favor the emergence of innovative concepts in cell research. Cytometry Part A 2007, 71A(9):646–647. 10.1002/cyto.a.20444View ArticleGoogle Scholar
- Murphy RF: Automated identification of subpopulations in flow cytometric list mode data using cluster analysis. Cytometry Part A 1985, 6: 302–309. 10.1002/cyto.990060405View ArticleGoogle Scholar
- Bakker Schut TC, Grooth BDG, Greve J: Cluster analysis of flow cytometric list mode data on a personal computer. Cytometry Part A 1993, 14: 649–659. 10.1002/cyto.990140609View ArticleGoogle Scholar
- Demers S, Kim J, Legendre P, Legendre L: Analyzing multivariate flow cytometric data in aquatic sciences. Cytometry 1992, 13(3):291–298. 10.1002/cyto.990130311View ArticlePubMedGoogle Scholar
- Wilkins MF, Hardy SA, Boddy L, Morris CW: Comparison of five clustering algorithms to classify phytoplankton from flow cytometry data. Cytometry 2001, 44(3):210–217. 10.1002/1097-0320(20010701)44:3<210::AID-CYTO1113>3.0.CO;2-YView ArticlePubMedGoogle Scholar
- Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. PNAS 1998, 95: 14863–14868. 10.1073/pnas.95.25.14863View ArticlePubMedPubMed CentralGoogle Scholar
- Breiman L, Friedman JH, Olshen RA, Stone CJ: Classification and Regression Trees. Monterey, CA: Wadsworth & Brooks; 1984.Google Scholar
- Boddy L, Morris CW: Artificial neural networks for pattern recognition. In Machine Learning Methods for Ecological Applications. Edited by: Fielding AH. Boston: Kluwer; 1999:37–87.View ArticleGoogle Scholar
- Scholkopf B, Smola AJ: Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond. Cambridge: MIT Press; 2002.Google Scholar
- Burges CJC: A Tutorial on Support Vector Machines for Pattern Recognition. Boston: Kluwer; 1998.Google Scholar
- Beckman RJ, Salzman GC, Stewart CC: Classification and regression trees for bone-marrow immunophenotyping. Cytometry 1995, 20(3):210–217. 10.1002/cyto.990200304View ArticlePubMedGoogle Scholar
- Boddy L, Morris CW, Wilkins MF, Al-Haddad L, Tarran GA, Jonker RR, Burkill PH: Identification of 72 phytoplankton species by radial basis function neural network analysis of flow cytometric data. Marine Ecology-Progress Series 2000, 195: 47–59. 10.3354/meps195047View ArticleGoogle Scholar
- Kothari R, Cualing H, Balachander T: Neural network analysis of flow cytometry immunophenotype data. Ieee Transactions on Biomedical Engineering 1996, 43(8):803–810. 10.1109/10.508551View ArticlePubMedGoogle Scholar
- Morris CW, Autret A, Boddy L: Support vector machines for identifying organisms - a comparison with strongly partitioned radial basis function networks. Ecological Modelling: 2001 2001, 57–67.Google Scholar
- Boyell RL, Ruston H: Hybrid techniques for real-time radar simulation. In The Fall Joint Computer Conference. Las Vegas, USA; 1963.Google Scholar
- Jang W: Nonparametric density estimation and clustering in astronomical sky survey. Comput Stat Data Anal 2006, 50: 760–774. 10.1016/j.csda.2004.10.001View ArticleGoogle Scholar
- Jang W, Hendry M: Cluster analysis of massive datasets in astronomy. Statistics and Computing 2007, 17: 253–262. 10.1007/s11222-007-9027-xView ArticleGoogle Scholar
- Cuevas A, Febrero M, Fraiman R: Estimating the number of clusters. Can J Stat 2000, 28: 367–382. 10.2307/3315985View ArticleGoogle Scholar
- Cuevas A, Febrero M, Fraiman R: Cluster analysis: a further approach based on density estimation. Comput Stat Data Anal 2001, 36: 441–459. 10.1016/S0167-9473(00)00052-9View ArticleGoogle Scholar
- Knuth KH: Optimal data-based binning for histograms. arXiv:physics/0605197v1 [physicsdata-an] 2006.Google Scholar
- Hoshen J, Kopelman R: Percolation and cluster distribution. 1. Cluster multiple labeling technique and critical concentration algorithm. Physical Review B 1976, 14(8):3438–3445. 10.1103/PhysRevB.14.3438View ArticleGoogle Scholar
- Stauffer D, Aharony A: Introduction to Percolation Theory. 2nd edition. London: Taylor and Francis; 1994.Google Scholar
- Tseng GC, Wong WH: Tight clustering: A resampling-based approach for identifying stable and tight patterns in data. Biometrics 2005, 61(1):10–16. 10.1111/j.0006-341X.2005.031032.xView ArticlePubMedGoogle Scholar
- Kirkpatrick S, Gelatt CD, Vecchi MP: Optimization by simulated annealing. Science 1983, 220(4598):671–680. 10.1126/science.220.4598.671View ArticlePubMedGoogle Scholar
- Krutzik PO, Nolan GP: Fluorescent cell barcoding in flow cytometery allows high-throughput drug screening and signaling profiling. Nature Methods 2006, 3: 361–368. 10.1038/nmeth872View ArticlePubMedGoogle Scholar
- Brinkman RR, Gasparetto M, Lee SJJ, Ribickas AJ, Perkins J, Janssen W, Smiley R, Smith C: High-content flow cytometry and temporal data analysis for defining a cellular signature graft-versus-host disease. Biology of Blood and Marrow Transplantation 2007, 13(6):691–700. 10.1016/j.bbmt.2007.02.002View ArticlePubMedPubMed CentralGoogle Scholar
- Lo K, Hahne F, Brinkman RR, Gottardo R: flowClust: a Bioconductor package for automated gating of flow cytometry data. Bmc Bioinformatics 2009., 10: 10.1186/1471-2105-10-145Google Scholar
- Hotelling H: Analysis of a complex of statistical variable into principal components. J Educ Psych 1933, 24: 417–441. 10.1037/h0071325View ArticleGoogle Scholar
- Wand MP: Data-based choice of histogram bin width. The American Statistician 1997, 51: 59–64. 10.2307/2684697Google Scholar
- Feller W: An Introduction to Probability Theory and Its Applications. Volume 1. New York: John Wiley and Sons; 1968.Google Scholar
- Box GEP, Muller ME: A note on the generation of random normal deviates. The Annals of Mathematical Statistics 1958, 29: 610–611. 10.1214/aoms/1177706645View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.