BayesFlow: latent modeling of flow cytometry cell populations
 Kerstin Johnsson†^{1}Email author,
 Jonas Wallin†^{2} and
 Magnus Fontes^{1, 3}
https://doi.org/10.1186/s128590150862z
© Johnsson et al. 2016
Received: 20 May 2015
Accepted: 17 December 2015
Published: 12 January 2016
The Erratum to this article has been published in BMC Bioinformatics 2016 17:149
Abstract
Background
Flow cytometry is a widespread singlecell measurement technology with a multitude of clinical and research applications. Interpretation of flow cytometry data is hard; the instrumentation is delicate and can not render absolute measurements, hence samples can only be interpreted in relation to each other while at the same time comparisons are confounded by intersample variation. Despite this, most automated flow cytometry data analysis methods either treat samples individually or ignore the variation by for example pooling the data. A key requirement for models that include multiple samples is the ability to visualize and assess inferred variation, since what could be technical variation in one setting would be different phenotypes in another.
Results
We introduce BayesFlow, a pipeline for latent modeling of flow cytometry cell populations built upon a Bayesian hierarchical model. The model systematizes variation in location as well as shape. Expert knowledge can be incorporated through informative priors and the results can be supervised through compact and comprehensive visualizations.
BayesFlow is applied to two synthetic and two real flow cytometry data sets. For the first real data set, taken from the FlowCAP I challenge, BayesFlow does not only give a gating which would place it among the top performers in FlowCAP I for this dataset, it also gives a more consistent treatment of different samples than either manual gating or other automated gating methods. The second real data set contains replicated flow cytometry measurements of samples from healthy individuals. BayesFlow gives here cell populations with clear expression patterns and small technical intradonor variation as compared to biological interdonor variation.
Conclusions
Modeling latent relations between samples through BayesFlow enables a systematic analysis of intersample variation. As opposed to other joint gating methods, effort is put at ensuring that the obtained partition of the data corresponds to actual cell populations, and the result is therefore directly biologically interpretable. BayesFlow is freely available at GitHub.
Keywords
AMS Subject Classification
Background
In a flow cytometer a number of characteristics for each individual cell in a sample of ∼ 10^{4} to ∼ 10^{6} cells are quantified as they pass through the cytometer in a fluid stream. The data that are obtained are most often summarized by grouping cells into cell populations; properties of these cell populations are used in many clinical applications—for example monitoring HIV infection and diagnosing blood cancers—and in many branches of medical research [1, 2]. Defining the cell populations based on the measured characteristics is in stateoftheart analyses still done manually by trained operators looking at twodimensional projections of the data. The importance of automated methods has risen along with an increase of the dimension of typical flow cytometry data sets due to developments in flow cytometry technology [3] and the emergence of studies with large numbers of flow cytometry samples [4]. Furthermore, manual so called gating of cell populations is a subjective process where operators have to take more or less arbitrary decisions for example when there are overlapping populations [5].
Automatic cell population identification is hard since flow cytometry measurements are not absolute, while at the same time different samples cannot be directly compared due to technical variation—especially apparent when samples are analyzed at different laboratories [5]—and intrinsic biological variation within and between subjects. Despite this, research into automated population identification methods has focused on individual or pooled flow cytometry samples, sometimes attempting to align data at first through normalization procedures [6].
Automated methods with the aim to replace manual gating must be able to treat multiple samples jointly and take variation between samples into account, while at the same time make it possible for the user to monitor that variation so that it is not too high for the application at hand. For example it needs to be decided if a shift in location of a population in a sample can be seen as technical variation and accepted or if the changed marker expression means that it is a different cell phenotype. These kinds of methods also need to be able to take prior information into account—in manual gating the experience of the operator can be necessary to define a population. We have developed BayesFlow, a method which models variation in cell population location as well as shape, can include prior information for example about cell population location, and gives a result that can be assessed in compact and comprehensive visualizations.
Partitioning the cell measurements in a sample into cell populations is essentially a clustering problem. In the context of flow cytometry data analysis clustering is called automated gating, as opposed to the manual gating performed by operators. Modelbased clustering using mixture models has been the most used approach for automated gating [7–12]. Mixture models are very well suited to describe flow cytometry data because they have a natural biological interpretation based on the cell populations. Examples of other approaches that have been used for automated gating are grid based density clustering [13], spectral clustering [14], hierarchical clustering [15, 16] and kmeans clustering [17, 18]. An evaluation of a wide range of automated gating methods was performed in the FlowCAP I challenge [19]. The discrepancy with manual gating was often quite large even for the best methods, with average Fmeasures around 0.9 for both completely automated and manually tuned methods. Large discrepancies between manual and automatically gated samples can be acceptable since the arbitrary decisions taken in manual gating means that the gates could just as well have been set another way. However, it is important that the gating is consistent between samples so that they can be compared against each other.
Joint identification of cell populations in a collection of samples can be accomplished by pooling the samples [12, 15] or matching populations identified separately in the samples [10, 20]. However, in the first approach no variation between samples is taken into account and in the second approach no information is shared between samples. Recently a third approach has been explored, where a Bayesian hierarchical model is used to share information between samples while at the same time allowing for variation. This was first utilized for flow cytometry gating by Cron et al. [21], with a hierarchical Dirichlet process model with fixed locations and shapes of cell populations. An extension of this model, also modeling variation in cell population locations has been used to create ASPIRE, a method for anomalous sample detection [22].
BayesFlow follows this third approach, but use a differently structured model than what has been used previously, favoring explicit modeling instead of implicit, parametric instead of nonparametric (or massively parametric). This follows the philosophy that mathematical models can never perfectly fit reality, thus it is important to be able to convey the constructed model and its parameters and in what ways it simplifies the data.
For example, in addition to variation in location BayesFlow explicitly models variation in cell population shape, whereas ASPIRE models shape variations implicitly by combining Gaussian components with the same shape. This means that an aberrant shape variation of a cell population in a sample can be detected in BayesFlow by examining the parameters of the model, which is not possible in ASPIRE. Perhaps more importantly, BayesFlow gives a parsimonious model which much fewer parameters—each individual parameter for the components in BayesFlow can be assessed through compact visualizations and thus undesired behaviors can be detected and corrected for by change of setup. Moreover, a restriction in ASPIRE which is avoided by BayesFlow is that the variation of component location within and between samples is connected to the shape of the components.
In BayesFlow, the cells in a sample are clustered using a multivariate Gaussian mixture model (GMM), where K components describe true and artificial cell populations and one component describes outliers. Artificial cell populations are measurements that cluster together and behave otherwise like real cell populations, but arise for example from dead cells, nonspecific binding of markers or doublets; doublets are pairs or groups of cells that pass through the flow cytometer at the same time. Measurements which are not clearly grouped but spread out over the measurement space, for example due to measurement noise, are modeled as outliers.
For each component not representing outliers its mean and covariance matrix is linked to a latent cluster which collects corresponding components across all samples. In practice this is done by assuming a normal prior for the means and an inverse Wishart prior for the covariance matrices of the components linked to a given latent cluster. The parameters of sample and latent components are jointly estimated by Markov Chain Monte Carlo (MCMC) sampling. The variation in location and shape between corresponding mixture components across samples is controlled by the priors on parameters of the latent clusters. The location of component means and shape of components can also be restricted if there is prior information supporting this. To allow for that flow cytometry data frequently have missing cell populations, we include the possibility that not all components are present in every sample.
A challenge that has to be addressed when analyzing flow cytometry data is that cell populations can be skewed and/or have heavy tails and are then not well described by a single Gaussian component [7, 10, 23]. To handle this we use multiple components to model such populations, an approach that have often been employed for flow cytometry data [9, 12, 24, 25] and has the further advantage that the number of cell populations can be automatically detected. We merge Gaussian components into super components with a procedure based on a systematic study of methods for merging mixture components [26].
Results from the MCMC sampling and subsequent merging are evaluated in a number of quality tests. This is a crucial step since what is deemed as a good clustering is application dependent. In some settings a given amount of variation in location or shape is expected from biological or technical reasons, whereas in others the same variation would indicate a different population. This also means that it is necessary for the user to choose prior parameters for their application. To simplify this process we have derived parametrizations so that the same value of the parameters gives a similar effect of the prior on data sets of different sizes.
We verified the ability of the sampling scheme to recover model parameters by fitting the model to a small threedimensional synthetic data set with 1.2 million cells in total and a large synthetic data set with in total 28 million cells in 8 dimensions. Then we applied BayesFlow to one of the datasets in the FlowCAP I challenge, the GvHD dataset, which contains samples from patients who have had organ transplants and might have early signs of graftversushost disease. We show that BayesFlow does not only give a result which has the same degree of accordance with manual gating as the best performing methods in FlowCAP I—which is much higher than what is obtained for other methods based on joint gating with Bayesian hierarchical models—it does also give a more similar treatment of different samples than manual gating and the best methods from FlowCAP I. Finally we applied BayesFlow, ASPIRE [22] and HDPGMM [21] to a data set with replicated samples from four healthy individuals. The ratio between intradonor technical variation and interdonor biological variation was similar between BayesFlow and HDPGMM, which was lower than for ASPIRE. Moreover, BayesFlow was the only of the three methods which gave cell populations with clear expression patterns.
Methods
Model
where N(Y;μ,Σ) denotes the probability density function of the normal distribution with mean μ and covariance matrix Σ evaluated at Y. To facilitate interpretation, the number K should be chosen as small as possible, given that the model pass quality requirements (described under Quality control). The last component represents outliers and its parameters μ _{ j0}=μ _{0} and Σ _{ j0}=Σ _{0} are identical across samples. Outliers are often modeled by a uniform density over the measurement space [27]; however due to the curse of dimensionality [28], this is not well behaved when we have more than a few dimensions, in which case a Gaussian should perform better. Noise coming from for example dead cells can also be captured in artificial cell populations, and can be excluded in downstream analyses based on the expression patterns.
where θ _{ k }, \(\boldsymbol {\Sigma }_{\theta _{k}}\), Ψ _{ k } and ν _{ k } are hyperparameters describing latent cluster k. These parameters describes the variability between flow cytometry samples, in contrast to μ _{ jk },Σ _{ jk } which describe the distribution of cell measurements within a sample. The normal and inverse Wishart distributions are conjugate priors to the mean and the covariance respectively of the normal distribution, enabling efficient sampling, however they are not jointly conjugate.
We call θ _{ k } and Ψ _{ k }/(ν _{ k }−d−1) the latent cluster mean and latent cluster covariance matrix respectively, since they are the a priori expected values of μ _{ jk } and Σ _{ jk }.
The parameters t _{ k } and S _{ k } define the prior belief of the locations of the latent means θ _{ k }, whereas the parameters Q _{ k } and \(n_{\theta _{k}}\) control the spread of mixture component means within a latent cluster and are hence important to control the variation across samples. A large \(n_{\theta _{k}}\) along with a small Q _{ k } forces the μ _{ jk } together; it makes large deviations between \(\boldsymbol {\Sigma }_{\theta _{k}}\) and Q _{ k } unlikely. The parameters H _{ k } and \(n_{\Psi _{k}}\) control the expected values and the variation of latent covariance matrices as well as the variation among mixture component covariance matrices in a latent cluster. If \(n_{\Psi _{k}}\) is large each Σ _{ jk } will be close to Ψ _{ k }/(ν _{ k }−d−1) for any k, since a high \(n_{\Psi _{k}}\) makes high ν _{ k } more probable.
Finally, to simplify sampling from the posterior distribution of the parameters, we add an component assignment variable x _{ ij }∈{0,1,…,K} describing which component Y _{ ij } is drawn from. To comply with (1), the a priori uncertainty of component membership is modeled by x _{ ij }∼ M u l t(π _{ j },1), where Mult denotes the multinomial distribution.
The resulting posterior distribution of all the parameters, denoted jointly by Θ, and x given the data Y is given in the Additional file 1: Section A. In Section B we describe the Markov chain Monte Carlo (MCMC) sampling scheme used to generate posteriors for our model parameters.
The computational bottleneck of the sampling scheme is the sampling of x, with a computational complexity bounded by \(\mathcal {O} (J d^{3} K\max _{j} n_{j})\). To handle high dimensions diagonal covariance matrices can be used instead, in which case the complexity is bounded by \(\mathcal {O}(J d K\max _{j} n_{j})\). However, for datasets with more than 20 dimensions the mathematical feasibility of using Gaussian mixture models without any prior dimension reduction needs to be seriously considered first, due to the curse of dimensionality [28].
Absent components
In some flow cytometry data sets not all cell populations are present in all samples. In our model this corresponds to that π _{ jk }=0 for some (j,k). However, mixture component parameters for empty clusters will still affect the mixing of the MCMC for the parameters of the latent cluster. It can also happen that if a cluster is empty that the mixture component moves and split a neighboring cluster in two. To avoid this in such data sets we extend the model by introducing a variable Z _{ j }∈{0,1}^{ K } that says which components are active in sample j. This has the further advantage that when sampling from the posterior distribution of the model we get the probability for each cluster that it is present in a sample. We impose a prior on Z _{ j } which is proportional to \(\exp \left (c_{s}\sum _{k=1}^{K}\mathbf {Z}_{j}\right) I\left (\sum _{k=1}^{K}\mathbf {Z}_{j}>0\right)\) where I denotes the indicator function and c _{ s }>0. The prior makes the model prefer fewer activated clusters so that if there is a very small cluster the likelihood will be larger if it is inactivated, which prevents spurious clusters. The strength of this prior can be adjusted to the expected size of the smallest clusters.
The changes to (1–3) required by this extension are straightforward but inference of the model becomes a bit more involved since removing components reduces the dimension of the model. To accommodate for this we have included a reversible jump step in our sampling algorithm. Details are given in the Additional file 1: Section B.
Merging latent clusters
To determine the “correct” number of clusters in a data set directly from the data is an illdefined problem, since what should be considered to be a separate cluster depends on the interpretation of the data. Nevertheless, there are many different criteria which can be used to guide the decision about the number of populations [26, 29]. We use overlap between components—measured by Bhattacharyya distance—and unimodality of the resulting super clusters—measured by Hartigan’s dip test [30]—to determine which latent clusters to merge and to indicate our confidence in the mergers.
In an evaluation of criteria for merging Gaussian components to represent more complex distributions, the Bhattacharyya distance performed well [26]. Bhattacharyya distance merges clusters according to a patternbased cluster concept as opposed to a modalitybased concept [26]. With a patternbased cluster concept a small dense cluster inside a sparse cluster—for example a well specified cell population inside a region with sparse outliers—will be considered to be different clusters. This would not be the case for the modalitybased cluster concept as long as the generating probability density is unimodal.
where \(\bar {\boldsymbol {\Sigma }} = (\boldsymbol {\Sigma }_{1} + \boldsymbol {\Sigma }_{2})/2\) [31]. In order to measure Bhattacharyya distance between mixtures of Gaussian distributions, which is necessary for deciding if super clusters should be merged with other clusters, we approximate each mixture with a Gaussian distribution. The means and the covariance matrices are estimated using a soft clustering of the data points inferred from the sampling of x _{ ij }, detailed in the Additional file 1: Section C.
However, it is not obvious how to set a threshold for d _{bhat}, since the appropriate threshold depends on the distribution of the data [26], which is unknown. Because of this we use a low soft threshold d _{1} and a high hard threshold d _{2}. Two clusters closer to each other than d _{1} are always merged, two clusters whose distance is between d _{1} and d _{2} are only merged if they fulfill an additional criterion based on Hartigan’s dip test for unimodality.
Unimodality is an appealing heuristic for defining cell populations, and it has frequently been used for automated gating [9, 12, 18]. It has two main limitations. The first one, that populations intuitively should be separate if they have very different densities—even when they overlap so that their combined distribution is unimodal—can be bypassed by combining unimodality with a patternbased merging criterion such as Bhattacharyya distance. The second one, that it is difficult to determine if a multidimensional empirical distribution is multimodal, is usually handled by considering onedimensional projections [12, 26]. This is the approach we take here, using Hartigan’s dip test of unimodality for each of the projections onto the coordinate axes where Bhattacharyya overlap is low, and for the projection onto Fisher’s discriminant coordinate. If for a proposed merger, any of these projections is found to be multimodal, the clusters are not merged. Further details of the merging procedure are given in the Additional file 1: Section C.
Quality control

Convergence of the MCMC sampler is established by viewing trace plots of sampled parameters.

To ensure that variation of the two different populations are not confused with each other, we require that the Bhattacharyya distance as well as the Euclidean distance from each sample component to its corresponding latent component should be smaller than these distances to any other latent component which does not belong to the same super cluster.

To ensure that the obtained clusters should not be divided further, Hartigan’s dip test is computed for the projections onto the coordinate axes of all super clusters. Projections which have a dip test pvalue below 0.28—the threshold for merging components (see Additional file 1: Section C)—are visualized using histograms of quantiles of the weighted data belonging to the cluster.

To ensure that the model fits the data reasonably well, samples from the posterior predictive is compared to the true data in one and twodimensional histograms.

To ensure that there are no outliers among the cluster centers, the centers for each cluster are plotted together along one dimension.

Additionally, to detect components with aberrant shapes, the eigenvectors corresponding to the largest eigenvalues, multiplied with the corresponding eigenvalues, can be viewed.
If any of the quality criteria is not met, the simulation should be rerun, either using the same or different parameters. Even if the same parameters are used a different result can be obtained due to randomness in the initialization.
Experiments
Simulated data
In order to verify that the proposed sampling scheme can find the correct model parameters, the MCMC algorithm was applied to two simulated datasets. The first dataset was threedimensional, which enables direct visual evaluation. It had four latent clusters across eighty artificial flow cytometry samples; each sample had 15,000 cells giving a total of 1.2 million cells. One of the latent clusters was present only in eight samples and another one was present in 24 samples, so that the ability to find rare cell populations was tested. Moreover, the cluster which was present in only eight samples contained only 1 % of the total number of cells, thus also the ability to find small cell populations was tested. The parameters and the algorithm used for generating the data are given in the Additional file 1: Section D.1.
The second data set was designed to test the ability to handle large data. It was eightdimensional, with eleven latent clusters and 192 artificial flow cytometry samples. Each sample had measurements of 150,000 cells, giving a total of 28 million cells. Four of the eleven clusters were missing in half of the samples. Additional file 2 contains a python script and data for regenerating this data set.
Prior parameters and initial values for the MCMC sampler are given in the Additional file 1: Section D.1. All priors were chosen to be noninformative. The outlier component was not used for inference in the small dataset, but it was used for the large dataset. The MCMC sampler ran first for a number of burnin iterations, then the posterior distribution was explored in a number of production iterations. During the production iterations, apart from sampling parameters of the model, a value of Y was also drawn, i.e. a sample from the posterior predictive. For the first synthetic data set 10,000 burnin and 100,000 production iterations were used. For the second, larger, data set we used 5,000 burnin iterations and 5,000 production iterations.
For the second data set the MCMC sampler was run on Amazon Cloud, using 192 cores. Each iteration took on average one second, so that about 2.7 h was needed in total.
Flow cytometry data
We analyze two flow cytometry data sets with BayesFlow: the data set GvHD from the FlowCAP I challenge—with four markers, 12 samples and approximately 13,000 cells per sample—and a data set obtained from the R package healthyFlowData [32] with technical replicates of PBMC samples from healthy donors—in total 20 samples with approximately 20,000 cells, also measured with four markers. In the GvHD dataset we can compare the gating obtained from BayesFlow with manual gating provided from FlowCAP as well as automated gating from a wide range of other methods. In healthyFlowData we can instead compare gating between technical replicates to see if samples are treated in a consistent manner.
BayesFlow applies three data preprocessing steps: 1) Data points with extreme values in at least one dimension (larger than 0.999 times the largest data point or smaller than 1.001 times the smallest data point) are removed. Such data points can lead to components with singular covariance matrices, and a well designed flow cytometry experiment should not have significant populations with such values. 2) The data is scaled using the 1 and 99 % percentiles q _{0.01} and q _{0.99} of the pooled data, with the same scaling for all samples, so that q _{0.01}=0 and q _{0.99}=1 for each marker for the pooled data. This is done in order to be able to set informative priors in an intuitive way. 3) Before testing which components should be merged, a very small amount of noise is added to the data (standard deviation 0.003). This is since the discreteness of the original flow cytometry measurements can lead to a striped pattern in the flow cytometry data [33] and also when it is not visible to the human eye it disturbs the dip test.
After preprocessing, parameters for the MCMC sampler were initialized by running the EM algorithm on the pooled data, followed by the initialization scheme used for the large synthetic dataset, detailed in the Additional file 1: Section D.4. We ran 16,000 burnin iterations and 4,000 production iterations of the MCMC sampler for both experiments. The burnin period consisted of five phases: In the first phase, the priors on variation in location and shape were modified to force clusters together. Before the second phase, priors parameters were set to normal again. After the second phase, components which were considered to be outliers were turned off. They were forced to stay off during a short third phase, but from the forth phase and onwards components were allowed to turn on and off. Label switching was allowed during the initial four phases in order to escape nondesired local minima, but then disallowed. The values of parameters controlling the simulation during the burnin and production period are given in Additional file 1: Table S1.
We also applied the two other joint gating methods based on Bayesian hierarchical models: ASPIRE [22] and HDPGMM [21]. For ASPIRE parameters were chosen according to the strategy recommended by Dundar et al. [22]; details are given in the Additional file 1: Section E.5. For each run we used in total 15,000 iterations, of which 14,000 were set as burn in iterations. For HDPGMM default parameters were used, with a burnin period of 3,000 iterations and a production period of 100 iterations.
We ran BayesFlow and ASPIRE on a 3.2 GHz quad core CPU. A BayesFlow run took 0.5 h for the GvHD dataset and 1.4 h for healthyFlowData. ASPIRE took in total 2.4 h for the GvHD dataset and 6.6 h for healthyFlowData per run. Four runs of ASPIRE was needed to determine the κ _{ i } parameters. HDPGMM was run on a dual core GPU. It needed 0.72 h for the GvHD dataset and approximately 1 h for the healthyFlowData dataset.
Results
Simulated data
We thus see that cluster centers and credibility intervals for latent clusters are captured well in both synthetic data sets.
Flow cytometry data
GvHD
For the analysis of the GvHD dataset we did twelve runs of BayesFlow in the informed setup described above. Seven were excluded due to confusion between populations, i.e. at least one sample component was closest to the wrong latent component; of the remaining five, one more run was excluded since it has not converged, and another two because of multimodal clusters. This leaved two runs that passed the quality control. Additional file 1: Figs. S2 and S3 show trace plots and projections of clusters with high dip test values respectively.
Accordance with manual gating for GvHD data set. For HDPGMM we also report the result when components are merged according to our merging procedure. When this procedure is applied to the results obtained by ASPIRE, no components are merged, i.e. the original result is identical to what is obtained after mergeing
Method  Fmeasure  Precision  Recall 

BayesFlow run 1  0.91 (0.86, 0.95)  0.96  0.89 
BayesFlow run 2  0.87 (0.82, 0.92)  0.95  0.84 
ASPIRE  0.67 (0.63, 0.72)  0.86  0.63 
HDPGMM  0.35 (0.30, 0.39)  0.98  0.23 
HDPGMM merged  0.60 (0.54, 0.66)  0.95  0.48 
flowMeans  0.88 (0.82, 0.93)  0.93  0.86 
SamSPECTRAL  0.87 (0.81, 0.93)  0.96  0.83 
Ensemble FlowCAP  0.88 
healthyFlowData
We did 18 runs of BayesFlow with K=25. Ten of these were excluded due to confusion between populations, moreover two runs were excluded since they had clusters with clearly multimodal distributions. For the six runs that passed the quality control, 3–6 components were turned off across all samples; they are excluded from visualizations. Additional file 1: Figs. S1, S3 and S4 show trace plots, projections of clusters with high dip test values and eigenvectors of covariance matrices respectively.
The output of BayesFlow, ASPIRE and HDPGMM can be compared in Fig. 13. The merging procedure we used for BayesFlow has been applied for both ASPIRE and HDPGMM, however for ASPIRE no components were merged by this. In BayesFlow each of the populations correspond to clear expression patterns, which is not the case for the other methods. For example the first population is clearly CD4+CD8 Tcells whereas for both ASPIRE and HDPGMM this population contains both components which are CD8 and components which are CD8+.
Discussion
From different runs of BayesFlow we can get different representations of data, as in the case of the GvHD dataset. This is because with highly overlapping populations there might be multiple models representing the data equally well. But since all samples are gated jointly in every run, the gated populations can still be compared across samples. The user might have a preference for one representation or the other though, and informative priors can be used to guide BayesFlow to a preferred representation.
BayesFlow is not aimed at discovery of rare cell populations, but it can be used together with an algorithm specifically designed for detecting rare cell populations in a sample, such as SWIFT [12], and then use informative priors to find how this population occurs across an entire set of samples, in a similar way as was done in the GvHD dataset.
How much clusters should be merged is a decision that needs to be taken by the interpreter of the data. In some settings one might want to be restrictive with merging and then use higher thresholds. In others one might want additional mergers after viewing joint onedimensional projections of the clusters.
The BayesFlow pipeline does not in itself include any compensation or any of the nonlinear transformations which are often used for flow cytometry data, such as logicle. Compensation is a linear transformation and Gaussian Mixture Models are invariant under linear transformations, so they perform equally well on uncompensated and compensated data. Nonlinear transformations such as logicle can make Gaussian populations nonGaussian, which makes inference harder. The flow cytometry data we used for the experiments had already been compensated, the healthyFlowData data set had also been transformed with an asinh transform; details are given in the Additional file 1: Section E.1.
BayesFlow finds a joint representation of an entire set of samples. In order for this representation to be reasonable there has to be sufficient correspondences between samples. Even if for a data set with very little correspondences a joint model could be obtained by using a very large number of components, it would hard to gain any insights from such a model. In such a case an entirely computational pipeline without the cell population identification step would be preferred.
BayesFlow can be computationally intensive if many runs are needed to pass the quality control. For these cases it would be desirable to complement BayesFlow e.g. with initialization methods that would allow passing the quality control more often, so that few runs in BayesFlow would be needed. Fast initialization methods and early quality checks aiming at this would therefore be of interest for the community and is something that we propose for further study.
Conclusions
In this paper we have presented a new Bayesian hierarchical model designed for joint cell population identification in many flow cytometry samples. The model captures the variability in shapes and locations of the populations between the samples and we have demonstrated its use in an exploratory as well as in a partly informed setting with some prior information. We showed that for synthetic datasets generated from the model, the parameters were recovered with high accuracy through a MCMC sampling scheme. The model was then applied to a real flow cytometry data set where a manual gating was available, and it was shown to have very high accordance with manual gating as compared to other automated gating methods, while at the same time the gating was more consistent across samples than either the manual gating or other automated gating methods. When applied to another flow cytometry data set with technical replicates of blood from healthy donors, BayesFlow gave a parsimonious representation of the data, which enables visualization and monitoring of its parameters. The obtained cell populations had clear expression patterns as opposed to the clusters obtained by ASPIRE and HDPGMM, where for example CD4+CD8 Tcells where in the same cluster as CD4+CD8+ Tcells. The population sizes obtained by BayesFlow and HDPGMM respectively had lower intradonor variation compared to interdonor variation than what was obtained from ASPIRE.
Many approaches of automated gating of multiple flow cytometry samples in parallel have been aimed at finding features of the data so that either samples can be classified into groups, e.g. cancer or normal, or they can be used to predict an outcome such as expected time to progression of disease. Features are often designed based on characteristics of cell populations, but usually not so much attention has been given to ensure that they represent actual cell populations. BayesFlow takes the opposite approach and gives a representation of the data according to cell populations, with the same cell populations across the entire set of samples (except when some populations only occurs in a subset of the samples). The advantages to this approach are among others that the result is directly biologically interpretable and that a rich output is given which can be explored in many different ways which are familiar to someone who is used to manual gating. In this way we can join the objectivity and ability to work in high dimensions and with many samples of automated gating with the flexibility in interpretation of manual gating.
Notes
Declarations
Acknowledgements
We would like to thank Bartek Rajwa and Ariful Azad for collecting and sharing the healthyFlowData data set and for providing information about the preprocessing of the data.
The first author gratefully acknowledges a travel grant from the Royal Swedish Academy of Sciences’ foundations. The second author is supported by Knut and Alice Wallenbergs stiftelse.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Shapiro HM. Practical Flow Cytometry. Hoboken, New Jersey: John Wiley & Sons; 2005.Google Scholar
 Nolan JP, Yang L. The flow of cytometry into systems biology. Brief Funct Genomics and Proteomics. 2007; 6(2):81–90.View ArticleGoogle Scholar
 O’Neill K, Aghaeepour N, Špidlen J, Brinkman R. Flow cytometry bioinformatics. PLoS Comput Biol. 2013; 9(12):1003365.View ArticleGoogle Scholar
 Chen X, Hasan M, Libri V, Urrutia A, Beitz B, Rouilly V, et al.Automated flow cytometric analysis across large numbers of samples and cell types. Clin Immunol. 2015; 157(2):249–60.View ArticlePubMedGoogle Scholar
 Welters MJ, Gouttefangeas C, Ramwadhdoebe TH, Letsch A, Ottensmeier CH, Britten CM, et al.Harmonization of the intracellular cytokine staining assay. Cancer Immunol Immunother. 2012; 61(7):967–78.View ArticlePubMedPubMed CentralGoogle Scholar
 Hahne F, Khodabakhshi AH, Bashashati A, Wong CJ, Gascoyne RD, Weng AP, et al.Perchannel basis normalization methods for flow cytometry data. Cytometry Part A. 2010; 77(2):121–31.Google Scholar
 Lo K, Brinkman RR, Gottardo R. Automated gating of flow cytometry data via robust modelbased clustering. Cytometry Part A. 2008; 73(4):321–32.View ArticleGoogle Scholar
 Boedigheimer MJ, Ferbas J. Mixture modeling approach to flow cytometry data. Cytometry Part A. 2008; 73(5):421–9.View ArticleGoogle Scholar
 Chan C, Feng F, Ottinger J, Foster D, West M, Kepler TB. Statistical mixture modeling for cell subtype identification in flow cytometry. Cytometry Part A. 2008; 73(8):693–701.View ArticleGoogle Scholar
 Pyne S, Hu X, Wang K, Rossin E, Lin TI, Maier LM, et al.Automated highdimensional flow cytometric data analysis. Proc Natl Acad Sci. 2009; 106(21):8519–524.View ArticlePubMedPubMed CentralGoogle Scholar
 Hu X, Kim H, Brennan PJ, Han B, BaecherAllan CM, De Jager PL, et al.Application of userguided automated cytometric data analysis to largescale immunoprofiling of invariant natural killer T cells. Proc Natl Acad Sci. 2013; 110(47):19030–19035.View ArticlePubMedPubMed CentralGoogle Scholar
 Naim I, Datta S, Rebhahn J, Cavenaugh JS, Mosmann TR, Sharma G. Swift scalable clustering for automated identification of rare cell populations in large, highdimensional flow cytometry datasets, part 1: Algorithm design. Cytometry Part A. 2014; 85(5):408–321.View ArticleGoogle Scholar
 Qian Y, Wei C, EunHyung Lee F, Campbell J, Halliley J, Lee JA, et al.Elucidation of seventeen human peripheral blood Bcell subsets and quantification of the tetanus response using a densitybased method for the automated identification of cell populations in multidimensional flow cytometry data. Cytometry Part B: Clinical Cytometry. 2010; 78(S1):69–82.View ArticleGoogle Scholar
 Zare H, Shooshtari P, Gupta A, Brinkman RR. Data reduction for spectral clustering to analyze high throughput flow cytometry data. BMC Bioinforma. 2010; 11:403.View ArticleGoogle Scholar
 Qiu P, Simonds EF, Bendall SC, Gibbs Jr KD, Bruggner RV, Linderman MD, et al.Extracting a cellular hierarchy from highdimensional cytometry data with spade. Nature Biotechnol. 2011; 29(10):886–91.View ArticleGoogle Scholar
 Bruggner RV, Bodenmiller B, Dill DL, Tibshirani RJ, Nolan GP. Automated identification of stratifying signatures in cellular subpopulations. Proc Natl Acad Sci. 2014; 111(26):2770–777.View ArticleGoogle Scholar
 Aghaeepour N, Nikolic R, Hoos HH, Brinkman RR. Rapid cell population identification in flow cytometry data. Cytometry Part A. 2011; 79(1):6–13.View ArticleGoogle Scholar
 Ge Y, Sealfon SC. flowPeaks: a fast unsupervised clustering for flow cytometry data via kmeans and density peak finding. Bioinforma. 2012; 28(15):2052–058.View ArticleGoogle Scholar
 Aghaeepour N, Finak G, The FlowCAP Consortium, The DREAM Consortium, Hoos H, Mosmann TR, et al.Critical assessment of automated flow cytometry data analysis techniques. Nature Methods. 2013; 10(3):228–38.View ArticlePubMedPubMed CentralGoogle Scholar
 Azad A, Khan A, Rajwa B, Pyne S, Pothen A. Classifying immunophenotypes with templates from flow cytometry. In: Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics. New York, NY, USA: ACM: 2013. p. 256.Google Scholar
 Cron A, Gouttefangeas C, Frelinger J, Lin L, Singh SK, Britten CM, et al.Hierarchical modeling for rare event detection and cell subset alignment across flow cytometry samples. PLoS Comput Biol. 2013; 9(7):1003130.View ArticleGoogle Scholar
 Dundar M, Akova F, Yerebakan HZ, Rajwa B. A nonparametric Bayesian model for joint cell clustering and cluster matching: Identification of anomalous sample phenotypes with random effects. BMC Bioinforma. 2014; 15:314.View ArticleGoogle Scholar
 FrühwirthSchnatter S, Pyne S. Bayesian inference for finite mixtures of univariate and multivariate skewnormal and skewt distributions. Biostat. 2010; 11(2):317–36.View ArticleGoogle Scholar
 Finak G, Bashashati A, Brinkman R, Gottardo R. Merging mixture components for cell population identification in flow cytometry. Advances in Bioinforma. 2009; 2009:12. http://www.hindawi.com/journals/abi/2009/247646/cta/.View ArticleGoogle Scholar
 Baudry JP, Raftery AE, Celeux G, Lo K, Gottardo R. Combining mixture components for clustering. J Comput Graph Stat. 2010; 19(2):332–353.View ArticleGoogle Scholar
 Hennig C. Methods for merging Gaussian mixture components. Adv Data Anal Class; 4(1):3–34.Google Scholar
 Fraley C, Raftery AE. How many clusters? Which clustering method? Answers via modelbased cluster analysis. Comput J. 1998; 41(8):578–88.View ArticleGoogle Scholar
 Lee JA, Verleysen M. Nonlinear Dimensionality Reduction. New York: Springer; 2007.View ArticleGoogle Scholar
 FrühwirthSchnatter S. Finite Mixture and Markov Switching Models: Modeling and Applications to Random Processes. New York: Springer; 2006. Chapter 4.Google Scholar
 Hartigan JA, Hartigan PM. The dip test of unimodality. Annal Stat. 1985; 13(1):70–84.View ArticleGoogle Scholar
 Fukunaga K. Introduction to Statistical Pattern Recognition. San Diego: Academic press; 1990.Google Scholar
 Azad A. healthyFlowData: Healthy Dataset Used by the flowMatch Package. R package version 1.2.0. 2013.Google Scholar
 Roederer M. Spectral compensation for flow cytometry: Visualization artifacts, limitations, and caveats. Cytometry. 2001; 45(3):194–205.View ArticlePubMedGoogle Scholar