Model-based clustering for flow and mass cytometry data with clinical information

Background High-dimensional flow cytometry and mass cytometry allow systemic-level characterization of more than 10 protein profiles at single-cell resolution and provide a much broader landscape in many biological applications, such as disease diagnosis and prediction of clinical outcome. When associating clinical information with cytometry data, traditional approaches require two distinct steps for identification of cell populations and statistical test to determine whether the difference between two population proportions is significant. These two-step approaches can lead to information loss and analysis bias. Results We propose a novel statistical framework, called LAMBDA (Latent Allocation Model with Bayesian Data Analysis), for simultaneous identification of unknown cell populations and discovery of associations between these populations and clinical information. LAMBDA uses specified probabilistic models designed for modeling the different distribution information for flow or mass cytometry data, respectively. We use a zero-inflated distribution for the mass cytometry data based the characteristics of the data. A simulation study confirms the usefulness of this model by evaluating the accuracy of the estimated parameters. We also demonstrate that LAMBDA can identify associations between cell populations and their clinical outcomes by analyzing real data. LAMBDA is implemented in R and is available from GitHub (https://github.com/abikoushi/lambda).

measurements from millions of cells from the subject. The recent mass cytometry systems use antibodies tagged with heavy metal isotopes which reduce signal interference due to spectral overlap and autofluorescence and enable the detection of more than 40 proteins per cell [2]. This high-dimensional cytometry data contains useful information to diagnose diseases such as leukemia [3] and HIV [4], as well as to predict clinical outcomes such as the response to cancer immune-therapies [5].
A key challenge in the analysis of high-dimensional cytometry data is to identify unknown cell populations that relate as prognostic factors to clinical outcomes of interest. Traditional analysis is done by manual gating which suffers not only from the need to detect unknown cell populations, but also from the need to ensure reproducibility [6]. This lack of reproducibility has two subjective causes. The first concerns the order in which pairs of markers are explored. This order often allows some degree of freedom in the gating process, so it might lead to the selection of different cells by alternative gating strategies. The second concerns the boundaries of the gates being used. There is considerable diversity between operators in terms of gating strategy, in which some experts gate strictly and others gate generously. The subjective nature of manual gating allows for the production of too wide a variety of results to be accurately reproducible.
As an alternative to manual gating, researchers have developed several computational methods, including Citrus [7], cydar [8], and diffcyt [9], to infer cell populations or states associated with an outcome variable in high-dimensional cytometry data. However, these methods require two steps: a first step in which cell populations are identified using a clustering algorithm, and a second step in which the summary statistics of the identified cell populations are concatenated into a clinical outcome of interest which can lead to information loss and analysis bias. Furthermore, these methods do not consider the distinctive features of the expression values of mass cytometry data as opposed to flow cytometry data. Mass cytometry data is marked by a zero-inflated distribution (Fig. 1). That is, proteins can be either 'on' or 'off, ' in which either a positive expression measure is recorded or the recorded expression is zero or negligible, and where a very high proportion of the data entries are zero. On the other hand, in flow cytometry, the boundary between 'on' and 'off ' is more ambiguous, which leads to a bimodal Gaussian distribution (Fig. 1). Therefore, lack of consideration for this distribution difference in the existing methods masks the underlying difference in cell populations and gives rise to a misleading conclusion in both basic and clinical research.
To address the aforementioned problems, we propose a new probabilistic approach for identifying unknown cell populations associated with clinical outcomes of interest which we have named LAMBDA (Latent Allocation Model with Bayesian Data Analysis). The contributions of our proposed method are summarized as follows: • Our method is a one-step procedure that directly uses cytometry data at the single cell level to simultaneously discover cell populations and to identify the associations of these populations with clinical outcomes of interest. Our model can also be used to find relationships between cell populations and a single clinical outcome as well as relationships between cell populations and multiple clinical outcomes.
• Our method is based on correctly specified probabilistic models that are designed for modeling the different distribution information of flow and mass cytometry data respectively. In the case of flow cytometry, LAMBDA assumes that the data is generated from a mixture of multivariate normal distributions, each of which represents an unknown cell population. On the other hand, in the case of mass cytometry, LAMBDA assumes that the data is generated from a mixture of zero-inflated distributions that represent censoring of expression below a substantial limit of detection. In both models, the compositions of cell populations are assumed to vary with clinical outcomes.
• We provide a simple and efficient learning procedure for the proposed model using a stochastic EM algorithm that reduces computational cost. LAMBDA is implemented in the R environment, which is available from https://github.com/abikoushi/ LAMBDA.
From here we will explain the method and its implications in detail. Figure 1 shows a conceptual view of analysis by LAMBDA. The "Methods" section details the proposed model and algorithm. The "Results" section includes an analysis of the efficiency of LAMBDA using synthetic and real data. The "Conclusion" section summarizes the data presented here and describes the possibility for future expansion of this model.

Model for flow cytometry data
Suppose that we observe the flow cytometry dataset y n ∈ R K , (n = 1, . . . , N) and clinical information x n ∈ R D . The dataset includes N cells, K markers and D-dimensional clinical information. Our goal is to identify cells populations from the data. Furthermore we seek to understand how these cell populations change depending on the clinical information. LAMBDA is a model based clustering method. Let L be the number of clusters. The data generative process of LAMBDA for flow cytometry data is defined as follows: where w n,l is l-th element of w n and D × L matrix β is the effect of clinical information. For identifiability, the first column of β is always set as zero. Here, the softmax function is defined by softmax( for vector x = (x 1 , . . . , x K ) using an element-wise exponential function. Figure 2 shows a plate diagram of this data generating process. This model is a kind of conditional Gaussian mixture model [10]. However, the details of the estimation method are not described in detail in the publication, so we will describe them here.

Parameter estimation for flow cytometry data
We find the maximum a posteriori probability (MAP) estimators, using an EM algorithm. If the latent variable w n,l is given, the complete likelihood of this model is represented by the following formula: In the E-step, we calculate where N (y|μ, ) is the density function of the multivariate Gaussian distribution with mean μ and covariance .
In the M-step, we update the parameters using: Because closed form solutions for β are unavailable, we use Newton's method to obtain estimates. We obtain estimates of β by maximizing the following equation, with respect to β: First order derivative of the function Q(β) is represented by: Second order derivative of the function Q(β) is represented by: where ⊗ denotes the Kronecker product and P n is defined as follows: Thus, we update the estimate of β using: where vec is the vec operator.

Model for mass cytometry data
In the same manner as the previous subsection, we observe a mass cytometry dataset, in this case y n ∈ R K , (n = 1, . . . , N) and clinical information x n ∈ R D . An important feature of mass cytometry data, is that a very high proportion of the data entries are zero. The ordinary mixture of the Gaussian model can not explain these zeroes. Here, LAMBDA steps in to properly assess the data. The data generative process of LAMBDA for mass cytometry data is defined as follows: Figure 3 shows a plate diagram of this data generating process.

Parameter estimation for mass cytometry data
Looking again at Eq. (11), if the latent variables w n and z n are given, the complete likelihood of this model is represented by the following formula: For mass cytometry data, we use a stochastic EM algorithm. In the E step, Monte Carlo samplesw n andz n replace missing data w n and z n .
If an arbitrary value for z n is given, we can samplew n from following categorical distribution: where the l-th element of η n is represented by the following formula: In contrast, if an arbitrary value for w n is given, we can samplez n from truncated Normal distribution. The steps of the Gibbs sampler for generatingz n are: • If y n,k > 0,z n,k = y n,k , othewise, where (y|μ, ) denotes the distribution function of a univariate Gaussian distribution with mean μ and variance and x −i is the set of all variables in x except for the i-th variable.
In the M-step, by replacing y n,k and w (i) n,l with samplesz n,k andw n,l , Eqs. (4), (5), and (10) can be used.

Model selection
In fitting the model, it is important to choose an appropriate number for L. It is well known that the number of cluster L with the lowest Bayesian information criterion (BIC) is an appropriate number. The BIC is defined as follows: where L is the likelihood and f is the number of estimated parameters. However, for mass cytometry data, it requires a high computational cost to calculate the exact likelihood in the stochastic EM algorithm. Thus, in this article, we use BIC for flow cytometry data, and elbow method for mass cytometry data to choose L. Elbow method chooses a number of clusters that adding another cluster doesn't give a better fit to the data. The goodness of fit of the model to data is evaluated by the sum of squared error (SSE). SSE is defined by following:

Simulation study
To evaluate the standard error (SE) and the bias of the estimations, we conducted simulation experiments. The bias ofθ is defined by the difference between the true value and the estimated value (E[θ ] −θ). The synthetic data was naturally produced via the data generating process given by Eq 11. We set K = 10. The μ and were randomly generated. We used a multivariate normal distribution to generate the synthetic data, and values less than 0 were replaced by 0. We estimated the parameters from 100 replicates of the experiment. We set the sample size N = 2000, the number of clusters L = 4, and the categories D = 2. One category has the mixture proportion φ 1 = (0.1, 0.2, 0.3, 0.4) , the other has the mixture proportion φ 2 = (0.25, 0.25, 0.25, 0.25) . To differentiate these two categories, we set X = (1, x), where 1 is a vector of ones. The variable x is a dummy variable to indicate the category. When estimating parameters, we set τ = 0.01, ν = K + 2, and is an identity matrix, which is equivalent to a weakly-informative prior distribution. To avoid the problem of label switching [11], the estimated parameters are rearranged as Synthetic data was analyzed by LAMBDA along with ordinaly Gaussian mixture model (GMM) which cannot incorporate explanatory variables. using R package "mclust". We estimated the clusters by GMM and calculate the mixture proportion of the estimated clusters by category and the median for each cluster was used as the estimates of μ.
The mean and SE for the estimatedμ andˆ are shown in Figs. 4 and 5, respectively. We observed that the points were arranged diagonally, indicating that the estimator of LAMBDA is unbiased. In contrast, GMM estimates often have a large bias. The mean and SE for the estimatedφ is shown in Table 1. In the case of synthetic data, the algorithm of Fig. 7 Estimatedμ for Landrigan's study. The x-axis corresponds to the markers, and the y-axis corresponds to the clusters. Red, white, and blue indicates high, middle, and low marker intensity, respectively LAMBDA uses parameters estimated with small biases and is able to produce reasonable estimates.

Results on real data
We applied LAMBDA to real world flow and mass cytometry data. When estimating parameters, we set τ = 0.01, ν = K + 2, and is an identity matrix.
For the case of flow cytometry we turn to Landrigan's study (https://community. cytobank.org/cytobank/experiments/35226), in which naive CD4+ T cells were purified and stimulated by anti-CD3 and anti-CD28 antibodies.
Five cases were tested: unstimulated, stimulated by only the anti-CD3 antibody, stimulated by both the anti-CD3 and anti-CD28 antibodies, and two cases with different dosages for the anti-CD3 antibody (0.3 μg/mL and 0.8 μg/mL). The purpose of this study is identifying the associations of cell populations with elapsed time from the stimulations start point.
Thus, we use time, dosage, anti-CD3, and anti-CD28 as the covariate X. All variables are treated as dummy variables.
It is known that stimulation of CD3 triggers activation of naive CD4+ T cells, which accompany the phosphorylation of SLP76/S6 and CD247 (pSLP76/pS6, pCD246) [12]. Fig. 8 Estimatedφ for Landrigan's study. The x-axis corresponds to the timepoints, and the y-axis corresponds to the clusters CD28 is the co-stimulatory factor that enhances and prolongs T cell activation [13]. Soon after activation, the levels of phosphorylated SLP/S6 and CD247 decrease by negative feedback. Then the cells become CD45RO+ memory T cells. BIC (Fig. 6) determined the setting of 13 clusters. Figure 7 shows theμ. Also as shown in Fig. 7, clusters 1, 3, 4, 11, and 12 are the pSLP76/pS6+ pCD247+activated naive T cells, and clusters 2, 9, and 10 are pSLP76/pS6-pCD247-CD45RO+ memory T cells. The mixture proportion is shown in Fig. 8. While the mixture proportion remains stable over time in unstimulated cases, other cases show a high proportion of activated naive T cells at 3 min and their proportion decreases at 6 and 10 min by deactivation through negative feedback. Figure 8 shows that as activated T cells decrease, the memory T cell population increases, indicating the transformation of naive T cells to memory T cells. This shows that in the case of flow cytometry data the method is able to provide a reasonable interpretation of the cell population clusters.
We also applied LAMBDA to mass cytometry data from Jin's study [14]. This data is available in FlowRepository (https://flowrepository.org/) under Repository ID: FR-FCM-ZY6C. The purpose of this study is discovering cell population related to clinical responses. In the case of mass cytometry data, for determination of the number of clusters, we used the elbow method, which is performed by plotting the SSE within each cluster against the number of clusters. In this case, the elbow method determined 14 clusters (Fig. 9). Arranged by severity, clinical responses include healthy donors (ND), partial response (PR), stable disease (SD), and progressive disease (PD). ND were used as the baseline for ICB samples. Figure 10 shows the estimated mixture proportionφ. We observed that the value of the mixture proportion for cluster 2 increases as cancer progresses from PR to PD. Figure 11 shows the estimatedμ. We denote hi and lo that the marker is high and low level expressed respectively. Cluster 2 is characterized by CD8+, T-bet lo, EOMES hi, PD1hi, and Ki67 lo. T-bet lo, EOMES hi, PD1hi, and Ki67 lo are exhaustion markers of the T cell. "Exhaustion" refers to cases where a T cell becomes dysfunctional due to the longterm induction of various co-repressive molecules such as PD-1, CTLA-4, and TIM-3. Pauken & Wherry [15] reported that the CD8+ T cells of T-bet hi and EOMES lo become T-bet lo, EOMES hi, and PD1 hi through exhaustion. The marker KI67 indicate cell mass culturing. The exhausted T cells have a low expression level in KI67. Blackburn [16] reported that the cell populations of T-bet lo, EOMES hi, and PD1 hi are not activated by blocking the PD1 / PDL-1 pathway with immune checkpoint inhibitors. LAMBDA shows that the cluster 2 cell population is high in patients who underwent a PDL1 inhibitor treatment with a poor prognosis. This finding is consistent with Pauken and Blackburn's study, showing the effectiveness of LAMBDA in interpreting high dimensional mass cytometry data in real situations.  Heatmap for Jin's study. x-axis corresponds to the markers, and the y-axis corresponds to the clusters Through this analysis we can see that LAMBDA is a method that can efficiently estimate various clusters within cell populations and identify the associations between these cell clusters and their clinical outcomes in cases of both flow and mass cytometry data.

Discussion
LAMBDA should prove useful as it is described in this paper, but there is room for future study and improvement. Recently, with the development of next generation sequencing technologies, single cell sequencing was introduced to the field of biomedical research. Sequencing the DNA provides a higher resolution of cellular differences and a better understanding of the function of an individual cell in the context of its microenvironment. In this context, our future aim is the extension of LAMBDA for application to single cell DNA data. This will allow us to understand the condition of the cell on a fundamental level, contributing to our overall understanding of biology and its processes.

Conclusion
With the development of high dimensional flow and mass cytometry data, researches have been challenged with the need to properly identify and interpret data about cell populations. To meet this challenge, we proposed a statistical framework that uses flow and mass cytometry data to discover cell clusters and the associations between individual clusters and clinical information.
As described in the "Methods" section, this model uses a stochastic EM to estimate parameters. In terms of computation, this parameter-estimation procedure offers an advantage over procedures that use an ordinary EM algorithm. This is because, in an algorithm that uses ordinary EM, the computational cost is large due to calculating the high-dimensional conditional expectation. By contrast, our procedure involves a Gibbs sampling that substitutes for this requirement, significantly reducing the computational cost.
In addition to being computationally efficient, our framework also has useful properties from the perspective of data analysis. Usual methods of clustering are not able to support the inclusion of explanatory variables. However, LAMBDA can include any explanatory variables. This property allows LAMBDA to analyze experimental results with various settings. Because of this novel feature, we expect that LAMBDA will be efficiently applied to studies that seek an association between cell populations and clinical information, advancing our ability to predict disease and predict outcomes of treatment.