Model-based cell clustering and population tracking for time-series flow cytometry data

Background Modern flow cytometry technology has enabled the simultaneous analysis of multiple cell markers at the single-cell level, and it is widely used in a broad field of research. The detection of cell populations in flow cytometry data has long been dependent on “manual gating” by visual inspection. Recently, numerous software have been developed for automatic, computationally guided detection of cell populations; however, they are not designed for time-series flow cytometry data. Time-series flow cytometry data are indispensable for investigating the dynamics of cell populations that could not be elucidated by static time-point analysis. Therefore, there is a great need for tools to systematically analyze time-series flow cytometry data. Results We propose a simple and efficient statistical framework, named CYBERTRACK (CYtometry-Based Estimation and Reasoning for TRACKing cell populations), to perform clustering and cell population tracking for time-series flow cytometry data. CYBERTRACK assumes that flow cytometry data are generated from a multivariate Gaussian mixture distribution with its mixture proportion at the current time dependent on that at a previous timepoint. Using simulation data, we evaluate the performance of CYBERTRACK when estimating parameters for a multivariate Gaussian mixture distribution, tracking time-dependent transitions of mixture proportions, and detecting change-points in the overall mixture proportion. The CYBERTRACK performance is validated using two real flow cytometry datasets, which demonstrate that the population dynamics detected by CYBERTRACK are consistent with our prior knowledge of lymphocyte behavior. Conclusions Our results indicate that CYBERTRACK offers better understandings of time-dependent cell population dynamics to cytometry users by systematically analyzing time-series flow cytometry data.

plot, but it suffers from several major limitations, including subjectivity, operator bias, difficulties in detecting unknown cell populations, and difficulties in reproducibility [4][5][6].
To overcome these limitations, several methods have been proposed for the computationally guided or automated detection of unknown cell populations by unsupervised clustering, including FlowSOM, X-shift, Pheno-Graph, Rclusterpp, and flowMeans [7]. Although these methods have been successfully applied to identify both major and rare cell populations, they are not designed for modeling and analyzing time-series data and thus cannot capture the time-dependent properties and dynamics of cell populations. For example, in clinical applications such as cancer immunotherapies, we are interested in investigating drug effects on cell populations by monitoring their dynamics throughout the treatment period [8]. Time-series flow cytometry data offer information on longitudinal cell population dynamics that could not be elucidated by conventional static time-point data. However, such research is currently limited by a lack of a systematic mathematical framework to adequately model and analyze time-series flow cytometry data.
To address this problem, we propose a new statistical framework, named CYBERTRACK (CYtometry-Based Estimation and Reasoning for TRACKing cell populations), for the automatic clustering and tracking of a mixture proportion of cell populations in time-series flow cytometry data. Our contributions are summarized as follows: • Our framework is based on the Topic Tracking Model proposed by Iwata et al., 2009, which is designed for tracking topic distribution that changes over time. We extend their model to handle time-series flow cytometry data, which is assumed to follow a multivariate Gaussian mixture distribution. • By assuming that the mixture proportion at the current time is dependent on that at a previous time, CYBERTRACK is capable of estimating the longitudinal transition of multiple cell populations and detecting the "change-point" in the overall mixture proportion. • We provide a simple and efficient learning procedure for the proposed model by using a stochastic EM algorithm, which is an alternate iteration of Gibbs sampling and maximum a posteriori (MAP) estimation of parameters. CYBERTRACK is implemented in an R environment, and the implementation is available from https://github.com/ kodaim1115/CYBERTRACK.
A conceptual view of an analysis by CYBERTRACK is shown in Fig. 1.
Our model and algorithm are described in the "Methods" section. To validate its performance and practicability, we applied CYBERTRACK to both simulation and real timeseries flow cytometry datasets for two immunological experiments.

Model
Suppose that we observe time series flow cytometry data y t,d,n ∈ R K , where t ∈ {1, ..., T} is a time index, d ∈ {1, ..., D} is a case index, n ∈ {1, ..., N t,d } is a cell index, and K is the number of markers. In the context of flow cytometry experiment, case corresponds to a study subjects (e.g. cultured cell, mouse, or patient) from which cells are collected and analyzed. N t,d represents the number of cells observed at time t for case d. The objective of this study is to perform clustering of cells and track the time-dependent transition of cluster mixture proportion π t,d,l for each case, where l ∈ {1, ..., L} is a cluster index. Our model is inspired by the Topic Tracking Model, and is an extension of the multivariate Gaussian mixture model. Topic Model is a Bayesian model which was originally designed to extract latent semantics, or "topics", from text data. Topic Tracking Model is an extension of Topic Model specialized in tracking time-varying topic distribution [9]. Although the original Topic Tracking Model assumed each word was generated from a multinomial distribution, this assumption does not apply to flow cytometry data, where each cell possess continuous values for the number of markers. Therefore, we assumed that flow cytometry data follow a multivariate Gaussian mixture distribution. Since the original Topic Tracking Model is not designed to estimate parameters for the multivariate Gaussian mixture distribution, we constructed the stochastic EM algorithm to solve this problem. Here, topics correspond to cell populations such as T cells or B cells. Figure 2 illustrates a plate diagram of our proposed model, where, z t,d,n is a latent cluster vector of length L that holds 1 for the l-th element when a cell is generated from cluster l and holds 0 otherwise. We assume that each cell is generated from a multivariate Gaussian mixture distribution with the parameter vector μ l and l , which represents the mean and the covariance matrix for cluster l, respectively. More specifically, the generative process of CYBERTRACK is defined by where z is a latent cluster of the n-th cell at time t for case d indicated by z t,d,n , μ z and z are the mean vector and covariance matrix of the latent cluster, respectively, π t,d = {π t,d,l } L l=1 is the mixture proportion vector, and α t,d represents the persistency parameter, which indicates how consistent the mixture proportion at time t is compared with that at the previous time t − 1. A smaller α t,d value indicates a larger discrepancy between the mixture proportion at time t and t − 1. Thus, timepoints with relatively small persistency parameters could be considered as "change-points" in the mixture proportion. μ = {μ l } L l=1 is the mean vectors of clusters, and

Fig. 2 Graphical model of CYBERTRACK
is the covariance matrices of clusters. τ is the hyperparameter of the μ prior distribution, and and ν are the hyperparameters of the prior distribution.

Parameter estimation
Parameter estimation in CYBERTRACK is based on the stochastic EM algorithm, which is an alternate iteration of Gibbs sampling and maximum a posteriori estimation of parameters. Suppose t is the current time, and suppose we have flow cytometry data matrix and a mixture proportion We perform the inference of latent clusters based on Gibbs sampling.
μ, and can be written as follows: The logarithm of above will be: Therefore, z t,d,n is sampled from the following categorical distribution: where η t,d,n = {η t,d,n,l } L l=1 . Suppose we have the mean of the previous mixture proportionπ t−1,d . The persistency parameter α t,d is estimated by fixed point iteration. where , where (·) is the gamma function, and N t,d,l is the number of cells assigned to cluster l at time t for case d. The mean of π t,d,l is then calculated as follows:π We substitute the E-step of the EM algorithm by Gibbs sampling, then μ and are updated in the M-step as follows: where y n,l is the n-th cell assigned to cluster l, and N l is the number of cells assigned to cluster l.

Simulation study
We conducted a simulation experiment to examine the performance of CYBERTRACK. We set K = 10, T = 5, and D = 2. The μ and was randomly generated. π were set manually so that change-points for mixture proportions exist at t = 3, 5 for case 1 and t = 2, 4 for case 2.
For the hyperparameters, we set τ = 10 −5 and ν = K + 2, and was set to an identity matrix, which is equivalent to giving weakly informative priors. With this parameter setting, we randomly generated 1000 synthetic data for each timepoint in each case (10,000 samples in total). The simulation was repeated 200 times, and different data was synthesized each time. Synthetic data was analyzed by CYBERTRACK along with ordinal Gaussian mixture model (GMM) using R package "mclust" as a baseline method.
Likelihood based criterion such as AIC or BIC tend to favor the number of clusters much larger than known distinct cell populations, making it difficult to interpret the result for flow cytometry users [10,11]. Also, since cell populations are known to form near-continuous phenotypes, fully automated determination of the number of cluster is often not very meaningful because users' cluster of interest may not be captured at that number. Instead, it is more important for users to explore some candidate cluster numbers that suits their purpose of the analysis. Therefore, we used the elbow method as partially automated, user-interactive determination method of the number of clusters. This method involves plotting the sum of squared error (SSE) within each cluster against the number of clusters. At the number of clusters with good partitioning, drop in SSE saturates, showing an elbow-like look (Additional file 1: Figure S1). It is usual to find several "elbows" in real world data, and we recommend users to try some candidate number of clusters and see whether cell populations of interest are detected.
The mean and standard deviation (sd) for the estimated μ andˆ are shown in Fig. 3.π estimated by CYBER-TRACK and GMM are shown in Fig. 4. These results show that the parameters for time-series multivariate Gaussian mixture distribution were reasonably estimated by the stochastic EM algorithm, and that CYBERTRACK successfully tracked the time-dependent transition of the mixture proportion in multiple cases with better precision compared to ordinal GMM (Fig. 4b). An important difference from the existing method is that CYBERTRACK can detect change-point by estimatingα. As shown in Fig. 5, α holds small values at t = 3, 5 for case 1 and t = 2, 4 for case 2, correctly indicating the dramatic transition of the mixture proportion at that timepoint.

Results for real data
To validate the CYBERTRACK performance for cell clustering and tracking mixture proportions of cell populations, we applied CYBERTRACK to real world flow cytometry data uploaded to Cytobank (https://www. cytobank.org/). In Landrigan's study (https://community. cytobank.org/cytobank/experiments/35226), naive CD4+ T cells were purified and stimulated using anti-CD3 and anti-CD28 antibodies. Five cases were tested: unstimulated, stimulated by only anti-CD3 antibody, and stimulated by both anti-CD3 and anti-CD28 antibodies, with two dosages tested for the anti-CD3 antibody (0.3 μg/mL and 0.8 μg/mL). It is known that the stimulation of CD3 triggers the activation of naive CD4+ T cells, which accompanies the phosphorylation of SLP76/S6 and CD247 (pSLP76/pS6, pCD246) [12]. CD28 is the co-stimulatory factor that enhances and prolongs T cell activation [13]. Soon after activation, the levels of pSLP/pS6 and pCD247 decrease owing to negative feedback. Consequently, the cells become CD45RO+ memory T cells. Fig. 3 Simulation result forμ andˆ estimation. a, Estimatedμ were plotted against true μ. Each dot represents elements ofμ and was color-coded by cluster. b, Estimatedˆ values were plotted against true . Each dot represents a element ofˆ and was color-coded by cluster; green, orange, and blue represents cluster 1, 2, and 3, respectively. Error bars represent standard deviation For Landrigan's study, the number of clusters was determined as 16 by using the elbow method (Additional file 1: Figure S2a). Histgrams of marker intensities for clusters identified by CYBERTRACK show that multivariate Gaussian distribution well approximates flowcytometry data (Additional file 1: Figure S3). Fig. 6a shows the heatmap generated from the estimatedμ; clusters 1,7,8,9,12, and 16 are the pSLP76/pS6+ pCD247+ activated naive T cells, and clusters 2, 4, 6, and 13 are pSLP76/pS6-pCD247-CD45RO+ memory T cells. The time-dependent transition of the mixture proportion is shown in Fig. 6b. While the mixture proportion remains stable over time in unstimulated cases, other cases show dynamic fluctuation, as expected. In stimulated cases, a high proportion of activated naive T cells (specifically, cluster 16 and 7 for dosages 0.3 and 0.8 μg/mL, respectively) was observed at t = 3 min. Their proportions decreased through t = 6, 10 min by the T cell's negative feedback mechanism. Figure 6b shows that as the number of activated T cells decreases, the memory T cell populations increase, indicating the transformation of naive T cells into memory T cells. This behavior was well represented byα estimates, shown in Fig. 6c; theα for stimulated cases shows small values at t = 6 and t = 10 min compared with that of unstimulated cases, indicating dynamic changes in cell population constitution at those timepoints. Interestingly, in cases stimulated with both anti-CD3 and anti-CD28 antibodies, a prominent increase of clusters with moderate levels of pSLP76/pS6 and pCD247 (cluster 14 and 1 for dosages 0.3 μg/mL and 0.8 μg/mL, respectively) was observed at t = 6. These clusters can be interpreted as cell populations that are transitioning from a highly activated state to an inactivated memory state. This is consistent with the well-known prolonged T cell activation by stimulation of CD28, thus further indicating that CYBERTRACK is capable of illustrating dynamic biological processes from time-series flow cytometry data [13]. We also applied CYBERTRACK to the data in Huang's study (https://community.cytobank.org/cytobank/experiments/5002), where cells were collected from mice whose lymph nodes were stimulated with either interleukin 7 (IL7) or interferon alpha (IFNα). It is known that IL7 and IFNα interact with their receptors on the lyphocytes' surface and activate lymphocytes through the phosphorylation of STAT family proteins (e.g., pSTAT1 and pSTAT5), which promotes the transcription of immunerelated genes [14,15].
The number of clusters was determined as 26 by using the elbow method (Additional file 1: Figure S2b), and the heatmap is shown in Fig. 7a. In Huang's study, T cells were identified by CD4 and/or TCRβ, and B cells were identified by B220. As shown in the heatmap, CYBERTRACK clustered cells into canonical cell types, which include CD4+ TCRβ+ T cells (clusters 4, 6, 9, 10, 17, and 26), CD4-TCRβ+ T cells (clusters 2, 3, 15, 24), and B220+ B cells (clusters 1, 3, 5, and 21). Clusters with extremely high levels of both B220 and TCRβ are thought to be debris; therefore, they were excluded from further interpretation. Figure 7b shows the time-dependent transition of the mixture proportion for each cell population. CYBERTRACK detected cell populations that increased over time in both cases. These cell populations include pSTAT1+ pSTAT5+ T cell (cluster 3) and pSTAT1+ B cell (cluster 5), which are typical cell populations that are known to emerge upon IL7 and IFNα stimulation. Furthermore, CYBERTRACK also illustrated cell population dynamics that differed in two cases; pSTAT5+ T cells (clusters 9 and 24) increased only when stimulated by IL7, whereas pSTAT1+ T cells (clusters 6 and 15) increased only in the IFNα-stimulated case. Although IL7 and IFNα are known to induce the phosphorylation of a variety of STAT family proteins, the result shown here may reflect the preferential upregulation of STAT5 and STAT1 by IL7 and IFNα, respectively [16,17]. The estimatedα shows that the change-points are located at t = 2, 4 min for IL7 stimulation and t = 2 min for IFNα stimulation (Fig. 7c). Furthermore, analysis by CYBERTRACK revealed that stimulation by IL7 induces more dramatic changes in cell population constitution at an early stage (until t = 4 min), as indicated by the small α values.

Discussion
The application of CYBERTRACK to simulation and real flow cytometry data has validated its performance for cell clustering and tracking mixture proportions in multiple Fig. 6 Result for CYBERTRACK analysis on Landrigan's study. a, Heatmap for Landrigan's study. Cluster number is shown on the x axis and markers are shown on the y axis. b, Mixture proportions for five cases were drawn in different graphs. Each colored line represents the mixture proportion for a different cluster. t is on the x axis and the mixture proportion is on the y axis. c, Estimatedα for Landrigan's study cases. The results of CYBERTRACK analysis using two immunological experiments were consistent with our prior knowledge, which validates CYBERTRACK's ability to analyze time-series flow cytometry data. We believe that CYBERTRACK will be a powerful tool in various fields involving the investigation of cell population dynamics. For instance, in the field of cancer immunotherapy, the longitudinal immune monitoring of patients has become increasingly important as it provides information on the impact of therapeutic treatment on certain cell populations, or in finding cell populations that can be used as prognostic markers. Furthermore, CYBERTRACK will also be useful in basic research as it can give insights into flow cytometry time-series data in an unbiased manner. Because CYBERTRACK is capable of clustering cells from different cases, it is easy for researchers to compare population dynamics in experiments with a control and several cases.
Recently, a tool called mass cytometry was introduced to the field of biomedical research. Mass spectrometrybased detection of marker genes by mass cytometry has enabled the investigation of more than 40 markers simultaneously, providing much more informative data with higher-dimensions compared with fluorescencebased conventional flow cytometry. Recent research trends in single-cell biology highly depend on mass cytometry, and it has contributed to many important discoveries [18]. One limitation of CYBERTRACK is that it is inapplicable to mass cytometry data, because the data generated by mass cytometry do not follow a Fig. 7 Result for CYBERTRACK analysis on Huang's study. a, Heatmap for Landrigan's study. Cluster number is shown on the x axis and markers are shown on the y axis. b, Mixture proportions for 26 clusters were drawn in different graphs. Black and red lines represent IL7 stimulation and IFNα stimulation, respectively. t is on the x axis and the mixture proportion is on the y axis. c, Estimatedα for Huang's study. Black and red lines represent IL7 stimulation and IFNα stimulation, respectively multivariate Gaussian distribution. Our future aim is to extend CYBERTRACK for application to time-series mass cytometry data.

Conclusion
Here, we propose a model-based cell clustering and population-tracking algorithm called CYBERTRACK. The aim of CYBERTRACK is to discover the underlying dynamics of cell populations in time-series flow cytometry data. Our model is inspired by the Topic Tracking Model [9], and we modified it for the parameter estimation of a multivariate Gaussian mixture distribution. CYBERTRACK is capable of (i) cell clustering, (ii) tracking the mixture proportion of each cell population, and (iii) detecting the change-point in the overall mixture proportion.