Skip to main content

Model-based cell clustering and population tracking for time-series flow cytometry data

Abstract

Background

Modern flow cytometry technology has enabled the simultaneous analysis of multiple cell markers at the single-cell level, and it is widely used in a broad field of research. The detection of cell populations in flow cytometry data has long been dependent on “manual gating” by visual inspection. Recently, numerous software have been developed for automatic, computationally guided detection of cell populations; however, they are not designed for time-series flow cytometry data. Time-series flow cytometry data are indispensable for investigating the dynamics of cell populations that could not be elucidated by static time-point analysis. Therefore, there is a great need for tools to systematically analyze time-series flow cytometry data.

Results

We propose a simple and efficient statistical framework, named CYBERTRACK (CYtometry-Based Estimation and Reasoning for TRACKing cell populations), to perform clustering and cell population tracking for time-series flow cytometry data. CYBERTRACK assumes that flow cytometry data are generated from a multivariate Gaussian mixture distribution with its mixture proportion at the current time dependent on that at a previous timepoint. Using simulation data, we evaluate the performance of CYBERTRACK when estimating parameters for a multivariate Gaussian mixture distribution, tracking time-dependent transitions of mixture proportions, and detecting change-points in the overall mixture proportion. The CYBERTRACK performance is validated using two real flow cytometry datasets, which demonstrate that the population dynamics detected by CYBERTRACK are consistent with our prior knowledge of lymphocyte behavior.

Conclusions

Our results indicate that CYBERTRACK offers better understandings of time-dependent cell population dynamics to cytometry users by systematically analyzing time-series flow cytometry data.

Background

Flow cytometry is a widely used technology for identifying and quantifying cellular properties and cell populations by measuring expression levels of surface and intracellular proteins at the single-cell level. Modern flow cytometers allow the simultaneous detection of nearly 20 protein markers per cell with a throughput of thousands of cells per second. The flow cytometry technique has greatly contributed to understanding the cellular biological processes and supporting clinical diagnoses in fields including immunology, cancer biology, and regenerative medicine [13].

An important challenge in the analysis of flow cytometry data is the classification of individual cells into canonical cell types, that is, subset populations such as T and B cells. The traditional approach of “manual gating” is performed by visually inspecting a two-dimensional scatter plot, but it suffers from several major limitations, including subjectivity, operator bias, difficulties in detecting unknown cell populations, and difficulties in reproducibility [46].

To overcome these limitations, several methods have been proposed for the computationally guided or automated detection of unknown cell populations by unsupervised clustering, including FlowSOM, X-shift, PhenoGraph, Rclusterpp, and flowMeans [7]. Although these methods have been successfully applied to identify both major and rare cell populations, they are not designed for modeling and analyzing time-series data and thus cannot capture the time-dependent properties and dynamics of cell populations. For example, in clinical applications such as cancer immunotherapies, we are interested in investigating drug effects on cell populations by monitoring their dynamics throughout the treatment period [8]. Time-series flow cytometry data offer information on longitudinal cell population dynamics that could not be elucidated by conventional static time-point data. However, such research is currently limited by a lack of a systematic mathematical framework to adequately model and analyze time-series flow cytometry data.

To address this problem, we propose a new statistical framework, named CYBERTRACK (CYtometry-Based Estimation and Reasoning for TRACKing cell populations), for the automatic clustering and tracking of a mixture proportion of cell populations in time-series flow cytometry data. Our contributions are summarized as follows:

  • Our framework is based on the Topic Tracking Model proposed by Iwata et al., 2009, which is designed for tracking topic distribution that changes over time. We extend their model to handle time-series flow cytometry data, which is assumed to follow a multivariate Gaussian mixture distribution.

  • By assuming that the mixture proportion at the current time is dependent on that at a previous time, CYBERTRACK is capable of estimating the longitudinal transition of multiple cell populations and detecting the “change-point” in the overall mixture proportion.

  • We provide a simple and efficient learning procedure for the proposed model by using a stochastic EM algorithm, which is an alternate iteration of Gibbs sampling and maximum a posteriori (MAP) estimation of parameters. CYBERTRACK is implemented in an R environment, and the implementation is available from https://github.com/kodaim1115/CYBERTRACK.

A conceptual view of an analysis by CYBERTRACK is shown in Fig. 1.

Fig. 1
figure1

Conceptual view of the analysis by CYBERTRACK. The aim of CYBERTRACK is to model and analyze time-series flowcytometry data to understand dynamic cell population behavior that spans certain period of time. In time-series flow cytometry analysis, cells are acquired sequentially and their expression levels of marker proteins are analyzed, giving data matrices of cells and markers for each timepoint. CYBERTRACK takes these data matrices as an input and performs i) clustering, ii) tracking cell population dynamics, and iii) detecting change-points in cell population constitution

Our model and algorithm are described in the “Methods” section. To validate its performance and practicability, we applied CYBERTRACK to both simulation and real time-series flow cytometry datasets for two immunological experiments.

Methods

Model

Suppose that we observe time series flow cytometry data \(y_{t,d,n} \in \mathbb {R}^{K}\), where t{1,...,T} is a time index, d{1,...,D} is a case index, n{1,...,Nt,d} is a cell index, and K is the number of markers. In the context of flow cytometry experiment, case corresponds to a study subjects (e.g. cultured cell, mouse, or patient) from which cells are collected and analyzed. Nt,d represents the number of cells observed at time t for case d. The objective of this study is to perform clustering of cells and track the time-dependent transition of cluster mixture proportion πt,d,l for each case, where l{1,...,L} is a cluster index. Our model is inspired by the Topic Tracking Model, and is an extension of the multivariate Gaussian mixture model. Topic Model is a Bayesian model which was originally designed to extract latent semantics, or “topics”, from text data. Topic Tracking Model is an extension of Topic Model specialized in tracking time-varying topic distribution [9]. Although the original Topic Tracking Model assumed each word was generated from a multinomial distribution, this assumption does not apply to flow cytometry data, where each cell possess continuous values for the number of markers. Therefore, we assumed that flow cytometry data follow a multivariate Gaussian mixture distribution. Since the original Topic Tracking Model is not designed to estimate parameters for the multivariate Gaussian mixture distribution, we constructed the stochastic EM algorithm to solve this problem. Here, topics correspond to cell populations such as T cells or B cells. Figure 2 illustrates a plate diagram of our proposed model, where, zt,d,n is a latent cluster vector of length L that holds 1 for the l-th element when a cell is generated from cluster l and holds 0 otherwise. We assume that each cell is generated from a multivariate Gaussian mixture distribution with the parameter vector μl and Σl, which represents the mean and the covariance matrix for cluster l, respectively. More specifically, the generative process of CYBERTRACK is defined by

$$\begin{array}{@{}rcl@{}} &&\boldsymbol{y}_{t,d,n} \mid \boldsymbol{z}_{t,d,n}\sim \text{Gaussian}(\boldsymbol{\mu}_{z},\boldsymbol{\Sigma}_{z}) \end{array} $$
(1)
Fig. 2
figure2

Graphical model of CYBERTRACK

$$\begin{array}{@{}rcl@{}} &&\boldsymbol{z}_{t,d,n}\mid\boldsymbol{\pi}_{t,d}\sim\text{Categorical}(\boldsymbol{\pi}_{t,d}) \end{array} $$
(2)
$$\begin{array}{@{}rcl@{}} &&\boldsymbol{\pi}_{t,d}\mid\alpha_{t,d},\boldsymbol{\pi}_{t-1,d}\sim\text{Dirichlet}(\alpha_{t,d}\boldsymbol{\pi}_{t-1,d}) \end{array} $$
(3)
$$\begin{array}{@{}rcl@{}} &&\boldsymbol{\mu}_{l} \mid \tau, \mathbf{\Sigma}_{l} \sim \text{Gaussian}\left(0,\tau^{-1}\mathbf{\Sigma}_{l}\right) \end{array} $$
(4)
$$\begin{array}{@{}rcl@{}} &&\mathbf{\Sigma}_{l}^{-1} \mid \nu, \mathbf{\Lambda} \sim \text{Wishart}\left(\nu, \mathbf{\Lambda}^{-1}\right) \end{array} $$
(5)

where z is a latent cluster of the n-th cell at time t for case d indicated by zt,d,n, μz and Σz are the mean vector and covariance matrix of the latent cluster, respectively, \(\boldsymbol {\pi }_{t,d} = \{\pi _{t,d,l}\}_{l=1}^{L}\) is the mixture proportion vector, and αt,d represents the persistency parameter, which indicates how consistent the mixture proportion at time t is compared with that at the previous time t−1. A smaller αt,d value indicates a larger discrepancy between the mixture proportion at time t and t−1. Thus, timepoints with relatively small persistency parameters could be considered as “change-points” in the mixture proportion. \(\boldsymbol {\mu } = \{\boldsymbol {\mu }_{l}\}_{l=1}^{L}\) is the mean vectors of clusters, and \(\mathbf {\Sigma } = \{\boldsymbol {\Sigma }_{l}\}_{l=1}^{L}\) is the covariance matrices of clusters. τ is the hyperparameter of the μ prior distribution, and Λ and ν are the hyperparameters of the Σ prior distribution.

Parameter estimation

Parameter estimation in CYBERTRACK is based on the stochastic EM algorithm, which is an alternate iteration of Gibbs sampling and maximum a posteriori estimation of parameters. Suppose t is the current time, and suppose we have flow cytometry data matrix \(\mathbf {Y}_{t}=\{\mathbf {Y}_{t,d}\}_{d=1}^{D}\) and a mixture proportion matrix \(\mathbf {\Pi }_{t} = \{\boldsymbol {\pi }_{t,d}\}_{d=1}^{D}\), where \(\mathbf {Y}_{t,d}=\{\boldsymbol {y}_{t,d,n}\}_{n=1}^{N_{t,d}}\). We perform the inference of latent clusters based on Gibbs sampling. Let \(\mathbf {Z}_{t} = \{\boldsymbol {z}_{t,d}\}_{d=1}^{D}\) be the set of latent clusters of all cases at time t, where \(\boldsymbol {z}_{t,d} = \{\boldsymbol {z}_{t,d,n}\}_{n=1}^{N_{t,d}}\). The posterior distribution of Zt given Yt, Πt, μ, and Σ can be written as follows:

$$ \begin{aligned} p(\mathbf{Z}_{t} &\mid \mathbf{Y}_{t}, \mathbf{\Pi}_{t}, \boldsymbol{\mu}, \mathbf{\Sigma})\\ &\propto p(\mathbf{Y}_{t}, \mathbf{Z}_{t}, \mathbf{\Pi}_{t}, \boldsymbol{\mu}, \mathbf{\Sigma})\\ &\propto p(\mathbf{Y}_{t} \mid \mathbf{Z}_{t}, \boldsymbol{\mu}, \mathbf{\Sigma}) p(\mathbf{Z}_{t} \mid \mathbf{\Pi}_{t})\\ &= \prod_{d} \prod_{n} p(\boldsymbol{y}_{t,d,n} \mid \boldsymbol{z}_{t,d,n}, \boldsymbol{\mu}, \mathbf{\Sigma})p(\boldsymbol{z}_{t,d,n} \mid \boldsymbol{\pi}_{t,d}). \end{aligned} $$
(6)

The logarithm of above will be:

$$\begin{array}{@{}rcl@{}} \begin{aligned} \log\{p(&\boldsymbol{y}_{t,d,n} \mid \boldsymbol{z}_{t,d,n}, \boldsymbol{\mu}, \mathbf{\Sigma})p(\boldsymbol{z}_{t,d,n} \mid \boldsymbol{\pi}_{t,d})\}\\ &= \sum_{l} z_{t,d,n,l}\{-\frac{1}{2}(\boldsymbol{y}_{t,d,n} - \boldsymbol{\mu}_{l})^{\top} \mathbf{\Sigma}_{l}^{-1}(\boldsymbol{y}_{t,d,n}- \boldsymbol{\mu}_{l})\\ &\quad+ \frac{1}{2} \log{\mid \mathbf{\Sigma}_{l}^{-1} \mid} + \log{\pi_{t,d,l}}\} + \text{const}. \end{aligned} \end{array} $$
(7)

Therefore, zt,d,n is sampled from the following categorical distribution:

$$ \boldsymbol{\tilde{z}}_{t,d,n} \sim \text{Categorical}(\boldsymbol{\eta}_{t,d,n}) $$
(8)
$$\begin{array}{@{}rcl@{}} \begin{aligned} \eta_{t,d,n,l} &\propto \exp\left\{-\frac{1}{2}(\boldsymbol{y}_{t,d,n} - \boldsymbol{\mu}_{l})^{\top} \mathbf{\Sigma}_{l}^{-1}(\boldsymbol{y}_{t,d,n}- \boldsymbol{\mu}_{l})\right.\\ &\left.+ \frac{1}{2} \log{\mid \mathbf{\Sigma}_{l}^{-1} \mid} + \log\pi_{t,d,l}\right\}\\ &\text{s.t.} \sum_{l} \eta_{t,d,n,l} = 1, \end{aligned} \end{array} $$
(9)

where \(\boldsymbol {\eta }_{t,d,n}=\{\eta _{t,d,n,l}\}_{l=1}^{L}\). Suppose we have the mean of the previous mixture proportion \(\hat {\boldsymbol {\pi }}_{t-1,d}\). The persistency parameter αt,d is estimated by fixed point iteration.

$$ \hat{\alpha}_{t,d} \leftarrow \hat{\alpha}_{t,d} \frac{\sum_{l} \hat{\pi}_{t-1,d,l} A_{t,d,l}}{\psi(N_{t,d} + \hat{\alpha}_{t,d}) - \psi(\hat{\alpha}_{t,d})}, $$
(10)

where \(A_{t,d,l} = \psi (N_{t,d,l} + \hat {\alpha }_{t,d} \hat {\pi }_{t-1,d,l}) - \psi (\hat {\alpha }_{t,d} \hat {\pi }_{t-1,d,l})\), ψ(·) is the digamma function \(\psi (x) = \frac {\partial \log {\Gamma (x)}}{\partial x}\), where Γ(·) is the gamma function, and Nt,d,l is the number of cells assigned to cluster l at time t for case d. The mean of πt,d,l is then calculated as follows:

$$ \hat{\pi}_{t,d,l} = \frac{N_{t,d,l} + \hat{\alpha}_{t,d}\hat{\pi}_{t-1,d,l}}{N_{t,d} + \hat{\alpha}_{t,d}}. $$
(11)

We substitute the E-step of the EM algorithm by Gibbs sampling, then μ and Σ are updated in the M-step as follows:

$$ \hat{\boldsymbol{\mu}}_{l} = \frac{\sum_{n} \boldsymbol{y}_{n,l}}{N_{l} + \tau} $$
(12)
$$ \begin{aligned} &\hat{\mathbf{\Sigma}}_{l} = \frac{\mathbf{\Lambda} + \sum_{n} (\boldsymbol{y}_{n,l}-\hat{\boldsymbol{\mu}}_{l})^{\top}(\boldsymbol{y}_{n,l}-\hat{\boldsymbol{\mu}}_{l}) + \tau\hat{\boldsymbol{\mu}}_{l}^{\top}\hat{\boldsymbol{\mu}}_{l}}{N_{l} + \nu - K - 1}, \end{aligned} $$
(13)

where yn,l is the n-th cell assigned to cluster l, and Nl is the number of cells assigned to cluster l.

Result

Simulation study

We conducted a simulation experiment to examine the performance of CYBERTRACK. We set K=10, T=5, and D=2. The μ and Σ was randomly generated. π were set manually so that change-points for mixture proportions exist at t=3,5 for case 1 and t=2,4 for case 2. For the hyperparameters, we set τ=10−5 and ν=K+2, and Λ was set to an identity matrix, which is equivalent to giving weakly informative priors. With this parameter setting, we randomly generated 1000 synthetic data for each timepoint in each case (10,000 samples in total). The simulation was repeated 200 times, and different data was synthesized each time. Synthetic data was analyzed by CYBERTRACK along with ordinal Gaussian mixture model (GMM) using R package “mclust” as a baseline method.

Likelihood based criterion such as AIC or BIC tend to favor the number of clusters much larger than known distinct cell populations, making it difficult to interpret the result for flow cytometry users [10, 11]. Also, since cell populations are known to form near-continuous phenotypes, fully automated determination of the number of cluster is often not very meaningful because users’ cluster of interest may not be captured at that number. Instead, it is more important for users to explore some candidate cluster numbers that suits their purpose of the analysis. Therefore, we used the elbow method as partially automated, user-interactive determination method of the number of clusters. This method involves plotting the sum of squared error (SSE) within each cluster against the number of clusters. At the number of clusters with good partitioning, drop in SSE saturates, showing an elbow-like look (Additional file 1: Figure S1). It is usual to find several ”elbows” in real world data, and we recommend users to try some candidate number of clusters and see whether cell populations of interest are detected.

The mean and standard deviation (sd) for the estimated \(\hat {\boldsymbol {\mu }}\) and \(\hat {\mathbf {\Sigma }}\) are shown in Fig. 3. \(\hat {\boldsymbol {\pi }}\) estimated by CYBERTRACK and GMM are shown in Fig. 4. These results show that the parameters for time-series multivariate Gaussian mixture distribution were reasonably estimated by the stochastic EM algorithm, and that CYBERTRACK successfully tracked the time-dependent transition of the mixture proportion in multiple cases with better precision compared to ordinal GMM (Fig. 4b). An important difference from the existing method is that CYBERTRACK can detect change-point by estimating \(\hat {\alpha }\). As shown in Fig. 5, \(\hat {\alpha }\) holds small values at t=3,5 for case 1 and t=2,4 for case 2, correctly indicating the dramatic transition of the mixture proportion at that timepoint.

Fig. 3
figure3

Simulation result for \(\hat {\boldsymbol {\mu }}\) and \(\hat {\boldsymbol {\Sigma }}\) estimation. a, Estimated \(\hat {\boldsymbol {\mu }}\) were plotted against true μ. Each dot represents elements of \(\hat {\boldsymbol {\mu }}\) and was color-coded by cluster. b, Estimated \(\hat {\boldsymbol {\Sigma }}\) values were plotted against true Σ. Each dot represents a element of \(\hat {\boldsymbol {\Sigma }}\) and was color-coded by cluster; green, orange, and blue represents cluster 1, 2, and 3, respectively. Error bars represent standard deviation

Fig. 4
figure4

Simulation result for \(\hat {\boldsymbol {\pi }}\) estimation. a, Estimated \(\hat {\boldsymbol {\pi }}\) for simulation data. Black and red lines represent the proportion mixture for case 1 and 2, respectively. Solid lines indicate the true mixture proportion and dashed lines indicate estimated proportion by CYBERTRACK and GMM. Error bars with flat bars and arrows represents CYBERTRACK and GMM, respectively. Error bars represent standard deviation. b, SSE was calculated for each repeat of simulation. SSE is smaller in CYBERTRACK, which indicates that it gave more precise estimation result on time-series mixture proportions compared to GMM. p = 0.0387, two-sided Student’s t-test

Fig. 5
figure5

Simulation result for \(\hat {\boldsymbol {\alpha }}\) estimation. Black and red lines represent proportion mixtures for cases 1 and 2, respectively. Timepoints where the alpha values decrease substantially indicate change-points in the overall mixture proportion. Error bars represent standard deviation

Results for real data

To validate the CYBERTRACK performance for cell clustering and tracking mixture proportions of cell populations, we applied CYBERTRACK to real world flow cytometry data uploaded to Cytobank (https://www.cytobank.org/). In Landrigan’s study (https://community.cytobank.org/cytobank/experiments/35226), naive CD4+ T cells were purified and stimulated using anti-CD3 and anti-CD28 antibodies. Five cases were tested: unstimulated, stimulated by only anti-CD3 antibody, and stimulated by both anti-CD3 and anti-CD28 antibodies, with two dosages tested for the anti-CD3 antibody (0.3 μg/mL and 0.8 μg/mL). It is known that the stimulation of CD3 triggers the activation of naive CD4+ T cells, which accompanies the phosphorylation of SLP76/S6 and CD247 (pSLP76/pS6, pCD246) [12]. CD28 is the co-stimulatory factor that enhances and prolongs T cell activation [13]. Soon after activation, the levels of pSLP/pS6 and pCD247 decrease owing to negative feedback. Consequently, the cells become CD45RO+ memory T cells.

For Landrigan’s study, the number of clusters was determined as 16 by using the elbow method (Additional file 1: Figure S2a). Histgrams of marker intensities for clusters identified by CYBERTRACK show that multivariate Gaussian distribution well approximates flowcytometry data (Additional file 1: Figure S3). Fig. 6a shows the heatmap generated from the estimated \(\hat {\boldsymbol {\mu }}\); clusters 1, 7, 8, 9, 12, and 16 are the pSLP76/pS6+ pCD247+ activated naive T cells, and clusters 2, 4, 6, and 13 are pSLP76/pS6- pCD247- CD45RO+ memory T cells. The time-dependent transition of the mixture proportion is shown in Fig. 6b. While the mixture proportion remains stable over time in unstimulated cases, other cases show dynamic fluctuation, as expected. In stimulated cases, a high proportion of activated naive T cells (specifically, cluster 16 and 7 for dosages 0.3 and 0.8 μg/mL, respectively) was observed at t=3 min. Their proportions decreased through t=6,10 min by the T cell’s negative feedback mechanism. Figure 6b shows that as the number of activated T cells decreases, the memory T cell populations increase, indicating the transformation of naive T cells into memory T cells. This behavior was well represented by \(\hat {\alpha }\) estimates, shown in Fig. 6c; the \(\hat {\alpha }\) for stimulated cases shows small values at t=6 and t=10 min compared with that of unstimulated cases, indicating dynamic changes in cell population constitution at those timepoints. Interestingly, in cases stimulated with both anti-CD3 and anti-CD28 antibodies, a prominent increase of clusters with moderate levels of pSLP76/pS6 and pCD247 (cluster 14 and 1 for dosages 0.3 μg/mL and 0.8 μg/mL, respectively) was observed at t=6. These clusters can be interpreted as cell populations that are transitioning from a highly activated state to an inactivated memory state. This is consistent with the well-known prolonged T cell activation by stimulation of CD28, thus further indicating that CYBERTRACK is capable of illustrating dynamic biological processes from time-series flow cytometry data [13].

Fig. 6
figure6

Result for CYBERTRACK analysis on Landrigan’s study. a, Heatmap for Landrigan’s study. Cluster number is shown on the x axis and markers are shown on the y axis. b, Mixture proportions for five cases were drawn in different graphs. Each colored line represents the mixture proportion for a different cluster. t is on the x axis and the mixture proportion is on the y axis. c, Estimated \(\hat {\alpha }\) for Landrigan’s study

We also applied CYBERTRACK to the data in Huang’s study (https://community.cytobank.org/cytobank/experiments/5002), where cells were collected from mice whose lymph nodes were stimulated with either interleukin 7 (IL7) or interferon alpha (IFN α). It is known that IL7 and IFN α interact with their receptors on the lyphocytes’ surface and activate lymphocytes through the phosphorylation of STAT family proteins (e.g., pSTAT1 and pSTAT5), which promotes the transcription of immune-related genes [14, 15].

The number of clusters was determined as 26 by using the elbow method (Additional file 1: Figure S2b), and the heatmap is shown in Fig. 7a. In Huang’s study, T cells were identified by CD4 and/or TCR β, and B cells were identified by B220. As shown in the heatmap, CYBERTRACK clustered cells into canonical cell types, which include CD4+ TCR β+ T cells (clusters 4, 6, 9, 10, 17, and 26), CD4- TCR β+ T cells (clusters 2, 3, 15, 24), and B220+ B cells (clusters 1, 3, 5, and 21). Clusters with extremely high levels of both B220 and TCR β are thought to be debris; therefore, they were excluded from further interpretation. Figure 7b shows the time-dependent transition of the mixture proportion for each cell population. CYBERTRACK detected cell populations that increased over time in both cases. These cell populations include pSTAT1+ pSTAT5+ T cell (cluster 3) and pSTAT1+ B cell (cluster 5), which are typical cell populations that are known to emerge upon IL7 and IFN α stimulation. Furthermore, CYBERTRACK also illustrated cell population dynamics that differed in two cases; pSTAT5+ T cells (clusters 9 and 24) increased only when stimulated by IL7, whereas pSTAT1+ T cells (clusters 6 and 15) increased only in the IFN α-stimulated case. Although IL7 and IFN α are known to induce the phosphorylation of a variety of STAT family proteins, the result shown here may reflect the preferential upregulation of STAT5 and STAT1 by IL7 and IFN α, respectively [16, 17]. The estimated \(\hat {\alpha }\) shows that the change-points are located at t=2,4 min for IL7 stimulation and t=2 min for IFN α stimulation (Fig. 7c). Furthermore, analysis by CYBERTRACK revealed that stimulation by IL7 induces more dramatic changes in cell population constitution at an early stage (until t=4 min), as indicated by the small \(\hat {\alpha }\) values.

Fig. 7
figure7

Result for CYBERTRACK analysis on Huang’s study. a, Heatmap for Landrigan’s study. Cluster number is shown on the x axis and markers are shown on the y axis. b, Mixture proportions for 26 clusters were drawn in different graphs. Black and red lines represent IL7 stimulation and IFN α stimulation, respectively. t is on the x axis and the mixture proportion is on the y axis. c, Estimated \(\hat {\alpha }\) for Huang’s study. Black and red lines represent IL7 stimulation and IFN α stimulation, respectively

Discussion

The application of CYBERTRACK to simulation and real flow cytometry data has validated its performance for cell clustering and tracking mixture proportions in multiple cases. The results of CYBERTRACK analysis using two immunological experiments were consistent with our prior knowledge, which validates CYBERTRACK’s ability to analyze time-series flow cytometry data. We believe that CYBERTRACK will be a powerful tool in various fields involving the investigation of cell population dynamics. For instance, in the field of cancer immunotherapy, the longitudinal immune monitoring of patients has become increasingly important as it provides information on the impact of therapeutic treatment on certain cell populations, or in finding cell populations that can be used as prognostic markers. Furthermore, CYBERTRACK will also be useful in basic research as it can give insights into flow cytometry time-series data in an unbiased manner. Because CYBERTRACK is capable of clustering cells from different cases, it is easy for researchers to compare population dynamics in experiments with a control and several cases.

Recently, a tool called mass cytometry was introduced to the field of biomedical research. Mass spectrometry-based detection of marker genes by mass cytometry has enabled the investigation of more than 40 markers simultaneously, providing much more informative data with higher-dimensions compared with fluorescence-based conventional flow cytometry. Recent research trends in single-cell biology highly depend on mass cytometry, and it has contributed to many important discoveries [18]. One limitation of CYBERTRACK is that it is inapplicable to mass cytometry data, because the data generated by mass cytometry do not follow a multivariate Gaussian distribution. Our future aim is to extend CYBERTRACK for application to time-series mass cytometry data.

Conclusion

Here, we propose a model-based cell clustering and population-tracking algorithm called CYBERTRACK. The aim of CYBERTRACK is to discover the underlying dynamics of cell populations in time-series flow cytometry data. Our model is inspired by the Topic Tracking Model [9], and we modified it for the parameter estimation of a multivariate Gaussian mixture distribution. CYBERTRACK is capable of (i) cell clustering, (ii) tracking the mixture proportion of each cell population, and (iii) detecting the change-point in the overall mixture proportion.

Availability of data and materials

Our software is available from GitHub (https://github.com/kodaim1115/CYBERTRACK). Data for the immunological experiments used in this paper are available from Cytobank (https://cytobank.org/).

Abbreviations

AIC:

Akaike’s information criterion

BIC:

Bayesian information criterion

EM:

Expectation-maximization

GMM:

Gaussian mixture model

MAP:

Maximum a posteriori

SSE:

Sum of squared error

sd:

Standard deviation

References

  1. 1

    Maeda Y, Nishikawa H, Sugiyama D, Ha D, Hamaguchi M, Saito T, Sakaguchi S, et al. Detection of self-reactive CD8+ T cells with an anergic phenotype in healthy individuals. Science. 2014; 346(6216):1536–40.

    CAS  Article  Google Scholar 

  2. 2

    Saito T, Nishikawa H, Wada H, Nagano Y, Sugiyama D, Atarashi K, Nagase H, et al. Two FOXP3+ CD4+ T cell subpopulations distinctly control the prognosis of colorectal cancers. Nat Med. 2016; 22(6):679.

    CAS  Article  Google Scholar 

  3. 3

    Tober J, Maijenburg MM, Li Y, Gao L, Hadland BK, Gao P, Speck NA, et al. Maturation of hematopoietic stem cells from prehematopoietic stem cells is accompanied by up-regulation of PD-L1. J Exp Med. 2018; 215(2):645–59.

    CAS  Article  Google Scholar 

  4. 4

    Finak G, Frelinger J, Jiang W, Newell EW, Ramey J, Davis MM, Gottardo R, et al. OpenCyto: an open source infrastructure for scalable, robust, reproducible, and automated, end-to-end flow cytometry data analysis. PLoS Comput Biol. 2014; 10(8):e1003806.

    Article  Google Scholar 

  5. 5

    Finak G, Langweiler M, Jaimes M, Malek M, Taghiyar J, Korin Y, Pontikos N, et al. Standardizing flow cytometry immunophenotyping analysis from the Human ImmunoPhenotyping Consortium. Sci Rep. 2016; 6:20686.

    CAS  Article  Google Scholar 

  6. 6

    Saeys Y, Van Gassen S, Lambrecht BN. Computational flow cytometry: helping to make sense of high-dimensional immunology data. Nat Rev Immunol. 2016; 16(7):449.

    CAS  Article  Google Scholar 

  7. 7

    Weber LM, Robinson MD, et al. Comparison of clustering methods for high-dimensional single-cell flow and mass cytometry data. Cytom Part A. 2016; 89(12):1084–96.

    CAS  Article  Google Scholar 

  8. 8

    Kurose K, Ohue Y, Wada H, Iida S, Ishida T, Kojima T, Kakimi K, et al. Phase Ia study of FoxP3+ CD4 Treg depletion by infusion of a humanized anti-CCR4 antibody, KW-0761, in cancer patients. Clin Cancer Res. 2015; 21(19):4327–36.

    CAS  Article  Google Scholar 

  9. 9

    Iwata T, Watanabe S, Yamada T, Ueda N. Topic tracking model for analyzing consumer purchase behavior. In: Proceedings of International Joint Conference on Artificial Intelligence, IJCAI: 2009. p. 1427–32.

  10. 10

    Finak G, Bashashati A, Brinkman R, Gottardo R. Merging mixture components for cell population identification in flow cytometry. Adv Bioinform. 2009. https://doi.org/10.1155/2009/247646.

    Article  Google Scholar 

  11. 11

    Biernacki C, Celeux G, Govaert G. Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell. 2000; 22(7):719–25.

    Article  Google Scholar 

  12. 12

    Gaud G, Lesourne R, Love PE. Regulatory mechanisms in T cell receptor signalling. Nat Rev Immunol. 2018; 18(8):485–97.

    CAS  Article  Google Scholar 

  13. 13

    Alegre ML, Frauwirth KA, Thompson CB. T-cell regulation by CD28 and CTLA-4. Nat Rev Immunol. 2001; 1(3):220.

    CAS  Article  Google Scholar 

  14. 14

    Mackall CL, Fry TJ, Gress RE. Harnessing the biology of IL-7 for therapeutic application. Nat Rev Immunol. 2011; 11(5):330.

    CAS  Article  Google Scholar 

  15. 15

    O’Brien TR. Interferon-alfa, interferon-lambda and hepatitis C. Nat Genet. 1048; 41(10).

  16. 16

    Khodarev NN, Roizman B, Weichselbaum RR. Molecular pathways: interferon/stat1 pathway: role in the tumor resistance to genotoxic stress and aggressive growth. Clin Cancer Res. 2012; 18(11):3015–21.

    CAS  Article  Google Scholar 

  17. 17

    Foxwell BM, Beadling C, Guschin D, Kerr I, Cantrell D. Interleukin-7 can induce the activation of Jak 1, Jak 3 and STAT 5 proteins in murine T cells. Eur J Immunol. 1995; 25(11):3041–6.

    CAS  Article  Google Scholar 

  18. 18

    Yao Y, Liu R, Shin MS, Trentalange M, Allore H, Nassar A, Montgomery RR, et al. CyTOF supports efficient detection of immune cell subsets from small samples. J Immunol Methods. 2014; 415:1–5.

    CAS  Article  Google Scholar 

Download references

Acknowledgements

Not applicable.

About this supplement

This article has been published as part of BMC Bioinformatics Volume 20 Supplement 23, 2019: Proceedings of the Joint International GIW & ABACBS-2019 Conference: bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-20-supplement-23.

Funding

This research was supported by JSPS Grant-in-Aid for Scientific Research on Innovative Areas Grant Numbers 15H05912, 18H04798, and 19H05210. It was also supported by the Japan Agency for Medical Research and Development (AMED) under the Strategic Research Program for Brain Sciences [No. JP18dm0107087], and the Japan Agency for Medical Research and Development (AMED) under Practical Research Project for Rare / Intractable Diseases [No. JP17ek0109281]. The super-computing resources were provided by Human Genome Center, University of Tokyo. Publication costs are funded by all of the funding.

Author information

Affiliations

Authors

Contributions

KM, KA, TS designed the proposed algorithm. All authors reviewed and approved the final manuscript.

Corresponding author

Correspondence to Teppei Shimamura.

Ethics declarations

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Minoura, K., Abe, K., Maeda, Y. et al. Model-based cell clustering and population tracking for time-series flow cytometry data. BMC Bioinformatics 20, 633 (2019). https://doi.org/10.1186/s12859-019-3294-3

Download citation

Keywords

  • Flow cytometry
  • Time-series
  • Topic model
  • Baysian inference