 Research
 Open Access
 Published:
Parametric modeling of cellular state transitions as measured with flow cytometry
BMC Bioinformatics volume 13, Article number: S5 (2012)
Abstract
Background
Gradual or sudden transitions among different states as exhibited by cell populations in a biological sample under particular conditions or stimuli can be detected and profiled by flow cytometric time course data. Often such temporal profiles contain features due to transient states that present unique modeling challenges. These could range from asymmetric nonGaussian distributions to outliers and tail subpopulations, which need to be modeled with precision and rigor.
Results
To ensure precision and rigor, we propose a parametric modeling framework StateProfiler based on finite mixtures of skew tNormal distributions that are robust against nonGaussian features caused by asymmetry and outliers in data. Further, we present in StateProfiler a new greedy EM algorithm for fast and optimal model selection. The parsimonious approach of our greedy algorithm allows us to detect the genuine dynamic variation in the key features as and when they appear in time course data. We also present a procedure to construct a wellfitted profile by merging any redundant model components in a way that minimizes change in entropy of the resulting model. This allows precise profiling of unusually shaped distributions and less wellseparated features that may appear due to cellular heterogeneity even within clonal populations.
Conclusions
By modeling flow cytometric data measured over time course and marker space with StateProfiler, specific parametric characteristics of cellular states can be identified. The parameters are then tested statistically for learning global and local patterns of spatiotemporal change. We applied StateProfiler to identify the temporal features of yeast cell cycle progression based on knockout of Sphase triggering cyclins Clb5 and Clb6, and then compared the Sphase delay phenotypes due to differential regulation of the two cyclins. We also used StateProfiler to construct the temporal profile of clonal divergence underlying lineage selection in mammalian hematopoietic progenitor cells.
Background
Flow Cytometry is among the most widely used platforms in biomedical research and clinical labs. It is used for investigation of a wide variety of biological problems at single cell level. Classical applications of flow cytometry include quantitative measurements of DNA content and cell cycle progression [1]. It is also one of the key platforms for studying dynamic cellular properties such as differentiation, proliferation and apoptosis, especially in the contexts of stem cells and cancer [2]. Such applications make flow cytometry the ideal platform for the purpose of identifying and monitoring the myriad states and functions in different specimens that vary over time under particular conditions and stimuli.
Typically, a flow sample is stained with fluorescent dyes, possibly attached to antibodies, and per cell events such as the expression of a cellsurface marker or the DNA content are measured in terms of fluorescence intensity. The distribution of these events are then plotted or modeled statistically for identification of important features in the sample. While developments in computational cytomics have produced many useful analytical methods (e.g. [3]), several important problems have not yet been addressed adequately. One such issue involves precise parametric modeling of dynamic features in temporal profiles such that the model parameters can characterize the transition of the populations in a sample through different cellular states. Often simple statistics such as population mean or size can be imprecise in the presence of unusually shaped distributions and outliers in temporal profiles. The modeling scenario could be complicated further by the adoption of different trajectories by different subpopulations. Indeed a rigorous algorithm for modeling cellular state transitions can not only automate the traditionally manual approach, which is subjective and laborintensive, but also extend it to increasingly complex and highthroughput experiments.
Many major cytometric studies have highlighted the importance of characterizing temporal profiles at single cell resolution for a variety of purposes such as cell cycle expression kinetics (e.g. [4, 5]), pharmacodynamics (e.g. [6, 7]), signaling alterations in specific subpopulations (e.g. [8, 9]), dynamics of differentiation into distinct lineages (e.g. [10, 11]), and so on. Clearly, mathematical formulation of a cellular statespace, and the transitions therein, can help us model a given collection of temporal flow cytometric profiles with the required rigor. Thereupon we can study the changes in features (say, in comparison with those in control profiles) and monitor trends in parametric detail. Precise probabilistic modeling of sample distributions at each stage can automatically reveal such dynamic features as emergence of a tail subpopulation or change in the skewness of a cluster that are statistically welldefined as well as biologically insightful [3].
Temporal profiling of cellular state transitions in flow data can, however, present unique modeling challenges. Often the transient states produce nonGaussian features such as asymmetric or trailing subpopulations owing to rush or delay in progression from one state to another [5]. Intermediate states might also produce outliers that cannot be clearly distinguished from the more distinctive states. Moreover certain metastable states may appear only inconsistently in a given time course [11]. Often the transient features appear and disappear at the tails of the more prominent distributions, and may be hard to model via automation. Thus a framework that uses robust probabilistic density functions to model time course data may be the best way to represent the underlying statespace, and reveal any sudden or gradual transition therein. In terms of the distribution of events in a flow sample, characteristics of different states may be determined by variation in size (say, percentage of cells in a peak or cluster), location (such as mean or mode) or significance (peak density) of the model components. While traditionally such changes were detected with manual or nonparametric techniques, several modelbased frameworks have recently been applied with success, e.g. [3, 12–15].
Here we present StateProfiler, a new framework based on finite mixture models of skew tNormal distributions (STNMIX) for statistical characterization of flow cytometric time course data. In particular, we present in StateProfiler a new greedy ExpectationMaximization (EM) algorithm for fitting our STNMIX model. The greedy EM algorithm starts with a minimum number of distributions (or components) and sequentially inserts a new component to the mixture until model convergence is achieved. This parsimonious approach allows us to detect the dynamic appearance (and disappearance) of transient features that are characteristic of many state transitions. In addition, intermediate states are known to produce spatial features in the form of distributions with unusual shapes or low separation, which can lead to overlapping components, and hence to an overestimated number of model components. For optimal model selection, we therefore also provide in StateProfiler a new procedure for merging skew tNormal components that are significantly overlapping in the mixture such that the change in entropy of the resulting model is minimal. Besides profiling of unusually shaped distributions and less wellseparated features, this allows StateProfiler to tackle cellular heterogeneity that exists even within clonal populations.
We applied StateProfiler to learn the temporal features of cell cycle progression in two mutant strains of budding yeast Saccharomyces cerevisiae. Based on knockout of Sphase triggering cyclins Clb5 and Clb6, we compared the Sphase delay phenotypes resulting from the differential regulation of the two cyclins. Also we used StateProfiler to construct the overall temporal profile of clonal divergence underlying lineage selection in mammalian hematopoietic progenitor EML cells. By comparing the fitted models at each time point, we observed a slow and nonmontonic convergence of clonal outlier subpopulations to a final median state.
Results and discussion
Temporal profiling with StateProfiler has several distinct advantages. First, the skew tNormal mixture fitted to the data is defined by a probability density function (pdf). This function is welldefined at any resolution and can be visualized as a smooth profile, which is, unlike kernelbased nonparametric representations, not dependent on bandwidth specification. Importantly, the pdf rigorously specifies the significance of every feature, which allows us to detect the significant ones in the profile, while ignoring the ones which are not. StateProfiler bases its optimal modeling on 3 strategies: (1) to begin with, asymmetric and heavytailed STNMIX components model the data precisely even in the presence of outliers or skewed populations, further (2) the parsimonious fitting of the model with greedy EM yields accurately estimated components, and finally, (3) any redundant components are merged into a wellfitted output profile. By design, our STNMIX model is computationally faster to fit than the skew t mixture (STMIX) model [3, 12, 16, 17] without sacrificing precision or rigor. Ho et al. [13] summarized the differences between the STMIX and STNMIX models and showed the implementation of the STNMIX model is generally much simpler and faster than that of STMIX model.
For temporal profiling, certain parameters of STNMIX model such as shape are uniquely suited to detect lagging or hastening trends in subpopulations (such as delay phenotypes in gene knockout experiments) that directly correspond to interesting cellular states and functions. Clearly this is neither possible with nonparametric representations nor using traditional parametric models based on Gaussian, t or other symmetric components [5]. Moreover, such shape or size parameters could be used to test for separability among components  i.e. to identify tendencies of subpopulations to move towards or away from each other without actually changing their mean locations. Parametric "snapshots" of such backandforth trends can shed light on the the discrete (switchlike) or continuous (spectrumlike) nature of the state transitions, leading to statistical observation of systems exhibiting multistable dynamics [10].
To illustrate some applications of StateProfiler, we analyzed two previously generated datasets for studying (a) cell division cycle and (b) cell differentiation in different species.
Cell cycle profiling
We applied StateProfiler to identify the temporal features of budding yeast cell cycle progression based on knockout of Sphase triggering cyclins Clb5 and Clb6. In late G1phase, while both Clb5 and Clb6 activate Cdc28p to promote initiation of DNA synthesis, the exact mechanisms and extents of regulating this transition from G1 to S phase are distinct for the two cyclins [4]. In particular, Clb5 knockout causes a more prominent S phase defect during cell cycle progression in yeast cells than Clb6 knockout. Since DNA replication happens in S phase, we studied the dynamics of transition from the start and end states corresponding of one and two copies of the chromosomes (respectively, G1 and G2M phases) while passing through intermediate states corresponding to S phase delay in the mutants. Interestingly, while genetic mutations are long known to produce delay phenotypes in cell cycle progression, few algorithms prior to StateProfiler could model the lag in the DNA distributions with precision.
We fitted STNMIX models to flow samples from two cell cycle time courses with 10 timepoints each in yeast cells with knockout of Clb5 (Clb5Δ) and Clb6 (Clb6Δ3P). The time courses spanned more than one cell cycle period with respect to wildtype yeast cells dividing under the same protocol. The fitted mixture models identified two or more components in every sample, which typically corresponded to the 1C and 2C peaks before and after DNA synthesis, along with subpopulations in the intermediate Sphase, thus characterizing an overall spectrum of profiles of different state transitions (Figure 1).
The smooth profiles of the noisy DNA histograms at every timepoint are constructed with StateProfiler according to optimal change in the entropy values of the fitted model (Figure 2). For example, the entropy plot (Figure 2a, b) suggests a jump in entropy (or elbow) beyond g = 2 components for Clb5Δ data at t = 25 min (blue histogram in Figure 2c). The resulting 2component profile is depicted by the orange curve in Figure 2c. The individual components involved in the model are identified and shown as black dotted curves. Their parameters could be used to detect features for purposes like sorting cells (FACS) or monitoring trends in specific subpopulations (e.g. note the lag in the left component in Figure 2).
To determine the precision of STNMIX, we computed loglikelihood maxima ${\widehat{\ell}}_{\text{max}}$, BIC values, and distances D_{ n } based on KolmogorovSmirnov (KS) test statistic, and compared in Table 1 with the same for four competing 2component mixture models (of normal, t, skew normal, and skew t) known from the literature [3, 18]. According to BIC, the optimal selection of the STNMIX model with equal dfs is evident (e.g. the 2component model at t = 25). As seen from Dn, we also conclude that STNMIX achieves the most precise modeling in terms of both the count and asymmetry of components in the given data. Further, we used the models for objective comparison of profiles both within and across timecourses. We computed the Gap statistic [19] as a measure of dispersion of cellular events between the two extreme states corresponding to the 1C and 2C peaks or clusters. Tested against a reference distribution of data with no clustering, the Gap statistics support the biological observation of Jackson et al. [4] that the Clb5 mutant shows more pronounced Sphase delay phenotypes than the Clb6 mutant and hence has less wellseparated components in midcell cycle (e.g. t = 25). The contrast between the samples in terms of cells showing a slower state transition from 1C to 2C may be observed in Table 2 for different timepoints. Finally, we observe the gradual variation in the key features at each successive timepoint to gain insights into the differential regulation of the Sphase by the cyclins Clb5 and Clb6 (Figure 3).
Cell differentiation profiling
Another key area in which flow data are extremely insightful about different state transitions is cell differentiation. In recent years, many important advances in biology have been made by studying the modes and mechanisms of differentiation especially in the context of stem cells and cancer. Stem cell differentiation has also been studied for their clinical applications such as in the field of regenerative medicine. An excellent review of the field is given in a recent text edited by Krishan et al. [2]. Over the course of differentiation, the profiles of expression of various markers  including those indicating stemness or commitment to a lineage  vary according to transitions of populations through unstable, metastable and eventually stable states. Often measurable phenotypic diversity appears due to celltocell variability even within clonal populations, which are manifest and can be studied as outlier events or asymmetric or tail subpopulations. Sometimes these features are transient and peripheral, and could be hard to distinguish via automation. Accurate modeling of dynamic flow profiles is thus essential to identify or monitor transitional features as and when they appear (or disappear) for objective temporal characterization of the statespace components involved in differentiation.
In the present study, we analyzed clonal populations of EML cells, a multipotent mouse haematopoietic cell line that can differentiate into myeloid, erythroid, and other lineages. In a recent study, Chang et al. [11] measured the expression levels of the stem cell marker Sca1 in different subpopulations of EML cells as time course data. They observed that celltocell heterogeneity in this clonal progenitor population gave rise to Sca1 outlier cells  cells that exhibit very high or low Sca1 expression  and possessed distinct gene expression patterns. The heterogeneity could not be attributed to measurement noise or cellcycledependent cell size variation. Eventually, however, each of these distinct Sca1 subpopulations' profiles became similar to that of the median cells, thus revealing an attractor state. Yet it was noted [11] that the divergence lasted long enough to allow different propensities for either subpopulation, i.e. low and high Sca1, to enter into a transient state that primes them for either the erythroid or the myeloid lineage, as captured by their differential expression of lineagespecific transcription factors.
For precise characterization of the dynamics by which population heterogeneity arose in this clonal population via outliers and subsided ultimately, cells with the lowest, middle and highest levels of Sca1 expression were isolated by [11] using fluorescenceactivated cell sorting (FACS). We call these subsets Sca1^{low}, Sca1^{mid}, and Sca1^{high}. Following FCAS, the sorted cells were immediately stripped of the staining antibody and cultured in standard growth medium. Subsequently, Sca1 fluorescence intensity were measured individually for each of the 3 subpopulations as time course data. Similar measurements were made for an original clonal population of EML cells for comparison (we call it Sca1^{all}).
We applied the StateProfiler framework to model the flow profiles for 14point time course data for each of the 4 populations. Often finite mixtures of Gaussians are used for modeling the theoretical subpopulation structure in such profiles [11, 20]. However, using Gaussian components, precise modeling in the presence of outliers due to celltocell heterogeneity is particularly difficult for clonal populations. This is because an optimal model must be able to accommodate such heterogeneity without requiring extra components, but Gaussian components with sharp tails are hardly robust against outliers. It leads to suboptimal models with spurious subpopulations, which makes their biological interpretation difficult.
StateProfiler addressed the modeling problem in two ways. First, its skew tNormal components are robust to outliers and asymmetry in the distributions. This helps in modeling transitional features even if they lead to unusually shaped or heavy tailed distributions. Second, even if redundant subpopulations were identified, the new merging procedure in StateProfiler can reconstruct any significantly overlapping components in a statistically optimal fashion, i.e. to produce a combined profile by causing minimal change in entropy of the model pre and postreconstruction.
The dual advantages of the StateProfiler modeling algorithm allowed us to compute highly accurate profiles of Sca1 expression in the time course datasets for the three sorted and the unsorted EML cells. The steps of the merging procedure through which an optimal structure for the model is "stitched" together are illustrated with an example in Figure 4. Finally, we compared the divergence of the 3 sorted subsets from the corresponding unsorted population using KullbackLeibler distances between the probability density functions specifying their profiles. A visual comparison of the profiles is shown in Figure 5. The trend of decreasing divergence, as the 3 sorted profiles become similar to the unsorted profile with progression of time, is shown in Figure 6.
StateProfiler's parametric characterization can reveal various features and trends of interest in terms of specific parameters. For instance, we observe that by 3 days, both Sca1^{mid} and Sca1^{high} have already started to resemble the unsorted population, and by 6 days, they actually have their own low Sca1 tails. Another trend of possible interest is the slow but continuous fluctuation in the proportion of low Sca1 outliers in the unsorted population. Finally, it appears that the eventual stable state when the 3 profiles finally coincide is reached at a point of time much later than 9 days, as suggested by [11], and takes probably double that time (432 h). In the mean time, as we see in Figure 6, the states might continue to drift closer and apart as in a dynamical system exhibiting multistable behaviour. If indeed the departure from the average state has biological functionality in the priming of cell fate commitment, then a nonmonotonic, delayed restoration of the underlying molecular mechanisms may be justified by having more than a few cells with random fluctuation and call for further investigation.
Conclusions
In this study, we described StateProfiler, a framework to construct temporal profiles with flow data, which can facilitate parametric modeling of cellular state transitions Towards this, we presented 3 key features of the framework. First, we described a finite mixture of skew tNormal distributions. Second, we presented a new greedy EM algorithm for fast and optimal model selection. The parsimonious approach of our greedy algorithm allows us to detect the variation in the features as and when they appear and disappear at different points of time thereby offering a parametric characterization of the overall nature of state transition. Third, we designed a mixture merging procedure for ensuring robust estimation of the fitted profile. The code implementing the framework is available from the authors upon request. Indeed the proposed framework is effective, general and may be applied to other similar domains.
Methods and materials
Mixtures of skew Studenttnormal distributions
We describe the skew tNormal mixture model (STNMIX) of StateProfiler. To simplify notation, we let ϕ(.) and Φ(.) denote the probability density function (pdf) and the cumulative distribution function (cdf) of the standard normal distribution, respectively. Let
denote the pdf of the t distribution with location ξ, scale σ^{2} and degrees of freedom (df) v, and t(xv ) simply for the case when ξ = 0 and σ = 1; and let Γ(α,β) be the gamma distribution with density g(xα, β) ∝ x^{α1}exp{βx}. We start by defining the STN distribution and then note further properties.
As introduced by Gómez et al. [21], a random variable Y is said to follow the STN with location parameter ξ ∈ ℝ, scale parameter σ^{2} ∈ (0, ∞), skewness parameter λ ∈ ℝ and degrees of freedom v ∈ (0,∞) it is has the density
We shall write Y~ STN(ξ,σ^{2},λ,v) if Y has the density of (1).
Ho et al. [13] give following hierarchical representation of STN to establish an EMtype algorithm [22].
where T N(µ, σ^{2}; (a, b)) represents the truncated normal distribution for N(µ, σ^{2}) lying within the truncated interval (a, b).
Consider n independent random variables Y_{1},..., Y_{ n }, which are taken from a mixture of STN distributions. The pdf of a gcomponent STNMIX model is
Where w_{ i }'s are mixing proportions which are constrained to be positive and ${\sum}_{i=1}^{g}{w}_{i}=1,\psi \left({y}_{j}{\theta}_{i}\right)$is the STN density defined in (1) and Θ_{ g }= (w_{1},..., w_{g 1}, θ_{1},..., θ_{ g }) represents all unknown parameters. Note that the component vector θ_{ i }consists of $\left({\xi}_{i},{\sigma}_{i}^{2},{\lambda}_{i},{v}_{i}\right).$
Based on (2), a practical ECM/ECME algorithm [23, 24] proceeds are described by Ho et al. [13] as follows:
Estep: Given ${\mathrm{\Theta}}_{g}={\widehat{\mathrm{\Theta}}}_{g}^{\left(h\right)},$compute following $\widehat{{z}_{ij}^{\left(h\right)}}$, ${\widehat{\tau}}_{ij}^{\left(h\right)}$, $\widehat{{\kappa}_{ij}^{\left(h\right)}}$ and ${\widehat{{\gamma}_{1}}}_{\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}ij}^{\left(h\right)}$ for i = 1,..., g and j = 1,..., n.
where ${\widehat{u}}_{ij}^{\left(h\right)}=\left({y}_{j}{\widehat{\xi}}_{i}^{\left(h\right)}\right)/{\widehat{\sigma}}_{i}^{\left(h\right)}.$
CMstep: Update the estimation by
where ${\widehat{n}}_{i}^{\left(h\right)}={\sum}_{j=1}^{n}{\u1e91}_{ij}^{\left(h\right)}$, ${\widehat{b}}_{1i}^{\left(h\right)}={\sum}_{j=1}^{n}{\u1e91}_{ij}^{\left(h\right)}\phantom{\rule{2.77695pt}{0ex}}{\widehat{\tau}}_{ij}^{\left(h\right)}{y}_{j}$, ${\widehat{b}}_{2i}^{\left(h\right)}={\sum}_{j=1}^{n}{\u1e91}_{ij}^{\left(h\right)}{y}_{j}$, ${\widehat{b}}_{3i}^{\left(h\right)}={\sum}_{j=1}^{n}{\u1e91}_{ij}^{\left(h\right)}\phantom{\rule{2.77695pt}{0ex}}{\widehat{\gamma}}_{1ij}^{\left(h\right)}$, ${\widehat{b}}_{4i}^{\left(h\right)}={\sum}_{j=1}^{n}{\u1e91}_{ij}^{\left(h\right)}\left({\widehat{\kappa}}_{ij}^{\left(h\right)}{\widehat{\tau}}_{ij}^{\left(h\right)}\right)/{\widehat{n}}_{i}^{\left(h\right)}$and ${\widehat{{u}_{ij}}}^{\left(h+1\right)}=\left({y}_{j}\widehat{{\xi}_{i}^{\left(h+1\right)}}\right)/\widehat{{\sigma}_{i}^{\left(h+1\right)}}.$
If the dfs are assumed to be identical, say v_{ 1 } = ⋯ = v_{ g } = v, we could update ${\widehat{v}}^{\left(h\right)}$by
The Estep and CM/CMLsteps are alternately repeated until a suitable convergence rule is satisfied, e.g., the Aitken acceleration based stopping criterion ${\ell}^{\left(h+1\right)}{\ell}_{\infty}^{\left(h+1\right)}<\epsilon ,$where ${\ell}^{\left(h+1\right)}$is the observed loglikelihood evaluated at ${\widehat{\mathrm{\Theta}}}_{g}^{\left(h\right)},{\ell}_{\infty}^{\left(h+1\right)}$is the asymptotic estimate of the loglikelihood at iteration h + 1 (see [18]; Chap. 4.9) and ε is the desired tolerance. For numerical analyses in this paper, a default value of ε = 10^{6} was used to terminate the iterations.
Greedy learning for STN mixtures
In this section, we present a new greedy version of the EM algorithm to determine the optimum number of components in the fitting of STNMIX models. The greedy EM approach was first introduced by Vlassis and Likas [25]. The fundamental concept of the greedy EM algorithm is to start from a minimum number of components and sequentially insert a new component to the mixture until convergence is achieved. The stopping criterion can be a prespecified maximum number of components or a prespecified convergence tolerance.
Suppose a new component $\psi \left({y}_{j}{\theta}_{g+1}\right)$ is added to a gcomponent $f\left({y}_{j}{\mathrm{\Theta}}_{g}\right)$. The resulting mixture takes the form of
where 0 < a < 1 and ${\mathrm{\Theta}}_{g+1}=\left({\mathrm{\Theta}}_{g},\phantom{\rule{2.77695pt}{0ex}}a,\phantom{\rule{2.77695pt}{0ex}}{\theta}_{g+1}\right)$ with ${\theta}_{g+1}$being the added parameters $\left({\xi}_{g+1},\phantom{\rule{2.77695pt}{0ex}}{\sigma}_{g+1}^{2},\phantom{\rule{2.77695pt}{0ex}}{\lambda}_{g+1},\phantom{\rule{2.77695pt}{0ex}}{v}_{g+1}\right).$ Given an old mixture $f\left({y}_{j}{\widehat{\mathrm{\Theta}}}_{g}\right)$, the weight a and ${\theta}_{g+1}$are optimally chosen to maximize the new loglikelihood
To find the optimal solution in (4), we start by performing a local search with for the newly inserted component. This gives rise to the following partial EM steps where $\stackrel{\u0303}{\theta}$denotes and the partial ML estimates of θ. For notational simplicity, the subscript (g + 1) is suppressed below in the Partial Estep.
Partial Estep: Calculating the conditional expectation of latent variables at the k th iteration, this yields
Where ${\u0169}_{j}^{\left(k\right)}=\left({y}_{j}{\stackrel{\u0303}{\xi}}^{\left(k\right)}\right)/{\stackrel{\u0303}{\sigma}}^{\left(k\right)}.$
Partial Mstep: Updating the new parameters in (a, θ_{g+1}), we get
Where ${\u0169}_{j}^{\left(k+1\right)}=\left({y}_{j}{\stackrel{\u0303}{\xi}}^{\left(k+1\right)}\right)/{\stackrel{\u0303}{\sigma}}^{\left(k+1\right)}$, ${\stackrel{\u0303}{b}}_{1}^{\left(k\right)}={\sum}_{j=1}^{n}{\stackrel{\u0303}{z}}_{j}^{\left(k\right)}{\stackrel{\u0303}{\tau}}_{j}^{\left(k\right)}{y}_{j}$, ${\stackrel{\u0303}{b}}_{2}^{\left(k\right)}={\sum}_{j=1}^{n}{\stackrel{\u0303}{z}}_{j}^{\left(k\right)}{y}_{j}$, ${\stackrel{\u0303}{b}}_{3}^{\left(k\right)}={\sum}_{j=1}^{n}{\stackrel{\u0303}{z}}_{j}^{\left(k\right)}{\stackrel{\u0303}{\gamma}}_{1j}^{\left(k\right)}$, and ${\stackrel{\u0303}{b}}_{4}^{\left(k\right)}={\sum}_{j=1}^{n}{\stackrel{\u0303}{z}}_{j}^{\left(k\right)}\left({\stackrel{\u0303}{\kappa}}_{j}^{\left(k\right)}{\stackrel{\u0303}{\tau}}_{j}^{\left(k\right)}\right)/{\sum}_{j=1}^{n}{\stackrel{\u0303}{z}}_{j}^{\left(k\right)}$.
The above partial EM steps constitute a fast and simple procedure to locally seek for the maximum of ${\mathcal{L}}_{g+1}$. To our experience, this local search scheme is very sensitive the initialization of a and ξ_{g+1}. Similar to Vlassis and Likas [25], we provided a global search strategy for extracting proper parameter initialization for a and ${\xi}_{g+1}^{\left(0\right)}$. By a secondorder Taylor expansion for ${\mathcal{L}}_{g+1}$, we obtain the following approximation:
where ${\dot{\mathcal{L}}}_{g+1}({a}_{0})$ and ${\ddot{\mathcal{L}}}_{g+1}({a}_{0})$ are the first and second derivatives of ${\mathcal{L}}_{g+1}$ evaluated at a = a_{0}. It can be deduced from (5) that a local maximum of ${\mathcal{L}}_{g+1}$ around a_{0} = 0.5 is given by
with
So the the optimal value of a can be calculated as
Following the suggestion of Li and Barron [26], one may set $\widehat{a}=0.5$ for g = 1 and $\widehat{a}=2/(g+1)$ for g ≥ 2 as a default recommendation when the estimated value (7) fall outside the range of (0, 1).
In our global search, a convenience choice of ${\stackrel{\u0303}{\sigma}}_{g+1}^{2\left(0\right)}$ is n^{1/5} times half of the sample variance ${s}_{y}^{2}$ whereas ${\stackrel{\u0303}{\lambda}}_{g+1}^{2\left(0\right)}$ and ${\stackrel{\u0303}{{\nu}_{g+1}}}^{\text{2}\left(\text{0}\right)}$ are always fixed at 0 and 10, respectively. For the initial choice of ξ_{g+1}, we search over the 5th, 10th, 15th, ⋯ 95th quantiles of y and set ${\stackrel{\u0303}{\xi}}_{g+1}^{\left(0\right)}$ to the one that maximizes (6).
The implementation of the greedy EM algorithm is summarized below.

1.
Start with g = 1 and compute the ML estimates of the singlecomponent STNMIX model via the ECME algorithm.

2.
If g > 1, estimate Θ_{ g } via the EMtype algorithms.

3.
Perform a global search to find a proper initialization of a and ξ_{g+1}.

4.
Apply the partial EMsteps until convergence. For instance, ${\widehat{\mathcal{L}}}_{g+1}^{\left(k\right)}/{\widehat{\mathcal{L}}}_{g+1}^{\left(k1\right)}1\phantom{\rule{2.77695pt}{0ex}}<1{0}^{6}$.

5.
If ${\widehat{\mathcal{L}}}_{g+1}\le {\widehat{\mathcal{L}}}_{g}+m$ then terminate, where m > 0 is a penalty term. Otherwise allocate the new component to the model and go to 2. Set g = g + 1.
Given r candidates (we have 19 quantiles of sample), the time complexity of our greedy EM algorithm is O(ngr). If overall sample was considered as candidates in the global search, then the running time is similar to Vlassis and Likas [25].
Merging mixture algorithm
The greedy EM algorithm provides a convenient method for automatically selecting a number of components for a mixture model under reasonable assumptions (such as convexity of components). Yet if data have certain spatial features due to distributions with unusual shapes or low separation [8], it can lead to overlapping components, and hence to overestimation in the number of components in spite of the parsimonious approach. To augment our greedy algorithm for obtaining a robust estimate of the number of components, we extend the merging mixture approach of Baudry et al. [27] to skew tNormal components. While merging techniques have been applied in the past to symmetric distributions [27, 28], designing a procedure for asymmetric distributions obviates any need for spurious components that may be required for the sole purpose of modeling asymmetry, and thus avoids redundant merging.
The basic idea behind the procedure is to use the maximum merged entropy to iteratively combine two possibly overlapping clusters, until the result of combination belong a single cluster (see implementation in [28]). The steps of the merging algorithm in StateProfiler are described below.

1.
Calculate the mean entropy of maximum estimation for g components as
$$\text{Ent}\left(g\right)=\sum _{j=1}^{n}\sum _{i=1}^{g}{\u1e91}_{ij}\text{log}{\u1e91}_{ij}\ge 0,$$
where ${\u1e91}_{ij}$ denotes the posterior probability given Θ_{ g } fix at ${\widehat{\mathrm{\Theta}}}_{g}$.

2.
Two clusters l and l' to be combined are those maximizing the criterion:
$$\begin{array}{c}\left(\sum _{j=1}^{n}\left\{{\u1e91}_{il}\text{log}{\u1e91}_{il}+{\u1e91}_{i{l}^{\prime}}\text{log}{\u1e91}_{i{l}^{\prime}}\right\}\right.\\ \left(\right)close=")">+\sum _{j=1}^{n}\left({\u1e91}_{il}+{\u1e91}_{i{l}^{\prime}}\right)\text{log}\left({\u1e91}_{il}+{\u1e91}_{i{l}^{\prime}}\right)\end{array}$$
among all possible pairs of clusters (l, l').

3.
Obtain the merged entropy
$$\begin{array}{c}\text{Ent}\left(g1\right)=\sum _{j=1}^{n}\left\{\sum _{i\ne l,{l}^{\prime}}{\u1e91}_{ij}\text{log}{\u1e91}_{ij}\right.\\ \phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\left(\right)close="\}">+{\u1e91}_{i,l\cup {l}^{\prime}}\text{log}{\u1e91}_{i,l\cup {l}^{\prime}}& ,\end{array}$$
where ${\u1e91}_{i,l\cup {l}^{\prime}}={\u1e91}_{il}+{\u1e91}_{i{l}^{\prime}}$ is the posterior probability of the new cluster l ∪ l'.

4.
Update $\widehat{{z}_{j}}$ consists of the unmerged and merged posterior probabilities.

5.
Set g = g  1 and go to 2. Repeat until g = 1.

6.
A solution of number of components can be identified (i) a sudden jump or "elbow" in a plot of the entropy of clustering versus the number of clusters, or (ii) peaks in a plot of the number of clusters versus the difference in entropy.
Data and experiments
For details of the yeast cell cycle experiments and timecourse data analyzed by StateProfiler, see [4]. For details of EML cell differentiation data, see [11].
References
 1.
Darzynkiewicz Z, Crissman H, Jacobberger JW: Cytometry of the cell cycle: cycling through history. Cytometry A 2004, 58: 21–32.
 2.
Krishan A, Krishnamurthy H, Totey S: Applications of Flow Cytometry in Stem Cell Research and Tissue Regeneration. John Wiley & Sons Inc; 2010.
 3.
Pyne S, Hu X, Wang K, Rossin E, Lin TI, Maier LM, BaecherAllan C, McLachlan GJ, Tamayo P, Hafler DA, Jager PLD, Mesirov JP: Automated highdimensional flow cytometric data analysis. Proc Natl Acad Sci USA 2009, 106: 8519–8524. 10.1073/pnas.0903028106
 4.
Jackson LP, Reed SI, Haase SB: Distinct mechanisms control the stability of the related Sphase cyclins Clb5 and Clb6. Mol Cell Biol 2006, 26: 2456–2466. 10.1128/MCB.26.6.24562466.2006
 5.
Niu W, Li Z, Zhan W, Iyer VR, Marcotte EM: Mechanisms of cell cycle control revealed by a systematic and quantitative overexpression screen in S. cerevisiae. PLoS Genet 2008, 4: e1000120. Doi:10.1371/journal.pgen.1000120 Doi:10.1371/journal.pgen.1000120 10.1371/journal.pgen.1000120
 6.
Hedley DW, Chow S, Goolsby C, Shankey TV: Pharmacodynamic monitoring of moleculartargeted agents in the peripheral blood of leukemia patients using flow cytometry. Toxicol Pathol 2008, 36: 133–139. 10.1177/0192623307310952
 7.
Krishan A, Hamelik RM: Flow cytometric monitoring of fluorescent drug retention and efflux. Methods Mol Med 2005, 111: 149–166.
 8.
Kotecha N, Flores NJ, Irish JM, Simonds EF, Sakai DS, Archambeault S, DiazFlores E, Coram M, Shannon KM, Nolan GP, Loh ML: Singlecell profiling identifies aberrant STAT5 activation in myeloid malignancies with specific clinical and biologic correlates. Cancer Cell 2008, 14(4):335–343. 10.1016/j.ccr.2008.08.014
 9.
Irish JM, Myklebust JH, Alizadeh AA, Houot R, Sharman JP, Czerwinski DK, Nolan GP, Levy R: Bcell signaling networks reveal a negative prognostic human lymphoma cell subset that emerges during tumor progression. Proc Natl Acad Sci USA 2010, 107: 12747–12754. 10.1073/pnas.1002057107
 10.
Chang HH, Oh PY, Ingber DE, Huang S: Multistable and multistep dynamics in neutrophil differentiation. BMC Cell Biol 2006, 7: 11. 10.1186/14712121711
 11.
Chang HH, Hemberg M, Barahona M, Ingber DE, Huang S: Transcriptomewide noise controls lineage choice in mammalian progenitor cells. Nature 2008, 453(7194):544–547. 10.1038/nature06965
 12.
FrühwirthSchnatter S, Pyne S: Bayesian inference for finite mixtures of univariate and multivariate skewnormal and skew t distributions. Biostatistics 2010, 11: 317–336. 10.1093/biostatistics/kxp062
 13.
Ho HJ, Pyne S, Lin TI: Maximum likelihood inference for mixtures of skew Student t normal distributions through practical EMtype algorithms. Stat Comput 2012, 22: 287–299. 10.1007/s1122201092259
 14.
Lin TI, Lee JC, Yen SY: Finite mixture modelling using the skew normal distribution. Stat Sinica 2007, 17: 909–927.
 15.
Rossin E, Lin TI, Ho HJ, Mentzer SJ, Pyne S: A framework for analytical characterization of monoclonal antibodies based on reactivity profiles in different tissues. Bioinformatics 2011, 27(19):2746–2753. 10.1093/bioinformatics/btr468
 16.
Lin TI, Lee JC, Hsieh WJ: Robust mixture modeling using the skew t distribution. Stat Comput 2007, 17: 81–92. 10.1007/s1122200690058
 17.
Lin TI: Robust mixture modeling using multivariate skew t distributions. Stat Comput 2010, 20: 343–356. 10.1007/s1122200991289
 18.
McLachlan GJ, Krishnan T: The EM algorithm and extensions. John Wiley & Sons Inc; 2008.
 19.
Tibshirani R, Walther G, Hastie T: Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B Methodol 2001, 63: 411–423. 10.1111/14679868.00293
 20.
Song C, Phenix H, Abedi V, Scott M, Ingalls BP, Kaern M, Perkins TJ: Estimating the stochastic bifurcation structure of cellular networks. PLoS Comput Biol 2010, 6(3):e1000699. 10.1371/journal.pcbi.1000699
 21.
Gómez HW, Venegas O, Bolfarine H: Skewsymmetric distributions generated by the distribution function of the normal distribution. Environmetrics 2007, 18: 395–407. 10.1002/env.817
 22.
Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm (with discussion). J R Stat Soc Ser B Methodol 1977, 39: 1–38.
 23.
Meng XL, Rubin DB: Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika 1993, 80: 267–278. 10.1093/biomet/80.2.267
 24.
Liu CH, Rubin DB: The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika 1994, 81: 633–648. 10.1093/biomet/81.4.633
 25.
Vlassis N, Likas A: A greedy EM algorithm for Gaussian mixture learning. Neural Process Lett 2002, 15: 77–87. 10.1023/A:1013844811137
 26.
Li JQ, Barron AR: Mixture Density Estimation. In Advances in Neural Information Processing Systems 12, [NIPS Conference, Denver, Colorado, USA, November 29  December 4, 1999. Edited by: Solla SA, Leen TK, Müller KR. The MIT Press; 1999:279–285.
 27.
Baudry JP, Raftery AE, Celeux G, Lo K, Gottardo R: Combining Mixture Components for Clustering. J Comput Graph Stat 2010, 9: 332–353.
 28.
Finak G, Bashashati A, Brinkman R, Gottardo R: Merging mixture components for cell population identification in flow cytometry. Adv Bioinformatics 2009, 2009: Article ID 247646. Doi:10.1155/2009/247646 Doi:10.1155/2009/247646
Acknowledgements
TIL was partially supported by National Science Council of Taiwan (Grant NO. NSC992118M005001MY2).
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 5, 2012: Selected articles from the First IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS 2011): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S5.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
HJH and TIL codeveloped the statistical methods and performed data analysis. SP conceived the project, designed the approach, and analyzed the results. All authors contributed to the development of the methodology and to writing the manuscript. HJH and TIL contributed equally and are the first authors as well as listed in alphabetical order.
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Ho, H.J., Lin, T.I., Chang, H.H. et al. Parametric modeling of cellular state transitions as measured with flow cytometry. BMC Bioinformatics 13, S5 (2012). https://doi.org/10.1186/1471210513S5S5
Published:
Keywords
 Probability Density Function
 Finite Mixture
 Clonal Population
 Parsimonious Approach
 Merging Procedure