Volume 13 Supplement 5

## Selected articles from the First IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS 2011): Bioinformatics

# Parametric modeling of cellular state transitions as measured with flow cytometry

- Hsiu J Ho
^{1}, - Tsung I Lin
^{1, 2}, - Hannah H Chang
^{3, 4, 5}, - Steven B Haase
^{6}, - Sui Huang
^{7}and - Saumyadipta Pyne
^{8, 9}Email author

**13(Suppl 5)**:S5

**DOI: **10.1186/1471-2105-13-S5-S5

© Ho et al.; licensee BioMed Central Ltd. 2012

**Published: **12 April 2012

## Abstract

### Background

Gradual or sudden transitions among different states as exhibited by cell populations in a biological sample under particular conditions or stimuli can be detected and profiled by flow cytometric time course data. Often such temporal profiles contain features due to transient states that present unique modeling challenges. These could range from asymmetric non-Gaussian distributions to outliers and tail subpopulations, which need to be modeled with precision and rigor.

### Results

To ensure precision and rigor, we propose a parametric modeling framework StateProfiler based on finite mixtures of skew *t*-Normal distributions that are robust against non-Gaussian features caused by asymmetry and outliers in data. Further, we present in StateProfiler a new greedy EM algorithm for fast and optimal model selection. The parsimonious approach of our greedy algorithm allows us to detect the genuine dynamic variation in the key features as and when they appear in time course data. We also present a procedure to construct a well-fitted profile by merging any redundant model components in a way that minimizes change in entropy of the resulting model. This allows precise profiling of unusually shaped distributions and less well-separated features that may appear due to cellular heterogeneity even within clonal populations.

### Conclusions

By modeling flow cytometric data measured over time course and marker space with StateProfiler, specific parametric characteristics of cellular states can be identified. The parameters are then tested statistically for learning global and local patterns of spatio-temporal change. We applied StateProfiler to identify the temporal features of yeast cell cycle progression based on knockout of S-phase triggering cyclins Clb5 and Clb6, and then compared the S-phase delay phenotypes due to differential regulation of the two cyclins. We also used StateProfiler to construct the temporal profile of clonal divergence underlying lineage selection in mammalian hematopoietic progenitor cells.

## Background

Flow Cytometry is among the most widely used platforms in biomedical research and clinical labs. It is used for investigation of a wide variety of biological problems at single cell level. Classical applications of flow cytometry include quantitative measurements of DNA content and cell cycle progression [1]. It is also one of the key platforms for studying dynamic cellular properties such as differentiation, proliferation and apoptosis, especially in the contexts of stem cells and cancer [2]. Such applications make flow cytometry the ideal platform for the purpose of identifying and monitoring the myriad states and functions in different specimens that vary over time under particular conditions and stimuli.

Typically, a flow sample is stained with fluorescent dyes, possibly attached to antibodies, and per cell events such as the expression of a cell-surface marker or the DNA content are measured in terms of fluorescence intensity. The distribution of these events are then plotted or modeled statistically for identification of important features in the sample. While developments in computational cytomics have produced many useful analytical methods (e.g. [3]), several important problems have not yet been addressed adequately. One such issue involves precise parametric modeling of dynamic features in temporal profiles such that the model parameters can characterize the transition of the populations in a sample through different cellular states. Often simple statistics such as population mean or size can be imprecise in the presence of unusually shaped distributions and outliers in temporal profiles. The modeling scenario could be complicated further by the adoption of different trajectories by different subpopulations. Indeed a rigorous algorithm for modeling cellular state transitions can not only automate the traditionally manual approach, which is subjective and labor-intensive, but also extend it to increasingly complex and high-throughput experiments.

Many major cytometric studies have highlighted the importance of characterizing temporal profiles at single cell resolution for a variety of purposes such as cell cycle expression kinetics (e.g. [4, 5]), pharmacodynamics (e.g. [6, 7]), signaling alterations in specific subpopulations (e.g. [8, 9]), dynamics of differentiation into distinct lineages (e.g. [10, 11]), and so on. Clearly, mathematical formulation of a cellular state-space, and the transitions therein, can help us model a given collection of temporal flow cytometric profiles with the required rigor. Thereupon we can study the changes in features (say, in comparison with those in control profiles) and monitor trends in parametric detail. Precise probabilistic modeling of sample distributions at each stage can automatically reveal such dynamic features as emergence of a tail subpopulation or change in the skewness of a cluster that are statistically well-defined as well as biologically insightful [3].

Temporal profiling of cellular state transitions in flow data can, however, present unique modeling challenges. Often the transient states produce non-Gaussian features such as asymmetric or trailing subpopulations owing to rush or delay in progression from one state to another [5]. Intermediate states might also produce outliers that cannot be clearly distinguished from the more distinctive states. Moreover certain metastable states may appear only inconsistently in a given time course [11]. Often the transient features appear and disappear at the tails of the more prominent distributions, and may be hard to model via automation. Thus a framework that uses robust probabilistic density functions to model time course data may be the best way to represent the underlying state-space, and reveal any sudden or gradual transition therein. In terms of the distribution of events in a flow sample, characteristics of different states may be determined by variation in size (say, percentage of cells in a peak or cluster), location (such as mean or mode) or significance (peak density) of the model components. While traditionally such changes were detected with manual or non-parametric techniques, several model-based frameworks have recently been applied with success, e.g. [3, 12–15].

Here we present StateProfiler, a new framework based on finite mixture models of skew *t*-Normal distributions (STNMIX) for statistical characterization of flow cytometric time course data. In particular, we present in StateProfiler a new greedy Expectation-Maximization (EM) algorithm for fitting our STNMIX model. The greedy EM algorithm starts with a minimum number of distributions (or *components*) and sequentially inserts a new component to the mixture until model convergence is achieved. This parsimonious approach allows us to detect the dynamic appearance (and disappearance) of transient features that are characteristic of many state transitions. In addition, intermediate states are known to produce spatial features in the form of distributions with unusual shapes or low separation, which can lead to overlapping components, and hence to an overestimated number of model components. For optimal model selection, we therefore also provide in StateProfiler a new procedure for merging skew *t*-Normal components that are significantly overlapping in the mixture such that the change in entropy of the resulting model is minimal. Besides profiling of unusually shaped distributions and less well-separated features, this allows StateProfiler to tackle cellular heterogeneity that exists even within clonal populations.

We applied StateProfiler to learn the temporal features of cell cycle progression in two mutant strains of budding yeast *Saccharomyces cerevisiae*. Based on knockout of S-phase triggering cyclins Clb5 and Clb6, we compared the S-phase delay phenotypes resulting from the differential regulation of the two cyclins. Also we used StateProfiler to construct the overall temporal profile of clonal divergence underlying lineage selection in mammalian hematopoietic progenitor EML cells. By comparing the fitted models at each time point, we observed a slow and non-montonic convergence of clonal outlier subpopulations to a final median state.

## Results and discussion

Temporal profiling with StateProfiler has several distinct advantages. First, the skew *t*-Normal mixture fitted to the data is defined by a probability density function (pdf). This function is well-defined at any resolution and can be visualized as a smooth profile, which is, unlike kernel-based non-parametric representations, not dependent on bandwidth specification. Importantly, the pdf rigorously specifies the significance of every feature, which allows us to detect the significant ones in the profile, while ignoring the ones which are not. StateProfiler bases its optimal modeling on 3 strategies: (1) to begin with, asymmetric and heavy-tailed STNMIX components model the data precisely even in the presence of outliers or skewed populations, further (2) the parsimonious fitting of the model with greedy EM yields accurately estimated components, and finally, (3) any redundant components are merged into a well-fitted output profile. By design, our STNMIX model is computationally faster to fit than the skew *t* mixture (STMIX) model [3, 12, 16, 17] without sacrificing precision or rigor. Ho *et al*. [13] summarized the differences between the STMIX and STNMIX models and showed the implementation of the STNMIX model is generally much simpler and faster than that of STMIX model.

For temporal profiling, certain parameters of STNMIX model such as shape are uniquely suited to detect lagging or hastening trends in subpopulations (such as delay phenotypes in gene knockout experiments) that directly correspond to interesting cellular states and functions. Clearly this is neither possible with non-parametric representations nor using traditional parametric models based on Gaussian, *t* or other symmetric components [5]. Moreover, such shape or size parameters could be used to test for separability among components - i.e. to identify tendencies of subpopulations to move towards or away from each other without actually changing their mean locations. Parametric "snapshots" of such back-and-forth trends can shed light on the the discrete (switch-like) or continuous (spectrum-like) nature of the state transitions, leading to statistical observation of systems exhibiting multistable dynamics [10].

To illustrate some applications of StateProfiler, we analyzed two previously generated datasets for studying (a) cell division cycle and (b) cell differentiation in different species.

### Cell cycle profiling

We applied StateProfiler to identify the temporal features of budding yeast cell cycle progression based on knockout of S-phase triggering cyclins Clb5 and Clb6. In late G1-phase, while both Clb5 and Clb6 activate Cdc28p to promote initiation of DNA synthesis, the exact mechanisms and extents of regulating this transition from G1 to S phase are distinct for the two cyclins [4]. In particular, Clb5 knockout causes a more prominent S phase defect during cell cycle progression in yeast cells than Clb6 knockout. Since DNA replication happens in S phase, we studied the dynamics of transition from the start and end states corresponding of one and two copies of the chromosomes (respectively, G1 and G2-M phases) while passing through intermediate states corresponding to S phase delay in the mutants. Interestingly, while genetic mutations are long known to produce delay phenotypes in cell cycle progression, few algorithms prior to StateProfiler could model the lag in the DNA distributions with precision.

*g*= 2 components for Clb5Δ data at

*t*= 25 min (blue histogram in Figure 2c). The resulting 2-component profile is depicted by the orange curve in Figure 2c. The individual components involved in the model are identified and shown as black dotted curves. Their parameters could be used to detect features for purposes like sorting cells (FACS) or monitoring trends in specific subpopulations (e.g. note the lag in the left component in Figure 2).

*D*

_{ n }based on Kolmogorov-Smirnov (K-S) test statistic, and compared in Table 1 with the same for four competing 2-component mixture models (of normal,

*t*, skew normal, and skew

*t*) known from the literature [3, 18]. According to BIC, the optimal selection of the STNMIX model with equal dfs is evident (e.g. the 2-component model at

*t*= 25). As seen from

*Dn*, we also conclude that STNMIX achieves the most precise modeling in terms of both the count and asymmetry of components in the given data. Further, we used the models for objective comparison of profiles both within and across time-courses. We computed the Gap statistic [19] as a measure of dispersion of cellular events between the two extreme states corresponding to the 1C and 2C peaks or clusters. Tested against a reference distribution of data with no clustering, the Gap statistics support the biological observation of Jackson

*et al*. [4] that the Clb5 mutant shows more pronounced S-phase delay phenotypes than the Clb6 mutant and hence has less well-separated components in mid-cell cycle (e.g.

*t*= 25). The contrast between the samples in terms of cells showing a slower state transition from 1C to 2C may be observed in Table 2 for different time-points. Finally, we observe the gradual variation in the key features at each successive time-point to gain insights into the differential regulation of the S-phase by the cyclins Clb5 and Clb6 (Figure 3).

Details of competing models for Clb5 data

t | Criterion | NMIX | TMIX | SNMIX | STMIX | STNMIX |
---|---|---|---|---|---|---|

0 | ${\widehat{\ell}}_{\text{max}}$ | 2539.78 | 2647.35 | 2682.44 | 2771.12 | 2759.40 |

BIC | -5033.60 | -5239.55 | -5300.53 | -5468.69 | -5445.25 | |

| 0.0413 | 0.0262 | 0.0292 | 0.0164 | 0.0185 | |

10 | ${\widehat{\ell}}_{\text{max}}$ | 1201.11 | 1224.87 | 1357.82 | 1405.31 | 1406.80 |

BIC | -2356.27 | -2394.60 | -2651.32 | -2737.09 | -2740.08 | |

| 0.0424 | 0.0284 | 0.0312 | 0.0214 | 0.0190 | |

20 | ${\widehat{\ell}}_{\text{max}}$ | -5463.40 | -5462.64 | -4869.75 | -4792.67 | -4791.62 |

BIC | 10972.72 | 10980.37 | 9803.79 | 9658.80 | 9656.69 | |

| 0.0758 | 0.0715 | 0.0251 | 0.0264 | 0.0266 | |

25 | ${\widehat{\ell}}_{\text{max}}$ | -7040.90 | -6981.07 | -6992.27 | -6918.17 | -6916.52 |

BIC | 14127.73 | 14017.26 | 14048.84 | 13909.82 | 13906.53 | |

| 0.0147 | 0.0145 | 0.0155 | 0.0077 | 0.0075 | |

30 | ${\widehat{\ell}}_{\text{max}}$ | -7251.45 | -7226.16 | -7228.28 | -7203.05 | -7201.55 |

BIC | 14548.76 | 14507.34 | 14520.76 | 14479.46 | 14476.46 | |

| 0.0218 | 0.0175 | 0.0143 | 0.0129 | 0.0110 | |

35 | ${\widehat{\ell}}_{\text{max}}$ | -6413.58 | -6412.58 | -6374.92 | -6320.38 | -6334.20 |

BIC | 12872.96 | 12880.12 | 12813.96 | 12714.04 | 12741.69 | |

| 0.0196 | 0.0230 | 0.0136 | 0.0117 | 0.0152 | |

40 | ${\widehat{\ell}}_{\text{max}}$ | -4626.43 | -4625.80 | -4546.10 | -4429.22 | -4461.12 |

BIC | 9298.55 | 9306.44 | 9156.18 | 8931.56 | 8995.37 | |

| 0.0338 | 0.0306 | 0.0170 | 0.0123 | 0.0184 | |

50 | ${\widehat{\ell}}_{\text{max}}$ | -1286.86 | -1121.26 | -1286.53 | -1093.80 | -1086.34 |

BIC | 2619.40 | 2297.35 | 2637.03 | 2260.70 | 2245.79 | |

| 0.0222 | 0.0145 | 0.0218 | 0.0139 | 0.0132 | |

60 | ${\widehat{\ell}}_{\text{max}}$ | -2016.29 | -1596.75 | -1835.23 | -1573.97 | -1568.82 |

BIC | 4078.18 | 3248.21 | 3734.30 | 3220.89 | 3210.59 | |

| 0.0540 | 0.0203 | 0.0339 | 0.0172 | 0.0131 | |

75 | $\widehat{{\ell}_{\text{max}}}$ | -8146.57 | -7810.70 | -7772.60 | -7770.74 | -7769.87 |

BIC | 16393.86 | 15731.29 | 15682.55 | 15688.00 | 15686.25 | |

| 0.0219 | 0.0119 | 0.0079 | 0.0101 | 0.1104 |

Measuring dispersion of events at each time point

Time | Gap1 | Gap2 | SE1 | SE2 |
---|---|---|---|---|

0 | 0.689 | -0.170 | 0.016 | 0.016 |

10 | 0.436 | -0.335 | 0.016 | 0.019 |

20 | 0.022 | -1.245 | 0.012 | 0.016 |

25 | 0.203 | -0.789 | 0.013 | 0.018 |

30 | -0.338 | -0.164 | 0.016 | 0.015 |

35 | -0.439 | -0.223 | 0.013 | 0.014 |

40 | -0.371 | -0.403 | 0.015 | 0.015 |

50 | 0.281 | 0.233 | 0.015 | 0.015 |

60 | 0.510 | 0.096 | 0.016 | 0.014 |

75 | -1.550 | 0.100 | 0.013 | 0.014 |

### Cell differentiation profiling

Another key area in which flow data are extremely insightful about different state transitions is cell differentiation. In recent years, many important advances in biology have been made by studying the modes and mechanisms of differentiation especially in the context of stem cells and cancer. Stem cell differentiation has also been studied for their clinical applications such as in the field of regenerative medicine. An excellent review of the field is given in a recent text edited by Krishan *et al*. [2]. Over the course of differentiation, the profiles of expression of various markers - including those indicating stemness or commitment to a lineage - vary according to transitions of populations through unstable, metastable and eventually stable states. Often measurable phenotypic diversity appears due to cell-to-cell variability even within clonal populations, which are manifest and can be studied as outlier events or asymmetric or tail subpopulations. Sometimes these features are transient and peripheral, and could be hard to distinguish via automation. Accurate modeling of dynamic flow profiles is thus essential to identify or monitor transitional features as and when they appear (or disappear) for objective temporal characterization of the state-space components involved in differentiation.

In the present study, we analyzed clonal populations of EML cells, a multipotent mouse haematopoietic cell line that can differentiate into myeloid, erythroid, and other lineages. In a recent study, Chang *et al*. [11] measured the expression levels of the stem cell marker Sca-1 in different subpopulations of EML cells as time course data. They observed that cell-to-cell heterogeneity in this clonal progenitor population gave rise to Sca-1 outlier cells - cells that exhibit very high or low Sca-1 expression - and possessed distinct gene expression patterns. The heterogeneity could not be attributed to measurement noise or cell-cycle-dependent cell size variation. Eventually, however, each of these distinct Sca-1 subpopulations' profiles became similar to that of the median cells, thus revealing an attractor state. Yet it was noted [11] that the divergence lasted long enough to allow different propensities for either subpopulation, i.e. low and high Sca-1, to enter into a transient state that primes them for either the erythroid or the myeloid lineage, as captured by their differential expression of lineage-specific transcription factors.

For precise characterization of the dynamics by which population heterogeneity arose in this clonal population via outliers and subsided ultimately, cells with the lowest, middle and highest levels of Sca-1 expression were isolated by [11] using fluorescence-activated cell sorting (FACS). We call these subsets Sca-1^{low}, Sca-1^{mid}, and Sca-1^{high}. Following FCAS, the sorted cells were immediately stripped of the staining antibody and cultured in standard growth medium. Subsequently, Sca-1 fluorescence intensity were measured individually for each of the 3 subpopulations as time course data. Similar measurements were made for an original clonal population of EML cells for comparison (we call it Sca-1^{all}).

We applied the StateProfiler framework to model the flow profiles for 14-point time course data for each of the 4 populations. Often finite mixtures of Gaussians are used for modeling the theoretical subpopulation structure in such profiles [11, 20]. However, using Gaussian components, precise modeling in the presence of outliers due to cell-to-cell heterogeneity is particularly difficult for clonal populations. This is because an optimal model must be able to accommodate such heterogeneity without requiring extra components, but Gaussian components with sharp tails are hardly robust against outliers. It leads to sub-optimal models with spurious subpopulations, which makes their biological interpretation difficult.

StateProfiler addressed the modeling problem in two ways. First, its skew *t*-Normal components are robust to outliers and asymmetry in the distributions. This helps in modeling transitional features even if they lead to unusually shaped or heavy tailed distributions. Second, even if redundant subpopulations were identified, the new merging procedure in StateProfiler can re-construct any significantly overlapping components in a statistically optimal fashion, i.e. to produce a combined profile by causing minimal change in entropy of the model pre- and post-reconstruction.

StateProfiler's parametric characterization can reveal various features and trends of interest in terms of specific parameters. For instance, we observe that by 3 days, both Sca-1^{mid} and Sca-1^{high} have already started to resemble the unsorted population, and by 6 days, they actually have their own low Sca-1 tails. Another trend of possible interest is the slow but continuous fluctuation in the proportion of low Sca-1 outliers in the unsorted population. Finally, it appears that the eventual stable state when the 3 profiles finally coincide is reached at a point of time much later than 9 days, as suggested by [11], and takes probably double that time (432 h). In the mean time, as we see in Figure 6, the states might continue to drift closer and apart as in a dynamical system exhibiting multistable behaviour. If indeed the departure from the average state has biological functionality in the priming of cell fate commitment, then a non-monotonic, delayed restoration of the underlying molecular mechanisms may be justified by having more than a few cells with random fluctuation and call for further investigation.

## Conclusions

In this study, we described StateProfiler, a framework to construct temporal profiles with flow data, which can facilitate parametric modeling of cellular state transitions Towards this, we presented 3 key features of the framework. First, we described a finite mixture of skew *t*-Normal distributions. Second, we presented a new greedy EM algorithm for fast and optimal model selection. The parsimonious approach of our greedy algorithm allows us to detect the variation in the features as and when they appear and disappear at different points of time thereby offering a parametric characterization of the overall nature of state transition. Third, we designed a mixture merging procedure for ensuring robust estimation of the fitted profile. The code implementing the framework is available from the authors upon request. Indeed the proposed framework is effective, general and may be applied to other similar domains.

## Methods and materials

### Mixtures of skew Student-*t*-normal distributions

*t*-Normal mixture model (STNMIX) of StateProfiler. To simplify notation, we let

*ϕ*(.) and Φ(.) denote the probability density function (pdf) and the cumulative distribution function (cdf) of the standard normal distribution, respectively. Let

denote the pdf of the *t* distribution with location *ξ*, scale *σ*^{2} and degrees of freedom (df) *v*, and *t*(*x|v* ) simply for the case when *ξ* = 0 and *σ* = 1; and let Γ(*α,β*) be the gamma distribution with density *g*(*x|α, β*) ∝ *x*^{α-1}exp{-*βx*}. We start by defining the STN distribution and then note further properties.

*et al*. [21], a random variable

*Y*is said to follow the STN with location parameter

*ξ*∈ ℝ, scale parameter

*σ*

^{2}∈ (0, ∞), skewness parameter

*λ*∈ ℝ and degrees of freedom

*v*∈ (0,∞) it is has the density

We shall write *Y~ STN*(*ξ*,*σ*^{2},*λ*,*v*) if *Y* has the density of (1).

*et al*. [13] give following hierarchical representation of STN to establish an EM-type algorithm [22].

where *T N*(*µ, σ*^{2}; (*a, b*)) represents the truncated normal distribution for *N*(*µ, σ*^{2}) lying within the truncated interval (*a, b*).

*n*independent random variables

*Y*

_{1}

*,..., Y*

_{ n }, which are taken from a mixture of STN distributions. The pdf of a

*g*-component STNMIX model is

Where *w*_{
i
}'s are mixing proportions which are constrained to be positive and ${\sum}_{i=1}^{g}{w}_{i}=1,\psi \left({y}_{j}|{\theta}_{i}\right)$is the STN density defined in (1) and Θ_{
g
}= (*w*_{1},..., *w*_{g- 1}, *θ*_{1},..., *θ*_{
g
}) represents all unknown parameters. Note that the component vector *θ*_{
i
}consists of $\left({\xi}_{i},{\sigma}_{i}^{2},{\lambda}_{i},{v}_{i}\right).$

Based on (2), a practical ECM/ECME algorithm [23, 24] proceeds are described by Ho *et al*. [13] as follows:

**E-step:**Given ${\mathrm{\Theta}}_{g}={\widehat{\mathrm{\Theta}}}_{g}^{\left(h\right)},$compute following $\widehat{{z}_{ij}^{\left(h\right)}}$, ${\widehat{\tau}}_{ij}^{\left(h\right)}$, $\widehat{{\kappa}_{ij}^{\left(h\right)}}$ and ${\widehat{{\gamma}_{1}}}_{\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}ij}^{\left(h\right)}$ for

*i*= 1,...,

*g*and

*j*= 1,...,

*n*.

where ${\widehat{u}}_{ij}^{\left(h\right)}=\left({y}_{j}-{\widehat{\xi}}_{i}^{\left(h\right)}\right)/{\widehat{\sigma}}_{i}^{\left(h\right)}.$

**CM-step:**Update the estimation by

where ${\widehat{n}}_{i}^{\left(h\right)}={\sum}_{j=1}^{n}{\u1e91}_{ij}^{\left(h\right)}$, ${\widehat{b}}_{1i}^{\left(h\right)}={\sum}_{j=1}^{n}{\u1e91}_{ij}^{\left(h\right)}\phantom{\rule{2.77695pt}{0ex}}{\widehat{\tau}}_{ij}^{\left(h\right)}{y}_{j}$, ${\widehat{b}}_{2i}^{\left(h\right)}={\sum}_{j=1}^{n}{\u1e91}_{ij}^{\left(h\right)}{y}_{j}$, ${\widehat{b}}_{3i}^{\left(h\right)}={\sum}_{j=1}^{n}{\u1e91}_{ij}^{\left(h\right)}\phantom{\rule{2.77695pt}{0ex}}{\widehat{\gamma}}_{1ij}^{\left(h\right)}$, ${\widehat{b}}_{4i}^{\left(h\right)}={\sum}_{j=1}^{n}{\u1e91}_{ij}^{\left(h\right)}\left({\widehat{\kappa}}_{ij}^{\left(h\right)}-{\widehat{\tau}}_{ij}^{\left(h\right)}\right)/{\widehat{n}}_{i}^{\left(h\right)}$and ${\widehat{{u}_{ij}}}^{\left(h+1\right)}=\left({y}_{j}-\widehat{{\xi}_{i}^{\left(h+1\right)}}\right)/\widehat{{\sigma}_{i}^{\left(h+1\right)}}.$

*v*

_{ 1 }= ⋯ =

*v*

_{ g }=

*v*, we could update ${\widehat{v}}^{\left(h\right)}$by

The E-step and CM/CML-steps are alternately repeated until a suitable convergence rule is satisfied, e.g., the Aitken acceleration based stopping criterion $|{\ell}^{\left(h+1\right)}-{\ell}_{\infty}^{\left(h+1\right)}|<\epsilon ,$where ${\ell}^{\left(h+1\right)}$is the observed log-likelihood evaluated at ${\widehat{\mathrm{\Theta}}}_{g}^{\left(h\right)},{\ell}_{\infty}^{\left(h+1\right)}$is the asymptotic estimate of the log-likelihood at iteration *h* + 1 (see [18]; Chap. 4.9) and *ε* is the desired tolerance. For numerical analyses in this paper, a default value of *ε* = 10^{-6} was used to terminate the iterations.

### Greedy learning for STN mixtures

In this section, we present a new greedy version of the EM algorithm to determine the optimum number of components in the fitting of STNMIX models. The greedy EM approach was first introduced by Vlassis and Likas [25]. The fundamental concept of the greedy EM algorithm is to start from a minimum number of components and sequentially insert a new component to the mixture until convergence is achieved. The stopping criterion can be a pre-specified maximum number of components or a pre-specified convergence tolerance.

*g*-component $f\left({y}_{j}|{\mathrm{\Theta}}_{g}\right)$. The resulting mixture takes the form of

*< a <*1 and ${\mathrm{\Theta}}_{g+1}=\left({\mathrm{\Theta}}_{g},\phantom{\rule{2.77695pt}{0ex}}a,\phantom{\rule{2.77695pt}{0ex}}{\theta}_{g+1}\right)$ with ${\theta}_{g+1}$being the added parameters $\left({\xi}_{g+1},\phantom{\rule{2.77695pt}{0ex}}{\sigma}_{g+1}^{2},\phantom{\rule{2.77695pt}{0ex}}{\lambda}_{g+1},\phantom{\rule{2.77695pt}{0ex}}{v}_{g+1}\right).$ Given an old mixture $f\left({y}_{j}|{\widehat{\mathrm{\Theta}}}_{g}\right)$, the weight

*a*and ${\theta}_{g+1}$are optimally chosen to maximize the new log-likelihood

To find the optimal solution in (4), we start by performing a local search with for the newly inserted component. This gives rise to the following partial EM steps where $\stackrel{\u0303}{\theta}$denotes and the partial ML estimates of *θ*. For notational simplicity, the subscript (*g* + 1) is suppressed below in the Partial E-step.

**Partial E-step:**Calculating the conditional expectation of latent variables at the

*k*th iteration, this yields

Where ${\u0169}_{j}^{\left(k\right)}=\left({y}_{j}-{\stackrel{\u0303}{\xi}}^{\left(k\right)}\right)/{\stackrel{\u0303}{\sigma}}^{\left(k\right)}.$

**Partial M-step:**Updating the new parameters in (

*a*, θ

_{g+1}), we get

Where ${\u0169}_{j}^{\left(k+1\right)}=\left({y}_{j}-{\stackrel{\u0303}{\xi}}^{\left(k+1\right)}\right)/{\stackrel{\u0303}{\sigma}}^{\left(k+1\right)}$, ${\stackrel{\u0303}{b}}_{1}^{\left(k\right)}={\sum}_{j=1}^{n}{\stackrel{\u0303}{z}}_{j}^{\left(k\right)}{\stackrel{\u0303}{\tau}}_{j}^{\left(k\right)}{y}_{j}$, ${\stackrel{\u0303}{b}}_{2}^{\left(k\right)}={\sum}_{j=1}^{n}{\stackrel{\u0303}{z}}_{j}^{\left(k\right)}{y}_{j}$, ${\stackrel{\u0303}{b}}_{3}^{\left(k\right)}={\sum}_{j=1}^{n}{\stackrel{\u0303}{z}}_{j}^{\left(k\right)}{\stackrel{\u0303}{\gamma}}_{1j}^{\left(k\right)}$, and ${\stackrel{\u0303}{b}}_{4}^{\left(k\right)}={\sum}_{j=1}^{n}{\stackrel{\u0303}{z}}_{j}^{\left(k\right)}\left({\stackrel{\u0303}{\kappa}}_{j}^{\left(k\right)}-{\stackrel{\u0303}{\tau}}_{j}^{\left(k\right)}\right)/{\sum}_{j=1}^{n}{\stackrel{\u0303}{z}}_{j}^{\left(k\right)}$.

*a*and

*ξ*

_{g+1}. Similar to Vlassis and Likas [25], we provided a global search strategy for extracting proper parameter initialization for

*a*and ${\xi}_{g+1}^{\left(0\right)}$. By a second-order Taylor expansion for ${\mathcal{L}}_{g+1}$, we obtain the following approximation:

*a*=

*a*

_{0}. It can be deduced from (5) that a local maximum of ${\mathcal{L}}_{g+1}$ around

*a*

_{0}= 0.5 is given by

*a*can be calculated as

Following the suggestion of Li and Barron [26], one may set $\widehat{a}=0.5$ for *g* = 1 and $\widehat{a}=2/(g+1)$ for *g* ≥ 2 as a default recommendation when the estimated value (7) fall outside the range of (0, 1).

In our global search, a convenience choice of ${\stackrel{\u0303}{\sigma}}_{g+1}^{2\left(0\right)}$ is *n*^{-1/5} times half of the sample variance ${s}_{y}^{2}$ whereas ${\stackrel{\u0303}{\lambda}}_{g+1}^{2\left(0\right)}$ and ${\stackrel{\u0303}{{\nu}_{g+1}}}^{\text{2}\left(\text{0}\right)}$ are always fixed at 0 and 10, respectively. For the initial choice of *ξ*_{g+1}, we search over the 5th, 10th, 15th, ⋯ 95th quantiles of y and set ${\stackrel{\u0303}{\xi}}_{g+1}^{\left(0\right)}$ to the one that maximizes (6).

- 1.
Start with

*g*= 1 and compute the ML estimates of the single-component STNMIX model via the ECME algorithm. - 2.
If

*g >*1, estimate**Θ**_{ g }via the EM-type algorithms. - 3.
Perform a global search to find a proper initialization of

*a*and*ξ*_{g+1}. - 4.
Apply the partial EM-steps until convergence. For instance, $|{\widehat{\mathcal{L}}}_{g+1}^{\left(k\right)}/{\widehat{\mathcal{L}}}_{g+1}^{\left(k-1\right)}-1|\phantom{\rule{2.77695pt}{0ex}}<1{0}^{-6}$.

- 5.
If ${\widehat{\mathcal{L}}}_{g+1}\le {\widehat{\mathcal{L}}}_{g}+m$ then terminate, where

*m >*0 is a penalty term. Otherwise allocate the new component to the model and go to 2. Set*g*=*g*+ 1.

Given *r* candidates (we have 19 quantiles of sample), the time complexity of our greedy EM algorithm is *O*(*ngr*). If overall sample was considered as candidates in the global search, then the running time is similar to Vlassis and Likas [25].

### Merging mixture algorithm

The greedy EM algorithm provides a convenient method for automatically selecting a number of components for a mixture model under reasonable assumptions (such as convexity of components). Yet if data have certain spatial features due to distributions with unusual shapes or low separation [8], it can lead to overlapping components, and hence to overestimation in the number of components in spite of the parsimonious approach. To augment our greedy algorithm for obtaining a robust estimate of the number of components, we extend the merging mixture approach of Baudry *et al*. [27] to skew *t*-Normal components. While merging techniques have been applied in the past to symmetric distributions [27, 28], designing a procedure for asymmetric distributions obviates any need for spurious components that may be required for the sole purpose of modeling asymmetry, and thus avoids redundant merging.

- 1.Calculate the mean entropy of maximum estimation for
*g*components as$\text{Ent}\left(g\right)=-\sum _{j=1}^{n}\sum _{i=1}^{g}{\u1e91}_{ij}\text{log}{\u1e91}_{ij}\ge 0,$

**Θ**

_{ g }fix at ${\widehat{\mathrm{\Theta}}}_{g}$.

- 2.Two clusters
*l*and*l'*to be combined are those maximizing the criterion:$\begin{array}{c}\left(-\sum _{j=1}^{n}\left\{{\u1e91}_{il}\text{log}{\u1e91}_{il}+{\u1e91}_{i{l}^{\prime}}\text{log}{\u1e91}_{i{l}^{\prime}}\right\}\right.\\ \left(\right)close=")">+\sum _{j=1}^{n}\left({\u1e91}_{il}+{\u1e91}_{i{l}^{\prime}}\right)\text{log}\left({\u1e91}_{il}+{\u1e91}_{i{l}^{\prime}}\right)\end{array}$

*l, l'*).

- 3.Obtain the merged entropy$\begin{array}{c}\text{Ent}\left(g-1\right)=-\sum _{j=1}^{n}\left\{\sum _{i\ne l,{l}^{\prime}}{\u1e91}_{ij}\text{log}{\u1e91}_{ij}\right.\\ \phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\left(\right)close="\}">+{\u1e91}_{i,l\cup {l}^{\prime}}\text{log}{\u1e91}_{i,l\cup {l}^{\prime}}& ,\end{array}$

*l*∪

*l'*.

- 4.
Update $\widehat{{z}_{j}}$ consists of the unmerged and merged posterior probabilities.

- 5.
Set

*g*=*g*- 1 and go to 2. Repeat until*g*= 1. - 6.
A solution of number of components can be identified (i) a sudden jump or "elbow" in a plot of the entropy of clustering versus the number of clusters, or (ii) peaks in a plot of the number of clusters versus the difference in entropy.

### Data and experiments

For details of the yeast cell cycle experiments and timecourse data analyzed by StateProfiler, see [4]. For details of EML cell differentiation data, see [11].

## Declarations

### Acknowledgements

TIL was partially supported by National Science Council of Taiwan (Grant NO. NSC99-2118-M-005-001-MY2).

This article has been published as part of *BMC Bioinformatics* Volume 13 Supplement 5, 2012: Selected articles from the First IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS 2011): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S5.

## Authors’ Affiliations

## References

- Darzynkiewicz Z, Crissman H, Jacobberger JW:
**Cytometry of the cell cycle: cycling through history.***Cytometry A*2004,**58:**21–32.View ArticlePubMed - Krishan A, Krishnamurthy H, Totey S:
*Applications of Flow Cytometry in Stem Cell Research and Tissue Regeneration*. John Wiley & Sons Inc; 2010.View Article - Pyne S, Hu X, Wang K, Rossin E, Lin TI, Maier LM, Baecher-Allan C, McLachlan GJ, Tamayo P, Hafler DA, Jager PLD, Mesirov JP:
**Automated high-dimensional flow cytometric data analysis.***Proc Natl Acad Sci USA*2009,**106:**8519–8524. 10.1073/pnas.0903028106PubMed CentralView ArticlePubMed - Jackson LP, Reed SI, Haase SB:
**Distinct mechanisms control the stability of the related S-phase cyclins Clb5 and Clb6.***Mol Cell Biol*2006,**26:**2456–2466. 10.1128/MCB.26.6.2456-2466.2006PubMed CentralView ArticlePubMed - Niu W, Li Z, Zhan W, Iyer VR, Marcotte EM:
**Mechanisms of cell cycle control revealed by a systematic and quantitative overexpression screen in S. cerevisiae.***PLoS Genet*2008,**4:**e1000120. Doi:10.1371/journal.pgen.1000120 Doi:10.1371/journal.pgen.1000120 10.1371/journal.pgen.1000120PubMed CentralView ArticlePubMed - Hedley DW, Chow S, Goolsby C, Shankey TV:
**Pharmacodynamic monitoring of molecular-targeted agents in the peripheral blood of leukemia patients using flow cytometry.***Toxicol Pathol*2008,**36:**133–139. 10.1177/0192623307310952View ArticlePubMed - Krishan A, Hamelik RM:
**Flow cytometric monitoring of fluorescent drug retention and efflux.***Methods Mol Med*2005,**111:**149–166.PubMed - Kotecha N, Flores NJ, Irish JM, Simonds EF, Sakai DS, Archambeault S, Diaz-Flores E, Coram M, Shannon KM, Nolan GP, Loh ML:
**Single-cell profiling identifies aberrant STAT5 activation in myeloid malignancies with specific clinical and biologic correlates.***Cancer Cell*2008,**14**(4):335–343. 10.1016/j.ccr.2008.08.014PubMed CentralView ArticlePubMed - Irish JM, Myklebust JH, Alizadeh AA, Houot R, Sharman JP, Czerwinski DK, Nolan GP, Levy R:
**B-cell signaling networks reveal a negative prognostic human lymphoma cell subset that emerges during tumor progression.***Proc Natl Acad Sci USA*2010,**107:**12747–12754. 10.1073/pnas.1002057107PubMed CentralView ArticlePubMed - Chang HH, Oh PY, Ingber DE, Huang S:
**Multistable and multistep dynamics in neutrophil differentiation.***BMC Cell Biol*2006,**7:**11. 10.1186/1471-2121-7-11PubMed CentralView ArticlePubMed - Chang HH, Hemberg M, Barahona M, Ingber DE, Huang S:
**Transcriptome-wide noise controls lineage choice in mammalian progenitor cells.***Nature*2008,**453**(7194):544–547. 10.1038/nature06965View ArticlePubMed - Frühwirth-Schnatter S, Pyne S:
**Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-**t**distributions.***Biostatistics*2010,**11:**317–336. 10.1093/biostatistics/kxp062View ArticlePubMed - Ho HJ, Pyne S, Lin TI:
**Maximum likelihood inference for mixtures of skew Student-**t**-normal distributions through practical EM-type algorithms.***Stat Comput*2012,**22:**287–299. 10.1007/s11222-010-9225-9View Article - Lin TI, Lee JC, Yen SY:
**Finite mixture modelling using the skew normal distribution.***Stat Sinica*2007,**17:**909–927. - Rossin E, Lin TI, Ho HJ, Mentzer SJ, Pyne S:
**A framework for analytical characterization of monoclonal antibodies based on reactivity profiles in different tissues.***Bioinformatics*2011,**27**(19):2746–2753. 10.1093/bioinformatics/btr468PubMed CentralView ArticlePubMed - Lin TI, Lee JC, Hsieh WJ:
**Robust mixture modeling using the skew**t**distribution.***Stat Comput*2007,**17:**81–92. 10.1007/s11222-006-9005-8View Article - Lin TI:
**Robust mixture modeling using multivariate skew**t**distributions.***Stat Comput*2010,**20:**343–356. 10.1007/s11222-009-9128-9View Article - McLachlan GJ, Krishnan T:
*The EM algorithm and extensions*. John Wiley & Sons Inc; 2008.View Article - Tibshirani R, Walther G, Hastie T:
**Estimating the number of clusters in a data set via the gap statistic.***J R Stat Soc Ser B Methodol*2001,**63:**411–423. 10.1111/1467-9868.00293View Article - Song C, Phenix H, Abedi V, Scott M, Ingalls BP, Kaern M, Perkins TJ:
**Estimating the stochastic bifurcation structure of cellular networks.***PLoS Comput Biol*2010,**6**(3):e1000699. 10.1371/journal.pcbi.1000699PubMed CentralView ArticlePubMed - Gómez HW, Venegas O, Bolfarine H:
**Skew-symmetric distributions generated by the distribution function of the normal distribution.***Environmetrics*2007,**18:**395–407. 10.1002/env.817View Article - Dempster AP, Laird NM, Rubin DB:
**Maximum likelihood from incomplete data via the EM algorithm (with discussion).***J R Stat Soc Ser B Methodol*1977,**39:**1–38. - Meng XL, Rubin DB:
**Maximum likelihood estimation via the ECM algorithm: A general framework.***Biometrika*1993,**80:**267–278. 10.1093/biomet/80.2.267View Article - Liu CH, Rubin DB:
**The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence.***Biometrika*1994,**81:**633–648. 10.1093/biomet/81.4.633View Article - Vlassis N, Likas A:
**A greedy EM algorithm for Gaussian mixture learning.***Neural Process Lett*2002,**15:**77–87. 10.1023/A:1013844811137View Article - Li JQ, Barron AR:
**Mixture Density Estimation.**In*Advances in Neural Information Processing Systems 12, [NIPS Conference, Denver, Colorado, USA, November 29 - December 4, 1999*. Edited by: Solla SA, Leen TK, Müller KR. The MIT Press; 1999:279–285. - Baudry JP, Raftery AE, Celeux G, Lo K, Gottardo R:
**Combining Mixture Components for Clustering.***J Comput Graph Stat*2010,**9:**332–353.PubMed CentralView ArticlePubMed - Finak G, Bashashati A, Brinkman R, Gottardo R:
**Merging mixture components for cell population identification in flow cytometry.***Adv Bioinformatics*2009,**2009:**Article ID 247646. Doi:10.1155/2009/247646 Doi:10.1155/2009/247646View Article

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.