 Research Article
 Open Access
 Published:
flowVS: channelspecific variance stabilization in flow cytometry
BMC Bioinformatics volume 17, Article number: 291 (2016)
Abstract
Background
Comparing phenotypes of heterogeneous cell populations from multiple biological conditions is at the heart of scientific discovery based on flow cytometry (FC). When the biological signal is measured by the average expression of a biomarker, standard statistical methods require that variance be approximately stabilized in populations to be compared. Since the mean and variance of a cell population are often correlated in fluorescencebased FC measurements, a preprocessing step is needed to stabilize the withinpopulation variances.
Results
We present a variancestabilization algorithm, called flowVS, that removes the meanvariance correlations from cell populations identified in each fluorescence channel. flowVS transforms each channel from all samples of a data set by the inverse hyperbolic sine (asinh) transformation. For each channel, the parameters of the transformation are optimally selected by Bartlett’s likelihoodratio test so that the populations attain homogeneous variances. The optimum parameters are then used to transform the corresponding channels in every sample. flowVS is therefore an explicit variancestabilization method that stabilizes withinpopulation variances in each channel by evaluating the homoskedasticity of clusters with a likelihoodratio test.
With two publicly available datasets, we show that flowVS removes the meanvariance dependence from raw FC data and makes the withinpopulation variance relatively homogeneous. We demonstrate that alternative transformation techniques such as flowTrans, flowScape, logicle, and FCSTrans might not stabilize variance. Besides flow cytometry, flowVS can also be applied to stabilize variance in microarray data. With a publicly available data set we demonstrate that flowVS performs as well as the VSN software, a stateoftheart approach developed for microarrays.
Conclusions
The homogeneity of variance in cell populations across FC samples is desirable when extracting features uniformly and comparing cell populations with different levels of marker expressions. The newly developed flowVS algorithm solves the variancestabilization problem in FC and microarrays by optimally transforming data with the help of Bartlett’s likelihoodratio test. On two publicly available FC datasets, flowVS stabilizes withinpopulation variances more evenly than the available transformation and normalization techniques. flowVSbased variance stabilization can help in performing comparison and alignment of phenotypically identical cell populations across different samples. flowVS and the datasets used in this paper are publicly available in Bioconductor.
Background
We describe an algorithm that transforms a collection of flow cytometry (FC) samples in order to stabilize the variance within cell populations in each fluorescence channel for the entire collection of samples. This transformation enables cell populations (clusters of cells with similar phenotypes) with homogeneous variances to be easily compared with each other by standard statistical methods. Betweenpopulation comparisons are important in detecting changes in populations across biological conditions, which might help us to diagnose diseases, develop new drugs, and understand the immune system in general [1–5]. Hence, our variancestabilization algorithm could play a supporting role in automating biological discovery based on flow cytometry and similar imaging technologies.
FC technology measures morphology (from light scattering) and the expression of multiple biomarkers (from fluorescence emission of fluorophores attached to antibodies) at the singlecell level. An FC sample consists of hundreds of thousands or more of such singlecell measurements, and a study could consist of thousands of samples from different individuals at different time points under different experimental conditions [6, 7].
Variance inhomogeneity is an inherent problem in fluorescencebased FC measurements and can be an obstacle both for manual data analysis performed by qualified cytometry operators and for automated multisample comparisons, which typically rely on an intermediate step of cell clustering using a plethora of approaches from modified kmeans to nonparametric Bayesian methodologies [8, 9]. The origin of the problem is the physics of fluorescence signal formation and the detection processes that monotonically increase the variance of the fluorescence signal with the average signal intensity [10, 11]. For example, Fig. 1 demonstrates how the variances of cell populations increase with their mean fluorescence intensities (MFIs) in a set of FC samples collected from several healthy individuals. Owing to such signalvariance dependence, a cell population with higher levels of marker expressions (i.e., higher fluorescence emission) has higher variance than another population with relatively low levels of marker expressions (i.e., low fluorescence emission). This inhomogeneity of withinpopulation variance creates problems in extracting features uniformly and comparing cell populations with different levels of marker expressions.
In order to demonstrate the flowVS results we evaluate the pre and postprocessing cluster homogeneity, and quantify the improvement offered by our approach. We report the results using a simple measure of effect size, rather than through a hypothesistesting framework. As an example, consider the population registration problem in which corresponding cell clusters from multiple FC samples are identified based on the average levels of markers expressed by the clusters [3, 12–14]. The clusters of cells representing the same immunophenotype identified in multiple samples are represented by a hypothetical metacluster (a biological generalization of the particular immunophenotype, observed across multiple samples). The existence of metaclusters is typically assessed by hemopathologists or other skilled FC operators on the basis of their experience and knowledge of previous examples of normal and aberrant immunophenotypes. The biological hypothesis behind assigning a cluster to a metacluster can be formulated as “all clusters in a metacluster represent the same cell type (immunophenotype).” However, translating this hypothesis to a null stating “all clusters in a metacluster have equal mean” and using a traditional hypothesistesting framework accompanied by pvalues may not be appropriate. First, we know that such a null hypothesis is unrealistic: biological variability, technical variability of blood or bonemarrow sample measurements, and random effects associated with the biochemistry of antibody binding will certainly produce clusters of differing means. Second, a hypothesistesting framework addresses only the question of whether the clusters have the same location, but it is not designed to measure the magnitude of the difference or lack of homogeneity within a postulated metacluster. Finally, the pvalues are affected by both cluster size and metacluster homogeneity. Thus, the pvalues obtained would not be comparable for various metaclusters or different clusters within metaclusters.
Variance stabilization (VS) is a process for dissociating data variability from mean signals [15–18]. Other fluorescencebased technologies such as the microarrays stabilize variance by data transformation [18–21]. However, unlike microarray data, explicit VS is not usually performed in FC data analysis. Traditionally, FC data are transformed with nonlinear functions to project cell populations with normally distributed clusters – a choice that usually simplifies subsequent visual analysis [22–27]. Recently, Finak et al. [27] used the maximumlikelihood approach to explicitly satisfy normality of the cell populations. Ray et al. [28] transformed each channel with the asinh function whose parameters are optimally selected by the JarqueBera test of normality (a goodnessoffit test of whether sample data have the skewness and kurtosis matching a normal distribution). While these transformations approximately normalize FC data, they might not stabilize variance, as may be seen in Figs. 6 and 7.
The VS problem in FC, however, cannot be solved directly by applying mature VS techniques from the microarray literature. In microarrays, each gene is measured multiple times (possibly under multiple conditions) and the betweensample variance for each gene is stabilized with respect to the average expression of the gene across samples. By contrast, variance is defined by withinpopulation celltocell variation in FC, and this withinpopulation variance is stabilized with respect to the average expression of markers within each population. These contrasting objectives prevent us from applying VS methods from microarray literature directly to flow cytometry.
We address the need for explicit VS in FC with a maximum likelihood (ML)based method, called flowVS, which is built on top of a commonly used inverse hyperbolic sine (asinh) transformation. The choice of asinh function is motivated by its success as a variance stabilizer for microarray data [18, 21]. flowVS stabilizes the withinpopulation variances separately for each fluorescence channel z across a collection of N samples. After transforming z by asinh(z/c), where c is a normalization cofactor, flowVS identifies onedimensional clusters (density peaks) in the transformed channel. Assume that a total of m 1D clusters are identified from N samples with the ith cluster having variance \({\sigma ^{2}_{i}}\). Then the asinh(z/c) transformation works as a variance stabilizer if the variances of the 1D clusters are approximately equal, i.e., \({\sigma ^{2}_{1}} \sim {\sigma ^{2}_{2}}\sim \ldots \sim {\sigma ^{2}_{m}}\). To evaluate the homogeneity of variance (also known as homoskedasticity), we use Bartlett’s likelihoodratio test [29]. From a wide range of cofactors, our algorithm selects one that minimizes Bartlett’s test statistic, resulting in a transformation with the best possible VS. Note that, in contrast to other transformation approaches, our algorithm applies the same transformation to corresponding channels in every sample. flowVS is therefore an explicit VS method that stabilizes withinpopulation variances in each channel by evaluating the homoskedasticity of clusters with a likelihoodratio test.
Using a healthysubject data set from Purdue and publicly available immune tolerance network (ITN) data, we demonstrate that flowVS removes the meanvariance dependence from raw FC data and makes the withinpopulation variance relatively homogeneous. We demonstrate that alternative transformation techniques might not stabilize variance. Variance homogeneity is especially useful to build metaclusters from a collection of phenotypically similar cell populations across samples [3, 27, 30, 31]. Previous studies (Hahne et al. [32], for example) shifted the distribution of each fluorescence channel to ensure homogeneity in metaclusters, but such shifting might hide useful biological signals present in the MFIs of cell populations. By contrast, we can build homogeneous metaclusters from variancestabilized populations without removing the differences in their MFIs. Hence, flowVS could provide additional flexibility in processing and analyzing a large collection of FC samples.
Related work
VS has been a widely studied topic in applied statistics for its central role in making heteroskedastic data easily tractable by standard methods. Heteroskedasticity appears in various data sets mostly because the data follow a distribution with correlated mean and variance, e.g., Poisson or Gamma; there are many more examples, but these two are relevant for fluorescence. For wellknown distribution families, VS is usually performed by transforming data with an analytically chosen function f. For example, \(f(z)=\sqrt {z+3/8}\) works as a good (asymptotic) stabilizer for a random variable z following the Poisson distribution [33]. Variance stabilizers for several wellknown distribution families are described in [33, 34]. For unknown distributions, heuristic and datadriven stabilizers are often used [15–17].
However, traditional transformations are often inadequate for lowcount (photonlimited) signals [18, 35] because of unknown error patterns in fluorescence data. Past work developed ad hoc VS schemes for different types of fluorescence data. For example, in microarrays, the VS problem has been addressed by various nonlinear transformations [18–21]. Most notably, the widely used approach by Huber et al. [18] uses an asinh transformation whose parameters are selected by a maximumlikelihood estimation.
For FC data, researchers have used various nonlinear transformations, such as the logarithm, hyperlog, generalized BoxCox, and biexponential (e.g., logicle and generalized arcsinh) functions [22–27]. In past work, parameters of these transformations were adjusted in a datadriven manner to maximize the likelihood (flowTrans by Finak et al. [27]), to satisfy the normality (flowScape by Ray et al. [28]), and to comply with simulations (FCSTrans by Qian et al. [36]). flowTrans estimates transformation parameters for each sample by maximizing the likelihood of data’s being generated by a multivariatenormal distribution on the transformed scale. flowScape optimizes the normalization factor of asinh transformation by the JarqueBera test of normality. FCSTrans selects the parameters of the linear, logarithm, and logicle transformations with an extensive set of simulations. However, normalizing data may not necessarily stabilize its variance, e.g., for a Poisson variable z, \(\sqrt {z+3/8}\) is an approximate variancestabilizer, whereas z ^{2/3} is a normalizer [16]. Therefore, we consider an approach built upon the wellknown asinh transformation and estimate transformation parameters for explicitly stabilizing withinpopulation variations.
Methods
Motivation
Consider two representative samples from the ITN data set taken from the flowStats package in Bioconductor. We gate the samples for lymphocytes, transform using an asinh transformation (with cofactor set to 1), and plot Tcell subpopulations and distribution of CD8 marker in Fig. 2. Across these two samples, Tcell subpopulations have different proportions and MFIs. For example, in Subfig. 2a and 2c, CD8^{+} Tcell subpopulation is 35.4 % of total Tcells with MFI 5.31. By contrast, CD8^{+} Tcell subpopulation is 28.7 % of total T cells with MFI 6.23 in Subfig. 2b and 2d. Should we consider the differences between CD8^{+} populations in these two samples to be biologically significant? The answer to this question depends on our assumption about the data. If we assume that a cell type either expresses or does not express a biomarker and that the biological information lies only in the proportion of positively and negatively expressed cells, then MFI does not bear meaningful information other than defining positive and negative cells. In this case, we could consider the differences in MFIs across cell populations of the same type as technical variations and eliminate them by aligning cell populations described by Hahne et al. [32] and Finak et al. [37]. However, past work has shown that both cell proportion and MFI can possess biological information [2, 3, 38]. Hence, aligning cell populations to a common MFI might remove meaningful biological signal from data. In the latter case, we want to compare MFIs of cell populations to evaluate whether they are statistically different. A common statistical approach to compare average expressions of cell population is to use a statistical test in an ANOVA model that explicitly requires that variance be approximately stabilized in populations. Hence, VS is necessary to detect statistically meaningful changes across populations from different samples.
The goal of VS in flow cytometry
The aim of VS in FC is to make withinpopulation variances of different cell populations approximately equal and thereby independent of the average marker expressed by populations. Recall that the expression of a marker is measured by the intensity of light at a particular channel of fluorescence. VS therefore stabilizes the withinpopulation fluorescence variance and makes it independent of the MFIs of the cell populations. In this paper, we refer to fluorescence channels more frequently because the nature of fluorescence emissions – not the protein expressions – dominates the meanvariance relationship in FC data. We do not stabilize variance on the scatter channels because, as pointed out by Finak et al. [27], there are few benefits to transforming forward and sidescatter channels.
Channelspecific variance stabilization
We assume that correlations among fluorescence channels due to the overlap of spectra are removed by spectral unmixing before we transform data. Even though the expression of biomarkers can still be correlated [24], we do not incorporate such correlations in VS because the nature of such correlation is difficult to model. Therefore, we assume that compensated fluorescence channels are independent and stabilize variance on each channel separately.
Selecting an optimal transformation for FC data is a nontrivial problem because the accurate error model of FC data is often unknown. In previous work, researchers have successfully used a number of functions to transform FC data, such as logarithm, asinh, BoxCox, logicle, etc. [24, 27, 28]. In our flowVS algorithm, we decided to use the asinh function to transform FC data. This choice of asinh function is motivated by its success in FC data visualization and normalization [27, 28] and in stabilizing variance in fluorescence readouts from microarray data [18, 21]. Stabilizing variance with other transformations can be performed using the same flowVS framework but is not discussed here.
To transform a fluorescence channel z, we use the asinh transformation with a single parameter c:
In this transformation, c is called the normalization cofactor, whose value is optimally selected to stabilize withinpopulation variance in channel z. Note that in a more general form asinh transformation is expressed with three parameters, a∗asinh(b+z/c), where in addition to the cofactor c, a denotes a scaling after transformation, and b denotes a translation before transformation. We set a=1 because scaling after transformation does not affect downstream analysis and set b=0 to avoid shifting cell populations. Hence, we are left with a single parameter c whose value is estimated in order to stabilize the variance.
The flowVS algorithm
Assume that we have a collection of N FC samples. Then the objective of the flowVS algorithm is to transform each sample such that the withinpopulation variance is stabilized in each fluorescence channel across N samples. Here, we describe the algorithm for a single channel z; the process can be applied independently to other channels. First, we discuss the process of evaluating homoskedasticity of a transformed channel for a selected cofactor c by computing Bartlett’s likelihoodratio test. Then, we elaborate the process of selecting an optimum cofactor that would stabilize variance when used with asinh transformation.
Steps to compute Bartlett’s statistic on channel z for a selected cofactor c

Step1: Transforming channel z in each sample. Let z _{ j } be a vector denoting channel z in the jth sample, where 1≤j≤N. We transform z _{ j } by the asinh function: \(z^{\prime }_{j} = \operatorname {asinh} (z_{j}/c)\), where \(z^{\prime }_{j}\) is the transformed channel.

Step2: Detecting 1D density peaks (1D clusters). We estimate the density of \(z^{\prime }_{j}\) by a kernel density estimation method (the density function of stats package in R). The peaks in the density of \(z^{\prime }_{j}\) are identified as regions of high local density and significant curvature (also called landmarks in [32]). We identify highdensity regions in \(z^{\prime }_{j}\) by the curv1Filter function of the flowCore package [39] in Bioconductor. The boundaries of density peaks are identified by detecting minima between two adjacent density peaks. Here, a density peak represents a 1D cluster of cells. Let P _{ j } be the collection of all density peaks identified in \(z^{\prime }_{j}\).

Step3: Collecting density peaks from all samples. Let P be the set of density peaks collected from all samples, i.e., P=∪_{1≤j≤N } P _{ j }. Let P contain a total of m density peaks where the ith peak contains n _{ i } cells with mean μ _{ i } and variance \({\sigma ^{2}_{i}}\).

Step4: Computing Bartlett’s test statistic. Let \(n=\sum _{1\leq i \leq m} n_{i}\) be the total number of cells in P and \({\sigma ^{2}_{p}}\) be the pooled variance of m density peaks. Then we compute Bartlett’s statistic as follows:
$$ B(c) = \frac{(nm)\: \ln\left({\sigma_{p}^{2}}\right)  \sum_{i=1}^{m} (n_{i}1)\: \ln\left({\sigma_{i}^{2}}\right)}{1 + \frac{1}{3(m1)} \left(\sum_{i=1}^{m} \frac{1}{n_{i}1}  \frac{1}{nm}\right)}. $$(2)This statistic B(c) is specific to the cofactor c used to transform the data and measures the degree of homogeneity across all 1D clusters in the transformed channel z ^{′}.
Finding a cofactor for optimum VS
The optimum variancestabilizing cofactor c ^{∗} is a cofactor giving the minimum value of Bartlett’s statistic:
Minimizing Eq. 2 is a nontrivial optimization problem because Bartlett’s test statistic B(c) depends indirectly on the cofactor c and is not differentiable with respect to c. This prevents us from applying optimization methods from the gradientdescent family. Therefore, we employ a piecewise minimization without derivatives [40]. Let c _{ low } and c _{ high } be the lowest and highest possible values of the cofactor on a logarithmic scale. By default, we set c _{ low }=−2 and c _{ high }=10, i.e., the lowest and highest values of cofactor is exp(−2)∼0.135 and exp(10)∼22026, respectively. Users can also supply these extreme values. We assume that the optimum cofactor lies in the range [ exp(c _{ low }), exp(c _{ high })]. Then the optimization procedure works as follows:

(a) We divide the interval [ c _{ low }, c _{ high }] into k=(c _{ high }−c _{ low }) equal regions where the ith region is defined by the interval [c _{ i },c _{ i+1}] and c _{ i+1}−c _{ i }=1.

(b) For the ith interval, we look for a cofactor in the range [ exp(c _{ i }), exp(c _{ i+1})] with minimum Bartlett’s statistic. For each cofactor, we compute the Bartlett’s statistic with the steps described in Sec. 3. For faster convergence, we call the optimize function from the stats package in R, which uses a combination of golden section search and successive parabolic interpolation [41]. Interested readers might see the R documentation for a detailed description of the function. Let \(c^{*}_{i}\) be the optimum cofactor in the ith interval with the associated Bartlett’s statistic \(B(c^{*}_{i})\).

(c) We identify the overall optimum cofactor c ^{∗} as follows:
$$ c^{*} = \arg\!\min_{i=1}^{k} B(c^{*}_{i}). $$(4)
Equation 4 provides an approximate solution to Eq. 3. Since we divided the search space into smaller intervals, the probability of having multiple local optima in an interval is small. Hence, the procedure described above is expected to return a variance stabilizing cofactor. After we obtain the optimum cofactor c ^{∗}, channel z in each sample is transformed by asinh(z _{ j }/c ^{∗}) and used in subsequent analysis.
Results
Data sets
We demonstrate the use of flowVS and other related methods by using a healthysubject data set from Purdue University (HD) and publicly available immune tolerance network (ITN) data. The original HD data set consists of 65 samples from five healthy individuals who donated blood on different days [42]. Here, for simplicity, we used a smaller subset of the HD data set consisting of 12 samples from three healthy individuals, “A”, “C”, and “D”. From each individual, we keep samples from two (randomly selected) days and two technical replicates from each day. Each HD sample was stained using labeled antibodies against CD45, CD3, CD4, CD8, and CD19 protein markers. In this paper, an HD sample “C_4_2” means that it is collected on day 4 from individual “C” and it is the second replicate on that day. The healthy data set is part of our Bioconductor package flowVS. The ITN data set is collected from 15 patients. It includes 3 patient groups with 5 samples each. Each sample was stained using labeled antibodies against CD3, CD4, CD8, CD69 and HLADr. The ITN data set is available in the flowStats package in Bioconductor. We selected these data sets because they are available in standard R packages. Hence, the results presented here can be easily reproduced.
We identify lymphocytes in each sample of the HD and ITN datasets by using a twostep gating shown in Fig. 3. In this paper, we perform data transformation only on lymphocytes.
Stabilizing variance in the HD dataset
At first, flowVs identifies an optimum cofactor for asinh transformation for each fluorescence channel of the HD dataset. This process is performed by identifying density peaks in each channel and minimizing Bartlett’s statistic, as described in Section 3. The top row in Fig. 4 shows Bartlett’s statistic computed from density peaks of all samples after the data are transformed by asinh transformation with different cofactors. The range of values of the cofactor is selected by the automated algorithm described in Section 3. An optimum variancestabilizing cofactor is obtained where Bartlett’s statistic is a minimum. From Fig. 4, the variancestabilizing cofactors for different markers are: (a) 17,956 for CD45, (b) 5,685 for CD3, (c) 6,317 for CD4, (d) 4,937 for CD8, and (e) 5,976 for CD19. For every channel except CD45, we obtain a clear global minimum, denoting the existence of a unique variancestabilizing cofactor with respect to Bartlett’s test. For CD45, we observe a sharp decrease in Bartlett’s statistic at cofactor 17,000. Since CD45 is a common leukocyte marker, it is always expressed on lymphocytes – the subset of cells that we preselected for this study. Hence, most cells are CD45+ in our preprocessed samples, which might produce a nonconvex relationship between Bartlett’s statistic and cofactors. For the same reason, the value of Bartlett’s statistic at the optimum cofactor for CD45 is the smallest (less than 700) compared to the minimum value of Bartlett’s statistic achieved in other channels. Note that the minimum Bartlett’s statistic denotes the degree to which we are able to stabilize the withinpopulation variance of a channel considering the betweensample variations.
We transform each sample of the HD data set by the asinh function with the variancestabilizing cofactors and plot the density of the transformed channels in the bottom row of Fig. 4. In each channel, we observe that density peaks (a.k.a. onedimensional clusters) have approximately equal width across all samples, which visually confirms the homogeneity of withinpopulation variances in onedimensional clusters. When both positive and negative peaks (i.e., clusters with high or low marker expression) are present in a channel, e.g., CD3, CD4, and CD8, their variances are also approximately stabilized. Note that the density peaks may not be well aligned owing to the betweensubject variations. Aligning density peaks across samples is not an objective of flowVS, because such shifting of density might potentially eclipse biological signals present in the mean expressions of a cell populations. When necessary, data normalization can be performed after variance stabilization, as was done by Hahne et al. [32] and Finak et al. [37].
Stabilizing variance in the ITN dataset
We stabilize variance in each channel of the ITN dataset and show the results in Fig. 5. Similar to the HD data set, the top row shows Bartlett’s statistic computed from density peaks of all samples of the ITN data set after each channel is transformed by asinh transformation with different cofactor for each one. From Fig. 5, the variance stabilizing cofactors for different markers are: (a) 3.66 for CD3, (b) 25.1 for CD4, (c).75 for CD8, and (d) 0.2 for CD69. The curves showing the relationship between Bartlett’s statistic and cofactors might have multiple local minima. Nevertheless, a clear global minimum is obtained for channels in the ITN dataset. We note that the variancestabilizing cofactors for the HD data set are an order of magnitude greater than those of the ITN data set. For example, the variancestabilizing cofactor for the CD channel is 6317 in the former data set, whereas for the same channel, variance is stabilized at a cofactor of 25.1 in the latter. The primary contributing factor behind this difference is the maximum range of values in each channel. The maximum value of a fluorescence channel is 10,000 for the ITN data set, and 1,048,575 for the HD data set. Hence, variance is stabilized at higher cofactor values for the channels in the HD data set.
After identifying the optimum cofactors for each channel, we transform each sample of the ITN data set by asinh functions with the variancestabilizing cofactors and plot the density of the transformed channels in the bottom row of Fig. 5. Similar to the HD data set (Fig. 5), the density peaks have approximately equal variance across all samples, thus confirming the homogeneity of withinpopulation variances in onedimensional clusters.
Comparing flowVS with other transformation methods
We compare flowVS with three automated methods developed for transforming FC data: (a) flowTrans (b) logicle (flowCore), and (c) FCSTrans. We selected these three methods because they automatically select parameters for different transformations. As discussed earlier, flowTrans estimates the parameters of different transformations (e.g., asinh, biexponential, linlog, and BoxCox) by maximizing the likelihood of data’s being generated from normal distributions [27]. In this paper, we chose the results of flowTrans with asinh transformation because it generated relatively better segregation of populations than the other options and is directly comparable to flowVS that also uses the asinh transformation. We generate our results by calling the flowTrans function of the Bioconductor package flowTrans. Next, we select the logicle transformation implemented in the flowCore package in Bioconductor. To estimate the parameters of logicle transformation, we use the estimateLogicle function of the flowCore package. Finally, FCSTrans also uses the logicle transformation. We obtained the R source code of FCSTrans from http://sourceforge.net/projects/immportflock/files/FCSTrans.
The top row of Fig. 6 shows the densities of the transformed CD4 channel of the HD data set after the samples are transformed by four methods using their optimum parameters. From visual inspection, we observe that the logicle and FCSTrans stabilize variance of the CD4^{+} and CD4^{} populations separately. However, these two methods do not stabilize variances across CD4^{+} and CD4^{} populations. flowTrans fails to converge for six samples (all samples from subject D and day 3 samples from subject C) and uses default cofactor=1 for these samples. Hence, the peaks transformed by flowTrans are on different scales, and they are hard to compare against each other. By contrast, flowVS stabilizes variance across all peaks of CD4 channels, including CD4^{+} and CD4^{} populations. Furthermore, flowVS selects a single cofactor for a channel across all samples in a data set, whereas flowTrans selects different parameters for different samples. Thus, populations are more comparable after data are transformed by flowVS.
Next, we quantitatively compare the stability of variance across multiple transformation methods. This comparison, however, can not be performed on the actual transformed data because different transformations convert data to different scales. Hence, we convert each transformed channel z to [0,1] scale by rescaling each element z _{ i } with the following equation (z _{ i }− min{z})/(max{z}− min{z}). For each transformation, we identify the density peaks in the converted CD4 channel and plot standard deviations of density peaks against their ranks of MFI in the bottom row of Fig. 6. Here we use rank of the means, instead of actual means, to distribute the points evenly along the xaxis. We observe that all four transformations are able to eliminate the systematic dependence of variance on mean, which is typically observed in untransformed fluorescence data, such as in Fig. 1. Therefore, these transformations have inherent ability to stabilize variance, mostly owing to the properties of the underlying asinh and logicle transformations. However, flowVS is able to stabilize variance more evenly than other transformations, as can be seen in the bottom right plot in Fig. 6.
The comparison of different transformations on the ITN data set is shown in Fig. 7. As before, the top row shows the densities of the transformed CD4 channel and the bottom row plots the standard deviations of the density peaks. As with the HD data set, flowVS stabilizes variance more evenly than other methods.
Normality of the variancestabilized clusters
Bartlett’s test assumes that the cell populations are normally distributed and is sensitive to departures from normality. Density peaks (1D cell populations) in data sets that we have studied approximately follow normal distributions. This normality assumption is typical for many FC data sets as well. Hence, a VS approach based on Bartlett’s test is expected to work well for most FC data sets. For example, in Fig. 8, we show the normality of cell populations in a representative sample of the HD data set with quantilequantile plots (QQ plots) [43] of eight 1D clusters. In each QQ plot, the distribution of a 1D cluster is compared with the standard normal distribution by plotting their quantiles against each other. If a cluster is normally distributed (i.e., linearly related to the standard normal distribution), the points in the QQ plot lie approximately on a straight line. We observe that all eight QQ plots in Fig. 8 show linearity in their central parts, except for small deviations at the ends, indicating that the 1D clusters approximately follow normal distributions with heavier tails. Therefore, flowVS based on Bartlett’s statistic works well for this data.
However, if cell populations deviate significantly from normality, we could use other likelihood ratio statistic that is less sensitive to departures from normality, such as Levene’s [44] or the BrownForsythe statistic [45]. In our experiments, we found Bartlett’s approach working significantly better than Levene’s, and therefore, did not show results of the latter method.
Impact of variance stabilization in comparing cell populations
We now briefly demonstrate the impact of variance stabilization on the homogeneity of metaclusters (i.e., groups of phenotypically concordant clusters). In metacluster homogeneity evaluation, the underlying assumption is that all clusters in a metacluster represent the same cellular immunophenotype. As mentioned before, the hypothesistesting framework may not be appropriate for the described problem, since a null hypothesis claiming that all clusters in a metacluster have equal mean is essentially always false. Moreover, when the number of cells (sampling units of the test) increases, the power of a statistical test such as a ttest or an Ftest increases too. Consequently, a statistical test would inevitably detect small (i.e., statistically significant, but biologically irrelevant) differences between clusters. For example, performing a ttest with CD4^{+} cell clusters from the first and the second samples of the ITN data set, we observe pvalues less than 10^{−10} for all transformations despite the fact that the tested cell populations are biologically identical (in an immunophenotypic sense).
Therefore, we use an effect size measure rather than pvalues to illustrate the impact of the proposed algorithm on metacluster homogeneity. We employ the ratio of betweencluster variation (\({\sigma _{b}^{2}}\)) to withincluster variation (\({\sigma _{w}^{2}}\)) [42]. Consider a set of k clusters where the ith cluster containing n _{ i } cells has mean μ _{ i } and variance \({\sigma ^{2}_{i}}\). If N is the total number of cells in all clusters and μ is the combined mean then the ratio of \({\sigma _{b}^{2}}\) and \({\sigma _{w}^{2}}\) is computed as follows:
Unlike the Ftest for comparing multiple clusters, the above ratio does not depend on the sample size, and it is constructed so that the increasing homogeneity of a metacluster results in a progressively smaller value of the ratio. Table 1 shows the \(\frac {{\sigma ^{2}_{b}}}{{\sigma _{w}^{2}}}\) values of CD4^{+} metacluster and a biologically erroneous metacluster grouping CD4^{+} and CD4^{} cells. The values are computed after the data are transformed by four transformations considered in this paper. In this example, the flowVS transformation gives the best homogeneity within the CD4^{+} metacluster, and an increased heterogeneity of the CD4^{+}/CD4^{} clusters mixture. Thus application of flowVS not only results in the highest homogeneity of a set of known phenotypically identical cell clusters, but also provides the best discrimination between homogeneous and heterogeneous collection of clusters. The result demonstrates that the flowVSbased variance stabilization can help in performing comparison and alignment of phenotypically identical cell populations across different samples.
Application to microarray data
The VS approach based on optimizing Bartlett’s statistic can also be used to stabilize variance in microarray data. However, the initial steps of flowVS need to be adapted for microarrays. Assume that the expression of m genes are measured from N samples in a microarray experiment. After transforming the data by the asinh function, the mean μ _{ i } and variance \({\sigma ^{2}_{i}}\) of the i ^{th} gene g _{ i } are computed from the expressions of g _{ i } in all samples. flowVS then stabilizes the variances of the genes by transforming data using the asinh function with an optimum choice of cofactor. Unlike FC, a single cofactor is selected for all genes in microarrays.
We have applied the modified flowVS to the publicly available kidney microarray data provided by Huber et al. [18]. The kidney data report the expression of 8704 genes from two neighboring parts of a kidney tumor, using cDNA microarray technology. For different values of the cofactor, flowVS transforms the kidney data with the asinh function and identifies the optimum cofactor by minimizing Bartlett’s statistic. Figure 9 shows that a minimum value of Bartlett’s statistic is obtained when the cofactor is set to exp(6) (∼400). The optimum cofactor is then used with the asinh function to transform the kidney data.
We compare the VS performance of flowVS with two software packages, VSN by Huber et al. [18] and DDHFm by Motakis et al. [46]. Similar to flowVS, VSN uses an asinh transformation whose parameters are optimized by maximizing a likelihood function [18]. DDHFm applies a datadriven HaarFisz transformation (HFT)[46, 47] to stabilize the variance. Both VSN and DDHFm are developed for stabilizing variance in microarray data and can not be applied to FC.
In Subfig. 10, we plot the mean and standard deviation of every gene before transforming the kidney data and after transforming it by flowVS, VSN, and DDHFm. In this figure, we have applied a loess regression to obtain smooth average curves. We observe in Subfig. 10 that the standard deviation of the untransformed kidney data increases monotonically with the mean. Both VSN and flowVs approximately stabilize the variance across all genes in this data. However, the HaarFisz transformation achieves good VS properties only for genes with higher levels of expression.
To take a closer look at the transformed data by flowVS and VSN, we plot the variances of genes against the ranks of their means in Fig. 11. These figures are generated by the meanSdPlot function from the VSN package. Here, the ranks of means distribute the data evenly along the xaxis and thus make it easy to visualize the homogeneity of variances. We also show the running median estimator of standard deviation by the red lines. Both VSN and flowVS remove the meanvariance dependence because the red lines are approximately horizontal for both transformations. Hence, flowVS performs at least as well as a stateoftheart approach developed for microarray data.
Conclusions
We describe a variancestabilization framework, flowVS, that removes the meanvariance correlations observed in cell populations from FC samples. This framework transforms each fluorescence channel by the asinh function whose normalization cofactor is optimally selected by Bartlett’s likelihoodratio test. Variance homogeneity (homoskedasticity) is a desirable property for comparing populations across conditions, building metaclusters from phenotypically similar populations, and analyzing metaclusters in an ANOVA model. However, unlike the earlier approach by Hahne et al. [32], flowVS does not artificially shift populations to align them in the marker space. By stabilizing the variances, flowVS homogenizes similar cell populations and establishes the foundation of biologically meaningful metaclusters and templates.
flowVS is built on several assumptions that limit our approach. First, flowVS stabilizes variance separately in each channel. Thus it might be unable to stabilize covariances across multiple channels when they are correlated. Second, flowVS identifies 1D density peaks and evaluates the homogeneity of populations by the likelihoodratio test. Therefore, this algorithm might not perform well when density peaks are not easily identifiable. Third, flowVS stabilizes variance more accurately when a number of samples are simultaneously passed to the algorithm. Hence, this approach is not suitable for normalizing a single sample or stabilizing variances of sequentially arriving samples. Finally, Bartlett’s test used in flowVS assumes that the deviation from normality is relatively modest. If data deviate significantly from normality, other likelihood ratio tests can be employed, such as Levene’s test [44] or the BrownForsythe test [45].
flowVS operates as an independent module in the FC data analysis pipeline. It does not depend on the preprocessing algorithms applied before VS nor on the postanalysis methods such as matching, metaclustering, and classification. Hence, flowVs is capable of working with most automated clustering and metaclustering algorithms developed for flow cytometry.
Abbreviations
asinh, inverse hyperbolic sine; FC, flow cytometry; ITN, immune tolerance network; MFI, mean fluorescence intensity; VS, variance stabilization.
References
Peters JM, Ansari MQ. Multiparameter flow cytometry in the diagnosis and management of acute leukemia. Arch Pathol Lab Med. 2011; 135(1):44–54.
Seder RA, Darrah PA, Roederer M. Tcell quality in memory and protection: implications for vaccine design. Nat Rev Immunol. 2008; 8(4):247–58.
Pyne S, Hu X, Wang K, Rossin E, Lin TI, Maier LM, BaecherAllan C, McLachlan GJ, Tamayo P, Hafler DA, et al. Automated highdimensional flow cytometric data analysis. Proc Natl Acad Sci. 2009; 106(21):8519–524.
Perfetto SP, Chattopadhyay PK, Roederer M. Seventeencolour flow cytometry: unravelling the immune system. Nat Rev Immunol. 2004; 4(8):648–55.
Azad A, Rajwa B, Pothen A. Immunophenotypes of acute myeloid leukemia from flow cytometry data using templates. 2014. http://arxiv.org/abs/1403.6358.
Aghaeepour N, Finak G, Hoos H, Mosmann TR, Brinkman R, Gottardo R, Scheuermann RH, et al. Critical assessment of automated flow cytometry data analysis techniques. Nat Methods. 2013; 10(3):228–38.
Shapiro HM. Practical Flow Cytometry. Hoboken: Wiley; 2005.
Aghaeepour N, Nikolic R, Hoos HH, Brinkman RR. Rapid cell population identification in flow cytometry data. Cytometry Part A. 2011; 79(1):6–13.
Dundar M, Akova F, Yerebakan HZ, Rajwa B. A nonparametric Bayesian model for joint cell clustering and cluster matching: Identification of anomalous sample phenotypes with random effects. BMC Bioinforma. 2014; 15(1):314.
Snow C. Flow cytometer electronics. Cytometry Part A. 2004; 57(2):63–9.
Novo D, Grégori G, Rajwa B. Generalized unmixing model for multispectral flow cytometry utilizing nonsquare compensation matrices. Cytometry Part A. 2013; 83(5):508–20.
Lee SX, McLachlan GJ, Pyne S. Modeling of intersample variation in flow cytometric data with the joint clustering and matching procedure. Cytometry Part A. 2016; 89A:30–43.
Azad A, Langguth J, Fang Y, Qi A, Pothen A. Identifying rare cell populations in comparative flow cytometry. Lect Notes Comput Sci. 2010; 6293:162–75.
Azad A, Pothen A. Multithreaded algorithms for matching in graphs with application to data analysis in flow cytometry. In: IEEE Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW). IEEE: 2012. p. 2494–497.
Bartlett M. The square root transformation in analysis of variance. Suppl J R Stat Soc. 1936; 3(1):68–78.
Efron B. Transformation theory: How normal is a family of distributions?. Ann Stat. 1982; 10(2):323–39.
Tibshirani R. Estimating transformations for regression via additivity and variance stabilization. J Am Stat Assoc. 1988; 83(402):394–405.
Huber W, Von Heydebreck A, Sültmann H, Poustka A, Vingron M. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics. 2002; 18(suppl 1):96–104.
Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995; 270(5235):467–70.
Chen Y, Dougherty ER, Bittner ML. Ratiobased decisions and the quantitative analysis of cDNA microarray images. J Biomed Opt. 1997; 2(4):364–74.
Durbin BP, Hardin JS, Hawkins DM, Rocke DM. A variancestabilizing transformation for geneexpression microarray data. Bioinformatics. 2002; 18(suppl 1):105–10.
Lo K, Brinkman RR, Gottardo R. Automated gating of flow cytometry data via robust modelbased clustering. Cytometry Part A. 2008; 73(4):321–32.
Bagwell CB. Hyperlog – A flexible loglike transform for negative, zero, and positive valued data. Cytometry Part A. 2005; 64(1):34–42.
Parks DR, Roederer M, Moore WA. A new logicle display method avoids deceptive effects of logarithmic scaling for low signals and compensated data. Cytometry Part A. 2006; 69(6):541–51.
Novo D, Wood J. Flow cytometry histograms: Transformations, resolution, and display. Cytometry Part A. 2008; 73(8):685–92.
Dvorak JA, Banks SM. Modified BoxCox transform for modulating the dynamic range of flow cytometry data. Cytometry. 2005; 10(6):811–3.
Finak G, Perez JM, Weng A, Gottardo R. Optimizing transformations for automated, high throughput analysis of flow cytometry data. BMC Bioinforma. 2010; 11(1):546.
Ray S, Pyne S. A computational framework to emulate the human perspective in flow cytometric data analysis. PLoS ONE. 2012; 7(5):35693.
Bartlett MS. Properties of sufficiency and statistical tests. Proc R Soc Lond Ser A Math Phys Sci. 1937; 160(901):268–82.
Azad A, Pyne S, Pothen A. Matching phosphorylation response patterns of antigenreceptorstimulated T cells via flow cytometry. BMC Bioinforma. 2012; 13(Suppl 2):10.
Azad A, Khan A, Rajwa B, Pyne S, Pothen A. Classifying immunophenotypes with templates from flow cytometry. In: Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics (ACM BCB). ACM: 2013. p. 256.
Hahne F, Khodabakhshi AH, Bashashati A, Wong CJ, Gascoyne RD, Weng AP, SeyfertMargolis V, Bourcier K, Asare A, Lumley T, et al. Perchannel basis normalization methods for flow cytometry data. Cytometry Part A. 2010; 77(2):121–31.
Anscombe FJ. The transformation of Poisson, binomial and negativebinomial data. Biometrika. 1948; 35(3/4):246–54.
BarLev SK, Enis P. On the classical choice of variance stabilizing transformations and an application for a Poisson variate. Biometrika. 1988; 75(4):803–4.
Zhang B, Fadili JM, Starck JL. Wavelets, ridgelets, and curvelets for Poisson noise removal. IEEE Trans Image Process. 2008; 17(7):1093–1108.
Qian Y, Liu Y, Campbell J, Thomson E, Kong YM, Scheuermann RH. FCSTrans: An open source software system for fcs file conversion and data transformation. Cytometry Part A. 2012; 81(5):353–6.
Finak G, Jiang W, Krouse K, Wei C, Sanz I, Phippard D, Asare A, Rosa SC, Self S, Gottardo R. Highthroughput flow cytometry data normalization for clinical trials. Cytometry Part A. 2014; 85(3):277–86.
Maier LM, Anderson DE, De Jager PL, Wicker LS, Hafler DA. Allelic variant in CTLA4 alters T cell phosphorylation patterns. Proc Natl Acad Sci. 2007; 104(47):18607.
Hahne F, LeMeur N, Brinkman RR, Ellis B, Haaland P, Sarkar D, Spidlen J, Strain E, Gentleman R. flowCore: a Bioconductor package for high throughput flow cytometry. BMC Bioinforma. 2009; 10(1):106.
Brent RP. Algorithms for Minimization Without Derivatives. Chicago: Courier Corporation; 2013.
Kiefer J. Sequential minimax search for a maximum. Proc Am Math Soc. 1953; 4(3):502–6.
Azad A. An algorithmic pipeline for analyzing multiparametric flow cytometry data. PhD thesis, Purdue University. 2014. http://arxiv.org/abs/1501.03461.
Wilk MB, Gnanadesikan R. Probability plotting methods for the analysis of data. Biometrika. 1968; 55(1):1–17.
Levene H. Robust tests for equality of variances. Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling. 1960; 2:278.
Brown MB, Forsythe AB. Robust tests for the equality of variances. J Am Stat Assoc. 1974; 69(346):364–7.
Motakis E, Nason GP, Fryzlewicz P, Rutter G. Variance stabilization and normalization for onecolor microarray data using a datadriven multiscale approach. Bioinformatics. 2006; 22(20):2547–553.
Fryzlewicz P, Delouille V. A datadriven HaarFisz transform for multiscale variance stabilization. In: IEEE/SP 13th Workshop on Statistical Signal Processing. IEEE: 2005. p. 539–44.
Acknowledgements
This work is supported in part by the US Department of Energy (DEFG0213ER26135 and DEAC0205CH11231), the National Science Foundation (CCF1218916), the National Institute of Biomedical Imaging and Bioengineering (NIBIB) under Grant Number 5R21EB015707, and an IBM Fellowship.
Availability
flowVS is available as a free package in Bioconductor (http://bioconductor.org/packages/release/bioc/html/flowVS.html). The data sets used in the paper are publicly available in flowVS and flowStats packages in Bioconductor. Results presented in the paper are reproducible and reported in flowVS package vignette.
Authors’ contributions
AA, BR, and AP participated in the discussions to conceive the study, designed the algorithms and experiments, and were involved in writing the manuscript. Additionally, AA implemented the algorithms and created the flowVS software. BR created the Purdue healthysubject data set. All three authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Ethics approval and consent to participate
All data analyzed in this study is in the public domain and have been analyzed in previous studies.
Consent for publication
Not applicable.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Azad, A., Rajwa, B. & Pothen, A. flowVS: channelspecific variance stabilization in flow cytometry. BMC Bioinformatics 17, 291 (2016). https://doi.org/10.1186/s1285901610839
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901610839
Keywords
 Variance stabilization
 Flow cytometry
 Bartlett’s test
 Microarrays