flowVS: channelspecific variance stabilization in flow cytometry
 Ariful Azad^{1}Email author,
 Bartek Rajwa^{2} and
 Alex Pothen^{3}
https://doi.org/10.1186/s1285901610839
© Azad et al. 2016
Received: 6 August 2015
Accepted: 14 May 2016
Published: 28 July 2016
Abstract
Background
Comparing phenotypes of heterogeneous cell populations from multiple biological conditions is at the heart of scientific discovery based on flow cytometry (FC). When the biological signal is measured by the average expression of a biomarker, standard statistical methods require that variance be approximately stabilized in populations to be compared. Since the mean and variance of a cell population are often correlated in fluorescencebased FC measurements, a preprocessing step is needed to stabilize the withinpopulation variances.
Results
We present a variancestabilization algorithm, called flowVS, that removes the meanvariance correlations from cell populations identified in each fluorescence channel. flowVS transforms each channel from all samples of a data set by the inverse hyperbolic sine (asinh) transformation. For each channel, the parameters of the transformation are optimally selected by Bartlett’s likelihoodratio test so that the populations attain homogeneous variances. The optimum parameters are then used to transform the corresponding channels in every sample. flowVS is therefore an explicit variancestabilization method that stabilizes withinpopulation variances in each channel by evaluating the homoskedasticity of clusters with a likelihoodratio test.
With two publicly available datasets, we show that flowVS removes the meanvariance dependence from raw FC data and makes the withinpopulation variance relatively homogeneous. We demonstrate that alternative transformation techniques such as flowTrans, flowScape, logicle, and FCSTrans might not stabilize variance. Besides flow cytometry, flowVS can also be applied to stabilize variance in microarray data. With a publicly available data set we demonstrate that flowVS performs as well as the VSN software, a stateoftheart approach developed for microarrays.
Conclusions
The homogeneity of variance in cell populations across FC samples is desirable when extracting features uniformly and comparing cell populations with different levels of marker expressions. The newly developed flowVS algorithm solves the variancestabilization problem in FC and microarrays by optimally transforming data with the help of Bartlett’s likelihoodratio test. On two publicly available FC datasets, flowVS stabilizes withinpopulation variances more evenly than the available transformation and normalization techniques. flowVSbased variance stabilization can help in performing comparison and alignment of phenotypically identical cell populations across different samples. flowVS and the datasets used in this paper are publicly available in Bioconductor.
Keywords
Background
We describe an algorithm that transforms a collection of flow cytometry (FC) samples in order to stabilize the variance within cell populations in each fluorescence channel for the entire collection of samples. This transformation enables cell populations (clusters of cells with similar phenotypes) with homogeneous variances to be easily compared with each other by standard statistical methods. Betweenpopulation comparisons are important in detecting changes in populations across biological conditions, which might help us to diagnose diseases, develop new drugs, and understand the immune system in general [1–5]. Hence, our variancestabilization algorithm could play a supporting role in automating biological discovery based on flow cytometry and similar imaging technologies.
FC technology measures morphology (from light scattering) and the expression of multiple biomarkers (from fluorescence emission of fluorophores attached to antibodies) at the singlecell level. An FC sample consists of hundreds of thousands or more of such singlecell measurements, and a study could consist of thousands of samples from different individuals at different time points under different experimental conditions [6, 7].
In order to demonstrate the flowVS results we evaluate the pre and postprocessing cluster homogeneity, and quantify the improvement offered by our approach. We report the results using a simple measure of effect size, rather than through a hypothesistesting framework. As an example, consider the population registration problem in which corresponding cell clusters from multiple FC samples are identified based on the average levels of markers expressed by the clusters [3, 12–14]. The clusters of cells representing the same immunophenotype identified in multiple samples are represented by a hypothetical metacluster (a biological generalization of the particular immunophenotype, observed across multiple samples). The existence of metaclusters is typically assessed by hemopathologists or other skilled FC operators on the basis of their experience and knowledge of previous examples of normal and aberrant immunophenotypes. The biological hypothesis behind assigning a cluster to a metacluster can be formulated as “all clusters in a metacluster represent the same cell type (immunophenotype).” However, translating this hypothesis to a null stating “all clusters in a metacluster have equal mean” and using a traditional hypothesistesting framework accompanied by pvalues may not be appropriate. First, we know that such a null hypothesis is unrealistic: biological variability, technical variability of blood or bonemarrow sample measurements, and random effects associated with the biochemistry of antibody binding will certainly produce clusters of differing means. Second, a hypothesistesting framework addresses only the question of whether the clusters have the same location, but it is not designed to measure the magnitude of the difference or lack of homogeneity within a postulated metacluster. Finally, the pvalues are affected by both cluster size and metacluster homogeneity. Thus, the pvalues obtained would not be comparable for various metaclusters or different clusters within metaclusters.
Variance stabilization (VS) is a process for dissociating data variability from mean signals [15–18]. Other fluorescencebased technologies such as the microarrays stabilize variance by data transformation [18–21]. However, unlike microarray data, explicit VS is not usually performed in FC data analysis. Traditionally, FC data are transformed with nonlinear functions to project cell populations with normally distributed clusters – a choice that usually simplifies subsequent visual analysis [22–27]. Recently, Finak et al. [27] used the maximumlikelihood approach to explicitly satisfy normality of the cell populations. Ray et al. [28] transformed each channel with the asinh function whose parameters are optimally selected by the JarqueBera test of normality (a goodnessoffit test of whether sample data have the skewness and kurtosis matching a normal distribution). While these transformations approximately normalize FC data, they might not stabilize variance, as may be seen in Figs. 6 and 7.
The VS problem in FC, however, cannot be solved directly by applying mature VS techniques from the microarray literature. In microarrays, each gene is measured multiple times (possibly under multiple conditions) and the betweensample variance for each gene is stabilized with respect to the average expression of the gene across samples. By contrast, variance is defined by withinpopulation celltocell variation in FC, and this withinpopulation variance is stabilized with respect to the average expression of markers within each population. These contrasting objectives prevent us from applying VS methods from microarray literature directly to flow cytometry.
We address the need for explicit VS in FC with a maximum likelihood (ML)based method, called flowVS, which is built on top of a commonly used inverse hyperbolic sine (asinh) transformation. The choice of asinh function is motivated by its success as a variance stabilizer for microarray data [18, 21]. flowVS stabilizes the withinpopulation variances separately for each fluorescence channel z across a collection of N samples. After transforming z by asinh(z/c), where c is a normalization cofactor, flowVS identifies onedimensional clusters (density peaks) in the transformed channel. Assume that a total of m 1D clusters are identified from N samples with the ith cluster having variance \({\sigma ^{2}_{i}}\). Then the asinh(z/c) transformation works as a variance stabilizer if the variances of the 1D clusters are approximately equal, i.e., \({\sigma ^{2}_{1}} \sim {\sigma ^{2}_{2}}\sim \ldots \sim {\sigma ^{2}_{m}}\). To evaluate the homogeneity of variance (also known as homoskedasticity), we use Bartlett’s likelihoodratio test [29]. From a wide range of cofactors, our algorithm selects one that minimizes Bartlett’s test statistic, resulting in a transformation with the best possible VS. Note that, in contrast to other transformation approaches, our algorithm applies the same transformation to corresponding channels in every sample. flowVS is therefore an explicit VS method that stabilizes withinpopulation variances in each channel by evaluating the homoskedasticity of clusters with a likelihoodratio test.
Using a healthysubject data set from Purdue and publicly available immune tolerance network (ITN) data, we demonstrate that flowVS removes the meanvariance dependence from raw FC data and makes the withinpopulation variance relatively homogeneous. We demonstrate that alternative transformation techniques might not stabilize variance. Variance homogeneity is especially useful to build metaclusters from a collection of phenotypically similar cell populations across samples [3, 27, 30, 31]. Previous studies (Hahne et al. [32], for example) shifted the distribution of each fluorescence channel to ensure homogeneity in metaclusters, but such shifting might hide useful biological signals present in the MFIs of cell populations. By contrast, we can build homogeneous metaclusters from variancestabilized populations without removing the differences in their MFIs. Hence, flowVS could provide additional flexibility in processing and analyzing a large collection of FC samples.
Related work
VS has been a widely studied topic in applied statistics for its central role in making heteroskedastic data easily tractable by standard methods. Heteroskedasticity appears in various data sets mostly because the data follow a distribution with correlated mean and variance, e.g., Poisson or Gamma; there are many more examples, but these two are relevant for fluorescence. For wellknown distribution families, VS is usually performed by transforming data with an analytically chosen function f. For example, \(f(z)=\sqrt {z+3/8}\) works as a good (asymptotic) stabilizer for a random variable z following the Poisson distribution [33]. Variance stabilizers for several wellknown distribution families are described in [33, 34]. For unknown distributions, heuristic and datadriven stabilizers are often used [15–17].
However, traditional transformations are often inadequate for lowcount (photonlimited) signals [18, 35] because of unknown error patterns in fluorescence data. Past work developed ad hoc VS schemes for different types of fluorescence data. For example, in microarrays, the VS problem has been addressed by various nonlinear transformations [18–21]. Most notably, the widely used approach by Huber et al. [18] uses an asinh transformation whose parameters are selected by a maximumlikelihood estimation.
For FC data, researchers have used various nonlinear transformations, such as the logarithm, hyperlog, generalized BoxCox, and biexponential (e.g., logicle and generalized arcsinh) functions [22–27]. In past work, parameters of these transformations were adjusted in a datadriven manner to maximize the likelihood (flowTrans by Finak et al. [27]), to satisfy the normality (flowScape by Ray et al. [28]), and to comply with simulations (FCSTrans by Qian et al. [36]). flowTrans estimates transformation parameters for each sample by maximizing the likelihood of data’s being generated by a multivariatenormal distribution on the transformed scale. flowScape optimizes the normalization factor of asinh transformation by the JarqueBera test of normality. FCSTrans selects the parameters of the linear, logarithm, and logicle transformations with an extensive set of simulations. However, normalizing data may not necessarily stabilize its variance, e.g., for a Poisson variable z, \(\sqrt {z+3/8}\) is an approximate variancestabilizer, whereas z ^{2/3} is a normalizer [16]. Therefore, we consider an approach built upon the wellknown asinh transformation and estimate transformation parameters for explicitly stabilizing withinpopulation variations.
Methods
Motivation
The goal of VS in flow cytometry
The aim of VS in FC is to make withinpopulation variances of different cell populations approximately equal and thereby independent of the average marker expressed by populations. Recall that the expression of a marker is measured by the intensity of light at a particular channel of fluorescence. VS therefore stabilizes the withinpopulation fluorescence variance and makes it independent of the MFIs of the cell populations. In this paper, we refer to fluorescence channels more frequently because the nature of fluorescence emissions – not the protein expressions – dominates the meanvariance relationship in FC data. We do not stabilize variance on the scatter channels because, as pointed out by Finak et al. [27], there are few benefits to transforming forward and sidescatter channels.
Channelspecific variance stabilization
We assume that correlations among fluorescence channels due to the overlap of spectra are removed by spectral unmixing before we transform data. Even though the expression of biomarkers can still be correlated [24], we do not incorporate such correlations in VS because the nature of such correlation is difficult to model. Therefore, we assume that compensated fluorescence channels are independent and stabilize variance on each channel separately.
Selecting an optimal transformation for FC data is a nontrivial problem because the accurate error model of FC data is often unknown. In previous work, researchers have successfully used a number of functions to transform FC data, such as logarithm, asinh, BoxCox, logicle, etc. [24, 27, 28]. In our flowVS algorithm, we decided to use the asinh function to transform FC data. This choice of asinh function is motivated by its success in FC data visualization and normalization [27, 28] and in stabilizing variance in fluorescence readouts from microarray data [18, 21]. Stabilizing variance with other transformations can be performed using the same flowVS framework but is not discussed here.
In this transformation, c is called the normalization cofactor, whose value is optimally selected to stabilize withinpopulation variance in channel z. Note that in a more general form asinh transformation is expressed with three parameters, a∗asinh(b+z/c), where in addition to the cofactor c, a denotes a scaling after transformation, and b denotes a translation before transformation. We set a=1 because scaling after transformation does not affect downstream analysis and set b=0 to avoid shifting cell populations. Hence, we are left with a single parameter c whose value is estimated in order to stabilize the variance.
The flowVS algorithm
Assume that we have a collection of N FC samples. Then the objective of the flowVS algorithm is to transform each sample such that the withinpopulation variance is stabilized in each fluorescence channel across N samples. Here, we describe the algorithm for a single channel z; the process can be applied independently to other channels. First, we discuss the process of evaluating homoskedasticity of a transformed channel for a selected cofactor c by computing Bartlett’s likelihoodratio test. Then, we elaborate the process of selecting an optimum cofactor that would stabilize variance when used with asinh transformation.
Steps to compute Bartlett’s statistic on channel z for a selected cofactor c

Step1: Transforming channel z in each sample. Let z _{ j } be a vector denoting channel z in the jth sample, where 1≤j≤N. We transform z _{ j } by the asinh function: \(z^{\prime }_{j} = \operatorname {asinh} (z_{j}/c)\), where \(z^{\prime }_{j}\) is the transformed channel.

Step2: Detecting 1D density peaks (1D clusters). We estimate the density of \(z^{\prime }_{j}\) by a kernel density estimation method (the density function of stats package in R). The peaks in the density of \(z^{\prime }_{j}\) are identified as regions of high local density and significant curvature (also called landmarks in [32]). We identify highdensity regions in \(z^{\prime }_{j}\) by the curv1Filter function of the flowCore package [39] in Bioconductor. The boundaries of density peaks are identified by detecting minima between two adjacent density peaks. Here, a density peak represents a 1D cluster of cells. Let P _{ j } be the collection of all density peaks identified in \(z^{\prime }_{j}\).

Step3: Collecting density peaks from all samples. Let P be the set of density peaks collected from all samples, i.e., P=∪_{1≤j≤N } P _{ j }. Let P contain a total of m density peaks where the ith peak contains n _{ i } cells with mean μ _{ i } and variance \({\sigma ^{2}_{i}}\).

Step4: Computing Bartlett’s test statistic. Let \(n=\sum _{1\leq i \leq m} n_{i}\) be the total number of cells in P and \({\sigma ^{2}_{p}}\) be the pooled variance of m density peaks. Then we compute Bartlett’s statistic as follows:$$ B(c) = \frac{(nm)\: \ln\left({\sigma_{p}^{2}}\right)  \sum_{i=1}^{m} (n_{i}1)\: \ln\left({\sigma_{i}^{2}}\right)}{1 + \frac{1}{3(m1)} \left(\sum_{i=1}^{m} \frac{1}{n_{i}1}  \frac{1}{nm}\right)}. $$(2)
This statistic B(c) is specific to the cofactor c used to transform the data and measures the degree of homogeneity across all 1D clusters in the transformed channel z ^{′}.
Finding a cofactor for optimum VS

(a) We divide the interval [ c _{ low }, c _{ high }] into k=(c _{ high }−c _{ low }) equal regions where the ith region is defined by the interval [c _{ i },c _{ i+1}] and c _{ i+1}−c _{ i }=1.

(b) For the ith interval, we look for a cofactor in the range [ exp(c _{ i }), exp(c _{ i+1})] with minimum Bartlett’s statistic. For each cofactor, we compute the Bartlett’s statistic with the steps described in Sec. 3. For faster convergence, we call the optimize function from the stats package in R, which uses a combination of golden section search and successive parabolic interpolation [41]. Interested readers might see the R documentation for a detailed description of the function. Let \(c^{*}_{i}\) be the optimum cofactor in the ith interval with the associated Bartlett’s statistic \(B(c^{*}_{i})\).

(c) We identify the overall optimum cofactor c ^{∗} as follows:$$ c^{*} = \arg\!\min_{i=1}^{k} B(c^{*}_{i}). $$(4)
Equation 4 provides an approximate solution to Eq. 3. Since we divided the search space into smaller intervals, the probability of having multiple local optima in an interval is small. Hence, the procedure described above is expected to return a variance stabilizing cofactor. After we obtain the optimum cofactor c ^{∗}, channel z in each sample is transformed by asinh(z _{ j }/c ^{∗}) and used in subsequent analysis.
Results
Data sets
We demonstrate the use of flowVS and other related methods by using a healthysubject data set from Purdue University (HD) and publicly available immune tolerance network (ITN) data. The original HD data set consists of 65 samples from five healthy individuals who donated blood on different days [42]. Here, for simplicity, we used a smaller subset of the HD data set consisting of 12 samples from three healthy individuals, “A”, “C”, and “D”. From each individual, we keep samples from two (randomly selected) days and two technical replicates from each day. Each HD sample was stained using labeled antibodies against CD45, CD3, CD4, CD8, and CD19 protein markers. In this paper, an HD sample “C_4_2” means that it is collected on day 4 from individual “C” and it is the second replicate on that day. The healthy data set is part of our Bioconductor package flowVS. The ITN data set is collected from 15 patients. It includes 3 patient groups with 5 samples each. Each sample was stained using labeled antibodies against CD3, CD4, CD8, CD69 and HLADr. The ITN data set is available in the flowStats package in Bioconductor. We selected these data sets because they are available in standard R packages. Hence, the results presented here can be easily reproduced.
Stabilizing variance in the HD dataset
We transform each sample of the HD data set by the asinh function with the variancestabilizing cofactors and plot the density of the transformed channels in the bottom row of Fig. 4. In each channel, we observe that density peaks (a.k.a. onedimensional clusters) have approximately equal width across all samples, which visually confirms the homogeneity of withinpopulation variances in onedimensional clusters. When both positive and negative peaks (i.e., clusters with high or low marker expression) are present in a channel, e.g., CD3, CD4, and CD8, their variances are also approximately stabilized. Note that the density peaks may not be well aligned owing to the betweensubject variations. Aligning density peaks across samples is not an objective of flowVS, because such shifting of density might potentially eclipse biological signals present in the mean expressions of a cell populations. When necessary, data normalization can be performed after variance stabilization, as was done by Hahne et al. [32] and Finak et al. [37].
Stabilizing variance in the ITN dataset
After identifying the optimum cofactors for each channel, we transform each sample of the ITN data set by asinh functions with the variancestabilizing cofactors and plot the density of the transformed channels in the bottom row of Fig. 5. Similar to the HD data set (Fig. 5), the density peaks have approximately equal variance across all samples, thus confirming the homogeneity of withinpopulation variances in onedimensional clusters.
Comparing flowVS with other transformation methods
We compare flowVS with three automated methods developed for transforming FC data: (a) flowTrans (b) logicle (flowCore), and (c) FCSTrans. We selected these three methods because they automatically select parameters for different transformations. As discussed earlier, flowTrans estimates the parameters of different transformations (e.g., asinh, biexponential, linlog, and BoxCox) by maximizing the likelihood of data’s being generated from normal distributions [27]. In this paper, we chose the results of flowTrans with asinh transformation because it generated relatively better segregation of populations than the other options and is directly comparable to flowVS that also uses the asinh transformation. We generate our results by calling the flowTrans function of the Bioconductor package flowTrans. Next, we select the logicle transformation implemented in the flowCore package in Bioconductor. To estimate the parameters of logicle transformation, we use the estimateLogicle function of the flowCore package. Finally, FCSTrans also uses the logicle transformation. We obtained the R source code of FCSTrans from http://sourceforge.net/projects/immportflock/files/FCSTrans.
Next, we quantitatively compare the stability of variance across multiple transformation methods. This comparison, however, can not be performed on the actual transformed data because different transformations convert data to different scales. Hence, we convert each transformed channel z to [0,1] scale by rescaling each element z _{ i } with the following equation (z _{ i }− min{z})/(max{z}− min{z}). For each transformation, we identify the density peaks in the converted CD4 channel and plot standard deviations of density peaks against their ranks of MFI in the bottom row of Fig. 6. Here we use rank of the means, instead of actual means, to distribute the points evenly along the xaxis. We observe that all four transformations are able to eliminate the systematic dependence of variance on mean, which is typically observed in untransformed fluorescence data, such as in Fig. 1. Therefore, these transformations have inherent ability to stabilize variance, mostly owing to the properties of the underlying asinh and logicle transformations. However, flowVS is able to stabilize variance more evenly than other transformations, as can be seen in the bottom right plot in Fig. 6.
Normality of the variancestabilized clusters
However, if cell populations deviate significantly from normality, we could use other likelihood ratio statistic that is less sensitive to departures from normality, such as Levene’s [44] or the BrownForsythe statistic [45]. In our experiments, we found Bartlett’s approach working significantly better than Levene’s, and therefore, did not show results of the latter method.
Impact of variance stabilization in comparing cell populations
We now briefly demonstrate the impact of variance stabilization on the homogeneity of metaclusters (i.e., groups of phenotypically concordant clusters). In metacluster homogeneity evaluation, the underlying assumption is that all clusters in a metacluster represent the same cellular immunophenotype. As mentioned before, the hypothesistesting framework may not be appropriate for the described problem, since a null hypothesis claiming that all clusters in a metacluster have equal mean is essentially always false. Moreover, when the number of cells (sampling units of the test) increases, the power of a statistical test such as a ttest or an Ftest increases too. Consequently, a statistical test would inevitably detect small (i.e., statistically significant, but biologically irrelevant) differences between clusters. For example, performing a ttest with CD4^{+} cell clusters from the first and the second samples of the ITN data set, we observe pvalues less than 10^{−10} for all transformations despite the fact that the tested cell populations are biologically identical (in an immunophenotypic sense).
The ratio of betweencluster to withincluster variations (a measure of effect size of metacluster homogeneity defined in Eq. 5) after four transformations
Cell populations  dataset  flowTrans  logicle  FCSTrans  flowVS 

CD4^{+} metacluster  HD  2.8  13.70  13.42  .87 
ITN  1.3  1.55  1.54  .61  
erronous CD4^{}/CD4^{+} metacluster  HD  24.40  21.63  12.64  36.77 
ITN  11.68  11.30  10.38  26.47 
Application to microarray data
The VS approach based on optimizing Bartlett’s statistic can also be used to stabilize variance in microarray data. However, the initial steps of flowVS need to be adapted for microarrays. Assume that the expression of m genes are measured from N samples in a microarray experiment. After transforming the data by the asinh function, the mean μ _{ i } and variance \({\sigma ^{2}_{i}}\) of the i ^{ t h } gene g _{ i } are computed from the expressions of g _{ i } in all samples. flowVS then stabilizes the variances of the genes by transforming data using the asinh function with an optimum choice of cofactor. Unlike FC, a single cofactor is selected for all genes in microarrays.
We compare the VS performance of flowVS with two software packages, VSN by Huber et al. [18] and DDHFm by Motakis et al. [46]. Similar to flowVS, VSN uses an asinh transformation whose parameters are optimized by maximizing a likelihood function [18]. DDHFm applies a datadriven HaarFisz transformation (HFT)[46, 47] to stabilize the variance. Both VSN and DDHFm are developed for stabilizing variance in microarray data and can not be applied to FC.
Conclusions
We describe a variancestabilization framework, flowVS, that removes the meanvariance correlations observed in cell populations from FC samples. This framework transforms each fluorescence channel by the asinh function whose normalization cofactor is optimally selected by Bartlett’s likelihoodratio test. Variance homogeneity (homoskedasticity) is a desirable property for comparing populations across conditions, building metaclusters from phenotypically similar populations, and analyzing metaclusters in an ANOVA model. However, unlike the earlier approach by Hahne et al. [32], flowVS does not artificially shift populations to align them in the marker space. By stabilizing the variances, flowVS homogenizes similar cell populations and establishes the foundation of biologically meaningful metaclusters and templates.
flowVS is built on several assumptions that limit our approach. First, flowVS stabilizes variance separately in each channel. Thus it might be unable to stabilize covariances across multiple channels when they are correlated. Second, flowVS identifies 1D density peaks and evaluates the homogeneity of populations by the likelihoodratio test. Therefore, this algorithm might not perform well when density peaks are not easily identifiable. Third, flowVS stabilizes variance more accurately when a number of samples are simultaneously passed to the algorithm. Hence, this approach is not suitable for normalizing a single sample or stabilizing variances of sequentially arriving samples. Finally, Bartlett’s test used in flowVS assumes that the deviation from normality is relatively modest. If data deviate significantly from normality, other likelihood ratio tests can be employed, such as Levene’s test [44] or the BrownForsythe test [45].
flowVS operates as an independent module in the FC data analysis pipeline. It does not depend on the preprocessing algorithms applied before VS nor on the postanalysis methods such as matching, metaclustering, and classification. Hence, flowVs is capable of working with most automated clustering and metaclustering algorithms developed for flow cytometry.
Abbreviations
asinh, inverse hyperbolic sine; FC, flow cytometry; ITN, immune tolerance network; MFI, mean fluorescence intensity; VS, variance stabilization.
Declarations
Acknowledgements
This work is supported in part by the US Department of Energy (DEFG0213ER26135 and DEAC0205CH11231), the National Science Foundation (CCF1218916), the National Institute of Biomedical Imaging and Bioengineering (NIBIB) under Grant Number 5R21EB015707, and an IBM Fellowship.
Availability
flowVS is available as a free package in Bioconductor (http://bioconductor.org/packages/release/bioc/html/flowVS.html). The data sets used in the paper are publicly available in flowVS and flowStats packages in Bioconductor. Results presented in the paper are reproducible and reported in flowVS package vignette.
Authors’ contributions
AA, BR, and AP participated in the discussions to conceive the study, designed the algorithms and experiments, and were involved in writing the manuscript. Additionally, AA implemented the algorithms and created the flowVS software. BR created the Purdue healthysubject data set. All three authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Ethics approval and consent to participate
All data analyzed in this study is in the public domain and have been analyzed in previous studies.
Consent for publication
Not applicable.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Peters JM, Ansari MQ. Multiparameter flow cytometry in the diagnosis and management of acute leukemia. Arch Pathol Lab Med. 2011; 135(1):44–54.PubMedGoogle Scholar
 Seder RA, Darrah PA, Roederer M. Tcell quality in memory and protection: implications for vaccine design. Nat Rev Immunol. 2008; 8(4):247–58.View ArticlePubMedGoogle Scholar
 Pyne S, Hu X, Wang K, Rossin E, Lin TI, Maier LM, BaecherAllan C, McLachlan GJ, Tamayo P, Hafler DA, et al. Automated highdimensional flow cytometric data analysis. Proc Natl Acad Sci. 2009; 106(21):8519–524.View ArticlePubMedPubMed CentralGoogle Scholar
 Perfetto SP, Chattopadhyay PK, Roederer M. Seventeencolour flow cytometry: unravelling the immune system. Nat Rev Immunol. 2004; 4(8):648–55.View ArticlePubMedGoogle Scholar
 Azad A, Rajwa B, Pothen A. Immunophenotypes of acute myeloid leukemia from flow cytometry data using templates. 2014. http://arxiv.org/abs/1403.6358.
 Aghaeepour N, Finak G, Hoos H, Mosmann TR, Brinkman R, Gottardo R, Scheuermann RH, et al. Critical assessment of automated flow cytometry data analysis techniques. Nat Methods. 2013; 10(3):228–38.View ArticlePubMedPubMed CentralGoogle Scholar
 Shapiro HM. Practical Flow Cytometry. Hoboken: Wiley; 2005.Google Scholar
 Aghaeepour N, Nikolic R, Hoos HH, Brinkman RR. Rapid cell population identification in flow cytometry data. Cytometry Part A. 2011; 79(1):6–13.View ArticleGoogle Scholar
 Dundar M, Akova F, Yerebakan HZ, Rajwa B. A nonparametric Bayesian model for joint cell clustering and cluster matching: Identification of anomalous sample phenotypes with random effects. BMC Bioinforma. 2014; 15(1):314.View ArticleGoogle Scholar
 Snow C. Flow cytometer electronics. Cytometry Part A. 2004; 57(2):63–9.View ArticleGoogle Scholar
 Novo D, Grégori G, Rajwa B. Generalized unmixing model for multispectral flow cytometry utilizing nonsquare compensation matrices. Cytometry Part A. 2013; 83(5):508–20.View ArticleGoogle Scholar
 Lee SX, McLachlan GJ, Pyne S. Modeling of intersample variation in flow cytometric data with the joint clustering and matching procedure. Cytometry Part A. 2016; 89A:30–43.View ArticleGoogle Scholar
 Azad A, Langguth J, Fang Y, Qi A, Pothen A. Identifying rare cell populations in comparative flow cytometry. Lect Notes Comput Sci. 2010; 6293:162–75.View ArticleGoogle Scholar
 Azad A, Pothen A. Multithreaded algorithms for matching in graphs with application to data analysis in flow cytometry. In: IEEE Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW). IEEE: 2012. p. 2494–497.Google Scholar
 Bartlett M. The square root transformation in analysis of variance. Suppl J R Stat Soc. 1936; 3(1):68–78.View ArticleGoogle Scholar
 Efron B. Transformation theory: How normal is a family of distributions?. Ann Stat. 1982; 10(2):323–39.View ArticleGoogle Scholar
 Tibshirani R. Estimating transformations for regression via additivity and variance stabilization. J Am Stat Assoc. 1988; 83(402):394–405.View ArticleGoogle Scholar
 Huber W, Von Heydebreck A, Sültmann H, Poustka A, Vingron M. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics. 2002; 18(suppl 1):96–104.View ArticleGoogle Scholar
 Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995; 270(5235):467–70.View ArticlePubMedGoogle Scholar
 Chen Y, Dougherty ER, Bittner ML. Ratiobased decisions and the quantitative analysis of cDNA microarray images. J Biomed Opt. 1997; 2(4):364–74.View ArticlePubMedGoogle Scholar
 Durbin BP, Hardin JS, Hawkins DM, Rocke DM. A variancestabilizing transformation for geneexpression microarray data. Bioinformatics. 2002; 18(suppl 1):105–10.View ArticleGoogle Scholar
 Lo K, Brinkman RR, Gottardo R. Automated gating of flow cytometry data via robust modelbased clustering. Cytometry Part A. 2008; 73(4):321–32.View ArticleGoogle Scholar
 Bagwell CB. Hyperlog – A flexible loglike transform for negative, zero, and positive valued data. Cytometry Part A. 2005; 64(1):34–42.View ArticleGoogle Scholar
 Parks DR, Roederer M, Moore WA. A new logicle display method avoids deceptive effects of logarithmic scaling for low signals and compensated data. Cytometry Part A. 2006; 69(6):541–51.View ArticleGoogle Scholar
 Novo D, Wood J. Flow cytometry histograms: Transformations, resolution, and display. Cytometry Part A. 2008; 73(8):685–92.View ArticleGoogle Scholar
 Dvorak JA, Banks SM. Modified BoxCox transform for modulating the dynamic range of flow cytometry data. Cytometry. 2005; 10(6):811–3.View ArticleGoogle Scholar
 Finak G, Perez JM, Weng A, Gottardo R. Optimizing transformations for automated, high throughput analysis of flow cytometry data. BMC Bioinforma. 2010; 11(1):546.View ArticleGoogle Scholar
 Ray S, Pyne S. A computational framework to emulate the human perspective in flow cytometric data analysis. PLoS ONE. 2012; 7(5):35693.View ArticleGoogle Scholar
 Bartlett MS. Properties of sufficiency and statistical tests. Proc R Soc Lond Ser A Math Phys Sci. 1937; 160(901):268–82.View ArticleGoogle Scholar
 Azad A, Pyne S, Pothen A. Matching phosphorylation response patterns of antigenreceptorstimulated T cells via flow cytometry. BMC Bioinforma. 2012; 13(Suppl 2):10.Google Scholar
 Azad A, Khan A, Rajwa B, Pyne S, Pothen A. Classifying immunophenotypes with templates from flow cytometry. In: Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics (ACM BCB). ACM: 2013. p. 256.Google Scholar
 Hahne F, Khodabakhshi AH, Bashashati A, Wong CJ, Gascoyne RD, Weng AP, SeyfertMargolis V, Bourcier K, Asare A, Lumley T, et al. Perchannel basis normalization methods for flow cytometry data. Cytometry Part A. 2010; 77(2):121–31.Google Scholar
 Anscombe FJ. The transformation of Poisson, binomial and negativebinomial data. Biometrika. 1948; 35(3/4):246–54.View ArticleGoogle Scholar
 BarLev SK, Enis P. On the classical choice of variance stabilizing transformations and an application for a Poisson variate. Biometrika. 1988; 75(4):803–4.View ArticleGoogle Scholar
 Zhang B, Fadili JM, Starck JL. Wavelets, ridgelets, and curvelets for Poisson noise removal. IEEE Trans Image Process. 2008; 17(7):1093–1108.View ArticlePubMedGoogle Scholar
 Qian Y, Liu Y, Campbell J, Thomson E, Kong YM, Scheuermann RH. FCSTrans: An open source software system for fcs file conversion and data transformation. Cytometry Part A. 2012; 81(5):353–6.View ArticleGoogle Scholar
 Finak G, Jiang W, Krouse K, Wei C, Sanz I, Phippard D, Asare A, Rosa SC, Self S, Gottardo R. Highthroughput flow cytometry data normalization for clinical trials. Cytometry Part A. 2014; 85(3):277–86.View ArticleGoogle Scholar
 Maier LM, Anderson DE, De Jager PL, Wicker LS, Hafler DA. Allelic variant in CTLA4 alters T cell phosphorylation patterns. Proc Natl Acad Sci. 2007; 104(47):18607.View ArticlePubMedPubMed CentralGoogle Scholar
 Hahne F, LeMeur N, Brinkman RR, Ellis B, Haaland P, Sarkar D, Spidlen J, Strain E, Gentleman R. flowCore: a Bioconductor package for high throughput flow cytometry. BMC Bioinforma. 2009; 10(1):106.View ArticleGoogle Scholar
 Brent RP. Algorithms for Minimization Without Derivatives. Chicago: Courier Corporation; 2013.Google Scholar
 Kiefer J. Sequential minimax search for a maximum. Proc Am Math Soc. 1953; 4(3):502–6.View ArticleGoogle Scholar
 Azad A. An algorithmic pipeline for analyzing multiparametric flow cytometry data. PhD thesis, Purdue University. 2014. http://arxiv.org/abs/1501.03461.
 Wilk MB, Gnanadesikan R. Probability plotting methods for the analysis of data. Biometrika. 1968; 55(1):1–17.PubMedGoogle Scholar
 Levene H. Robust tests for equality of variances. Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling. 1960; 2:278.Google Scholar
 Brown MB, Forsythe AB. Robust tests for the equality of variances. J Am Stat Assoc. 1974; 69(346):364–7.View ArticleGoogle Scholar
 Motakis E, Nason GP, Fryzlewicz P, Rutter G. Variance stabilization and normalization for onecolor microarray data using a datadriven multiscale approach. Bioinformatics. 2006; 22(20):2547–553.View ArticlePubMedGoogle Scholar
 Fryzlewicz P, Delouille V. A datadriven HaarFisz transform for multiscale variance stabilization. In: IEEE/SP 13th Workshop on Statistical Signal Processing. IEEE: 2005. p. 539–44.Google Scholar