flowDiv is implemented in the R language and is structured in 19 stages of processing and 11 stages of oriented decision (Fig. 1). Here we describe the rationale behind each stage in detail.

### Data read

The first step of the pipeline consists of reading and parsing preprocessed (i.e. compensated, normalized or transformed) [18] FCS data. Input may be structured either as FlowJo^{®} workspaces or, equivalently, as GatingSet R objects.

This process is a wrapper for some flowWorkspace [19] and flowCore [20] subroutines. It is intended to reduce the complexity of the overall analysis by reducing the number of required software programs to two at most. This allows a manageable and more reproducible execution of the assay.

### Gate selection

Once imported, the next action consists of the extraction of user-defined regions of interest, the gates.

Gates are regions defined by their channels and respective borders (limits) that must be provided to the algorithm. While borders are internally and automatically parsed, information about which channels to use must be defined empirically by the analyst.

This is one of the key steps of the algorithm, as it expands the data analysis to higher dimensions, allowing more than two channels to be set per analysis.

### Range definitions

For any selected channel, a histogram is generated with equal numbers of bins. First, the channel ranges and bin width must be outlined.

The ranges within which channels will be binned can be defined either by the relative maximum and minimum values of the pooled set of channels (dynamic ranges), or by setting absolute limits for each channel separately (fixed ranges).

Fixed ranges define static limits for the histograms, producing a global model for comparative analyses between different runs of the algorithm. Dynamic ranges, on the other hand, mean that only the limits spanned by the data are considered in the binning process, maximizing the information gain in the analysis.

### Normalization

To fit specific scenarios where the data include any control standards (e.g., beads) but are acquired under different protocol guidelines – namely for scenarios where the operator accounts for changes in the data while controlling for the variance – we provide an approach to set the data to a common perspective through a translational transformation of the data (termed, in our pipeline, normalization).

Formally, in each vector *v*=(*a*_{1},*a*_{2},...,*a*_{n}), representing the channels features of a particular cytogram, we apply a transformation *T*, such as:

$$ T(v) = \left(a_{1}+\triangle b_{1}, a_{2}+\triangle b_{2},\ldots,a_{n}+\triangle b_{n}\right) $$

(1)

Where *b*=(△*b*_{1},△*b*_{2},…,△*b*_{n}) represents the displacement coordinates for each point. Here, *b* is the vector of the difference computed between the mean bead values of each channel and a grand mean, calculated from the pooled mean bead values for each channel of all cytograms in the set, such as:

$$ \triangle b_{ij} = \frac{\sum_{1}^{j} \overline{w}_{ij}}{n}-\overline{w}_{ij} $$

(2)

Where \(\overline {w}_{ij}\) is the representation of the arithmetic mean of bead values from channel *i* of cytogram *j*, and *n* corresponds to the absolute number of samples (cytograms).

Following translation, flowDiv runs a variance stabilization of the data based on the approach implemented by Azada et al. (2015) in the flowVS package [21].Briefly, these steps proceed to an inverse hyperbolic sine (asinh) transformation of data with the form:

$$ T(v_{i}) = asinh (v_{i}/c_{i}) $$

(3)

Where *c*_{i} equals a normalization factor, calculated for each channel *i* individually [21].

### Binning

After the ranges are defined and the data centralized, the algorithm proceeds to data binning: here, the analyst will be asked how many bins should be used in the histogram construction.

In view of the innate high variability of natural environments, it is not reasonable to define a basic number of bins that represent any kind of data. Binning should be changeable, according to the nature of the data at hand. To deal with this, we have implemented a subroutine for inferring the optimum number of bins, which is based on the Freedman-Diaconis rule [22]:

$$ bins_{ij} = \left\lceil {\frac{max(x_{ij}) - min(x_{ij})}{2 \cdot IQR(x_{ij}) \cdot n_{j}^{\frac{-1}{3}}} } \right\rceil $$

(4)

Where *b**i**n**s*_{ij} represents the ceiling number of bins for channel *i* of sample *j*; *n* is the number of observations for the sample *j*; *IQR* stands for interquartile range and *x*_{ij} is the channel vector *i* of sample *j*.

The optimum number of bins, *b**i**n**s*_{b}, is calculated simply from the arithmetic mean of all suggested bins pooled, as follows:

$$ bins_{b} = \frac{\sum_{1}^{i} \sum_{1}^{j} bins_{ij}}{max(i)\cdot max(j)} $$

(5)

### Contingency tables

The binning process results in the creation of common, mutually exclusive, exhaustive and ordered classes (bins), which are then cross-tabulated and used to construct an *n*-dimensional contingency table *S* in the form:

$$ S = \{ x_{i_{k}} \mid i = 1,2,\ldots, m \quad \text{and} \quad k=1,2,\ldots,n \} $$

(6)

Where \(x_{i_{k}}\) corresponds to the number of counts for bin *i* of channel *k*.

### Vectorization

Each *n*-dimensional contingency table is further linearly transformed to column vectors, in a process known as vectorization, creating a one-to-one correspondence between elements of the multidimensional space and elements of its transformed form, as follows:

$$ V_{j}=vec(S_{j}) = \{ x_{1_{1}},\ldots,x_{1_{2}},\ldots,x_{i_{k}} \} $$

(7)

The rationale behind this step is to make the data more manageable for subsequent manipulation, by reducing the data dimensionality while keeping the information unchanged.

### Volume correction

In some circumstances, environmental samples are previously diluted before running a flow cytometer experiment: such dilutions may occur as a direct consequence of stain, fixative or beads addition, or as a requirement to keep event counting within a protocol-specified range [2].

All of these situations must be appropriately considered in the final calculations, in order to correctly determine the real frequency of any targeted event. In our pipeline, we deal with dilution bias by applying a user-defined correction factor to each individual sample, such as:

$$ F = W \cdot D_{cf} $$

(8)

Where *W* is an *n*x*j* matrix composed of all column vectors *V*_{j}, and *D*_{cf} is a diagonal matrix in which element *d*_{ij} corresponds to the ratio between the minimum true volume passed (i.e., the real volume analyzed, considered after correcting for dilutions of any nature) of all samples pooled and the true volume passed for sample *j*. The minimum value is chosen to downweight any background noise generated in relatively long runs.

### Diversity analysis

After vectorization, each cytogram is further used to derive three measures of biological diversity: *α*-diversity, species evenness, and *β*-diversity.

To make these steps as feasible and adjustable as possible, we take advantage of another important suite of tools available in the vegan package [23] to provide a wide range of *α* and *β* indices for calculation. By incorporating vegan::diversity() and vegan::betadiver() functions in its workflow, flowDiv allows analysts to manage, in addition to one evenness index (Pielou’s index), three different indices of *α* diversity (Shannon-Weaver, Simpson and inverse Simpson) and 24 indices of *β* diversity, as reviewed by Koleff et al. (2003)[24].

#### Nestedness and turnover

Some of the available *β* indices have particularly useful properties for FCM data analysis, as is the case for Bray-Curtis [25] semimetrics. Besides being an appropriate index for raw count data, it can also be partitioned into two very informative complementary components, nestedness and turnover.

In an abstract sense, nestedness and turnover correspond, respectively, to AND and XOR relationships between two sets of bins (e.g., Baselga, 2009 [26]). In the present context, these two components serve as convenient proxies to detail how the differences in cytograms might be partitioned between bin superposition (nestedness) or bin differential counting (turnover).

Because of their clear utility, both indices are also incorporated in our pipeline, as a wrapper of the betapart:bray.part() function, and are automatically called when the Bray-Curtis dissimilarity is chosen.

### Transformations

To accommodate other ecologically meaningful distance measures (see [27] and [23] for details), we have also incorporated another optional step, transformation. Internally, this process is simply a wrapper for the decostand{vegan} function.

### Ordination analysis, clusterization and mapping

Once *β*-diversity indices are acquired, the next step consists of an ordination and biplot of the results (cytograms and bins) to help in further investigations of the contributions of bins to the observed differences. Since Non-Metric Multidimensional Scaling (nMDS) has the convenient property of accommodating any (dis)similarity measure handled by flowDiv [28], we applied this technique in our pipeline.

For the purpose of keeping track of broader regions of the contingency tables while allowing further inspection of plots using traditional visual approaches, flowDiv proceeds to the clusterization of the bin ordination scores to generate a single masking image, which is further applied onto each cytogram individually. This step provides a novel and straightforward way of visually interpreting the bin ordination directly in cytograms.

For clusterization, we use the *K*-means clustering method. Briefly, the goal of *K*-means clustering is to partition *n* observations into *k* mutually exclusive clusters. More formally, *K*-means aims to minimize a squared error function *J*, such as:

$$ \underset{c}{\arg\min} \ J =\underset{c}{\arg\min} \ \sum_{i=1}^{k} \sum_{j=1}^{n} \lVert x_{ji} - \mu_{i} \rVert_{2}^{2} $$

(9)

Where ∥*x*_{ij}−*μ*_{i}∥_{2} is the Euclidean distance between a data point *x*_{j}, belonging to cluster *i*, and the cluster center *μ*_{i}. In the flowDiv context, the set of observations *x*=(*x*_{1},*x*_{2},...,*x*_{n}) represents the set of 2-dimensional real vectors, defined by each of the *n* bin ordination scores obtained in the previous step.

#### Choice of *K*

Determining the ideal number of clusters, *K*, is not a trivial task unless analysts can make some reasonable practical assumptions about the optimum number of clusters. For other situations, a data-driven process should be used, and considering these explicitly, we adopted the Calinski-Harabasz [29] criterion to guide our definition of the best number of clusters. The Calinski-Harabasz criterion, *C*, is defined as:

$$ C = \frac{n-K}{K-1} \cdot \frac{BG_{SS}}{WG_{SS}} $$

(10)

In the formula, *n* is the number of bins, *K* is the number of clusters, *W**G*_{SS} is the sum of squares within the clusters, and *B**G*_{SS} is the sum of squares between the clusters.

flowDiv tests *K* iteratively within a pragmatically defined range, from one to ten clusters, and the lowest *C* is set as a suggestion of the appropriate number of clusters.