stochprofML: stochastic profiling using maximum likelihood estimation in R

Background Tissues are often heterogeneous in their single-cell molecular expression, and this can govern the regulation of cell fate. For the understanding of development and disease, it is important to quantify heterogeneity in a given tissue. Results We present the R package stochprofML which uses the maximum likelihood principle to parameterize heterogeneity from the cumulative expression of small random pools of cells. We evaluate the algorithm’s performance in simulation studies and present further application opportunities. Conclusion Stochastic profiling outweighs the necessary demixing of mixed samples with a saving in experimental cost and effort and less measurement error. It offers possibilities for parameterizing heterogeneity, estimating underlying pool compositions and detecting differences between cell populations between samples. Supplementary information The online version contains supplementary material available at 10.1186/s12859-021-03970-7.


Introduction: Stochastic Profiling
Tissues are built of cells which contain their genetic information on DNA strings, so-called genes. These genes can lead to the generation of messenger RNA (mRNA) which transports the genetic information and induces the production of proteins. Such mRNA molecules and proteins are modes of expression by which a cell reflects the presence, kind and activity of its genes. In this paper, we consider such gene expression in terms of quantities of mRNA molecules.
Gene expression is stochastic. It can differ significantly between, e.g., types of cells or tissues, and between individuals. In that case, one refers to differential gene expression. In particular, cells can be differentially expressed between healthy and sick tissue samples from the same origin. Moreover, cells can differ even within a small tissue sample, e.g. within a tumour that consists of several mutated cell populations. Mathematically, we regard two populations to be different if their mRNA counts follow different probability distributions. If there is more than one population in a tissue, we call it heterogeneous. The expression of such tissues is often described by mixture models. Detecting and parameterising heterogeneities is of utmost importance for understanding development and disease.
The amount of mRNA molecules of a gene in a tissue sample can be assessed by various tech-arXiv:2004.08809v1 [stat.AP] 19 Apr 2020 niques such as microarray measurements (Kurimoto 2006;Tietjen et al. 2003) or sequencing (Sandberg 2014;Ziegenhain et al. 2017). Measurements of single cells yield the highest possible resolution. They are best suited for identification and description of heterogeneity in large and error-free datasets. In practice, however, single-cell data often comes along with high cost, effort and technical noise (Grün et al. 2014). Instead of considering single-cell data, we analyze the cumulative gene expression of small pools of randomly selected cells (Janes et al. 2010). The pool size should be large enough to substantially reduce measurement error and cost, and at the same time small enough such that heterogeneity is still identifiable.
We developed the algorithm stochprofML to infer single-cell regulatory states from such pools (Bajikar et al. 2014). In contrast to previously existing methods, we neither require a priori knowledge about the mixing weights (such as Shen-Orr et al. 2010) nor about expression profiles (such as Erkkilä et al. 2010); other than most bulk deconvolution methods, like CIBERSORT (Newman et al. 2015), so-called signature matrices for the populations are not needed to infer population fractions.
In Bajikar et al. (2014), we applied stochprofML to measurements from human breast epithelial cells and revealed the functional relevance of the heterogeneous expression of a particular gene. In a second study, we applied the algorithm to clonal tumor spheroids of colorectal cancer (Tirier et al. 2019). Here, a single tumor cell was cultured, and after several rounds of replication, each resulting spheroid was imaged and sequenced. However, pool sizes differed between tissue samples as each spheroid contained a different number of cells ranging from less than ten to nearly 200 cells. Therefore, we extended stochprofML to be able to handle pools of different sizes.
In this work, we explain the statistical reasoning and R implementation of stochprofML (Amrhein and Fuchs 2020). In Section 2, we derive the statistical model. After a first description of the nomenclature, we introduce basic statistical descriptions of continuous univariate single-cell gene expression. The complexity of the model is increased step by step: First, we account for cell-to-cell heterogeneity through the use of mixture distributions. Then, we extend the modeling from single-cell to small-pool measurements by introducing convolutions of statistical distributions. Finally, we calculate the likelihood and present ways for model selection. Section 3 shows how the R package can be used to generate data and to infer model parameters. This is followed by simulation studies in Section 4, investigating the influence of pool sizes, differences in parameter settings and uncertainty about cell counts on the resulting parameter inference. In Section 5, we elaborate the interpretation of inferred heterogeneity. Section 6 concludes the work.

Models and software
The stochprofML algorithm aims at maximum likelihood estimation of the corresponding model parameters. Hence, we derive the likelihood functions of the parameters and show details of the estimation and its implementation. The new elements of the most recent version of the algorithm are introduced along the line.
we collect a pool of a known number of cells. The cells are either indexed by j ∈ {1, . . . , n} if the cell pool size is the same in all measurements, or, as possible in the latest implementation, by j i ∈ {1, . . . , n i } in case cell pool sizes vary between measurements. In the latter, more general case, the cell numbers are variable over the k cell pools and summarized by n = (n 1 , . . . , n k ). From each sample, the gene expression of m genes is measured, indexed by g ∈ {1, . . . , m}. We assume that each cell stems from one out of T cell populations, indexed by h ∈ {1, . . . , T }. If T > 1 in the set of all cells of interest, the tissue is called heterogeneous. The notation is illustrated in Figure 1. ... The table illustrates the index notation of (tissue) samples, single cells, populations and genes as well as observed and latent measurements.
Biologically, the different cell populations correspond to different regulatory states orespecially in the context of cancer -to different (sub-)clones. For example, there might be two populations within a considered tissue: one occupying a basal regulatory state, where the expression of genes is at a low level, and one from a second regulatory state, where genes are expressed at a higher level.
As described in Section 1, there are various technologies to measure gene expression. In case of microarray techniques (as done in previous applications of stochastic profiling, see Janes et al. 2010 andBajikar et al. 2014), the measured gene expression is typically described in terms of continuous probability distributions. Conditioned on the cell population, stochprofML provides two choices for the distribution of the expression of one gene: Lognormal distribution. The two parameters defining a univariate lognormal distribution LN (µ, σ 2 ) are called log-mean µ ∈ R and log-standard deviation σ > 0. These are the mean and the standard deviation of the normally distributed random variable log(X), the natural logarithm of X. The probability density function (PDF) of X is given by A random variable X ∼ LN (µ, σ 2 ) has expectation and variance E(X) = exp µ + σ 2 2 and Var(X) = exp 2µ + σ 2 exp σ 2 − 1 .
(1) Exponential distribution. An exponential distribution EXP(λ) is defined by the rate parameter λ > 0. The PDF is given by A random variable X ∼ EXP(λ) has expectation and variance E(X) = 1 λ and Var(X) = 1 λ 2 .
In general, the lognormal distribution is an appropriate description of continuous gene expression. With its two parameters, it is more flexible than the exponential distribution. However, the lognormal distribution cannot model zero gene expression. In case of zeros in the data, it could be modified by adding very small values such as 0.0001, or one uses the exponential distribution to model this kind of expression.
In case of T cell populations, we describe the expression of one gene by a stochastic mixture model. Let (p 1 , . . . , p T ) with p 1 + . . . + p T = 1 denote the fractions of populations in the overall set of cells. stochprofML offers the following three mixture models:

Lognormal-lognormal (LN-LN):
Each population h is represented by a lognormal distribution with population-specific parameter µ h (different for each population h) and identical σ for all T populations. The single-cell expression X that originates from such a mixture of populations then follows LN (µ 1 , σ 2 ) with probability p 1 . . .

Relaxed lognormal-lognormal (rLN-LN):
This model is similar to the LN-LN model, but each population h is represented by a lognormal distribution with a dif-ferent parameter set (µ h , σ h ). The single-cell expression X follows LN (µ 1 , σ 2 1 ) with probability p 1 . . .

Exponential-lognormal (EXP-LN):
Here, one population is represented by an exponential distribution with parameter λ, and all remaining T − 1 populations are modeled by lognormal distributions analogously to LN-LN, i.e. with population-specific parameters µ h and identical σ. The single-cell expression X then follows The LN-LN model is a special case of the rLN-LN model. It assumes identical σ across all populations. Biologically, this assumption is motivated by the fact that, for the lognormal distribution, identical σ lead to identical coefficient of variation CV(X) = Var(X) E(X) = exp(σ 2 ) − 1 even for different values of µ. In other words, the linear relationship between the mean expression and the standard deviation is maintained across cell populations in the LN-LN model. The appropriateness of the different mixture models can be discussed both biologically and in terms of statistical model choice (see Section 2.5).
Within one set of genes under consideration, we assume that the same type of model (LN-LN, rLN-LN, EXP-LN) is appropriate for all genes. The parameter values, however, may differ.
In case of T cell populations, we describe the single-cell gene expression X (g) for gene g by a mixture distribution with PDF where f h with h ∈ {1, . . . , T } represents the PDF of population h that can be either lognormal or exponential, and θ (g) = {θ T } are the (not necessarily disjoint) distribution parameters of the T populations for gene g.
Example: Mixture of two populations -Part 1. We exemplify the two-population case. Here, the PDF of the mixture distribution for gene g reads where p is the probability of the first population. The univariate distributions f (g) 1 and f (g) 2 depend on the chosen model : there are four unknown parameters: p, µ there are five unknown parameters: p, µ 2 , σ 1 2 and σ 2 2 .
Note that although each lognormal population has its individual σ, these σ-values remain identical across genes.

Small-pool models of heterogeneous gene expression
stochprofML is tailored to analyze gene expression measurements of small pools of cells, beyond the analysis of standard single-cell gene expression data. In other words, the singlecell gene expression X (g) ij i described in Section 2.1 is assumed latent. Instead, we consider observations (2) for i = 1, . . . , k, which represent the overall gene expression of the ith cell pool for gene g. In the first version of stochprofML, pools had to be of equal size n, i.e. for each measurement Y (g) i one had to extract the same number of cells from each tissue sample. This was a restrictive assumption from the experimental point of view. The recent extension of stochprofML allows each cell pool i to contain a different number n i of cells (see also Figures 1 and 2).
The algorithm aims to estimate the single-cell population parameters despite the fact that measurements are available only in convoluted form. To that end, we derive (in Section 2.3) the likelihood function of the parameters in the convolution model (2), where we assume the gene expression of the single cells to be independent within a tissue sample. For better readability, we suppress for now the superscript (g) and introduce it again in Section 2.3.
The derivation of the distribution of Y i is described in Appendix A. The corresponding PDF f n i (y i |θ, p) of an observation y i which represents the overall gene expression from sample i (consisting of n i cells) is given by  can be performed either on measurements of (A) homogeneous pool size of n cells or of (B) different pool sizes given by the cell number vector n. In both cases, the stochprofML algorithm estimates the parameters for the specified number of populations from pooled data, leading to inferred single-cell distributions for each population. Appendix B describes how this density is visualized here.
.., T ) describes the PDF of a pool of n i cells with known composition of the single populations, i.e. it is known that there are 1 cells from population 1, 2 cells from population 2 etc. n i 1 , 2 ,..., T p 1 1 p 2 2 · · · p T T represents the multinomial probability of obtaining exactly this composition ( 1 , . . . , T ) using the multinomial coefficient (3) sums up over all possible compositions ( 1 , . . . , T ) with 1 , . . . , T ∈ N 0 and 1 + . . . + T = n i . Taken together, f n i (y i |θ, p) determines the PDF of y i with respect to each possible combination of n i cells of T populations.
Thus, the calculation of f n i (y i |θ, p) requires knowledge of f ( 1 , 2 ,..., T ) (y i |θ) . The derivation of this PDF depends on the choice of the single-cell model (LN-LN, rLN-LN, or EXP-LN; see Section 2.1) that was made for X ij i .
. . , X in i , which is here the convolution of T sub-convolutions: a convolution of 1 times LN (µ 1 , σ 2 ), plus a convolution of 2 times LN (µ 2 , σ 2 ), and so on, up to a convolution of T times LN (µ T , σ 2 ).
There is no analytically explicit form for the convolution of lognormal random variables. Hence, f ...,h ,...,T ) is approximated using the method by Fenton (1960). That is, the distribution of the sum A 1 + . . . + A m of independent random variables According to Equation (1), that means that µ B and σ B are chosen such that the following equations are fulfilled: That is achieved by choosing This approximation is implemented in the function d.sum.of.lognormals(). The overall PDF is computed through d.sum.of.mixtures.LNLN().

EXP-LN:
is the density of a sum Y i = X i1 + . . . + X in i of n i independent random variables with The sum of independent exponentially distributed random variables with equal intensity parameter follows an Erlang distribution (Feldman and Valdez-Flores 2010), which is a gamma distribution with integer-valued shape parameter that represents the number of exponentially distributed summands. Thus, the PDF for the EXP-LN mixture model is the convolution of one Erlang (or gamma) distribution (namely the sum of all exponentially distributed summands) and one lognormal distribution (namely the sum of all lognormally distributed summands, again using the approximation method by Fenton 1960). The PDF for this convolution is not known in analytically explicit form but expressed in terms of an integral that is solved numerically through the function lognormal.exp.convolution(). The overall PDF of the EXP-LN model is implemented in d.sum.of.mixtures.EXPLN().
Example: Mixture of two populations -Part 2. In this example of the two-population model, let each observation consist of the same number of n = 10 cells. Then Y i is a 10-fold convolution, and the PDF (3) simplifies to where f ( ,10− ) is the PDF of the sum Y i of ten independent random variables, that is . This PDF depends on the particular chosen model: 3. EXP-LN f ( ,10− ) (y i |θ) = f EXP-LN ( ,10− ) (y i |λ, µ, σ 2 ) is the PDF of a sum Y i = X i1 + . . . + X i10 of ten independent random variables with

Likelihood function
Overall, after re-introducing the superscript (g) for measurements of genes g = 1, . . . , m, we obtain the PDF with model-specific choice of f ( 1 , 2 ,..., T ) . While n = (n 1 , . . . , n k ) is considered known, we aim to infer the unknown model parameters θ = {θ (1) , . . . , θ (m) , p} by maximum likelihood estimation. Assuming independent observations y = {y Consequently, the log-likelihood function of the model parameters reads Example: Mixture of two populations -Part 3. Returning to the two-population example with 10-cell pools, the log-likelihood for k = 100 tissue samples and m = 5 genes is given by where f 10 y (g) i |θ (g) , p is given by Equation (4).

Maximum likelihood estimation
The stochprofML algorithm aims to infer the unknown model parameters using maximum likelihood estimation. As input, we expect an m × k data matrix of pooled gene expression, known cell numbers n, the assumed number of populations T and the choice of single-cell distribution (LN-LN, rLN-LN, EXP-LN). Based on this input, the algorithm aims to find parameter values of θ = {θ (1) , . . . , θ (m) , p} that maximize (θ|y) as given by Equation (6). This section describes practical aspects of the optimization procedure.
Example: Mixture of two populations -Part 4. Several challenges occur during parameter estimation. We explain these on the two-population LN-LN example: First, we aim to ensure parameter identifiability. This is achieved for the two-population LN-LN model by constraining the parameters to fulfil either p ≤ 0.5 or µ 1 > µ 2 . Otherwise, the two combinations (p, µ 1 , µ 2 , σ) and (1 − p, µ 2 , µ 1 , σ) would yield identical values of the likelihood function and could cause computational problems. For our implementation, we preferred the second possibility, i. e. µ 1 > µ 2 . The alternative, i. e. requiring p ≤ 0.5, led to switchings between µ 1 and µ 2 in case of p ≈ 0.5. As a second measure, we implement unconstrained rather than constrained optimization: Instead of estimating (p, µ 1 , µ 2 , σ) under the constraints p ∈ [0, 1], µ 1 > µ 2 and σ > 0, the parameters are transformed to (logit(p), µ 1 , µ 2 , log(σ)), and an unconstrained optimization method is used. This is substantially faster.
The aforementioned transformations are likewise employed for all other models (rLN-LN and EXP-LN) and population numbers. In particular, σ and λ are log-transformed, and the lognormal populations are ordered according to the log-means µ h of the first gene in the gene list. The population probabilities are transformed to R such that they still sum up to one after back-transformation. For details, see Appendix C.
The log-likelihood function is multimodal. Thus, a single application of some gradient-based optimization method does not suffice to find a global maximum. Instead, two approaches are combined which are alternately executed: First, a grid search is performed, where the log-likelihood function is computed at randomly drawn parameter values. This way, high likelihood regions can be identified with low computational cost. In the second step, the Nelder-Mead algorithm (Nelder and Mead 1965) is repeatedly executed. Its starting values are randomly drawn from the high likelihood regions found before. The following grid search then again explores the regions around the obtained local maxima, and so on.
If a dataset contains gene expressions for m genes, and if we assume T populations, there are at minimum T (m+ 1) parameters which one seeks to estimate depending on the model framework. This is computationally difficult, because the number of modes of the log-likelihood function increases with the number of parameters. The performance of the numerical optimization crucially depends on the quality of the starting values, and a large number of restarts is required. When analyzing a large gene cluster, it is advantageous to start by considering small clusters and use the derived estimates as initial guesses for larger clusters. This is implemented in the function stochprof.loop() (parameter subgroups) and demonstrated in analyze.toycluster().
Approximate marginal 95% confidence intervals for the parameter estimates are obtained as follows: We numerically compute the Hessian matrix of the negative log-likelihood function on the unrestricted parameter space and evaluate it at the (transformed) maximum likelihood estimator. Denote by d i the ith diagonal element of the inverse of this matrix. Then the confidence bounds for the ith transformed parameter θ i arê We obtain respective marginal confidence intervals for the original true parameters by backtransformation of the above bounds. This approximation is especially appropriate in the two-population example for the parameters p and σ when conditioning on µ 1 and µ 2 . In this case, in practice, the profile likelihood is seemingly unimodal.

Model choice
By increasing the number T of populations, we can describe the observed data more precisely, but this comes at the cost of potential overfitting. For example, a three-population LN-LN model may lead to a larger likelihood at the maximum likelihood estimator than a twopopulation LN-LN model on the same dataset. However, the difference may be small, and the additional third population may not lead to a gain of knowledge. For example, the estimated population probabilityp 3 may be tiny, or the log-means of the second and third population,μ 2 andμ 3 might hardly be distinguished from each other.
To objectively find a trade-off between necessary complexity and sufficient interpretability, we employ the Bayesian information criterion (BIC, Schwarz 1978): whereθ is the maximum likelihood estimate of the respective model, dim(θ) the number of parameters and k the size of the dataset. From the statistics perspective, the model with smallest BIC is considered most appropriate among all considered models.
In practice, it is required to estimate all models of interest separately with the stochprofML algorithm, e. g. the LN-LN model with one, two and three populations, and/or the respective rLN-LN and EXP-LN models. The BIC values are returned by the function stochprof.loop().

Usage of stochprofML
This section illustrates the usage of the stochprofML package for simulation and parameter estimation. There are two ways two use the stochprofML package: (i) Two interactive functions stochasticProfilingData() and stochasticProfilingML() provide low-level access to synthetic data generation and maximum likelihood parameter estimation without requiring advanced programming knowledge. They guide the user through entering the relevant input parameters: Working as question-answer functions, they ask for prompting the data (or file name), the number of cells per sample, the number of genes etc. An example of the use of the interactive functions can be found in Appendix E. (ii) The direct usage of the package's R functions allows more flexibility and is illustrated in the following.

Simulated Gene
Sum of mixtures of lognormals Next, we show how the parameters used above can be back-inferred from the generated dataset using maximum likelihood estimation. When the fitting is done, pressing <enter> causes R to show plots of the estimation process, see Figure 4, and displays the results in the following form.

Simulation studies
This section demonstrates the performance of the estimation depending on pool sizes (Section 4.1), true parameter values (Section 4.2) and in case of uncertainty about pool sizes (Section 4.3). All scripts used in these studies can be found in our open GitHub repository https://github.com/fuchslab/Stochastic_Profiling_in_R.

Simulation study on optimal pool size
Stochastic profiling, i.e. the analysis of small-pool gene expression measurements, is a compromise between the analysis of single cells and the consideration of large bulks: Single-cell information is most immediate, but a fixed number k of samples will only cover k cells. In pools of cells, on the other hand, information is convoluted, but k pools of size n cover n times as much material. An obvious question is the optimal pool size n. The answer is not available in analytically closed form. We hence study this question empirically.
For this simulation study, first, we generate synthetic data for different pool sizes with identical parameter values and settings. Then, we re-infer the model parameters using the stochprofML algorithm. This is repeated 1,000 times for each choice of pool size, enabling us to study the algorithm's performance by simple summary statistics of the replicates.
The fixed settings are as follows: We use the two-population LN-LN model to generate data for one gene with p 1 = 0.2, µ 1 = 2, µ 2 = 0 and σ = 0.2. For each pool size we simulate k = 50 observations. The pool sizes are chosen in nine different ways: In seven cases, pool sizes are identical for each sample, namely n ∈ {1, 2, 5, 10, 15, 20, 50}. In two additional cases, pool sizes are mixed, i.e. each of the k samples within one dataset represents a pool of different size n i ∈ {1, 2, 5, 10} or n i ∈ {10, 15, 20, 50}. Figure 5 summarizes the point estimates of the 1,000 datasets for each of the nine pool size settings. It seems that (for this particular choice of model parameter values) parameter estimation works reliably for pool sizes up to ten cells, with smaller variance from single-cells to 5-cells. This applies also for the mixture of pool sizes for the small cell numbers. For cell numbers larger than ten, the range of estimated values becomes considerably larger, but without obvious bias, which also applies to the mixture of the larger pool sizes. Appendix F shows repetitions of this study for different choices of population parameters. The results there confirm the observations made here.
Figure 5 suggests n = 5 or varying small pool sizes as ideal choices since its estimates show smaller variance than the other pool sizes. This simulation study, however, has been performed in an idealized in silico setting: We did not include any measurement noise. In practice, however, it is well known that single-cells suffer more from such noise than samples with many cells. The ideal choice of pool size may hence be larger in practice. The underlying data-generating model obviously influences the ability of the maximum likelihood estimator to re-infer the true parameter values: Values of p 1 close to 0.5, small differences between µ 1 and µ 2 and large σ blur the data and complicate parameter inference in practice.

Simulation study on impact of parameter values
In the simulation study of this section, we investigate the sensitivity of parameter inference and which scenarios could be realistically identified.
We use the same datasets as in the previous simulation study: The parameter choices from Section 4.1 are considered as the standard and compared to those from Appendix F. In detail, p 1 is reduced from 0.2 to 0.1 in one setting and increased to 0.4 in the next. µ 2 is increased from 0 to 1, and σ increases from 0.2 to 0.5. µ 1 is kept fixed to 2 in all settings. As before, we consider 1,000 data sets for every parameter setting and compare the resulting estimates to the true values. This was done for all pool sizes considered in Section 4.1, but here we only comment on the results of the 10-cell pools and refer to Appendix F for all other pool size settings. Figure 6 shows the results of the study. In each row of the plot, we compare the estimates of the datasets that were simulated with the standard parameters to the estimates of the datasets that were simulated with one of the parameters changed. Even if only one parameter is changed all parameters are estimated. Each violin accumulates the estimates of 1,000 datasets. For easier comparison, each of the twelve tiles shows the standard setting as turquoise violin, which means those are repeated in each row. When changing the parameter values, they can still be derived without obvious additional bias, but accuracy decreases for increasing p, decreasing µ 2 − µ 1 and increasing σ (with few exceptions). Result for other pool sizes (see Appendix F) show that these observations can be transferred to other pool sizes with some additions: Larger pool sizes infer parameters more accurately if p is smaller. In an increased first population setting (p = 40%), µ 1 can be better inferred if the data set consists of smaller pools. For larger pools, the estimation of µ 1 and µ 2 works comparably well after increasing µ 2 . In general, the estimation of σ is the most difficult one; already in pools of ten cells with increased µ 2 , σ is underestimated. This worsens in larger pools.

Simulation study on the uncertainty of pool sizes
One key assumption of the stochprofML algorithm is that the exact number of cells in each cell pool is known. In Janes et al. (2010), accordingly, ten cells were randomly taken from each sample by experimental design. However, different experimental protocols may not reveal the exact cell number: In Tirier et al. (2019), for example, tissue samples were taken as whole cancer spheroids. Here, the cell numbers were experimentally unknown but estimated using light sheet microscopy and 3D image analysis. Since the stochprofML algorithm requires the pool sizes as input parameter, some estimate has to be passed to it. It is intuitively obvious that the better the prior knowledge about the cell pool sizes, the better the final model parameter estimate. In this simulation study, we investigate the consequences of misspecification.
In a first simulation study, we reuse from Section 4.1 the 1,000 synthetic 10-cell datasets. Each of these contains 50 10-cell samples, simulated with underlying model parameters p = 0.2, µ 1 = 2, µ 2 = 0 and σ = 0.2. As before, we re-infer the population parameters using the stochprofML algorithm. This time, however, we use varying pool sizes from 5 to 15 as input parameters of the algorithm. This is a misspecification except for the true value 10. The resulting parameter estimates (empirical median and 2.5%-/97.5%-quantiles across the 1,000 datasets) are depicted in Figure 7. Estimates are optimal or at least among the best in terms of empirical bias and variance when using the correct pool size. With increasing assumed cell number, the estimates of p decrease, i. e. the fraction of cells from the higher expressed population is assumed to be smaller. This is a reasonable consequence of overestimating n, because in this case the surplus cells are assigned to the second population with lower (or even close-to-zero) expression. Consequently, at the same time the estimates of µ 2 decrease to be even smaller. In a second simulation study, we use the two settings with mixed cell pool sizes as introduced in Section 4.1. One setting embraces cell pools with rather small cell numbers (single-, 2-, 5and 10-cell samples), the other one pools with larger cell numbers (10-, 15-, 20-and 50-cell samples). For each of the two scenarios, we generate one dataset with 50 samples. We denote the true 50-dimensional pool size vectors by n small and n large and employ these vectors for re-estimating the model parameters p, µ 1 , µ 2 and σ. Then, we estimate the parameters again for the same two datasets for 1,000 times, but this time using perturbed pool size vectors as input to the algorithm, introducing artificial misspecification. These 50-dimensional pool size vectors are generated as follows: For each component, we draw a Poisson-distributed random variable with intensity parameter equal to the respective component of the true vectors n small or n large . Zeros are set to one, the minimum pool size. Figure 8 shows these 2 × 1, 000 parameter estimates as compared to the true parameter values and those for which the true size vectors n small and n large were used as input. The violins of the estimates for the smaller cell pools (based on n small ) indicate that the estimates of p and µ 1 are fairly accurate, but the estimates of µ 2 have large variance, and σ is overestimated in all 1,000 runs. This is plausible as population 1 (the one with higher log-mean gene expression) is only present on average in 20% of the cells; even when misspecifying the pool sizes, the cells of population 1 are still detectable since this is the population responsible for most gene expression. Consequently, all remaining cells are assigned to population 2, which has lower or even almost no expression. If the pool size is assumed too low, this second population will be estimated to have on average a higher expression; if it is assumed too large, the second population will be estimated to have a lower expression. This leads to a broader distribution and thus an overestimation of σ.
The results for the larger cell pools (based on n large ) show a similar pattern. In this case, however, the impact of misspecification is less visible, as also confirmed by additional simulations in Appendix F. For large cell pools, the averaging effect across cells is strong anyway and in that sense more robust. In the study here, the σ parameter is often even better estimated when using a misspecified pool size vector than when using the true one.
Taken together, stochprofML can be used even if exact pool sizes are unknown. In that case, the numbers should be approximated as well as possible.

Interpretation of estimated heterogeneity
We investigate what we can learn from the parameter estimates about the heterogeneous populations (Section 5.1) and about sample compositions (Section 5.2).

Comparison of inferred populations
The stochprofML algorithm estimates the assumed parameterized single-cell distributions underlying the samples and; as described in Section 2.5, we can select the most appropriate number of cell populations using the BIC. Assume we have performed this estimation for samples from two different groups, cases and controls. One may in practice then want to know whether the inferred single-cell populations are substantially different between the two groups, e.g. in case the estimated log-meansμ cases andμ controls are close to each other. A related question is whether the difference is biologically relevant.
We hence seek a method that can judge statistical significance and potentially reject the null hypothesis that two single-cell populations are the same; and at the same time allow the interpretation of similarity. Direct application of Kolmogorov-Smirnov or likelihood-ratio tests to the observed data is impossible here since the single-cell data is unobserved: We only measure the overall gene expression of pools of cells. Calculation of the Kullback-Leibler divergence of the two distributions would be possible; however, it is not target-oriented for our application where we seek an interpretable measure of similarity rather than a comparison between more than two population densities.
For our purposes, we use a simple intuitive measure of similarity -the overlap of two PDFs, that is the intersection of the areas under both PDF curves: for two continuous one-dimensional PDFs f and g (see also Pastore and Calcagnì 2019). The overlap lies between zero and one, with zero indicating maximum dissimilarity and one implying (almost sure) equality. In our case, we are particularly interested in the overlap of two lognormal PDFs: OVL_LN_LN <-function(mu_1, mu_2, sigma_1, sigma_2) { f1 <-function(x){dlnorm(x, meanlog = mu_1, sdlog = sigma_1) } f2 <-function(x){dlnorm(x, meanlog = mu_2, sdlog = sigma_2) } f3 <-function(x){pmin(f1(x), f2(x))} integrate(f3, lower = 0, upper = Inf, abs.tol = 0)$value } Figure 9 shows examples of such overlaps. Here, the overlap ranges from 12% for two quite different distributions to 86% for two seemingly similar distributions. The question is where to draw a cutoff, that is, at what point we decide to label two distributions as different. Current literature considers two cases: Either the parametric case (e.g. Inman and Bradley 1989), where both distributions are given by their distribution families and parameter values; or the  (7).
nonparametric case (e.g. Pastore and Calcagnì 2019), where observations (but no theoretical distributions) are available for the two populations. Our application builds a third case: On the one hand, we want to compare two parametric distributions, but the model parameters are just given as estimates based on (potentially small) datasets, thus they are uncertain; on the other hand, we do not directly observe the single-cell gene expression but just the pooled one.  Figure 9D, i. e. LN (μ 1,cases = 2.10,σ 2 cases = 0.19 2 ) and LN (μ 1,controls = 2.03,σ 2 controls = 0.20 2 ), is marked in turquoise. The light grey bars of the histogram indicate values below the empirical 5%-quantile. If the original overlap falls into this range, we reject the null hypothesis that both distributions identical.
To address this issue, we suggest to again take into account the original data that led to the estimated parametric PDFs. As an example, assume that we consider two sets of pooled gene expression, one for a group of cases and one for a group of controls. In both groups, pooled gene expression is available as 10-cell measurements, but the two groups differ in sample size. Let's say the cases contain 50 samples and the controls 100. We assume the LN-LN model with two populations and estimate the mixture and population parameters using the stochprofML algorithm separately for each group, leading to estimatesp cases ,μ 1,cases ,μ 2,cases , σ cases andp controls ,μ 1,controls ,μ 2,controls ,σ controls . We now aim to assess whether the first populations in both groups have identical characteristics, i. e. whether LN (μ 1,cases ,σ 2 cases ) and LN (μ 1,controls ,σ 2 controls ) are the same. Figure 9 displays the single-cell PDFs of the first population and their overlaps for various values of the estimates. For example, in Figure 9D, the orange curve shows the single-cell PDF of population 1 inferred from the cases, yielding LN (μ 1,cases = 2.10,σ 2 cases = 0.19 2 ), and the blue one shows the inferred single-cell PDF of population 1 from the controls, LN (μ 1,controls = 2.03,σ 2 controls = 0.20 2 ). The overlap of these two inferred PDFs equals 86%. We now aim to test the null hypothesis that these populations are the same versus the experimental hypothesis that they are different. We perform a sampling-based test: Taking into account the inferred population probabilitiesp cases andp controls and the number of samples and cells in the data, we can estimate the number of cells which the estimatesθ cases andθ controls relied on. The larger this cell number, the less expected uncertainty about the estimated population distributions LN (μ 1,cases ,σ 2 cases ) and LN (μ 1,controls ,σ 2 controls ) (neglecting the impact of pool sizes).
In our example, letp cases = 12%. Then, approximately 12% of the 500 cells from the cases group (50 × 10-cell samples) belonged to population 1, that is 60 cells. Forp controls = 20%, 200 cells were expected to be from the first population (that is 20% of 1,000 cells, coming from the 100 × 10-cell measurements for the controls). In our procedure, we compare parameter estimates that are based on the respective numbers of single cells, i. e. 60 cells for cases and 200 cells for controls. We perform the following steps: • Calculate OVL original , the overlap of the PDFs of LN (μ 1,cases = 2.10,σ 2 cases = 0.19 2 ) and LN (μ 1,controls = 2.03,σ 2 controls = 0.20 2 ). • Under the null hypothesis, the two distributions are identical. We approximate the parameters of this identical distribution asμ 1,mean = (μ 1,cases +μ 1,controls )/2 and σ mean = (σ cases +σ controls )/2.
• Repeat N = 1, 000 times: -Draw dataset A of size 60 from LN (μ 1,mean ,σ 2 mean ). -Draw dataset B of size 200 from LN (μ 1,mean ,σ 2 mean ). -Estimate the log-mean and log-sd for these two datasets using the method of maximum likelihood, yieldingμ A ,σ A ,μ B andσ B . - • Sort the N overlap values and select the empirical 5% quantile OVL 0.05 .
• Compare the overlap from the original data to this quantile: -If OVL original ≤ OVL 0.05 , the null hypothesis that both populations are the same can be rejected.
-If OVL original > OVL 0.05 , the null hypothesis cannot be rejected.
This proceeding is related to the idea of parametric bootstrap with the difference that our original data is on the n-cell level and the parametrically simulated data is on the single-cell level.
The left panel of Figure 10 shows one outcome of the above-described procedure (i. e. the stochastic, sampling-based algorithm was run once) with the above-specified values of the parameter estimates. Here, OVL original lies in the critical range such that we reject the null hypothesis that the gene expression of the populations in question stem from the same lognormal distribution. We thus assume a difference here. The right panel of Figure 10 demonstrates the importance of taking into account the number of cells which the original estimates were based on: Here, we show one outcome of the above described steps, but this time we assume that for the control group there were only 30 10-cell samples (i.e. 300 cells in total). With the same population fraction as before (p controls = 20%), the datasets B now contain only 60 cells. Here, the value OVL original does not fall into the critical range, and therefore we would not reject the null hypothesis that the two populations of interest are the same.
When testing for heterogeneity for several genes simultaneously, multiple testing issues should be taken into account. However, genes will in general not be independent from each other.

Prediction of sample compositions
The stochprofML algorithm estimates the parameters of the mixture model, i. e. -in case of at least two populations -the probability for each cell within a pool to fall into the specific populations. It does not reveal the individual pool compositions. In some applications, however, exactly this information is of particular interest. Here, we present how one can infer likely population compositions of a particular cell pool.
This is done in a two-step approach via conditional prediction: First, one estimates the model parameters from the observed pooled gene expression, i. e. one obtains an estimateθ of θ. Then, one assumes that θ equalsθ and derives the most probable population composition via maximizing the conditional probability of a specific composition given the pooled gene expression (for calculations, see Appendix D).
We evaluate this procedure via a simulation study. As before, we simulate data using the stochprofML package. In particular, we use the LN-LN model with two populations with parameters p = (0.2, 0.8), µ = (2, 0) and σ = 0.2. Each simulated measurement shall contain the pooled expression of n = 5 cells, and we sample k = 100 such measurements. We store the original true cell pool compositions from the data simulation step in order to later compare the composition predictions to the ground truth.
Having generated the synthetic data, we apply stochprofML to estimate the model parameters p, µ and σ. Figure 11 shows a histogram of one simulated data set along with the PDF of the true population mixture and the PDF of the estimated population mixture (that is the LN-LN model with parametersp = (0.14, 0.86),μ = (2.04, 0) andσ = 0.20).
Next, we calculate the conditional probability mass function (PMF; see Appendix D for details) for each possible population composition conditioned on the particular pooled gene expression measurement. Figure 12A and Table 1 show results for the first six (out of 100) pooled measurements.
In particular, Figure 12A displays the conditional PMF of all possible compositions (i. e. k times population 1 and 5−k times population 2 for k ∈ {0, 1, . . . , 5}). Blue bars stand for these probabilities whenθ is used as model parameter value. Orange stands for the hypothetical case where the true value θ is known and used. These two scenarios are in good agreement with each other.
We regard the most likely sample composition to be the one that maximizes the conditional PMF (maximum likelihood principle). The true composition (ground truth) is marked with a black box around the blue and orange bars. We observe in Figure 12A that the composition is in all six cases inferred correctly and mostly unambiguously. Only for the fifth measurement, there is visible probability mass on a composition other than the true one. In fact, it is the only pool (out of the six considered ones) with two cells from the first population. Alternatively to the maximum likelihood estimator, one can also regard the expected composition -the empirical weighted mean of numbers of cells in the first population -or confidence intervals for this number. The respective estimates for the first six measurements of the dataset are shown in Table 1. The results are consistent with the interpretation of Figure 12A.
Certainly, the precision of the prediction depends on the employed pool sizes, the underlying true model parameters and how reliably these were inferred during the first step. We showed in Section 4 that larger cell pools lead to less precise parameter inference. Hence, we repeat the prediction of sample compositions on another dataset, this time based on 10-cell pools. All other parameters remain unchanged. The resulting conditional probabilities are depicted in Figure 12B. Since p = 0.2, one expects on average two cells to be from the first population in each 10-cell pool. As in the previous 5-cell case, most predictions show a clear pattern. However, probability masses are spread more widely. Measurements 3 and 4 exemplify that almost identical gene expression measurements (y = 19.69 and y = 19.79) can arise from dif- ferent underlying pool compositions (two times population 1 in measurement 3 vs. three times population 1 in measurement 4). For more similar population parameters, the estimation will get worse, which will then propagate to the well composition prediction. In such cases, to predict the pool compositions, one may use additional parallel measurements of other genes that might separate the population better by their different expression profiles while the pool composition stays the same across genes.

Summary and Discussion
With the stochprofML package, we provide an environment to profile gene expression measurements obtained from small pools of cells. Rows: Estimation of cell numbers are based on conditional probabilities that use either the estimated model parameters (rows 1 and 2, corresponding to blue bars in Figure 12A) or the true values (rows 3 and 4, orange bars). Within each of these two choices one can consider the mean number of cells from population 1 as determined by the conditional probabilities (rows 1 and 3) or the MLE that maximizes the conditional probabilities (rows 2 and 4, first value) including a 95% confidence interval that covers at least 95% of the conditional probability mass (rows 2 and 4, in parentheses). The last row shows the true pool composition. The last column shows for each estimator how many of the 100 cell numbers were inferred correctly (defined as follows: rounded mean is exact match; MLE is exact match; CI includes correct number).
high in single-cell libraries, if budget or time are limited, or if one prefers to avoid the stress which is put on the cells during separation. The latest implementation even allows to combine information from different pool sizes, in particular, to simultaneously analyze single-cell and and n-cell data.
We demonstrated the usage and performance of the stochprofML algorithm in various examples and simulation studies. These have been performed in an idealized in silico environment. This should be kept in mind when incorporating the results into experimental planning and analysis. The interpretation of heterogeneity will be informative if based on a good model estimate.
The assumption of independent expression across genes within the same tissue sample is a simplification of nature that leads to less complex parameter estimation. The optimal pool size with respect to bias and variance of the corresponding parameter estimators will depend on unknown properties such as numbers of populations and their characteristics, and also on the relationship between the pool size and the amount of technical measurement noise. The latter aspect has been excluded from the studies here but further supports the application of stochastic profiling.

Computational details
We

A. PDF of n-cell measurements of T cell populations
Equation (3) in Section 2.2 displays the PDF of the overall gene expression of a cell pool of size n i , where each single cell from the pool can originate from any of T cell populations. We derive this PDF here. To make it easier to follow the lengthy calculations, we build up the formulas in four steps: We start with the simplest case of 2-cell measurements in the presence of two populations. Then, we continue with 2-cell samples and three populations. Next, we increase the cell number to n and finally raise the population number to T .

PDF of 2-cell measurements of two populations (n = 2, T = 2)
First, we derive the PDF of a measurement y of a 2-cell pool, i.e. of Y = X 1 +X 2 . Assume we know that two cell populations are present in the tissue, and each of them is described by an individual distribution. In this section, we denote the univariate population distributions by D h , h = 1, . . . , T = 2. In Section 2, they are replaced by the distributions that were presented in Section 2.1: LN (µ, σ 2 ) or EXP(λ) with population-specific parameter values. For now, we consider for j = 1, 2 with p 2 = 1 − p 1 . To determine the distribution of Y , we use the convolution of the single-cell PDFs, which are the same functions f X for both X 1 and X 2 : Each of these integrals y 0 f D i (x 1 )f D j (y − x 1 )dx 1 is the PDF of a random variable Z 1 + Z 2 evaluated at y, where Z 1 ∼ D i and Z 2 ∼ D j are independent. This holds for both i = j and i = j. We denote this density by f i,j . All together, we get An alternative formulation is where 1 and 2 = 2 − 1 show how often a cell of population 1 and 2 is present in the pool. The two PDFs f ( 1 , 2 ) and f i,j are directly connected: f ( 1 , 2 ) considers how often populations 1 and 2 are represented, and f i,j denotes which populations are present. For example, f (1,1) (y) = f 1,2 (y) and f (0,2) (y) = f 2,2 (y).

PDF of 2-cell measurements of three populations (n = 2, T = 3)
Next, we derive the PDF of a measurement y of a 2-cell pool, i.e. of Y = X 1 + X 2 . Now, we assume three cell populations to be present in the tissue. Again, each of them is described by an individual distribution D h for h = 1, . . . , T = 3: for j = 1, 2 where p 1 , p 2 ∈ [0, 1] and p 1 + p 2 ≤ 1. Hence, the PDF of each X j is To determine the distribution of Y = X 1 + X 2 , we again use the convolution of the single-cell PDFs: Once more, we make use of the fact that y 0 f D i (x 1 )f D j (y−x 1 )dx 1 is the PDF of the sum Z 1 +Z 2 of two independent random variables, where Z 1 ∼ D i and Z 2 ∼ D j (now with i, j ∈ {1, 2, 3}). As before, we denote this density by f i,j . Overall, we obtain where 1 , 2 , 3 = 2 − 1 − 2 show how often cells of population 1, 2 and 3 are present in the pool. Again, f ( 1 , 2 ,2− 1 − 2 ) (y) is connected to f i,j . For example, f (0,1,1) (y) = f 2,3 (y) and f (2,0,0) (y) = f 1,1 (y).

PDF of n-cell measurements of three populations (n arbitrary, T = 3)
Next, we suppose that we measure pools of n cells originating from three cell populations. Let Y = X 1 + . . . + X n . Then Equation (9) turns into where p 3 = 1 − p 1 − p 2 and 3 = n − 1 − 2 .

PDF of n-cell measurements of T populations (n and T arbitrary)
Finally, we extend Equation (10) to the most general case, where n-cell pools are measured from a tissue that consists of T cell populations. Here, we obtain The binomial coefficients form together the multinomial coefficient Taken together, this leads to the final PDF (3) from Section 2.2: The terms n 1 ,..., T p 1 1 · · · p T T are probabilities arising from the multinomial distribution and can be seen as multinomial weights of the densities f ( 1 ,..., T ) (y).

B. PDF of pooled gene expression for mixed pool size vectors
When estimating a gene expression model from data, one may want to verify whether the estimated model adequately describes the data. In Figure 11, we did this by comparing the estimated PDF to the histogram of the data and to the true PDF: The orange curve was known since we treated the case of synthetic data. For the blue curve, we first estimated the model parameters and then plugged these in into the general model PDF. In case of a uniform pool size across all measurements, this procedure is straightforward. For a vector of pool sizes, i. e. a mix of e. g. 1-cell, 2-cell and 10-cell data, the PDF (see e. g. Figure 2B) is less obvious. We calculate this function as follows: • For each cell number contained in the n-vector, calculate the PDF of the respective pool size and plug in the parameter estimates.
• Calculate the weighted sum of these PDFs -weighted according to the times the respective pool size occurs in the n-vector.
The resulting PDF approximates the PDF of a sample where the observations are based on the pool sizes of the considered n-vector. While this PDF describes a mixture distribution with randomly drawn pool sizes (according to the weights used), we in our applications assume the pool sizes to be known for each measurement.

C. Transformation of population probabilities
As described in Section 2.4, we transform the model parameters before optimization of the likelihood function such that no constraints of the parameter space have to be accounted for.
Here, we provide details about the transformation of the population probabilities.
In case of two populations, there is only one parameter p ∈ [0, 1] that determines the probabilities p and 1 − p of populations 1 and 2. We transform p to and later back-transform this via The advantage of w as compared to p is the unrestricted range R instead of [0, 1].
For the back-transformations, we start at h = T − 1 and calculatẽ in reverse order. We setp T = 1 to ensure that the probabilities sum up to one. Additionally, one hasp h ≤p h+1 as expit(w h ) ∈ [0, 1] for all h ∈ 1, . . . , T − 1. Obviously, p 1 =p 1 , and the remaining population probabilities are given by The (back-)transformations are implemented in transform.par() and backtransform.par().

D. Derivation of sample composition probabilities
We describe how to predict the population composition of a cell pool, as applied in Section 5.2.
A key formula here is the conditional probability of a cell composition given the measured gene expression, which we derive here. We use the following notations and assumptions: • The overall gene expression of a cell pool is denoted by Y and assumed a continuous random variable with PDF f Y (y) .
• L = (L 1 , . . . , L T ) denotes the specific cell population combinations, i. e. L i is the number of cells of population i for all i = 1, . . . , T , within a pool of L 1 + . . . + L T cells. L is a discrete random vector with PMF P (L = ).
• f Y |L= (y) is the conditional PDF of the overall gene expression in a cell pool whose composition is known to equal . For shorter notation, this was referred to as f ( 1 , 2 ,..., T ) (y i |θ) in Section 2.2.
• In turn, P (L = |Y = y) is the conditional PMF of the cell pool composition given the pool gene expression measurement Y = y.
We use Bayes' theorem to derive the latter PMF: where J is the set of all possible compositions of the cell pool, i. e. the set of all vectors (j 1 , . . . , j T ) with j i ∈ N 0 and j 1 + . . .
The terms in Equation (11)  where n 1 , 2 ,..., T = n! 1 ! 2 !··· T ! is the multinomial coefficient. With this, the conditional PMF of the cell pool composition given the pooled gene expression measurement Y reads:

E. Interactive Functions
As indicated in Section 3, we show an example of the interactive functions for data generation, stochasticProfilingData(), and parameter estimation, stochasticProfilingML().
Synthetic data generation: stochasticProfilingData() R> library("stochprofML") R> set.seed(10) R> stochprofML::stochasticProfilingData() This function generates synthetic data from the stochastic profiling model. In the following, you are asked to specify all settings. By pressing enter , you choose the default option. -

R> 0.03
---------Would you like to write the generated dataset to a file? (Be careful not to overwrite any existing file!) Please type yes or no .

R> yes
Please enter a valid path and filename, either a full path, e.g. D:/Users/lisa.amrhein/Desktop/mydata.txt or just a file name, e.g. mydata.txt. The current directory is D:/Users/lisa.amrhein/Desktop.

test.txt
Hit <Return> to see next plot: The full dataset has been written to test.txt. It is also stored in the .Last.value variable.

R> <Return>
---------The file should contain a data matrix with one dimension standing for genes and the other one for observations. Fields have to be separated by tabs or white spaces, but not by commas. If necessary, please delete the commas in the text file using the replace all function of your text editor.
Please enter a valid path and filename, either a full path, e.g. D:/Users/lisa.amrhein/Desktop/mydata.txt or just a file name, e.g. mydata.txt. The current directory is D:/Users/lisa.amrhein/Desktop.

R> test.txt
Does the file contain column names? Please enter yes or no .

R> yes
Does the file contain row names? Please enter yes or no .

R> no
Do the columns stand for different genes or different observations? 1: genes 2: observations.

R> 1
This is the head of the dataset (columns contain different genes): gene. If the matrix does not look correct to you, there must have been an error in the answers above. In this case, please quit by pressing escape and call stochasticProfilingML() again.

F. Details on Simulation Studies
The general procedure of the simulation studies shown in Section 4 is to first generate synthetic datasets with some predefined population parameters and frequencies using r.sum.of.mixtures(). Thereby datasets with either fixed or varying pool sizes are generated, i. e. the numbers of cells contained in one pool are fixed or vary from cell pool to cell pool within a dataset. Next, we assume that we do not know the predefined model parameters and estimate them using stochprof.loop(). Then we compare the estimates of the parameters in different ways, e. g. how they are influenced by increasing cell numbers or how their variance differs when the dataset was generated with differing population parameters.
Here, we give an overview about the different model parameter settings and pool sizes used in data generation: We use datasets with fixed pool sizes that contain single-cells, 2 cells, 5 cells, 10 cells, 15 cells, 20 cells or 50 cells. Additionally, we chose two types of datasets with varying pool sizes. The first contains small cell pools with 1, 2, 5 and 10 cells, the second contains larger cell pools with 10, 15, 20 and 50 cells. Thus, in total we have nine different cell pool settings that we use for data generation.
In all simulation studies, we use the LN-LN model with five different parameter settings, given in Table 2. While the first set is considered to be the default, each of the other parameter sets differs from it in one of the population parameters.
Taken together, for each of the nine cell pool settings and each of the five parameter settings 1,000 datasets are generated using r.sum.of.mixtures.LNLN(), so that in total we have 5*9*1000 = 4.5 × 10 4 simulated datasets.

Impact of pool sizes
In the first simulation study (Section 4.1), we investigate how parameter estimation is influenced by increasing cell numbers within the cell pools. The results for parameter set 1 are displayed in the main part of the manuscript. Here, we show the corresponding results for  In the second parameter setting, the fraction of the first population was reduced to 10% as compared to the first parameter setting. The results are shown in Figure 13. They are similar to the results of the first parameter set in Figure 5. For set 2, however, single cells lead to large variance of estimates, supposedly due to the small sample size of 50 in combination with the small probability (10%) of the first population: We can only expect five single cells of the first population to be measured on average. In some datasets, this will be too low to estimate the parameters of the first population and/or their proportion satisfactorily. Consequently, the violins of the single-cell estimates show a higher variance, especially for the estimates of the parameters of the first population.
In the third parameter setting, the fraction of the first population was increased to 40%. The resulting estimates are shown in Figure 14. In this setting, both populations are similarly frequent; hence, it seems plausible that the single-cell estimates show similar variability as for example the 2-cell estimates. The estimates of the mixed pools of the lower cell numbers provide estimates that are as accurate as the ones for single-cell and 2-cell data. From a pool size of five cells on, the estimates vary strongly. Apparently, low cell numbers are advisable if a tissue is not dominated by one cell population.
In the fourth parameter setting, µ 2 is increased to 1 and thus larger than in the first parameter setting. The two populations are more similar. The resulting estimates are shown in Figure 15. Starting from a pool size of 10 cells, it seems as if the variance of the estimates did not increase any more. The estimates for the mixed pools with larger cell numbers can sometimes not distinguish the populations, therefore the violin of p is bi-modal. We draw the same conclusion as for two populations with similar frequencies that more similar populations should be investigated in pools with lower cell numbers because their individual expression profile is blurred for small pool sizes already.
Finally, we investigate the effect of different pool sizes in the fifth parameter set, where the logsd σ of both populations is increased to 0.5. The resulting estimates of the model parameters are shown in Figure 16. With an increase of σ, both populations have broader distributions. It appears that there is an increase in variance in the estimates between the 5-cell and the 10-cell measurements. Increasing cell numbers in the pools mainly influences the estimate of σ, which is increasingly underestimated.  Table 2). Left: Accumulated parameter fits of the single-cell, 2-cell, 5-cell, 10-cell and mixture of single-, 2-, 5and 10-cell pools. Right: Results of the 10-cell pools are repeated (turquoise violins), next to those of the larger pool sizes, namely 15-, 20-, 50-cells and their mixture. Each violin is based on 1,000 parameter estimates. The true parameter values are marked in orange.  Figure 14 but for parameter set 3 (see Table 2).  Figure 14 but for parameter set 4 (see Table 2).   Figure 14 but for parameter set 5 (see Table 2).

Impact of parameter values
In Section 4.2, we investigate the influence of the model parameter values on the estimation performance while fixing the pool size. In the main part of the manuscript, we presented results for 10-cell pools (see Figure 6). Here, corresponding analyses for the remaining eight cell pool sizes (n ∈ {1, 2, 5, 15, 20, 50} and two mixtures) are shown.
Results for single-cell and 2-cell pools look alike (Figures 17 and 18). As discussed before, the variance of the estimates become large for a small value of p in combination with the small pool sizes. For both single-cell and 2-cell data, varying µ 2 does not affect the estimation accuracy of the estimation, whereas a larger value of σ leads to higher variance of all parameter estimates but for p.
In contrast to this, the 5-cell data results in a different pattern ( Figure 19): As compared to the estimates from the standard setting, the estimates show a larger variance. The mixture of small cell pool numbers (Figure 20), however, lead to similar results as the pure 2-cell datasets. Figure 21 displays the results for the 15-cell data. For most parameter combinations, the variance of the estimates does not change dramatically. The most accurate estimates are achieved for small p, the least accurate ones for large σ, in which case σ gets underestimated. The same holds true for the 20-and 50-cell datasets (Figures 22 and 23), with even larger variance. For the mixture of large cell pools (Figure 24), estimation performance is comparable to the one for the pure 50-cell measurements.