scDC: single cell differential composition analysis

Background Differences in cell-type composition across subjects and conditions often carry biological significance. Recent advancements in single cell sequencing technologies enable cell-types to be identified at the single cell level, and as a result, cell-type composition of tissues can now be studied in exquisite detail. However, a number of challenges remain with cell-type composition analysis – none of the existing methods can identify cell-type perfectly and variability related to cell sampling exists in any single cell experiment. This necessitates the development of method for estimating uncertainty in cell-type composition. Results We developed a novel single cell differential composition (scDC) analysis method that performs differential cell-type composition analysis via bootstrap resampling. scDC captures the uncertainty associated with cell-type proportions of each subject via bias-corrected and accelerated bootstrap confidence intervals. We assessed the performance of our method using a number of simulated datasets and synthetic datasets curated from publicly available single cell datasets. In simulated datasets, scDC correctly recovered the true cell-type proportions. In synthetic datasets, the cell-type compositions returned by scDC were highly concordant with reference cell-type compositions from the original data. Since the majority of datasets tested in this study have only 2 to 5 subjects per condition, the addition of confidence intervals enabled better comparisons of compositional differences between subjects and across conditions. Conclusions scDC is a novel statistical method for performing differential cell-type composition analysis for scRNA-seq data. It uses bootstrap resampling to estimate the standard errors associated with cell-type proportion estimates and performs significance testing through GLM and GLMM models. We have made this method available to the scientific community as part of the scdney package (Single Cell Data Integrative Analysis) R package, available from https://github.com/SydneyBioX/scdney.

methods for cell-type identification and none of them guarantees perfect identification accuracy. A systematic evaluation of 14 common clustering algorithms revealed differences in performance and clustering result [5,6]. A further evaluation on similarity metrics in clustering algorithms illustrates further inconsistency in results due to the different metrics used for measuring differences in cell transcriptomes [7]. In addition, as the compositions sum to one, this transforms the data from the Euclidean space into a simplex space. Canonical statistical tests such as t-test may not be reliable, as these are not designed for relative proportion data [8]. These issues all highlight the difficulty and uncertainty associated with cell-type identification. However, there is a current lack of methodological research on cell-type compositional analysis for single cell data. Whilst some research papers draw observations on compositional differences of cell-types between conditions [1,3,9], such observations are not accompanied by a measure of uncertainty associated with the estimates. In contrast to the lack of such studies in the field of single cell analysis, compositional analysis has been an active and ongoing area in microbiome research, due to the compositional nature of microbiome data. Dirichlet -multinomial models have been used extensively for microbiome research, in which Dirichlet Monte Carlo procedures are used to obtain posterior distributions of sample proportions [10][11][12][13]. However, due to the small sample size of scRNAseq data, fitting a Dirichlet multinomial is not currently feasible.
In this paper, we present single cell Differential Composition (scDC), a method for estimating uncertainty in cell-type proportions via bootstrapping. Our method provides a bias-corrected estimate of cell-type proportions confidence intervals, and by performing clustering within each bootstrap iteration it also captures the uncertainty associated with cell-type identification. In scRNA-seq experiments with small sample sizes, such estimates of cell-type proportion uncertainties would be useful in determining the significance of cell-type compositional differences between conditions. In the future where multi-sample comparative scRNAseq experiments are likely to become common place, our method offers a statistically robust way to propagate uncertainty in downstream analysis. We demonstrate this using two synthetic datasets derived from published experimental data. Finally, we implemented scDC as part of the scdney R package which is freely available from https://github.com/SydneyBioX/scdney.

Overview of single cell differential composition analysis (scDC)
We developed scDC (workflow shown in Fig. 1), a novel approach based on bootstrap resampling, to perform differential cell-type composition analysis. Let Y ij be a matrix of cell counts from i = 1, . . . , C cell-types and j = 1, . . . , R subjects. The first step of scDC involves stratified bootstrap resampling where we sampled with replacement N j cells from the cell count matrix Y ij for subject j. Stratification by subject is important because it ensures unbiased bootstrap estimates and captures subject to subject variability. Let Y * bj be the b th bootstrap sample for the j th subject. Subsequently, for each bootstrap iteration, we combined resampled cell counts Y * b1 , . . . , Y * bR for all R subjects.
The next step involved cell-type identification of each bootstrap sample using clustering. Here, we performed PCA dimension reduction followed by k-means clustering (using Pearson correlation as distance metric [7]). For each bootstrap sample, this generated cell-type proportions p * b1 , . . . p * bC which we later used to calculate bootstrap standard errors (SE) for each of the cell-type proportions p 1 , . . . , p C . In parallel, for experiments with two conditions A and B, to assess changes in cell-type compositions between conditions, we modelled the data using a log-linear model accounting for cell-types, conditions, subjects and interaction effects between cell-types and conditions. A log-linear model was fitted to each bootstrap sample Y * b and the results were pooled using Rubin's pooled estimate (see "Methods" section) [14]. This log-linear model can also be fitted as a mixed effect model by giving each subject a random effect and both options are available in our scDC framework.

Stratified bootstrap procedure provides good estimation of sampling error for cell-type proportions
First, we examined the validity of our stratified bootstrap procedure using simulation data (Table 1). In Fig.2a each point represents the sampling standard error and the bootstrap standard error associated with a particular subject. The figure clearly shows a high concordance between bootstrap standard error and sampling standard error. Therefore, bootstrap resampling should provide a reasonable estimate of uncertainty for cell-type proportion.

Bias-corrected and accelerated (BCa) bootstrap confidence interval for single subject cell-type proportions
We examined the effectiveness of various approaches to estimate confidence intervals (CI) associated with celltype proportions at the single subject level. Cell-type proportions from a single cell experiment can be modeled using a multinomial distribution with C proportion parameters p 1 , . . . , p C . The simple standard error (SE) measure, commonly used to illustrate error bars, assumes symmetric CIs and therefore is not an appropriate representation of the CI of cell-type proportions. Overview of scDC workflow. This illustrates the main components of the scDC procedure. The key functions have been included on top of each sub-figure, where relevant. a Single cell data is collected and analysed. Publicly available experimental data is obtained. Simulated data is generated using R package PowSimR. b This corresponds to step 1 and 2 of the 4-step procedure. Resampled data is generated using stratified bootstrap and then clustered using clustering algorithm. Each cluster is matched to cell-type using reference cell labels in the original dataset. This step is repeated n times. c In step 3 and 4, cell count output from each bootstrap is fitted using GLM. The coefficient estimates from each individual GLM model are pooled using Rubin's rules and tested for significance. d User can extract the overall estimates of statistics. e Each bootstrap re-sampling gives an estimated distribution of cell-types composition for each patient. f the result can be visualised graphically The SE estimate is especially problematic for rare cell-types, as it can generate CI estimates with endpoints that extend outside [ 0, 1]. Existing approaches to estimate CIs for multinomial proportions require large expected cell counts [15]. For rare cell-types, cell counts may not be adequate to ensure the appropriate overall coverage probability.
To address this issue, we considered three methods to estimate CI of cell-type proportions: (1) the model-based approach of May and Johnson (2000) [16,17], which constructs simultaneous CIs for multinomial proportions; (2) the bootstrap percentile method, and (3) the bootstrap BCa method [18]. Figure 2b shows CI estimates using  these three methods on four sets of simulated data. Comparing simulation data Sim1 with 500 cells per replicate , Sim2 with 100 cells per replicate and Sim3 with 50 cells per replicate, it is clear that the width of all three types of confidence intervals increases with decreasing number of cells. The width of BCa CI is similar to multinomial and percentile in all cases, but is significantly shorter when there was presence of rare cell-types and small number of cells per replicate, as shown by the cell-type n4. This suggests that BCa provides better estimates for rare cell-types.

scDC correctly estimates the cell-type proportions and corresponding standard error in simulation data
Using a series of simulation data, we evaluated whether scDC can accurately recover true cell-type proportions and the corresponding standard errors within each condition and subject. For simulated data SimP1 and SimP2, we simulated the situation where biological replicates are available; for simulated data SimP3 and SimP4, we also modelled the subject to subject variability as random effects. The detailed simulation settings are listed in Table 2 and described in detail in "Methods" section. Figure 3a shows that in all four simulations, scDC correctly recovers the true cell-type proportions; Fig. 3b shows the corresponding bootstrap confidence intervals calculated using the BCa method. Visually, the confidence intervals overlap between conditions where cell-type proportions differ by 10%. When we used GLMs to test for significance, we found that there is significant interaction effect for cell-type 2 in simulation data SimP1 (p = 0.01; Table 3) but not for cell-types 3 and 4. This shows concordance with the true underlying model. We also examined scDC's performance on estimating cell-types proportions in datasets where there are highly  Fig. 3 scDC on simulated dataset. scDC has been applied to four simulated datasets. a Each dot represents a pair of values containing the mean proportion estimate calculated from scDC (x-axis) and the true proportion from the simulation data (y-axis). Across all four datasets, all dots lie on the diagonal line, even when the proportion of a cell-type is as low as 5%. This demonstrates that scDC is highly accurate in estimating cell-type composition. b Each simulated dataset contains two conditions, represented suing colours red and blue. The bar chart shows proportion estimates of each cell-type n1 to n4 for each subject in the dataset. The horizontal black line represents the BCa confidence interval associated with the proportion estimate correlated cell-types and rare cell-types, as these situations are often observed in real experimental datasets. Using eight simulated datasets SimS1 to SimS8, we demonstrated that scDC is highly accurate in recovering rare cell-types proportion as low as 1% and is only slightly affected when there is high degree of correlation between two cell-types (Additional file 1: Figure S1). The detailed simulation settings are listed in Additional file 1: Table S2) Lastly, we simulated a dataset composed of 4000 cells and 10 cell-types with 3 rare cell-types, replicating the size and the complexity of real experimental dataset. scDC result shown in Additional file 1: Figure S2 clearly revealed that scDC accurately recovered true cell-types proportion. In addition, we observed that BCa produced the tightest CI interval around the density distribution of proportion estimates for the three rare cell-types, n1, n2 and n3. In contrast, both percentile and multinomial CI produced much wider CI in some cases. For the other seven cell-types, the three CI estimates were similar. This is consistent with our findings in the previous subsection.

Impact of specifying incorrect number of cell-types
In practice, the true number of cell-types are often unknown. To investigate the impact on scDC result when the number of cell-types is incorrectly determined, we simulated a dataset containing 500 cells and 4 cell-types and evaluated scDC with the number of cell-types specified as 2, 3, 4 and 5. Figure 4 clearly shows that in all cases when the input number of cell-types is incorrect, the resulting confidence intervals are very wide. Some of the confidence intervals appeared to be out of place compared to the density distribution. Only when scDC was ran with number of cell-types being 4 did the confidence intervals form reasonably CI estimate as seen from close bands surrounding the density distributions.

Application of scDC reveals differential cell-type proportions in two synthetic datasets
To demonstrate the applicability of our method on data containing realistic variability from single cell experiments, we examined two synthetic datasets constructed from two sets of real experimental data -Pancreas [1] and Neuronal [3] (see "Methods" section).
In the Pancreas dataset that examines subjects with type two diabetes (T2D) and healthy subjects, we observed subject to subject variation. Despite the overall proportion of beta cells being lower with subjects with T2D compared with healthy controls, our analysis confirms the original finding that 1 of the 4 subjects has a higher beta cell value. The boxplot for this patient has a small interquartile range, suggesting that there is high confidence that the high value observed with this subject is not due to an error or random chance, for example, in the clustering procedure. The overall boxplot (Fig. 5d-f ) summarising the average cell proportion across subjects further shows that the interquartile range of the beta cell proportion of  normal subjects is not overlapping with the interquartile range of the beta cell proportion of T2D subjects. This confirms that, despite one of the T2D subjects being an outlier, the overall beta cell proportion of T2D subjects is lower than for normal subjects. Inspecting output from random effect and fixed effect model revealed a difference in subject variability and the pooled fixed effect model also suggests the beta cell-type and subject interaction effect is significant (p = 0.02). This again highlights that in the beta cell, there is a lower proportion in T2D compared to healthy and such difference in proportion isn't in the alpha cell-type. The details of the GLM and GLMM results are presented in Table 4.
The Neuronal dataset contains samples from inbred mice at different developmental time points. With the neuronal dataset, we examined the mesoderm, neural crest, neural crest neurons, and progenitor cell-types. Here, the subject to subject variability are substantially smaller (Fig. 5a-c). Application of scDC reveals a clear trend in the relative proportion of these four cell-types. In Beta, ductal are the cell-type effect. T2D is abbreviation of type two diabetes effect. Subject ID followed by underscore of T2D represents type two diabetes subject. Subject ID without underscore of T2D represents normal subject particular, there is a clear increase in the relative proportion of progenitor, neural crest and neural crest neurons, and a decrease in the relative proportion of mesoderm. The GLM and GLMM analysis show a significant interaction effect between cell-types and time points (p < 0.001).
The details of the GLM and GLMM results are presented in Additional file 1: Table S1. The original paper examined the change in progenitor compared to neurons and stated that the percentage of progenitor increased as the percentage of neurons decreased. Our findings suggest that, by excluding neurons from analysis, there is a visible increase in the percentage of progenitor cells over time. This finding is worth further investigation, as progenitor cells is a key player in neurogenesis which gives rise to neuronal cells.

Discussion
We present a novel statistical framework, scDC, to assess cell-type composition differences between conditions of interest. This involves developing approaches to estimate standard error of cell-type proportions estimates in scRNA-seq data via bootstrapping and adapting a missing values framework to facilitate Wald testing from bootstrap log-linear model and generalized linear mixed model analyses. By applying scDC to simulated data, we demonstrate that our method can accurately recover the correct confidence interval of cell-type proportions in an unbiased manner. Furthermore, we show in two synthetic datasets that cell-type composition differences can be accurately determined following the scDC procedure. All confidence interval estimation procedures discussed in this paper can be extended to directly calculate the confidence interval associated with the difference in i th celltype proportions. This provides an alternative approach to the CI constructed based on SE pooled that will be symmetric by default. These various CI estimation procedures generate different CIs with different coverage probabilities and interval lengths. The ability to calculate the correct coverage probability with the smallest interval length is an important component of power calculations as it leads to a reduction of sample size needed to achieve detection of differences between two cell-types.
Comparisons between the two datasets confirm the widely accepted phenomena that human data contain much higher subject to subject variability than mouse data. The lower mouse to mouse variability could be due to two factors. First, cell-types in mouse are easier to distinguish, resulting in greater stability in clustering process. Second, inbred mice might exhibit much less variability than humans. Our scDC method provides and encourages the visualization of SE associated with celltype proportions estimates at different level. For example, in the Pancreas dataset, despite there being only one T2D subject with a higher beta cell proportion, there is still a significant interaction effect between T2D and beta cell proportions when taking account of all subject. This opens up a potential problem of hierarchical compositional analysis where one identifies the proportion of population within T2D that demonstrates a higher beta-cell proportion versus the proportion of population within T2D that has a lower beta-cell proportion, and what sample size of subject is required to achieve this.
As with all bootstrap approximations, we are limited by the initial sampling procedure. This is more complicated in data with samples from different human subjects, where the cell-type proportions in each sample vary greatly. In addition, comparing Sim1 and Sim3 from Fig.2b, it is clear that the width of the confidence interval increases greatly as the number of cells in dataset decrease. Similar situation is likely to be observed in experimental datasets, where the number of cells obtained from each sample may differ, particularly after quality control. scDC utilizes stratified bootstrap sampling, stratifying based on each subject. Without stratified bootstrap sampling, we will over-estimate the SE -this is a natural consequence of large subject to subject variability. It is not appropriate to assume that all subjects share the same cell-type proportions and be from the same multinomial distribution. In the stratified bootstrap, we assume that each subject has a different cell-type proportion and draws from a different multinomial distribution.
Finally, we implemented scDC as part of the package scdney for use by the scientific community. This package performs stratified bootstrap, builds GLM models and provides visualizations of the bootstrap results.

Conclusions
In this paper, we present scDC, a novel statistical framework for performing differential composition analysis in scRNA-seq experiments. scDC utilizes bootstrap resampling to estimate the standard errors in cell-type proportions, and enables significance testing from bootstrap GLM and GLMM analysis. We applied scDC to both simulation and synthetic data sets and showed that it can accurately recover true cell-type proportions and estimate the variance in cell-type proportions. By pooling estimates from GLM analyses in each bootstrap resample, we also showed that we can perform significance testing of composition differences across subjects and conditions. scDC is implemented as an R package and can be freely downloaded from our GitHub repository. Our implementation supports direct input from sc-RNA expression matrix.

Simulated data generation
The R package powsimR was used to simulate the singlecell RNA-seq data [19]. We selected 10% of genes to be differentially expressed between two cell-types, with fold changes sampled from a multivariate normal distribution. Gene specific distributional parameters are estimated by a dataset from mouse embryonic stem cells [20].
Simulation with one condition To validate the bootstrapping estimation, we simulated a population of 5000 cells consisting of 4 cells types. Four different simulations were set up, with three simulations consisting of 4 cell-types with proportion of 10%, 20% , 30% and 40% (Simulated data Sim1, Sim2, Sim3), and one simulation with proportion 5%, 25%, 30% and 40% (Simulated data Sim4). Simulated data Sim1 has 10 replicates, with 500 cells in each replicates. Four replicates were randomly chosen and bootstrapped. Simulated data Sim2 has 50 replicates in total, with 100 cells in each replicate. Five replicates were randomly chosen and bootstrapped. Simulated data Sim3 has 100 replicates in total, with 50 cells in each replicate. 20 replicates were chosen and bootstrapped. Simulated data Sim4 has 50 replicates in total, with 100 cells in each replicate. We chose 20 replicates to be bootstrapped (Table 1).
Simulation with two condition To validate our scDC procedure in-silico, we simulated a population of 6000 cells of four cell-types, with two conditions with different cell-type proportions. Each condition is divided into three subsets, which can be considered as the subpopulation of three subjects. Then we randomly sampled 200 cells from each subpopulation to create a simulated dataset with 1200 cells. Simulated data SimP1 and SimP2 represents the situation when the biological replicates are available. Here we sample from the same multinomial distribution with true cell-type proportions given in Table 2. Simulated data SimP3 and SimP4 represents the situation where the variability between subjects follows a N(0, σ 2 ) distribution, where σ = 0.02. The detailed cell-type composition settings of each simulation are listed in Table 2.
Simulation with rare cell-types and correlated cell-types We simulated four datasets that have different degree of correlation between cell-types and four datasets that have rare cell-types of different proportions. For each dataset, we first simulated a population of 2400 cells, then we randomly sampled 1200 cells to create the final dataset. Each dataset is made up of two conditions with different celltype proportions and three replicates in each condition. Variability between subjects are introduced by modelling a N(0, σ 2 ) distribution, where σ = 0.02. The detailed celltype composition settings of each simulation are listed in Additional file 1: Table S1.
Simulation of complex dataset To test scDC on a dataset with complexity similar to the characteristics of experimental dataset, we simulated one dataset with 8000 cells and 10 cell-types and randomly selected 4000 cells to create the final dataset. The 10 celltypes contains 3 rare cell-types with proportion as low as 1%. This dataset is composed of 2 conditions and 4 replicate subjects in each condition. Variability between subjects are introduced by modelling a N(0, σ 2 ) distribution, where σ = 0.02. Detailed cell-type composition settings for this simulation are listed in Additional file 1: Table S2.

Synthetic data
We generated two synthetic datasets from two publicly available datasets [1,3]. All data were first normalized by TPM (transcripts per million) and log-transformed. Any samples with a total of less than 50 cells of interest or with less than 10 cells of any cell-types of interest were excluded.
Synthetic Pancreas Dataset The Pancreas dataset contains 2,209 single cells taken from the pancreas of six healthy and four type 2 diabetic human cadaveric donors (E-MTAB-5061) [1]. Alpha, beta and ductal cells were selected to create the synthetic dataset. Cell labels were provided for all cells by the authors.
Synthetic Neuronal Dataset The Neuronal dataset contains 21,465 single cells taken from neural tubes of mouse embryo at five developmental time points from e9.5 to e13.5 [3]. Raw data is made available by the author on their GitHub repository https:// github.com/juliendelile/MouseSpinalCordAtlas. The celltypes mesoderm, neural crest, neural crest neurons and progenitor were selected to create the synthetic dataset. Cell-type labels were provided by the original authors through assessing the expression of a list of curated genes.

Bootstrap confidence intervals
Let n ij be the number of cells of cell-type i and subject j, i = 1, . . . , C, j = 1, . . . , R; N j be the number of cells within the subject j; and N be the total number of cells in the dataset. For a typical subject j, the ith cell-type proportion is defined as p ij and calculated by The percentile method This is an intuitive method for estimating the bootstrap confidence interval by first estimating the empirical distribution of p ij . Letp * ij(α) represent the 100α-th percentile of B bootstrap replications,p * ij (1) ,p * ij (2) , . . . ,p * ij(B) , witĥ where B = 10000 by default. Then the percentile interval of 100(1 − α)% for p ij is estimated by The BCa method To account for skewness in the confidence interval, we used the Bias-corrected and accelerated (BCa) confidence intervals proposed by Efron (1987) [21]. The BCa method uses percentiles of bootstrap distribution described above, but depends on an acceleration parameterâ and a biascorrection factorẑ 0 . The 100(1 − α)% BCa interval ofp * ij can be calculated by where ; .
Here (·) is the cumulative distribution function of standard normal distribution, and z (α) indicates the 100α-th percentile point of a standard normal distribution.
This method estimates two adjustment factors. First, the bias-correction factor which estimates the proportion of the bootstrap replication less than the original estimate. This is calculated bŷ where −1 is the inverse cumulative distribution function of standard normal andp ij is the original estimate of proportion, derived from Step 2 of clustering in the scDC procedure.
The second is the acceleration factor which quantifies the rate of change of the standard error ofp ij with respect to the original estimate p ij . This is calculated using a jackknife approach as follows, wherep ij(n) is the estimate of p ij holding out cell n in the data, andp ij(·) = 1 N N n=1p ij(n) . We modifies the implementation of function bcanon in bootstrap package to estimate BCa intervals for cell-type proportions of each subject [22].
Note that, for data with only one subject, we can still estimate the CIs with j = 1.

Single cell differential composition analysis (scDC)
scDC is a 4-step procedure that tests for differences in cell-type composition between conditions.
Step 1: Stratified bootstrap This is used to generate B bootstrap samples by sampling with replacement and stratified by subjects. We denote each bootstrap sample data of subject j as Y * bj . All genes with zero variance in each of the B bootstrap sample Y * bj are removed as they are not informative for cell-type determination.
Step 2: Cell-type identification We estimate the cell-type identity of each cell by clustering each bootstrap resample Y * bj and estimate the bootstrap cell-type proportionsp * b1 ,p * b2 , . . . ,p * bC for all C cell-types via clustering The clustering first performed dimension reduction using PCA (R package stats [23]) to the first 10 principal component, follow by k-means clustering with Pearson correlation as the similarity measure using function scClust in R package scdney [7]. The number of clusters, K, was initially set to be the number of cell-types C in the data. Cluster identity was assigned based on the majority cell-types in a cluster.
In cases where the clustering result do not recover all the cell-types, clustering was re-run with the number of clusters K increased by one. We repeated the clustering procedure until each cell-type has been assigned to at least one cluster. We then recorded the number of cells in each cell-type labelled cluster that belongs to each subject or replicate j.
Step 3: Pooled estimates from B complete linear model analyses.
For each bootstrap resample Y * bj , a Generalized Linear Model was fitted to assess the significance associated with each predictor variable. The R package lme4 was used to fit the model [24]. Two types of models were consideredthe fixed effect model (GLM) and the mixed effect model (GLMM) by treating the subject as a random effect. For the fixed effect model (GLM) the following R code glm(cell_count~cell_type + condition + subject + cell_type:condition, data, family = poisson(link=log)) was used.
This generated B bootstrap coefficient estimateŝ β * 1 ,β * 2 , . . .β * B from the GLM or GLMM model. B = 100 by default if user does not require any confidence interval to be estimated, otherwise B = 10000 by default. The pooled model estimate based on Rubin's rules [14] is whereβ * b is the coefficients estimates from the GLM model fitted with the b th bootstrap resample. The corresponding estimated pooled standard error is with V within and V between denotes the within and between imputation variance respectively. These are calculated by with SE * 2 b represents the sum of the squared Standard Error (SE), estimated in each bootstrap dataset b. and We used the pool function in the R package MICE [25].
Step 4: Significance testing For significance testing of any of the pooled parameters of interest such as the interaction effect between cell-types and conditions, the univariate Wald test [14,25] is used and the test statistics is defined as: whereβ * i is the i th element of the pooled coefficient vector β * .
Additional file 1: Supplementary Tables and Figures. Table S1. Simulation cases with correlated and rare cell-types. Table S2. Simulation of complex dataset. Table S3. Summary from Pooled GLM on neuronal dataset. Figure S1. scDC on correlated and rare cell-types. Figure S2. scDC on complex dataset. Figure S3. Runtime of scDC with varying number of cells and cell-types.