Skip to main content

scDC: single cell differential composition analysis

Abstract

Background

Differences in cell-type composition across subjects and conditions often carry biological significance. Recent advancements in single cell sequencing technologies enable cell-types to be identified at the single cell level, and as a result, cell-type composition of tissues can now be studied in exquisite detail. However, a number of challenges remain with cell-type composition analysis – none of the existing methods can identify cell-type perfectly and variability related to cell sampling exists in any single cell experiment. This necessitates the development of method for estimating uncertainty in cell-type composition.

Results

We developed a novel single cell differential composition (scDC) analysis method that performs differential cell-type composition analysis via bootstrap resampling. scDC captures the uncertainty associated with cell-type proportions of each subject via bias-corrected and accelerated bootstrap confidence intervals. We assessed the performance of our method using a number of simulated datasets and synthetic datasets curated from publicly available single cell datasets. In simulated datasets, scDC correctly recovered the true cell-type proportions. In synthetic datasets, the cell-type compositions returned by scDC were highly concordant with reference cell-type compositions from the original data. Since the majority of datasets tested in this study have only 2 to 5 subjects per condition, the addition of confidence intervals enabled better comparisons of compositional differences between subjects and across conditions.

Conclusions

scDC is a novel statistical method for performing differential cell-type composition analysis for scRNA-seq data. It uses bootstrap resampling to estimate the standard errors associated with cell-type proportion estimates and performs significance testing through GLM and GLMM models. We have made this method available to the scientific community as part of the scdney package (Single Cell Data Integrative Analysis) R package, available from https://github.com/SydneyBioX/scdney.

Background

Tissues are composed of many heterogeneous cell-types and differences in cell-type proportions often carry biological significance. For example, cell-type compositional differences can shed light on disease mechanisms [1], immune response in cancer [2] and developmental processes [3]. In recent years, single cell RNA sequencing (scRNA-seq) has become an increasingly popular technology as it allows the cell-type identity of each individual cell sequenced to be inferred. This has enabled generation of insights through comparison of cell-type composition between samples.

There are many challenges associated with estimating cell-type composition. First, there could be technical biases for each cell-type. Certain cell-types may be more prone to damage under existing scRNA-seq protocols [4], leading to bias and perhaps a larger variance in cell-type proportions. Second, there are many proposed methods for cell-type identification and none of them guarantees perfect identification accuracy. A systematic evaluation of 14 common clustering algorithms revealed differences in performance and clustering result [5, 6]. A further evaluation on similarity metrics in clustering algorithms illustrates further inconsistency in results due to the different metrics used for measuring differences in cell transcriptomes [7]. In addition, as the compositions sum to one, this transforms the data from the Euclidean space into a simplex space. Canonical statistical tests such as t-test may not be reliable, as these are not designed for relative proportion data [8]. These issues all highlight the difficulty and uncertainty associated with cell-type identification. However, there is a current lack of methodological research on cell-type compositional analysis for single cell data. Whilst some research papers draw observations on compositional differences of cell-types between conditions [1, 3, 9], such observations are not accompanied by a measure of uncertainty associated with the estimates. In contrast to the lack of such studies in the field of single cell analysis, compositional analysis has been an active and ongoing area in microbiome research, due to the compositional nature of microbiome data. Dirichlet – multinomial models have been used extensively for microbiome research, in which Dirichlet Monte Carlo procedures are used to obtain posterior distributions of sample proportions [1013]. However, due to the small sample size of scRNA-seq data, fitting a Dirichlet multinomial is not currently feasible.

In this paper, we present single cell Differential Composition (scDC), a method for estimating uncertainty in cell-type proportions via bootstrapping. Our method provides a bias-corrected estimate of cell-type proportions confidence intervals, and by performing clustering within each bootstrap iteration it also captures the uncertainty associated with cell-type identification. In scRNA-seq experiments with small sample sizes, such estimates of cell-type proportion uncertainties would be useful in determining the significance of cell-type compositional differences between conditions. In the future where multi-sample comparative scRNA-seq experiments are likely to become common place, our method offers a statistically robust way to propagate uncertainty in downstream analysis. We demonstrate this using two synthetic datasets derived from published experimental data. Finally, we implemented scDC as part of the scdney R package which is freely available from https://github.com/SydneyBioX/scdney.

Results

Overview of single cell differential composition analysis (scDC)

We developed scDC (workflow shown in Fig. 1), a novel approach based on bootstrap resampling, to perform differential cell-type composition analysis. Let Yij be a matrix of cell counts from i=1,…,C cell-types and j=1,…,R subjects. The first step of scDC involves stratified bootstrap resampling where we sampled with replacement Nj cells from the cell count matrix Yij for subject j. Stratification by subject is important because it ensures unbiased bootstrap estimates and captures subject to subject variability. Let \(Y^{*}_{bj}\) be the bth bootstrap sample for the jth subject. Subsequently, for each bootstrap iteration, we combined resampled cell counts \(Y^{*}_{b1},\ldots,Y^{*}_{bR}\) for all R subjects.

Fig. 1
figure 1

Overview of scDC workflow. This illustrates the main components of the scDC procedure. The key functions have been included on top of each sub-figure, where relevant. a Single cell data is collected and analysed. Publicly available experimental data is obtained. Simulated data is generated using R package PowSimR. b This corresponds to step 1 and 2 of the 4-step procedure. Resampled data is generated using stratified bootstrap and then clustered using clustering algorithm. Each cluster is matched to cell-type using reference cell labels in the original dataset. This step is repeated n times. c In step 3 and 4, cell count output from each bootstrap is fitted using GLM. The coefficient estimates from each individual GLM model are pooled using Rubin’s rules and tested for significance. d User can extract the overall estimates of statistics. e Each bootstrap re-sampling gives an estimated distribution of cell-types composition for each patient. f the result can be visualised graphically

The next step involved cell-type identification of each bootstrap sample using clustering. Here, we performed PCA dimension reduction followed by k-means clustering (using Pearson correlation as distance metric [7]). For each bootstrap sample, this generated cell-type proportions \(p^{*}_{b1},\ldots p^{*}_{bC}\) which we later used to calculate bootstrap standard errors (SE) for each of the cell-type proportions p1,…,pC. In parallel, for experiments with two conditions A and B, to assess changes in cell-type compositions between conditions, we modelled the data using a log-linear model accounting for cell-types, conditions, subjects and interaction effects between cell-types and conditions. A log-linear model was fitted to each bootstrap sample \(Y^{*}_{b}\) and the results were pooled using Rubin’s pooled estimate (see “Methods” section) [14]. This log-linear model can also be fitted as a mixed effect model by giving each subject a random effect and both options are available in our scDC framework.

Stratified bootstrap procedure provides good estimation of sampling error for cell-type proportions

First, we examined the validity of our stratified bootstrap procedure using simulation data (Table 1). In Fig.2a each point represents the sampling standard error and the bootstrap standard error associated with a particular subject. The figure clearly shows a high concordance between bootstrap standard error and sampling standard error. Therefore, bootstrap resampling should provide a reasonable estimate of uncertainty for cell-type proportion.

Fig. 2
figure 2

Stratified bootstrap provides good estimate of standard error. a Using four simulated datasets, stratified bootstrap was performed to estimate cell-type proportions in each bootstrap. Each dot on the figure represents a pair of value containing the mean standard error of bootstrap (x-axis) and standard error from sampling (y-axis). b Each horizontal bar shows confidence interval associated with proportion estimate

Table 1 Simulation cases with one condition

Bias-corrected and accelerated (BCa) bootstrap confidence interval for single subject cell-type proportions

We examined the effectiveness of various approaches to estimate confidence intervals (CI) associated with cell-type proportions at the single subject level. Cell-type proportions from a single cell experiment can be modeled using a multinomial distribution with C proportion parameters p1,…,pC. The simple standard error (SE) measure, commonly used to illustrate error bars, assumes symmetric CIs and therefore is not an appropriate representation of the CI of cell-type proportions. The SE estimate is especially problematic for rare cell-types, as it can generate CI estimates with endpoints that extend outside [0,1]. Existing approaches to estimate CIs for multinomial proportions require large expected cell counts [15]. For rare cell-types, cell counts may not be adequate to ensure the appropriate overall coverage probability.

To address this issue, we considered three methods to estimate CI of cell-type proportions: (1) the model-based approach of May and Johnson (2000) [16, 17], which constructs simultaneous CIs for multinomial proportions; (2) the bootstrap percentile method, and (3) the bootstrap BCa method [18]. Figure 2b shows CI estimates using these three methods on four sets of simulated data. Comparing simulation data Sim1 with 500 cells per replicate, Sim2 with 100 cells per replicate and Sim3 with 50 cells per replicate, it is clear that the width of all three types of confidence intervals increases with decreasing number of cells. The width of BCa CI is similar to multinomial and percentile in all cases, but is significantly shorter when there was presence of rare cell-types and small number of cells per replicate, as shown by the cell-type n4. This suggests that BCa provides better estimates for rare cell-types.

scDC correctly estimates the cell-type proportions and corresponding standard error in simulation data

Using a series of simulation data, we evaluated whether scDC can accurately recover true cell-type proportions and the corresponding standard errors within each condition and subject. For simulated data SimP1 and SimP2, we simulated the situation where biological replicates are available; for simulated data SimP3 and SimP4, we also modelled the subject to subject variability as random effects. The detailed simulation settings are listed in Table 2 and described in detail in “Methods” section. Figure 3a shows that in all four simulations, scDC correctly recovers the true cell-type proportions; Fig. 3b shows the corresponding bootstrap confidence intervals calculated using the BCa method. Visually, the confidence intervals overlap between conditions where cell-type proportions differ by 10%. When we used GLMs to test for significance, we found that there is significant interaction effect for cell-type 2 in simulation data SimP1 (p=0.01; Table 3) but not for cell-types 3 and 4. This shows concordance with the true underlying model.

Fig. 3
figure 3

scDC on simulated dataset. scDC has been applied to four simulated datasets. a Each dot represents a pair of values containing the mean proportion estimate calculated from scDC (x-axis) and the true proportion from the simulation data (y-axis). Across all four datasets, all dots lie on the diagonal line, even when the proportion of a cell-type is as low as 5%. This demonstrates that scDC is highly accurate in estimating cell-type composition. b Each simulated dataset contains two conditions, represented suing colours red and blue. The bar chart shows proportion estimates of each cell-type n1 to n4 for each subject in the dataset. The horizontal black line represents the BCa confidence interval associated with the proportion estimate

Table 2 Simulation cases with two conditions
Table 3 Summary of pooled estimates from fitting GLMM to bootstrap samples from simulation data SimP1

We also examined scDC’s performance on estimating cell-types proportions in datasets where there are highly correlated cell-types and rare cell-types, as these situations are often observed in real experimental datasets. Using eight simulated datasets SimS1 to SimS8, we demonstrated that scDC is highly accurate in recovering rare cell-types proportion as low as 1% and is only slightly affected when there is high degree of correlation between two cell-types (Additional file 1: Figure S1). The detailed simulation settings are listed in Additional file 1: Table S2)

Lastly, we simulated a dataset composed of 4000 cells and 10 cell-types with 3 rare cell-types, replicating the size and the complexity of real experimental dataset. scDC result shown in Additional file 1: Figure S2 clearly revealed that scDC accurately recovered true cell-types proportion. In addition, we observed that BCa produced the tightest CI interval around the density distribution of proportion estimates for the three rare cell-types, n1, n2 and n3. In contrast, both percentile and multinomial CI produced much wider CI in some cases. For the other seven cell-types, the three CI estimates were similar. This is consistent with our findings in the previous subsection.

Impact of specifying incorrect number of cell-types

In practice, the true number of cell-types are often unknown. To investigate the impact on scDC result when the number of cell-types is incorrectly determined, we simulated a dataset containing 500 cells and 4 cell-types and evaluated scDC with the number of cell-types specified as 2, 3, 4 and 5. Figure 4 clearly shows that in all cases when the input number of cell-types is incorrect, the resulting confidence intervals are very wide. Some of the confidence intervals appeared to be out of place compared to the density distribution. Only when scDC was ran with number of cell-types being 4 did the confidence intervals form reasonably CI estimate as seen from close bands surrounding the density distributions.

Fig. 4
figure 4

Impact of incorrectly specified numbers of cell-types. We simulated a dataset containing 4 cell-types and ran scDC with number of cell-types specified to be 2, 3, 4 and 5. a shows the width of the three types of confidence intervals. b shows the density distribution of proportion estimates

Application of scDC reveals differential cell-type proportions in two synthetic datasets

To demonstrate the applicability of our method on data containing realistic variability from single cell experiments, we examined two synthetic datasets constructed from two sets of real experimental data — Pancreas [1] and Neuronal [3] (see “Methods” section).

In the Pancreas dataset that examines subjects with type two diabetes (T2D) and healthy subjects, we observed subject to subject variation. Despite the overall proportion of beta cells being lower with subjects with T2D compared with healthy controls, our analysis confirms the original finding that 1 of the 4 subjects has a higher beta cell value. The boxplot for this patient has a small interquartile range, suggesting that there is high confidence that the high value observed with this subject is not due to an error or random chance, for example, in the clustering procedure. The overall boxplot (Fig. 5d-f) summarising the average cell proportion across subjects further shows that the interquartile range of the beta cell proportion of normal subjects is not overlapping with the interquartile range of the beta cell proportion of T2D subjects. This confirms that, despite one of the T2D subjects being an outlier, the overall beta cell proportion of T2D subjects is lower than for normal subjects. Inspecting output from random effect and fixed effect model revealed a difference in subject variability and the pooled fixed effect model also suggests the beta cell-type and subject interaction effect is significant (p=0.02). This again highlights that in the beta cell, there is a lower proportion in T2D compared to healthy and such difference in proportion isn’t in the alpha cell-type. The details of the GLM and GLMM results are presented in Table 4.

Fig. 5
figure 5

scDC on synthetic dataset. scDC has been applied on two different synthetic datasets, a-c shows result on neuronal dataset and d-f shows result on pancreas dataset. a Subjects are samples taken at different developmental time point. Each time point has two or three samples, as indicated by the labels on the x-axis. Boxplot represents 100 cell proportion values obtained from performing scDC. The diamond symbol in each boxplot represents the reference cell proportion value calculated from the original dataset. b Each boxplot is drawn by taking the mean of subjects at each time point. Thus wider boxplot indicates greater subject to subject variability. c This plots the mean cell composition of each subject from the 100 cell proportion values obtained from scDC and compares across the developmental time point. A trend in proportional change of the four cell-types is visible. d Subjects are samples taken from normal subjects and type 2 diabetes subjects. e Averaging across subjects for each cell-type reveals a clear difference in beta cell proportions between normal and type two diabetes subjects. f Compared to mouse data (c), human data exhibit much greater between subject variability

Table 4 Summary of pooled estimates from fitting GLM and GLMM to bootstrap samples from Pancreas dataset

The Neuronal dataset contains samples from inbred mice at different developmental time points. With the neuronal dataset, we examined the mesoderm, neural crest, neural crest neurons, and progenitor cell-types. Here, the subject to subject variability are substantially smaller (Fig. 5a-c). Application of scDC reveals a clear trend in the relative proportion of these four cell-types. In particular, there is a clear increase in the relative proportion of progenitor, neural crest and neural crest neurons, and a decrease in the relative proportion of mesoderm. The GLM and GLMM analysis show a significant interaction effect between cell-types and time points (p<0.001). The details of the GLM and GLMM results are presented in Additional file 1: Table S1. The original paper examined the change in progenitor compared to neurons and stated that the percentage of progenitor increased as the percentage of neurons decreased. Our findings suggest that, by excluding neurons from analysis, there is a visible increase in the percentage of progenitor cells over time. This finding is worth further investigation, as progenitor cells is a key player in neurogenesis which gives rise to neuronal cells.

Discussion

We present a novel statistical framework, scDC, to assess cell-type composition differences between conditions of interest. This involves developing approaches to estimate standard error of cell-type proportions estimates in scRNA-seq data via bootstrapping and adapting a missing values framework to facilitate Wald testing from bootstrap log-linear model and generalized linear mixed model analyses. By applying scDC to simulated data, we demonstrate that our method can accurately recover the correct confidence interval of cell-type proportions in an unbiased manner. Furthermore, we show in two synthetic datasets that cell-type composition differences can be accurately determined following the scDC procedure.

All confidence interval estimation procedures discussed in this paper can be extended to directly calculate the confidence interval associated with the difference in ith cell-type proportions. This provides an alternative approach to the CI constructed based on SEpooled that will be symmetric by default. These various CI estimation procedures generate different CIs with different coverage probabilities and interval lengths. The ability to calculate the correct coverage probability with the smallest interval length is an important component of power calculations as it leads to a reduction of sample size needed to achieve detection of differences between two cell-types.

Comparisons between the two datasets confirm the widely accepted phenomena that human data contain much higher subject to subject variability than mouse data. The lower mouse to mouse variability could be due to two factors. First, cell-types in mouse are easier to distinguish, resulting in greater stability in clustering process. Second, inbred mice might exhibit much less variability than humans. Our scDC method provides and encourages the visualization of SE associated with cell-type proportions estimates at different level. For example, in the Pancreas dataset, despite there being only one T2D subject with a higher beta cell proportion, there is still a significant interaction effect between T2D and beta cell proportions when taking account of all subject. This opens up a potential problem of hierarchical compositional analysis where one identifies the proportion of population within T2D that demonstrates a higher beta-cell proportion versus the proportion of population within T2D that has a lower beta-cell proportion, and what sample size of subject is required to achieve this.

As with all bootstrap approximations, we are limited by the initial sampling procedure. This is more complicated in data with samples from different human subjects, where the cell-type proportions in each sample vary greatly. In addition, comparing Sim1 and Sim3 from Fig.2b, it is clear that the width of the confidence interval increases greatly as the number of cells in dataset decrease. Similar situation is likely to be observed in experimental datasets, where the number of cells obtained from each sample may differ, particularly after quality control. scDC utilizes stratified bootstrap sampling, stratifying based on each subject. Without stratified bootstrap sampling, we will over-estimate the SE - this is a natural consequence of large subject to subject variability. It is not appropriate to assume that all subjects share the same cell-type proportions and be from the same multinomial distribution. In the stratified bootstrap, we assume that each subject has a different cell-type proportion and draws from a different multinomial distribution.

Finally, we implemented scDC as part of the package scdney for use by the scientific community. This package performs stratified bootstrap, builds GLM models and provides visualizations of the bootstrap results.

Conclusions

In this paper, we present scDC, a novel statistical framework for performing differential composition analysis in scRNA-seq experiments. scDC utilizes bootstrap resampling to estimate the standard errors in cell-type proportions, and enables significance testing from bootstrap GLM and GLMM analysis. We applied scDC to both simulation and synthetic data sets and showed that it can accurately recover true cell-type proportions and estimate the variance in cell-type proportions. By pooling estimates from GLM analyses in each bootstrap resample, we also showed that we can perform significance testing of composition differences across subjects and conditions. scDC is implemented as an R package and can be freely downloaded from our GitHub repository. Our implementation supports direct input from sc-RNA expression matrix.

Methods

Simulated data generation

The R package powsimR was used to simulate the single-cell RNA-seq data [19]. We selected 10% of genes to be differentially expressed between two cell-types, with fold changes sampled from a multivariate normal distribution. Gene specific distributional parameters are estimated by a dataset from mouse embryonic stem cells [20].

Simulation with one conditionTo validate the bootstrapping estimation, we simulated a population of 5000 cells consisting of 4 cells types. Four different simulations were set up, with three simulations consisting of 4 cell-types with proportion of 10%, 20%, 30% and 40% (Simulated data Sim1, Sim2, Sim3), and one simulation with proportion 5%, 25%, 30% and 40% (Simulated data Sim4). Simulated data Sim1 has 10 replicates, with 500 cells in each replicates. Four replicates were randomly chosen and bootstrapped. Simulated data Sim2 has 50 replicates in total, with 100 cells in each replicate. Five replicates were randomly chosen and bootstrapped. Simulated data Sim3 has 100 replicates in total, with 50 cells in each replicate. 20 replicates were chosen and bootstrapped. Simulated data Sim4 has 50 replicates in total, with 100 cells in each replicate. We chose 20 replicates to be bootstrapped (Table 1).

Simulation with two conditionTo validate our scDC procedure in-silico, we simulated a population of 6000 cells of four cell-types, with two conditions with different cell-type proportions. Each condition is divided into three subsets, which can be considered as the subpopulation of three subjects. Then we randomly sampled 200 cells from each subpopulation to create a simulated dataset with 1200 cells. Simulated data SimP1 and SimP2 represents the situation when the biological replicates are available. Here we sample from the same multinomial distribution with true cell-type proportions given in Table 2. Simulated data SimP3 and SimP4 represents the situation where the variability between subjects follows a N(0,σ2) distribution, where σ=0.02. The detailed cell-type composition settings of each simulation are listed in Table 2.

Simulation with rare cell-types and correlated cell-typesWe simulated four datasets that have different degree of correlation between cell-types and four datasets that have rare cell-types of different proportions. For each dataset, we first simulated a population of 2400 cells, then we randomly sampled 1200 cells to create the final dataset. Each dataset is made up of two conditions with different cell-type proportions and three replicates in each condition. Variability between subjects are introduced by modelling a N(0,σ2) distribution, where σ=0.02. The detailed cell-type composition settings of each simulation are listed in Additional file 1: Table S1.

Simulation of complex datasetTo test scDC on a dataset with complexity similar to the characteristics of experimental dataset, we simulated one dataset with 8000 cells and 10 cell-types and randomly selected 4000 cells to create the final dataset. The 10 cell-types contains 3 rare cell-types with proportion as low as 1%. This dataset is composed of 2 conditions and 4 replicate subjects in each condition. Variability between subjects are introduced by modelling a N(0,σ2) distribution, where σ=0.02. Detailed cell-type composition settings for this simulation are listed in Additional file 1: Table S2.

Synthetic data

We generated two synthetic datasets from two publicly available datasets [1, 3]. All data were first normalized by TPM (transcripts per million) and log-transformed. Any samples with a total of less than 50 cells of interest or with less than 10 cells of any cell-types of interest were excluded.

Synthetic Pancreas DatasetThe Pancreas dataset contains 2,209 single cells taken from the pancreas of six healthy and four type 2 diabetic human cadaveric donors (E-MTAB-5061) [1]. Alpha, beta and ductal cells were selected to create the synthetic dataset. Cell labels were provided for all cells by the authors.

Synthetic Neuronal DatasetThe Neuronal dataset contains 21,465 single cells taken from neural tubes of mouse embryo at five developmental time points from e9.5 to e13.5 [3]. Raw data is made available by the author on their GitHub repository https://github.com/juliendelile/MouseSpinalCordAtlas. The cell-types mesoderm, neural crest, neural crest neurons and progenitor were selected to create the synthetic dataset. Cell-type labels were provided by the original authors through assessing the expression of a list of curated genes.

Bootstrap confidence intervals

Let nij be the number of cells of cell-type i and subject j, i=1,…,C,j=1,…,R; Nj be the number of cells within the subject j; and N be the total number of cells in the dataset. For a typical subject j, the ith cell-type proportion is defined as pij and calculated by

$$p_{ij} = \frac{n_{ij}}{\sum_{i} n_{ij}} = \frac{n_{ij}}{N_{j}}.$$

The percentile methodThis is an intuitive method for estimating the bootstrap confidence interval by first estimating the empirical distribution of pij. Let \(\hat {p}_{ij(\alpha)}^{*}\) represent the 100α-th percentile of B bootstrap replications, \(\hat {p}_{ij(1)}^{*}\), \(\hat {p}_{ij(2)}^{*},\ldots,\hat {p}_{ij(B)}^{*}\), with

$$\hat{p}_{ij(1)}^{*} \leq \hat{p}_{ij(2)}^{*}\leq \hat{p}_{ij(3)}^{*} \leq... \leq \hat{p}_{ij(B)}^{*},$$

where B=10000 by default. Then the percentile interval of 100(1−α)% for pij is estimated by

$$\left[\hat{p}_{ij(B\alpha/2)}^{*}, \hat{p}_{ij(B(1-\alpha/2))}^{*}\right].$$

The BCa methodTo account for skewness in the confidence interval, we used the Bias-corrected and accelerated (BCa) confidence intervals proposed by Efron (1987) [21]. The BCa method uses percentiles of bootstrap distribution described above, but depends on an acceleration parameter \(\hat {a}\) and a bias-correction factor \(\hat {z}_{0}\). The 100(1−α)% BCa interval of \(\hat {p}^{*}_{ij}\) can be calculated by

$$\left[\hat{p}_{ij(\alpha_{1})}^{*}, \hat{p}_{ij(\alpha_{2})}^{*}\right],$$

where

$$\alpha_{1} = \Phi\left(\hat{z}_{0} + \frac{\hat{z}_{0} + z_{(\alpha)}}{1-\hat{a}(\hat{z}_{0} + z_{(\alpha)})} \right);$$
$$\alpha_{2} = \Phi\left(\hat{z}_{0} + \frac{\hat{z}_{0} + z_{(1-\alpha)}}{1-\hat{a}(\hat{z}_{0} + z_{(1-\alpha)})} \right).$$

Here Φ(·) is the cumulative distribution function of standard normal distribution, and z(α) indicates the 100α-th percentile point of a standard normal distribution.

This method estimates two adjustment factors. First, the bias-correction factor which estimates the proportion of the bootstrap replication less than the original estimate. This is calculated by

$$\hat{z}_{0} = \Phi^{-1}(\#\{\hat{p}_{ij(b)}^{*}<\hat{p}_{ij}\}/B),$$

where Φ−1 is the inverse cumulative distribution function of standard normal and \(\hat {p}_{ij}\) is the original estimate of proportion, derived from Step 2 of clustering in the scDC procedure.

The second is the acceleration factor which quantifies the rate of change of the standard error of \(\hat {p}_{ij}\) with respect to the original estimate pij. This is calculated using a jackknife approach as follows,

$$\hat{a} = \frac{\sum_{n=1}^{N} (\hat{p}_{ij(\cdot)}-\hat{p}_{ij(n)})^{3}}{6\left\{\sum_{n=1}^{N} (\hat{p}_{ij(\cdot)}-\hat{p}_{ij(n)})^{2}\right\}^{3/2}},$$

where \(\hat {p}_{ij(n)}\) is the estimate of pij holding out cell n in the data, and \(\hat {p}_{ij(\cdot)} = \frac {1}{N}\sum _{n=1}^{N} \hat {p}_{ij(n)}\). We modifies the implementation of function bcanon in bootstrap package to estimate BCa intervals for cell-type proportions of each subject [22].

Note that, for data with only one subject, we can still estimate the CIs with j=1.

Single cell differential composition analysis (scDC)

scDC is a 4-step procedure that tests for differences in cell-type composition between conditions.

Step 1: Stratified bootstrapThis is used to generate B bootstrap samples by sampling with replacement and stratified by subjects. We denote each bootstrap sample data of subject j as \(Y^{*}_{bj}\). All genes with zero variance in each of the B bootstrap sample \(Y^{*}_{bj}\) are removed as they are not informative for cell-type determination.

Step 2: Cell-type identificationWe estimate the cell-type identity of each cell by clustering each bootstrap resample \(Y^{*}_{bj}\) and estimate the bootstrap cell-type proportions \(\hat {p}^{*}_{b1}, \hat {p}^{*}_{b2}, \ldots, \hat {p}^{*}_{bC}\) for all C cell-types via clustering The clustering first performed dimension reduction using PCA (R package stats [23]) to the first 10 principal component, follow by k-means clustering with Pearson correlation as the similarity measure using function scClust in R package scdney [7]. The number of clusters, K, was initially set to be the number of cell-types C in the data. Cluster identity was assigned based on the majority cell-types in a cluster.

In cases where the clustering result do not recover all the cell-types, clustering was re-run with the number of clusters K increased by one. We repeated the clustering procedure until each cell-type has been assigned to at least one cluster. We then recorded the number of cells in each cell-type labelled cluster that belongs to each subject or replicate j.

Step 3: Pooled estimates from B complete linear modelanalyses.For each bootstrap resample \(Y^{*}_{bj}\), a Generalized Linear Model was fitted to assess the significance associated with each predictor variable. The R package lme4 was used to fit the model [24]. Two types of models were considered — the fixed effect model (GLM) and the mixed effect model (GLMM) by treating the subject as a random effect. For the fixed effect model (GLM) the following R code

was used.

For the mixed effect model (GLMM), the following R code

was used.

This generated B bootstrap coefficient estimates \(\hat {\boldsymbol {\beta }}^{*}_{1}, \hat {\boldsymbol {\beta }}^{*}_{2}, \ldots \hat {\boldsymbol {\beta }}^{*}_{B}\) from the GLM or GLMM model. B=100 by default if user does not require any confidence interval to be estimated, otherwise B=10000 by default. The pooled model estimate based on Rubin’s rules [14] is

$$\bar{\boldsymbol{\beta}}^{*} = \frac {1}{B} \sum_{b=1}^{B}{\hat{{\boldsymbol{\beta}}}_{b}^{*}}, $$

where \({\hat {{\boldsymbol {\beta }}}_{b}^{*}} \) is the coefficients estimates from the GLM model fitted with the bth bootstrap resample. The corresponding estimated pooled standard error is

$${SE}_{pooled}=\sqrt {{V}_{within}+{V}_{between}+\frac {{V}_{between}}{B}},$$

with Vwithin and Vbetween denotes the within and between imputation variance respectively. These are calculated by

$${V}_{within} = \frac {1}{B} \sum_{b=1}^{B}{{SE}_{b}^{*2}},$$

with \({SE}_{b}^{*2}\) represents the sum of the squared Standard Error (SE), estimated in each bootstrap dataset b. and

$${V}_{between} = \sqrt {\frac {\sum_{b=1}^{B}({{\hat{\boldsymbol{\beta}}^{*}_{b}}-\bar{\boldsymbol{\beta}}^{*})^{2}}}{B-1}}.$$

We used the pool function in the R package MICE [25].

Step 4: Significance testingFor significance testing of any of the pooled parameters of interest such as the interaction effect between cell-types and conditions, the univariate Wald test [14, 25] is used and the test statistics is defined as:

$$W_{pooled} = \frac{\bar{{\beta}}^{*2}_{i}}{SE_{pooled}}, $$

where \(\bar {{\beta }}^{*}_{i}\) is the ith element of the pooled coefficient vector \(\bar {\boldsymbol {\beta }}^{*}\).

Availability of data and materials

The method scDC is implemented in R package scdney, available from https://github.com/SydneyBioX/scdney. Data used in this paper are available through the following link.

The human pancreas dataset is available at http://sandberg.cmb.ki.se/pancreas/. Date accessed 4 September 2019.

The mouse neuronal dataset can be found at http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-7320. Date accessed 4 September 2019.

Abbreviations

BCa:

Bias-corrected and accelerated

CI:

Confidence interval

GLM:

Generalized linear model

GLMM:

Generalized linear mixed model

scDC:

Single cell differential composition

scRNA-seq:

Single cell RNA sequencing

SE:

Standard error

T2D:

Type two diabetes

References

  1. Segerstolpe A, Palasantza A, Eliasson P, Andersson EM, Andreasson AC, Sun X, Picelli S, Sabirsh A, Clausen M, Bjursell MK, Smith DM, Kasper M, Ammala C, Sandberg R. Single-Cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 Diabetes. Cell Metab. 2016; 24(4):593–607.

    Article  CAS  Google Scholar 

  2. Ali HR, Chlon L, Pharoah PD, Markowetz F, Caldas C. Patterns of immune infiltration in breast cancer and their clinical implications: a gene-expression-based retrospective study. PLoS Med. 2016; 13(12):1002194.

    Article  Google Scholar 

  3. Delile J, Rayon T, Melchionda M, Edwards A, Briscoe J, Sagner A. Single cell transcriptomics reveals spatial and temporal dynamics of gene expression in the developing mouse spinal cord. Development. 2019. https://doi.org/10.1242/dev.173807.

    Article  CAS  Google Scholar 

  4. Ilicic T, Kim JK, Kolodziejczyk AA, Bagger FO, McCarthy DJ, Marioni JC, Teichmann SA. Classification of low quality cells from single-cell rna-seq data. Genome Biol. 2016; 17(1):29. https://doi.org/10.1186/s13059-016-0888-1.

    Article  Google Scholar 

  5. Duò A, D. Robinson M, Soneson C. A systematic performance evaluation of clustering methods for single-cell rna-seq data. F1000Research. 2018; 7:1141. https://doi.org/10.12688/f1000research.15666.2.

    Article  Google Scholar 

  6. Freytag S, Tian L, Lönnstedt I, Ng M, Bahlo M. Comparison of clustering tools in r for medium-sized 10x genomics single-cell rna-sequencing data. F1000Research. 2018; 7:1297.

    Article  Google Scholar 

  7. Kim T, Chen IR, Lin Y, Wang AY-Y, Yang JYH, Yang P. Impact of similarity metrics on single-cell rna-seq data clustering. Brief Bioinform. 2018. https://doi.org/10.1093/bib/bby076.

  8. Aitchison J. The single principle of compositional data analysis, continuing fallacies, confusions and misunderstandings and some suggested remedies. Keynote address, CODAWORK 2008. 2019. https://core.ac.uk/download/pdf/132548276.pdf. Accessed 20 Nov 2019.

  9. Shih AJ, Menzin A, Whyte J, Lovecchio J, Liew A, Khalili H, Bhuiya T, Gregersen PK, Lee AT. Identification of grade and origin specific cell populations in serous epithelial ovarian cancer by single cell rna-seq. PLoS ONE. 2018; 13(11):1–17. https://doi.org/10.1371/journal.pone.0206785.

    Article  Google Scholar 

  10. La Rosa PS, Brooks JP, Deych E, Boone EL, Edwards DJ, Wang Q, Sodergren E, Weinstock G, Shannon WD. Hypothesis testing and power calculations for taxonomic-based human microbiome data. PLoS ONE. 2012; 7(12):1–13. https://doi.org/10.1371/journal.pone.0052078.

    Article  Google Scholar 

  11. Chen J, Li H. Variable selection for sparse dirichlet-multinomial regression with an application to microbiome data analysis. The Ann Appl Stat. 2013; 7(1):418–42. https://doi.org/10.1214/12-AOAS592.

    Article  Google Scholar 

  12. Bian G, Gloor GB, Gong A, Jia C, Zhang W, Hu J, Zhang H, Zhang Y, Zhou Z, Zhang J, Burton JP, Reid G, Xiao Y, Zeng Q, Yang K, Li J. The gut microbiota of healthy aged chinese is similar to that of the healthy young. mSphere. 2017; 2(5). https://doi.org/10.1128/mSphere.00327-17. http://arxiv.org/abs/https://msphere.asm.org/content/2/5/e00327-17.full.pdf.

  13. Rivera-Pinto J, Egozcue JJ, Pawlowsky-Glahn V, Paredes R, Noguera-Julian M, Calle ML. Balances: a new perspective for microbiome analysis. mSystems. 2018; 3(4). https://doi.org/10.1128/mSystems.00053-18. http://arxiv.org/abs/https://msystems.asm.org/content/3/4/e00053-18.full.pdf.

  14. Toutenburg H. Rubin, d.b.: Multiple imputation for nonresponse in surveys. Stat Pap. 1990; 31(1):180. https://doi.org/10.1007/BF02924688.

    Article  Google Scholar 

  15. Quesenberry CP, Hurst DC. Large sample simultaneous confidence intervals for multinomial proportions. Technometrics. 1964; 6(2):191–5. https://doi.org/10.1080/00401706.1964.10490163. http://arxiv.org/abs/https://www.tandfonline.com/doi/pdf/10.1080/00401706.1964.10490163.

    Article  Google Scholar 

  16. May WL, Johnson WD, et al. Constructing two-sided simultaneous confidence intervals for multinomial proportions for small counts in a large number of cells. J Stat Softw. 2000; 5(6):1–24.

    Article  Google Scholar 

  17. Sison CP, Glaz J. Simultaneous confidence intervals and sample size determination for multinomial proportions. J Am Stat Assoc. 1995; 90(429):366–9.

    Article  Google Scholar 

  18. Efron B, Tibshirani RJ. An Introduction to the Bootstrap. Monographs on Statistics and Applied Probability, vol. 57. Boca Raton: Chapman & Hall/CRC; 1993.

    Google Scholar 

  19. Vieth B, Ziegenhain C, Parekh S, Enard W, Hellmann I. powsimR: power analysis for bulk and single cell RNA-seq experiments. Bioinformatics (Oxford, England). 2017. https://doi.org/10.1093/bioinformatics/btx435.

    Article  CAS  Google Scholar 

  20. Kolodziejczyk AA, Kim JK, Tsang JCH, Ilicic T, Henriksson J, Natarajan KN, Tuck AC, Gao X, Bühler M, Liu P, Marioni JC, Teichmann SA. Single Cell RNA-Sequencing of Pluripotent States Unlocks Modular Transcriptional Variation. Cell Stem Cell. 2015. https://doi.org/10.1016/j.stem.2015.09.011.

    Article  CAS  Google Scholar 

  21. Efron B. Better bootstrap confidence intervals. J Am Stat Assoc. 1987; 82(397):171–85. https://doi.org/10.1080/01621459.1987.10478410. https://amstat.tandfonline.com/doi/pdf/10.1080/01621459.1987.10478410.

    Article  Google Scholar 

  22. Efron B, Tibshirani RJ. An Introduction to the Bootstrap. Vol. 57; 1993. https://doi.org/10.1111/1467-9639.00050.

    Article  Google Scholar 

  23. R Core Team. R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing; 2018. R Foundation for Statistical Computing.

    Google Scholar 

  24. Bates D, Mächler M, Bolker B, Walker S. Fitting linear mixed-effects models using lme4. J Stat Softw. 2015; 67(1):1–48. https://doi.org/10.18637/jss.v067.i01.

    Article  Google Scholar 

  25. van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate imputation by chained equations in r. J Stat Softw. 2011; 45(3):1–67.

    Google Scholar 

Download references

Acknowledgements

The authors thank all their colleagues, particularly at The University of Sydney, School of Mathematics and Statistics, for their support and intellectual engagement. In particular to Weichang Yu for discussion in multinomial simulation.

About this supplement

This article has been published as part of BMC Bioinformatics, Volume 20 Supplement 19, 2019: 18th International Conference on Bioinformatics. The full contents of the supplement are available at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-20-supplement-19.

Funding

Publication of this supplement is funded by the following sources for each author. Australian Research Council Discovery Project grant (DP170100654) to JYHY, JTO, PY; Discovery Early Career Researcher Award (DE170100759) to PY; Australia NHMRC Career Developmental Fellowship (APP1111338) to JYHY and the Judith and David Coffey Life Lab at the Charles Perkins Centre, The University of Sydney to YL. Research Training Program Tuition Fee Offset and Stipend Scholarship to YL. The funding source had no role in the study design; in the collection, analysis, and interpretation of data, in the writing of the manuscript, and in the decision to submit the manuscript for publication.

Author information

Authors and Affiliations

Authors

Contributions

JYHY and KKL conceived the study with input from PY and JTO; YC and YL led the method development and data analysis with input from JTO, KKL, PY and JYHY; JYHY, KKL, YL and YC interpreted the results with input from PY; YL and YC processed and curated the data and implemented the R package. YC, JYHY, PY, YL, and KKL wrote the manuscript with input from JTO; All authors read and approved the final version of the manuscript.

Corresponding author

Correspondence to Kitty K. Lo.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1

Supplementary Tables and Figures. Table S1. Simulation cases with correlated and rare cell-types. Table S2. Simulation of complex dataset. Table S3. Summary from Pooled GLM on neuronal dataset. Figure S1. scDC on correlated and rare cell-types. Figure S2. scDC on complex dataset. Figure S3. Runtime of scDC with varying number of cells and cell-types.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cao, Y., Lin, Y., Ormerod, J.T. et al. scDC: single cell differential composition analysis. BMC Bioinformatics 20 (Suppl 19), 721 (2019). https://doi.org/10.1186/s12859-019-3211-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-019-3211-9

Keywords