Sensei: how many samples to tell a change in cell type abundance?

Cellular heterogeneity underlies cancer evolution and metastasis. Advances in single-cell technologies such as single-cell RNA sequencing and mass cytometry have enabled interrogation of cell type-specific expression profiles and abundance across heterogeneous cancer samples obtained from clinical trials and preclinical studies. However, challenges remain in determining sample sizes needed for ascertaining changes in cell type abundances in a controlled study. To address this statistical challenge, we have developed a new approach, named Sensei, to determine the number of samples and the number of cells that are required to ascertain such changes between two groups of samples in single-cell studies. Sensei expands the t-test and models the cell abundances using a beta-binomial distribution. We evaluate the mathematical accuracy of Sensei and provide practical guidelines on over 20 cell types in over 30 cancer types based on knowledge acquired from the cancer cell atlas (TCGA) and prior single-cell studies. We provide a web application to enable user-friendly study design via https://kchen-lab.github.io/sensei/table_beta.html. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04526-5.

Changes in the abundance of specific immune cell types within the tumor microenvironment (TME) over time reflect the evolution of cancer across the successive stages of premalignancy, invasion, local recurrence and distant metastatic spread [5,11,12]. Differences in TME composition are also reflective on different subtypes of tumors associated with different coevolving immune responses, thus reflecting two of the hallmarks of cancer: evasion of immune detection, and tumor promoting-inflammation [13]. Therefore, these pieces of information are critical to understand the role of the immune system during cancer evolution and metastasis and also to develop immune interception strategies for both cancer prevention and treatment [2].
For example, the intestinal mucosa is populated by intra-epithelial lymphocytes and mucosa associated lymphoid tissue. Proportions of T cells may vary in mucosa specimens obtained from healthy individuals at average-risk for colon cancer development (general population) compared to individuals at high-risk as a consequence of genetic predisposition due to an inherited condition such as Lynch syndrome. Lynch syndrome is the most frequent hereditary syndrome predisposing for the development of colorectal cancer and is secondary to the presence of germline mutations in one of the DNA mismatch-repair (MMR) genes. The deficiency of this mechanism leads to the accumulation of hundreds of point mutations and insertion-deletion loops (indels) that generate hypermutant neoplastic lesions [14]. These mutations constitute antigenic peptides (also known as neoantigens) that are recognized by the immune system, thus leading to an activation of different immune cell populations. Therefore, studying changes in immune cell proportions at single-cell resolution could help understand the immune response triggered at the intestinal level, thus helping to envision strategies to enhance it to prevent cancer or to decrease it to treat conditions such as inflammatory bowel disease [15]. This type of study would require the use of multi-color flow cytometry [16] and also intersects with microbiome [17] datasets, but it can be now accomplished with much higher accuracy due to the rise of single-cell RNA-sequencing (scRNA-seq) and singlecell ATAC (assay for transposase-accessible chromatin) sequencing (scATAC-seq) [18,19]. To observe and confirm cell type differences, samples from multiple research participants will need to be collected and sequenced; thus, accurately estimating the adequate sample size is critical for the feasibility and success of these type of studies due to the current high cost of these technologies. On the other hand, an insufficient number of samples can lead to a false-negative result [20].
Various sources of variability can complicate the ascertainment of cell type abundance. Sample preparation and single-cell sequencing reactions can introduce undesirable technical biases and variations [21]. For example, cell types that are hard to harvest intact such as neurons and adipocytes may be disproportionately underrepresented. As for single-cell profiling, scRNA-seq can introduce dropouts of lowly expressed genes, low total gene counts per cell, and high bias for 3' coverage [5], while scATAC-seq can be confounded by sampling efficiency resulting in a highly sparse profiling [22]. Furthermore, mass cytometry brings its own challenges as it is susceptible to oxidization and signal spillover [23]. All of these factors often lead to uncertainty in cell typing and, therefore, need to be properly accounted for sample size estimation before the experiments are performed. Moreover, selection of the type of platforms relies on the number of cells that can be assayed, ranging vastly from 100 to 10,000 [5] and the fact that in many occasions few cells remain after performing quality control. In general, a limited number of cells leads to underrepresentation of cell types and drift in their proportions. Therefore, a method that considers these factors is urgently needed.
However, it is challenging to model the effects of these factors in a mathematical model. Several approaches have utilized statistical models to estimate the number of cells that are required for a single-cell study. "Howmanycells" (https:// satij alab. org/ howma nycel ls). It uses negative binomial distributions to estimate how many cells assayed in total ensure sufficient representation of a given cell type, assuming that the number of cells in different cell types are mutually independent. However, if the proportion of one cell type rises, the proportions of other cell types must fall. Accordingly, SCOPIT [24] uses Dirichlet-multinomial distribution to add negative correlations between cell types. Nevertheless, the authors of SCOPIT have verified that calculations based on the independence assumption are very similar to that of SCOPIT, only off by a maximum of one cell [24]. Further improvement in modeling is possible, but it will likely result in nonanalytical solutions. Also, validating the accuracy of more sophisticated models will be unrealistic, as it requires datasets providing impractical and, most of the time, unfeasible numbers of technical replicates.
Most importantly, those previous approaches were designed to estimate the number of cells in a single biological sample, but not to estimate the number of biological samples that are required to ascertain changes in cell type abundance across biological conditions, a very different goal. For biological sample size estimation, the legacy sample size estimation approach for the t-test (Methods) does not factor in the variance introduced by insufficient number of cells. Thus, the estimation can be over-optimistic, especially for rare cell types.
Here, we present a new approach, Sensei, to provide accurate estimation of the sample size (or, equivalently, statistical power or false negative rate) for a variety of single-cell studies. Sensei takes into consideration both the number of samples and the number of cells within a unified mathematical framework and accounts for the abovementioned variabilities. We validate the accuracy of Sensei using multiple datasets and demonstrate Sensei's utility in a wide range of study settings that can impact broadly on both cancer prevention and treatment. We have also developed an online web application making Sensei accessible for clinical and basic science researchers during a study design.

Sensei
The framework of Sensei to model a controlled clinical study is illustrated in Fig. 1. The study design includes a control group and a case group of participants of certain sizes (Fig. 1a). The proportion of a cell type, T cell as an example hereafter, in a specific tissue varies among participants. While a level of difference is expected between the means of the T cell proportions in the two groups, within-group variances blur it, thus making statistical test necessary for ascertainment. Because a proportion falls between 0 and 1, Sensei uses a beta distribution to model the true proportion of T cells in each group, which parametrizes difference between groups and variance among participants within each group (Fig. 1b). For studies involving matched-pairs of specimens, e.g., autologous samples from one group of participants, additional statistical power can be acquired from modeling positive correlation of proportions of each cell type between pairs of samples.
From each participant, a biopsy of a tissue of interest is extracted, dissociated, and assayed using one of the single-cell profiling protocols. The single-cell profile is Fig. 1 Framework of Sensei. a-f show side-by-side the way Sensei (right) models a controlled clinical study (left). a A controlled study involves a control group and a case group for ascertaining the difference in the proportions of T cells between the two groups. b Sensei models the true biological between-group difference and within-group variance using beta distributions. Correlation is also modeled for matched pairs study design. c A biopsy is extracted from each participant and assayed by a single-cell technology. Cell types are identified in silico. d Sensei models technical variations introduced by limited cell number using a binomial distribution (with other technical variations already accounted for in b). e The t-test is performed to identify statistically significant differences. f Sensei infers the distribution of the t-statistics and calculate the false negative (type II error) rates. g A sample input for Sensei. Required are sample sizes, cell numbers, estimated proportions of the cell type and false positive rate (type I error) rate for t-test. h A sample output of Sensei, corresponding to (g). Tabulated are false negative rates for each feasible sample size analyzed in silico and the cells are clustered and classified into cell types (Fig. 1c). Two types of technical variations are introduced in this step. Firstly, a major source of variation is limited cell number, especially for rare cell types, which reduces the statistical power of a study. To model it, we assume that profiled cells are chosen randomly from the population, i.e., all cells in the tissue of interest, which is consistent with SCOPIT [24] and "Howmancells". Because the total number of cells in the tissue (population) is typically larger than that assayed in a single-cell experiment by several orders of magnitude, the number of sampled cells from a specific cell type would closely follow a binomial distribution, given its true proportion in the population (Fig. 1d). Secondly, sample preparation, sequencing, clustering, and classification also raise uncertainty, which is highly complex and may not be modeled analytically. Precise modeling would require exhaustive quantification of a specific protocol, which is not readily available. Thus, we factor such variances in the beta distributions ( Fig. 1b) mentioned above, which is consistent with the empirical understanding in the field [25]. The conjugacy of beta distribution and binomial distribution facilitates such modeling, allowing for efficient computation. Also factored in is the correlation between paired samples, if applicable.
After cell types are identified, assuming that the distributions of the proportions are approximately normally distributed, the t-test, one of the most widely used statistical tests [26,27], can be applied to ascertain the between-group difference. Along with other parametric tests, t-test is widely used in differential abundance testing to date [1,28,29]. Indeed, the observed skewness and kurtosis of cell type proportions validates the assumption of normality ("Methods" and Additional file 1: Supplementary note 1) [30] and justifies the use of the t-test. The t-statistics is calculated and compared with a critical value corresponds to a significance level (also referred to as false positive rate and type I error rate, 0.05 and 0.01 being the typical choice) (Fig. 1e). Sensei estimates the false negative (type II error) rate by inferring the distribution of the t-statistics and calculates the probability of it failing to reach the critical value (Fig. 1f ). The correlation of samples in the paired test ( Fig. 1b) is also accounted for.
Sensei is implemented as a web application powered by JavaScript, and as a Python package. Required as input are the sample sizes, cell numbers, estimated cell proportions, false positive rate and the type of t-test ( Fig. 1g and Additional file 1: Supplementary note 2). Hover-over help information is provided throughout the web application to help a user easily understand the purpose of the parameters and reasonably set them. Output is a table of false negative rates for various sample sizes for researchers to identify feasible study designs (Fig. 1h). An screenshot of the full webpage is shown in Additional file 1: Figure S1. Mathematical modeling is detailed in Methods.

Validation of Sensei
Because Sensei's analytical solution includes necessary approximations ("Methods", Eqs. 7 and 10), we performed a simulation experiment to validate that Sensei accurately estimates the sample size for ideal beta-binomial distributions. We simulated 10,000 datasets using the beta-binomial model that Sensei aims to approximate ("Methods"). We set sample sizes M 0 , M 1 = 5 ∼ 12 , cell numbers per sample N 0 , N 1 = 1, 000 , mean proportions µ 0 = 0.03, µ 1 = 0.05 , and variances σ 0 = 0.015, σ 1 = 0.01 for control and case samples, respectively. We performed a onesided unpaired t-test with a significance level of α = 0.05 on each dataset and counted the number of negative results to determine the false negative rate. We then used Sensei to estimate the false negative rates with the same parameters. For comparison, we also applied on the same data the legacy t-test approach, which makes predictions assuming a normal distribution instead of the beta-binomial distribution. As shown in Fig. 2a, the estimation error of Sensei against the simulated ground-truth (7.9% on average) is much smaller from that of the legacy approach (38.2%). The latter tends to be over-optimistic, because it does not account for insufficiency in cell number, which has relatively large effects on such a rare cell type.
Because real tissue data may not follow exactly the assumed distributions in the simulation, to further assess the accuracy of Sensei, we evaluated it on a breast cancer dataset, which contains 144 tumor samples, and 46 juxta-tumoral samples [31]. The proportions of T cells p ij are available as ground truth for each sample, with an average of 56% in the tumor samples and 42% in the juxtatumoral samples (p-value = 6.6 × 10 −6 , two-sided t-test). We considered the tumor samples as the case group and the juxta-tumoral samples as the control group and assumed that a study plans to involve 12 to 20 participants per group to ascertain a change in T cell abundance. For each combination of sample sizes of both groups, we obtained the estimation from Sensei and the legacy approach using a simulated dataset generated according to the original data ( Fig. 2b, Methods). A very high degree of consistency can be observed between the "Sensei" and the "Simulation" results ( Fig. 2b). For 100 cells per sample, Sensei halved the average error of the "Legacy" approach (2.5% vs 6.6%). Because T cell is relatively abundant (Additional file 1: Figure S2), the improvement shrinks when more cells are collected (4.8% vs 5.7% for 384 cells and 3.8% vs 4.1% for 1000 cells). The improvement is expected to be larger for rare cell types. The result further validated the accuracy of Sensei in assessing immune cell abundance in breast cancer samples, which does not strictly follow the assumed distributions (Additional file 1: Figure S2).
With Sensei being validated, we comprehensively examined datasets from current large-scale cancer genomic studies that have over 30 cancer samples [2]. We applied Fig. 2 Results of simulation studies. a Comparison of false negative rate (y-axis) known from simulation against those estimated by Sensei and by the legacy approach, using datasets sampled from a beta-binomial distribution. Number of samples in the case group is indicated on the x-axis and in the control group by different colors. Markers correspond to result from different approaches. The average error is the mean absolute relative difference between the estimation and the simulation. b Comparison of false negative rates calculated by Sensei and the legacy approach, with those generated by simulation on the proportions of T cells in tumor and juxtatumoral samples in a breast cancer study Sensei to estimate how many samples are required to detect compositional changes in over 20 cell types in a particular cancer type. Our results can be utilized as a guideline for designing preclinical studies and clinical trials in a variety of settings.

Tumor microenvironment of unpaired cancer samples
Changes in tumor clonal fractions have been widely used to track cancer evolution dynamics [32][33][34]. As important are changes in immune cell abundance in the TME [35]. In many studies, case and control samples are collected from different groups of patients. Thorsson et al. [2] deconvolved bulk RNA-seq data from TCGA data (Additional file 1: Figure S3) using CIBERSORT and obtained the proportions of 22 immune cell types in 11,373 samples. The immune cells can be further grouped into 6 major types (T cells, B cells, NK cells, Macrophages, Dendritic cells, Mast cells). We obtained the sample mean, standard deviation, and confidence intervals of the proportion of each cell type in each cancer type (Methods, Additional file 1: Figures S4-S6). Based on these inputs, Sensei inferred the sample sizes for ascertaining the difference between normal tissues and primary tumors in each cancer type using a one-tailed unpaired t-test at a significance level of 0.05 with at least 80% power (Fig. 3a, Additional file 1: Figure  S7a,b). Although Sensei has the ability of suggesting unequal number of cases and controls, we assumed sample sizes are equal for both groups without loss of generalizability. The result shows that a sample size of 20 in each group is adequate to ascertain the differences of many cell types in many cancer types using current single-cell technologies, including but not limited to T cells in kidney chromophobe (KICH), kidney renal clear cell carcinoma (KIRC), rectum adenocarcinoma (READ), and thyroid carcinoma (THCA), and B cells in colon adenocarcinoma (COAD), esophageal carcinoma (ESCA), KIRC, kidney renal papillary cell carcinoma (KIRP), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), and READ (Fig. 3a).
Incidentally, a CyTOF study of liver hepatocellular carcinoma (LIHC) is available, involving 12 tumor samples and 7 normal tissue samples [36]. Sensei estimated a power of 75% for identifying an increase in regulatory T (Tregs) cells using the study sample size. Indeed, the study successfully detected an increase in Tregs at a statistically significance level between 0.05 and 0.01.
Similarly, we calculated the sample size needed for studying cancer progressions from primary tumors to recurrent tumors, since differences in the TME may indicate cancer metastasis and treatment resistance [37]. We have used a data set from a study assessing 13 samples of glioblastoma multiforme (GBM) and 18 of low grade glioma (LGG). Unlike tumor versus normal studies, the difference between recurrent and primary tumors is generally more subtle (Additional file 1: Figures S4-S6), and thus require more samples to ascertain (Fig. 3b, Additional file 1: Figure S7c,d). Our results show that compared with the primary tumors, a change in monocyte proportion in the recurrent tumors may be detected in LGG with a modest samples size of 34 (Fig. 3b). This is relevant, as previous studies have detected a significant decrease in monocyte proportions over malignant transformation of glioma [38]. A decrease in neutrophils proportion, which is known to be negatively correlated with glioma grade [39], also requires relatively modest sample sizes to detect. For GBM, Sensei predicts that a study design with 80% power needs at least 37 samples per group for dendritic cells, and more for other cells (Fig. 3b). A recent pivotal single-cell study finds 13 primary and 3 recurrent GBM samples are likely insufficient to ascertain changes in immune cell types [40]. Consistently, Sensei predicts a power of only 33% for dendritic cells, 9% for T cells, and even less for other cell types for such a setting. It should be noted that the data for recurrent tumors are limited in TCGA. Thus, more pilot experiments may be advised for designing related studies.
Cancer heterogeneity is driven by both genetics and epidemiology. Often performed are pan-cancer studies that categorize tumors based on shared genetic and/or epidemiological features [41,42]. For example, patients with Lynch Syndrome or inflammatory a Estimated sample size for detecting statistically significant difference in normal tissue and primary tumor using an one-sided Welch's t-test at a significance level of 0.05 with 80% power (the same below). Estimations for unpaired test and paired test are shown in blue and yellow, respectively. Estimations are for infinite (the legacy approach, left end of a whisker), 1,000 (left bar), 384 (right bar, may overlap with the left one), and 100 (right end of a whisker) cells. Fewer cells per sample would require more samples to ascertain an effect. The estimated sample size is for each of the two group in a controlled study, not jointly. For matched-pairs study, it is the same as the number of participants. Sample sizes larger than 200 are omitted. The direction of change in cell type abundance is shown by an arrow. An up arrow indicates a higher abundance in primary tumor compared with normal tissue, and vice versa. b Estimated sample size for detecting statistically significant difference in primary tumor and recurrent tumor for low grade glioma (LGG) and glioblastoma multiforme (GBM) patients. An up arrow indicates a higher abundance in recurrent tumor compared with primary tumor, and vice versa. c Estimated sample size for detecting statistically significant difference in each immune cell type between microsatellite instability-high (MSI-H) and microsatellite stable (MSS) tumor samples in uterine corpus endometrial carcinoma (UCEC), colon adenocarcinoma (COAD), and stomach adenocarcinoma (STAD). An up arrow indicates a higher abundance in MSI-H tumor compared with MSS tumor, and vice versa. d Estimated sample size for detecting statistically significant difference between pre-and post-treatment samples from metastatic melanoma patients. An up arrow indicates a higher abundance in post-treatment tumor compared with pre-treatment tumor, and vice versa. Source data for generating this figure is included in Additional file 2 bowel disease often develop colorectal cancers displaying high level microsatellite instability (MSI-H), while sporadic tumors more frequently display microsatellite stability (MSS). Molecular subtyping based on microsatellite instability is not only used in colorectal cancers, but also in other cancers such as endometrial and stomach tumors. Multiple clinical studies have shown that immune checkpoint-blockade therapy is more effective on MSI-H cancers, potentially because of a higher T cell infiltration rate compared to MSS cancers [43,44]. To extrapolate those findings to a wider variety of cancer types, it is important to have a study design that can ensure the ascertainment of immune cell abundance.
As an example, we selected a set of MSI-H and MSS tumors samples in TCGA produced by Hause et al. [45]. In this dataset, MSI-H tumors comprise approximately 30% of uterine corpus endometrial carcinoma (UCEC), 20% of colon adenocarcinoma (COAD), 20% of stomach adenocarcinoma (STAD), and much lower in other cancer types (Additional file 1: Figure S8a). Using the cell type abundance deconvolved from the bulk RNA expression data [2] and the microsatellite instability labels obtained from genomic testing [45], we summarized the immune cell type abundance for the three cancer types (Additional file 1: Figure S8b). For one-tailed unpaired Welch's t-test at a significance level of 0.05 with 80% power, the sample sizes estimated by Sensei are summarized in Fig. 3c and Additional file 1: Figure S7e,f. Testing for the higher proportion of activated memory CD4 T cells in MSI-H and MSS STAD requires the smallest sample sizes-29, 26, 25 samples in each group with 100, 384, and 1000 cells per sample, respectively (Additional file 1: Figure S7e).
The sample size estimated by the legacy t-test approach also reported 25 samples. It is no coincidence that it is the same as what Sensei estimated for 1000 cells, because the legacy approach effectively assumes that there are infinite numbers of cells sequenced. Thus, the result suggests that a sample of 1000 cells is enough in the sense that the variance introduced by cell number is neglectable compared with the within-group variance. On the other hand, having only 100 and 384 cells may compromise the statistical power. In fact, to detect the difference in NK cells in COAD, even 1000 cells result in a sample size of 59 (Fig. 3c), which is higher than 56 obtained via the legacy approach.
Overall, there is a trade-off between the number of cells per sample and the sample size. A few other cell types, including CD8 T cells in UCEC (Additional file 1: Figure S7e) and activated NK cells in COAD (Additional file 1: Figure S7f ), also require fewer than 40 samples per group. On the other hand, many cell types would require more than 200 samples per group (not shown in the figure). Those cases either correspond to a very small difference, or associate with very large variance (Additional file 1: Figure S8b). Caution must be exercised when designing experiments under those conditions.

Tumor microenvironment of paired cancer samples
Paired studies involve the use of autologous samples from the same patients and can reveal more pathologically relevant changes in cell type abundance. It is an ideal way for assessing differences in the TME between not only primary and metastasis/recurrent tumors, but also primary and adjacent normal samples [46,47]. Cell type abundances are available for 717 patients from matched normal and primary tumor samples, and 36 patients from matched primary and recurrent tumor samples (Additional file 1: Figure S9a,b). For each cell type, we estimated correlations of the cell type abundances between paired samples in each cancer type (Additional file 1: Figure S9c,d). We then calculated the sample sizes under the paired test settings using Sensei.
The result between paired normal and primary tumor samples is largely consistent with that of the unpaired test ( Fig. 3a and Additional file 1: Figure S7a), while significantly smaller sample sizes are predicted for some cases. A salient example is the dendritic cells in liver hepatocellular carcinoma (LIHC), for which as low as 35 samples is needed, compared with 53 in the unpaired test. Similarly, difference in naïve B cells in BRCA can be revealed by 33 samples, instead of 47 in unpaired test (Additional file 1: Figure S7a). However, in some cell types, larger sample sizes are required because there are negative correlations between paired samples in the data (Additional file 1: Figure S9c,d). Those may be technical artifacts introduced by experimental and analytical variances, as we found no statistically significant negative correlations (see 95% confidence intervals for the correlations shown in Additional file 1: Figure S9c,d and explained in "Methods"). In practice, the expected correlation can always be adjusted based on accurate prior knowledge.
Similarly, we obtained result between paired primary and recurrent tumor samples for the lower grade glioma (LGG) and glioblastoma multiforme (GBM) (Fig. 3b and Additional file 1: Figure S7c,d), based on prior cell type abundances estimated from 14 LGG patients and 6 GBM patients. The result is largely consistent between the paired test and the unpaired test with some salient differences. For example, 30 samples per group are needed to ascertain the difference in activated NK cells in LGG, compared with 42 for the unpaired test (Additional file 1: Figure S7d). For GBM, the sample size needed for follicular helper T cells decreased to 32 from 116 (Additional file 1: Figure S7c). It is important to be able to examine at modest sample sizes these cell types, which have been reportedly linked to malignant transformation of glioma and associated with the prognosis [38].
Paired tests are often utilized to assess the safety and efficacy of a treatment. We examined a dataset containing 48 tumor samples from 32 metastatic melanoma patients treated with anti-PD-1 therapy, anti-CTLA4 therapy, and their combinations, among which paired pre-and post-treatment samples are available for 11 of the patients [48]. Across various immune cells, exhausted lymphocytes increase the most and memory T cells decrease the most in the abundances post treatment (Additional file 1: Figure S10a). We calculated the pre-and post-treatment correlations for each immune cell type. We found a strong correlation (0.79) of exhausted lymphocytes yet a weak correlation (0.06) of memory T cells (Additional file 1: Figure S10b). Based on these input parameters, Sensei infers that at least 35 and 124 samples are needed to ascertain increases in exhausted lymphocytes under paired and unpaired design, respectively (Fig. 3d). On the other hand, ascertaining decreases in memory T cells requires 33 and 34 samples, respectively. This exemplifies that researchers may benefit substantially from a matched-pairs study design when there is a clear positive correlation for a cell type of interest between paired samples (Additional file 1: Figure  S10b), which can often be the case, as paired samples are most likely derived from the same genetic background and under similar physiological conditions.

Peripheral blood mononuclear cells of Kawasaki disease patients
Kawasaki disease (KD) is a rare condition of blood vessel inflammation. Although it is largely self-limiting, coronary artery aneurysms as a sequela were reported in more than 20% of untreated cases [49]. While the etiology of KD is not yet clear, excessive immune response, manifested as an abrupt change in cell type abundance, is believed to be an important factor [50]. To study the progress and treatment of KD, Wang et al. performed single-cell RNA sequencing on peripheral blood mononuclear cells (PBMCs) from six patients and three healthy controls, as well as flow cytometry on 16 patients and 20 healthy controls [29]. Patient samples were collected before and after high-dose intravenous immunoglobulin (IVIG) treatment, a standard treatment for KD. The authors mentioned that some comparisons did not return significant changes because of the relatively small sample size.
Here, we use Sensei to perform power analysis of the choice of sample sizes in this study. We use the scRNA-seq data, which have smaller sample sizes, to estimate the mean and standard deviation of cell type abundances. The abovementioned flow cytometry sample sizes are used in the estimation. The number of cells per sample is set at 1000, an estimation in the original publication, and the significance level is set at 0.05. Paired test is chosen for the comparison of before-and after-treatment samples and unpaired test is chosen otherwise. In the results, all comparisons with > 90% estimated power resulted in significant p-values (< 0.05), indicating that the estimation is consistent with the actual experimental outcome (Table 1).
It is worth noting that because pilot studies are of small sample sizes, they may not fully represent the distributions of data. For example, although a low power of 18.6% is estimated for CD8 T cells when comparing after-and before-treatment samples, a significant p-value was returned. This discrepancy is likely a result of an outlier (patient 1 as shown in Fig. 2 in [49]), removal of which results in a more favorable estimated power of 58.2%. Thus, it is equally important to leverage biological insights to scrutinize the data when setting the parameters.

Designing a precancer clinical trial for cancer prevention colorectal mucosa of Lynch syndrome patient
Effective eradication of cancer relies on not only treatment but also prevention [51]. The AACR White Paper for cancer prevention [52] calls for acquisition of more longitudinal data from pre-cancer samples to facilitate the modeling of progression and regression of pre-cancerous lesions. Sensei can be of great use in designing such studies. We used Sensei to design a randomized, placebo-controlled clinical trial involving patients diagnosed with Lynch syndrome, as a continuation of a pilot study [53]. The objective of the study is to evaluate whether the experimental intervention leads to recruitment and/or activation of immune cells in colorectal mucosa. Participants will be randomized to receive placebo or the experimental drug for a total of 12 months. After the treatment period, colorectal tissue samples will be collected. The percentage of immune cells within the mucosa will be measured by scRNA-seq to determine whether there were significant differences between the mean percentage of immune cells in the experimental treatment arm versus that in the placebo. Based on in silico deconvolution of bulk RNAseq data from untreated colorectal mucosa, we estimated that the immune cell population is approximately 18.6% at baseline, with a standard deviation of around 5%. We hypothesized that the population will increase by 10% points to 28.6% in the experimental arm and the standard deviation will remain the same. Based on these pieces of information, Sensei estimated that for a one-sided t-test, 6 samples in each group is needed to yield a false negative rate β = 0.062 ≤ 0.1 if 1,000 cells are collected in each specimen (Table 2). Furthermore, if as many as 5,000 cells are collected in each specimen, then 5 samples in each group will be enough to achieve β = 0.1 , i.e., 90% power. It should be noted that this experiment compares pre-cancerous tissues in a placebocontrolled study, which is different from comparing tumor with normal tissues in TCGA. Thus, the estimated sample size is different compared to that of COAD in Fig. 3a. Sensei can be broadly utilized in clinical trial design, as estimations of the prior parameters are often available from preclinical/pilot studies.

Discussion
Changes in cell type composition underlie cancer evolution and metastasis. Ascertainment of such changes is critical for understanding the coevolution of tumor and its microenvironment during carcinogenesis and responses to treatments. Single-cell assays have become viable ways to measure cell type proportions in each biological sample. However, of great needs is a reliable, comprehensive, easy-to-use tool, which estimates the number of samples required for ascertaining changes in cell type proportions between two group of participants. Unlike tools that are designed to estimate the number of cells for ascertaining cell type proportions in a single sample, Sensei is the first tool, to our best knowledge, designed to estimate the number of samples for ascertaining changes in cell type proportions, with the limited capacities on current single-cell platforms accounted for. Although necessary approximations are made, the estimation is accurate as indicated by its consistency with the result of computer simulation and of patient data from real experiments. Results from previous single-cell studies are also consistent in the ballpark with Sensei's predictions [40]. Sensei runs in seconds on the front-end without requiring any connection to backend servers, providing versatile, secured utilities for researchers with limited resources.
The estimation of Sensei is based on several assumptions on existing single-cell profiling protocols. Firstly, the profiled cells are assumed to be chosen at random from the tissue of interest, which leads to the assumption of binomial distribution. An experimental validation to this assumption would require a large number of technical replicates profiled from the same biological sample, which is neither currently available, nor practically viable. Notably, the same assumption is adopted by SCOPIT [24] and "Howmanycells" and appears widely accepted. Choosing beta distribution conveniently models cell type proportion among participants and greatly facilitates efficient computation via beta-binomial conjugacy. Further, the beta distribution can be uniquely determined by a mean and a standard deviation, which are widely accessible from preclinical studies. Like the beta distribution, its multidimensional extension, Dirichlet distribution, has been used in similar contexts [24]. More realistic modeling is possible, should become available more prior knowledge about biological variances, technical noise, and experimental biases, although an analytical solution may not exist. The power estimation will then be based on sampling, which requires substantially more computational resources. These pieces of information may not become clear, until experimental protocols become standardized and large single-cell atlases are completed [52,54,55].
In rare cases where the assumptions are violated, researchers should be able to observe a large skew in the distribution in data analysis. In those cases, new singlecell-aware power estimation methods based on non-parametric Wilcoxon rank-sum test might be more advisable [27,30]. Sensei also assumes that the ascertainment bias within individual studies is consistent and well-controlled across experiments, i.e. equally applied to all the study samples and there are no significant batch effects, or that the batch effects have been alleviated by other systematic approaches. If severe batch effects are expected samples, stratified sampling, stratified test [56], and corresponding power estimation methods [57] should be used.
It is worth noting that single-cell sequencing technologies may be biased against certain cell types such as those that are oversized or hard to digest. In addition, certain context, such as liquid biopsy for solid tumors, may also carry larger biases. Although these factors do not change the trend of the change in cell abundance, the observed mean and variance of abundance can be different from what were expected. Thus, if any of these factors apply, to obtain a more reliable estimation, it is advisable to either perform a pilot single-cell sequencing study to determine the discrepancy or adjust the model parameters empirically. For cells that are hard to precisely classify, such as subclones in a tumor and cell states in a continuous developmental process, it is advisable to enlarge the expected standard deviation to obtain a more conservative estimation for the power (i.e., in turn, larger sample size). " The effect on the false-negative rate of the total number of cells in each sample N i is generally minimal, when the number of cells is greater than 1,000. Only for rare cell types (< 5% proportion) will further increasing N i become necessary to ensure statistical power. Our model assumes that N i is the same for all samples in group i , which is a reasonable simplification because the number of cells generated by an assay is usually consistent in a systematic study and that small differences in N i have little effect on results. It should also be noted that the standard deviation σ i , an input of Sensei, does not explicitly delineate biological variance among participants from technological variance introduced by assays. That said, those variances often coexist, and can hardly be separated cleanly. Thus, it is pragmatic to use the total variance that are learnt from existing or preliminary studies. Sensei's estimation is based on Welch' version of t-test, which also handles two groups with different variances, in addition to the standard Student's t-test. Overall, our current model allows for a closed-form representation of the statistical power, which is essential to a light-weight web-based application providing a fast and sufficiently accurate estimation of sample sizes.
For convenience of presentation, we showed results based on an equal number of samples in the case and the control group, even for unpaired test. That is not a limitation of Sensei. In practice, the number of samples are allowed to be different between the two groups for unpaired studies. It should be noted that decreasing the number of normal tissue samples usually has less pronounced effect on the statistical power. Generally, the group with less variance requires fewer samples. Researchers may use our online application to choose the best combination of sample sizes for the two groups.
Sensei contains an implementation of additional variants of the t-test, including the smallest effect size of interest test and equivalence test ("Methods"), to support different kinds of studies. We have shown that the t-test is appropriate for most cell types, based on the TCGA data. We have also shown that the correlations of cell type proportions between paired samples are positive for many cell types, which empowers the paired test. We also provide a guideline for setting the parameters including mean, variance, and correlation in Sensei. We expect that Sensei, with rich information we summarized from various datasets including normal/tumor, primary/metastasis/recurrent tumor, and pre-/post-treatment data, will meet the demand of many projects that are being planned, such as those in the Human Tumor Atlas Network [58] Pre-Cancer Atlas [52,55], and clinical trials. Similar single-cell studies are on the rise at present. For example, even for colorectal carcinoma, where a relatively large cohort of data have been collected, more samples for colorectal adenoma are still needed to study the recruitment of immune cells throughout the lesion to find interventions that intercept premalignancy and prevent cancer [51]. In turn, data collected from these projects will inform Sensei to provide more realistic estimate.

Conclusions
This study reports a user-friendly web application for estimating sample size and statistical power in studies that apply single-cell profiling technologies to compare cell composition across samples. Both the number of participants and the number of cells per sample are taken into consideration. With an emphasis on cancer evolution, our results provide a guideline for designing studies to ascertain changes in cell type abundance among normal/tumor, primary/metastasis/recurrent tumor, and pre-/post-treatment conditions. We expect that Sensei will have applications in different single-cell studies involving differential abundance analysis. The web application can be accessed at https:// kchen-lab. github. io/ sensei/ table_ beta. html [59].

Beta-binomial modeling of Sensei
We assume that the study design includes M 0 and M 1 participants in the control and case group, respectively. For each group,N i ( i = 0, 1 ) single cells are collected in each sample. The mean µ i and the standard deviation σ i (as a result of biological variation) represent the proportion of the cell type of interest in each group. The significant level (false positive rate, normally 0.05 or 0.01) α should be assigned based on the expectation of the study, to calculate the false negative rate β , or, equivalently, the statistical power (1 − β) . The input parameters required to execute our method are summarized in Table 3.
We assume that in the tissue to be studied, the true proportion of the cell type of interest, is p i . For the j th participant in group i , we denote A ij the total number of such cells which is a random variable and has the following conditional distribution, Because p i is largely unknown in real cases, we model p i using the conjugate prior of binomial distribution, Therefore, the cell number A ij have the beta-binomial distribution, It is worth mentioning that beta-binomial distribution has been applied on modeling in compositional analysis [25,60]. It is also a simplified version of Dirichlet-multinomial distribution used in sample size calculation [24,61]. The a i and b i can be reparametrized from the user-defined mean and standard deviation µ i and σ i . Formally, Practically, we require that the resulting a i and b i to be both greater than 1 to confine the beta distribution to be of unimodality. Using the properties of beta binomial distribution, we can get The corresponding cell type proportion is defined as p ij = A ij N i , which follows a scaled beta binomial distribution. Thus, We now assume that the beta binomial distribution can be approximated by a normal distribution The approximation is justified by the fact that the L1 distance between the scaled beta-binomial distribution and Eq. (7) is sufficiently small, especially for large N and small σ (Additional file 1: Figure S11a). We experimented on a few examples, for µ = 0.3 , σ = 0.2 , the underlying beta distribution is skewed to the left and deviates from a normal distribution. That results in a slightly unprecise, but still largely acceptable normal approximation (Additional file 1: Figure S11b). For µ = 0.5 , σ = 0.1 , the beta distribution itself is already close to a normal distribution, and the generated p ij can be perfectly approximated by a normal distribution (Additional file 1: Figure  S11b). The normality can also be illustrated by the Skewness and excess kurtosis of beta-binomial distributions (Additional file 1: Figure S12). When N samples are collected, the skewness of their mean will be further divided by √ N , and the excess kurtosis will be divided by N . Thus, for an abundance of 0.1% to 50% and coefficient of variance 0.01 to 0.3, the skewness and excess kurtosis always indicate sufficient normality (i.e., both smaller than 0.5) for N ≥ 3 with a reasonable number of cells per sample (e.g., n ≥ 300 for 1% abundance). We further benchmarked Sensei using simulation with t-test and Wilcoxon rank-sum test and observed little difference among all three (Additional file 1: Figure S13).The shrinkage of skewness and kurtosis is guaranteed when the CLT holds, i.e., when the samples are independent. Dependency may show in scenarios such as multicenter trials with strong batch effect. A user should take caution in such cases and apply domain knowledge to judge the data.
For a two-sided test, the null hypothesis is formulated as where p ij denotes the cell proportion in sample j from group i . For a one-sided test, the " = " is substituted by " < " or " > ". Thus, for a t-test allowing different variances in two samples [62], the t-value in Welch's t-test follows a noncentral t-distribution, i.e., where the μ i and σ i are sample mean and sample standard deviation of p i , which are random variables. The distribution of t can be approximated by where the second term is a constant. The degree of freedom, ν , is calculated as which degrades to (M 1 + M 2 − 2) , the same as Student's t-test, when V p 1 = V p 2 and M 1 = M 2 [62] . Thus, the false negative rate can be calculated as where T ν , is the CDF of the Student's t-distribution. t * = t 1− α 2 ,ν , as 2P[t ≥ t * ] < α , for a two-sided test [63], or t * = t 1−α,ν for a one-sided test.

Paired test
Paired samples are usually collected from normal and malignant tissues, or primary and recurrent/metastatic tumors. Longitudinal data from one patient, such as pre-treatment and post-treatment also form paired samples. In such cases, paired test can exploit the correlation between paired samples to improve the statistical power. Sensei has a functionality to help design studies with paired samples. In addition to the unpaired test, we naturally require sample size M 0 and M 1 to be the same (denoted as M ) and require one more parameter, ρ = corr(p 0 , p 1 ) , the correlation of the true proportions of cells between two conditions in the paired study. Note that cell number of cell type A 0 and A 1 are solely depend on p 0 and p 1 , respectively. Thus, they are conditionally independent given p 0 and p 1 . Consequently, we can use law of total covariance to derive Thus, we have the distribution of the cell numbers and proportions where cov p 0 ,p 1 = ρ a 0 b 0 (a 0 +b 0 ) 2 (a 0 +b 0 +1) . E[· · · ] and V[· · · ] remains the same as those in unpaired test. Note that corr p 0 , p 1 is in a 1 +b 1 +N 1 ) ρ , which approaches the same as ρ when numbers of cells, N 0 and N 1 are large. The difference between a pair of samples is Thus, the paired t-statistics can be calculated as where μ and σ are sample mean and sample standard deviation of p . Thus, t satisfies It can be observed that the t-statistics will be the same as the unpaired test when the covariance is zero, and even smaller should the covariance be negative. In other words, paired test needs a positive correlation to gain statistical power. Also note that paired t-test does not assume an equal variance. Finally, the false negative rate is where t * = t 1−α/2,ν for a two-sided test, or t * = t 1−α,ν for a one-sided test, where ν = M − 1.

Legacy sample size estimation
We refer to the sample size estimated using the mean, variance, and correlation without the beta-binomial modeling in Eqs. (5) and (13). Consequently, the effect of number of cells is not accounted for. It is effectively assuming an infinite number of cells.

Smallest effect size of interest and two one-sided t-test for equivalence
Being scientifically significant is usually different from being statistically different. For example, when enough samples are collected, even a 0.01% change in the proportion of a cell type can be statistically significant. However, the difference may be too small to induce any actual effect, and thus is rarely considered biologically interesting (i.e., not scientifically significant). Smallest effect size of interest (SESOI) is a way to set a threshold of scientific significance into statistical test [64]. Instead of performing t-test on the experimental group with the control group directly, it translates the control group by SESOI, the level to be considered biologically interesting, by adding or subtracting a constant from the control group. SESOI can also be used on the opposite side, to conclude that it is statistically significant, that the change in cell type abundance does not exceed the SESOI. We provide sample size estimation for t-test with SESOI in Sensei.
If two t-test with SESOI find that the different is statistically significantly within a range that is considered negligible in terms of biology, the proportion can be claimed to be effectively unchanged. This approach is formally called two one-sided t-test (TOST) for equivalence [64]. Sensei can also estimate the sample size for TOST. (16)

Mean, variance, correlation, and their confidence intervals
The correlation and its confidence interval are obtained by standard ways [65], i.e., for cell type proportions in matched pairs {(p 0i , p 1i )}, i = 1 . . . n , the sample correlation coefficient and its (1 − α) confidence limits are whereν = n − 1 . The sample meanx , variances , and their 95% confidence intervals [p L , p U ] and [s L , s U ] are obtained by standard methods for sample mean and sample standard deviation, i.e., for a group {p i }, i = 1 . . . n, Sensei may use p and s as input directly because they are the maximum likelihood estimates of parameters of a beta distribution. The confidence intervals may help evaluate the reliability of the prior knowledge. Note that the confidence limits may exceed [0, 1] in some cases, and we cut it to 0 or 1 in such cases. As a footnote, complementary log-log transform may be used to confine the limits, but it also skews the values and complicates interpretation. Bootstrap may also be used to construct the confidence interval. For unmatched pairs, see Additional file 1: Supplementary note 3.

Simulation study based on T cell abundance in breast cancer data
The breast cancer dataset contains 144 tumor samples, and 46 juxta-tumoral samples [31]. The proportions of T cells were available as ground truth for each sample, with an average of 56% in the tumor samples and 42% in the juxta-tumoral samples. We considered the tumor samples as the experimental group and the juxta-tumoral samples as the control group. Because the proportions of T-cells are significantly different (p-value = 6.6 × 10 −6 , two-sided t-test) between the two groups, we assume that true difference exists. We use the mean and standard deviation calculated as the input of Sensei. To validate Sensei's accuracy, we randomly drew M 0 and M 1 samples respectively from the juxta-tumoral and tumor samples. If we were to perform single-cell assays on these samples, we would observe A ij T cells in each sample, according to a binomial distribution parameterized by N i and p ij ( i = 0, 1 ). Binomial distribution is a reasonable assumption since a tissue sample often contains millions of cells, which is several orders of magnitudes higher than N i . We then perform a one-tailed unpaired t-test between the set of {A 0j } and that of {A 1j } at α = 0.05 , and record a true positive when the test is positive, and a false negative otherwise. We estimate the false negative rate by repeating the above process 1,000 times for each combination of M 0 and M 1 .