- Open Access
A computational method to aid the design and analysis of single cell RNA-seq experiments for cell type identification
BMC Bioinformatics volume 20, Article number: 275 (2019)
The advent of single cell RNA sequencing (scRNA-seq) enabled researchers to study transcriptomic activity within individual cells and identify inherent cell types in the sample. Although numerous computational tools have been developed to analyze single cell transcriptomes, there are no published studies and analytical packages available to guide experimental design and to devise suitable analysis procedure for cell type identification.
We have developed an empirical methodology to address this important gap in single cell experimental design and analysis into an easy-to-use tool called SCEED (Single Cell Empirical Experimental Design and analysis). With SCEED, user can choose a variety of combinations of tools for analysis, conduct performance analysis of analytical procedures and choose the best procedure, and estimate sample size (number of cells to be profiled) required for a given analytical procedure at varying levels of cell type rarity and other experimental parameters. Using SCEED, we examined 3 single cell algorithms using 48 simulated single cell datasets that were generated for varying number of cell types and their proportions, number of genes expressed per cell, number of marker genes and their fold change, and number of single cells successfully profiled in the experiment.
Based on our study, we found that when marker genes are expressed at fold change of 4 or more, either Seurat or SIMLR algorithm can be used to analyze single cell dataset for any number of single cells isolated (minimum 1000 single cells were tested). However, when marker genes are expected to be only up to fold change of 2, choice of the single cell algorithm is dependent on the number of single cells isolated and rarity of cell types to be identified. In conclusion, our work allows the assessment of various single cell methods and also aids in the design of single cell experiments.
The greater precision afforded by single cell sequencing has increased the scope of the average sequencing study. Unlike conventional bulk sequencing methods that profile aggregate of hundreds of thousands of cells, the single cell sequencing methods made it possible to isolate a single cell and perform different types of omics profiling including genomics, transcriptomics, epigenomics and proteomics . One prominent technique that measures gene expression at single-cell level is single cell mRNA sequencing (scRNA-seq) [1, 2]. It, unlike bulk sequencing, unmasks the fundamental, widespread heterogeneity in gene expression among cells in a tissue or cells considered to be of same type based on canonical markers [3, 4]. Hence, rather than simply examining differential expression between two samples, we can identify the cell types and expressed genes within each cell type as a first step before differential expression analysis [4, 5]. Not only does this first step provide valuable insights into the transcriptomic profiles of individual cell types and states, but it also provides a deeper context for the subsequent differential expression analysis.
However, the effectiveness of cell type identification is a multi-step process which led to the explosion of new single cell software applications, referred to as a “cottage industry” . According to Awesome Single Cell (https://github.com/seandavi/awesome-single-cell), a site that compiles a list of new single cell analysis methods, 118 methods have recently been created for analyzing single cell sequencing data (normalization, dimensionality reduction, clustering and differential expression), including plethora of methods required for cell type identification.
Hence, it is necessary to comparatively assess the different tool combinations (aka pipelines) to determine which is the best at cell type identification. Comparative analyses have been published on sequencing [7, 8], normalization  and clustering [10, 11]. Yet, there has not been a comprehensive study, assessing whole pipelines and addressing broader issues of experimental design in cell type identification.
We developed a computational method to address this important gap. We developed an easy to use tool as an R-package SCEED (Single Cell Experimental Design and Analysis). The package has functionality to simulate scRNA-seq data with user provided statistical characteristics: total number of cells, genes, cell type proportions, marker genes and fold change (fC) of marker genes. The simulated dataset with known cell types can be analyzed using published cell-type identification algorithms by incorporating them in SCEED. Systematic comparison of the results of the analysis pipeline to the known true labels using F1score (for details see methods) that provide the ability to identify the optimal single cell analysis procedures for the dataset and will also help to identify the number of cells required for adequate power for the detection of the cell-types.
The schematic of SCEED is shown in Fig. 1. Each step in SCEED is described below.
Our procedure to simulate a single cell dataset is shown in Fig. 2. In step 1, gene by cell expression matrix is simulated using Splatter package , which simulates m cell types of given rarity/prevalence with n cells. In step 2, each cell type will express specific number of marker genes g with specific fold change levels fC. The mean expression level of each marker gene gi in group k was simulated by taking the product of a group-specific fold change level (sampled from a negative binomial distribution with shape = fCi and rate = 1) and the mean expression level of gi in all cells that are not part of k. For each cell in k, the final expression level of marker gene gi was the product of the simulated mean of gi and a library size that was simulated using Splatter . The remaining steps are stated in Fig. 2.
Single cell analysis steps
A standard single cell analysis procedure includes data normalization, dimensionality reduction and clustering . Normalization is a crucial step for any single cell analysis that adjusts for unwanted technical or biological variations that may otherwise affect the gene expression analysis. With larger datasets like single cell, dimensionality reduction is also an important step that transforms data into lower dimensional space, allowing significant reduction in data complexity and also makes data visualization easier. Finally, single cells with similar transcriptome profiles are clustered together to deduce putative (sub)populations, aka cell types.
Incorporation of single cell methods into SCEED package
SCEED package allow users to add any single cell analysis package of interest into its pipeline using function “sceed_AlgorithmName” for example sceed_seurat. In the current implementation of SCEED, Kmeans, SIMLR and Seurat (details in results section) are available. Although we have added only three single cell algorithms, SCEED package is completely flexible and any number of single-cell algorithms can be added for testing as per user’s requirements.
The performance of an analysis procedure is assessed by computing F1score of a cluster. F1score is a balancing measures of recall (sensitivity) and precision of cell classification. Higher F1score shows better performance of the algorithm tested. User can choose F1score threshold suitable to annotate the clusters for cell types and hence best single cell analysis algorithm as well as sample size.
We used SCEED to test 3 popularly known single cell algorithms for cell type identification: Kmeans, SEURAT and SIMLR. For Kmeans clustering approach, k was set equal to the number of cell types simulated. For Seurat and SIMLR algorithms, default parameters mentioned by the authors were used. In Seurat, while using “FindClusters” function, k.param was set to the number of cell types simulated. We generated 27 datasets of varying choices of parameters.
Generating simulated single-cell datasets
In a single cell experiment, discovering rare cell populations is of utmost importance. Stressing on the rarity of cell populations, we simulated single cell datasets where five cell types were partitioned into unequal proportions such that one of them has low proportion or representing rare population, ranging from 2 to 10%. For instance, we defined a single-cell category having 5 cell types in proportions of 0.1, 0.2, 0.2, 0.2 and 0.3. In each cell type, 50 genes were simulated as marker genes that were either 2-, 4- or 8-fold upregulated when compared to rest of the cell types. For the same proportions of cell types while keeping the other parameters same, we simulated single cell data sets of 2000 or 3000 cells. More details of each dataset are shown in Table 1. In summary, we created 27 simulated single-cell datasets. Notably, in SCEED package, all these parameters (such as number of cell types, single cells per cell type, genes per cell, marker genes per cell type and fold change cutoffs) can be adjusted as per user’s requirements.
Testing the performance of single-cell algorithms and estimation of sample size required
All these datasets were analyzed using three single cell algorithms, Kmeans, Seurat and SIMLR and tested for their performance using F1score. At lowest fold change (fC) of 2 of marker genes, irrespective of number of single cells collected, Seurat provided the best performance in F1score for rarity of 0.1. However, for fC of 2, we may need at least 1000 cells to achieve F1score of > 0.9. As fC increases, the other algorithms also offered increased performance, Additional files 1 and 2. Next, we compared these algorithms to detect even rarer cell type, with a proportion of 0.05 (the cell type proportions are 0.05, 0.35, 0.2, 0.2 and 0.2), Fig. 3. At fC = 2, Seurat reached the F1 score of 0.93 but only when number of single cells > = 2000. In line with previous observation, the other algorithms also showed increased performance with increase in fC at 0.05 proportion. However, when we reduced the rarer cell type proportion further down to 0.02, SIMLR outperformed the remaining two algorithms with F1 score of 0.69 for number of single cells > = 1000. Separately, we also estimated the minimum sample size required at a given F1 score. For instance, SIMLR could attain F1score > 0.7 for proportions of 0.1 and 0.05 for sample size (number of single cells) of 1000 while Seurat required sample sizes of 1000 and 2000 for cell proportions of 0.1 and 0.05 respectively. We see similar results for fC of 4 and 8, see Additional files 1 and 2.
Discussion and conclusion
We proposed SCEED method as an easy-to-use package to help the researchers in designing a single cell experiment (estimate the number of cells required to identify novel cell types) and optimal analysis procedure. The package takes into account all technical and biological parameters that characterize typical single cell RNA-seq data. Using SCEED package, we simulated 27 single cell datasets that account for varying sample sizes, rarity of cell types and fold change of expression of marker genes. Such a simulation is significant. For example, when researchers are planning to analyze cell types similar to beta cells from islets of Langerhans in the pancreas where marker genes such as insulin are expressed in far greater concentrations than rest of the genes. In contrast, they are interested in identifying sub classes of established cell types where marker genes are expressed in moderately higher concentrations than the rest of the genes. Using SCEED package, researchers can generate simulated datasets that bear statistical properties similar to that of the expected data and test various single cell algorithms. Our package not only suggests the best method among the tested algorithms but also suggest the number of cells required to achieve the required results. As single cell transcriptome analysis field is rapidly growing field, SCEED package facilitates easily adding more single cell algorithms for testing.
In our study, we have compared the performance of three popularly used single cell algorithms. Though our simulations are limited, our study clearly shows that even popularly used algorithms do not perform best over ranges of cell population rarity and fold change in expression of marker genes. Based on these results, we demonstrated that SCEED package fills an important gap in the single cell analysis field. However, we need to conduct extensive study to identify optimal analysis procedures for a variety of experimental settings and statistical properties of data. Such a study needs to account not only for the 3 parameters we tested up on, it needs to account for the variation in the other statistical parameters (can be selected in SCEED package) and addressing the experimental designs of scRNA-seq experiments.
Macaulay IC, Ponting CP, Voet T. Single-cell multiomics: multiple measurements from single cells. Trends Genet. 2017;33(2):155–68.
Kanter I, Kalisky T. Single cell transcriptomics: methods and applications. Front Oncol. 2015;5:53.
Papalexi E, Satija R: Single-cell RNA sequencing to explore immune cell heterogeneity. Nat Rev Immunol 2018;18(1):35–45.
Patel AP, Tirosh I, Trombetta JJ, Shalek AK, Gillespie SM, Wakimoto H, Cahill DP, Nahed BV, Curry WT, Martuza RL, et al. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science. 2014;344(6190):1396–401.
Tang F, Barbacioru C, Wang Y, Nordman E, Lee C, Xu N, Wang X, Bodeau J, Tuch BB, Siddiqui A, et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nat Methods. 2009;6(5):377–82.
Perkel JM. Single-cell sequencing made simple. Nature. 2017;547(7661):125–6.
Ziegenhain C, Vieth B, Parekh S, Reinius B, Guillaumet-Adkins A, Smets M, Leonhardt H, Heyn H, Hellmann I, Enard W. Comparative analysis of single-cell RNA sequencing methods. Mol Cell. 2017;65(4):631–643 e634.
Wu AR, Neff NF, Kalisky T, Dalerba P, Treutlein B, Rothenberg ME, Mburu FM, Mantalas GL, Sim S, Clarke MF, et al. Quantitative assessment of single-cell RNA-sequencing methods. Nat Methods. 2014;11(1):41–6.
Vallejos CA, Risso D, Scialdone A, Dudoit S, Marioni JC. Normalizing single-cell RNA sequencing data: challenges and opportunities. Nat Methods. 2017;14(6):565–71.
Zurauskiene J, Yau C. pcaReduce: hierarchical clustering of single cell transcriptional profiles. BMC Bioinformatics. 2016;17:140.
Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, Natarajan KN, Reik W, Barahona M, Green AR, et al. SC3: consensus clustering of single-cell RNA-seq data. Nat Methods. 2017;14(5):483–6.
Zappia L, Phipson B, Oshlack A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 2017;18(1):174.
Lun AT, McCarthy DJ, Marioni JC. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res. 2016;5:2122.
We thank Kyung In Kim for his valuable inputs for the manuscript.
Research reported in this publication was partially supported by the Cancer Center Support Grant from the National Cancer Institute of the NIH under Award Number P30CA034196 (to J.G. and R.K.M.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The work has also been funded by National Cancer Institute of the NIH for Summer Student Fellowship under award number R25 CA174584 (to D.A.). Publication costs are supported by The Jackson Laboratory.
Availability of data and materials
About this supplement
This article has been published as part of BMC Bioinformatics Volume 20 Supplement 11, 2019: Selected articles from the 7th IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS 2017): bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-20-supplement-11.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Performance of different single cell algorithms at different cell proportions. F1score was calculated at cell rarity proportions of 0.02, 0.05 and 0.1 containing 4 foldchange upregulated marker genes for 1000, 2000 and 3000 single cells datasets. X-axis represents the cell rarity proportions while y-axis represents F1score. (PDF 5 kb)
Performance of different single cell algorithms at different cell proportions. F1score was calculated at cell rarity proportions of 0.02, 0.05 and 0.1 containing 8 foldchange upregulated marker genes for 1000, 2000 and 3000 single cells datasets. X-axis represents the cell rarity proportions while y-axis represents F1score. (PDF 5 kb)