A computational method to aid the design and analysis of single cell RNA-seq experiments for cell type identification

Background The advent of single cell RNA sequencing (scRNA-seq) enabled researchers to study transcriptomic activity within individual cells and identify inherent cell types in the sample. Although numerous computational tools have been developed to analyze single cell transcriptomes, there are no published studies and analytical packages available to guide experimental design and to devise suitable analysis procedure for cell type identification. Results We have developed an empirical methodology to address this important gap in single cell experimental design and analysis into an easy-to-use tool called SCEED (Single Cell Empirical Experimental Design and analysis). With SCEED, user can choose a variety of combinations of tools for analysis, conduct performance analysis of analytical procedures and choose the best procedure, and estimate sample size (number of cells to be profiled) required for a given analytical procedure at varying levels of cell type rarity and other experimental parameters. Using SCEED, we examined 3 single cell algorithms using 48 simulated single cell datasets that were generated for varying number of cell types and their proportions, number of genes expressed per cell, number of marker genes and their fold change, and number of single cells successfully profiled in the experiment. Conclusions Based on our study, we found that when marker genes are expressed at fold change of 4 or more, either Seurat or SIMLR algorithm can be used to analyze single cell dataset for any number of single cells isolated (minimum 1000 single cells were tested). However, when marker genes are expected to be only up to fold change of 2, choice of the single cell algorithm is dependent on the number of single cells isolated and rarity of cell types to be identified. In conclusion, our work allows the assessment of various single cell methods and also aids in the design of single cell experiments. Electronic supplementary material The online version of this article (10.1186/s12859-019-2817-2) contains supplementary material, which is available to authorized users.


Background
The greater precision afforded by single cell sequencing has increased the scope of the average sequencing study. Unlike conventional bulk sequencing methods that profile aggregate of hundreds of thousands of cells, the single cell sequencing methods made it possible to isolate a single cell and perform different types of omics profiling including genomics, transcriptomics, epigenomics and proteomics [1]. One prominent technique that measures gene expression at single-cell level is single cell mRNA sequencing (scRNA-seq) [1,2]. It, unlike bulk sequencing, unmasks the fundamental, widespread heterogeneity in gene expression among cells in a tissue or cells considered to be of same type based on canonical markers [3,4]. Hence, rather than simply examining differential expression between two samples, we can identify the cell types and expressed genes within each cell type as a first step before differential expression analysis © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
* Correspondence: krishna.karuturi@jax.org; joshy.george@jax.org † Douglas Abrams and Parveen Kumar contributed equally to this work. 2 The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA Full list of author information is available at the end of the article [4,5]. Not only does this first step provide valuable insights into the transcriptomic profiles of individual cell types and states, but it also provides a deeper context for the subsequent differential expression analysis.
However, the effectiveness of cell type identification is a multi-step process which led to the explosion of new single cell software applications, referred to as a "cottage industry" [6]. According to Awesome Single Cell (https:// github.com/seandavi/awesome-single-cell), a site that compiles a list of new single cell analysis methods, 118 methods have recently been created for analyzing single cell sequencing data (normalization, dimensionality reduction, clustering and differential expression), including plethora of methods required for cell type identification.
Hence, it is necessary to comparatively assess the different tool combinations (aka pipelines) to determine which is the best at cell type identification. Comparative analyses have been published on sequencing [7,8], normalization [9] and clustering [10,11]. Yet, there has not been a comprehensive study, assessing whole pipelines and addressing broader issues of experimental design in cell type identification.
We developed a computational method to address this important gap. We developed an easy to use tool as an R-package SCEED (Single Cell Experimental Design and Analysis). The package has functionality to simulate scRNA-seq data with user provided statistical characteristics: total number of cells, genes, cell type proportions, marker genes and fold change (fC) of marker genes. The simulated dataset with known cell types can be analyzed using published cell-type identification algorithms by incorporating them in SCEED. Systematic comparison of the results of the analysis pipeline to the known true labels using F1score (for details see methods) that provide the ability to identify the optimal single cell analysis procedures for the dataset and will also help to identify the number of cells required for adequate power for the detection of the cell-types.

Methods
The schematic of SCEED is shown in Fig. 1. Each step in SCEED is described below.

Data simulation
Our procedure to simulate a single cell dataset is shown in Fig. 2. In step 1, gene by cell expression matrix is simulated using Splatter package [12], which simulates m cell types of given rarity/prevalence with n cells. In step 2, each cell type will express specific number of marker genes g with specific fold change levels fC. The mean expression level of each marker gene g i in group k was simulated by taking the product of a group-specific fold change level (sampled from a negative binomial distribution with shape = fC i and rate = 1) and the mean expression level of g i in all cells that are not part of k. For each cell in k, the final expression level of marker gene g i was the product of the simulated mean of g i and a library size that was simulated using Splatter [12]. The remaining steps are stated in Fig. 2.

Single cell analysis steps
A standard single cell analysis procedure includes data normalization, dimensionality reduction and clustering Fig. 1 Schematic representation of SCEED pipeline. (Left to right) First a simulated dataset is generated using SCEED "generateDataset" function with input parameters mentioned under "Data simulation". Next, the simulated dataset is analyzed using different single cell analysis procedures. To test the performance of each single-cell algorithm, F1score which is a measure of test's accuracy is computed. Finally, based on the F1score cutoff chosen by user, the best analysis procedure and the number of cells required to perform the single cell experiment are selected [13]. Normalization is a crucial step for any single cell analysis that adjusts for unwanted technical or biological variations that may otherwise affect the gene expression analysis. With larger datasets like single cell, dimensionality reduction is also an important step that transforms data into lower dimensional space, allowing significant reduction in data complexity and also makes data visualization easier. Finally, single cells with similar transcriptome profiles are clustered together to deduce putative (sub)populations, aka cell types.

Incorporation of single cell methods into SCEED package
SCEED package allow users to add any single cell analysis package of interest into its pipeline using function "sceed_AlgorithmName" for example sceed_seurat. In the current implementation of SCEED, Kmeans, SIMLR and Seurat (details in results section) are available. Although we have added only three single cell algorithms, SCEED package is completely flexible and any number of single-cell algorithms can be added for testing as per user's requirements.

Performance assessment
The performance of an analysis procedure is assessed by computing F1score of a cluster. F1score is a balancing measures of recall (sensitivity) and precision of cell classification. Higher F1score shows better performance of the algorithm tested. User can choose F1score threshold suitable to annotate the clusters for cell types and hence best single cell analysis algorithm as well as sample size.

Results
We used SCEED to test 3 popularly known single cell algorithms for cell type identification: Kmeans, SEURAT and SIMLR. For Kmeans clustering approach, k was set equal to the number of cell types simulated. For Seurat and SIMLR algorithms, default parameters mentioned by the authors were used. In Seurat, while using "FindClusters" function, k.param was set to the number of cell types simulated. We generated 27 datasets of varying choices of parameters.

Generating simulated single-cell datasets
In a single cell experiment, discovering rare cell populations is of utmost importance. Stressing on the rarity of cell populations, we simulated single cell datasets where five cell types were partitioned into unequal proportions such that one of them has low proportion or representing rare population, ranging from 2 to 10%. For instance, we defined a single-cell category having 5 cell types in proportions of 0.1, 0.2, 0.2, 0.2 and 0.3. In each cell type,  Table 1. In summary, we created 27 simulated single-cell datasets. Notably, in SCEED package, all these parameters (such as number of cell types, single cells per cell type, genes per cell, marker genes per cell type and fold change cutoffs) can be adjusted as per user's requirements.
Testing the performance of single-cell algorithms and estimation of sample size required All these datasets were analyzed using three single cell algorithms, Kmeans, Seurat and SIMLR and tested for their performance using F1score. At lowest fold change (fC) of 2 of marker genes, irrespective of number of single cells collected, Seurat provided the best performance in F1score for rarity of 0.1. However, for fC of 2, we may need at least 1000 cells to achieve F1score of > 0.9. As fC increases, the other algorithms also offered increased performance, Additional files 1 and 2. Next, we

Discussion and conclusion
We proposed SCEED method as an easy-to-use package to help the researchers in designing a single cell experiment (estimate the number of cells required to identify novel cell types) and optimal analysis procedure. The package takes into account all technical and biological parameters that characterize typical single cell RNA-seq data. Using SCEED package, we simulated 27 single cell datasets that account for varying sample sizes, rarity of cell types and fold change of expression of marker genes. Such a simulation is significant. For example, when researchers are planning to analyze cell types similar to beta cells from islets of Langerhans in the pancreas where marker genes such as insulin are expressed in far greater concentrations than rest of the genes. In contrast, they are interested in identifying sub classes of established cell types where marker genes are expressed in moderately higher concentrations than the rest of the genes. Using SCEED package, researchers can generate simulated datasets that bear statistical properties similar to that of the expected data and test various single cell algorithms. Our package not only suggests the best method among the tested algorithms but also suggest the number of cells required to achieve the required results. As single cell transcriptome analysis field is rapidly growing field, SCEED package facilitates easily adding more single cell algorithms for testing. In our study, we have compared the performance of three popularly used single cell algorithms. Though our simulations are limited, our study clearly shows that even popularly used algorithms do not perform best over ranges of cell population rarity and fold change in expression of marker genes. Based on these results, we demonstrated that SCEED package fills an important gap in the single cell analysis field. However, we need to conduct extensive study to identify optimal analysis procedures for a variety of experimental settings and statistical properties of data. Such a study needs to account not only for the 3 parameters we tested up on, it needs to account for the variation in the other statistical parameters (can be selected in SCEED package) and addressing the experimental designs of scRNA-seq experiments.

Additional files
Additional file 1: Performance of different single cell algorithms at different cell proportions. F1score was calculated at cell rarity proportions of 0.02, 0.05 and 0.1 containing 4 foldchange upregulated marker genes for 1000, 2000 and 3000 single cells datasets. X-axis represents the cell rarity proportions while y-axis represents F1score. (PDF 5 kb) Additional file 2: Performance of different single cell algorithms at different cell proportions. F1score was calculated at cell rarity proportions of 0.02, 0.05 and 0.1 containing 8 foldchange upregulated marker genes for 1000, 2000 and 3000 single cells datasets. X-axis represents the cell rarity proportions while y-axis represents F1score. (PDF 5 kb)