Comparative analysis of differential gene expression analysis tools for single-cell RNA sequencing data

Background The analysis of single-cell RNA sequencing (scRNAseq) data plays an important role in understanding the intrinsic and extrinsic cellular processes in biological and biomedical research. One significant effort in this area is the detection of differentially expressed (DE) genes. scRNAseq data, however, are highly heterogeneous and have a large number of zero counts, which introduces challenges in detecting DE genes. Addressing these challenges requires employing new approaches beyond the conventional ones, which are based on a nonzero difference in average expression. Several methods have been developed for differential gene expression analysis of scRNAseq data. To provide guidance on choosing an appropriate tool or developing a new one, it is necessary to evaluate and compare the performance of differential gene expression analysis methods for scRNAseq data. Results In this study, we conducted a comprehensive evaluation of the performance of eleven differential gene expression analysis software tools, which are designed for scRNAseq data or can be applied to them. We used simulated and real data to evaluate the accuracy and precision of detection. Using simulated data, we investigated the effect of sample size on the detection accuracy of the tools. Using real data, we examined the agreement among the tools in identifying DE genes, the run time of the tools, and the biological relevance of the detected DE genes. Conclusions In general, agreement among the tools in calling DE genes is not high. There is a trade-off between true-positive rates and the precision of calling DE genes. Methods with higher true positive rates tend to show low precision due to their introducing false positives, whereas methods with high precision show low true positive rates due to identifying few DE genes. We observed that current methods designed for scRNAseq data do not tend to show better performance compared to methods designed for bulk RNAseq data. Data multimodality and abundance of zero read counts are the main characteristics of scRNAseq data, which play important roles in the performance of differential gene expression analysis methods and need to be considered in terms of the development of new methods. Electronic supplementary material The online version of this article (10.1186/s12859-019-2599-6) contains supplementary material, which is available to authorized users.


Background
Next generation sequencing (NGS) [1] technologies greatly promote research in genome-wide mRNA expression data. Compared with microarray technologies, NGS provides higher resolution data and more precise measurement of levels of transcripts for studying gene expression. Through downstream analysis of RNA sequencing (RNAseq) data, gene expression levels reveal the variability between different samples. Typically, in RNAseq data analysis, the expression value of a gene from one sample represents the mean of all expression values of the bulk population of cells. Although it is common to use expression values on such a bulk scale in certain situations [2][3][4], it is not sufficient to employ bulk RNAseq data for other biological research that involves, for example, studying circulating tumor cells [5] and stem cells. Consequently, analyzing gene expression values on the single-cell scale provides deep insight into the interplay between intrinsic cellular processes and stochastic gene expression in biological and biomedical research [6][7][8][9]. For example, single-cell data analysis is important in cancer studies, as differential gene expression analysis between different cells can help to uncover driver genes [10].
Tools developed for differential gene expression analysis on bulk RNAseq data, such as DESeq [11] and edgeR [12], can be applied to single-cell data [11][12][13][14][15][16][17][18][19][20]. Single-cell RNAseq (scRNAseq) data, however, have different characteristics from those of bulk RNAseq data that require the use of a new differential expression analysis definition, beyond the conventional definition of a nonzero difference in average expression. In scRNAseq data, due to the tiny number and low capture efficiency of RNA molecules in single cells [6], many transcripts tend to be missed during the reverse transcription. As a result, we may observe that some transcripts are highly expressed in one cell but are missed in another cell. This phenomenon is defined as a "drop-out" event [21]. Recent studies have shown that gene expression in a single cell is a stochastic process and that gene expression values in different cells are heterogeneous [22,23], which results in multimodality in expression values in different cells. For example, cells from the same brain tissue or the same tumor [24] pose huge heterogeneity from cell to cell [24][25][26][27][28]. Even though they are from the same tissue, these cells are different in regard to cell types, biological functions, and response to drugs. Therefore, unlike bulk RNAseq data, scRNAseq data tend to exhibit an abundance of zero counts, a complicated distribution, and huge heterogeneity. Examples of distributions of scRNAseq expression values between two conditions are shown in Fig. 1. Consequently, the heterogeneity within and between cell populations manifests major challenges to the differential gene expression analysis in scRNAseq data.
To address the challenges of multimodal expression values and/or drop-out events, new strategies and models [21,[29][30][31][32][33][34][35][36][37] have been proposed for scRNAseq Fig. 1 Distributions of gene expression values of total 92 cells in two groups (ES and MEF) using real data show that scRNAseq data exhibit a different types of multimodality (DU, DP, DM, and DB) and b large amounts of zero counts. X axis represents log-transformed expression values. To clearly show the multimodality of scRNAseq data, zero counts are removed from the distribution plots in (a) data. Single-cell differential expression (SCDE) [21] and model-based analysis of single-cell transcriptomics (MAST) [29] use a two-part joint model to address zero counts; one part corresponds to the normal observed genes, and the other corresponds to the drop-out events. Monocle2 [38] is updated from the previous Monocle [32] and employs census counts rather than normalized transcript counts as input to better normalize the counts and eliminate variability in single-cell experiments. A recent approach, termed scDD [39], considers four different modality situations for gene expression value distributions within and across biological conditions. DEsingle employs a zero-inflated negative binomial (ZINB) regression model to estimate the proportion of the real and drop-out zeros and classifies the differentially expressed (DE) genes into three categories. Recently, nonparametric methods, SigEMD [37], EMDomics [31], and D3E [33], have been proposed for differential gene expression analysis of heterogeneous data. Without modeling the distributions of gene expression values and estimating their parameters, these methods identify DE genes by employing a distance metric between the distributions of genes in two conditions. A few studies have compared differential expression analysis methods for scRNAseq data. Jaakkola et al. [40] compared five statistical analysis methods for scRNAseq data, three of which are for bulk RNAseq data analysis. Miao et al. [41] evaluated 14 differential expression analysis tools, three of which are newly developed for scRNAseq data and 11 of which are old methods for bulk RNAseq data. A recent comparison study [42] assessed six differential expression analysis tools, four of which were developed for scRNAseq and two of which were designed for bulk RNAseq. In this study, we consider all differential gene expression analysis tools that have been developed for scRNAseq data as of October 2018 (SCDE [21], MAST [29], scDD [39], D3E [33], Monocle2 [38], SINCERA [34], DEsingle [36], and SigEMD [37]). We also consider differential gene expression analysis tools that are designed for heterogeneous expression data (EMDomics [31]) and are commonly used for bulk RNAseq data (edgeR [4], DESeq2 [43]).
The goal of this study is to reveal the limitations of the current tools and to provide insight and guidance in regard to choosing a tool or developing a new one. In this work, we discuss the computational methods used by these tools and comprehensively evaluate and compare the performance of the tools in terms of sensitivity, false discover rate, and precision. We use both simulated and real data to evaluate the performance of the above-noted tools. To generate more realistic simulated data, we model both multimodality and drop-out events in simulated data. Using gold standard DE genes in both simulated and real data, we evaluate the accuracy of detecting true DE genes. In addition, we investigate the agreement among the methods in identifying significantly DE genes. We also evaluate the effect of sample size on the performance of the tools, using simulated data, and compare the runtimes of the tools, using real data. Finally, we perform gene-set enrichment and pathway analysis to evaluate the biological functional relevance of the DE genes identified by each tool.

Methods
As of October 2018, we have identified eight software tools for differential expression analysis of scRNAseq data, which are designed specifically for such data [21,29,30,33,34,[36][37][38] (SCDE, MAST, scDD, D3E, Mon-ocle2, SINCERA, DEsingle, and SigEMD). We also considered tools designed for bulk RNAseq data that are widely used [4,43] (edgeR, and DESeq2) or can apply to multimodal data [31] (EMDomics). The general characteristics of the eleven tools are provided in Table 1. MAST, scDD, EMDomics, Monocle2, SINCERA, and SigEMD use normalized TPM/FPKM expression values as input, while SCDE, D3E, and DEsingle use read counts obtained from RSEM as input. D3E runs on Python, while all other methods are developed as an R package. In the following sections, we provide the details of the tools.
Differential gene expression analysis methods for scRNAseq data Single-cell differential expression (SCDE) SCDE [21] utilizes a mixture probabilistic model for gene expression values. The observed read counts of genes are modeled as a mixture of drop-out events by a Poisson distribution and amplification components by a negative binomial (NB) distribution: where e is the expected expression value in cells when the gene is amplified, and λ 0 is always set to 0.1. The posterior probability of a gene expressed at level x in cell c based on observed r c and the fitted model Ω c is calculated by: where p d is the probability of a drop-out event in cell c for a gene expressed at an average level x, and p poisson (x| r c ) and p NB (x|r c ) are the probabilities of observing expression value r c in the cases of drop-out (Poisson) and successful amplification (NB) of a gene expressed at level x in cell c, respectively. Then, after the bootstrap step, the posterior probability of a gene expressed at level x in a subpopulation of cells S is determined as an expected value: where B is the bootstrap samples of S. Based on the posterior probabilities of gene expression values in cells S and G, p S (x) and p G (x), SCDE uses a fold expression difference f in gene g for the differential expression analysis between subgroups S and G, which is determined as: where X is the expression range of the gene g. An empirical p-value is determined to test the differential expression.

Model-based analysis of single-cell transcriptomics (MAST)
MAST [29] proposes a two-part generalized linear model for differential expression analysis of scRNAseq data. One part models the rate of expression level, using logistic regression: where Z = [Z ig ] indicates whether gene g is expressed in cell i. The other part models the positive expression mean, using a Gaussian linear model: where Y = [y ig ] is the expression level of gene g in cell i observed Z ig = 1. The cellular detection rate (CDR) for each cell, defined as CDR i = (1/N)∑ g = 1 Z ig (N is the total number of genes), is introduced as a column in the design matrix X i of the logistic regression model and the Gaussian linear model. For the differential expression analysis, a test with asymptotic chi-square null distribution is utilized, and a false discovery rate (FDR) adjustment control [44] is used to decide whether a gene is differentially expressed.

Bayesian modeling framework (scDD)
scDD [39] employs a Bayesian modeling framework to identify genes with differential distributions and to classify them into four situations: 1-differential unimodal (DU), 2-differential modality (DM), 3-differential proportion (DP), and 4-both DM and DU (DB), as shown in Additional file 1: Figure S1. The DU situation is one in which each distribution is unimodal but the distributions across the two conditions have different means.
The DP situation involves genes with expression values that are bimodally distributed. The bimodal distribution of gene expression values in each condition has two modes with different proportions, but the two modes across the two conditions are the same. DM and DB situations both include genes whose expression values follow a unimodal distribution in one condition but a bimodal distribution in the other condition. The difference is that, in the DM situation, one of the modes of the bimodal distribution is equal to the mode of the unimodal distribution, whereas in the DB situation, there is no common mode across the two distributions. Let Y g be the expression value of gene g in a collection of cells. The non-zero expression values of gene g are modeled as a conjugate Dirichlet process mixture (DPM) model of normals, and the zero expression values of gene g are modeled using logistic regression as a separate distributional component: For detecting the DE genes, a Bayes factor for gene g is determined as: is the predictive distribution of the observed expression value from gene g under a given hypothesis, M DD denotes the differential distribution hypothesis, and M ED denotes the equivalent distribution hypothesis that ignores conditions. As there is no solution for the Bayes factor BF g , a closed form is calculated to present the evidence of whether a gene is differentially expressed: where Z g is the vector of the mean and the variance for gene g, and C 1 and C 2 represent the two conditions.

EMDomics
EMDomics [31], a nonparametric method based on Earth Mover's Distance (EMD), is proposed to reflect the overall difference between two normalized distributions by computing the EMD score for each gene and determining the estimation of FDRs. Suppose P = {(p 1, w p1 ) ,(p 2, w p2 )…(p m, w pm )} and Q = {(q 1, w q1 ),(q 2, w q2 )… (q n, w qn )} are two signatures, where p i and q j are the centers of each histogram bin, and w pi and w qj are the weights of each histogram bin. The COST is defined as the summation of the multiplication of flow f ij and the distance d ij : where d ij is the Euclidean distance between p i and q j , and f ij is the amount of weight that need to be moved between p i and q j . An optimization algorithm is used to find a flow F = [f ij ] between p i and q j to minimize the COST. After that, the EMD score is calculated as the normalized minimum COST.
based on the permutations of FDRs, is introduced to describe the significance of the score for each gene.

Monocle2
Monocle2 [38] is an updated version of Monocle [32], a computational method used for cell type identification, differential expression analysis, and cell ordering. Monocle applies a generalized additive model, which is a generalized linear method with linear predictors that depend on some smoothing functions. The model relates a univariate response variable Y, which belongs to the exponential family, to some predictor variables, as follows: where h is the link function, such as identity or log function, Y is the gene expression level, x i is the predictor variable that expresses the cell categorical label, and f i is a nonparametric function, such as cubic splines or some other smoothing functions. Specifically, the gene expression level Y is modeled using a Tobit model: where Y * is a latent variable that corresponds to predictor x, and λ is the detection threshold. For identifying DE genes, we use an approximate chi-square (χ 2 ) likelihood ratio test. In Monocle2, a census algorithm is used to estimate the relative transcript counts, which leads to an improvement of the accuracy compared with using the normalized read counts, such as TPM values.
Discrete distributional differential expression (D3E) D3E [33] consists of four steps: 1-data filtering and normalization, 2-comparing distributions of gene expression values for DE genes analysis, 3-fitting a Poisson-Beta model, and 4-calculating the changes in parameters between paired samples for each gene. For the normalization, D3E uses the same algorithm as used by DESeq2 [11] and filters genes that are not expressed in any cell. Then, the non-parametric Cramer-von Mises test or the Kolmogorov-Smirnov test is used to compare the expression values' distributions of each gene for identifying the DE genes. Alternatively, a parametric method, the likelihood ratio test, can be utilized after fitting a Poisson-Beta model: where n is the number of transcripts of a particular gene, α is the rate of promoter activation, β is the rate of promoter inactivation, γ is the rate of transcription when the promoter is in the active state, λ is the transcript degradation rate, and x is the auxiliary variable. The parameters α, β, and γ can be estimated by moments matching or Bayesian inference method, but λ should be known and assumed to be constant.

SINCERA
SINCERA [34] is a computational pipeline for single cell downstream analysis that enables pre-processing, normalization, cell type identification, differential expression analysis, gene signature prediction, and key transcription factors identification. SINCERA calculates the p-value for each gene from two groups based on a statistical test to identify the DE genes. It provides two methods: one-tailed Welch's t-test for genes, assuming they are from two independent normal distributions, and the Wilcoxon rank sum test for small sample sizes. Last, the FDRs are adjusted, using the Benjamini and Hochberg method [44].
edgeR edgeR [4] is a negative binomial model-based method to determine DE genes. It uses a weighted trimmed mean of the log expression ratios to normalize the sequencing depth and gene length between the samples. Then, the expression data are used to fit a negative binomial model, whereby the mean μ and variance ν have a relationship of ν = μ + αμ 2 , and α is the dispersion factor. To estimate the dispersion factor, edgeR combines a common dispersion across all the genes, estimated by a likelihood function, and a gene-specific dispersion, estimated by the empirical Bayes method. Last, an exact test with FDR control is used to determine DE genes.

DESeq2
DESeq2 [43] is an advanced version of DESeq [11], which is also based on the negative binomial distribution. Compared with the DESeq, which uses a fixed normalization factor, the new version of DESeq2 allows the use of a gene-specific shrinkage estimation for dispersions. When estimating the dispersion, DESeq2 uses all of the genes with a similar average expression. The fold-change estimation is also employed to avoid identifying genes with small average expression values.

DEsingle
DEsingle [36] utilizes a ZINB regression model to estimate the proportion of the real and drop-out zeros in the observed expression data. The expression values of each gene in each population of cells are estimated by a ZINB model. The probability mass function (PMF) of the ZINB model for read counts of gene g in a group of cells is: where θ is the proportion of constant zeros of gene g in the group of cells, I(n = 0) is an indicator function, f NB is the PMF of the NB distribution, r is the size parameter and p is the probability parameter of the NB distribution. By testing the parameters (θ, r, and p) of two ZINB models for the two different groups of cells, the method can classify the DE genes into three categories: 1-different expression status (DEs), 2-differential expression abundance (DEa), and 3-general differential expression (DEg). DEs represents genes that they show significant different proportion of cells with real zeros in different groups (i.e. θs are significantly different) but the expression of these genes in the remaining cells show no significance (i.e. r, and p show no significance). DEa represents genes that they show no significance in the proportion of real zeros, but show significant differential expression in remaining cells. DEg represents genes that they not only have significant difference in the proportion of real zeros, but also significantly expressed differentially in the remaining cells.

SigEMD
SigEMD [37] employs logistic regression to identify the genes that their zero counts significantly affect the distribution of expression values; and employs Lasso regression to impute the zero counts of the identified genes. Then, for these identified genes, SigEMD employs EMD, similar to EMDomics, for differential analysis of expression values' distributions including the zero values; while for the remaining genes, it employs EMD for differential analysis of expression values' distributions ignoring the zero values. The regression model and data imputation declines the impact of large amounts of zero counts, and EMD enhances the sensitivity of detecting DE genes from multimodal scRNAseq data.

Datasets
In this work, we used both simulated and real data to evaluate the performance of the differential expression analysis tools.

Simulated data
As we do not know exactly the true DE genes in real single-cell data, we used simulated data to compute the sensitivities and specificities of the eleven methods. Data heterogeneity (multimodality) and sparsity (large number of zero counts), which are the main characteristics of scRNAseq data, are modeled in simulated data. First, we generated 10 datasets, including simulated read counts in the form of log-transformed counts, across a two-condition problem by employing a simulation function from the scDD package [30] in R programing language [45]. For each condition, there were 75 single cells with 20,000 genes in each cell. Among the total 20,000 genes, 2000 genes were simulated with differential distributions, and 18,000 genes were simulated as non-DE genes. The 2000 DE genes were equally divided into four groups, corresponding to the DU, DP, DM, and DB scenarios (Additional file 1: Figure S1). Examples of these four situations from the real data are shown in Fig. 1a. From the 18,000 non-DE genes, 9000 genes were generated, using a unimodal NB distribution (EE scenario), and the other 9000 genes were simulated using a bimodal distribution (EP scenario). All of the non-DE genes had the same mode across the two conditions. Then, we simulated drop-out events by introducing large numbers of zero counts. To introduce zero counts, first, we built the cumulative distribution function (CDF) of the percentage of zeros of each gene, using the real data, F X (x). Then, in the simulated data for each gene, we randomly selected c (c~F X (x)) cells from the total cells for half of the genes in each scenario and forced their expression values to zero (10,000 genes in total). Thus, the CDF of the percentage of zeros of each gene is similar between the simulated and real data (Additional file 1: Figure S2). This way, the distribution of the total counts in the simulated data is more similar to real data, which enables us to assess the true positives (TPs) and false positives (FPs) more accurately.

Real data
We used the real scRNAseq dataset provided by Islam et al. [46] as the positive control dataset to compute TP rates. The datasets consist of 22,928 genes from 48 mouse embryonic stem cells and 44 mouse embryonic fibroblasts. The count matrix is available in the Gene Expression Omnibus (GEO) database with Accession No. GSE29087.
To assess TPs, we used the already-published top 1000 DE genes that are validated through qRT-PCR experiments [47] as a gold standard gene set [21,40,42].
We also used the dataset from Grün et al. [48] as the negative control dataset to assess FPs. We retrieved 80 pool-and-split samples that were obtained under the same condition from the GEO database with Accession No. GSE54695. By employing random sampling from the 80 samples, we generated 10 datasets to obtain the statistical characteristics of the results. For each generated dataset, we randomly selected 40 out of the 80 cells as one group and considered the remaining 40 cells as the other group [42]. Because all of the samples are under the same condition, there should be no DE genes in these 10 datasets.
In the preprocessing of the real datasets, we filtered out genes that are not expressed in all cells (zero read counts across all cells), and we used log-transformed transcript per millions (TPM) values as the input.

Results from simulated data
We used simulated data to compute true sensitivities and precision of the tools for detecting DE genes. The receiver operating characteristic (ROC) curves, using the simulated data, are shown in Fig. 2. As can be seen in the figure, the tools show comparable areas-under-the-curve (AUC) values.
The average true positive rates (TPRs, sensitivities), false positive rates (FPRs), precision, accuracy, and F1 score of the tools under the adjusted p-value of 0.05 are given in Table 2. We defined TPs as the truly called DE genes, and FPs as the genes that were called significant but were not true DE genes. Similarly, true negatives (TNs) were defined as genes that were not true DE and were not called significant, and false negatives (FNs) Fig. 2 ROC curves for the eleven differential gene expression analysis tools using simulated data were defined as genes that were true DE but were not called significant. We computed TPRs as the number of TPs over the 2000 ground-truth DE genes, FPRs as the number of FPs genes over the 18,000 genes that are not differentially expressed, precision as the number of TPs over all of the detected DE genes, and accuracy as the sum of TPs and TNs over all of the 20,000 genes.
As seen in Table 2, Monocle2 identified the greatest number of true DE genes but also introduced the greatest number of false DE genes, which results in a low identification accuracy, at 0.824. The nonparametric methods, EMDomics and D3E, identified more true DE genes compared to parametric methods (2465.8 and 1683.4 true DE genes, respectively). They also, however, introduced many FPs, resulting in lower accuracies (0.91 and 0.929, respectively) than did parametric methods. In contrast, tools with higher precisions, larger than 0.9 (MAST, SCDE, edgeR, and SINCERA), introduce lower numbers of FPs but identify lower numbers of TPs. Interestingly, F1 scores show that DESeq2 and edgeR, which are designed for traditional bulk RNAseq data, do not show poor performance compared to the tools that are designed for scRNAseq data. DEsingle and SigEMD performed the best in terms of accuracy and F1 score since they identified high TPs and did not introduce many FPs.
A bar plot of true detection rates of the eleven tools under the four scenarios for DE genes (i.e., DU, DM, DP, and DB) and the two scenarios for non-DE genes (i.e., EP and EE), are shown in Fig. 3. As shown in the figure, all of the methods could achieve a TPR near to or larger than 0.5 for the DU and DM scenarios, where there is no multimodality (DU scenario) or the level of multimodality is low (DM scenario). For scenarios with a high level of multimodality (DP and DB), however, some of the tools, except EMDomics, Monocle2, DESeq2, D3E, DEsingle, and SigEMD, perform poorly. In the DP scenario, only EMDomics and Monocle2 exhibited TPRs larger than 0.5, and SCDE fails for this multimodal scenario. Similarly, for the DB scenario, Monocle2, DESeq2, and DEsingle have a TPR larger than 0.5; however, MAST and SINCERA completely fail. SigEMD exhibited a TPR around 0.5 for both DP and DB scenarios. DEsingle performed the best for the DB scenario but exhibited a low TPR for the DP scenario. We showed the TPRs and true negative rates, using the simulated data with and without large numbers of zeros separately in Additional file 1: Figures S3 and S4. All of the tools have a better performance for the four scenarios when there are not large numbers of zero counts. We also showed the ROC curve for the data with and without large numbers of zeros in Additional file 1: Figures S5 and S6.
It is important to notice that, even though simulated data contain multimodality and zero counts, they cannot capture the real multimodality and zero count behaviors of real data. Therefore, as seen in the following, we evaluated the detection accuracy of detecting DE genes, using real data.

Results from positive control real data
We used the positive control real dataset to evaluate the accuracy of the identification of DE genes. We employed the validated 1000 genes as a gold standard gene set. We defined true detected DE genes as DE genes that are called by the tools and are among the 1000 gold standard DE genes. The number of detected DE genes and the number of true detected DE genes over the 1000 gold standard genes (defined as sensitivity) for each tool, using an FDR or adjusted p-value of 0.05, are given in Table 3.
The tools can be ranked in three levels based on their sensitivities: Monocle2, EMDomics, SINCERA, D3E, and DEsingle rank in the first level, with sensitivities more than 0.7; edgeR, DESeq2, and SigEMD rank in the second level, with sensitivities between 0.4 and 0.7; and SCDE, scDD, and MAST rank in the third level with sensitivities below 0.4. The methods that show better sensitivities, however, also called more than 7000 genes as significantly DE genes. In Fig. 4, the blue bars show the intersection between the gold standard genes and the DE genes called by the methods (true detected DE genes), whereas the yellow bars show the number of significantly DE genes that are not among the gold standard genes. We need to note that we do not have all of the true positive DE genes for the positive control dataset. The 1000 gold standard genes are a subset of DE genes from the dataset that are validated through qRT-PCR experiments [47]. In addition, the datasets that we used in this study have been generated under similar conditions as those of the positive control datasets; however, they are not from the same assay and experiment. Therefore, the results we present here provide information about sensitivities to some degree.

Results in negative control real data
Because all of the real true DE genes in the positive control real dataset are unknown, we can test only the TPs, using the 1000 gold standard genes but not the FPs. To validate the FPs, we applied the methods to 10 datasets with two groups, randomly sampled from the negative control real dataset. Because cells in the two groups are from the same condition, we expect the methods to not identify any DE gene. Using an FDR or adjusted p-value of 0.05, MAST, SCDE, edgeR, and SINCERA did not call any gene as a DE gene, as we expected, whereas DEsingle, scDD, DESeq2, SigEMD, D3E, EMDomics, and Monocle2 identified 4, 5, 19, 50, 160, 733, and 917 significantly DE genes, respectively, out of 7277 genes in   Table 4. EMDomics and Monocle2, which show the best sensitivities, using the positive control datasets, introduce the most FPs.

Agreement among the methods in identifying DE genes
In general, agreement among all of the tools is very low. Considering the top 1000 DE genes detected by the eleven tools in the positive control real data, there are only 92 common DE genes across all of the tools. Of these 92 DE genes, only 41 intersect with the gold standard 1000 DE genes. We investigated how much the tools agreed with each other on identifying DE genes by examining the number of identified DE genes that were common across a pair of tools, which we called common DE genes. First, we ranked genes by their adjusted p-values or FDRs, and then we selected the top 1000 DE genes. We defined pairwise agreement as the number of common DE genes identified by a pair of tools. The numbers of common DE genes between pairs of tools are between 770 and 1753 for simulated data (Additional file 1: Figure S7), and 142 and 856 for real data (Fig. 5). We observed that the methods do not have high pairwise agreement in either the simulated data or the real data.
In addition, we used significantly DE genes under a p-value or FDR threshold of 0.05 to investigate the pairwise agreement among the tools. The pairwise agreement varies from 432 to 7934 for the real data (Fig. 6) and from 444.8 to 1878 for the simulated data (Additional file 1: Figure S8). In the real data, MAST identified fewer significantly DE genes under the 0.05 cut-off adjusted p-value, but the majority of its significantly DE genes overlapped with the significantly DE genes from other tools.

Effect of sample size
We investigated the effect of sample size on detecting DE genes in terms of TPR, FPR, precision, and accuracy, using the simulated data. Precision was defined as TP/ (TP + FP) and accuracy as (TP + TN)/(TP + TN + FP + FN). We generated eight cases: 10 cells, 30 cells, 50 cells, 75 cells, 100 cells, 200 cells, 300 cells, and 400 cells for each condition. We noticed that the number of identified DE genes and the TPRs of detection under a default FDR or adjusted p-value (< 0.05) tend to increase when the sample size increases from 10 to 400 (Fig. 7) for all tools.
The results show that sample size is very important, as the tools' precision increases significantly by increasing the sample size from 10 to 75. The FPRs tend to be steady when the sample size is > 75, except for DEsingle. DEsingle works well for a large number of zero counts in a larger dataset. These results also show that Monocle2, EMDomics, DESeq2, DEsingle, and SigEMD can achieve TPRs near 100% by increasing the sample size, while the other methods cannot. Monocle2, EMDomics, DESeq2, and D3E, however, introduce FPs (FPR > 0.05%), whereas FPRs for other methods are very low (close to zero). All of the tools similarly perform poorly for a sample size of < 30. When the sample size exceeded 75 in each condition, the tools achieved better accuracy in detection.

Enrichment analysis of real data
To examine whether the identified DE genes are meaningful to biological processes, we conducted gene set enrichment analysis through the "Investigate Gene Sets" function of the web-based GSEA software tool (http:// www.broadinstitute.org/gsea/msigdb/annotate.js). We investigated the KEGG GENES database (KEGG; contains 186 gene sets) from the Molecular Signatures Database (MSigDB) for the gene set enrichment analysis (FDR threshold of 0.05). We used the same number of identified DE genes (top n = 300 genes) of each tool as the input for KEGG pathway enrichment analysis. The results are shown in Table 5. We observed that the 300 top-ranked DE genes identified by nonparametric methods (EMDomics and D3E) were enriched for more KEGG pathways compared to other methods. We also used a box plot to compare the FDRs of the top 10 most significant gene sets enriched by the top-ranked DE genes from the tools (Additional file 1: Figure S9). It can be observed that pathways enriched by the top-ranked DE genes from edgeR and Monocle2 have the highest strength. The 10 top-ranked KEGG pathways for the eleven tools are listed in Additional file 1: Tables S1 to S11.
We also used DAVID (https://david.ncifcrf.gov/summary.jsp) for the Gene Ontology Process enrichment analysis of the 300 top-ranked DE genes identified by   Table 5. Top DE genes identified by EMDomics, D3E, Monocle, and DESeq2 are enriched in more KEGG pathways and/or GO terms compared to those of other tools.
Finally, although the quantitative values of terms recovered from gene set enrichment analysis is informative with regard to the relative statistical power of calling biologically meaningful genes of these tools, very different gene lists can result in very similar quantitative performance values. To perform a qualitative assessment of the biological relevance of the differentially expressed gene lists recovered by each tool, we ranked the performance of each tool in recovering stem cell-relevant GO terms from the 300 top-ranked DE genes. Each gene list was subjected to gene set enrichment against the Biological Process portion of the Gene Ontology Process, and all significant enriched terms were recovered. The results of the Gene Ontology Process enrichment analysis of the 300 top-ranked DE genes and the list of the 300 top-ranked genes for each tool are given in Additional file 2. Significant GO terms with their negative log transform of their q-values for each tool are given in Additional file 3. To consolidate closely related processes recovered in this step, we subjected each list of GO terms to word and phrase significance analysis, using world cloud analysis, whereby negative log transform q-values are considered as frequencies in this analysis. The phrase significance of  each tool, in the form of word clouds, is shown in Additional file 1: Figures S10-S20, and the word significance, in the form of word clouds, is shown in Additional file 1: Figures S21-S31. In these plots, the font size represents the significance of the word/phrase. This provides a readily interpretable visualization of the biologically relevant GO terms. Several stem cell biologists were then asked to rank the performance of each algorithm in terms of its ability to recover the GO terms most relevant to the experiment that provides the real dataset used in this study. Each algorithm was scored on a 1-3 scale, with 3 as the best recovery of biologically relevant terms and phrases; then, the scores for terms and phrases were added to give an overall performance score from 2 to 6 ( Table 6). As expected, many of these tools recovered, at high significance, several terms strongly related to stem cell biology, including development, differentiation, morphogenesis, multicellular, and adhesion as well as many others. Interestingly, scDD and SCDE failed to recover stem cell-relevant terms at high significance. Instead, these approaches appeared to yield terms and phrases related to cellular housekeeping processes. Monocle2 and MAST performed the best at recovering stem cell-relevant terms. Following them, EMDomics, DESeq2, D3E, DEsingle, SigEMD, edgeR, SINCERA all performed well. This result strongly suggests not only that the methods used for identifying DE genes may yield non-overlapping and quantitatively different gene sets but that some methods are much better at extracting biologically relevant gene sets from the data.

Runtimes
We compared the runtimes of the eleven tools (Table 7). Except for D3E, which was implemented in Python, all of the tools were implemented in R ( Table 1). The runtime was computed using a personal computer, iMac with 3.1GHz CPU and up to 8 gigabytes of memory. The average runtime (of 10 times) of each tool, using the positive control dataset, is shown in Table 7. SINCERA has the lowest time cost because it employs a simple t-test. edgeR has the lowest time cost among the model-based and nonparametric methods. MAST, Monocle2, and DESeq2 run fast (less than 5 min), as MAST and Monocle2 use linear regression methods, and DESeq2 uses a binomial model for identifying DE genes. scDD takes longer, as it needs time to classify DE genes into different modalities. The nonparametric method, SigEMD, EMDomics and D3E, take more time compared to the model-based methods because they need to compute the distance between two distributions for each gene. We note that D3E had two running modes: It takes about 40 min when running under the simple mode and about 30 h when running under the more accurate mode.

Discussion
As shown in Fig. 1, scRNAseq expression data are multimodal, with a high number of zero counts that make differential expression analysis challenging. In this study, we conducted a comprehensive evaluation of the performance of eleven software tools for single cell differential gene expression analysis: SCDE, MAST, scDD, EMDomics, D3E, Monocle2, SINCERA, edgeR, DESeq2, DEsingle, and SigEMD. Using simulated data and real scRNAseq data, we compared the accuracy of the tools in identifying DE genes, agreement among the tools in detecting DE genes, and time consumption of the tools. We also examined the enrichment of the identified DE genes by running pathway analysis and GO analysis for the real data.

Detection accuracy
In general, the eleven methods behave differently in terms of calling true significantly DE genes. The tools that show higher sensitivity also show lower precision. Among all of the tools, DEsingle and SigEMD, which are designed for the scRNAseq, tend to show a better trade-off between TPRs and precision.  All of the tools perform well when there is no multimodality or low levels of multimodality. They all also perform better when the sparsity (zero counts) is less. For data with a high level of multimodality, methods that consider the behavior of each individual gene, such as DESeq2, EMDomics, Monocle2, DEsingle, and SigEMD, show better TPRs. This is because EMDomics and SigEMD use a nonparametric method to compute the distance between two distributions and can capture the multimodality; DEsingle models dropout events well by using a zero inflated negative model to estimate the proportion of real and drop-out zeros in the expression value; Mon-ocle2 uses a census algorithm to estimate the relative transcript counts for each gene instead of using normalized read counts, such as TPM values; and DESeq2 uses a gene-specific shrinkage estimation for the dispersions parameter to fit a negative binomial model to the read counts. If the level of multimodality is low, however, SCDE, MAST, and edgeR can provide higher precision.

Agreement among the methods
The overall agreement in terms of finding DE genes among all of the tools is low. We used the top 1000 DE genes identified by the eleven tools (ranked by p-values) and significantly DE genes with a significant threshold of 0.05 to identify the common DE genes across the tools and between pairs of tools. The DE genes identified by DESeq2, EMDomics, D3E, Monocle2, SINCERA, DEsingle, and SigEMD show higher pairwise agreement, whereas the model-based methods, SCDE and scDD, show less pairwise agreement within other tools. No single tool is clearly superior for identifying DE genes, using single cell sequencing datasets. The tools use different methods with different strengths and limitations for calling DE genes. The sequencing data also are very noisy. The methods treat zero counts, multimodality, and noise differently, resulting in low agreement among them. Some tools work well when the drop-out event is not significant and some, when data multimodality is not significant. For instance, scDD aims at characterizing different patterns of differential distributions; however, handling a large number of zero counts in the expression values is a challenging task for this tool.

Sample size effect
All of the tools perform better when there are more samples in each condition. TPRs improve significantly by increasing sample size from 10 to 75, but they slow down for sample sizes greater than 100; and for sample sizes of 300 and larger, there are almost no changes in TPRs and FPRs. Monocle2, EMDomics, DESeq2, DEsingle, and SigEMD can achieve a TPR close to 100% by increasing the sample size. DEsingle works well for a larger number of zero counts or small number of samples. When the number of zero counts is low and the number of samples is large, its model cannot capture the dropout event well.

Enrichment analysis
As expected, top-ranked DE genes of many of these tools are enriched for GO terms strongly related to stem cell biology. scDD and SCDE, however, failed to recover stem cell-relevant terms at high significance. Instead, they appeared to yield GO terms related to cellular housekeeping processes. This result suggests that model-based single cell DE analysis methods that do not consider multimodality do not perform well in extracting biologically relevant gene sets from the data.

Conclusion
In conclusion, the identification of DE genes, using scRNAseq data, remains challenging. Tools developed for scRNAseq data focus on handling zero counts or multimodality but not both. In general, the methods that can capture multimodality (non-parametric methods), perform better than do the model-based methods designed for handling zero counts. However, a model-based method that can model the drop-out events well, can perform better in terms of true positive and false positive. We observed that methods developed specifically for scRNAseq data do not show significantly better performance compared to the methods designed for bulk RNAseq data; and methods that consider behavior of each individual gene (not all genes) in calling DE genes outperform the other tools. The lack of agreement in finding DE genes by these tools and their limitations in detecting true DE genes and biologically relevant gene sets indicate the need for developing more precise methods for differential expression analysis of scRNAseq data. Multimodality, heterogeneity, and sparsity (many zero counts) are the main characteristics of scRNAseq data that all need to be addressed when developing new methods.