baredSC: Bayesian approach to retrieve expression distribution of single-cell data

Background The number of studies using single-cell RNA sequencing (scRNA-seq) is constantly growing. This powerful technique provides a sampling of the whole transcriptome of a cell. However, sparsity of the data can be a major hurdle when studying the distribution of the expression of a specific gene or the correlation between the expressions of two genes. Results We show that the main technical noise associated with these scRNA-seq experiments is due to the sampling, i.e., Poisson noise. We present a new tool named baredSC, for Bayesian Approach to Retrieve Expression Distribution of Single-Cell data, which infers the intrinsic expression distribution in scRNA-seq data using a Gaussian mixture model. baredSC can be used to obtain the distribution in one dimension for individual genes and in two dimensions for pairs of genes, in particular to estimate the correlation in the two genes’ expressions. We apply baredSC to simulated scRNA-seq data and show that the algorithm is able to uncover the expression distribution used to simulate the data, even in multi-modal cases with very sparse data. We also apply baredSC to two real biological data sets. First, we use it to measure the anti-correlation between Hoxd13 and Hoxa11, two genes with known genetic interaction in embryonic limb. Then, we study the expression of Pitx1 in embryonic hindlimb, for which a trimodal distribution has been identified through flow cytometry. While other methods to analyze scRNA-seq are too sensitive to sampling noise, baredSC reveals this trimodal distribution. Conclusion baredSC is a powerful tool which aims at retrieving the expression distribution of few genes of interest from scRNA-seq data. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04507-8.


ERCC show an increased variance
When scRNA-seq experiments are performed, some manufacturers recommend to spike-in ERCC (External RNA Control Consortium). These are synthetic RNA molecules of various length and various GC contents which can be added to the RNA preparation and sequenced along with the mRNA. They can thus be used as controls to normalize the data. The control experiments that we analyze here all include ERCC. In Fig. S1, we plot the normalized variance estimator as a function of average expression for real and ERCC genes. We notice that the ERCC genes have an increased variance, and are sometimes expressed at a much higher level than regular genes. In order to avoid contaminating our results with this peculiar behavior of ERCC genes, we exclude them from our analysis.

Sampling noise introduces strong additional variability
In order to highlight how sampling noise introduces variability, we simulated expression values for 2361 cells using a distribution composed of two Gaussians in Seurat scale (log(1 + 10 4 λ g )). The first Gaussian had a mean of 0.75 and a scale of 0.25 and the second one had a mean of 2 and a scale of 0.2. We display on Fig. S2 this distribution as a blue line as well as the histogram of the simulated values in light blue. We observe that they are very close. We then simulate the sampling noise using this simulated expression and N i values found in an existing scRNA-seq dataset (see Methods: Simulation of data). This simulation gives a number of count for this gene (k i,g ) for each cell. We then plot the histogram of the normalized expression (log(1 + 10 4 (k i,g /N i,g )) in dark red. We see the appearance of a proportion of cell with no detected expression as well as a general increased variability compared to the original distribution. To summarize this data, we also use a kernel density estimate in red (see Methods: Data representation for more details). This representation is used in Fig. 2, 3 and 7C, Fig. S3 and S7C. The Fig. S2 also shows that the overdispersion observed in the 'Density from data' curve compared to the 'Original distribution' is not due to oversmoothing of the kernel density estimate.

Fig. S1
: ERCC exhibit an increased variance compare to mRNA coding for genes. In each data set, control data set (A) or real single-cell experiment (B), the normalized estimator of variance is plotted as a function of the estimated mean expression. Each dot is a gene. The dots are colored by gene type, spike-in ERCC (External RNA Control Consortium) are depicted in blue while mRNA are depicted in red. In blue is displayed the distribution of the original data. In light blue is the histogram of the random expression values simulated from the distribution. In dark red is displayed the histogram of the simulated sampling (count for each cell, k i , g) normalized by the number of total count (N i ). In red is displayed the density from data like in Fig. 2 and 3.

Comparison between baredSC and Sanity
As presented in the main text, baredSC and Sanity share common features but they highly differ on objectives and versatility. We report in Table S1 the differences between the two approaches making a distinction between baredSC used with a single Gaussian or with any number of Gaussians. We compared more precisely the results obtained by the three methods (Sanity, baredSC with a single Gaussian, baredSC with up to four Gaussians) on the five simulations used in Fig. 3. Indeed, using Sanity outputs we can represent the data in three different ways. First, Sanity assume that the distribution of expression can be described by its mean and variance. Among Sanity outputs, one can get the expected value of the mean and an estimation of its error as well as the expected value of the variance. Using these values, we can deduce the inferred PDF approximated by a single Gaussian. The main result of Sanity is the matrix of the log transcription quotients (LTQs) which are the expected value for each cell and the error on these values. Using baredSC, we can also evaluate these expected values and error (see Methods: Posterior per cell).
While the implementation and the priors differ between Sanity and baredSC with a single Gaussian, they give very similar results (Fig. S3). Interestingly, the second and third rows highlight how the representations using either the expected value for each cell or these expected values and their error bars affect the quality of the distribution when it is not a single Gaussian. Each column corresponds to a simulation. For each of them, the simulated distribution is plotted in blue in the first row (Inferred expression distribution) and its characteristics are written above (N for normal followed by mean and scale values in log as well as the proportion of cells with no expression). In the first row, the values obtained after Poisson simulation are summarized by the red curve (Density from data). The mean PDF obtained using baredSC with up to 4 Gaussians is depicted in dark green and the dark green area shows the quantiles at 16-84%. Similarly the results obtained using baredSC with a single Gaussian are plotted in light green. The results obtained using Sanity (using the estimation of the average and the variance of the distribution) are in purple. For the 3 models, the expected value was computed for each cell and the density of these values are plotted in the second row. Finally, the posterior using the expected value and the error for each cell is displayed on the last row.   (2361), we also run baredSC when only a smaller number of cells was provided (Fig. S5). With 1561 cells, the results are very close to the results obtained for the 2361 cells. The only exception is the case of each gene expressed in half independent (Fig. S5C) where it detects a significant but very modest correlation (0.09). For the smaller numbers of cells (300 and 500), it globally performs very well especially in simulations with multi-modal distributions. In Fig. S5A where the expression is very low, the algorithm fails to detect most of the (anti-)correlations but the average PDF is close to the simulated one. In this range of expression, a high number of cell is necessary to infer a good correlation of the distribution.

Comparison between baredSC and Sanity in two dimensions
Sanity evaluates the expected value of the log transcription quotients (LTQs) for each cell and each gene. Therefore, we can compare the results obtained by Sanity in 2 dimensions with baredSC results, especially for the evaluation of the correlation coefficient. We simulated values similarly to what has been done for Fig. 5 except that the normal distributions were simulated in the log scale (Fig. S6).
We can see that in the first two simulations where the original distribution is a single Gaussian with a weak (anti-)correlation, the correlation coefficient estimated by Sanity using the expected values (fourth row) is less than half the real correlation coefficient. This is in contrast with baredSC results (third row) which give better estimation. When using the posterior distribution from Sanity (last row), the (anti-)correlation is nearly undetected. In the third simulation where the original distribution is a single Gaussian without any correlation (independent), the posterior from Sanity gives very good results.
When two populations were simulated (last four simulations), we can distinguish the first two where the two Gaussians are well separated from the last two where they are very close. In the first case, Sanity outputs allow the identification of the two populations but the difference in the scales of the Gaussians cannot be distinguished. In these cases the correlation coefficients are reduced by 30% compared to the coefficients in the original distribution. In the last two cases, the Sanity results do not allow to separate the two populations and the correlation coefficients are about half the value in the original distribution.
In conclusion, Sanity can help to detect correlation or anti-correlation between genes but the correlation coefficient may be underestimated. In contrast, baredSC allows to retrieve very close confidence interval for the correlation coefficient and the inferred PDF is close to the original one.

Analysis of distribution of Pitx1 expression in log axis
In baredSC, we implemented two scales. The scale used in the Seurat and Scanpy package which is log(1 + N t X i,g ) (with N t = 10 4 in the Seurat package) and the regular log scale (log(X i,g )). In all main figures of this article except Fig. 3, we used the Seurat scale. This choice is motivated by the decreased degeneracy of models in Seurat scale for cells with no count compared to the log scale. However, the log scale may be more convenient in some cases.
In particular, we use it in Fig. 3, Fig. S3, S6 and Fig. S7 to compare to other techniques which use the log scale. In the case of Pitx1 expression, we have fluorescence levels from flow cytometry which are conventionally represented in log scale (Fig. S7A)  ( Fig. S7B). For FL Pitx1 +/+ (in black), where Pitx1 is not expressed, we observe that the distribution is concentrated on the low values. However, we observe a larger confidence interval compared to the distribution obtained in the Seurat scale (Fig. 5D). This is due to the degeneracy of the model at low value of expression: having no count is similarly probable for expression values of -13 in log scale (2e-6 which is much smaller than 1/N i ) or below. Thus, one should consider the low expression range as cells that do not express Pitx1. As explained in the article, in this range of expression, it is difficult to compare both techniques as the flow cytometry has a high level of background. At higher level of expression (right part), the PDF of HL Pitx1 Pen-/Penand HL Pitx1 +/+ have a highly similar shape between the flow cytometry ( Fig. S7A) and the baredSC results (Fig. S7B). These results confirm what we observed in Fig. 5 that the fluorescence level is highly correlated to the level of expression of Pitx1. We then compared these results to the density plot of normalized expression (Fig. S7C). For this representation of the normalized counts, the expression of cells with no expression was artificially put to the minimal value. We see that the PDF of cells where Pitx1 is detected is close to a Gaussian for both HL samples. We also run Sanity to visualize the posterior distribution given by this algorithm (Fig. S7D). We see that the tissue which does not express Pitx1 (FL Pitx1 +/+ in black) is described as a large Gaussian, suggesting that a fraction of cells expresses Pitx1 at low level. The PDF of samples which express Pitx1 (HL in blue and red) is unimodal. This show that the posterior distribution from Sanity is not able to detect the trimodal expression of Pitx1 in HL samples.