 Research article
 Open access
 Published:
An informationtheoretic approach to single cell sequencing analysis
BMC Bioinformatics volume 24, Article number: 311 (2023)
Abstract
Background
Singlecell sequencing (scSeq) experiments are producing increasingly large data sets. However, large data sets do not necessarily contain large amounts of information.
Results
Here, we formally quantify the information obtained from a scSeq experiment and show that it corresponds to an intuitive notion of gene expression heterogeneity. We demonstrate a natural relation between our notion of heterogeneity and that of cell type, decomposing heterogeneity into that component attributable to differential expression between cell types (intercluster heterogeneity) and that remaining (intracluster heterogeneity). We test our definition of heterogeneity as the objective function of a clustering algorithm, and show that it is a useful descriptor for gene expression patterns associated with different cell types.
Conclusions
Thus, our definition of gene heterogeneity leads to a biologically meaningful notion of cell type, as groups of cells that are statistically equivalent with respect to their patterns of gene expression. Our measure of heterogeneity, and its decomposition into inter and intracluster, is nonparametric, intrinsic, unbiased, and requires no additional assumptions about expression patterns. Based on this theory, we develop an efficient method for the automatic unsupervised clustering of cells from scSeq data, and provide an R package implementation.
Background
Advances in singlecell sequencing (scSeq) technologies have enabled us to profile thousands of cells in a single experiment [43]. In combination with advances in unsupervised analysis methods, particularly specialised clustering algorithms and dimensionality reduction techniques, these technologies have allowed us to dissect cellular identities in unprecedented detail and discover novel, functionally important, cell types [48]. The goal of most scSeq studies (except those focused on methodology development) is to extract biological information, often concerning the mix of cell types present in the tissue sample, from the data obtained. Yet, data is not the same as information; and large, complex, data sets do not necessarily convey useful or usable information. Notably, current singlecell profiling technologies typically produce noisy data for numerous technical reasons, including low capture rate, sparsity due to shallow sequencing, and batch effects [17, 19]. Consequently, the relationship between biological information and scSeq data is complex and incompletely understood. There is, therefore, a need for formal, quantitative methods to evaluate this relationship.
To address this challenge, we propose an informationtheoretic framework that quantifies the amount of information contained in a scSeq data set, and leads to a natural definition of gene expression heterogeneity. Our measure of gene expression heterogeneity decomposes into that which is explained by a given grouping of cells—a proposed clustering into cell types, for example—and that which remains unexplained.
Our framework differs from other approaches to heterogeneity decomposition (e.g. [14, 27]) by formally quantifying the information, in terms of gene expression heterogeneity, gained from a scSeq experiment concerning the mix of cell types present. The resulting measure of heterogeneity, and its decomposition into inter and intracluster heterogeneity, is nonparametric, intrinsic, unbiased, and requires no a priori assumptions about gene expression patterns.
Our approach is mathematically precise, biologically intuitive and computationally simple to implement, enabling a practitioner to quickly assess the information content of a scSeq data set, in terms of gene expression heterogeneity, identify highly informative genes, and determine the extent to which observed patterns of gene expression variability are explained by the presence of a mixture of cell types in a cellular population. Furthermore, we provide an efficient unsupervised clustering algorithm of cells from scSeq data based on heterogeneity and an implementation as an R package.
Results
Highthroughput singlecell analysis methods, such as singlecell sequencing, typically view cells as the objects of study and seek to compare cell identities with each other [4, 12, 14, 20, 24, 25, 35, 47]. However, this cellcentric view is less well suited to quantifying gene expression heterogeneity, which is concerned with patterns of variation that arise from the mixing of cell types within a population, and may vary from gene to gene. For instance, variance, the standard measure of heterogeneity in the cellcentric view, may not be informative in multimodal distributions [39].
In the context of scSeq, heterogeneity results from differential expression between cell types. In the simplest case, where one cell type expresses a gene highly and another lowly, such differential expression introduces bimodality in the cellcentric expression distribution. In a bimodal distribution, the population mean is no longer characteristic of either of the two subpopulations, making variance, as a measure of the spread about the mean, also misleading (see Smith and MacArthur [39]).
To meet these challenges, we introduce a novel genecentric probabilistic view that seeks to more formally specify what is meant by gene expression homogeneity and heterogeneity. We show that this genecentric view is better suited to quantifying gene expression heterogeneity in the context of a multimodal population, and formalises the biological intuition of expression homogeneity and heterogeneity.
Quantifying gene expression heterogeneity
Consider the expression pattern of a single gene g of interest in a population of N distinct cells. Assume that in total M transcripts of g are identified in the cell population (i.e. across all N cells profiled). Note that M represents the observed transcript count, which may differ from the true count due to technical artefacts. Now consider the stochastic process of assigning the M identified transcripts of gene g to the N cells profiled. Intuitively, the population is homogeneous with respect to expression of g if all the cells are statistically the same with respect to its expression. Mathematically, this means that the M transcripts of g will be assigned to the N cells independently and equiprobably—i.e. each transcript will be assigned to each cell with probability 1/N. Conversely, if the population is heterogeneous with respect to expression of g (that is, it consists of a mix of cell types, each expressing the gene differently), then transcripts will not be assigned uniformly, but rather will be assigned preferentially to distinct subsets of cells. Heterogeneity in experimentally observed patterns of expression can, therefore, be assessed in terms of deviation from this hypothetical homogeneous null model.
The Kullback–Leibler divergence (KLD, also known as the relative entropy) is a measure of the information encoded in the difference between an observed and null distribution [22]. In the scSeq context, the KLD of an experimentally observed gene expression distribution from the homogeneous null model, described above, measures the amount of information that is lost by assuming that the gene is homogeneously expressed in the sequenced cell population. We will denote this quantity I(g) and refer to it as the heterogeneity of g (see Fig. 1 for a schematic). Formally,
where \(x_i^g\) is the fraction of transcripts of gene g expressed by cell i, for each \(1 \le i \le N\), and H(g) is the entropy of the expression of g in the population (see “Methods” Section for full details). For technical reasons, explained in the “Methods” Section, we will use natural logarithms in all calculations; I(g) therefore has units of nats.
Intuitively, if the cell population is unstructured with respect to the expression of g (i.e. if the cells are approximately interchangeable with respect to the expression of g) then the assumption of homogeneity is correct and \(I(g) \approx 0\). Conversely, the theoretical maximum for I(g) is \(\log (N)\), which is achieved when \(H(g)= 0\) and all transcripts of the gene are assigned to the same cell (see Fig. 1). Note that: (1) we do not need to know, or model, the particular expression distribution of g in the population, so no a priori assumptions about expression patterns are required to calculate I(g); (2) I(g) is agnostic concerning missing readings or counts so long as they are distributed uniformly at random; (3) since it quantifies the deviation from the homogeneous null model, I(g) measures the information obtained from the experiment concerning the expression of g.
In practice, I(g) is associated with cellular diversity: the more distinct cell subpopulations present in a sample, and the more those subpopulations differ from one another with respect to their expression of g, the larger I(g) will be. Thus, I(g) is a parsimonious measure of expression heterogeneity that makes minimal assumptions concerning expression patterns and, therefore, imposes minimal technical requirements on data collection methodology or quality. As such, it can be used as the basis for numerous aspects of the scSeq pipeline, including feature selection and unsupervised clustering.
To validate these uses, we considered a series of singlecell RNAsequencing benchmarking data sets: Svensson, a technical control [42]; Tian, a mixture of three cancerous cell lines [46]; Zheng, a set of FACS separated peripheral blood mononuclear cells; Stumpf, a sampling of cells from mouse bone marrow [41]; and the Tabula Muris, a mouse cell atlas with cells from twelve organs [44]. Fig. 2a–e shows plots of I(g) for each gene profiled in these experiments. These plots confirm the intuition that the more distinct types present in a sample, and the more those cell types differ from one another with respect to their gene expression patterns, the greater the population heterogeneity will be, as measured by I(g).
These results indicate that heterogeneity (as defined here) can be used for feature selection. For example, heterogeneously expressed genes can be identified by 1) ordering them in descending ordering by I(g), and selecting the top n (where n is userdetermined) or 2) finding those genes for which I(g) greater than some userdefined threshold (note that \(0 \le I(g) \le \log N\), with \(I(g) \approx 0.7\) indicating the gene g is expressed in approximately half of cells).
As a measure for use in feature selection, I(g) identifies gene sets largely distinct from those based on Highly Variable Gene selection (the mean overlap of selected genes—for listed data sets excluding Svensson—was 0.36, with the number of genes selected determined by scran, based on an false discovery rate threshold of 0.05). For instance, I(g) identifies Hbbbs, a marker of erythrocyte maturation, as informative in the Stumpf et al. [41] data set (high I(g) value), whereas scran does not, instead attributing the observed variance to technical sources (see Additional file 1: Fig. S1)[27]. Critically, informationtheoretic heterogeneity is more mathematically precise and computationally simpler to implement than variancebased approaches, making no distributional assumptions (e.g. use of the negative binomial model) and having no free parameters to fit. Furthermore, it may be easily modified to account for the presence of multiple homogeneous cell subpopulations, and thereby act as a measure of cluster quality.
Intercluster heterogeneity
Let S be a discrete clustering of a population of cells—i.e. an assignment of the cells to a finite set of C nonintersecting subpopulations (also known as clusters). In the “Methods” Section, we show that I(g) can be decomposed into two parts: one part that quantifies the extent to which each of the subpopulations defined by S deviate from homogeneity, which we call the intracluster heterogeneity and write as \(h_S(g)\); and one part that quantifies how differently the gene is expressed, in aggregate, between subpopulations, which we call the intercluster heterogeneity and write as \(H_S(g)\). Namely, we show that
for any proposed clustering S. Thus, the information obtained from an experiment concerning the expression of a gene g can be explicitly related to both local (within cluster) and global (between cluster) patterns of variation, for any proposed clustering. Full mathematical details of this decomposition, including formulae for \(H_S(g)\) and \(h_S(g)\), are provided in the “Methods” Section, and an illustration is given in Fig. 3. We also provide an R package for its calculation (see “Methods” Section).
This decomposition can be used to assess cluster quality. For a proposed cluster to meaningfully represent a cell type, it must be associated with differential expression of some subset of marker genes. Because \(H_S(g)\) measures the extent to which the expression of gene g deviates from the homogeneous null model it is a simple measure of the extent to which g is differentially expressed between the clusters defined by S. We therefore expect that biologically meaningful clustering based on a marker gene g will result in a high value for \(H_S(g)\) and (by necessity) a low value for \(h_S(g)\).
This expectation is confirmed for the Tian, Zheng and Stumpf data sets for which we possess highconfidence a priori cell type annotations, derived either experimentally (for the Tian and Zheng data) or from expert annotation of a computational clustering (for the Stumpf data). In the set of genes most likely to be differentially expressed between clusters in these data sets (taken to be the top 500 genes by I(g) in each case) the majority are associated with substantially higher intercluster heterogeneity than expected by chance (i.e., in randomly permuted clusters). Indeed, in Tian 97.0%, in Zheng 94.4% and in Stumpf 99.6% of the tested genes had significantly higher intercluster heterogeneity than expected (onesided exact test with \(10^4\) permutations, \(p < 0.05\), false discovery rate corrected for 500 trials; see Fig. 2f–h for illustration) [2, 8].
These results indicate that by simple informationtheoretic reasoning, we are able to quantify the amount of information in the expression pattern of a single gene that is explained by a given clustering. However, a key strength of highthroughput singlecell profiling methods is that they allow the simultaneous profiling of thousands of genes in large cell populations.
Our informationtheoretic reasoning may be extended to a highthroughput singlecell profiling experiment by assuming that each gene is an independent source of information and making use of the fact that information from independent sources is additive [37]. Thus, we can determine the total information explained by a given clustering S by evaluating the sum \(H_S = \sum _g H_S(g)\) over all genes profiled. The total information explained by S is a simple, easily computed measure for cluster evaluation that favours grouping of cells into homogeneous (with respect to gene expression) subpopulations and is maximised at \(\sum _g I(g)\) if and only if the proposed clustering divides the population into indivisible perfectly homogeneous subpopulations and thereby accounts for all the heterogeneity contained in the scSeq data set. If so, then S is the maximum entropy partition of the cell population into distinct classes, and may therefore be considered as the most parsimonious way of assigning cell identities.
To illustrate the association between total information explained, \(H_S\), and cluster quality, we again considered the established annotations of the Tian, Zheng and Stumpf data sets. As expected, in all cases the observed value of \(H_S\) significantly exceeds that of all random label permutations (see Fig. 2i indicating that \(H_S\) is strongly associated with cluster quality and that these benchmark cell annotations are associated with homogeneous cell subpopulations. Notably, this association holds true independently of the method used to annotate cell identities (annotations were derived from genotypic information for the Tian data, surface protein expression for the Zheng data, and unsupervised clustering for the Stumpf data), indicating that \(H_S\) provides a methodology agnostic, parameterfree, means to quantify cluster quality.
Unsupervised clustering
Building on this empirical association, we investigate the clustering of cells that maximises \(H_S\) (and necessarily minimises \(h_S\)) as a reasonable approximation to (homogeneous) cell type. That is, this is the cell clustering that explains the most of the gene expression heterogeneity as intercluster variability.
We implement \(H_S\)maximisation as the objective function of an unsupervised clustering method, which we call scEC (singlecell Entropic Clustering). We show that our method is comparable to the stateoftheart Louvain method, and in doing so, validate the underlying definition of heterogeneity and homogeneous cell type.
Namely, we compare the clustering returned by scEC and the established annotations for the Tian, Zheng and Stumpf data sets. The resulting scEC clusterings agreed strongly with the established cell annotations, achieving adjusted Rand Indices (ARI) of 0.99 for the Tian annotations, 0.87 for the Zheng annotations, and 0.69 for the Stumpf annotations (ARIis a measure of similarity of a proposed clustering to a known clustering, Rand [33]). To benchmark these results we also repeated the analysis using the Louvain method, a leading singlecell clustering algorithm, which achieved ARIs of 0.99, 0.99 and 0.35 for the Tian, Zheng, and Stumpf data sets respectively [3, 9, 26, 40].
The scEC objective function, \(H_S\), is naturally built by summation of genelevel measurements (see “Methods” Section), so can be used to directly identify the key drivers of a given clustering without relying on posthoc differential gene expression testing. To demonstrate this feature, we compared the intercluster heterogeneity values (i.e., the values \(H_S(g)\) for each of the genes profiled) for clusters identified by scEC and by established annotations for the Tian & Stumpf data sets (see Fig. 4).
We found that the values of \(H_S(g)\) associated with scEC clusters are strongly positively correlated with those associated with established annotations for both the Zheng and Stumpf data sets (Pearson’s correlation coefficient of 0.94 for Zheng and 0.99 for Stumpf; see Fig. 4). Moreover, the key marker genes in each data set were found to have significantly different expression levels across clusters, as measured by \(H_S(g)\) (onesided permutation test on clustering labels, \(10^4\) permutations, \(p < 0.05\), false discovery rate corrected; marker genes for Zheng: CD14, CD4, CD8 and NCAM1, Zheng et al. [52]; and Stumpf: Cd34, Kitl, Spi1, Gata1 and Pax5, Stumpf et al. [41]).
This concordance indicates that scEC is generally able to recapitulate known cell clusters in an unbiased and parameter free way. However, despite this concordance, there were differences between scEC clusters and established annotations. To investigate these differences further we selected those genes for which the value of \(H_S(g)\) differed substantially (\(\Delta H_S(g) = H_{scEC}(g)  H_{Known}(g) > 0.1 \text { nats}\)) between scEC clusters and established annotations in each data set for further investigation.
In the Zheng data, we found that those genes with expression heterogeneity better explained by the scEC clustering were enriched for cellcycle related genes (gene ontology enrichment, \(p = 1.33 \times 10^{8}\), false discovery rate corrected). Conversely, those genes with expression heterogeneity better explained by the established annotation were enriched for genes involved in the immune response (\(p = 1.49 \times 10^{3}\), false discovery rate corrected) [1, 2, 10].
In the Stumpf data set, we found that those genes with expression heterogeneity better explained by the scEC clustering were enriched for genes involved in erythroblast differentiation and homeostasis (\(p = 1.80 \times 10^{2}\) and \(p\text {value} = 6.17 \times 10^{4}\) respectively, false discovery rate corrected). Conversely, those genes with expression heterogeneity better explained by the established annotation were enriched for genes involved in the immune response (\(p = 2.57 \times 10^{18}\), false discovery rate corrected). In this example, scEC generally identifies additional erythrocyte subpopulations while merging the cell subtypes of the neutrophil lineage (see Fig. 2j for illustration and Additional file 1: Table S1 for contingency table of cluster assignments).
Collectively, these differences reflect the preference of scEC for cellular annotations driven by biological processes involving larger tranches of genes. Depending on circumstance, this preference may be a benefit or a drawback. It is generally beneficial because it ensures that scEC annotations are robust to outliers (i.e. anomalous expression patterns of individual genes cannot easily distort scEC cluster assignments). Conversely, in circumstances in which key cellular functions are determined by a small number of genes, scEC may fail to disambiguate important cell populations from similar cell types. However, given the simplicity of the scEC methodology this issue can this issue can be relatively easily addressed, by adjusting the prior, as follows. In the standard formulation of scEC each gene contributes equally to the objective function \(H_S = \sum _g H_S(g)\). This corresponds to assuming a uniform prior. However, alternatives may be easily considered that take into account different sources of prior knowledge. In general, each gene can contribute in a weighted way to the overall intercluster heterogeneity. In this case the objective becomes \(H_S = \sum _g w(g) H_S(g)\), where w(g) is the weight of gene g. These weights can encode known biology—for example, by upweighting known drivers of cellular identities, downweighting genes associated with housekeeping processes and/or the cell cycle, or by weighting genes in inverse proportion to the size (i.e. the number of associated genes) of their leading ontology term.
While incorporating such priors into scEC may be beneficial, defining them is a challenge and potentially introduces a source of bias into the resulting clustering and so should be approached with care. An alternative approach—which encodes prior information in a dataled, rather than algorithmic way—is to use a reference data set, such as a cell atlas, to benchmark cell annotations (these atlases can serve as reference transcriptomes, in analogy to reference genomes).
Doing so leads to a semisupervised version of scEC, wherein we cluster a novel test data set based on the known clustering of a given reference data set. We detail the mathematics of semisupervised clustering in the “Methods” Section, but, in short, the extension to the semisupervised setting requires no additional mathematical machinery. Such mathematical simplicity affirms information theory as providing a mathematically precise and biological intuitive framework for cellular clustering.
As an example, we cluster the Tian data set based on a distinct data set containing the same three cell lines (plus two additional cell lines) [46]. The semisupervised clustering strongly agreed with the established cell annotation, achieving an ARI of 0.90. Notably, the ARI was reduced compared with the unsupervised version due to the assignment of \(\sim 6\%\) of cells to cell types present only in the reference data set, a disadvantage of semisupervised (and supervised) methods.
Benchmarking and imputation
We benchmark scEC and stateoftheart methods against a series of data sets consisting of cells sampled from three or five distinct cancerous cell lines and sequenced using various technologies (CELSeq2, Sortseq, 10x, Dropseq) [15, 28, 31, 46]. We compare each method’s clustering against ground truth for each data set (by adjusted rand index), finding that scEC performs subpar to the best of existing methods [13, 21, 27, 34, 40]. Investigating the genewise contributions via \(H_S(g)\) reveals that scECderived clusterings diverge from the ground truth due to genes expressed in small numbers of cells, e.g., SPINK6 (difference in \(H_{scEC}\) from ground truth of 0.23 nats; expressed in \(1.6\%\) of cells in the fivecell line, 10X data set; see Fig. 4c).
To protect against low cellcount genes, we adopt a datadiffusion step, imputing the expression value of each cell as a weighted average of itself and a small subset of cells with similar expression patterns (i.e., diffusion of observed expression values over a shared nearestneighbours graph derived from the top 500 genes by I(g), see “Methods” Section) [49]. The smoothing layer shares information between genes, lessening scEC’s sensitivity to the expression pattern of individual genes; this raises the performance of scEC clustering to be on par with the stateoftheart (see Fig. 5a).
We further benchmark scEC (with and without the imputation step) as a feature selection tool, evaluating the ability of I(g) to a priori identify differentially expressed genes, assuming differential expression to be the ground truth of feature selection (differentially expressed genes are identified per data set via Wilcoxon Rank Sum testing, selecting those genes with an FDRadjusted \(p\text {value}< 0.05\) in at least one cluster; Stuart et al. [40]). We find that, without smoothing, scEC is comparable to the stateoftheart; with smoothing, scEC notably improves, substantially bettering the stateoftheart in selecting differentially expressed genes from the benchmark data sets (see Fig. 5b). Notably, relatively few genes remain heterogeneous after the imputation step, so the imputed I(g) identifies a much smaller number of genes than other methods, where an arbitrary cutoff in selected gene number is required.
Conclusion
Traditional unsupervised clustering methods for single cell data are based on the, often implicit, assumption that cell types correspond to regions of high probability density in the joint gene expression distribution [11]. Although useful, this is not a specifically biological assumption (similar assumptions form the basis of numerous clustering algorithms in other different disciplines) and does not have a clear biological basis. Indeed, this assumption does not naturally accord with biological intuition that distinct cell populations should be ‘homogeneous’ and cells of the same type should be functionally interchangeable. Here, we have taken a different approach that seeks to encode biological intuition about population homogeneity in a mathematical formulation, drawing on tools from information theory.
We have proposed a formal measure of gene expression heterogeneity and shown that this measure captures biologically relevant heterogeneity arising from differential expression between established cell types. We formalised the additive decomposition of heterogeneity with respect to a given grouping of cells (in “Methods” Section), and tested the association between high intercluster heterogeneity and grouping by cell type. Finally, we used this measure as a basis of a biologicallymotivated clustering procedure, which we call singlecell entropic clustering (scEC), which identifies homogeneous cell types. The scEC method is free of tunable hyperparameters and its performance is comparable to a stateoftheart clustering algorithm on a series of benchmark scSeq data sets, suggesting that this underlying biological basis is justified.
While our method represents a mathematically rigorous definition of heterogeneity, the stochasticity inherent in scSeq data does limit our resolution. For instance, nearly every gene is associated with some nonzero level of heterogeneity and, over thousands of genes, even low levels of stochasticallyinduced heterogeneity will accumulate (a version of the socalled ‘curse of dimensionality’). This limits our ability to directly interpret the genomewide heterogeneity (e.g. the total heterogeneity of the technical control Svensson exceeds that of Tian, both absolute and per gene, although the trend reverses when excluding those genes with fewer than 100 total transcripts).
Aside from these technical limitations, our approach demonstrates the benefit of a biologically grounded, mathematically formal approach to understanding scSeq data. As these data sets grow in complexity and size, there will be a greater need for biological interpretability and mathematical rigour in analysis. We suggest information theory as the appropriate mathematical language for describing gene expression and its heterogeneity. Indeed, we anticipate that information theory will become an established part of the quantitative biologist’s toolkit.
Methods
Data collection
Count matrices for each data set were downloaded from their respective repositories (see Availability of Data and Materials). For the Zheng and Tabula Muris data sets, the count matrices of the various cell types/tissues were concatenated using the Matrix (v1.32) package in R (v4.0.3) [32].
Data preprocessing
For each data set, those genes with less than 100 total transcripts in total (i.e. across all cells) were excluded from further analysis. For the Zheng data set, FACS identities were taken from the repository metadata, with the CD4+ Helper Tcells, CD4+/CD25+ Regulatory Tcells, CD4+/CD45RA+/CD25 Naïve T cells, CD4+/CD45RO+ Memory Tcells, CD8+ Cytotoxic T cells and CD8+/CD45RA+ Naïve Cytotoxic Tcells merged into the single identity of Tcells.
Normalisation
Since scSeq data is noisy, rather than working with the experimentally observed proportions of transcripts assigned to each cell, we instead adopt a Bayesian approach to estimate the cellular frequencies using the James–Stein estimator, as explained below, and use this estimator in all subsequent calculations.
We write \(m^g_i\) for the number of transcripts of gene g associated with cell i in a population of N cells. Let \(\sum _{i=1}^N m^g_i = M^g\) be the total number of transcripts of gene g, and \(p^g_i = m^g_i/M^g\) the fraction of transcripts of gene g expressed by cell i, for each \(1 \le i \le N\). Thus, \(\sum _{i=1}^{N}p^g_i=1\) and we can think of \(p^g_i\) as the probability that a randomly selected transcript of gene g is associated with cell i. To take undersampling into account, we adjust the observed probabilities \(p_i=p^g_i\) as follows (let us drop the superscript g for simplicity). We define \(X=X^g\) as the discrete random variable on the set \(\{ 1,2, \ldots N\}\) with probabilities
where \(\lambda ^\prime\) is the shrinkage intensity given by
which is the James–Stein estimator, with the uniform distribution as the shrinkage target [16]. Shrinkage approaches deal with undersampling and so are able to correct for the substantial sparsity observed in singlecell RNAsequencing data. The shrinkage probabilities \(x_i\) given in Eq. 3 are a compromise between the observed probabilities \(p_i\), which are unbiased but with a high variance, and the target uniform probabilities 1/N, which are biased but with low variance [6, 16].
Feature selection
The Shannon entropy of X is, by definition,
By convention, we assume that \(0 \cdot \log 0 = 0\). For the logarithm, we take base e for ease in differentiation (see later) rather than the usual base 2 (recall that \(\log _e(x)=\log _2(e)\log _2(x)\) for all \(x>0\)), and thus H(X), and I(g) below, are measured in nats.
The Shannon entropy is a measure of the uncertainty in the outcomes of the random variable X. It has a minimum value of zero, when \(x_i=1\) for some i (e.g. if the gene g is expressed in only one cell of the population) and has a maximum value of \(\log (N)\), when \(x_i=1/N\) for all i (e.g. if the gene is uniformly expressed in the cell population). The entropy may therefore be considered as a measure of the homogeneity of expression of the gene g in the cell population profiled (note that the association of high entropy with homogeneity is counter to the usual intuition: it occurs because we are concerned with the entropy of the generative process of assigning transcripts to cells, rather than transcript count distribution in the cell population, as is more usual. A high entropy transcript assignment process gives rise to a sharp transcript distribution, hence its association with homogeneity). By contrast, the quantity \(\log (N)H(X)\) also ranges between zero and \(\log (N)\), yet is minimised when the gene is homogeneously expressed and so is a simple measure of expression heterogeneity, which we will denote I(g). We can rewrite this as
The Kullback–Leibler divergence (KLD), or relative entropy, is a measure of the information encoded in the difference between two probability distributions [22]. The relative entropy of a discrete probability distribution \(p_1, \ldots , p_N\) from a discrete probability distribution \(q_1, \ldots , q_N\) is, by definition,
with the provision that \(q_i = 0\) implies \(p_i =0\), and the convention that \(0 \cdot \log (\frac{0}{0})= 0\). From this definition and Eq. 6, it is clear that our measure of heterogeneity is simply the relative entropy of the observed expression distribution from the uniform distribution U. Thus,
where U denotes to the uniform distribution on the set \(\{ 1,2, \ldots N\}\) with probabilities \(u_i=1/N\) for all \(1\le i \le N\).
Clusterlevel heterogeneity
A crucial property of the relative entropy is that it is additively decomposable with respect to arbitrary groupings [38]. Informally, this means that if we have a clustering of the cells into disjoint groups, then I(g) can be reconstructed from intercluster and intracluster heterogeneities. Next, we formalise this additive decomposition and give a selfcontained derivation (cf. Theil [45]).
Consider a clustering \(S=\{S_1,\ldots ,S_C\}\) that unambiguously assigns each cell in the sample into one of C nonintersecting subpopulations \(S_1,\ldots ,S_C\) of sizes \(N_1, \ldots , N_C\). Note that \(\sum _{k=1}^C N_k = N\), the total number of cells. Let \(y_k\) be the fraction of transcripts associated with cells in subpopulation \(S_k\), adjusted by the shrinkage estimator. That is,
with \(x_i\) defined by Eq. 3. This gives another discrete random variable Y with probability distribution \(y_1,\ldots ,y_C\) on the set \(\{1,2,\ldots C\}\). For each \(k=1,\ldots , C\), we can also assess the heterogeneity of the subpopulation \(S_k\) by considering the random variable \(Z_k\) with probability distribution \(z_i = x_i/y_k\) on the set \(i \in S_k\). Note that
so \(Z_k\) is a random variable on the set (cluster) \(S_k\).
We may rewrite I(g) in terms of Y and \(Z_k\), as follows:
Expression A may be rewritten as:
This is the relative entropy of Y from the uniform distribution \(U_group\) in which \(p_k = N_k/N\) for \(k=1,\ldots ,C\). Since \(y_k\) is the proportion of transcripts assigned to the cluster \(S_k\), it measures the deviation from the assumption that the clusters are homogeneous in their expression of g (i.e. each cluster expresses g at the same level). Since it is a measure of the extent to which the population deviates from homogeneity between clusters, we will term this contribution the intercluster heterogeneity of g with respect to the clustering S, denoted \(H_S\). Informally, it is a measure of the extent to which the gene g is differentially expressed between clusters.
Expression B may be rewritten as:
This is the weighted sum of the relative entropies of the empirical distributions \(Z_k\) (i.e. the observed gene expression distribution in group \(S_k\)) from the uniform distribution \(U_k\) on \(S_k\) (in which \(p_i = 1/N_k\) for each \(i \in S_k\)). It is the deviation from the assumption that the population consists of a mixture of homogeneous subpopulations according to the clustering S (where the expectation is taken with respect to the probability measure provided by Y). Since it is a measure of the expected extent to which the proposed subpopulations deviate from homogeneity within clusters, we will term this contribution the intracluster heterogeneity of g with respect to S, denoted \(h_S(g)\).
Taken together, these results show that I(g) can be decomposed into two welldefined parts that encode local and global properties of the expression distribution of g with respect to a given clustering \(S=\{S_1,\ldots ,S_C\}\):
The relative entropy is always nonnegative and hence so are \(H_S(g)\) and \(h_S(g)\) for any S. Thus, both quantities range from zero to I(g). If S places one cell in each cluster (i.e. \(C=N\)) then \(Z_k=U_k\) for all k and thus \(h_S(g) = 0\). Conversely, if S places all cells in one cluster (i.e. \(C=1\)) then \(Y=U_S\) and thus \(H_S(g) = 0\). In this case, the population heterogeneity is equivalent to the intracluster heterogeneity of the trivial clustering.
Unsupervised clustering
The above measures of heterogeneity can be easily extended from one gene to many. In this case, the homogeneous null model is obtained by assuming that each gene is expressed homogeneously and independently. Because entropy from independent sources is additive [7], the total heterogeneity of a singlecell RNAsequencing data set under a given clustering is given by the sum:
Here \(H_S=\sum _g H_S(g)\) and \(h_S=\sum _g h_S(g)\) represent the total intercluster and intracluster heterogeneity of the data with respect to a clustering \(S=\{S_1,\ldots ,S_C\}\).
Naturally, we want to identify a clustering (e.g. an assignment of cells to types) that produces maximally homogeneous subgroups and thereby explains most of the expression heterogeneity for most genes in terms of intercluster heterogeneity. Mathematically, this means we want to find a clustering S that maximises \(H_S\) or, equivalently, minimises \(h_S\). A brute force approach is not feasible, as the number of partitions of a set with n elements into two or more subsets grows exponentially with n.
To approach the problem we therefore converted the problem from one of assigning discrete identities (i.e., a hard clustering problem) to one of assigning continuous identities (i.e., a fuzzy clustering problem). The advantage of this approach is that it defines the problem in terms of a continuous, differentiable objective function which can be approached with more efficient optimisation routines.
Fuzzy clustering
We adopt a fuzzy conception of clustering in which cells are assigned to C (possibly) overlapping, fuzzy clusters, \(S_1, \ldots , S_C\). Each cell i has C corresponding membership functions \(\mu _{ik} \in [0,1]\) for \(k=1,2,\ldots ,C\), which assess the probability that cell i belongs to cluster \(S_k\). Thus,
Our informationtheoretic framework can be adapted to this fuzzy setting. Population heterogeneity, I(g), is independent of the chosen clustering, so it remains unaffected by the adoption of fuzzy clusters, i.e. the total gene expression heterogeneity is unaffected by the choice of discrete or fuzzy cluster memberships. For the calculations of intercluster and intracluster heterogeneities, we extend the discrete random variables \(Y^{g}\) and \(Z^{g}_k\) to the fuzzy setting, where \(Y^{g}\) now measures the expression distribution of the gene g across the C fuzzy clusters, and \(Z^{g}_k\) measures the expression distribution of the gene g within fuzzy cluster \(S_k\), as follows.
We define the discrete random variable \(Y^{g}\) on the set of fuzzy clusters \(S = \{S_1,\ldots ,S_C\}\) with probabilities \(y_k^{g}\) given by
We also define, for each \(S_k \in S\), a discrete random variable \(Z_k^{g}\) on the set of cells, \(i = 1,\ldots ,N\), with probabilities \(z^{g}_{ik}\) given by,
We can rewrite I(g) (which is independent of the clustering) in terms of \(Y^g\) and \(Z^g_k\) (which depend on the clustering) as
where,
and
Note that Eq. 33 is the entropy of the random variable \(Y^g\), however \(H(Z^g_k)\) is not quite the entropy of \(Z^g_k\) (there is a missing \(1/\mu _{ik}\) factor inside the logarithm).
From Eq. 32, we can obtain a second decomposition,
where
Here H(g) and h(g) represent the intercluster heterogeneity and intracluster heterogeneity of gene g with respect to the fuzzy clustering S. Equation 36 is also the KLD of the distribution of \(Y^g\) from the uniform distribution \(U_group\) given by \(u_k=N_k/N\) for \(k=1,2,\ldots ,C\) (note that \(\sum _{k=1}^C u_k = 1\)). Similarly, Eq. 37 is the KLD of the distribution of \(Z^g\) from the uniform distribution \(U_k\) given by \(p_i=\mu _{ik}/N_k\) for \(i=1,2,\ldots ,N\).
Either \(H_S = \sum _g H_S(g)\) or \(h_S = \sum _g h_S(g)\) can serve as our objective function, f, for optimisation: optimising either quantity simultaneously identifies the clustering with the greatest heterogeneity between clusters (i.e. the greatest differential gene expression between clusters) and the least heterogeneity within clusters (i.e. in which the cells within each cluster are closest to being interchangeable with respect to their patterns of gene expression).
Choosing intercluster heterogeneity \(H_S\), we solve the optimisation problem
where
using the limitedmemory, box constrained BFGS (LBFGSB) optimisation algorithm from the Python3 (v3.8.2) package SciPy (v1.5.3) [5, 50, 51, 53].
The LBFGSB algorithm is an efficient nonlinear local optimisation method developed for solving large, dense problems such as this [53] but its speed and the quality of the solution produced depends on the availability of an explicit formulation of the gradient of the objective function. Therefore, we derive the required gradient by partially differentiating the total intercluster heterogeneity, \(H_S\), with respect to fuzzy cluster membership, i.e. the \(N \cdot C\) membership functions \(\mu _{rs}\) (\(1\le r\le N, 1 \le s \le C\)).
Beginning by differentiating individual elements of f with respect to \(\mu _{rs}\), from Eqs. 30 and 38, we have
Using the product, chain and quotient rules, we can differentiate the gsummand, call it \(f^g \left( = H_S(g)\right)\), of the objective function f:
The membership functions \(\mu _{rs}\) are not independent, since \(\sum _{k=1}^C \mu _{ik} = 1\) for all i, by Eq. 29. To incorporate this constraint, and the constraint \(\mu _{ik} \ge 0\) for all i, k, we introduce variables \(w_{ik}\) given by
and make the objective function f a function of the \(w_{ik}\). By the chain rule,
Determining \(\frac{\partial \mu _{rs}}{\partial w_{ij}}\), let us write \(M_i = \sum _{l=1}^C e^{w_{il}}\); then, Eq. 50 can be written as \(\mu _{ik}=e^{w_{ik}}/M_i\). By the quotient rule,
Numerical optimisation
The objective function \(H_S\), as formulated in Eq. 40, was optimised with respect to cluster memberships, \(\mu _{ik}\) using the LBFGSB optimisation algorithm from the Python3 (v3.8.2) package SciPy (v1.5.3) [5, 50, 51], with random initial starting points (initial cluster memberships were defined by random sampling of a uniform distribution centred on zero). The gradient function, as formulated in Eq. 51, was supplied to the optimisation routine.
Because initial conditions are randomly generated, and the LBFGSB algorithm only identifies local optima, the specific clustering found by the algorithm may depend on the choice of initial vector [5, 53]. To make the implementation robust a multistart process is adopted, in which the optimisation is repeated multiple times with different initial vectors, and the clustering S with the greatest intercluster heterogeneity \(H_S\) is chosen.
Permutation testing
Annotations for the Tian, Zheng and Stumpf data were taken from the respective repositories’ metadata. Random clusterings were generated by randomly permuting the established annotations 10,000 times each. Comparison of true and shuffled annotations was formulated as an exact oneside hypothesis test, generating exact pvalues [8]. We control the false discovery rate arising from multiple testing for each data set using the R function p.adjust [2].
Clustering comparison
We compare our clustering results to two standard clustering methods: Seurat (v3) clustering (Louvain community detection), and UMAP projection, both with default parameters, with the exception of the resolution, as described in Stuart et al. [40] [3, 29]. The Adjusted Rand Index (ARI) was calculated for each data set using the associated function in the R package mclust (v5.4.7) [33, 36].
Semisupervised classification
Consider a pair of count matrices, \(T = \{T_{ref}, T_{test}\}\), where one count matrix, which we term the reference, \(T_{ref}\), has a known cluster structure (and therefore each cell has a unique, discrete cell type classification). The goal of reference mapping is to classify cells in the test matrix, \(T_{test}\), based on cellular classifications of the reference matrix.
We normalise each count matrix separately as described in Normalisation, using the James–Steintype estimator so that \(\sum _{i \in T_k} x_i^g = 1\) [16]. We assume that the same set of genes are profiled in \(T_{ref}\) and \(T_{test}\) and derive the normalised combined count matrix, \(X_{mix}\), via the weighted concatenation:
where \(N_{ref}\) is the number of cells in the reference data set, \(N_{test}\) is the number of cells in the unclassified data set, and N is the total number of cells across both data sets.
As in the unsupervised setting, we cluster the combined data set, \(X_{mix}\), by maximising \(H_S\) of the combined data set with respect to a fuzzy clustering S, see Eq. 56. However, unlike in the unsupervised setting, a subset of cellular cluster memberships are known a priori. Let \(R = R_1,\dots ,R_C\) be the discrete clustering of the reference data set, with cluster sizes \(n_1,\ldots ,n_C\), where \(\sum _{k=1}^C n_k = N_{ref}\). As the cluster identities of cells originating from the reference data set are fixed, each cluster in the reference data set constitutes a subset of a cluster in the mixed data set, \(R_k \in S_k\), where \(S=S_1,\ldots ,S_C\) is the clustering of the mixed data set. Note that number of clusters in the reference and in the combined data set is the same, i.e. C.
Based on the comixing of known and unknown cellular identities, \(y_k^g\) and \(N_k\) of the combined data set \(X_{mix}\) are given by
Based on these formulations of \(y_k^g\) and \(N_k\), the definition of \(H_S\) follows as before,
Similarly, the derivative of \(H_S\) follows as before, except that the membership function, \(\mu _{rs}\), is only defined with respect to cells of the test data set, \(1 \le r \le N_{test}\) & \(1 \le i \le N_{test}\),
Numerical optimisation follows as in the unsupervised setting, using the LBFGSB algorithm, except that the known cellular identities of the reference mean that only a single run of the algorithm is required, i.e. no multiple starts.
Imputation
We rely on a cellbycell adjacency matrix for imputation, encoding which cells have similar gene expression profiles. Specifically, we use the sharednearestneighbours matrix produced by Seurat, using default settings [40].
We begin with the expression matrix X, keeping only the top G genes by I(g) (default is 500). Then, briefly, the Seurat method follows the following steps: data is scaled and normalised using a variance stabilising transform; the transformed data then undergoes principal components, and a cellcell Euclidean distance matrix is calculated on the first 10 principal components. Next, a knearestneighbours graph is constructed, where k is by default 20; a shared nearestneighbours graph is constructed by taking the Jaccard Index of the overlap in neighbourhoods of each pair of cells [18].
From this adjacency matrix, A, we compute the imputed data via a onehop smoothing operator:
where Q is the rownormalised stochastic matrix, i.e. \(\sum _j Q_{ij} = 1 \; \forall \; i \in 1,\ldots ,N\), \(X^g=(X^g_j)\) is the gcolumn of X, and \(\widetilde{X}^g=(\widetilde{X}^g_i)\) are the imputed values for gene g [30]. The imputed expression value of gene g in cell i is a weighted mean of observed expression values in cell i and its neighbourhood in A. This is an example of a data diffusion imputation, first applied to Scseq data in Van Dijk et al. [49].
The vector \(\widetilde{X}^g\) can then be used in place of \(X^g\) in any of the above described work. In the main text, we select the top 500 genes by I(g) based on \(\widetilde{X}^g\), before clustering on \(\widetilde{X}^g\). Accordingly, use of the imputed expression matrix requires two rounds of feature selection: first in the construction of the adjacency matrix, A, then for the use of \(\widetilde{X}^g\) in clustering.
For the benchmarking of the imputation, we recapitulated the set of data sets and methods tested in Tian et al. [46]. However, we were unable to rerun RCA [23].
Availability of data and materials
The Svensson data was downloaded as file “svensson_chromium_control.h5a” from https://data.caltech.edu/records/1264 The Tian data sets were downloaded from https://github.com/LuyiTian/sc_mixology The Zheng data was downloaded as set of files “Gene / cell matrix (filtered)” in section “Single Cell 3’ Paper: Zheng et al. 2017” from https://support.10xgenomics.com/singlecellgeneexpression/datasets The Stumpf data was downloaded as file “RData.zip” from http://dx.doi.org/10.17632/csvm3kpkxd.1 The Tabula Muris data was downloaded as set of files “Singlecell RNAseq data from microfluidic emulsion (v2)” from https://tabulamuris.ds.czbiohub.org
Code availability
An R package for the implementation of the described methods is available on github https://github.com/mjcasy/scEC
Abbreviations
 ARI:

Adjusted rand index
 KLD:

Kullback–Leibler divergence
 scEC:

Singlecell entropic clustering
 scSeq:

Singlecell sequencing
References
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. Nate Genet. 2000;25(1):25–9.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc: Ser B (Methodol). 1995;57(1):289–300.
Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech: Theory Exp. 2008;2008(10):P10008.
Brennecke P, Anders S, Kim JK, Kołodziejczyk AA, Zhang X, Proserpio V, Baying B, Benes V, Teichmann SA, Marioni JC, et al. Accounting for technical noise in singlecell rnaseq experiments. Nat Meth. 2013;10(11):1093–5.
Byrd RH, Lu P, Nocedal J, Zhu C. A limited memory algorithm for bound constrained optimization. SIAM J Sci Comput. 1995;16(5):1190–208.
Chan TE, Stumpf MP, Babtie AC. Gene regulatory network inference from singlecell data using multivariate information measures. Cell Syst. 2017;5(3):251–67.
Cover TM, Thomas JA. Elements of information theory. Wiley; 2012.
Fisher R. Statistical methods for research workers. Gyan Books; 2017.
Freytag S, Tian L, Lönnstedt I, Ng M, Bahlo M. Comparison of clustering tools in r for mediumsized 10x genomics singlecell RNAsequencing data. F1000Research. 2018;7:1297.
Consortium Gene Ontology. The gene ontology resource: enriching a gold mine. Nucl Acids Res. 2021;49(D1):D325–34.
Greulich P, Smith R, MacArthur BD. The physics of cell fate. In: Levine H, Jolly MK, Kulkarni P, Nanjundiah V, editors. Phenotypic switching. NewYork: Academic Press; 2020. p. 189–206.
Grün D, Kester L. Van, Oudenaarden A. Validation of noise models for singlecell transcriptomics. Nat Meth. 2014;11(6):637–40.
Grün D, Lyubimova A, Kester L, Wiebrands K, Basak O, Sasaki N, Clevers H. Van, Oudenaarden A. Singlecell messenger RNA sequencing reveals rare intestinal cell types. Nature. 2015;525(7568):251–5.
Hafemeister C, Satija R. Normalization and variance stabilization of singlecell RNAseq data using regularized negative binomial regression. Genome Biol. 2019;20(1):1–15.
Hashimshony T, Senderovich N, Avital G, Klochendler A, De Leeuw Y, Anavy L, Gennert D, Li S, Livak KJ, RozenblattRosen O, et al. Celseq2: sensitive highlymultiplexed singlecell RNAseq. Genome Biol. 2016;17:1–7.
Hausser J, Strimmer K. Entropy inference and the James–Stein estimator, with application to nonlinear gene association networks. J Mach Learn Res. 2009;10(7):1469–84.
Hicks SC, Townes FW, Teng M, Irizarry RA. Missing data and technical variability in singlecell RNAsequencing experiments. Biostatistics. 2018;19(4):562–78.
Jaccard P. The distribution of the flora in the alpine zone. 1. NewPhytol. 1912;11(2):37–50.
Kharchenko PV, Silberstein L, Scadden DT. Bayesian approach to singlecell differential expression analysis. Nat Meth. 2014;11(7):740–2.
Kiselev VY, Andrews TS, Hemberg M. Challenges in unsupervised clustering of singlecell RNAseq data. Nat Rev Genet. 2019;20(5):273–82.
Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, Natarajan KN, Reik W, Barahona M, Green AR, et al. Sc3: consensus clustering of singlecell RNAseq data. Nat Meth. 2017;14(5):483–6.
Kullback S, Leibler RA. On information and sufficiency. Ann Math Stat. 1951;22(1):79–86.
Li H, Courtois ET, Sengupta D, Tan Y, Chen KH, Goh JJL, Kong SL, Chua C, Hon LK, Tan WS, et al. Reference component analysis of singlecell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nat Genet. 2017;49(5):708–18.
Liu B, Li C, Li Z, Wang D, Ren X, Zhang Z. An entropybased metric for assessing the purity of single cell populations. Nat Commun. 2020;11(1):1–13.
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNAseq data with deseq2. Genome Biol. 2014;15(12):1–21.
Luecken MD, Theis FJ. Current best practices in singlecell RNAseq analysis: a tutorial. Mol Syst Biol. 2019;15(6): e8746.
Lun AT, McCarthy DJ, Marioni JC. A stepbystep workflow for lowlevel analysis of singlecell RNAseq data with bioconductor. F1000Research. 2016;5:2122.
Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas AR, Kamitaki N, Martersteck EM, et al. Highly parallel genomewide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161(5):1202–14.
McInnes L, Healy J, Melville J. Umap: Uniform manifold approximation and projection for dimension reduction, arXiv preprint arXiv:1802.03426. 2018
Ortega A. Introduction to graph signal processing. Cambridge University Press; 2022.
Peterman N, Levine E. Sortseq under the hood: implications of design choices on largescale characterization of sequencefunction relations. BMC Genom. 2016;17:1–17.
R Core Team R: a language and environment for statistical computing, R foundation for statistical computing, Vienna. https://www.Rproject.org/, 2020
Rand WM. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc. 1971;66(336):846–50.
Risso D, Purvis L, Fletcher RB, Das D, Ngai J, Dudoit S, Purdom E. Clusterexperiment and rsec: a bioconductor package and framework for clustering of singlecell and other large gene expression datasets. PLoS Comput Biol. 2018;14(9): e1006378.
Robinson MD, McCarthy DJ, Smyth GK. edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40.
Scrucca L, Fop M, Murphy TB, Raftery AE. mclust 5: clustering, classification and density estimation using gaussian finite mixture models. R J. 2016;8(1):289.
Shannon CE. A mathematical theory of communication. Bell Syst Tech J. 1948;27(3):379–423.
Shorrocks AF. The class of additively decomposable inequality measures. Econom: J Econom Soc. 1980;48(3):613–25.
Smith RC, MacArthur BD. Informationtheoretic approaches to understanding stem cell variability. Curr Stem Cell Rep. 2017;3(3):225–31.
Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM III, Hao Y, Stoeckius M, Smibert P, Satija R. Comprehensive integration of singlecell data. Cell. 2019;177(7):1888–902.
Stumpf PS, Du X, Imanishi H, Kunisaki Y, Semba Y, Noble T, Smith RC, RoseZerili M, West JJ, Oreffo RO, et al. Transfer learning efficiently maps bone marrow cell types from mouse to human using singlecell RNA sequencing. Commun Biol. 2020;3(1):1–11.
Svensson V, Natarajan KN, Ly LH, Miragaia RJ, Labalette C, Macaulay IC, Cvejic A, Teichmann SA. Power analysis of singlecell RNAsequencing experiments. Nat Meth. 2017;14(4):381.
Svensson V, VentoTormo R, Teichmann SA. Exponential scaling of singlecell RNAseq in the past decade. Nat Protoc. 2018;13(4):599–604.
Consortium Tabula Muris. Singlecell transcriptomics of 20 mouse organs creates a tabula muris. Nature. 2018;562(7727):367.
Theil H. Economics and Information, theory studies in mathematical and managerial economics. NorthHolland Publishing Company; 1967.
Tian L, Dong X, Freytag S, Lê Cao KA, Su S, JalalAbadi A, AmannZalcenstein D, Weber TS, Seidi A, Jabbari JS, et al. Benchmarking single cell RNAsequencing analysis pipelines using mixture control experiments. Nat Meth. 2019;16(6):479–87.
Townes FW, Hicks SC, Aryee MJ, Irizarry RA. Feature selection and dimension reduction for singlecell RNAseq based on a multinomial model. Genome Biol. 2019;20(1):1–16.
Trapnell C. Defining cell types and states with singlecell genomics. Genome Res. 2015;25(10):1491–8.
Van Dijk D, Sharma R, Nainys J, Yim K, Kathail P, Carr AJ, Burdziak C, Moon KR, Chaffer CL, Pattabiraman D, et al. Recovering gene interactions from singlecell data using data diffusion. Cell. 2018;174(3):716–29.
Van Rossum G, Drake FL. Python 3 reference manual. Scotts Valley: CreateSpace; 2009.
Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, et al. Scipy 1.0: fundamental algorithms for scientific computing in python. Nat Meth. 2020;17(3):261–72.
Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB, Wheeler TD, McDermott GP, Zhu J, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8(1):1–12.
Zhu C, Byrd RH, Lu P, Nocedal J. Algorithm 778: LBFGSB: Fortran subroutines for largescale boundconstrained optimization. ACM Trans Math Softw (TOMS). 1997;23(4):550–60.
Acknowledgements
We thank P. Stumpf, J. Egan and B. KitchingMorley for helpful discussions.
Funding
RSG and BDM were supported by The Alan Turing Institute under the EPSRC Grant EP/N510129/1. RSG was supported by the Alan Turing Institute—Roche strategic partnership under the project ‘Structured missingness in heterogeneous data’.
Author information
Authors and Affiliations
Contributions
Conceptualization, MJC, RSG, BDM; Software, MJC; Investigation, MJC, JF, RJSG and BDM; Writing—original draft, MJC, RSG and BDM; Writing—review and editing, MJC, JF, RSG and BDM; Visualization, MJC; Supervision, RSG and BDM. All authors have seen and approved the manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
. Two additional figures and one additional table: Expression of Hbbbs in the Stumpf data set (Supplementary Figure 1); Feature selection benchmarking of scEC and scECimpute (Supplementary Figure 2); and Contingency table between cell annotations provided in the Stumpf data set and scEC labels.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Casey, M.J., Fliege, J., SánchezGarcía, R.J. et al. An informationtheoretic approach to single cell sequencing analysis. BMC Bioinformatics 24, 311 (2023). https://doi.org/10.1186/s12859023054248
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859023054248