 Research
 Open access
 Published:
Assessing transcriptomic heterogeneity of singlecell RNASeq data by bulklevel gene expression data
BMC Bioinformatics volumeÂ 25, ArticleÂ number:Â 209 (2024)
Abstract
Background
Singlecell RNA sequencing (scRNASeq) data illuminate transcriptomic heterogeneity but also possess a high level of noise, abundant missing entries and sometimes inadequate or no cell type annotations at all. Bulklevel gene expression data lack direct information of cell population composition but are more robust and complete and often better annotated. We propose a modeling framework to integrate bulklevel and singlecell RNASeq data to address the deficiencies and leverage the mutual strengths of each type of data and enable a more comprehensive inference of their transcriptomic heterogeneity. Contrary to the standard approaches of factorizing the bulklevel data with one algorithm and (for some methods) treating singlecell RNASeq data as references to decompose bulklevel data, we employed multiple deconvolution algorithms to factorize the bulklevel data, constructed the probabilistic graphical models of celllevel gene expressions from the decomposition outcomes, and compared the loglikelihood scores of these models in singlecell data. We term this framework backward deconvolution as inference operates from coarsegrained bulklevel data to finegrained singlecell data. As the abundant missing entries in scRNASeq data have a significant effect on loglikelihood scores, we also developed a criterion for inclusion or exclusion of zero entries in loglikelihood score computation.
Results
We selected nine deconvolution algorithms and validated backward deconvolution in five datasets. In the insilico mixtures of mouse scRNASeq data, the loglikelihood scores of the deconvolution algorithms were strongly anticorrelated with their errors of mixture coefficients and cell type specific gene expression signatures. In the true bulklevel mouse data, the sample mixture coefficients were unknown but the loglikelihood scores were strongly correlated with accuracy rates of inferred cell types. In the data of autism spectrum disorder (ASD) and normal controls, we found that ASD brains possessed higher fractions of astrocytes and lower fractions of NRGNexpressing neurons than normal controls. In datasets of breast cancer and lowgrade gliomas (LGG), we compared the loglikelihood scores of three simple hypotheses about the gene expression patterns of the cell types underlying the tumor subtypes. The model that tumors of each subtype were dominated by one cell type persistently outperformed an alternative model that each cell type had elevated expression in one gene group and tumors were mixtures of those cell types. Superiority of the former model is also supported by comparing the real breast cancer scRNASeq clusters with those generated by simulated scRNASeq data.
Conclusions
The results indicate that backward deconvolution serves as a sensible model selection tool for deconvolution algorithms and facilitates discerning hypotheses about cell type compositions underlying heterogeneous specimens such as tumors.
Introduction
Transcriptomic heterogeneity is probed by RNA sequencing data at bulk and singlecell levels. Each type of data has its merits and shortcomings. Bulklevel RNASeq data fail to directly disclose subpopulation variability in a heterogeneous sample, but are less vulnerable to measurement noise and missing values and often better annotated. Singlecell RNASeq (scRNASeq) data directly manifest cellular heterogeneity, but also suffer from high measurement noise and dropouts, and sometimes lack proper annotations. Integrating bulklevel and singlecell RNASeq data by leveraging their complementary merits can address these deficiencies and provide a more comprehensive understanding of the heterogeneous cell types and their compositions in samples.
A standard approach for unveiling heterogeneity from bulklevel data is deconvolution. Denote an \(n\times m\) matrix \(E\) the expression data of \(n\) genes and \(m\) bulk samples. Assume each bulk sample comprises cells from \(k\) types, and the expression data of a sample is the weighted sum of the expression profiles of the \(k\) cell types. Under these assumptions \(E\) is approximately factorized into the product of two matrices:
\(Q\) is an \(n\times k\) signature matrix denoting the expression profiles of \(n\) genes in \(k\) cell types, and \(P\) is a \(k\times m\) mixture coefficient matrix denoting the proportions of \(k\) cell types in \(m\) samples, where the \(P\) entries in each column are nonnegative and sum to 1, and the \(Q\) entries are nonnegative as well.
Numerous deconvolution algorithms have been proposed (see reviews [1, 2]), and they fall into two general categories. Complete deconvolution methods simultaneously solve \(Q\) and \(P\) from \(E\) by imposing various constraints on the inferred matrices [3, 4]. Incomplete deconvolution methods take one matrix either as given or inferred from external sources and optimize the other matrix. Very few incomplete deconvolution methods fix \(P\) and optimize \(Q\) [5], and the majority of the methods fix \(Q\) and optimize \(P\) [6,7,8,9,10,11]. \(Q\) is either explicitly given as an input [12], constructed from cell type specific marker genes [13], derived from the reference expression profiles at bulklevel [7] or singlecell RNASeq data [14,15,16,17,18]. Several benchmark studies also extensively compared multiple deconvolution methods in a wide range of experimental datasets and performance settings [1, 19, 20].
Despite diversity and richness of these methods, integration of bulklevel and singlecell data (if undertaken) is solely achieved by utilizing the singlecell data as a reference to infer the composition of the bulklevel data.
Inference from singlecell to bulklevel data directly matches the goal of deconvolution as the signature matrix \(Q\) can be directly derived from the singlecell data. Hence we term this inference direction forward deconvolution. However, forward deconvolution may be infeasible or misleading as singlecell data are sparse, noisy and sometimes unannotated. A prominent example is cancer transcriptomics data. Tumor subtypes have distinct expression patterns in their bulklevel RNASeq data [21,22,23]. Yet in the singlecell data the cancer cell types are often unannotated and the data quality is substantially inferior, as indicated in prior studies [24, 25] and our analysis on the data of breast cancer and low grade gliomas in the Results section. Therefore, prior deconvolution methods using singlecell RNASeq data as a reference are not applicable in certain contexts.
To fix these caveats of forward (singlecell \(\to\) bulklevel data) deconvolution, we propose a backward deconvolution framework to integrate bulklevel and singlecell RNASeq data and simultaneously infer (1) signature expression profiles of cell types, (2) mixture coefficients of cell types in each bulk sample, (3) relations between expression profiles of bulklevel sample subtypes and cell types, (4) cell type assignments in singlecell RNASeq data. Backward deconvolution employs several forward deconvolution algorithms to the bulklevel data and derives the probabilistic graphical models of singlecell gene expressions from the deconvolution results. It then evaluates the loglikelihood scores of these models in the singlecell data and selects the best model according to its loglikelihood score.
Representation of scRNASeq data as a probabilistic graphical model has been proposed since the early stage of singlecell technology development. The most common approach is to borrow topic models or Latent Dirichlet Allocation (LDA) in text analysis [26] to the scRNASeq data. LDA models word distributions per topic and topic distributions per document with two nested Dirichlet distributions. There is a direct correspondence from documents, words and latent topics in text analysis to cells, genes and cell functions in scRNASeq data. LDA is now widely used in dimension reduction [27, 28] and clustering [29] of scRNASeq data alone. A more relevant approach unifies bulklevel and singlecell RNASeq data with a more general probabilistic graphical model (URSM [10]). Our work shares a common spirit of a hierarchical probabilistic representation of the data generation process but substantially differs from them in several important aspects. Most LDA studies on gene expression data apply to singlecell RNASeq data only and fail to integrate both singlecell and bulklevel data. Although URSM tackles integration of both types of data, the graphical model is based on one set of particular assumptions about the data. In contrast, backward deconvolution directly tackles bulklevel and singlecell data integration and allows multiple modeling assumptions encoded by different forward deconvolution algorithms. These features are unique in our approach.
We justified backward deconvolution by selecting nine deconvolution algorithms and applying the framework to five singlecell and bulklevel datasets: (1) the scRNASeq data of mouse gene expressions and its insilico mixtures as the virtual bulklevel data, (2) the true bulklevel and singlecell RNASeq data of mouse gene expressions, (3) the bulklevel and singlecell RNASeq data of the brains of ASD patients and normal controls, (4) the breast cancer bulklevel and singlecell RNASeq data, (5) the lowgrade gliomas bulklevel and singlecell RNASeq data. In the mouse datasets with cell type annotations, the loglikelihood scores wereÂ aligned with several common performance metrics such as the accuracy rate of predicted cell type assignments, similarity between true and inferred signature matrices, and similarity between true and inferred mixture coefficients matrices. In the ASD data, backward deconvolution outcomes indicated that ASD brains possessed higher fractions of astrocytes and lower fractions of NRGNexpressing neurons. In the cancer datasets with no cell type annotations, we compared three simple hypotheses about cell type expression patterns and found the model that cancer cells of each subtype were dominated by one cell type was superior to other models. The results indicate that backward deconvolution (1) is a sensible model selection tool for deconvolution algorithms and (2) facilitates discerning hypotheses about cell type compositions underlying heterogeneous specimens.
Materials and methods
Overview of the backward deconvolution framework
The objective of backward deconvolution is to simultaneously infer the gene expression signatures of the underlying cell types and their compositions in selected sample types from both bulklevel and singlecell RNASeq data. The outcome of a standard deconvolution algorithm (forward deconvolution) is a decomposition of bulklevel data as in Eq.Â 1. Our method converts the decomposition outcome into a probabilistic graphical model for singlecell gene expressions, and applies the model to fit singlecell data. We term this method backward deconvolution as inference is undertaken from bulklevel to singlecell data. Rather than incurring one algorithm to perform decomposition, backward deconvolution compares several forward deconvolution algorithms on their goodness of fit to the singlecell data and selects the best one. Therefore, it should be viewed as a framework of constructing and selecting models from multiple deconvolution algorithms.
FigureÂ 1A illustrates the backward deconvolution framework. The inputs include (1) bulklevel RNASeq data of samples labeled with subtypes, (2) singlecell RNASeq data of samples from the same bulklevel subtypes (but not necessarily from the same specimens of the bulklevel RNASeq data), (3) a subset of marker genes pertaining to the sample subtypes, and group labels of the marker genes whose expression profiles distinguish sample subtypes, (4) a number of forward deconvolution algorithms. The outputs include (1) transcription signatures of gene markers for each cell type, (2) mixture coefficients of cell types in each bulk sample, (3) the probabilistic graphical model of the deconvolution algorithm that best fits the singlecell data, (4) cell type assignments in scRNASeq data. Briefly, it consists of the following steps. First, we generate a reference signature matrix and a reference signature distribution from singlecell or bulklevel RNASeq data. A reference signature matrix comprises the mean marker gene expressions of each cell type. A reference signature distribution specifies the expression distribution of each marker gene group in each cell type. Second, we employ multiple forward deconvolution algorithms to factorize the bulklevel data (Eq.Â 1). Third, from the outcome of each deconvolution algorithm we construct a probabilistic graphical model for singlecell gene expressions. Finally, we evaluate the marginal loglikelihood scores of all models in the singlecell data and select the best model based on the loglikelihood scores. In each model we also assign each cell to a cell type according to its posterior probabilities of cell types given the scRNASeq data.
Fitting singlecell gene expression data with probabilistic graphical models constructed from forward deconvolution outcomes
The major highlight contrasting backward deconvolution with canonical forward deconvolution approaches is to employ the deconvolution outcomes of the bulklevel data to fit the singlecell gene expression data. Here we give a brief preview of this approach and provide details of each step in subsequent sections. We assume both singlecell and bulklevel RNASeq data are generated from the same process which can be represented by a probabilistic graphical model with a hierarchical structure. Samples in the data are drawn from several subtypes (e.g., different tissue types or cancer subtypes). Each sample constitutes cells belonging to several cell types where the cell type composition depends on the sample subtype. Cells of each type possess a specific expression signature of selected marker genes. Furthermore, the marker genes are categorized into several groups where members of each gene group possess similar expression patterns across cell types. Singlecell RNASeq data are noisy measurements of the marker gene expressions of individual representative cells from the process. Bulklevel RNASeq data are measurements of the marker gene expressions of mixtures of the representative cells.
More precisely, denote \(t, s, \pi , \gamma , x\) random variables of sample identities, sample subtypes, cell types, gene group labels, and individual gene expressions, respectively. The probabilistic graphical model constitutes two families of parameters: \(P\left(\pi s\right)\)â€™s specify the conditional probabilities of cell types given sample subtypes, and \(P(x\) \(\gamma ,\pi )\)â€™s specify the conditional probabilities of marker gene expressions given gene group labels and cell types. A complete model should also include prior probabilities \(P(s)\) and \(P(\gamma )\). These priors are discarded here as the sample subtypes are determined by the sample identities \(t\) and the gene group labels are determined by gene identities. All these variables have subscript indices denoting individual genes, cells or bulk samples. The joint likelihood of observing the scRNASeq data becomes:
where indices \(i, j, l\) are over samples, cells and genes respectively. Evaluation of the joint likelihood resembles the sampling process and can be concisely represented by a plate notation illustrated in Fig.Â 1B. Multiplications of the terms over the three indices (\(l, j, i\)) are represented as nested boxes inside out. The term \(P\left({s}_{i}{t}_{i}\right)\) is deterministic as the subtype of each sample is unique and known.
Forward deconvolution can be viewed as inference of these model parameters from the bulklevel (and singlecell) RNASeq data. Complete methods infer \(Q\) and \(P\) from the bulklevel data, and we can derive \(P(x\) \(\gamma ,\pi )\) and \(P\left(\pi s\right)\) accordingly. Incomplete methods infer \(P\) from the bulklevel data and reference singlecell data, and we can derive \(P\left(\pi s\right)\) according to \(P\) and directly construct \(P(x\) \(\gamma ,\pi )\) from the reference singlecell data. Once these parameters are decided, we plug them into Eq.Â 2 and evaluate the likelihood score of the scRNASeq data.
In some applications cell type labels are unobserved. For instance, in tumor data typically normal cells are annotated but cancer cells are not since there are few standard ways to delineate cancer cell subtypes. To cope with this scenario we evaluate the marginal likelihood function over possible cell type labels:
The log (marginal) likelihood score quantifies the goodness of fit of a deconvolution model to the scRNASeq data but does not take model complexity into account. We add a regularization term to the loglikelihood value and evaluate the Bayesian Information Criterion (BIC) score [30]:
\(N\) is the number of cells in the scRNASeq data and \(D\) is the degree of freedom in the model. \(D\) is determined by the number of independent entries in \(P\left(\pi s\right)\) and \(P(x\) \(\gamma ,\pi )\).
The backward deconvolution framework applies several existing forward deconvolution algorithms to the bulklevel RNASeq data and uses the BIC scores (Eq.Â 4) to measure the goodness of the inferred parameters to fit the scRNASeq data. Despite the counterintuitive inference direction from bulklevel to singlecell data, this approach has several advantages. The framework can be treated as an ensemble learning method if we want to deconvolve bulklevel data by combining multiple forward deconvolution algorithms, or a model selection criterion if we want to compare the performance of multiple deconvolution methods. It also offers a more robust way to integrate both bulklevel and singlecell expression data since it allows multiple assumptions about the relations between bulklevel and singlecell data and can tolerate noisy and missing entries as well as lack of annotations in singlecell data.
Generating reference signature matrices and distributions from the singlecell data
A reference signature matrix specifies the mean expression value of each marker gene in each cell type. A reference distribution specifies the expression distribution of each marker gene group in each cell type. Both are derived from singlecell or bulklevel data. In this section, we describe the procedures of deriving these quantities from the singlecell data if the scRNASeq data have cell type annotations and reliable quality.
For incomplete deconvolution algorithms, the reference signature matrix \({Q}_{ref}\) (\(Q\) in Eq.Â 1) is directly obtained from the scRNASeq data. Denote \(X\) an \(n\times {N}_{c}\) matrix of scRNASeq data with \(n\) genes and \({N}_{c}\) cells, and \(\tau\) a \(1\times {N}_{c}\) vector of \(k\) cell type annotations of the \({N}_{c}\) cells. The reference signature matrix \({Q}_{ref}\) is an \(n\times k\) matrix where \({Q}_{ref}\left(i,j\right)=\frac{1}{{N}_{j}}\sum_{\{l:\tau \left(l\right)=j\}}{X}_{il}\) is the average expression of gene \(i\) over type \(j\) cells in \(X\) (\({N}_{j}\) is the number of type \(j\) cells).
\({Q}_{ref}\) collapses the expression levels of a gene over multiple cells of the same cell type into one mean value. A more precise quantification of singlecell gene expression values is to infer their distributions. To simplify the model of singlecell gene expressions, we make two explicit assumptions. First, marker genes in the singlecell data are subdivided into groups. Second, the normalized expressions of marker genes in the same group are drawn from the same distribution. Both assumptions are valid as marker genes are selected according to the criteria that they are expressed in specific sample types or cell types. These assumptions enable us to estimate the expression value distributions of a small number of gene groups rather than all the marker genes separately. The reference signature distribution \({GP}_{ref}\) is an \({n}_{g}\times k\times I\) tensor for \({n}_{g}\) gene groups, \(k\) cell types and \(I\) intervals of gene expression values. \({GP}_{ref}(i,j,l)\) specifies the probability that the expression values of gene group \(i\) in cell type \(j\) fall in the \({l}^{th}\) interval. It is inadequate to directly estimate \({GP}_{ref}\) from \(X\) because marker genes in the same group may have quite different scales of expression levels. To make the expression levels of all marker genes comparable, we ranktransformed the expression values of each gene and normalized the ranks into cumulative distribution function (cdf) values. In the normalized scRNASeq data matrix \({X}_{cdf}\), all entries take values in \([\text{0,1}]\) and the orders of entry values in each row are preserved from \(X\). \({GP}_{ref}(i,j,:)\) specifies the probability mass function of normalized expression values of gene group \(i\) and cell type \(j\) in the interval \([\text{0,1}]\). We subdivided \([\text{0,1}]\) into \(I\) intervals, identified genes \({\Lambda }_{i}\) belonging to group \(i\) and cells \({S}_{j}\) belonging to type \(j\), and collected the corresponding \({X}_{cdf}\) entries \({X}_{cdf}({\Lambda }_{i}, {S}_{j})\). \({GP}_{ref}(i,j,:)\) was obtained from \({X}_{cdf}({\Lambda }_{i}, {S}_{j})\) by kernel density estimation. To avoid minus infinity values of loglikelihood scores in the subsequent steps, all entries in \({GP}_{ref}\) need to be positive. We replaced zero entries in \({GP}_{ref}(i,j,:)\) with a small value \(\epsilon\) and renormalized \({GP}_{ref}(i,j,:)\) to make them sum to 1 (\(\epsilon =0.001\) in our analysis).
Constructing reference signature matrices and distributions from the bulklevel data
When the scRNASeq data are absent, unannotated or of poor quality (such as the breast cancer and LGG data used in the present study), we have to construct the reference signature matrices and distributions from the bulklevel data alone. Complete deconvolution algorithms infer the signature matrix \(Q\) and the mixture coefficients \(P\) from the bulklevel data \(E\) (Eq.Â 1). The reference signature matrix \({Q}_{ref}\) is the inferred signature matrix \(Q\), yet the reference signature distributions \({GP}_{ref}\) cannot be obtained from the deconvolution outcomes. For incomplete deconvolution methods, both \({Q}_{ref}\) and \({GP}_{ref}\) need to be constructed from the bulklevel data before launching deconvolution. To construct \({Q}_{ref}\) and \({GP}_{ref}\) for incomplete methods we have to impose stronger hypotheses about the relations between the expression patterns of bulklevel sample subtypes and cell types. Below we describe the procedure of constructing the reference signature matrix \({Q}_{ref}\) and distribution \({GP}_{ref}\) of three simple models from the bulklevel RNASeq data of breast cancer or lowgrade glioma.
Any model specifying the relations between sample subtypes and cell types has to calculate each entry in \({Q}_{ref}\) and \({GP}_{ref}\) from a subset of samples in the bulklevel data. To establish the bases of all possible models we derived three quantities from the bulklevel data: (1) partition of the normalized bulklevel data into a grid of gene groups and sample subtypes, (2) the mean expression value of each grid component, (3) quantization of the grid component mean expression matrix. Denote an \(n\times m\) matrix \(E\) the bulklevel RNASeq data of \(n\) genes and \(m\) samples. The first step is to normalize \(E\) into a matrix \({E}_{cdf}\) by evaluating the cdf values of each row of \(E\). This step is identical to the construction of \({X}_{cdf}\) from \(X\). The second step is to partition \({E}_{cdf}\) into a grid of \({n}_{g}\) gene groups (row partition) and \({n}_{s}\) sample subtypes (column partition). Denote a \(1\times n\) vector \(\Gamma\) the gene group labels and a \(1\times m\) vector \(\Sigma\) the sample subtype labels, and \({\Lambda }_{i}\equiv \{g:{\Gamma }_{g}=i\}\) and \({\text{S}}_{j}\equiv \{s:{\Sigma }_{s}=j\}\) the gene group \(i\) members and sample subtype \(j\) members respectively. Each component \({E}_{cdf}({\Lambda }_{i},{\text{S}}_{j})\) of the grid contains the entries of \({E}_{cdf}\) belonging to gene group \(i\) and sample subtype \(j\). The breast cancer bulklevel data comprises three gene groups and four sample subtypes (basal, Her2enriched, luminal A and luminal B), and the LGG data comprises three gene groups and three sample subtypes (Idh1 mutation with chromosome 1p/19q codeletion, Idh1 mutation without the codeletion, and wild type). The third step is to construct a gridlevel expression data by taking the average of \({E}_{cdf}\) entries in each grid component: denote \(G\) an \({n}_{g}\times {n}_{s}\) matrix and \({G}_{ij}\equiv \frac{1}{{\Lambda }_{i}{S}_{j}}{\sum }_{a\in {\Lambda }_{i},b\in {S}_{j}}{E}_{cdf}(a,b)\). The fourth step is to quantize \(G\) to an \({n}_{g}\times {n}_{s}\) matrix \(C\) of trinary values indicating whether each grid component is up/down regulated or neither. For each gene group \(1\le i\le {n}_{g}\), we extracted the \({E}_{cdf}\) entries belonging to gene group \(i\): \({E}_{cdf}\left({\Lambda }_{i},:\right)\equiv \left\{{E}_{cdf}\left(a,b\right):a\in {\Lambda }_{i}, 1\le b\le m\right\}\), and calculated their mean \({\mu }_{{g}_{i}}\) and standard deviation \({\sigma }_{{g}_{i}}\). \({C}_{ij}=+1\) if \({G}_{ij}\ge {\mu }_{{g}_{i}}+{\sigma }_{{g}_{i}}\), \({C}_{ij}=1\) if \({G}_{ij}\le {\mu }_{{g}_{i}}{\sigma }_{{g}_{i}}\), and \({C}_{ij}=0\) otherwise. This quantization specifies whether the mean value of each grid component significantly deviates from the global mean value of the same gene group in positive or negative directions.
We then constructed the three models about the gene expression patterns underlying the unobserved cell types. These models are illustrative examples of simple hypotheses about the relations between sample subtypes and cell types but by no means an exhaustive list of possible gene expression patterns of cell types. \({M}_{1}\) stipulates that tumors of each subtype is dominated by one unique cell type, hence the gene expression profiles of cell types resemble the bulklevel data. For breast cancer data, \({M}_{1}\) assumes that the bulk samples of each subtype (basallike, Her2enriched, luminal A, luminal B) are dominated by one cell type. Hence the reference signature matrix and distribution of a cell type are estimated from the bulklevel data of the corresponding cancer subtype. The bulklevel expression data (METABRIC) are treated as the reference scRNASeq data to construct \({Q}_{ref}\) and \({GP}_{ref}\) (Fig.Â 4A, left panel). Entries in the signature matrix \({Q}_{ref}\) are the average of the corresponding entries of (gene, sample subtype) combinations in \(E\): \({Q}_{ref}\left(i,j\right)=\frac{1}{ {\text{S}}_{j}}{\sum }_{b\in {\text{S}}_{j}}E(i,b)\). Entries in \({GP}_{ref}\) are estimated by the corresponding entries of (gene group, sample subtype) combinations in \({E}_{cdf}\). For gene group \(i\) and sample subtype \(j\), we extracted the \({E}_{cdf}\) entries \({E}_{cdf}({\Lambda }_{i},{S}_{j})\equiv \left\{{E}_{cdf}\left(a,b\right):a\in {\Lambda }_{i}, b\in {S}_{j}\right\}\). We then applied kernel density estimation to \({E}_{cdf}({\Lambda }_{i},{S}_{j})\) and assessed the probability mass function \({p}_{M}(x)\) on intervals \(0:\frac{1}{I}:1\). \({p}_{M}\) is a \(1\times I\) vector and \({p}_{M}(x)\equiv \text{Pr}(\frac{x1}{I}\le y\le \frac{x}{I})\) for \(y\in {E}_{cdf}({\Lambda }_{i},{S}_{j})\). To avoid zero probability values in \({GP}_{ref}\), we replaced zero values in \({p}_{M}\) with a small but nonzero value \(\epsilon\) (\(\epsilon =0.001\) in our analysis) and renormalized \({p}_{M}\) to make the components sum to 1. Finally, we substituted the vector \({p}_{M}\) in \({GP}_{ref}(i,j,:)\). This estimation treats the bulklevel data as the singlecell data and equates sample subtypes and cell types. Hence the assumption of \({M}_{1}\) holds.
\({M}_{2}\) stipulates that each cell type has high expressions in one gene group and low expressions in other gene groups. Therefore, we identified the bulklevel grid components corresponding to up or down regulation and assigned them to proper positions to assess \({Q}_{ref}\) and \({GP}_{ref}\). For breast cancer, \({M}_{2}\) assumes that there are three cell types. Each cell type has high expression values of one marker gene group and low expressions of the other two marker gene groups, and the three marker gene groups are enriched with cell cycle control, immune response, and estrogen response respectively. We then quantized the grid of bulklevel expression data of (gene groups, sample subtypes) into trinary values. To estimate the reference signature matrix and distribution of a marker gene in a gene group (for instance, cell cycle control) in the corresponding cell type (the cell type with high expressions of cell cycle control genes), we solicited the sample subtypes with high expressions of the marker gene group (basallike, Her2enriched, and luminal B) and estimated \({Q}_{ref}\) and \({GP}_{ref}\) from the selected bulk samples (Fig.Â 4A, left panel). Similarly, \({Q}_{ref}\) and \({GP}_{ref}\) of a cell cycle gene in a cell type with low expression of the cell cycle gene group are estimated from the bulk samples with low expressions of the cell cycle gene group (luminal A samples). For a gene group \(i\), we identified two subsets of sample subtypes where the quantized grids \(C\) hadâ€‰+â€‰1 and âˆ’â€‰1 values: \({\text{H}}_{i}^{+}\equiv \{j:C\left(i,j\right)=1\}\), \({\text{H}}_{i}^{}\equiv \{j:C\left(i,j\right)=1\}\). We then identified the samples whose subtypes belonged to \({\text{H}}_{i}^{+}\) and \({\text{H}}_{i}^{}\) respectively: \({S}_{i}^{+}\equiv \left\{s:{\Sigma }_{s}\in {\text{H}}_{i}^{+}\right\}, {S}_{i}^{}\equiv \left\{s:{\Sigma }_{s}\in {\text{H}}_{i}^{}\right\}.\) For each gene \(g\in {\Lambda }_{i}\) and sample subtype \(j\), \({Q}_{ref}\left(g,j\right)=\frac{1}{{S}_{j}^{+}}{\sum }_{\{s\in {S}_{j}^{+}\}}E(g,s)\) if \(j=i\), and \({Q}_{ref}\left(g,j\right)=\frac{1}{{S}_{j}^{}}{\sum }_{\{s\in {S}_{j}^{}\}}E(g,s)\) if \(j\ne i\). In other words, \({Q}_{ref}\left(g,j\right)\) is the average of gene \(g\) expressions over the upregulated entries for cell type \(i\) and the average of gene \(g\) expressions over the downregulated entries for other cell types. Similarly, the distribution \({GP}_{ref}(i,j=i,:)\) was estimated from the entries of gene group \(i\) (\({\Lambda }_{i}\)) and the upregulated samples \({S}_{i}^{+}\): \({E}_{cdf}\left({\Lambda }_{i},{S}_{i}^{+}\right):\{{E}_{cdf}\left(a,b\right):a\in {\Lambda }_{i}, b\in {S}_{i}^{+}\}\), and \({GP}_{ref}(i,j\ne i,:)\) was estimated from \({\Lambda }_{i}\) and the downregulated samples \({S}_{i}^{}\): \({E}_{cdf}\left({\Lambda }_{i},{S}_{i}^{}\right):\{{E}_{cdf}\left(a,b\right):a\in {\Lambda }_{i}, b\in {S}_{i}^{}\}\). Therefore, \({GP}_{ref}(i,j=i,:)\) assigns high probability mass to high expression values and \({GP}_{ref}(i,j\ne i,:)\) assigns high probability mass to low expression values, which meets the assumption of \({M}_{2}\).
\({M}_{3}\) serves as a negative control of \({M}_{1}\) as the two models have the same number of cell types and \({M}_{3}\) is obtained from \({M}_{1}\) by rearranging rows and columns to maximize the difference. In breast cancer data, we permuted entries in each row of \(G\) independently and exhausted all \({24}^{3}=13824\) permutations. Each permutation \(\psi\) induced a matrix \({G}_{\psi }\). We then exhausted all \(24\) column permutations of \({G}_{\psi }\) andÂ found the best alignment with \(G\). The resulting grid matrix \(\widehat{G}\) yields the maxâ€“min \({L}_{2}\)norm difference from \(G\):
\(\psi\) denotes a combination of independent permutations of entries in each row, \(\phi\) denotes a column permutation, and \(\phi \circ \psi\) denotes a composition of independent row entry permutations followed by a column permutation. The optimal permutation \(\widehat{\phi }\circ \widehat{\psi }\) assigns a grid component in \(G\) to another grid component in \(\widehat{G}\). Therefore, we redefined \({S}_{j}\) for sample subtype \(j\) and \({S}_{i}^{+}\) and \({S}_{i}^{}\) for gene group \(i\) according to \(\widehat{\phi }\circ \widehat{\psi }\) and recalculated \({Q}_{ref}\) and \({GP}_{ref}\) following the procedure for constructing \({M}_{1}\).
In LGG data, we found that \(\widehat{G}\) yielded one column of low expressions for all gene groups. This column (cell type) can fit many cells with sparse nonzero entries, hence will distort the loglikelihood scores and make \({M}_{3}\) more favorable. To circumvent this distortion, we manually reassigned grids in \(G\) to form \(\widehat{G}\) such that each cell type consisted of at least one upregulated gene group and one downregulated gene group. The \({M}_{3}\) signature matrix of the LGG data is displayed in Supplementary file 3: Figure S3A.
Deconvolving bulklevel gene expression data
A (forward) deconvolution algorithm factorizes a bulklevel gene expression matrix \(E\) into the product of the cell type signature matrix \(Q\) and the sample mixture coefficient matrix \(P\) (Eq.Â 1). Incomplete algorithms require \(Q\) as explicitly given or derived from an external source. We used \({Q}_{ref}\) generated from Sect.Â "Generating reference signature matrices and distributions from the singlecell data" or "Constructing reference signature matrices and distributions from the bulklevel data". Complete algorithms return both \(Q\) and \(P\). Here we selected nine deconvolution algorithms: DeconRNASeq [12], lsfit [31], DWLS [14], NMF [3], two versions of deconf (original and fast) [32, 33], bMIND [34], RADs [35], and Scaden [36]. Scaden was a supervised deep learning algorithm that required labels of scRNASeq data, hence was not applicable for our cancer datasets due to lack of cancer cell type annotations. The first and last three are incomplete methods and the remaining three are complete methods. An R package CellMix [33] includes deconf and lsfit implementations, and the remaining algorithms have their own R or Python packages. The complete methods differ in their respective cost functions (e.g. Euclidean distance between the target matrix and the NMF estimate in deconf original and Kullbackâ€“Leibler divergence in Brunetâ€™s NMF), their algorithms (multiplicative or least squares based), stopping criteria, and the ways the nonnegativity and scaling constraints are enforced onto the signature and the mixture coefficient matrices. Thus, two algorithms (Brunetâ€™s NMF and deconf) represent two significantly different methods while the two versions of deconf represent the class of similar methods. All the incomplete methods are variations of nonnegative least squares optimization algorithm and have different approaches to estimating cell proportions from signatures. This combination of both similar and dissimilar methods can offer additional insights into the performance of the backward deconvolution framework.
Constructing probabilistic graphical models of singlecell gene expressions
As mentioned in Sect.Â "Fitting singlecell gene expression data with probabilistic graphical models constructed from forward deconvolution outcomes", the scRNASeq data generation process is represented as a probabilistic graphical model with two families of conditional probabilities \(P\left(\pi s\right)\) and \(P(x\) \(\gamma ,\pi )\). We propose a procedure to construct \(P\left(\pi s\right)\) and \(P(x\) \(\gamma ,\pi )\) from the forward deconvolution outcome based on the following assumptions: (1) expressions from the same gene group and cell type are drawn from the same underlying distribution, (2) samples of the same subtype possess similar cell type compositions, (3) each reference or inferred signature vector (a column in \(Q\)) can be viewed as the expression profile of a virtual bulk sample; after rescaling the reference signature value of each gene is in the expression value range of the same gene in the bulklevel data, (4) the expression patterns in the bulklevel and singlecell RNASeq data are preserved after conversion into cdf values. Assumptions 1 and 2 simplify the models by collapsing the parameters pertaining to members of a gene group and a sample subtype. They are sensible if adequate marker genes of each cell type or subtype are selected from data. Assumption 3 ensures we can reasonably infer \(P(x\) \(\gamma ,\pi )\) by comparing rescaled \(Q\) values with bulklevel data values. It is sensible since bulklevel samples typically have small variations in the \({L}_{2}\)norms of their expression profiles. Assumption 4 ensures we can employ the models inferred from the bulklevel data to calculate the loglikelihood scores of singlecell data even though the two datasets may have very different scales. It is valid since the expression patterns are largely preserved after rank transformation [37].
\(P\left(\pi s\right)\) is directly estimated from the mixture coefficient matrix \(P\). Denote \({S}_{i}\) the bulk samples belonging to subtype \(i\), then \(P\left(\pi =js=i\right)=\frac{{\sum }_{l\in {S}_{i}}{P}_{jl}}{{\sum }_{{j}{\prime}=1k,l\in {S}_{i}}{P}_{j{\prime}l}}\).
Complete algorithms report signature matrices \(Q\), and incomplete algorithms take the reference signature matrices as inputs (\({Q=Q}_{ref}\)).
The signature matrix \(Q\) reports the average expression value of each gene in each cell type. \(P(x\) \(\gamma ,\pi )\) reports the distribution of expression values conditioned on a gene group and cell type. Direct estimation of \(P(x\) \(\gamma ,\pi )\) from \(Q\) is not stable due to the small number of entries to assess cdf values. Each gene in \(Q\) has only a small number (the number of cell types) of expression values. Hence the rank transform of rows in \(Q\) gives a very crude quantization of signature matrix values. To mitigate this problem, we calculated the cdf values of \(Q\) entries in terms of a much larger pool \(E\) (the bulklevel RNASeq data) rather than \(Q\) itself. In other words, the cdf value of an entry \({Q}_{ij}\) was calculated by comparing \({Q}_{ij}\) with the entries in the \({i}^{\text{th}}\) row of \(E\), rather than the entries in the \({i}^{\text{th}}\) row of \(Q\). The ranktransformed matrix of \(Q\) was used to estimate \(P(x\) \(\gamma ,\pi )\). However, entries in \(Q\) and \(E\) are not necessarily comparable since \(Q\) may be derived from the singlecell RNASeq data which often has a very different scale than \(E\). We adopted assumption (3) in Sect.Â "Constructing probabilistic graphical models of singlecell gene expressions" to rescale each \(Q\) column by the median column norm of \(E\), and ranktransformed the rescaled \(Q\) into cdf values. The following procedure was executed.

1.
Rescaled each column of \(Q\) separately to make the signature matrix entries have comparable values as the bulk expression data \(E\). Calculated the \({L}_{2}\)norm of each column in \(E\) and denoted them \(Z=\{{z}_{1},\cdots ,{z}_{m}\}\), and the \({L}_{2}\)norm of each column in \(Q\) and denoted them \(Y=\{{y}_{1},\cdots ,{y}_{k}\}\). Denoted the median of \(Z\) as \(\overline{z }\). Rescaled column \(j\) of \(Q\) by multiplying it by a factor \({r}_{j}=\frac{\overline{z}}{{y }_{j}}\): \({\widehat{Q}}_{*,j}={r}_{j}\cdot {Q}_{*,j}\). Each column of the rescaled signature matrix \(\widehat{Q}\) had an identical \({L}_{2}\)norm \(\overline{z }\) as median \({L}_{2}\)norm over all bulklevel samples, implying that the cell type gene expression profiles were comparable to the bulk sample gene expression profiles after rescaling.

2.
Normalized the rescaled signature matrix into cdf values. For each entry \({\widehat{Q}}_{ij}\) in \(\widehat{Q}\) (gene \(i\) and cell type \(j\)), found row \(i\) in \(E\) (expressions of gene \(i\) in all bulk samples) and denoted it as \({E}_{i,*}\). Calculated the cdf value of \({\widehat{Q}}_{ij}\) in \({E}_{i,*}\) as the fraction of \({E}_{i,*}\) entries with values \(\le {\widehat{Q}}_{ij}\), and denoted its value as \({W}_{ij}\). Repeated the same procedure for all genes and prototypes and completed the signature cdf value matrix \(W\). \(W\) entry values range in \([\text{0,1}]\) and have dimension \(n\times k\).

3.
Estimated \(P(x\) \(\gamma ,\pi )\) from \(W\). Each column in \(W\) denoted the normalized expression profile of a cell type. For each \(\gamma =i\) and \(\pi =j\), solicited the \(W\) entries for gene group \(i\) and cell type \(j\). \(P(x\) \(\gamma ,\pi )\) was the density estimate of the selected entries. Similar to prior procedures, we subdivided \([\text{0,1}]\) into \(I\) equal intervals \(0:\frac{1}{I}:1\) and applied kernel density estimation to calculate the probability value of each interval. We also replaced zeros in \(P(x\) \(\gamma ,\pi )\) with a small value \(\epsilon\) and renormalized the entries to make them legitimate conditional probabilities.
The input \({Q}_{ref}\) of each incomplete algorithm is often accompanied with a reference signature distribution \({GP}_{ref}\). We also use \({GP}_{ref}(\gamma ,\pi ,:)\) as a more precise model of \(P(x\) \(\gamma ,\pi )\). Consequently, each incomplete algorithm reports two types of \(P(x\) \(\gamma ,\pi )\): \({P}_{Q}(x\) \(\gamma ,\pi )\) derived from \(Q\) and \({P}_{{GP}_{ref}}(x\) \(\gamma ,\pi )\) derived from \({GP}_{ref}\), while each complete algorithm reports only \({P}_{Q}(x\) \(\gamma ,\pi )\).
Comparing the deconvolution algorithms in fitting scRNASeq data
Once the parameters of \(P\left(\pi s\right)\) and \(P(x\) \(\gamma ,\pi )\) were inferred from the deconvolution outcomes of bulklevel RNASeq data, we substituted them into Eqs.Â 2â€“4 to evaluate the joint likelihood, marginal likelihood, and BIC score of the scRNASeq data respectively. For each incomplete method we had two sets of estimated \(P(x\) \(\gamma ,\pi )\) tables (\({P}_{Q}(x\) \(\gamma ,\pi )\) and \({P}_{{GP}_{ref}}(x\) \(\gamma ,\pi )\)) to evaluate the likelihood and BIC scores.
The degree of freedom \(D\) in the BIC score (Eq.Â 4) is determined by the number of independent entries in \(P\left(\pi s\right)\) and \(P(x\) \(\gamma ,\pi )\). For instance, in breast cancer data there are 4 sample subtypes (\(s\) values), 3 gene groups (\(\gamma\) values), and 10 intervals of gene expression values (\(x\) values). The two models described in Sect.Â "Constructing reference signature matrices and distributions from the bulklevel data" have 4 and 3 cell types (\(\pi\) values) respectively. Therefore, model 1 has \(4\times 3+3\times 4\times 9=120\) free parameters, and model 2 has \(4\times 2+3\times 3\times 9=89\) free parameters. Complexity penalty is not needed when all models in a dataset have the same degree of freedom.
Singlecell RNASeq data is filled with many zero entries. Abundant zero entries implicate their dominant contribution to the BIC scores. We propose a simple criterion for inclusion or exclusion of zero entries in loglikelihood score computation according to the concentration or depletion of zero entries in the grids of (gene group, sample subtype) combinations. The procedure and decision for zero entry inclusion/exclusion of the datasets are reported in Sect.Â "Handling zero entries in calculating loglikelihood scores of the scRNASeq data".
For incomplete methods in annotated scRNASeq data, backward deconvolution uses the singlecell data to both infer the models and evaluate the loglikelihood scores. To avoid double usage of the singlecell data, we split scRNASeq data into the sets for constructing the reference signature matrix and distribution (training data) and evaluating the loglikelihood score of the model (test data). In our experiments we assigned the same number of cells to the training and test data.
We selected the deconvolution model that yielded the highest BIC score on the singlecell data and reported \(Q, P, P\left(\pi s\right), P(x\gamma ,\pi )\), and the cell type assignments in the singlecell data. To infer the type of a cellÂ \(j\), we calculated the posterior likelihood score conditioned on each possible cell type:
where \(s\left(j\right)\) denotes the sample subtype of cell \(j\), index \(l\) is over all genes, and \({x}_{jl}\) denotes the normalized expression of gene \(l\) in cell \(j\). Cell \(j\) is assigned to the cell type of the highest posterior likelihood.
Handling zero entries in calculating loglikelihood scores of the scRNASeq data
Correctly handling the missing values in the scRNASeq data is crucial to ensure the reliability of downstream analyses. There are two general approaches handling dropout (zero) entries in the scRNASeq data. The first approach imputes the dropout entries. Numerous imputation techniques based on clustering, deep learning algorithms, or fitting various statistical models underlying the observed expression values have been proposed [38,39,40]. The imputation approach infers zero entries with information of nonzero entries hence suffers from two shortcomings: the imputed values are based on specific assumptions of the data, and the information of zero entries is discarded. The second approach treats missing values as informative biological signals, hence uses them for inferring relevant information for downstream analyses. It was shown that the distribution of dropout entries can be used for cell type identification and trajectory inference [41, 42], feature selection tasks [43], data projections [44], and others. In line with the second approach, we proposed a simple criterion for including or excluding zero entries in computing the loglikelihood scores of a scRNASeq dataset. Intuitively, if zero entries are strongly enriched or depleted in specific (gene group, cell type) combinations, then they likely reflect low or high values in the gene expression signatures hence should be included in loglikelihood calculation. Conversely, if zero entries are not strongly enriched in specific (gene group, cell type) combinations, then they likely reflect random noise due to dropouts hence should be discarded. This intuitive criterion is translated into the following procedure. First, the scRNASeq data was subdivided into a grid of gene groups and cell types. If cell types were not annotated, then cells were subdivided by the subtypes of their bulk samples. Second, in each grid component we counted the number of zero entries. Third, we randomly permuted cells in the data 10,000 times and counted the number of zero entries in each grid component and each random permutation. Fourth, in each grid we calculated the mean and standard deviation of zero entry counts over 10,000 random permutations. Fifth, we counted the grid components whose zero entry counts were above or below the confidence interval of six standard deviations from the mean in the permuted data. Sixth, if the number of grid components with enriched or depleted zero entries constituted a substantial fraction of the total number of grid components (0.25 in our study), then we included zero entries in computing loglikelihood scores. Otherwise we excluded zero entries in computing loglikelihood scores.
Supplementary file 4: Table S1 reports the numbers of grids with enriched or depleted zero entries in each scRNASeq dataset. All but the two breast cancer datasets contain more than 25% of grids with enriched or depleted zero entries. Hence we excluded zero entries in loglikelihood score computation for the two breast cancer datasets and included zero entries for all other datasets.
Selecting marker genes and cells from four singlecell and bulklevel datasets
We validated the backward deconvolution framework in five transcriptomic datasets: (1) the mouse scRNASeq data of 5 cell types and artificial mixtures of these data as the virtual bulklevel data, (2) the mouse bulklevel and singlecell RNASeq data of 4 tissue types and the 9 constituting cell types, (3) one bulklevel and one singlecell RNASeq datasets of human brain regions from Autism Spectrum Disorder (ASD) subjects and normal controls, (4) one bulklevel and two singlecell RNASeq datasets of breast cancer, (5) one bulklevel and one singlecell RNASeq datasets of lowgrade gliomas (LGG). Below we describe the procedure of selecting marker genes of each dataset.
Insilico mixture of mouse scRNASeq data
We generated bulk level insilico mixtures from singlecell RNASeq data of a mouse gene expression database [45]. We selected cells from five different cell typesâ€”oligodendrocytes, T cells, lung endothelial cells, hepatocytes, and fibroblast cellsâ€”along with the genes that were differentially expressed in those cell types, which we denoted as marker genes. These cell types were chosen primarily because their numbers of cells were relatively abundant and their expression patterns were relatively distinct. A marker gene of a cell type meets two criteria: (1) it has nonzero expressions in at least 75% of the cells of the target cell type and in at least 60 cells of each of the remaining cell types, and (2) it has a pvalueâ€‰<â€‰0.05 for the onetailed unpaired ttest between the target cell type and each of the remaining cell types. In total, we selected 226, 163, 92, 84, and 78 marker genes (for a total of 643) and 713, 375, 324, 196, and 1,082 cells (for a total of 2,690) for each cell type, respectively.
Mouse bulklevel and singlecell RNASeq data
The bulklevel and singlecell RNASeq data from the Tabula Muris Senis database were used [46]. We selected four tissue types: fat, heart, limb, and liver, and solicited cells from the same tissue types in the mouse singlecell RNASeq data. The four tissue types constituted nine cell types: endothelial cells, fibroblast/mesenchymal cells, epithelial cells, immune cells, smooth muscle cells, skeletal cells, endocardial cells, Schwann cells and hepatocyte cells. The diverse family of immune cells (e.g., T cells, B cells, macrophages, etc.) was collapsed into one cell type in order to reduce the number of cell types and hence simplify deconvolution. Fibroblast and mesenchymal cells were also collapsed into one cell type as they had very similar expression profiles.
For each gene, we extracted the cells of each type and calculated the mean and standard deviation of their scRNASeq data in the nine cell types. The signaltonoise ratio between cell types \(i\) and \(j\) was the difference of their mean values \({m}_{i}\) and \({m}_{j}\) normalized by their standard deviations \({\sigma }_{i}\) and \({\sigma }_{j}\): \({SNR}_{ij}=\frac{2({m}_{i}{m}_{j})}{({\sigma }_{i}+{\sigma }_{j})}.\) A gene was selected as a marker gene of cell type \(i\) if its \({SNR}_{ij}\ge 0.5\) for all \(j\ne i\). There were totally 4439 marker genes and 33,777 cells in the singlecell RNASeq.
ASD bulklevel and singlecell RNASeq data
The bulklevel and singlecell RNASeq data of brain regions from ASD patients and normal controls were from distinct sources. The scRNASeq data [47] comprises 104,559 cells from ASD and normal subjects of two brain regions: anterior cingulate cortex (ACC) and prefrontal cortex (PFC). Together these cells belong to 15 simplified cell types: INPV, INSST, INSV2C, INVIP, L2/3, L4, L5/6, L5/6CC, NeuNRGN, Neumat, Microglia, Astrocytes (AST), Endothelial, Oligodendrocyte Progenitor Cells (OPC), and mature Oligodendrocytes. Two types of astrocytes were combined into one astrocyte cell type, and two types of NeuNRGN cells were combined into one NeuNRGN cell type. The first 10 cell types are neuron cells and the remaining 5 are supporting cells. The bulklevel RNASeq data [48] comprises 104 samples from ASD and normal subjects of three brain regions: Brodmann areas 10 (decision making), 19 (vision processing) and 44 (motor aspect of speech).
We identified the marker genes with differential expressions in each of the 15 cell types in the scRNASeq data. Velmeshev et al. [47] reported the differentially expressed genes between diagnostic phenotypes (ASD vs. normal), brain regions and individuals. We collected these genes as the candidates for the marker genes. For each candidate gene, we calculated the mean expression values over the cells of each type and sorted the cell types accordingly. The overwhelming score of a gene was the mean expression value of the top cell type minus the sum of mean expression values of all other cell types. We selected the genes whose overwhelming scores \(\ge 0\). If a cell type had less than 20 marker genes according to this condition, we then relaxed the criterion and selected the topranking genes according to their overwhelming scores. 500 marker genes were selected accordingly.
The brain regions covered in the singlecell and bulklevel data are not identical. To make likelihood evaluation feasible we have to establish a mapping between bulklevel sample subtypes and singlecell sample subtypes. There are four sample subtypes in the scRNASeq data (two phenotypes multiply two brain regions) and six sample subtypes in the bulklevel data (two phenotypes multiply three brain regions). We exhausted all 6 possible assignments from three brain areas in the bulklevel data to two brain areas in the singlecell data. For each possible assignment, we evaluated the \({\mathcal{L}}_{2}\) scores of 6 incomplete methods, and identified the assignment which gave the highest \({\mathcal{L}}_{2}\) scores in most of the methods. The assignment ACC \(\to\) BA19, PFC \(\to\) BA44 was chosen accordingly.
Breast cancer bulklevel and singlecell RNASeq data
We extended the PAM50 genes [49] with a procedure described in Tiong et al. 2022 [50]. The PAM50 genes were divided into three groups by kmeans clustering on the METABRIC gene expression data. The three groups were enriched in cellular functions related to cell cycle control, immune responses, and estrogen receptors respectively. To extend the PAM50 gene list, we calculated correlation coefficients of other genes in TCGABRCA and METABRIC data with the PAM50 genes, and averaged the correlation coefficients over the members of each PAM50 group. Candidate genes were sorted by the maximum of grouplevel average correlation coefficients in each dataset (TCGABRCA and METABRIC) separately. A total of 200 genes were selected (intersection of genes sorted by the maximum grouplevel average correlation coefficients from both datasets), and further filtered down to 127 genes after selecting for genes havingâ€‰>â€‰70% valid entry in the single cell data.
The breast cancer scRNASeq datasets contain noncancer cells and cancer cells with sparse valid (nonzero) entries. We cleaned the data by considering only cancer cells with \(\ge \frac{1}{3}\) of the valid entries among the marker genes since the proportions of normal cells in singlecell RNASeq data were unlikely those from tumors. There were 281 and 12,019 selected cancer cells in the two singlecell RNASeq datasets respectively. In addition, Supplementary file 4: Table S1 indicates that the breast cancer scRNASeq datasets have more disperse distributions of zero entries. Therefore, zero entries were discarded when evaluating the loglikelihood scores. Penalty terms of model complexity in Eq.Â 4 were added since the models had different numbers of parameters.
To provide more direct evidence supporting superiority of \({M}_{1}\) to fit the breast cancer scRNASeq data, we compared the clustering outcomes of an independent breast cancer scRNASeq dataset with those of two virtual scRNASeq datasets simulated from \({M}_{1}\) and \({M}_{2}\). We downloaded the GSE161529 data comprising 305,157 tumor cells from 45 patients [51]. Due to hardware limitations, a total of 50,000 cells (cells with the most nonzero entries) from 33 patients were initially selected for our analysis. Seurat preprocessing pipeline [52] further filtered out the cells to the final number of 14,004 cells, which were subsequently used for clustering and tSNE visualization. The virtual scRNASeq data of 40 patients (10 patients for each subtype, 1000 cells for each patient) were generated by sampling from the probabilistic graphical models (Eq.Â 2). \(P(x\) \(\gamma ,\pi )\)â€™s were inferred from the \({M}_{1}\) or \({M}_{2}\) signature matrix, and \(P(\pi \) \(s)\)â€™s were estimated from the METABRIC bulklevel data by DWLS. The simulated cells were also clustered using Seurat.
Lowgrade glioma bulklevel and singlecell RNASeq data
Marker genes for glioma were derived based on our previous observation [50, 53] that 3 LGG subtypes: IDHmutant with chromosome 1p19q codeletion (CoDel), IDHmutant with no chromosome 1p19q codeletion (NoCoDel), and IDH wildtype (WT) possess highly expressed genes in three gene groups. Corresponding MSigDB gene sets of the enriched cellular functions were tested for subtypespecific differential expression (ttest) in TCGA LGG gene expression data, further narrowed down to genes havingâ€‰>â€‰70% valid entry in the single cell data, resulting in a final list of 61 genes.
Introducing variations to artificial mixtures bulklevel data
One dataset in our analysis is insilico mixture of mouse scRNASeq data: the virtual bulklevel data was computationally generated by sampling and mixing the experimental singlecell data. A straightforward process of generating the insilico bulklevel data is to fix the mixture coefficients and exert no additional noise beyond the experimental singlecell data. However, since the mouse scRNASeq data was relatively clean, nearly all deconvolution algorithms we picked successfully decomposed the virtual data generated from this process. To test the capacity of the deconvolution algorithms, we introduced two types of variations from the aforementioned artificial mixtures: fluctuations of the mixture coefficients vector and additive noise to the virtual bulk expression data. Denote \(\overrightarrow{\mu }\equiv ({\mu }_{1},\cdots ,{\mu }_{5})\) the ideal mixture coefficient vector of a sample subtype. Each vector has five components indicating the mixture coefficient of five cell types. We created variations of the mixture coefficient vectors from \(\overrightarrow{\mu }\) by sampling the mixture coefficient vectors \(\overrightarrow{\lambda }\) from a Dirichlet distribution:
Parameters \(\overrightarrow{\mu }\) and \(\eta\) specify the mean mixture coefficient vector and scale of concentration around the mean, and \(Z\left(\overrightarrow{\mu },\eta \right)\) is a normalization constant. A large \(\eta\) makes \(\overrightarrow{\lambda }\) narrowly concentrated around the mean, and a small \(\eta\) makes \(\overrightarrow{\lambda }\) disperse in a wide range.
After \(\overrightarrow{\lambda }\) was sampled from Eq.Â 7 we sampled \((\left[{\lambda }_{1}N\right],\cdots ,\left[{\lambda }_{5}N\right])\) cells from the scRNASeq data with replacement and calculated the average expression profile of the marker genes over the sampled cells. Denote \(\overrightarrow{x}\) this average expression profile vector. We then introduced nonnegative noise to the virtual bulk sample by augmenting \(\overrightarrow{x}\) with a random value sampled from a gamma distribution. The noisy expression of gene \(i\) in the virtual bulk sample is:
\({\epsilon }_{i}\) was sampled from a gamma distribution with shape parameter \(\alpha =0.1{x}_{i}\), rate parameter \(\beta =0.1\) and mean \(\frac{\alpha }{\beta }={x}_{i}\), and the noise \({\epsilon }_{i}\) was diminished by a factor \(\delta\). \(f(i,\overrightarrow{\mu })\) is an indicator function of the condition that gene \(i\) is a marker gene of the dominant cell type(s) of the ideal mixture coefficients vector \(\overrightarrow{\mu }\). This noise is more challenging for deconvolution algorithms as it lifts the expression values of nontarget cell types only and hence blurs the differential expressions between target and nontarget cell types. The level of noise was controlled by the diminishing factor \(\delta\). We considered two \(\eta\) values 50, 3 and two \(\delta\) values 0, 0.7, hence generated 4 sets of artificially mixed bulk data. The data of \(\eta =50, \delta =0\) has no fluctuation of mixture coefficients vectors and no artificial noise of expression data, hence should be accurately deconvolved by most reasonable algorithms. The data of \(\eta =3, \delta =0.7\) has the highest levels of mixture coefficient fluctuation and gene expression noise, hence will yield different results from different deconvolution algorithms.
Results
We validated backward deconvolution using nine forward deconvolution algorithms (Sect.Â "Deconvolving bulklevel gene expression data") and four singlecell and bulklevel transcriptomic datasets (Sect.Â "Selecting marker genes and cells from four singlecell and bulklevel datasets"). In the two datasets of mouse tissues and the dataset of ASD brains, the cell types were annotated and the accuracy rates of cell type assignments were calculated. Consequently, we demonstrated validity of backward deconvolution by showing that the loglikelihood scores of nine deconvolution algorithms were correlated with several existing accuracy metrics. In the two datasets of human tumors, only normal cells were annotated with cell types (stromal cells, T cells, etc.), but cancer cells were not further annotated with refined cell types. Since our analysis focused on cancer cells, the accuracy rates of cell type assignments were inaccessible. Therefore, we adopted both indirect and direct approaches to validate backward deconvolution in cancer data. We proposed three simple hypotheses specifying the relation between tumor subtypes (which were annotated) and cancer cell types (which were not annotated), employed these hypotheses to construct the reference signature matrices and distributions, and compared their BIC scores of nine deconvolution algorithms. Intriguingly, one hypothesis was persistently superior across the deconvolution algorithms and datasets. Furthermore, we clustered and visualized an independent breast cancer scRNASeq dataset and compared the clustering results with those of two virtual datasets simulated from two models. The virtual data simulated from the superior model of the indirect approach also better resembled the real scRNASeq data compared to an alternative model. The results didnâ€™t truly substantiate the favorable hypothesis but at least indicated it better fit the singlecell data in our analysis.
Insilico mixture of mouse scRNASeq data
We downloaded the singlecell RNASeq data of the Tabula Muris database [45], selected 2690 cells from five cell typesâ€”oligodendrocytes, T cells, lung endothelial cells, hepatocytes, and fibroblast cellsâ€”and 643 marker genes which were differentially expressed in one cell type (see Sect.Â "Selecting marker genes and cells from four singlecell and bulklevel datasets" for the criteria of selecting marker genes and cells). We first constructed virtual bulklevel data by sampling and mixing the scRNASeq data in silico. Because both cell type annotations in the singlecell data and mixture coefficients of bulklevel data were available, we could directly relate accuracies in these two aspects with the loglikelihood scores derived from backward deconvolution.
We considered 16 mixture coefficient vectors (subtypes) of the 5 cell types (Supplementary file 1: Figure S1A, topleft panel). Five vectors had one dominant cell type (90%) and equal proportions for other cell types. Ten vectors had two dominant cell types (45% each). One vector had an equal proportion of each cell type. Each subtype constituted 100 bulk samples, and each sample was mixed from \(N=1000\) randomly selected cells of the scRNASeq data. We introduced two types of variations to the artificial mixtures: fluctuations of the mixture coefficient vector (parametrized by scale of concentration \(\eta\)) and additive noise to the virtual bulk expression data (parametrized by noise diminishing factor \(\delta\)). The procedure of introducing variations to artificial mixtures is described in Sect.Â "Introducing variations to artificial mixtures bulklevel data". We considered two \(\eta\) values 50, 3 and two \(\delta\) values 0, 0.7, hence generated 4 sets of artificially mixed bulk data.
We calculated the loglikelihood scores for nine deconvolution algorithms. For each incomplete method, we calculated two loglikelihood scores of the models derived from \(Q\) (\({\mathcal{L}}_{1}\)) and \({GP}_{ref}\) (\({\mathcal{L}}_{2}\)). For each complete method, we calculated \({\mathcal{L}}_{1}\) only.
Table 1 reports the loglikelihood scores for nine deconvolution algorithms on the four insilico mixture datasets. We also report the sums of \({L}_{2}\)norms of the differences between the true and inferred mixture coefficient vectors, and those between the conditional probabilities \(P(x\) \(\gamma ,\pi )\) constructed from the reference signature distribution (\({P}_{{GP}_{ref}}(x\) \(\gamma ,\pi )\)) and from the signature matrix (\({P}_{Q}(x\) \(\gamma ,\pi )\)). These two metrics reflect deviations from the ground truth in two aspects of decomposition (mixture coefficients and signature matrices/distributions).
Two observations are salient. First, all \({\mathcal{L}}_{2}\) scores are considerably higher than all \({\mathcal{L}}_{1}\) scores, indicating superiority of \({GP}_{ref}\) to \(Q\) in capturing the distribution of expression patterns. Second, for \({\mathcal{L}}_{1}\) scores incomplete methods are generally superior to two complete methods. However, deconf_original is a complete method but has the best \({\mathcal{L}}_{1}\) in three datasets (50,0), (3,0), (3,0.7), and NMF is a complete method but has the best \({\mathcal{L}}_{1}\) in one dataset (50,0.7). The six incomplete methods have similar \({\mathcal{L}}_{1}\) scores which are lower than the best complete method but higher than the remaining two complete methods.
Since the loglikelihood scores were jointly determined by both mixture coefficients and signature matrices (or distributions), it may be misleading to directly correlate the loglikelihood scores with each aspect of decomposition. It is more sensible to correlate the \({\mathcal{L}}_{1}\) (and \({\mathcal{L}}_{2}\)) scores with the differences of mixture coefficient vectors among the six incomplete methods, and with the differences of \(P(x\) \(\gamma ,\pi )\) terms among the three complete methods, since the incomplete methods have identical \(P(x\) \(\gamma ,\pi )\) terms. \({\mathcal{L}}_{1}\) scores are strongly anticorrelated with \(P(x\) \(\gamma ,\pi )\) errors among complete methods. In contrast, correlations of the log likelihood scores and the mixture coefficient vector errors among incomplete methods are less clear. By removing the outlier values of bMIND, \({\mathcal{L}}_{2}\) scores are strongly anticorrelated with the mixture coefficient vector errors. Yet \({\mathcal{L}}_{1}\) scores are positively correlated with the mixture coefficient vector errors in the two datasets with \(\delta =0.7\), indicating that \({\mathcal{L}}_{1}\) scores are sensitive to the additive noise.
Supplementary file 1: Figure S1AB displays the mixture coefficients of the ground truth and six incomplete methods in two virtual bulk datasets ((50,0), (3,0.7)). In both datasets, DWLS has among the top three \({\mathcal{L}}_{1}\) scores and the lowest deviation from the true mixture coefficients. Supplementary file 1: S1CD displays the signature matrices of three complete methods in the same virtual datasets. In both datasets, deconf_original has the highest \({\mathcal{L}}_{1}\) scores and the lowest deviation from the true \(P(x\) \(\gamma ,\pi )\) distribution among the complete deconvolution methods.
Mouse bulklevel and singlecell RNASeq data
Beyond insilico mixtures, we also assessed how backward deconvolution gauged the performance of deconvolution algorithms on true mouse bulklevel data. We downloaded and processed the mouse bulklevel and singlecell RNASeq data from Tabula Muris Senis [46], selected 4439 marker genes, 164 bulk samples from four tissue types (fat, heart, limb, and liver), and 33,777 cells from nine cell types (endothelial cells, fibroblast/mesenchymal cells, epithelial cells, immune cells, smooth muscle cells, skeletal muscle cells, endocardial cells, Schwann cells, and hepatocyte cells). The procedure of selecting and processing the data is reported in Sect.Â "Selecting marker genes and cells from four singlecell and bulklevel datasets".
Table 2 reports the loglikelihood scores for nine deconvolution algorithms on the subset of the mouse scRNASeq data. Similar to the insilico mixtures, each incomplete method has a superior \({\mathcal{L}}_{2}\) than \({\mathcal{L}}_{1}\). However, unlike insilico mixtures the six incomplete methods yield superior \({\mathcal{L}}_{1}\) scores than most of three complete deconvolution algorithms (with one exception that NMF (a complete method) is superior to lsfit (an incomplete method)). Scaden, DWLS and RADs are the best in terms of both \({\mathcal{L}}_{1}\) and \({\mathcal{L}}_{2}\) scores.
The sample mixture coefficients in the true bulklevel data are unknown. Instead of comparing inferred mixture coefficients with ground truth, we predicted cell types in the test scRNASeq data according to each model (Eq.Â 6) and calculated their accuracy rates. Intriguingly, the accuracy rates for both \(Q\) and \({GP}_{ref}\) models are highly correlated with \({\mathcal{L}}_{1}\) and \({\mathcal{L}}_{2}\). Moreover, Scaden, DWLS and RADs possess the highest \({\mathcal{L}}_{1}\) and \({\mathcal{L}}_{2}\) as well as the best accuracy rates for both \(Q\) and \({GP}_{ref}\) models. FigureÂ 2 visualizes the bulklevel and singlecell data and the true and predicted cell types from each model. The \({GP}_{ref}\) models of Scaden, DWLS and RADs offer near 83% accurate predictions on cell types, indicating that these three methods are indeed the best algorithms in this dataset.
The strong correlations between the loglikelihood scores and various error metrics inspired us to examine the relations of these quantities in a semiformal way. Assume the scRNASeq data were generated by the aforementioned model. Denote \({\theta }^{*}\equiv ({P}_{{\theta }^{*}}\left(\pi s\right),{P}_{{\theta }^{*}}\left(x\gamma ,\pi \right))\) the true parameter values of the model, and \(\theta \equiv ({P}_{\theta }\left(\pi s\right),{P}_{\theta }\left(x\gamma ,\pi \right))\) the undetermined parameter values which are formulated as random variables in a Bayesian framework. Assume an infinite amount of data were generated from \({\theta }^{*}\), then the likelihood score asymptotically approximates the KL divergence between \({\theta }^{*}\) and \(\theta\):
We then derive the approximation error of the bulklevel data deconvolution. EquationÂ 1 gives the true deconvolution outcomes. Suppose \(\widehat{Q}\) and \(\widehat{P}\) are the approximated signature and mixture coefficients matrices respectively, then the reconstructed bulklevel gene expression data becomes:
Each bulklevel expression vector is the average of a collection of singlecell expression vectors sampled from \(P\left({\varvec{x}}{\theta }^{*}\right)\). If \(P\left({\varvec{x}}{\theta }^{*}\right)\) is given, then \(Q\) is obtained from the mean sampled from \(P(x\gamma ,\pi )\), and \(P\) is \(P(\pi s)\). Hence \(Q\cdot P\) is the mean of the expression data sampled from \(P\left({\varvec{x}}{\theta }^{*}\right)\). The approximation error then becomes the variance of the expression values.
If the true distribution \(P\left({\varvec{x}}{\theta }^{*}\right)\) is not given, but instead the parameters \(\theta\) are estimated from finite data, then the approximation error becomes:
The first term indicates the variance from the true distribution, and the second term indicates the bias between the true and estimated distributions.
Finally, the errors on \(\widehat{Q}\) and \(\widehat{P}\) are approximated as:
For incomplete methods, in a relaxed condition \({P}_{{\theta }^{*}}\left(x\gamma ,\pi \right)\) is given from the reference signature matrix or distribution. There is a close relation between the KL divergence (Eq.Â 9) and square error (Eq.Â 14) of the mixture coefficients pertaining to \(\theta\) and \({\theta }^{*}\). For complete methods such as NMF, \({D}_{KL}\left({\theta }^{*}\parallel \theta \right)\) has the composite effect of both \(Q\) and \(P\). Hence the relation between the KL divergence and each of the approximation error terms (Eqs.Â 13 and 14) is less obvious.
Bulklevel and singlecell brain RNASeq data of autism spectrum disorder (ASD) subjects
The two datasets in Sects.Â "Insilico mixture of mouse scRNASeq data""Mouse bulklevel and singlecell RNASeq data" are considered as simple problems for deconvolution as the cells and bulk samples from the selected tissue types possess quite distinct gene expression patterns among the selected marker genes. To justify the utility of backward deconvolution in more challenging scenarios, we downloaded and processed bulklevel and singlecell RNASeq data of brains of distinct regions from ASD patients and normal subjects. This is a more challenging dataset as the differences of cell type composition and gene expression signatures are much subtler between samples of different brain regions or diagnosis than between samples of different organs. Here we employed backward deconvolution to demonstrate that human brains exhibited different cell type compositions between ASD patients and normal controls as well as between distinct brain regions.
FigureÂ 3 displays the ASD data of 500 selected marker genes. Cells belonging to the four sample subtypes (controlACC, controlPFC, ASDACC, ASDPFC) comprise 15 cell types including various types of neuron cells and supporting cells such as microglia and astrocytes. The four sample subtypes possess different cell type compositions in scRNASeq data by visual inspection (Fig.Â 3A), yet these differences can be attributed to biases in sampling cells from the prepared tissues. The bulklevel data look even less informative than the scRNASeq data (Fig.Â 3B). To answer the aforementioned question (whether brains possess different cell type compositions between phenotypes and/or brain regions), we applied the backward deconvolution framework to the dataset and compared the inferred mixture coefficients of the topranking forward deconvolution methods.
Table 3A reports the loglikelihood scores of nine deconvolution algorithms. Similar to the results on other datasets, the incomplete methods have superior likelihood scores than the complete methods, \({\mathcal{L}}_{2}\) scores are superior to \({\mathcal{L}}_{1}\) scores, and the cell type prediction accuracy rates are highly correlated with the log likelihood scores. The three best methods according to both \({\mathcal{L}}_{2}\) and cell type prediction accuracy rates are Scaden, RAD, and DWLS.
We then examined the inferred \(P(\pi s)\) tables of the three best methods (TableÂ 3B). Intriguingly, for two of the top three methods (Scaden and DWLS) ASD samples have higher fractions of astrocytes than control samples conditioned on the brain regions. In other words, \(P\left(\pi =\text{astrocyte}s=\text{ASD}, \text{ACC}\right)>P\left(\pi =\text{astrocyte}s=\text{control},\text{ ACC}\right)\), and \(P\left(\pi =\text{astrocyte}s=\text{ASD}, \text{PFC}\right)>P\left(\pi =\text{astrocyte}s=\text{control}, \text{PFC}\right)\). In addition, ASD samples have lower fractions of NRGNneurons than control samples conditioned on both brain regions. Finally, ACC samples have higher fractions of oligodendrocytes than PFC samples conditioned on the phenotypes (ASD or control).
Breast cancer bulklevel and singlecell RNASeq data
Breast cancers are classified into four subtypes [54]: basal, Her2enriched, luminal A and luminal B. We expanded the wellknown PAM50 genes [49] to 127 genes and categorized them into three gene groups (Sect.Â "Selecting marker genes and cells from four singlecell and bulklevel datasets" and [50]). The bulklevel samples of the four subtypes possess distinct combinatorial expression patterns on the three gene groups (Fig.Â 4A, left panel). However, it is unclear whether the combinatorial expression pattern of each PAM50 subtype is dominated by one cell type or attributed to a mixture of several common cell types.
To answer this question, we employed backward deconvolution to one bulklevel breast cancer RNASeq dataset (METABRIC, [55]) and two singlecell RNASeq datasets (GSE75688, [56]; GSE176078, [57]), and focused on cancer cells in the two scRNASeq datasets. We termed GSE75688 and GSE176078 small and large datasets as they comprised 281 and 12,019 cancer cells respectively after processing (Sect.Â "Selecting marker genes and cells from four singlecell and bulklevel datasets"). For each incomplete method, we proposed three simple hypotheses about the expression patterns of the cell types underlying the PAM50 subtypes. Hypothesis 1 (\({M}_{1}\)) stipulates that the tumors of each subtype are dominated by one cell type (Fig.Â 5A, panel 1). \({M}_{2}\) stipulates that tumors of each subtype are mixtures of three cancer cell types which have elevated expressions in one gene group each (Fig.Â 5A, panel 2). \({M}_{3}\) serves as a negative control of \({M}_{1}\) by rearranging the sample subtypes and gene groups from the bulklevel data to maximize the difference from \({M}_{1}\) (Fig.Â 5A, panel 3). We checked whether certain models consistently outperformed other models in the two scRNASeq datasets.
FigureÂ 4A displays marker gene expressions on three datasets. Combinatorial expression patterns of PAM50 subtypes are salient in METABRIC and deteriorated in singlecell datasets. FigureÂ 4B displays the tSNE projections of samples in the three datasets. METABRIC samples of the four subtypes are clearly separated, but cells in the two singlecell datasets are clustered primarily by patient identities (annotations not shown) rather than PAM50 subtypes.
Table 4 reports the 21 \({\mathcal{L}}_{1}\) scores and 15 \({\mathcal{L}}_{2}\) scores on breast cancer data, and Fig.Â 5 visualizes the signature matrices and inferred mixture coefficients of the bulklevel data. Similar to the mouse data, each incomplete method has a superior \({\mathcal{L}}_{2}\) than \({\mathcal{L}}_{1}\) in all three datasets. BIC scores on bulklevel data serve as sanity check as the true model of the expression patterns of sample subtypes (\({M}_{1}\)) is given. Indeed, for each incomplete method both \({\mathcal{L}}_{1}\) and \({\mathcal{L}}_{2}\) scores follow the order \(M_{1} > M_{2} > M_{3}\), and for two complete methods the model of \(K = 4\) outperforms the model of \(K = 3\), but their scores are inferior to those of incomplete methods. A complete method NMF is the only anomaly, as it has the \({\mathcal{L}}_{1}\) score comparable to the \({\mathcal{L}}_{1}\) scores of the best incomplete methods, and the model of \(K = 3\) is superior to the model of \(K = 4\).
The \({\mathcal{L}}_{1}\) scores on scRNASeq datasets are largely congruent with expectation. \(M_{1}\) has superior scores than \(M_{2}\) and \(M_{3}\) for all incomplete methods, and complete methods are generally inferior to the \(M_{1}\) scores of incomplete methods (except NMF). The \({\mathcal{L}}_{2}\) scores on scRNASeq datasets are also compatible with expectation. On the small scRNASeq data, the \({\mathcal{L}}_{2}\) scores of each incomplete method follow the order \(M_{1} > M_{3} > M_{2}\). On the large scRNASeq data, they follow the order \(M_{1} > M_{2} > M_{3}\).
The superior log likelihood scores of \(M_{1}\) offer indirect evidence supporting the strength of \(M_{1}\) to fit the data. To provide direct evidence supporting the strength of \(M_{1}\), we found another independent breast cancer scRNASeq dataset [57], clustered the cells and annotated their PAM50 subtypes, and then compared the clustering outcomes with those of two virtual scRNASeq datasets simulated from \(M_{1}\) and \(M_{2}\). FigureÂ 6 visualizes the clustering outcomes of the real breast cancer scRNASeq data (Fig.Â 6A) and those of the two virtual datasets simulated from \(M_{1}\) and \(M_{2}\) (Fig.Â 6B). Supplementary file 5: Table S2 reports the confusion tables of clustering outcomes of the real breast cancer scRNASeq data (Table S2A) and those of the \(M_{1}\) and \(M_{2}\) simulated data (Table S2B and S2C). Intriguingly, the \(M_{1}\) data resembles the real data more closely than the \(M_{2}\) data. In the real scRNASeq data, cells are clustered primarily by their PAM50 subtypes. This clustering pattern is nearly reproduced in the \(M_{1}\) data. In contrast, in the \(M_{2}\) data cells are clustered by the three hidden cell types rather than their PAM50 labels. Consequently, the model \(M_{1}\) better describes the breast cancer scRNASeq data than an alternative model \(M_{2}\).
Lowgrade glioma bulklevel and singlecell RNASeq data
Lowgrade glioma (LGG) patients in The Cancer Genome Atlas (TCGA) data were classified into three subtypes according to the mutation states of Idh1 gene and chromosome 1p/19q codeletion [58]: Idh1 mutation with or without codeletion and wild type. We identified 61 marker genes and labeled them to three gene groups enriched with neuron development, cell cycle, and immune response, respectively.
Supplementary file 2: Figure S2A displays expressions of the marker genes on LGG bulklevel [58] and singlecell [59] datasets, and Supplementary file 2: Figure S2B displays the tSNE projections of samples in the two datasets. Similar to the breast cancer data, it is difficult to discern the underlying cell types from the gene expression visualization and tSNE projections alone.
We considered the same three hypotheses \(M_{1}  M_{3}\) and checked which hypothesis better fit the scRNASeq data. Supplementary file 6: Table S3 reports the 18 \({\mathcal{L}}_{1}\) scores and 15 \({\mathcal{L}}_{2}\) scores on one bulklevel and one singlecell LGG data, and Supplementary file 3: Figure S3 visualizes the signature matrices and inferred mixture coefficients of the bulklevel data. Similar to Sects.Â "Insilico mixture of mouse scRNASeq data""Breast cancer bulklevel and singlecell RNASeq data", the \({\mathcal{L}}_{2}\) scores are higher than the \({\mathcal{L}}_{1}\) scores of all models in both datasets. \(M_{2}\) has the best \({\mathcal{L}}_{1}\) for each incomplete method, and the three complete methods have superior \({\mathcal{L}}_{1}\) than incomplete methods. In contrast, in the bulklevel data the \({\mathcal{L}}_{2}\) scores of incomplete methods follow the order \(M_{1} > M_{2} > M_{3}\). In the singlecell data the \({\mathcal{L}}_{2}\) scores of incomplete methods follow the order \(M_{1} > M_{3} > M_{2}\). Consequently, the \(M_{1}\) models derived from \(GP_{ref}\) yield the best BIC scores.
Discussion
We propose a backward deconvolution framework to infer cell type gene expression signatures and compositions by integrating both bulklevel and singlecell RNASeq data. It has several unique benefits. First, it compares and selects a decomposition model from multiple candidates rather than sticking to one particular decovolution algorithm and/or hypothesis. Second, it handles the scRNASeq data with highlevel noise, abundant zero entries, and no cell type annotations by constructing the reference signature matrix and distribution from bulklevel data with stronger hypotheses. Third, the loglikelihood scores provide a common metric for the joint effect of signature matrices (or distributions) and mixture coefficients in fitting the scRNASeq data. Fourth, the loglikelihood scores can be evaluated without knowing bulk sample mixture coefficients or singlecell annotations, hence can be applied in a wider range of datasets.
Several important discoveries are drawn from the analysis of five datasets. First, there is no universally superior deconvolution algorithm over all datasets, as each dataset has its best performing algorithm. Nevertheless, overall three incomplete deconvolution algorithmsâ€”DWLS, RADs and Scadenâ€”tend to be superior to other algorithms in most datasets. Second, in the mouse data where the single cell annotations and/or bulk sample mixtures are provided, the loglikelihood scores of nine deconvolution methods are largely compatible with the deviations of mixture coefficients, gene expression conditional probabilities, or cell type assignments from the ground truth. Third, in the human brain data ASD samples tend to possess higher fractions of astrocytes and lower fractions of NRGNexpressing neurons than control samples. The first observation was reported in the study of the ASD scRNASeq data [47], and both observations were manifested in both bulklevel and singlecell data. Fourth, in the cancer data with no singlecell annotations and abundant zero entries, the model that tumors of each subtype are dominated by one cell type (\(M_{1}\)) outperforms an alternative model that each cell type possesses elevated expressions on one gene group and low expressions on the remaining gene groups (\(M_{2}\)). Moreover, in an independent breast cancer scRNASeq dataset, cells were clustered primarily by their sample subtypes (PAM50 subtypes). By comparing with the scRNASeq data simulated by the two hypothetical models, we found that the clustering patterns of the real data resembled \(M_{1}\) the most. The results are not definitive since we have not tested \(M_{1}\) against many alternative models. Nevertheless, superiority of \(M_{1}\) to \(M_{2}\) has supporting evidence from prior studies. Tumors of the four breast cancer subtypes have similar expression profiles as the cell types in normal breast epithelium. It is thus widely hypothesized that the four breast cancer subtypes may arise from distinct normal cell types [60] or mutations or genetic rearrangements occurring in different populations of stem cells and progenitor cells [61]. Tumors of the three LGG subtypes are likely derived from the subclones arising from Idh1 mutations and chromosome 1p/19q codeletion events [62]. Therefore, cancer cells of a tumor subtype likely inherit the expression signatures from their tissues of origin or founder cells, and are relatively homogeneous. Heterogeneity is present primarily in the interactions between cancer cells and different types of normal cells such as multiple families of immune cells, stromal cells, fibroblasts, and others [63]. By contrast, even though a tumor may comprise multiple subclones, cancer cells of these subclones are likely derived from the same cell type. Thus the cancer cells from the same tumor subtypes may share the common expression patterns on the marker genes. This postulation by no means claims that cancer cells are homogeneous. Rather, we think homogeneity/heterogeneity is relative to the examined features (gene expressions). Expressions of cancer cells from multiple subclones are likely heterogeneous in the genes involved in the molecular alterations segregating these subclones (sequence mutations, copy number variations, structural variations, etc.), but homogeneous among the marker genes selected from bulklevel data analysis. Fifth, all the \({\mathcal{L}}_{2}\) scores are superior to all the \({\mathcal{L}}_{1}\) scores, and \({\mathcal{L}}_{2}\) often better matches anticipation than \({\mathcal{L}}_{1}\). This suggests that \(Q\) is less reliable to estimate \(P(x\gamma ,\pi )\) compared to \(GP_{ref}\). \(Q\) collapses the entries of each gene in the cells of each type into one number by taking an average, but \(GP_{ref}\) retains the entries of all the cells of each type. Hence the latter estimates \(P(x\gamma ,\pi )\) from far more entries than the former and is more accurate.
The analysis of each dataset possesses some customized procedures. Most of these procedures pertain to selection of marker genes, gene groups, sample subtypes and cell types in the data. These procedures facilitate deconvolution operations and make the results more interpretable, but are strictly speaking not part of the backward deconvolution framework. These variables are treated as given in the framework. Users interested in applying the backward deconvolution programs into their data can ignore our customized procedures and directly provide sample subtypes, cell types, and gene group labels of their data.
Several open problems remain in the present study. When scRNASeq data have poor quality or no annotations, the models underlying signature matrices and distributions are manually constructed from the bulklevel data. Manual construction is preferable currently as we aim to compare a few simple and interpretable hypotheses about cancer cell type heterogeneity. However, in the long run it is desirable to have an algorithm capable of generating simple and sensible hypotheses for backward deconvolution. Although the loglikelihood scores combine the joint effects of mixture coefficients and cell type specific gene expression patterns, the downside is that these two effects are entangled. Better statistical methods are required to disentangle the contributions of the two factors. In the current formulation of marginal likelihood function (Eq.Â 3), the effect of \(P(x\) \(\gamma ,\pi )\) often outweighs that of \(P(\pi \) \(s)\) because in each cell the former term multiplies over all marker genes yet the latter term appears only once. Hence differences in estimated mixture coefficients are likely overwhelmed by estimated cell type specific gene expression distributions. Similar problems arise in topic models of natural language processing, and several techniques have been proposed to correct the asymmetric contributions [64, 65]. We plan to adopt some of these methods in the future development of backward deconvolution. Albeit we proposed a probabilistic graphical model in generating the scRNASeq data, we did not adopt a fully Bayesian approach to evaluate the likelihood scores. Instead of integrating over the possible conditional probability \(P\left( {\pi {}s} \right)\) and \(P(x\) \(\gamma ,\pi )\) values, we estimated their values from the bulk data deconvolution outcomes and plugged the estimated values into the likelihood function. A fully Bayesian approach is conceivable if we introduce proper prior distributions of \(P\left( {\pi {}s} \right)\) and \(P(x\) \(\gamma ,\pi )\) and employ standard Bayesian inference methods to evaluate the marginal likelihood scores over both cell type values and parameter values.
Availability of data and materials
All data generated or analyzed during this study are included in this published article and its supplementary information files.
Availability of source codes
We have implemented the backward deconvolution algorithm in R, and deposited the source codes and their description on github.com/chyeang/backwarddeconvolution/.
References
Avila Cobos F, AlquiciraHernandez J, Powell JE, Mestdagh P, De Preter K. Benchmarking of cell type deconvolution pipelines for transcriptomics data. Nat Commun. 2020;11(1):5650.
Avila Cobos F, Vandesompele J, Mestdagh P, De Preter K. Computational deconvolution of transcriptomics data from mixed cell populations. Bioinformatics. 2018;34(11):1969â€“79.
Brunet JP, Tamayo P, Golub TR, Mesirov JP. Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci U S A. 2004;101(12):4164â€“9.
Zaitsev K, Bambouskova M, Swain A, Artyomov MN. Complete deconvolution of cellular mixtures based on linearity of transcriptional signatures. Nat Commun. 2019;10(1):2209.
Jaakkola MK, Elo LL. Computational deconvolution to estimate cell typespecific gene expression from bulk data. NAR Genom Bioinform. 2021;3(1):lqaa110.
Mohammadi S, Zuckerman N, Goldsmith A, Grama A. A critical survey of deconvolution methods for separating cell types in complex tissues. Proc IEEE. 2017;105(2):340â€“66.
Chiu YJ, Hsieh YH, Huang YH. Improved cell composition deconvolution method of bulk gene expression profiles to quantify subsets of immune cells. BMC Med Genomics. 2019;12(Suppl 8):169.
Wang N, Hoffman EP, Chen L, Chen L, Zhang Z, Liu C, Yu G, Herrington DM, Clarke R, Wang Y. Mathematical modelling of transcriptional heterogeneity identifies novel markers and subpopulations in complex tissues. Sci Rep. 2016;6:18909.
Roy S, Lane T, Allen C, Aragon AD, WernerWashburne M. A hiddenstate Markov model for cell population deconvolution. J Comput Biol. 2006;13(10):1749â€“74.
Zhu L, Lei J, Devlin B, Roeder K. A unified statistical framework for single cell and bulk RNA sequencing data. Ann Appl Stat. 2018;12(1):609â€“32.
Zinovyev A, Kairov U, Karpenyuk T, Ramanculov E. Blind source separation methods for deconvolution of complex signals in cancer biology. Biochem Biophys Res Commun. 2013;430(3):1182â€“7.
Gong T, Szustakowski JD. DeconRNASeq: a statistical framework for deconvolution of heterogeneous tissue samples based on mRNASeq data. Bioinformatics. 2013;29(8):1083â€“5.
Zhong Y, Wan YW, Pang K, Chow LM, Liu Z. Digital sorting of complex tissues for cell typespecific gene expression profiles. BMC Bioinformatics. 2013;14:89.
Tsoucas D, Dong R, Chen H, Zhu Q, Guo G, Yuan GC. Accurate estimation of celltype composition from gene expression data. Nat Commun. 2019;10(1):2975.
Wang X, Park J, Susztak K, Zhang NR, Li M. Bulk tissue cell type deconvolution with multisubject singlecell expression reference. Nat Commun. 2019;10(1):380.
Jew B, Alvarez M, Rahmani E, Miao Z, Ko A, Garske KM, Sul JH, Pietilainen KH, Pajukanta P, Halperin E. Accurate estimation of cell composition in bulk expression through robust integration of singlecell information. Nat Commun. 2020;11(1):1971.
Dong M, Thennavan A, Urrutia E, Li Y, Perou CM, Zou F, Jiang Y. SCDC: bulk gene expression deconvolution by multiple singlecell RNA sequencing references. Brief Bioinform. 2021;22(1):416â€“27.
ErdmannPham DD, Fischer J, Hong J, Song YS. Likelihoodbased deconvolution of bulk gene expression data using singlecell references. Genome Res. 2021;31(10):1794â€“806.
Jin H, Liu Z. A benchmark for RNAseq deconvolution analysis under dynamic testing environments. Genome Biol. 2021;22(1):102.
Sutton GJ, Poppe D, Simmons RK, Walsh K, Nawaz U, Lister R, GagnonBartsch JA, Voineagu I. Comprehensive evaluation of deconvolution methods for human brain gene expression. Nat Commun. 2022;13(1):1358.
Cancer Genome Atlas Research N. Integrated genomic analyses of ovarian carcinoma. Nature. 2011;474(7353):609â€“15.
Cancer Genome Atlas N. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490(7418):61â€“70.
Cancer Genome Atlas Research N. Comprehensive molecular characterization of gastric adenocarcinoma. Nature. 2014;513(7517):202â€“9.
Kiselev VY, Andrews TS, Hemberg M. Challenges in unsupervised clustering of singlecell RNAseq data. Nat Rev Genet. 2019;20(5):273â€“82.
Christensen E, Luo P, Turinsky A, Husic M, Mahalanabis A, Naidas A, DiazMejia JJ, Brudno M, Pugh T, Ramani A, et al. Evaluation of singlecell RNAseq labelling algorithms using cancer datasets. Brief Bioinform. 2023;24(1):bbac561.
Blei DM, Ng AY, Jordan MI. Latent Dirichlet allocation. J Mach Learn Res. 2003;3(4â€“5):993â€“1022.
duVerle DA, Yotsukura S, Nomura S, Aburatani H, Tsuda K. Cell Tree: an R/bioconductor package to infer the hierarchical structure of cell populations from singlecell RNAseq data. BMC Bioinformatics. 2016;17(1):363.
Wu XT, Wu H, Wu ZJ. Penalized latent dirichlet allocation model in singlecell RNA sequencing. Stat Biosci. 2021;13(3):543â€“62.
Yang Q, Xu Z, Zhou W, Wang P, Jiang Q, Juan L. An interpretable singlecell RNA sequencing data clustering method based on latent Dirichlet allocation. Brief Bioinform. 2023;24(4):199.
Schwarz G. Estimating dimension of a model. Ann Stat. 1978;6(2):461â€“4.
Abbas AR, Wolslegel K, Seshasayee D, Modrusan Z, Clark HF. Deconvolution of blood microarray data identifies cellular activation patterns in systemic lupus erythematosus. PLoS ONE. 2009;4(7): e6098.
Repsilber D, Kern S, Telaar A, Walzl G, Black GF, Selbig J, Parida SK, Kaufmann SH, Jacobsen M. Biomarker discovery in heterogeneous tissue samples taking the insilico deconfounding approach. BMC Bioinformatics. 2010;11:27.
Gaujoux R, Seoighe C. Cell Mix: a comprehensive toolbox for gene expression deconvolution. Bioinformatics. 2013;29(17):2211â€“2.
Wang J, Roeder K, Devlin B. Bayesian estimation of cell typespecific gene expression with prior derived from singlecell data. Genome Res. 2021;31(10):1807â€“18.
Lei H, Guo XA, Tao Y, Ding K, Fu X, Oesterreich S, Lee AV, Schwartz R. Semideconvolution of bulk and singlecell RNAseq data with application to metastatic progression in breast cancer. Bioinformatics. 2022;38(Suppl 1):i386â€“94.
Menden K, Marouf M, Oller S, Dalmia A, Magruder DS, Kloiber K, Heutink P, Bonn S. Deep learningbased cell composition analysis from tissue expression profiles. Sci Adv. 2020;6(30):eaba2619.
Zwiener I, Frisch B, Binder H. Transforming RNASeq data to improve the performance of prognostic gene signatures. PLoS ONE. 2014;9(1): e85150.
Hou W, Ji Z, Ji H, Hicks SC. A systematic evaluation of singlecell RNAsequencing imputation methods. Genome Biol. 2020;21(1):218.
Xu J, Cui L, Zhuang J, Meng Y, Bing P, He B, Tian G, Kwok Pui C, Wu T, Wang B, et al. Evaluating the performance of dropout imputation and clustering methods for singlecell RNA sequencing data. Comput Biol Med. 2022;146: 105697.
Cheng Y, Ma X, Yuan L, Sun Z, Wang P. Evaluating imputation methods for singlecell RNAseq data. BMC Bioinformatics. 2023;24(1):302.
Qiu P. Embracing the dropouts in singlecell RNAseq analysis. Nat Commun. 2020;11(1):1169.
Li R, Quon G. scBFA: modeling detection patterns to mitigate technical noise in largescale singlecell genomics data. Genome Biol. 2019;20(1):193.
Andrews TS, Hemberg M. M3Drop: dropoutbased feature selection for scRNASeq. Bioinformatics. 2019;35(16):2865â€“7.
Kiselev VY, Yiu A, Hemberg M. scmap: projection of singlecell RNAseq data across data sets. Nat Methods. 2018;15(5):359â€“62.
Tabula Muris Consortium. Singlecell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature. 2018;562(7727):367â€“72.
Tabula Muris C. A singlecell transcriptomic atlas characterizes ageing tissues in the mouse. Nature. 2020;583(7817):590â€“5.
Velmeshev D, Schirmer L, Jung D, Haeussler M, Perez Y, Mayer S, Bhaduri A, Goyal N, Rowitch DH, Kriegstein AR. Singlecell genomics identifies cell typespecific molecular changes in autism. Science. 2019;364(6441):685â€“9.
Gupta S, Ellis SE, Ashar FN, Moes A, Bader JS, Zhan J, West AB, Arking DE. Transcriptome analysis reveals dysregulation of innate immune response genes and neuronal activitydependent genes in autism. Nat Commun. 2014;5:5748.
Parker JS, Mullins M, Cheang MC, Leung S, Voduc D, Vickery T, Davies S, Fauron C, He X, Hu Z, et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol. 2009;27(8):1160â€“7.
Tiong KL, Lin YW, Yeang CH. Characterization of gene cluster heterogeneity in singlecell transcriptomic data within and across cancer types. Biol Open. 2022;11(6):59256.
Pal B, Chen Y, Vaillant F, Capaldo BD, Joyce R, Song X, Bryant VL, Penington JS, Di Stefano L, Tubau Ribera N, et al. A singlecell RNA expression atlas of normal, preneoplastic and tumorigenic states in the human breast. EMBO J. 2021;40(11): e107333.
Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM 3rd, Hao Y, Stoeckius M, Smibert P, Satija R. Comprehensive integration of singlecell data. Cell. 2019;177(7):1888â€“902.
Tiong KL, Sintupisut N, Lin MC, Cheng CH, Woolston A, Lin CH, Ho M, Lin YW, Padakanti S, Yeang CH. An integrated analysis of the cancer genome atlas data discovers a hierarchical association structure across thirty three cancer types. PLOS Digit Health. 2022;1(12): e0000151.
Nguyen PL, Taghian AG, Katz MS, Niemierko A, Abi Raad RF, Boon WL, Bellon JR, Wong JS, Smith BL, Harris JR. Breast cancer subtype approximated by estrogen receptor, progesterone receptor, and HER2 is associated with local and distant recurrence after breastconserving therapy. J Clin Oncol. 2008;26(14):2373â€“8.
Curtis C, Shah SP, Chin SF, Turashvili G, Rueda OM, Dunning MJ, Speed D, Lynch AG, Samarajiwa S, Yuan Y, et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature. 2012;486(7403):346â€“52.
Chung W, Eum HH, Lee HO, Lee KM, Lee HB, Kim KT, Ryu HS, Kim S, Lee JE, Park YH, et al. Singlecell RNAseq enables comprehensive tumour and immune cell profiling in primary breast cancer. Nat Commun. 2017;8:15081.
Wu SZ, AlEryani G, Roden DL, Junankar S, Harvey K, Andersson A, Thennavan A, Wang C, Torpy JR, Bartonicek N, et al. A singlecell and spatially resolved atlas of human breast cancers. Nat Genet. 2021;53(9):1334â€“47.
Cancer Genome Atlas Research N, Brat DJ, Verhaak RG, Aldape KD, Yung WK, Salama SR, Cooper LA, Rheinbay E, Miller CR, Vitucci M, et al. Comprehensive, integrative genomic analysis of diffuse lowergrade gliomas. N Engl J Med. 2015;372(26):2481â€“98.
Chaligne R, Gaiti F, Silverbush D, Schiffman JS, Weisman HR, Kluegel L, Gritsch S, Deochand SD, Gonzalez Castro LN, Richman AR, et al. Epigenetic encoding, heritability and plasticity of glioma transcriptional cell states. Nat Genet. 2021;53(10):1469â€“79.
Skibinski A, Kuperwasser C. The origin of breast tumor heterogeneity. Oncogene. 2015;34(42):5309â€“16.
Sims AH, Howell A, Howell SJ, Clarke RB. Origins of breast cancer subtypes and therapeutic implications. Nat Clin Pract Oncol. 2007;4(9):516â€“25.
Kayabolen A, Yilmaz E, BagciOnder T. IDH mutations in glioma: doubleedged sword in clinical applications? Biomedicines. 2021;9(7):799.
Kim IS, Zhang XH. One microenvironment does not fit all: heterogeneity beyond cancer cells. Cancer Metastasis Rev. 2016;35(4):601â€“29.
Raj A, Stephens M, Pritchard JK. fastSTRUCTURE: variational inference of population structure in large SNP data sets. Genetics. 2014;197(2):573â€“89.
Teh Y, Newman D, Welling M (2006) A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. Adv Neural Inf Process Syst 19
Acknowledgements
We thank Vahid Golderzahi for providing feedback about the manuscript.
Funding
The study is partially supported by the National Science & Technology Council, Taiwan, Grant Numbers 110â€“2118M001003MY2, 112â€“2118M001007MY2.
Author information
Authors and Affiliations
Contributions
KLT implemented the backward deconvolution framework, processed and analyzed the data, and cowrote the manuscript. DL implemented the backward deconvolution framework, performed analysis on insilico and true mouse RNASeq data, and cowrote the manuscript. CHY conceived the project, formulated the models, and cowrote the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
This study uses the data from the public domain, thus does not require ethics approval and consent from participants.
Consent to publish
Not applicalbe.
Competing interests
There is no competing interest in this work.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
12859_2024_5825_MOESM1_ESM.tiff
Supplementary Material 1. Figure S1: (A) Mixture coefficients of the ground truth (top panel) and those inferred from six incomplete deconvolution methods for the artificial mixture data with Î·, Î´ = (50,0), (B) Mixture coefficients for the artificial mixture data with Î·, Î´ = (3,0.7), (C) Signature matrices inferred from three complete deconvolution methods for the artificial mixture data with Î·, Î´ = (50,0), (D) Signature matrices for the artificial mixture data with Î·, Î´ = (3,0.7).
12859_2024_5825_MOESM2_ESM.tiff
Supplementary Material 2. Figure S2: (A) Visualization of gene expression LGG data on the bulk level (TCGA) and singlecell level (GSE151506), (B) tSNE visualization of the same two expression datasets.
12859_2024_5825_MOESM3_ESM.tif
Supplementary Material 3. Figure S3: Deconvolution results on LGG data. (A) Signature matrices of three hypothetical models (M_{1}M_{2}.) for incomplete deconvolution methods and those inferred from three complete deconvolution methods, (B)Mixture coefficients inferred from three complete deconvolution methods (deconf_original, deconf_fast, and NMF) and five incomplete deconvolution methods.
12859_2024_5825_MOESM5_ESM.xlsx
Supplementary Material 5. Table S2: (A) The confusion table of the clustering outcomes on the breast cancer scRNASeq dataset GSE161529. Rows indicate clusters of cells, and three columns indicate the PAM50 subtypes of the samples encompassing the cells, (B) The confusion table of the virtual data simulated by M_{1}, rows and columns correspond to cell types (Ï€) and sample subtypes (s) respectively, (C) The confusion table of the virtual data simulated by M1.
12859_2024_5825_MOESM6_ESM.xlsx
Supplementary Material 6. Table S3: The loglikelihood scores 18 probabilistic graphical models on LGG bulklevel and singlecell RNASeq data.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Tiong, KL., Luzhbin, D. & Yeang, CH. Assessing transcriptomic heterogeneity of singlecell RNASeq data by bulklevel gene expression data. BMC Bioinformatics 25, 209 (2024). https://doi.org/10.1186/s12859024058253
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859024058253