 Research
 Open Access
 Published:
VPAC: Variational projection for accurate clustering of singlecell transcriptomic data
BMC Bioinformatics volume 20, Article number: 0 (2019)
Abstract
Background
Singlecell RNAsequencing (scRNAseq) technologies have advanced rapidly in recent years and enabled the quantitative characterization at a microscopic resolution. With the exponential growth of the number of cells profiled in individual scRNAseq experiments, the demand for identifying putative cell types from the data has become a great challenge that appeals for novel computational methods. Although a variety of algorithms have recently been proposed for singlecell clustering, such limitations as low accuracy, inferior robustness, and inadequate stability greatly impede the scope of applications of these methods.
Results
We propose a novel modelbased algorithm, named VPAC, for accurate clustering of singlecell transcriptomic data through variational projection, which assumes that singlecell samples follow a Gaussian mixture distribution in a latent space. Through comprehensive validation experiments, we demonstrate that VPAC can not only be applied to datasets of discrete counts and normalized continuous data, but also scale up well to various data dimensionality, different dataset size and different data sparsity. We further illustrate the ability of VPAC to detect genes with strong unique signatures of a specific cell type, which may shed light on the studies in system biology. We have released a userfriendly python package of VPAC in Github (https://github.com/ShengquanChen/VPAC). Users can directly import our VPAC class and conduct clustering without tedious installation of dependency packages.
Conclusions
VPAC enables highly accurate clustering of singlecell transcriptomic data via a statistical model. We expect to see wide applications of our method to not only transcriptome studies for fully understanding the cell identity and functionality, but also the clustering of more general data.
Background
Singlecell RNAsequencing (scRNAseq) has emerged as a revolutionary tool to reveal previously unknown heterogeneity and functional diversity at a microscopic resolution. Since the first protocols were published in 2009 [1], a massive expansion in method development has derived scRNAseq technologies with distinct advantages and applicability [2]. For example, Smartseq2 [3] and MARSseq [4] is preferable when quantifying transcriptomes of fewer cells, while Dropseq [5] is preferable when quantifying transcriptomes of large numbers of cells with low sequencing depth [6]. Advances in scRNAseq technology have resulted in a wealth of studies aiming to reveal new cell types [7, 8], assess tissue composition [4, 9, 10], identify gene regulatory mechanisms [11, 12], investigate cell development or lineage processes [13,14,15], and many others. With the exponential growth of the number of cells profiled in individual scRNAseq experiments, there is a demand for novel analysis methods for this new type of transcriptomic data, which has not only much greater scale of datasets than that of bulk experiments but also various challenges unique to the singlecell context [16].
A key advantage of scRNAseq is that it can be used to identify putative cell types using unsupervised clustering, which is essential to fully understand the cell identity and functionality. A variety of algorithms have recently been proposed for singlecell clustering. For example, CellTree produces tree structures outlining the hierarchical relationship between singlecell samples using a novel statistical approach based on document analysis techniques [17]. Other statistical approaches based on Dirichlet mixture model are shown to be well suited for single cell clustering, especially for data as unique molecular identifiers (UMI) matrix. For example, DIMMSC models UMI count data and characterizes variations across different cell clusters via a Dirichlet mixture prior [18]. ParaDPMM further improves the clustering quality by introducing a Dirichlet process prior to automatically infers the number of clusters from the dataset, and a splitmerge mechanism is adopted to improve convergence and optimality of the result [19]. Given the high dimensionality of singlecell data, clustering directly on the original dimension may affect the performance due to the intrinsic noise of singlecell data, and usually demands tedious preprocessing such feature selection, which may restrict the robustness of clustering. A simple idea is to use traditional methods to reduce dimensions and then cluster. However, there are too many alternative combinations that make it difficult for us to make a choice. Methods combining dimension reduction with classic clustering such as tDistributed Stochastic Neighbor Embedding (tSNE) with Kmeans [8], and principal component analysis (PCA) with hierarchical clustering [20] for singlecell clustering are proposed. Combining PCA with Kmeans, a consensus clustering approach is proposed to achieve high accuracy and robustness [21]. CIDR further uses an implicit imputation approach to alleviate the impact of dropouts in scRNAseq data in a principled manner and identifies putative cell types using hierarchical clustering [22]. Other recent works are also proposed to cluster high dimensional singlecell data with gene regulatory networks [23], or ranking on shared nearest neighbors (SNN) [13].
However, there are still limitations in singlecell clustering to be addressed. First, even the stateoftheart methods have achieved encouraging performance, the clustering quality can still be significantly improved for challenging tasks as shown in the Results Section. Second, most methods are designed for one of the continuous data and discrete UMI counts. There is demand for methods that can be applied to singlecell data created with different scRNAseq technologies, such as discrete counts created with UMI based techniques and continuous data normalized to transcripts per million mapped reads (TPM)—if reads are only generated from one end of the transcript, or fragments per kilobase per million mapped reads (FPKM)—if reads cover the entire transcript [24]. Third, a method that can be generally applied to data of various sizes or dimensions is desirable. Nevertheless, most proposed methods cannot scale up well with various data dimensionality and with different dataset size. Last but equally important, most public packages cannot be easily installed due to system environment or software version problems. In addition, some of them require other dependency packages, and thus further inconveniences the use of these packages. Therefore, there is a demand for an easytoinstall and easytouse package.
Motivated by the above understanding, we propose in this paper a modelbased algorithm named VPAC (Variational Projection for Accurate Clustering) of singlecell transcriptomic data. VPAC is a novel extension of the framework of probabilistic principal components analysis (PPCA) [25], which has been shown to be effective in dimensionality reduction for scRNAseq data [26]. VPAC projects singlecell samples to a latent space, where the samples are constrained to follow a Gaussian mixture distribution [27]. With a coordinate ascent variational inference (CAVI) algorithm, VPAC can implement parameter estimation efficiently and steadily. Using five scRNAseq datasets, we show that our model is not only superior to existing methods in the clustering of singlecell transcriptomic data, but also able to be applied to singlecell data of both discrete counts and normalized continuous data. Through comprehensive experiments, we further show the robustness of our model for various data dimensionality, different dataset size and different data sparsity, and the ability of our model to detect genes with strong unique signatures of a specific cell type.
Methods
The statistical model of VPAC
Let N be the number of singlecell samples, D the number of genes, and M the desired number of clusters. We assume that the gene expression vector x_{n} of cell n is generated from a projection of a latent Ldimensional vector z_{n} (L ≪ D). We use n = 1, ..., N to index over samples, i = 1, ..., L to index over latent dimensions, and j = 1, …, M to index over clusters in all derivations below. The distribution of x_{n} could be a complex highdimensional distribution. We assume x_{n} follows a multivariate Gaussian distribution given z_{n}, while the latent vector z_{n} follows a Gaussian mixture distribution. An intuitive understanding of this assumption is that, when projecting, constraining samples to follow a Gaussian mixture distribution in the latent space contributes to the accurate clustering. As a generative model, VPAC independently draw each sample x_{i} through the following process.
In the Gauss mixture distribution (1), the binary latent variable s_{jn}, which gives discrete distributions governed by ρ, describes which component in the mixture gives rise to the latent vector z_{n}, i.e., if z_{n} is generated from component j then s_{jn} = 1, and s_{jn} = 0 otherwise. We choose a Dirichlet distribution for the prior of ρ
We complete the specification of Gaussian mixture distribution in Eq. (1) by introducing Gaussian and Wishart conjugate priors over the means m and precisions T, as
where γ is a small and fixed parameter chosen to give a broad prior to m, while the degrees of freedom υ and scale matrix V give a broad prior to T.
In order to endow VPAC the ability to automatically determine the appropriate dimensionality for the latent space to avoid discrete model selection, we introduce a hierarchical prior P(W α) over the projection matrix W in Eq. (2), as
where each item of the Ldimensional vector α controls the corresponding column of the matrix W by playing a role as the precision of a Gaussian distribution. We again introduce broad priors over the parameters α,μ and τ to complete the specification of VPAC
The graphical model representation of VPAC is shown in Fig. 1. The broad priors introduced above are obtained by setting a_{α} = b_{α} = a_{τ} = b_{τ} = β = γ = 10^{−3}. The initial parameters δ of the Dirichlet distribution are set as \( \frac{1}{M} \). The joint distribution of all of the variables is given by
Variational inference of parameters
We use variational methods to find a lower bound on P(X) because it is analytically intractable to directly evaluate P(X). With θ denoting the set of all parameters and latent variables in VPAC, we introduce an approximating distribution Q(θ) of the true posterior distribution. The log marginal likelihood is then given by
The function \( \mathcal{L}(Q) \) is the evidence lower bound (ELBO) on the true log marginal likelihood. The goal is to find a suitable Q(θ) to maximize the ELBO or minimize the KullbackLeibler divergence between \( \mathcal{L}(Q) \) and the true log marginal likelihood
We assume the variational approximation is meanfield, i.e., \( Q\left(\boldsymbol{\theta} \right)=\prod \limits_t{Q}_t\left({\theta}_t\right) \). The corresponding factors are
Because the proposed model is conjugate, we can derive a coordinate ascent variational inference (CAVI) algorithm to update the variational parameters
We implement VPAC in Python using common packages for data analysis (including Numpy, Scikitlearn and Scipy), without other dependency packages. Users can directly import our VPAC class and conduct clustering. Algorithm 1 gives a pseudocode description of VPAC.
Data collection
We collected a dataset of three T cell types (CD4+/CD25+ regulatory T cells, CD4+/CD45RA+/CD25 naive T cells and CD8+/CD45RA+ naïve cytotoxic T cells) from 10X Genomics [28]. The dataset measures the expression of 32,738 genes in 32,695 cells, which were enriched from fresh peripheral blood mononuclear cells (PBMCs) and sequenced by Illumina NextSeq 500 instrument. Clustering on this dataset was claimed as a challenging task by Z.Sun et al. [18] and T.Duan et al. [19]. We used this dataset to demonstrate the ability of our method to scale up well with various data dimensionality and with different dataset size on datasets from UMI based techniques. We also downloaded three preprocessed datasets of different scales used by ParaDPMM [19], which is the stateoftheart method for clustering discrete UMI counts, to fairly evaluate the performance of our model.
A dataset of 561 cells derived from seven cell lines (A549, H1437, HCT116, IMR90, K562, GM12878, and H1) was downloaded from the NCBI Gene Expression Omnibus via accession GSE81861 [29]. Sequenced using the 101bp pairedend protocol on the Illumina HiSeq 2000 platform, the dataset measures the expression of 55,186 DNA regions, and the expression was quantified as FPKM values. We used this dataset to demonstrate the robustness of our model for FPKMnormalized data of various dimensions. From the Single Cell Portal, we downloaded a platebased scRNAseq dataset of 24,649 genes in 27,998 cells from 12 clusters. The dataset was sequenced using a modified Smartseq2 protocol, and the expression levels of genes were quantified as TPM values [30]. With this dataset, we further demonstrated the robustness of our model for TPMnormalized datasets of different size.
In order to illustrate the superior ability of our model to account for the sparsity of scRNAseq data, we collected a dataset of 301 cells captured from 11 cell types using microfluidics. This dataset is publicly available from the NCBI Sequence Read Archive under accession SRP041736, and the quantification of gene expression levels in TPM for all 23,730 genes in all samples was performed [31]. We also collected a Smartseq2 sequenced dataset of 742 dendritic cells from six clusters with two batches from the Single Cell Portal to illustrate the application of our model [32, 33].
Assessment of performance
We used adjusted rand index (ARI) [34] and normalized mutual information (NMI) [35] to assess the clustering performance, namely, the similarity between the clustering results and known cell types provided by their original references. Suppose T is the known cell types, C the predicted clustering results, N the total number of singlecell samples, x_{i} the number of samples clustered to the ith cluster of C, y_{j} the number of samples belong to the jth cell type of T, and n_{ij} the number of overlapping samples between the ith cluster and the jth cell type. ARI is computed as
where ()denotes a binomial coefficient. By denoting the entropy of C and T as H(C) and H(T), respectively, and the mutual information between them as MI(C,T), NMI can be computed as
Results
VPAC accurately clusters scRNAseq data
To verify the clustering performance of VPAC, we conducted each experiment five times and computed two widely used metrics, ARI and NMI. We compared the performance of our method with four baseline methods (using their default parameters), including Para_DPMM, the stateoftheart method for clustering discrete UMI counts [19], pcaReduce, another method combining dimension reduction with clustering [20], and other two widely used methods (Seurat [13] and SC3 [21]).
In order to fairly evaluate the performance of our model, we compared the performance of VPAC with other four baseline methods on the three T cell datasets of different scales provided by T.Duan et al. [19]. The small scale dataset (SSet) consists of 1200 cells with the 1000 top variable genes, the medium scale dataset (MSet) consists of 3000 cells with the 3000 top variable genes, and the large scale dataset (LSet) consists of 6000 cells with the 5000 top variable genes. As shown in Table 1, our method consistently outperforms the four baseline methods for a large margin on all the three datasets. In more detail, pcaReduce and Seurat fail in the clustering task, while SC3 achieves satisfactory performance in the first two datasets. However, the performance of SC3 decreases dramatically when the number of cells exceeds about 3 thousand, which will be further demonstrated in the following sections. Compared to Para_DPMM, our method achieves average 11.3% improvement on ARI and 8.1% improvement on NMI even the clustering on these datasets was claimed as a challenging task [18, 19]. In addition, our method consistently achieves stable performance, which demonstrates its higher robustness than the baseline methods. In terms of efficiency, the performance of VPAC is moderate compared with discriminant methods, which have an inherent advantage of high speed. Taking the LSet as an example, VPAC cost 241 s, Para_DPMM cost 1000 s, pcaReduce cost 37 s, Seurat cost 66 s and SC3 cost 653 s on average. It is worth noting that the default minimal training time of Para_DPMM is 1000 s, which slightly improves the performance of Para_DPMM shown in the original research.
VPAC scales up well with various data dimensionality
To illustrate the robustness of our model for various data dimensionality, we selected different numbers of top variable genes (features) based on their standard deviations across the cell transcriptome profiles. We first used the dataset of 32,695 cells enriched from PBMCs, which is the full version of the three preprocessed datasets, to demonstrate the superior performance of VPAC for discrete UMI counts of various dimensionality. As illustrated in Fig. 2a, our method significantly outperforms the four baseline methods and achieves consistent performance across datasets of different numbers of features. Note that pcaReduce raised an error when training the dataset with 30 k features even the memory is sufficient. SC3 cannot be competent for clustering the dataset of 32,695 cells.
We further evaluated the robustness of our model for FPKMnormalized data using a dataset of 561 cells derived from seven cell lines. As shown in Fig. 2b, VPAC consistently outperforms the four baseline methods again. Different from the performance for discrete UMI counts, all the performance of pcaReduce, Seurat and SC3 are superior to that of Para_DPMM. SC3 achieves the second best performance, but, as other three baseline methods, the performance is obviously unstable, while VPAC achieves consistent performance across datasets of different numbers of features. The results demonstrate that VPAC can scale up well with various data dimensionality whether the scRNAseq data is discrete or continuous.
VPAC scales up well with different dataset size
Different scRNAseq technologies can be used to quantify transcriptomes for different numbers of cells. In order to demonstrate the robustness of our model for different dataset size, we randomly selected different proportions of singlecell samples from the whole cell population. Besides, we selected the top 90% variable genes (features) to avoid errors raised by pcaReduce. We again used the cells enriched from PBMCs to illustrate the superior performance of VPAC for discrete UMI counts of different dataset size. As illustrated in Fig. 3a, our method significantly outperforms the four baseline methods and achieves superior performance across datasets of different numbers of samples. As the number of samples increased, the performance of Para_DPMM deteriorates, while that of SC3 decreases dramatically since the number of cells exceeds about 3 thousand, which confirms the previous results.
We further illustrated the robustness of our model for TPMnormalized data using a dataset of 24,649 cells from 12 clusters. As shown in Fig. 3b, VPAC is still superior to the baseline methods and achieves more stable performance across datasets of different numbers of samples. The ARI scores of VPAC are higher than that of Seurat for a large margin, even that Seurat achieves comparable NMI scores. The performance of SC3 again deteriorates as the number of samples increased. The performance of pcaReduce is still very unstable even outperforms Para_DPMM. The above results demonstrate that VPAC can also scale up well with different dataset size whether the scRNAseq data is discrete or continuous.
VPAC scales up well with different data sparsity
Singlecell gene expression data suffer from the high sparsity problem because they contain an abundance of dropout events that lead to zero expression measurements. In order to illustrate the superior ability of our model to account for the sparsity of scRNAseq data, we randomly zeroed different proportions of nonzero items five times on a TPMnormalized dataset of 301 cells with about 2.7 × 10^{5} reads per cell [31]. As illustrated in Fig. 4, VPAC is superior to the baseline methods and achieves stable performance. Onesided pairedsample Wilcoxon signed rank test consistently suggest that our metty hod achieves higher ARI scores (pvalues = 5.05e5 for pcaReduce and 4.98e5 for SC3) and NMI scores than a baseline method (pvalues = 8.60e4 for pcaReduce and 2.25e5 for SC3). The performance of VPAC, pcaReduce and SC3 obviously deteriorate when we set 80% of the nonzero expression data to zero. Seurat achieves relatively poor performance, while Para_DPMM again fails on clustering the TPMnormalized dataset. The results demonstrate that our model has the ability to account for the sparsity of scRNAseq data, and thus benefits the clustering tasks with the exponential growth of the Dropseq based singlecell transcriptomic data.
Applications of VPAC
To demonstrate potential applications of VPAC, we collected a dataset of 742 dendritic cells from six clusters [32, 33]. By setting the number of clusters to 6 and the dimensionality of latent space to 10, VPAC achieved an ARI of 0.876 and an NMI of 0.898. Actually, the items of α with large values in VPAC will result in ‘switching off’ the corresponding column of the matrix W, and thus provide guidance for determining the appropriate dimensionality for the latent space. As shown in Fig. 5, the visualization of the projection matrix W (the left one) indicates there are four principal directions in the latent space, while the visualization of the projection matrix obtained by classical PCA (the right one) cannot effectively reveal the number of principal directions. Therefore, by setting the dimensionality of latent space to 5 (in practice we have found that one more dimensionality, which may contain some additional information, tends to give better results), VPAC achieved an ARI of 0.875 and an NMI of 0.903, which are approximately equal to the results with the dimensionality of latent space setting to 10, which means the only one parameter we should determine for VPAC is the number of clusters.
We used the tSNE algorithm to visualize the dimensionreduced data in the latent space of VPAC by projecting the data into a twodimensional space so that certain hidden structures can be depicted intuitively. Note that tSNE is a visualization tool, and it is not intended to be used for clustering scRNAseq data [18]. As shown in Fig. 6a, VPAC accurately clustered different cell types accounting for cells from different batches. It is worth noting that VPAC intended to cluster DC2 and DC3 cell types together while inferring a small cluster (in the bottom) containing some individual cells. According to the original research for this dataset, it is reasonable to cluster DC2 and DC3 together because both of them correspond to new subdivisions of the CD1C/BDCA1^{+}cDC2, while DC1 corresponds to the crosspresenting CD141/BDCA3^{+} cDC1, DC4 corresponds to CD1C^{−}CD141^{−}CD11C^{+} DC, DC5 is a unique DC subtype, AS DCs, and DC6 corresponds to the interferonproducing pDC [32]. With this understanding, we set the number of clusters to 5. VPAC successfully clustered the DC2/3 cluster and achieved an ARI of 0.892 and an NMI of 0.943. In addition, as illustrated in Fig. 6b, VPAC accounted for batch effects even there are still very few samples wrongly clustered, which means that our method is robust to scRNAseq data from different batches.
We further conducted gene coexpression network analysis to identify which genes have a tendency to show a coordinated expression pattern of the DC2/3 cluster inferred by VPAC. By calculating the Pearson correlation coefficient (PCC) of each gene pairs, we plotted a coexpression network with a threshold 0.4 of PCC [36]. As illustrated in Fig. 7, the three genes, namely S100A9, S100A8 and CD14, with the highest degree of connectivity are expected to be drivers required for signaling pathways of essential functions. It is worth noting that, in the original research, these three genes were claimed as acute and chronic inflammatory genes that play as a strong unique signature to distinguish the DC2/3 cluster, which means our method has the ability to reveal the genes with strong unique signatures of a specific cell type.
It is worth noting that VPAC successfully clustered the novel DC5 samples detected by the original research. In order to study the potential mechanism implied in VPAC, we further analyzed the samples in the latent space inferred by VPAC. As shown in Fig. 8, the fifth items in latent vectors of the DC5 cluster are much larger than that of other clusters, which means that the fifth direction of the latent space can effectively distinguish the DC5 cluster. We sorted the absolute values of the fifth column in projection matrix W, which are the absolute weights of linear combinations of genes, and found that the two genes with the highest absolute weights (0.290 and 0.288) are AXL and SIGLEC6, which have much higher absolute weight than the successive gene CX3CR1 with a weight of 0.225. Interestingly, AXL and SIGLEC6 serve as unique markers for the population emerged from the unbiased cluster analysis (cluster DC5) according to the original literature, which further demonstrates that our method has the inherent ability to detect genes with strong unique signatures of a specific cell type while accurately clustering.
Discussion
scRNASeq technologies have advanced rapidly in recent years and enable the quantitative characterization of cell types based on transcriptome profiles. VPAC is proposed to identify putative cell types using unsupervised clustering. The superiority of our method over other baseline methods, such as Para_DPMM and pcaReduce, is mainly attributed to the variational projection while constraining singlecell samples to follow a Gaussian mixture distribution in the latent space. In addition, with the hierarchical prior over the projection matrix, VPAC can automatically determine the appropriate dimensionality for the latent space to avoid discrete model selection, which means the only one parameter we should determine for VPAC is the number of clusters. The comprehensive experiments demonstrate the generality and robustness of our model.
Our model can certainly be improved in some aspects. First, a more complex nonlinear projection can be introduced to better model the scRNAseq data. Second, more constraints or assumptions can be included in our model considering the characteristics of scRNAseq data, such as the high sparsity. Recent studies have shown the introduction of zeroinflated assumption can effectively model the dropout evens of singlecell data [26, 37], which may also improve the performance of our model. Third, our model can be extended to incorporate other types of functional genomics data such as chromatin accessibility. For example, in the literature, the method of coupled nonnegative matrix factorizations performs clustering by the integrative analysis of scRNAseq and singlecell ATACsequencing data [38]. Finally, the performance and efficiency of the model may be further improved by parameters inference using stochastic optimization and deep neural networks, which have been shown to be effective in statistical models for singlecell data analysis [39, 40].
Conclusions
We have proposed a modelbased algorithm, named VPAC, for accurate clustering of singlecell transcriptomic data through variational projection. Benefitting from the variational projection while constraining singlecell samples to follow a Gaussian mixture distribution in the latent space, VPAC is superior to existing methods in the clustering of datasets of discrete counts, normalized continuous data, various data dimensionality, different dataset size and different data sparsity. We have further demonstrated the ability of VPAC to detect genes with strong unique signatures of a specific cell type, which may shed light on the studies in system biology. Eventually, with the explosive growth of scRNAseq data, we expect that such a statistical approach will provide us with superior performance and be widely applicable.
Abbreviations
 ARI:

Adjusted rand index
 CAVI:

Coordinate ascent variational inference
 ELBO:

Evidence lower bound
 FPKM:

Fragments per kilobase per million mapped reads
 NMI:

Normalized mutual information
 PBMCs:

Peripheral blood mononuclear cells
 PCA:

Principal component analysis
 PCC:

Pearson correlation coefficient
 PPCA:

Probabilistic principal components analysis
 scRNAseq:

Singlecell RNAsequencing
 SNN:

Shared nearest neighbors
 TPM:

Transcripts per million mapped reads
 tSNE:

tDistributed Stochastic Neighbor Embedding
 UMI:

Unique molecular identifiers
References
Tang F, Barbacioru C, Wang Y, Nordman E, Lee C, Xu N, Wang X, Bodeau J, Tuch BB, Siddiqui A, et al. mRNASeq wholetranscriptome analysis of a single cell. Nat Methods. 2009;6(5):377–82.
Hedlund E, Deng Q. Singlecell RNA sequencing: technical advancements and biological applications. Mol Asp Med. 2018;59:36–46.
Picelli S, Bjorklund AK, Faridani OR, Sagasser S, Winberg G, Sandberg R. Smartseq2 for sensitive fulllength transcriptome profiling in single cells. Nat Methods. 2013;10(11):1096–8.
Jaitin DA, Kenigsberg E, KerenShaul H, Elefant N, Paul F, Zaretsky I, Mildner A, Cohen N, Jung S, Tanay A, et al. Massively parallel singlecell RNAseq for markerfree decomposition of tissues into cell types. Science. 2014;343(6172):776–9.
Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas AR, Kamitaki N, Martersteck EM, et al. Highly parallel genomewide expression profiling of individual cells using Nanoliter droplets. Cell. 2015;161(5):1202–14.
Ziegenhain C, Vieth B, Parekh S, Reinius B, GuillaumetAdkins A, Smets M, Leonhardt H, Heyn H, Hellmann I, Enard W. Comparative analysis of singlecell RNA sequencing methods. Mol Cell. 2017;65(4):631–643 e634.
Segerstolpe A, Palasantza A, Eliasson P, Andersson EM, Andreasson AC, Sun X, Picelli S, Sabirsh A, Clausen M, Bjursell MK, et al. Singlecell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 2016;24(4):593–607.
Grun D, Lyubimova A, Kester L, Wiebrands K, Basak O, Sasaki N, Clevers H, van Oudenaarden A. Singlecell messenger RNA sequencing reveals rare intestinal cell types. Nature. 2015;525(7568):251–5.
Zeisel A, MunozManchado AB, Codeluppi S, Lonnerberg P, La Manno G, Jureus A, Marques S, Munguba H, He L, Betsholtz C, et al. Brain structure. Cell types in the mouse cortex and hippocampus revealed by singlecell RNAseq. Science. 2015;347(6226):1138–42.
Scialdone A, Natarajan KN, Saraiva LR, Proserpio V, Teichmann SA, Stegle O, Marioni JC, Buettner F. Computational assignment of cellcycle stage from singlecell transcriptome data. Methods. 2015;85:54–61.
Jaitin DA, Weiner A, Yofe I, LaraAstiaso D, KerenShaul H, David E, Salame TM, Tanay A, van Oudenaarden A, Amit I. Dissecting immune circuits by linking CRISPRpooled screens with singlecell RNASeq. Cell. 2016;167(7):1883–1896 e1815.
Xue Z, Huang K, Cai C, Cai L, Jiang CY, Feng Y, Liu Z, Zeng Q, Cheng L, Sun YE, et al. Genetic programs in human and mouse early embryos revealed by singlecell RNA sequencing. Nature. 2013;500(7464):593–7.
Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of singlecell gene expression data. Nat Biotechnol. 2015;33(5):495–502.
Marco E, Karp RL, Guo G, Robson P, Hart AH, Trippa L, Yuan GC. Bifurcation analysis of singlecell gene expression data reveals epigenetic landscape. Proc Natl Acad Sci U S A. 2014;111(52):E5643–50.
Treutlein B, Brownfield DG, Wu AR, Neff NF, Mantalas GL, Espinoza FH, Desai TJ, Krasnow MA, Quake SR. Reconstructing lineage hierarchies of the distal lung epithelium using singlecell RNAseq. Nature. 2014;509(7500):371–5.
Stegle O, Teichmann SA, Marioni JC. Computational and analytical challenges in singlecell transcriptomics. Nat Rev Genet. 2015;16(3):133–45.
duVerle DA, Yotsukura S, Nomura S, Aburatani H, Tsuda K. CellTree: an R/bioconductor package to infer the hierarchical structure of cell populations from singlecell RNAseq data. BMC Bioinformatics. 2016;17(1):363.
Sun Z, Wang T, Deng K, Wang XF, Lafyatis R, Ding Y, Hu M, Chen W. DIMMSC: a Dirichlet mixture model for clustering dropletbased single cell transcriptomic data. Bioinformatics. 2018;34(1):139–46.
Duan T, Pinto JP, Xie X. Parallel clustering of single cell transcriptomic data with Splitmerge sampling on Dirichlet process mixtures. Bioinformatics. 2019;35(6):953–61.
Zurauskiene J, Yau C. pcaReduce: hierarchical clustering of single cell transcriptional profiles. BMC Bioinformatics. 2016;17:140.
Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, Natarajan KN, Reik W, Barahona M, Green AR, et al. SC3: consensus clustering of singlecell RNAseq data. Nat Methods. 2017;14(5):483–6.
Lin P, Troup M, Ho JW. CIDR: ultrafast and accurate clustering through imputation for singlecell RNAseq data. Genome Biol. 2017;18(1):59.
Aibar S, GonzalezBlas CB, Moerman T, HuynhThu VA, Imrichova H, Hulselmans G, Rambow F, Marine JC, Geurts P, Aerts J, et al. SCENIC: singlecell regulatory network inference and clustering. Nat Methods. 2017;14(11):1083–6.
Grun D, van Oudenaarden A. Design and analysis of singlecell sequencing experiments. Cell. 2015;163(4):799–810.
Tipping ME, CMJJotRSSSB B. Probabilistic principal component analysis. J R Stat Soc. 1999;61(3):611–22.
Pierson E, Yau CZIFA. Dimensionality reduction for zeroinflated singlecell gene expression analysis. Genome Biol. 2015;16:241.
Corduneanu A, Bishop CM. Variational Bayesian model selection for mixture distributions. In: Artificial intelligence and statistics. Waltham: Morgan Kaufmann; 2001. p. 27–34.
Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB, Wheeler TD, McDermott GP, Zhu J, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8:14049.
Li H, Courtois ET, Sengupta D, Tan Y, Chen KH, Goh JJL, Kong SL, Chua C, Hon LK, Tan WS, et al. Reference component analysis of singlecell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nat Genet. 2017;49(5):708–18.
Biton M, Haber AL, Rogel N, Burgin G, Beyaz S, Schnell A, Ashenberg O, Su CW, Smillie C, Shekhar K, et al. T Helper Cell Cytokines Modulate Intestinal Stem Cell Renewal and Differentiation. Cell. 2018;175(5):1307–20 e1322.
Pollen AA, Nowakowski TJ, Shuga J, Wang X, Leyrat AA, Lui JH, Li N, Szpankowski L, Fowler B, Chen P, et al. Lowcoverage singlecell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat Biotechnol. 2014;32(10):1053–8.
Villani AC, Satija R, Reynolds G, Sarkizova S, Shekhar K, Fletcher J, Griesbeck M, Butler A, Zheng S, Lazo S, et al. Singlecell RNAseq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science. 2017;356:6335.
Hua K, Zhang X. A case study on the detailed reproducibility of a human cell atlas project; 2018. p. 467993.
Hubert L, PJJoc A. Comparing partitions. J Classif. 1985;2(1):193–218.
Strehl A, JJJomlr G. Cluster ensemblesa knowledge reuse framework for combining multiple partitions; 2002. p. 583–617.
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13(11):2498–504.
Ferreira PF, Carvalho AM, Vinga S. Scalable probabilistic matrix factorization for singlecell RNAseq analysis. bioRxiv. 2018;496810. https://www.biorxiv.org/content/10.1101/496810v1.
Duren Z, Chen X, Zamanighomi M, Zeng W, Satpathy AT, Chang HY, Wang Y, Wong WH. Integrative analysis of singlecell genomics data by coupled nonnegative matrix factorizations. Proc Natl Acad Sci U S A. 2018;115(30):7723–8.
Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep generative modeling for singlecell transcriptomics. Nat Methods. 2018;15(12):1053–8.
Ding J, Condon A, Shah SP. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat Commun. 2018;9(1):2002.
Acknowledgements
We thank Yong Wang for his helpful discussions. Rui Jiang is a RONG professor at the Institute for Data Science, Tsinghua University.
Funding
Publication costs are funded by the National Key Research and Development Program of China (No. 2018YFC0910404), the National Natural Science Foundation of China (Nos. 61873141, 61721003, 61573207, 71871019 and 71471016), and the TsinghuaFuzhou Institute for Data Technology.
Availability of data and materials
The datasets supporting the conclusions of this article are publicly available from 10xgenomics (https://support.10xgenomics.com/singlecellgeneexpression/datasets), the NCBI Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo/), the NCBI Sequence Read Archive (http://www.ncbi.nlm.nih.gov/Traces/sra/), and the Single Cell Portal (https://portals.broadinstitute.org/single_cell).
About this supplement
This article has been published as part of BMC Bioinformatics Volume 20 Supplement 7, 2019: Selected papers from the 12th International Conference on Computational Systems Biology (ISB 2018). The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume20supplement7.
Author information
Authors and Affiliations
Contributions
RJ designed the research. SC and KH designed and implemented the models. SC and HC collected data and analyzed the results. SC, KH, HC and RJ wrote the manuscript. All authors read and confirmed the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Chen, S., Hua, K., Cui, H. et al. VPAC: Variational projection for accurate clustering of singlecell transcriptomic data. BMC Bioinformatics 20, 0 (2019). https://doi.org/10.1186/s1285901927424
Published:
DOI: https://doi.org/10.1186/s1285901927424
Keywords
 Singlecell
 scRNAseq
 Clustering
 Variational projection
 Dimensionality reduction
 Multiscale
 Dropout
 Cell subtypes