 Methodology article
 Open access
 Published:
A robust nonlinear lowdimensional manifold for single cell RNAseq data
BMC Bioinformatics volumeÂ 21, ArticleÂ number:Â 324 (2020)
Abstract
Background
Modern developments in singlecell sequencing technologies enable broad insights into cellular state. Singlecell RNA sequencing (scRNAseq) can be used to explore cell types, states, and developmental trajectories to broaden our understanding of cellular heterogeneity in tissues and organs. Analysis of these sparse, highdimensional experimental results requires dimension reduction. Several methods have been developed to estimate lowdimensional embeddings for filtered and normalized singlecell data. However, methods have yet to be developed for unfiltered and unnormalized count data that estimate uncertainty in the lowdimensional space. We present a nonlinear latent variable model with robust, heavytailed error and adaptive kernel learning to estimate lowdimensional nonlinear structure in scRNAseq data.
Results
Gene expression in a single cell is modeled as a noisy draw from a Gaussian process in high dimensions from lowdimensional latent positions. This model is called the Gaussian process latent variable model (GPLVM). We model residual errors with a heavytailed Studentâ€™s tdistribution to estimate a manifold that is robust to technical and biological noise found in normalized scRNAseq data. We compare our approach to common dimension reduction tools across a diverse set of scRNAseq data sets to highlight our modelâ€™s ability to enable important downstream tasks such as clustering, inferring cell developmental trajectories, and visualizing high throughput experiments on available experimental data.
Conclusion
We show that our adaptive robust statistical approach to estimate a nonlinear manifold is well suited for raw, unfiltered gene counts from highthroughput sequencing technologies for visualization, exploration, and uncertainty estimation of cell states.
Background
Highthroughput singlecell RNA sequencing (scRNAseq) is a powerful tool for cataloguing cell types and cell states, and for investigating changes in expression over cell developmental trajectories. Dropletbased methods encapsulate individual cells with unique barcode tags that are ligated to cellular RNA fragments [1]. Sequenced reads are mapped to both a gene and a cell, creating a highdimensional cellbygene count matrix with hundreds to millions of cells and twenty thousand genes per human cell. These cellbygene count matrices contain a substantial proportion of zeros because of lowcoverage sequencing per cell (i.e., dropout). The count matrices also contain substantial variance and observation outliers due to both technical and biological sources of noise [2]. Furthermore, the lowdimensional space representing the transcriptional relationships among cells is not a simple, smooth space, but a complex, nonlinear, and sometimes nonsmooth space due to cellular differentiation and cell cycle processes.
Computational tools for analyzing scRNAseq results generally require initial dimension reduction to a lowerdimensional manifold capturing gene expression patterns for regularization and computational efficiency. Dimension reduction techniques are an essential precursor to noise reduction [2, 3], subpopulation identification [4, 5], visualization [6, 7], pseudotemporal ordering of development stages [8â€“10], and imputation [11]. Lowerdimensional mappings also provide convenient visualizations that lead to hypothesis generation, and inform analytic methods and future experiments [12, 13].
Linear dimension reduction techniques are commonly used as a first step to downstream analyses. Principal component analysis [14] (PCA) â€“ the projection of a highdimensional space onto orthogonal bases that capture the directions of greatest variance â€“ is the first step of several scRNAseq analysis packages such as PAGODA [15] and Waterfall [16]. ZeroInflated Factor Analysis [4] (ZIFA) extends the factor analysis [17] paradigm of a linear map onto lowdimensional latent dimensions to allow dropouts modeled by Bernoulli random variables to account for the excess of zero counts in scRNAseq data. Independent component analysis [18] (ICA), which assumes nonGaussian observations, and canonical correlation analysis [19] (CCA), which allows for multiple observation types, have also been used as dimension reduction techniques for studying cell developmental trajectories [9] and for experimental batch correction [20].
More sophisticated models eschew the linearity assumption to capture the rich nonlinear structure in the data. The tdistributed Stochastic Neighbors Embedding (tSNE) [7] is a popular visualization tool. TSNE computes the similarity between two points in highdimensional space with respect to a Gaussian kernel distance metric, and estimates a lowerdimensional mapping with similarity with respect to a Studentâ€™s tdistribution metric that minimizes the KullbackLeibler divergence between the similarity distributions in high and low dimensions. The Gaussian kernel in tSNE includes a perplexity parameter that controls the decay rate of similarity across the distance between cells. The uniform manifold approximation and projection for dimension reduction [21] (UMAP), which is similar to tSNE but better preserves highdimensional distances in the loudimentioned space, has also become popular for visualization. Diffusion maps, used in packages such as Destiny [22], are another tool for nonlinear lowdimensional mapping that perform linear decomposition on a kernel similarity matrix of highdimensional observations. SAUCIE [6] implements an autoencoder, or a deep neural network, that compresses data with the goal of creating an optimal reconstruction from the compressed representation in order to execute several singlecell tasks. Similarly, scVI [23] uses deep neural networks to create a probabilistic representation in latent space for batch correction, visualization, clustering, and differential expression. One Bayesian probabilistic technique is the Gaussian process latent variable model [24, 25] (GPLVM), used by scLVM [2] for noise reduction, and GPfates [10] and GrandPrix [8] for pseudotemporal ordering. The GPLVM models observations (i.e., cells) as draws from a Gaussian process map of lowerdimensional latent variables.
While current methods for dimension reduction have been successful with early sequencing experiments and filtered expression data, they are limited in their capacity to accurately represent and inform analyses of raw, highthroughput sequencing experiments. Linear methods such as PCA and ZIFA are illsuited for capturing highly nonlinear biological processes across developmental phase, and many implementations scale poorly with increased sample size. Current nonlinear methods are highly sensitive to parameter choices, including perplexity for tSNE, kernel variables for diffusion maps, and network architecture for VAEs. Latent dimensions of tSNE have no global structure, making embedded positions difficult to interpret and leading to uninformative mappings beyond two dimensions. Downstream analyses of tSNE results are hindered by an inability to map the lowdimensional projections back to observation space [26]. VAEs, like most neural networks, require tens to hundreds of thousands of cells for accurate estimation, which may not be available in smaller experiments. Furthermore, VAEs are often challenging to fit in practice, with posterior collapse being a common failure mode in statistical inference [27]. These current methods, particularly those using the GPLVM, are sensitive to outliers. In particular, they often work only with filtered, normalized data, and they incorporate prior information to facilitate the latent mapping.
Robust statistical models are a natural solution to capturing heavytailed, sparse log count data. We introduce the tdistributed Gaussian process latent variable model (tGPLVM) for learning a lowdimensional, nonlinear embedding of unfiltered count data. To address each of the challenges in scRNAseq analyses, we introduce three features to the basic GPLVM: 1) a robust Studentâ€™s tdistribution noise model; 2) a weighted sum of nonsmooth covariance kernel functions with parameters estimated from the data; 3) sparse kernel structure. The heavy tailed Studentâ€™s tdistribution improves robustness to outliers, previously demonstrated in Gaussian process regression [28, 29]. MatÃ©rn kernels have been successfully used in singlecell time series models to capture nonsmooth trajectories [8, 30, 31]. We use a weighted sum of three MatÃ©rn kernels (1/2, 3/2, and 5/2) and a Gaussian kernel. We estimate the kernel parameters during inference to allow for a broad range of highdimensional geometries. The sparse kernel structure allows us to effectively reduce the number latent dimensions based on the estimated complexity of the data. Furthermore, our implementation of tGPLVM accepts sparse inputs produced from highthroughput experimental cellbygene count matrices, is among the best methods in terms of computational speed, and scales to over a million cells.
We demonstrate tGPLVMâ€™s ability to estimate informative manifolds from noisy, raw singlecell unique molecular identifier (UMI) count matrices and highlight its applicability to multiple downstream tasks. While still sparse, UMIs do not show zeroinflation and require less processing or normalization than traditional count data [32]. We show improved celltype identification via clustering on the estimated latent space using a data set of cerebral cortex cells labeled with estimated cell type [33]. We find that the tGPLVM manifold can learn pseudotemporal ordering from a batch of Plasmodiuminfected mouse cells sequenced across time post exposure [10]. Finally, we demonstrate that tGPLVM can be used on unprocessed, unnormalized count data from recent highthroughput sequencing methods [1] and can be used to explore gene expression across cell states.
Results
The tdistributed Gaussian process latent variable model (tGPLVM) is a nonlinear latent variable model that captures highdimensional observations in a lowdimensional nonlinear latent space. Expression of each of P genes across N cells is modeled as a draw from a Gaussian process distribution with the covariance matrix a function of the lowdimensional, latent positions. The observation error is modeled with a heavytailed Studentâ€™s tdistribution with four degrees of freedom to robustly account for variance in the count data due to technical and biological noise relative to a normally distributed error model [34, 35]. Here, we used a weighted sum of MatÃ©rn 1/2, MatÃ©rn 3/2, MatÃ©rn 5/2, and squared exponential (SE) kernel functions to model possibly nonsmooth manifolds.
Studentâ€™s tdistributed error improves manifold learning
To evaluate the impact of using a Studentâ€™s tdistributed error model instead of a Gaussian, we compared the accuracy of estimated latent representations of simulated data. Data from a branching trajectory in two dimensions are projected into tendimensional space by a SE kernel Gaussian process. We added noise to the simulation with either Gaussian or gamma distributions, then fit using either tdistributed or Gaussian error GPLVMs. We evaluated the goodnessoffit by comparing the 2Wasserstein distance between normalized pairwise distance matrices of the simulated and estimated latent positions. For simulations with normally distributed error, tdistributed and normal error models are about equal in their ability for reconstruct the latent space (Fig. 1). As the variance of the Gaussian error increases past 50% of the maximum noiseless value, however, the tdistributed model outperforms a normally distributed error model. When the noise is a skewed gamma distribution, using the tdistributed error is uniformly better than the normal distribution at estimating the latent space. When the datagenerating covariance is different from the model covariance, heavytailed noise models deal well with model mismatch. These results demonstrate that a robust Studentâ€™s tdistributed error model improves estimation under different possible noise conditions.
Nonparametric manifold learning improves celltype identification.
We evaluated the ability of tGPLVM and commonlyused singlecell dimension reduction methods to distinguish distinct cell types. tGPLVM, PCA, ZIFA, tSNE, UMAP, and scVI were used to map cells labeled with their inferred cell type from the Pollen data [33] to latent spaces varying from two to nine dimensions. The data consist of log _{10}(1+x) normalized counts of 249 cells from 11 distinct populations from a Fluidigm C1 system. With more than three latent dimensions, tGPLVM produced clusters that best corresponded to the actual celltype labels of the six methods (Fig. 2). With three dimensions, tGPVLM is second only to scVI in performance. At nine latent dimensions, the models perform about equally, indicating that most of the information in the data has been captured.
Including MatÃ©rn kernels in tGPLVM improves celltype separation in the latent space as measured by normalized mutual information (NMI) and adjusted rand score (ARS) (Fig. 3). Inclusion of MatÃ©rn kernels also reduces the uncertainty of posterior estimates of the latent embedding as measured by the average scale parameter of the latent position (Fig. 4). We find that the clustering efficacy remains when initialization is changed from a PCA approximation to random initial embeddings (Supplemental Figure 1). These results suggest that a robust Bayesian nonparametric manifold is superior to current linear dimension reduction algorithms and equal to scVI for identifying and visualizing distinct cell types captured by scRNAseq experiments.
Nonparametric manifolds to reconstruct development time scales without prior information
Next, we test the flexibility of tGPLVM applied to continuous cellular developmental trajectories. We fit latent mappings for log count data of 408 mouse Th1 and Tfh cells sequenced over seven days on a Fluidigm C1 after infection with Plasmodium [10]. Visually, we find that the latent mapping from tGPLVM represents the temporal relationships accurately, with most cells positioned among cells from the same or adjacent time points. We build a minimum spanning tree on the latent mappings to infer developmental trajectories. For a twodimensional mapping, only tGPLVM accurately spans the first time point (day 0) to the final time point (day 7). PCA, ZIFA, and scVI find endpoints of the tree in days 2 or 4. tSNE is able to separate cells based on time but does not accurately reconstruct the ordering and is sensitive to outliers (Fig. 5). UMAPâ€™s representation does not represent the time progression as well as other methods. We recognize that these trees are not a complete method for determining developmental trajectories, and that PCA, ZIFA, scVI, and tGPLVM are all capable of finding reasonable embeddings. The results suggest that tGPLVM can capture developmental trajectories and distances in unlabeled settings better than other visualization tools and as well as any other singlecell dimension reduction technique.
Nonparametric manifold learning improves visualization of raw count data and captures cell state
Next, we tested tGPLVMâ€™s performance on unfiltered count data. tGPLVM and comparison models were fit on âˆ¼10,000 CD34+ peripheral blood mononuclear cells (PBMCs) sequenced on a highthroughput parallel 10x system [1]. Each model was able to find three distinct regions based on expression patterns (Fig. 6). PCA is dominated by total counts, with cells with more reads moving further away in latent space, and more frequent cell types dominating the space [36]. While all methods place highexpression cells together, nonlinear methods like tGPLVM and scVI disperse lowcount cells across the space more than linear methods (Fig. 6). We see visually that tGPLVM identifies three distinct regions while providing continuity between cells, so that gene expression is not partitioned into distinct groups in the latent space. CD34 is a marker for hematopoeitic stem cells [37], which differentiate into myeloid and lymphoid cells. From tGPLVM, we can observe this separation from different expression patterns in progenitor cells across dimensions. Dimension three correlates with myeloid cells, demonstrated visually by marker TYROBP [38] (Pearsonâ€™s r=0.647; Fig. 6), in addition to correlations with macrophageassociated genes [39, 40] S100A4 (Pearsonâ€™s r=0.623) and S100A6 (Pearsonâ€™s r=0.665). In comparison to the traditional GPLVM, tGPLVM collects cells expressing TYROBP more compactly (Fig. 6). Dimension two correlates to lymphoid cells, visualized by marker LTB [41] (Pearsonâ€™s r=0.306; Fig. 6), and further supported by correlation with lymphocyte specific protein1 LSP1 (Pearsonsâ€™s r=0.481). Dimension one corresponds to general cellular functions, with strong correlation with mitochondrial activity genes COX5A (Pearsonâ€™s r=0.587) and STOML2 (Pearsonâ€™s r=0.474), and shown with CLTA, an endocytosismediating gene [42] (Pearsonâ€™s r=0.461; Fig. 6). These distinct expression patterns reflect the broadly different immune cellular functions into which hematopoietic stem cells may develop. Gradients of expression levels projected onto tGPLVM embeddings may be used to further interrogate changes in cell states from different experiments.
tGPLVM scales to a million cells
Finally, we evaluate the ability of tGPLVM and related methods to fit embeddings for unfiltered, unnormalized, high throughput scRNAseq data. Models with two latent dimensions were fit on subsamples from 100 to 1 million cells from the 10x 1 million mouse brain cell data [1]. tGPLVM, PCA, and scVI are the only methods that can fit one million cells in a computationally tractable way (Fig. 7). ZIFA is slower than tGPLVM by an order of magnitude consistently across sample sizes. Since ZIFA requires a dense input, its input cell count matrix is limited in our framework to approximately 100,000 cells. While tSNEâ€™s implementation can take as input a sparse matrix format, it does not converge beyond 10^{4} samples. Similarly, UMAP is not able to fit more than 10^{5} cells.
To check that the embedding has biological significance, we again used Pearsonâ€™s correlation to identify genes whose expression is correlated with latent dimensions. We find that latent dimension one corresponds to increased expression of genes associated with the circulatory system and hemoglobin, such as HBBBS (Pearsonâ€™s r=0.320) and HBAA1 (Pearsonâ€™s r=0.316; Fig. 8b). Dimension two correlates with genes such as TUBA1A (Pearsonâ€™s r=0.474) and FEZ1 (Pearsonâ€™s r=0.427) that are associated with neural cells (Fig. 8a). While scVI and PCA also find similar separation, visually tGPLVM provides a clear gradient between the patterns (Fig. 8). The ability of tGPLVM to scale to highthroughput data and capture global structure from unnormalized count matrices makes it a powerful method for analyzing upcoming largescale singlecell experiments.
Discussion
Our results show that tGPLVM is flexible to cell type, cell development, and cell perturbation experiments and finds informationrich mappings from filtered and processed data as well as unfiltered raw count data. tGPLVM scales to the size of a million cells as produced by the latest singlecell sequencing systems. Despite the sparsity, these data are complex and require more than two factors to capture variation; we did not use the ARD kernel parameters to remove dimensions for any of our experiments. However, the embedded dimensions in our experiments were able to capture informative representations of these complex data. As we increased the number of latent dimensions, at some point the ARD prior would remove unnecessary dimensions for being redundant.
We expect that this general statistical model of dimension reduction may be built upon for more sophisticated, nonparametric approaches for a variety of singlecell tasks from normalization and imputation to cell type identification and pseudotime inference. The optimal method for modeling space count data is an open area of research. Alternative models could explore the use of Poisson, negative binomial, or multinomial emission models. While the tdistribution is robust to different distributions, there may be other transformations of the count data that better stabilize variance or reduce heteroskedasticity.
Conclusion
We present a Bayesian nonparametric model for robust nonlinear manifold estimation in scRNAseq settings. tGPLVM captures transcriptional signals in singlecell data using a robust Studentâ€™s tdistribution noise model and integrating adaptive kernel structure in settings with no a priori information about clusters or sequencing order. We hope that this robust manifold estimation can be used for other types of data with noisy outliers and sparse features.
Methods
The tdistribution Gaussian process latent variable model (tGPLVM).
The tGPLVM assumes that samples in highdimensional space are noisy observations of a Gaussian process of lowerdimensional latent features. Let \(Y \in \mathbb {R}^{N\times P}\) represent N observations in a highdimensional space of dimension P, and let \(X \in \mathbb {R}^{N \times Q}\) represent the same observations in a lowerdimensional space Qâ‰ªP. Each sample x_{n} in nâˆˆ{1,2,...,N} is assumed to be drawn from a Q dimensional multivariate normal distribution with identity variance:
Noiseless observations of each of the P highdimensional features across N samples, f_{p}(X), are draws from a zeromean Gaussian process of x across a weighted sum of M kernels:
where K_{NN} represents the NÃ—N covariance matrix defined by the kernel function, k(x,x^{â€²}). In the traditional GPLVM, observations y_{n,p} are noisy realization of a normal distribution with mean f_{n,p} and variance Ï„^{2}:
For tGPLVM, each observation y_{n,p} is drawn from a heavytailed Studentâ€™s tdistribution with a set degrees of freedom Î½ and featurespecific variance \(\tau _{p}^{2}\) [28]:
where we use f_{n,p} to represent the nth component of the N dimensional vector f_{p}(X). We set Î½=4 for expression data based on previous work with Gaussian process regression with tdistributed error [28, 29].
The kernel that we use is a flexible sum of an automatic relevance determination (ARD) squared exponential kernel and three different MatÃ©rn ARD kernels, each with hyperparameters scales Ïƒ_{m} and length scales â„“_{m,q}. Each ARD dimensionspecific length scale, l_{k,q}, indicates the distance of that latent dimension over which points are similar. While nonparametric covariance estimation techniques may be more flexible [43, 44], in this formulation, kernel parameters â€“ and associated kernel shape â€“ are estimated using maximum likelihood methods simultaneously with latent variable estimation. Letting r represent the length scaleweighted distance in latent space, the kernels are defined as:
We use black box variational inference (BBVI) [45] (see Appendix A) to estimate the posterior distribution for the tGPLVM. We adapt the variational distributions from prior work [46]. Inference is implemented in Python using Edward [47, 48]. Edward optimizes the hyperparameters to maximize the conditional likelihood p(xz,Î»). The use of BBVI means there is no computational cost to changing the error model and kernels. However, the switch from exact gradientbased methods of traditional VI for GPLVMs [46] to approximations via BBVI may have an accuracy cost.
To scale to large data, minibatches of cells and genes are used to approximate gradients at each step. Genes (i.e., features) are sampled in proportion to the percentage of cells in which they are expressed to efficiently approximate the covariance matrix calculated during inference, inspired by previous random matrix algorithms for approximate matrix multiplication [49]. Cells (i.e., samples) are sampled uniformly in every batch. Inference was performed on Microsoft Azure High Performance Computing cores.
Simulation experiments
We simulated tendimensional observations from a ground truth twodimensional branching manifold using Gaussian processes with different kernels and error models. Latent embeddings were estimated for tGPLVM with both composite kernels and normal kernels. We also compared models with a normal error and Studentâ€™s tdistributed error with degrees of freedom between 1 and 10. Each test (simulation and inference) was repeated one hundred times. We tested two different noise models with smooth RBF kernels: normallydistributed error with mean zero and variance between 0.1 and 1 times the maximum value of noiseless data, and skewed noise with a gammadistributed error with scale 1 and shape between 0.1 and 1 times the maximum value of the noiseless data.
To compare the estimated manifolds to ground truth, we used the Wasserstein2 distance between the rownormalized Euclidean distance matrices between all observations, with a cost matrix equal to the pairwise distance in the ground truth. Intuitively, this penalizes large distances between points that were nearby in the true latent space.
Single cell RNAseq data
We chose four data sets to evaluate tGPLVMâ€™s applicability to identify cell type, state, and developmental trajectory, and scalability to experiments with large numbers of cells. The Pollen data [33], which were used to evaluate clustering, consist of 11 distinct mouse neural and blood cell populations across 249 cells sequenced on a Fluidigm C1 systems. Pollen is a dense matrix because of high read depth, with about 80% nonzero values. The counts, Y, are log normalized as Y^{â€²}= log10(1+Y). Inference of development trajectories was evaluated on the data used to develop the method GPfates from Lonnberg [10]. These data include 408 T helper cells sequenced over 7 days following Plasmodium infection on a Fluidigm C1 system. The Lonnberg data are provided as TPM measurements. The data are sparse and normalized by log2(1+Y). Cell state was explored on batch of CD34+ peripheral blood mononuclear cells [1] (PBMCs). About 10,000 cells were captured with 10x Cell Ranger sequencing technology. These sparse data were also normalized as log2(1+Y). Finally, to ensure scalability to the most recent experimental data sets, we fit the model to 1 million mice brain cells sequenced on a 10x Cell Ranger [1]. We normalized the mice brain cells as log2(1+Y).
Identifying cell types with kmeans clustering
Clustering for celltype identification was evaluated on the Pollen [33] data. tGPLVM and comparison methods were used to fit latent mappings between 2 and 9 dimensions. To perform clustering for each of the estimated latent manifolds, we used kmeans clustering with the number of clusters k equal to the number of different cell type or cell state labels in the existing data. Clustering with kmeans was repeated ten times on the mean of the maximum a posterior estimate of the latent position and evaluated against true labels using normalized mutual information (NMI) and adjusted rand score (ARS). Mutual information measures the amount of information contained about one random variable (the true labels) in another random variable (the inferred labels). NMI normalizes mutual information by the geometric mean of the entropy of both labels to a scale of zero â€“ no mutual information â€“ to one â€“ the same distribution [50]. ARS is a measure of the proportion of shared members between pairs of true and estimated clusters [51]. Zero inflated factor analysis (ZIFA) [4], tSNE [7] (perplexity set to default 30), UMAP [21] (default settings), scVI [23], and PCA [14] were tested as comparison methods. To evaluate the robust adaptations of the tGPLVM model, we fit tGPLVM with only a SE kernel or SE and MatÃ©rn 1/2 kernel, as well as a tGPLVM with normally distributed error.
Trajectory building with minimum spanning trees
tGPLVM was used to fit a two dimensional map for the Lonnberg developmental data [10]. The minimum spanning tree was found on the Euclidean distance matrix of the posterior means of the lowdimensional embedding, and compared to sequencing time to verify correct ordering. The same analysis was performed with ZIFA [4], tSNE [7] (perplexity set to default 30), UMAP [21], scVI [23], and PCA [14].
Visualization of sparse, raw count matrices
tGPLVM was used to fit a threedimensional mapping for the two 10x data sets, CD34+ cells and mice brain cells. Pearson correlation between latent position posterior mean and expression counts was used to identify genes associated with latent dimensions. ZIFA [4], tSNE [7] (perplexity set to default 30), UMAP [21], scVI [23] and PCA (truncated SVD [52]) were also fit to the CD34+ cell data.
Scaling inference to highthroughput experiments
Computational times were recorded for samples of 100, 1,000, 10,000, 100,000 and 1,000,000 cells from the 10x 1 million mouse brains cells [1] to a latent embedding using tGPLVM and comparison methods. Experiments were run on a Standard H16m (16 VCPUs, 224 GB memory) Azure high performance computing Unix system. tGPLVM was run for 100 passes through the data. Each minibatch contained the minimum of 2500 cells or the number of cells, and 250 genes. ZIFA, tSNE, UMAP, scVI, and PCA (using truncated SVD) were fit until convergence or failure.
Availability of data and materials
The pollen data and implementation is available in the tGPLVM repository (https://github.com/architverma1/tGPLVM). Links to repositories for the remaining data is also available in the tGPVLM repository.
Abbreviations
 GPLVM:

Gaussian Process Latent Variable Model
 tGPVLM:

Studentâ€™
 s tdistributed Gaussian Process Latent Variable Model; PCA:

Principal Component Analysis
 tSNE:

tdistributed Stochastic Neighbor Embedding
 ZIFA:

Zero Inflated Factor Analysis UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
 scVI:

Single Cell Variational Inference
 NMI:

Normalized Mutual Information
 ARS:

Adjusted Rand Score
References
Zheng GXY, et al.Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017; 8:14049.
Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ, Teichmann SA, Marioni JC, Stegle O. Computational analysis of celltocell heterogeneity in singlecell RNAsequencing data reveals hidden subpopulations of cells. Nat Biotechnol. 2015; 33:155â€“60.
Eraslan G, Simon LM, Mircea M, Mueller NS, Theis FJ. Singlecell RNAseq denoising using a deep count autoencoder. Nat Commun. 2019; 10(1):1â€“14.
Pierson E, Yau C. ZIFA: Dimensionality reduction for zeroinflated singlecell gene expression analysis. Genome Biol. 2015; 16(1):241.
Haghverdi L, BÃ¼ttner M, Wolf FA, Buettner F, Theis FJ. Diffusion pseudotime robustly reconstructs lineage branching. Nat Methods. 2016; 13(10):845â€“8.
Amodio M, van Dijk D, Srinivasan K, Chen WS, Mohsen H, Moon KR, Campbell A, Zhao Y, Wang X, Venkataswamy Ma. Exploring singlecell data with deep multitasking neural networks. Nature methods. 2019;:1â€“7. Nature Publishing Group.
Van Der Maaten LJP, Hinton GE. Visualizing highdimensional data using tSNE. J Mach Learn Res. 2008; 9:2579â€“605. http://arxiv.org/abs/1307.1662.
Ahmed S, Rattray M, Boukouvalas A. GrandPrix: Scaling up the Bayesian GPLVM for singlecell data. Bioinformatics. 2018; 533:47â€“54.
Trapnell C, Cacchiarelli D, Grimsby J, Pokharel P, Li S, Morse M, Lennon NJ, Livak KJ, Mikkelsen TS, Rinn JL. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol. 2014; 32:381.
LÃ¶nnberg T, et al.Singlecell RNAseq and computational analysis using temporal mixture modeling resolves TH1/TFH fate bifurcation in malaria. Sci Immunol. 2017; 2(9):2192.
Li WV, Li JJ. An accurate and robust imputation method scImpute for singlecell RNAseq data. Nat Commun. 2018; 9(1):997.
Dumitrascu B, Feng K, Engelhardt BE. GTTS: Experimental design for maximizing cell type discovery in singlecell data. bioRxiv. 2018:386540. Cold Spring Harbor Laboratory.
Dumitrascu B, Villar S, Mixon DG, Engelhardt BE. Optimal marker gene selection for cell type discrimination in single cell analyses. BioRxiv. 2019:599654. Cold Spring Harbor Laboratory.
Hotelling H. Analysis of a complex of statistical variables into principal components. J Educ Psychol. 1933; 24(6):417.
Fan J, Salathia N, Liu R, Kaeser GE, Yung YC, Herman JL, Kaper F, Fan JB, Zhang K, Chun J, Kharchenko PV. Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis. Nat Methods. 2016; 13(13):241â€“4.
Shin J, Berg DA, Zhu Y, Shin JY, Song J, Bonaguidi MA, Enikolopov G, Nauen DW, Christian KM, Ming GL, Song H. Singlecell RNAseq with waterfall reveals molecular cascades underlying adult neurogenesis. Cell Stem Cell. 2015; 17(3):360â€“72.
Harman H. H.Modern Factor Analysis, 3rd edn. Chicago: Univ. of Chicago Press; 1960.
Comon P. Independent component analysis, a new concept?. Sig Process. 1994; 36(3):287â€“314. http://arxiv.org/abs/arXiv:1011.1669v3.
Hotelling H. Relations between two sets of variates. Biometrika. 1936; 28(34):321â€“77.
Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating singlecell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol. 2018; 36:411â€“20.
McInnes L, Healy J, Melville J. u map: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. 2018.
Angerer P, Haghverdi L, BÃ¼ttner M, Theis FJ, Marr C, Buettner F. Destiny: Diffusion maps for largescale singlecell data in R. Bioinformatics. 2016; 32(8):1241â€“3.
Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep generative modeling for singlecell transcriptomics. Nat Methods. 2018; 15(12):1053â€“8.
Titsias M, Lawrence N. Bayesian Gaussian process latent variable model. Artif Intell. 2010; 9:844â€“51. http://arxiv.org/abs/1309.6835.
Lawrence N.Probabilistic nonlinear principal component analysis with Gaussian process latent variable models. J Mach Learn Res. 2005; 6:1783â€“816.
Wattenberg M, ViÃ©gas F, Johnson I. How to use tSne effectively. Distill. 2016; 1(10):2.
Lucas J, Tucker G, Grosse RB, Norouzi M. Donâ€™t blame the Elbo! a linear Vae perspective on posterior collapse. In: Advances in Neural Information Processing Systems: 2019. https://arxiv.org/abs/1911.02469.
Tang Q, Niu L, Wang Y, Dai T, An W, Cai J, Xia ST. Studentt process regression with Studentt likelihood. IJCAI Int Joint Conf Artif Intell. 2017; 12:2822â€“8. http://arxiv.org/abs/1106.4431.
Vanhatalo J, JylÃ¤nki P, Vehtari A. Gaussian process regression with Studentt likelihood: 2009. p. 1910â€“18.
Reid JE, Wernisch L. Pseudotime estimation: Deconfounding single cell time series. Bioinformatics. 2016; 32(19):2973â€“80.
Guttorp P, Gneiting T. Studies in the history of probability and statistics XLIX on the MatÃ©rn correlation family. Biometrika. 2006; 93(4):989â€“95.
Townes FW, Hicks SC, Aryee MJ, Irizarry RA. Feature selection and dimension reduction for singlecell RNAseq based on a multinomial model. Genome Biol. 2019; 20(1):1â€“16.
Pollen AA, et al.Lowcoverage singlecell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat Biotechnol. 2014; 32:1053.
Oâ€™Hagan A. On outlier rejection phenomena in Bayes inference. J R Stat Soc Ser B Methodol. 1979; 41(3):358â€“67.
Oâ€™Hagan A. Modelling with heavy tails. 1988:345â€“359.
Engelhardt BE, Stephens M. Analysis of population structure: A unifying framework and novel methods based on sparse factor analysis. PLoS Genet. 2010; 6(9).
Sidney LE, Branch MJ, Dunphy SE, Dua HS, Hopkinson A. Concise review: Evidence for CD34 as a common marker for diverse progenitors. Stem Cells. 2014; 32(6):1380â€“9.
Tomasello E, Vivier E. KARAP/DAP12/TYROBP: Three names and a multiplicity of biological functions. Eur J Immunol. 2005; 35(6):1670â€“7.
Donato R, Cannon BR, Sorci G, Riuzzi F, Hsu K, Weber DJ, Geczy CL. Functions of S100 Proteins. Curr Mol Med. 2013; 13(1):24â€“57.
Xia C, Braunstein Z, Toomey AC, Zhong J, Rao X. S100 proteins as an important regulator of macrophage inflammation. Front Immunol. 2018; 8(JAN):1â€“11.
Browning JL, Ngamek A, Lawton P, DeMarinis J, Tizard R, Chow EP, Hesslon C, Oâ€™BrineGreco B, Foley S, Ware CF. Lymphotoxin B, a novel member of the TNF family that forms a heteromeric complexs with lymphotoxing on the cell surface. Cell. 1993; 72:847â€“56.
Stelzer G, et al.The GeneCards suite: From gene data mining to disease genome sequence analyses. Curr Protocol Bioinformatics. 2016; 1(June):1â€“30113033.
BjÃ¸rnstad ON, Falck W. Nonparametric spatial covariance functions: Estimation and testing. Environ Ecol Stat. 2001; 8(1):53â€“70.
Handcock MS, Stein ML. A Bayesian analysis of kriging. Technometrics. 1993; 35(4):403â€“10.
Ranganath R, Gerrish S, Blei DM. Black Box Variational Inference. Int Conf Artif Intell Stat (AISTATS). 2013; 33. http://arxiv.org/abs/1401.0118.
Damianou AC, Titsias MK, Lawrence ND. Variational inference for latent variables and uncertain inputs in Gaussian processes. J Mach Learn Res. 2016; 17:1â€“62.
Tran D, Kucukelbir A, Dieng AB, Rudolph M, Liang D, Blei DM. Edward: A library for probabilistic modeling, inference, and criticism. arXiv preprint arXiv:1610.09787. 2016.
Tran D, Kucukelbir a, Dieng AB, Rudolph M, Liang D, Blei DM. Edward: A library for probabilistic modeling, inference, and criticism: 2016.
Drineas P, Kannan R, Mahoney MW. Fast Monte Carlo algorithms for matrices I: Approximating matrix multiplication. SIAM J Comput. 2006; 36(1):132â€“57.
Strehl A, Ghosh J. Cluster ensembles  A knowledge reuse framework for combining multiple partitions. J Mach Learn Res. 2003; 3(3):583â€“617.
Hubert L, Arabie P. Comparing partitions. J Classif. 1985; 2(1):193â€“218.
Halko N, Martinsson PG, Tropp JA. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. arXiv preprint arXiv:0909.4061. 2009:1â€“74.
Acknowledgements
We gratefully acknowledge conversations with Dana Peâ€™er, Anqi Wu, and members of the Engelhardt Research Group, and Ariel Gewirtz and Bianca Dumitrascu in particular, for critical insights into this work.
Funding
Funded for computational resources were provided by CZI AWD1005664, CZI AWD1005667, NIH R01 HL133218, NIH U01 HG007900, a Sloan Faculty Fellowship, and an NSF CAREER AWD1005627. Funding sources did not play a role in any part of the study.
Author information
Authors and Affiliations
Contributions
AV and BEE conceived the model and contributed to writing the manuscript. AV implemented and applied software for fitting the model. Both authors have read and approved the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Competing interests
BEE is currently on a leave of absence from Princeton University at Genomics plc. BEE is on the Scientific Advisory Boards of Celsius Therapeutics and Freenome.
Additional information
Publisherâ€™s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1
A PDF file with two appendices. Appendix A describes the variational distributions for inference. Appendix B lists the sources of the various data sets and implementations of comparison methods.
Additional file 2
Supplemental figure 1â€“sensitivity of clustering results to initialization. The normalized mutual information and adjusted rand score when comparing clusters learned from KMeans applied to tGPLVM fit to true cluster labels when tGPLVM is initialized with PCA or random Gaussian noise.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the articleâ€™s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the articleâ€™s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Verma, A., Engelhardt, B.E. A robust nonlinear lowdimensional manifold for single cell RNAseq data. BMC Bioinformatics 21, 324 (2020). https://doi.org/10.1186/s1285902003625z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285902003625z