 Software
 Open Access
 Published:
FastProject: a tool for lowdimensional analysis of singlecell RNASeq data
BMC Bioinformatics volume 17, Article number: 315 (2016)
Abstract
Background
A key challenge in the emerging field of singlecell RNASeq is to characterize phenotypic diversity between cells and visualize this information in an informative manner. A common technique when dealing with highdimensional data is to project the data to 2 or 3 dimensions for visualization. However, there are a variety of methods to achieve this result and once projected, it can be difficult to ascribe biological significance to the observed features. Additionally, when analyzing singlecell data, the relationship between cells can be obscured by technical confounders such as variable gene capture rates.
Results
To aid in the analysis and interpretation of singlecell RNASeq data, we have developed FastProject, a software tool which analyzes a gene expression matrix and produces a dynamic output report in which twodimensional projections of the data can be explored. Annotated gene sets (referred to as gene ‘signatures’) are incorporated so that features in the projections can be understood in relation to the biological processes they might represent. FastProject provides a novel method of scoring each cell against a gene signature so as to minimize the effect of missed transcripts as well as a method to rank signatureprojection pairings so that meaningful associations can be quickly identified. Additionally, FastProject is written with a modular architecture and designed to serve as a platform for incorporating and comparing new projection methods and gene selection algorithms.
Conclusions
Here we present FastProject, a software package for twodimensional visualization of single cell data, which utilizes a plethora of projection methods and provides a way to systematically investigate the biological relevance of these low dimensional representations by incorporating domain knowledge.
Background
In an analogous manner to the maturation of RNASeq methodologies, singlecell RNASeq (scRNASeq) is now in its infancy and requires new computational tools to realize its full potential for dissecting and understanding the functional meaning of celltocell heterogeneity [1, 2]. Visualization methods provide an effective strategy for inspecting and characterizing the phenotypic diversity between cells. In a typical scenario, the analysis begins with a matrix of expression levels of thousands of genes in hundreds of cells. An appealing way to make sense out of this immense data is to project it into a two dimensional scatterplot, where each cell is represented by a single point. While such representations provide an easy way to see obvious stratification of cells into a taxonomy of discrete types, they can also provide more nuanced views of gradual transitions, reflecting for instance, developmental processes [3], physical locations [4], or the cell cycle [5]. Indeed, supplementing these twodimensional views with additional, phenotypical information (e.g. the expression level of marker genes) can be used to provide the correct context, and make the observed diversity between cells interpretable [6, 7]).
There are three main challenges in making effective use of such visualizations for scRNASeq data. The first challenge is selecting an appropriate method for dimensionality reduction (projection) among candidates such as principal component analysis (PCA) [1, 2], independent component analysis (ICA) [3] or various nonlinear methods such as tdistributed stochastic neighbor embedding (tSNE) [8], each of which may highlight different aspects of the data. Once a projection is created, a second challenge is to interpret its biological significance, namely which cellular phenotypes [7] or processes [9] are most responsible for the resulting arrangement of cells. Lastly, scRNASeq can be difficult to correctly interpret due to technical confounders such as differences in gene capture rates [10, 11]. Performing functional interpretation on the input data without accounting for this effect may lead to incorrect interpretation of the biological meaning of celltocell heterogeneity.
Introducing FastProject
To address these issues, we have developed FastProject, a software tool for the visualization and interpretation of scRNASeq data. FastProject allows the user to explore a gene expression matrix using a plethora of twodimensional visualization methods. To interpret these twodimensional plots, we use the concept of biological signatures  sets of genes that represent either a collection of genes with a common associated function (e.g. Glycolysis) or a dichotomy between two cellular states of interest (e.g. naïve vs. memory CD8 Tcells)[7]. We evaluate the extent to which these phenotypic dichotomies are reflected in the projections (i.e. to what extent do neighboring cells in the projection have similar values for the genes included in the signature), and highlight the significant projectionsignature pairs. This analysis is made possible in single cells by modeling the probability that a missed transcript was actually expressed in the cell, and using these probabilities when evaluating signature scores on samples to minimize the effect of variable capture rates between cells. As a source for biological signatures, we use publicly available datasets that consist of comparative information from hundreds of studies (e.g. MSigDB [12]), which can be supplemented by the user to include any other gene signatures of interest. Through this automated analysis, FastProject therefore provides a systematic view of the main axes of variation in the data, along with their possible biological meaning.
Compared with currentlyavailable software, FastProject is unique in the combination of methods it employs. Visualization software such as viSNE [6], although designed for cytometry data, allows for the integrated visualization of transcript levels overlaid on a twodimensional tSNE projection. However, viSNE does not incorporate the use of gene sets or provide a method to systematically search for biological variation within a twodimensional projection. Another method, PAGODA [13], makes use of gene sets, but does so in the context of a clustered heatmap of expression data, not a 2dimensional embedding.
It is important to clarify upfront that FastProject is not a normalization tool. Indeed, it has been observed by us and others that without proper normalization scRNASeq data can be heavily confounded by technical factors such as library depth and complexity [7]. We strongly advocate the use of scRNASeq normalization tools (e.g. based on [5] or [7]) prior to any downstream analysis, and we assume that the data has been normalized prior to input to FastProject. Nevertheless, since scRNASeq measurements tend to be characterized by strong zeroinflation, we conduct our biological signature analysis while aiming to minimize the effect of gene dropout events (i.e. false negatives, described in [10, 14]).
When running FastProject, processing is done upfront, producing dozens of projection maps (using different algorithms and parameterizations) and their functional annotation, which can be inspected through an interactive HTML report. On a typical data set, consisting of around 1000 cells, processing time is typically between 10 and 30 min (see Additional file 1 for benchmarking results). Because processing is performed upfront, the user can quickly switch between different projection maps in the output report as well as share the results with colleagues who would not themselves need to install FastProject. Importantly, FastProject has been written to be easily extensible so that it may serve as a general platform for deploying and evaluating new gene filtering schemes, falsenegative estimation methods, or projection techniques. Instructions for developers on how to augment FastProject are detailed in the FastProject wiki hosted at https://github.com/YosefLab/FastProject/wiki.
Implementation
Overview
The steps involved in the FastProject processing pipeline are depicted in Fig. 1.
The software starts with an evaluation of false negative rates, later used to downweight the effect of dropouts on the biological signature analysis. It then selects sufficiently variable genes for further analysis using increasingly stringent criteria. With these genes in hand, FastProject uses 11 different projection methods (summarized below) to calculate two dimensional coordinate for each cell. Based on a userprovided database of gene signatures it then computes a score for every cell/signature pair and uses a randomization test to identify statistically significant projectionsignature associations. Importantly, the confounding effect of missed transcripts is mitigated by estimating the probability that each undetected gene was actually expressed in the cell, and downweighting the contributions of these measurements in downstream analysis (similar to [7]). Altogether, FastProject outputs 76 possible projections (a combination of choosing different gene filtering criteria, whether or not the data was reduced to significant PCs prior to projection, and the final projection method) along with their functional annotations, which can be interactively inspected through a userfriendly HTML report (Figs. 1 b and 3). The results are also provided in the form of text files (including signature scores, projection coordinates etc.), which can be used for downstream analysis.
Falsenegative estimates
To account for expressed transcripts that are not detected (false negatives) due to the limitations in sensitivity [1, 10, 11, 15], an initial step in the processing pipeline is to evaluate these detection rates so that the subsequent analysis can downweight the contribution of less reliably measured transcripts. More specifically, it has been observed that the detection probability for a transcript varies with the transcripts relative abundance measurement as well as the total amount of RNA in the sample [10, 14]. To characterize this, we calculate a false negative rate curve which estimates the probability of detection based on a gene’s abundance. A separate curve is estimated for each cell individually so that varying levels of library quality and cell integrity can be accounted for.
To perform this estimate, our procedure utilizes a set of human housekeeping genes [16] under the assumption that these housekeeping genes are constitutively expressed in all cells and only missed due to technical errors. Importantly, as the appropriate set of constitutively expressed genes may differ from study to study and between organisms, FastProject can accept a userdefined housekeeping list.
Our estimation of false negative rates is built on and extends upon our previous work [7]. For each housekeeping gene, we estimate its mean expression level by taking the average of nonzero measurements for the gene. We then use the estimated means to group the genes into 30 quantiles, and denote the mean of genes in quantile 1≤q≤30 as μ _{ q }. For each cell j and quantile q, we then compute F _{ jq } as the proportion of genes from q that are not detected by j. Based on our assumption of constitutive expression, we treat F _{ jq } as an empirical estimate to the dropout rates (i.e. probability that a gene is not detected, given that it is expressed). We use the F _{ jq } values to fit a sigmoid function \(\hat {F}_{j}(\cdot)\) that describes the observed dropout rates as a function the genes’ average expression when detected (Fig. 1 a):
where the estimated parameters α _{ j } and β _{ j } minimize the residual sum of squares (RSS) term:
Applying the fitted function globally for all genes, we estimate the conditional probability for a dropout event for gene i in cell j as: \(P(\text {Gene i is not detected in cell j} \mid \text {Gene i is expressed by cell j})=\hat {F}_{j}(\mu _{i})\) where μ _{ i } is the average of gene i’s expression level when detected. Finally, we estimate the prior probabilities for detection and expression of each gene in order to evaluate weights for the missed transcripts, P(Gene i is not expressed by cell j∣Gene i isnot detected in cell j), as described in the Methods section. These probability estimates are used further in the pipeline to reduce the effect of missed transcripts when evaluating signature scores and generating projections. As a first step in validating the efficacy of this approach, a set of simulations were performed (Additional file 2, procedure detailed in the “Methods” section) in which FastProject’s weighting procedure was able to distinguish between true negatives (representing unexpressed genes) and false negatives (representing technical dropouts).
The fitted sigmoids \(\hat {F}_{j}(\cdot)\) can also be used to provide an overall evaluation of the abundance of false negatives in each cell j by taking the area under the curve. This in turn provides a way to identify and exclude poor quality samples, assuming that such samples have higher dropout rates. Such quality control filter (on cells) is available in FastProject and can be enabled when running the pipeline (it is turned off by default). With this option enabled, samples that score 1.6 median absolute deviation (MAD) units worse than the median quality score are removed prior to calculating signatures and lowdimensional projections.
Generating 2dimensional projections
Gene filtering
Selecting an informative set of genes is necessary for obtaining biologically meaningful patterns of variability between cells. To this end, FastProject applies a strict prefiltering step that discards genes detected in less than a threshold number of cells. The default threshold is 20 % of the input cells, however this is configurable as an input option. Subsequently, FastProject produces projections that derive from all prefiltered genes as well as subsets thereof, calculated using either of two filtering schemes. The first filter option selects bimodal genes, a characteristic which in the past has been useful in isolating biological variation (e.g. cytokines related to Th17 differentiation [7] or antiviral and inflammatory response genes in stimulated dendritic cells [17]). In FastProject bimodal genes are selected using the Hartigan’s Dip Test (p<0.05). The second filter option selects highlyvariable genes, based on their Fanofactor (σ ^{2}/μ where μ is the average expression and σ the standard deviation across all cells). An increase in the Fano factor past a value of 1 indicates a departure from a Poisson statistic, which is the steady state distribution expected of constitutively expressed genes [18]. To select candidates with high Fanofactor, similar to the procedure in [19], genes are first stratified into 30 quantiles according to μ, and within each quantile genes are retained if their Fanofactor is more than 2 MAD above the quantile’s median Fanofactor.
Projection methods
The variety of methods available for the task of dimensionality reduction each come with strengths and weaknesses. For instance, projections of scRNASeq data based on PCA [7], provide an appealing guarantee about the preservation of variation, and makes the contribution of individual genes readily interpretable [6]. However, the underlying assumptions of PCA may not necessarily be supported by single cell data. In particular, PCA is a linear transformation, which may not be able to accurately capture nonlinear relationships in the data (i.e. if the data is embedded within a nonlinear lowdimensional manifold). Additionally, PCA posits that the projection axes should be uncorrelated, which again may not necessarily result in the most informative representation. The same criticisms apply for other linear methods such as ICA [3]. A complementary, and commonly used approach (tSNE [8]) uses a nonlinear projection that aims to preserve the structure in the data locally. However, while this method aims to ensure that points that are close in the high dimensional space will be close (with high probability) in the low dimensionality embedding, more global relations are not directly interpretable from the results. To avoid the issue of choosing upfront between these different options, FastProject uses these methods plus additional nonlinear projection methods, including: ISOMAP [20] (using four nearest neighbors when defining the adjacency graph), PCA with a radial basis function (RBF) kernel (with kernel coefficient of \(\frac {1}{\text {Number of genes}}\)), MultiDimensional Scaling (MDS), and Spectral Embedding [21] (with Graph laplacian formed using knearest neighbors with \(k=\frac {\text {Number of cells}}{10}\)). For the linear PCA we consider three combinations of principal components (1st and 2nd, 1st and 3rd, 2nd and 3rd); and for tSNE we use two perplexity values (10 and 30), totaling in 11 projection methods overall. All methods are run as implemented in the ScikitLearn package for Python [22]. After each projection, the resulting sets of 2dimensional coordinates are meancentered and scaled such that the average r ^{2}=[x c o o r d i n a t e]^{2}+[y c o o r d i n a t e]^{2} equals 1. This standardization is performed so that signatureprojection scores (defined below) are more comparable between projections.
Incorporating falsenegative estimates into projections
Nonlinear methods such as tSNE have been shown to effectively combine with PCA. In this approach, PCA is performed first to reduce highdimensional data to an intermediate number of dimensions, and this intermediate representation is further reduced to 2 or 3 dimensions by the nonlinear method. The procedure is effective, assuming that the initial PCA operation preserves the main structure of the data while discarding unnecessary components.
FastProject makes use of this procedure to allow for a general method of incorporating weights (derived from falsenegative estimates) into nonlinear projections. Prior to a nonlinear projection, a weightedPCA is performed by using a weighted covariance matrix (similar to [7]). Entries in the weighted covariance matrix are calculated as:
where X _{ ij } is the logtransformed expression of gene i in cell j and w represents the matrix of weights of equivalent size, defined as:
To select the number of components to retain, following weightedPCA, the software employs a randomization scheme, as described in [23], to select the top principal components with statistically significant (p<0.05) contribution to the overall variance (retaining at least 5 components when less than 5 are significant). All the subsequent nonlinear projections are evaluated on this reduceddimension matrix. To provide a way for evaluating the effects of this reduction, FastProject also runs each nonlinear projection method on the original, nonreduced matrix, allowing the user to switch between these representations in the output report.
Signaturebased analysis
To interpret the biological meaning of the organization of cells in the resulting two dimensional maps, FastProject incorporates domain knowledge through the input of gene “signatures” [7]. The signatures can reflect a comparative analysis between two conditions of interest and consist of a set of genes, each of which is labeled as either “upregulated” or “downregulated” in that comparison. Signatures, such as these, are based off of annotations of cell states obtained from public databases (e.g. the Immsig collection [12]), or provided by the user. For each signature, a score is computed against each cell by aggregating over the weighted expression level of its genes. Specifically, for signature S and cell j, the respective signature score R _{ s }(j) is calculated as:
Where \(X^{\prime }_{ij}\) is the standardized (Znormalized across all cells) log expression level of gene i in cell j,w _{ ij } is the estimated FalseNegative weight (defined above), and s i g n _{ S }(i)=−1 for genes in the "downregulated" set and +1 otherwise. Notably, signatures can also be undirected, in which case the sign value is set to +1 for all member genes.
Projections vs. signatures
A signatureprojection consistency score is calculated to evaluate how well each projection reflects the phenotypic variation that is captured by each signature. To this end, for each pair of signature, s, and projection, p, we compute a signature consistency score representing the extent to which neighboring points in the projection have similar signature scores. This is done by calculating for each cell j, an estimated signature rank \(\hat {r}_{sp}(j)\) based on its location in the two dimensional plot:
where Δ _{ jk } is the Euclidean distance between cells j and k in the projection, α defines an effective neighborhood size (set to 0.33 by default), and r _{ s }(j) is the rank of the signature score for cell j (i.e. a rank transformation of the quantitative signature scores R _{ S }). The signatureprojection consistency is then determined by the respective goodness of fit:
In this way, each signature/projection pairing is scored based on how similar signature scores are for samples located nearby in the projection space. To identify significantly consistent signature/projection pairs, we use signatures of randomly selected genes to create empirical background distributions of signature scores. We compare the consistency scores computed for the original signatures with those of the random ones (for example, see Fig. 2 a), obtaining Pvalues using a Ztest and correcting for multiple hypothesis testing using the Benjamini Hochberg procedure. We observed that the distribution parameters of scores generated by random signatures varied with the number of genes in the signature. To account for this, separate distributions are generated for different signature sizes (10, 20, 50, 100, and 200 genes) and when assessing the significance of a signature score, the score is compared against the background distribution with the most similar number of genes. In the output, we report all the signatures that had a significant match (F D R<0.05) with at least one projection.
Results
Software
FastProject has been implemented as a command line Python package. As an input the software receives: (1) an expression matrix in a tabdelimited format (genes in rows, cells in columns). (2) a set of gene signatures, using the standard GMT format. Such sets of directed signatures are publicly available from various databases, such as MsigDB [12] (e.g. including signaling effects of genetic and chemical perturbations, cell cycle signatures, and comparison of cell types) and NetPath [24] (transcriptional effects of signaling cascades). Undirected gene sets are naturally more abundant, and can be drawn from resources such as Gene Ontology [25], KEGG [26, 27], and MSigDB [12] (note that the latter includes much of the information in the former two). Importantly, the user can also upload his/her own signatures that reflect a phenotype of interest. For instance, in the example below, we study glioblastoma cells, and use signatures derived from microarray experiments, which define different tumor subtypes. FastProject also accepts precomputed signatures, namely, an association of cells with precomputed values. These can be categorical (e.g. annotations of clusters, computed by a different tool); or numerical (e.g. additional meta data associated with cells; e.g. levels of a handful of proteins, or technical factors such as library complexity; see Methods section for description of how precomputed signatures are analyzed).
FastProject will calculate projections, signature scores, and their associations, covering all the options above (totaling in 76 different projections; Fig. 1 b). To examine the extent to which the projections are affected by zero values, FastProject treats the sample quality score (defined above), and the percentage of genes with zero expression as precomputed signatures and evaluates their association with each projection method. The output is provided as an HTML file (Fig. 3), where projections, signatures and their associations can be inspected interactively. Additionally, a data export feature embedded in the HTML report allows the generation of tabdelimited text files that depict the output projection coordinates, signature scores, and other relevant information. The source code, running, examples and user manual are bundled with this manuscript submission and also hosted at https://github.com/YosefLab/FastProject.
Extending FastProject
FastProject has been designed using a modular architecture so that new projection methods or criteria to filter genes can easily be added to the pipeline. Since dimensionality reduction is still an active research area, this allows new methods to easily be compared against more traditional approaches. For example, the recently proposed ZIFA algorithm [28] can be added by appending the following lines to Projections.py:
This is documented in the software wiki, hosted with the project repository at https://github.com/YosefLab/FastProject/wiki.
Proof of concept
We applied FastProject to a recently published data sets of tumor cells from five glioblastoma patients [2]. The analyzed data, consisting of 430 single cells with mRNA abundance in units of transcripts per million (TPM) and normalized as described in [2], was downloaded from the Gene Expression Omnibus, accession number GSE57872. We applied FastProject on this data, using a large collection of both “signed” (i.e. up and down regulated genes) and “unsigned” signatures from MSig DB (the C2 (curated), H (Hallmark), and C7 (Immune signatures) collections) and NetPath [24].
As a first check, we explored the distributions of signature consistency scores obtained for the original vs. random signatures, and compared the results to an application of FastProject on randomized datasets, where the entries in each column (Cell) were shuffled (i.e. maintaining the percentage of zeroes in every cell; Fig. 2 a). As expected, we see pronounced differences between the original input signatures and the randomized ones when FastProject is applied on the nonshuffled data, and these differences disappear when we apply FastProject on the randomized data. As a second test, we evaluated the extent by which the signature consistency scores are biased by the abundance of zerovalues in the analyzed cells. As expected, when we do not downweigh the potential false negative entries, the signature consistency scores highly correlate with the amount of zeroes in each cell; namely the analysis primarily reflect what might be a result of technical dropouts. Conversely, downweighing the false negatives reduces this bias (Fig. 2 b). We repeated this procedure using a second dataset of scRNASeq data from mouse dendritic cells responding to antigen stimulation [1], obtaining similar results (Additional files 3 and 4).
Examining the output report of FastProject (Fig. 3 a), we first observe that the various projection methods correctly stratify the cells according to their respective donors (Fig. 3 b). More importantly, FastProject automatically picks up on several of the most important features in this data, providing a proof of concept for its utility as an unbiased analysis tool. Specifically, the two dimensional position of the cells is highly consistent with their scoring according to an epithelial to mesenchymal transition signature, which is a strong marker of poor prognosis [29, 30]. The two dimensional positions are also associated with signatures of other key pathways altered in cancer, including immune and hypoxic responses as detailed below. While these observations were made using a general database of signatures (MSigDB), we supplemented our analysis with casespecific signatures  in this case gene signatures from TCGA that are predictive of Glioblastoma subtypes [30] to provide further support. As expected we see high level of concordance between the TCGAderived scores of the Mesenchymal tumor subtype and the epithelial to mesenchymal transition signatures from MSigDB. We also see the mirror image of the cell ranking when we consider the TCGAderived signature of the Proneural tumor subtype of glioblastoma.
In addition to the HTML report, FastProject outputs all the cellsignature scores in textual format. Taking advantage of this feature, we were able to more closely inspect the relationship between the different pathways that were highly correlated with the two dimensional positions and reveal new associations in the data. Considering the top ranked signatures (p<10^{−15} for at least one projection method), and filtering overlapping signatures (Jaccard coefficient >30 %; 63 signatures remaining), we observe a clear pattern of signature clusters (Fig. 4). Interestingly, the mesenchymal signature is positively correlated with the expression of coagulation/complement genes (whose expression in the glioblastoma cells studied here was already verified by [2]). Both signatures are also correlated with the TNF α signaling, which supports previous findings concerning the role of this pathway in mesenchymal emergence [31]. On the other hand, the mesenchymal signature is negatively correlated with a hypoxia signature. While hypoxic regions are characteristic in solid tumors [32], this inverse correlation is surprising and possibly aligned with the upregulation of angiogenetic markers in mesenchymal glioblastoma tumors [29]. Finally, we see a strong negative correlation with a signature of response to the PPAR γ agonist rosiglitazone, which aligns with previous observations of beneficial effects PPAR γ agonists have in glioblastoma [33–35]. In addition to the interdonor variation, FastProject’s visualization also highlights potentially important variation within a tumor. Indeed, the cells from patient MGH31 (Fig. 3 b, purple) are clearly divided in accordance to the two programs mentioned above  with cells with low mesenchymal and high hypoxic score on one side and the mirror image on the other.
The glioblastoma case study underscores the utility of FastPtroject as a tool for scRNASeq data exploration. Starting from a normalized input matrix of gene expression in single cells, and a generic set of signatures, it clearly highlighted some of the main sources for phenotypic variation between cells, and the relation between these sources. Such insights provide an important first step in working with data sets an immense and as complex as the one presented here.
Conclusions
FastProject is a flexible, comprehensive, and automated pipeline that combines multiple techniques for the analysis of scRNASeq. It provides a first glance on the main axes of variation in the data (captured by projections of interest) and their biological meaning (the biological signatures that may explain the organization of cells within the projections). The tool was designed with a flexible API in mind, with the aim of establishing a general platform that will be used by the scRNASeq research community for deployment and evaluation of future methods, such as normalization, batch correction, removal of undesired effects (cell cycle, dropouts), gene/cell filtering, and dimensionality reduction.
Methods
Falsenegative estimates
Let P(E _{ ij }∣N o t D _{ ij }) denote the probability that a gene i is expressed in cell j conditioned that it is not detected (i.e. probability for a false negative). To estimate this probability we first estimate the priors for the detection, P(D _{ ij }), and expression, P(E _{ ij }), events:
For P(D _{ ij }) we use the percentage of cells that detect gene i (e x p r e s s i o n>0), which we denote as \(\bar {D_{i}}\). For P(E _{ ij }), one approach would be to use:
where μ _{ i } is the average of gene i’s log expression when detected and \(\hat {F}_{j}(\mu _{i})\) (defined in Eq. 1) represents P(N o t D _{ ij }∣E _{ ij }) for each cell. However, in order to get a more robust estimation, we use the population mean of the conditional probability, taking: \(P(E_{ij}) \approx \frac {\bar {D_{i}}}{\frac {1}{N} {\sum \nolimits }_{k=1}^{N} P(D_{ik} \mid E_{ik})}\) where N is the number of cells in the dataset. Combining these terms, the full expression is:
This value is then complemented to obtain P(N o t E _{ ij }N o t D _{ ij }) (the true negative probability), which is used to weight missed transcripts. Notably, on occasions where the detection rate in a cell is higher than the prior estimate, this formulation results in a negative value. We therefore restrict the estimate to the range [ 0,1] by applying a clipping operation.
Simulating falsenegatives
To evaluate our approach, we first applied it on simulated data. Count values for each gene were either set to zero (with probability P _{ i }) or drawn from a negative binomial distribution parameterized by the mean, μ _{ i }, and the dispersion, \(\gamma _{i} = ({\sigma ^{2}_{i}}  \mu _{i})/{\mu _{i}^{2}}\). P _{ i } was set to 1.0 for the first 500 genes (considered to be housekeeping genes, and therefore constitutively active) and varied uniformly from 0 to 1 for the remaining genes. μ _{ i } was generated from an exponential distribution with m e a n=75 and γ was generated using a uniform distribution from 0 to 4 (reasonable distributions and ranges for μ and γ found by measuring these parameters using nonzero values from the Shalek et al. [1] singlecell data). The generated counts were transformed using l o g(x+1). In this way, an expression matrix of 3000 genes across 2000 cells was simulated.
After generating ‘true’ expression data, sigmoidal dropout curves (as defined in Eq. 1) were generated for each cell by drawing α and β from uniform distributions (α∈[ 2,6];β∈[ 1,3]). Using these curves, ‘measured’ expression data was generated by droppingout nonzero measurements using each cell’s dropout characteristic and each gene’s nonzero mean. Using FastProject’s weightestimation procedure, weights were estimated using the ‘measured’ data matrix. For comparison purposes, weights were also generated using the likelihood, P(D e t e c t i o n∣E x p r e s s i o n), as described in Gaublomme et al. [7]. The results, detailed in Additional file 2, demonstrate that FastProject’s weighting system produces different weight distributions for true negatives (unexpressed genes) as opposed to false negatives (droppedout genes), assigning higher weights to the true negatives, as would be expected in an effective weighting procedure.
Clustering
For each of the projections, the FastProject HTML report includes a simple clustering analysis using the cells’ positioning in the respective two dimensional map. Specifically, samples are clustered based on Euclidean distance in the twodimensional space using kmeans with different k values. These clusters are used when rendering a heatmap below the projection showing the (per cluster) average zscore of expression for each gene in the signature.
Evaluating the consistency of projections and categorical precomputed signatures
A different method is used to evaluate the significance of signature/projection pairings when operating on precomputed signatures. For numerical precomputed signatures, the assigned values are shuffled among cells and for each iteration of this procedure, the signature/projection consistency score is evaluated. Pvalues are then assessed by comparing the unshuffled score against this distribution using a Ztest as above. For factor signatures, it is necessary to calculated the consistency score in a different manner. To this end, for each pair of signature s and projection p we compute a signature consistency score representing the extent to which neighboring points in the projection are assigned to the same category. First, we evaluate a neighborhoodconsistency score for every cell:
where
In this way, \(\hat {L}_{sp}(i)\) is closer to 1 if most of sample i’s neighbors have the same label. The signatureprojection consistency is then determined by a measure of the overall consistency:
These consistency scores are compared against a distribution of scores calculated from shuffled label assignments to assess significance using a Ztest as above.
Availability and requirements
Project Name: FastProject.Software Repository: https://github.com/yoseflab/fastproject.Software Manual: https://github.com/YosefLab/FastProject/wiki.Archived Versions: https://pypi.python.org/pypi/FastProject.Operating Systems: PlatformIndependent.Programming language: Python.License: Free for Educational, Research, and NotForProfit purposes. Results from the analysis in this article can by found at https://github.com/YosefLab/FastProject/wiki/ExampleOutputReports.
Abbreviations
 EM:

Expectation maximization
 FNC:

Falsenegative curve
 HTML:

Hypertext markup language
 ICA:

Independent component analysis
 LPS:

Lipopolysaccharide
 MAD:

Median absolute deviation
 MDS:

Multidimensional scaling
 MsigDB:

Molecular signatures database
 PCA:

Principal component analysis
 RBF:

Radial basis function
 scRNAseq:

Single cell RNAseq
 TPM:

Transcripts per million
 tSNE:

tDistributed stochastic neighbor embedding
References
Shalek AK, Satija R, Shuga J, Trombetta JJ, Gennert D, Lu D, Chen P, Gertner RS, Gaublomme JT, Yosef N, Schwartz S, Fowler B, Weaver S, Wang J, Wang X, Ding R, Raychowdhury R, Friedman N, Hacohen N, Park H, May AP, Regev A. Singlecell RNAseq reveals dynamic paracrine control of cellular variation. Nature. 2014; 509(7505):363–9. doi:10.1038/nature13437.
Patel AP, Tirosh I, Trombetta JJ, Shalek AK, Gillespie SM, Wakimoto H, Cahill DP, Nahed BV, Curry WT, Martuza RL, Louis DN, RozenblattRosen O, Suvà ML, Regev A, Bernstein BE. Singlecell RNAseq highlights intratumoral heterogeneity in primary glioblastoma. Science (New York). 2014; 344(6190):1396–401. doi:10.1126/science.1254257.
Trapnell C, Cacchiarelli D, Grimsby J, Pokharel P, Li S, Morse M, Lennon NJ, Livak KJ, Mikkelsen TS, Rinn JL. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol. 2014; 32(4):381–6. doi:10.1038/nbt.2859.
Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of singlecell gene expression data. Nat Biotechnol. 2015; 33(5):495–502. doi:10.1038/nbt.3192.
Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ, Teichmann SA, Marioni JC, Stegle O. Computational analysis of celltocell heterogeneity in singlecell RNAsequencing data reveals hidden subpopulations of cells. Nat Biotechnol. 2015; 33(2):155–60. doi:10.1038/nbt.3102.
Amir EaD, Davis KL, Tadmor MD, Simonds EF, Levine JH, Bendall SC, Shenfeld DK, Krishnaswamy S, Nolan GP, Pe’er D. viSNE enables visualization of high dimensional singlecell data and reveals phenotypic heterogeneity of leukemia. Nat Biotechnol. 2013; 31(6):545–2. doi:10.1038/nbt.2594.
Gaublomme J, Yosef N, Lee Y, Gertner R, Yang L, Wu C, Pandolfi P, Mak T, Satija R, Shalek A, Kuchroo V, Park H, Regev A. SingleCell Genomics Unveils Critical Regulators of Th17 Cell Pathogenicity. Cell. 2015; 163(6):1400–1412. doi:10.1016/j.cell.2015.11.009.
van der Maaten L, Hinton G. Visualizing Data using tSNE. J Mach Learn Res. 2008; 9:2579–605.
Kafri R, Levy J, Ginzberg MB, Oh S, Lahav G, Kirschner MW. Dynamics extracted from fixed cells reveal feedback linking cell growth to cell cycle. Nature. 2013; 494(7438):480–3. doi:10.1038/nature11897.
Ramsköld D, Luo S, Wang YC, Li R, Deng Q, Faridani OR, Daniels GA, Khrebtukova I, Loring JF, Laurent LC, Schroth GP, Sandberg R. Fulllength mRNASeq from singlecell levels of RNA and individual circulating tumor cells. Nat Biotechnol. 2012; 30(8):777–82. doi:10.1038/nbt.2282.
Brennecke P, Anders S, Kim JK, Kołodziejczyk AA, Zhang X, Proserpio V, Baying B, Benes V, Teichmann SA, Marioni JC, Heisler MG. Accounting for technical noise in singlecell RNAseq experiments. Nat Methods. 2013; 10(11):1093–5. doi:10.1038/nmeth.2645.
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: a knowledgebased approach for interpreting genomewide expression profiles. Proc Natl Acad Sci U S A. 2005; 102(43):15545–50. doi:10.1073/pnas.0506580102.
Fan J, Salathia N, Liu R, Kaeser GE, Yung YC, Herman JL, Kaper F, Fan JB, Zhang K, Chun J, Kharchenko PV. Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis. Nat Methods. 2016; 13(3):241–4. doi:10.1038/nmeth.3734.
Marinov GK, Williams BA, McCue K, Schroth GP, Gertz J, Myers RM, Wold BJ. From singlecell to cellpool transcriptomes: stochasticity in gene expression and RNA splicing. Genome Res. 2014; 24(3):496–510. doi:10.1101/gr.161034.113.
Deng Q, Ramsköld D, Reinius B, Sandberg R. Singlecell RNAseq reveals dynamic, random monoallelic gene expression in mammalian cells. Science (New York). 2014; 343(6167):193–6. doi:10.1126/science.1245316.
Eisenberg E, Levanon EY. Human housekeeping genes are compact. Trends Genet (TIG). 2003; 19(7):362–5. doi:10.1016/S01689525(03)001409.
Shalek AK, Satija R, Adiconis X, Gertner RS, Gaublomme JT, Raychowdhury R, Schwartz S, Yosef N, Malboeuf C, Lu D, Trombetta JJ, Gennert D, Gnirke A, Goren A, Hacohen N, Levin JZ, Park H, Regev A. Singlecell transcriptomics reveals bimodality in expression and splicing in immune cells. Nature. 2013; 498(7453):236–40. doi:10.1038/nature12172.
Munsky B, Neuert G, van Oudenaarden A. Using gene expression noise to understand gene regulation. Science (New York). 2012; 336(6078):183–7. doi:10.1126/science.1216379.
Macosko E, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas A, Kamitaki N, Martersteck E, Trombetta J, Weitz D, Sanes J, Shalek A, Regev A, McCarroll S. Highly Parallel Genomewide Expression Profiling of Individual Cells Using Nanoliter Droplets. Cell. 2015; 161(5):1202–14. doi:10.1016/j.cell.2015.05.002.
Tenenbaum JB, de Silva V, Langford JC. A global geometric framework for nonlinear dimensionality reduction. Science (New York). 2000; 290(5500):2319–323. doi:10.1126/science.290.5500.2319.
Campbell K, Ponting CP, Webber C. Laplacian eigenmaps and principal curves for high resolution pseudotemporal ordering of singlecell RNAseq profiles. Technical report. 2015. doi:10.1101/027219. http://dx.doi.org/10.1101/027219.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikitlearn: Machine Learning in {P}ython. J Mach Learn Res. 2011; 12:2825–830.
Buja A, Eyuboglu N. Remarks on Parallel Analysis. Multivar Behav Res. 1992; 27(4):509–40.
Kandasamy K, Mohan SS, Raju R, Keerthikumar S, Kumar GSS, Venugopal AK, Telikicherla D, Navarro JD, Mathivanan S, Pecquet C, Gollapudi SK, Tattikota SG, Mohan S, Padhukasahasram H, Subbannayya Y, Goel R, Jacob HKC, Zhong J, Sekhar R, Nanjappa V, Balakrishnan L, Subbaiah R, Ramachandra YL, Rahiman BA, Prasad TSK, Lin JX, Houtman JCD, Desiderio S, Renauld JC, Constantinescu SN, Ohara O, Hirano T, Kubo M, Singh S, Khatri P, Draghici S, Bader GD, Sander C, Leonard WJ, Pandey A. NetPath: a public resource of curated signal transduction pathways. Genome Biol. 2010; 11(1):3. doi:10.1186/gb2010111r3.
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, IsselTarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000; 25(1):25–9. doi:10.1038/75556.
Kanehisa M. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000; 28(1):27–30. doi:10.1093/nar/28.1.27.
Kanehisa M, Goto S, Sato Y, Kawashima M, Furumichi M, Tanabe M. Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res. 2014; 42(Database issue):199–205. doi:10.1093/nar/gkt1076.
Pierson E, Yau C. ZIFA: Dimensionality reduction for zeroinflated singlecell gene expression analysis. Genome Biol. 2015; 16(1):241. doi:10.1186/s130590150805z.
Phillips HS, Kharbanda S, Chen R, Forrest WF, Soriano RH, Wu TD, Misra A, Nigro JM, Colman H, Soroceanu L, Williams PM, Modrusan Z, Feuerstein BG, Aldape K. Molecular subclasses of highgrade glioma predict prognosis, delineate a pattern of disease progression, and resemble stages in neurogenesis. Cancer Cell. 2006; 9(3):157–73. doi:10.1016/j.ccr.2006.02.019.
Verhaak RGW, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, Miller CR, Ding L, Golub T, Mesirov JP, Alexe G, Lawrence M, O’Kelly M, Tamayo P, Weir BA, Gabriel S, Winckler W, Gupta S, Jakkula L, Feiler HS, Hodgson JG, James CD, Sarkaria JN, Brennan C, Kahn A, Spellman PT, Wilson RK, Speed TP, Gray JW, Meyerson M, Getz G, Perou CM, Hayes DN. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell. 2010; 17(1):98–110. doi:10.1016/j.ccr.2009.12.020.
Bhat KPL, Balasubramaniyan V, Vaillant B, Ezhilarasan R, Hummelink K, Hollingsworth F, Wani K, Heathcock L, James JD, Goodman LD, Conroy S, Long L, Lelic N, Wang S, Gumin J, Raj D, Kodama Y, Raghunathan A, Olar A, Joshi K, Pelloski CE, Heimberger A, Kim SH, Cahill DP, Rao G, Den Dunnen WFA, Boddeke HWGM, Phillips HS, Nakano I, Lang FF, Colman H, Sulman EP, Aldape K. Mesenchymal differentiation mediated by NF κB promotes radiation resistance in glioblastoma. Cancer Cell. 2013; 24(3):331–46. doi:10.1016/j.ccr.2013.08.001.
Brown JM, Wilson WR. Exploiting tumour hypoxia in cancer treatment. Nat Rev Cancer. 2004; 4(6):437–7. doi:10.1038/nrc1367.
Panigrahy D, Singer S, Shen LQ, Butterfield CE, Freedman DA, Chen EJ, Moses MA, Kilroy S, Duensing S, Fletcher C, Fletcher JA, Hlatky L, Hahnfeldt P, Folkman J, Kaipainen A. PPARgamma ligands inhibit primary tumor growth and metastasis by inhibiting angiogenesis. J Clin Investig. 2002; 110(7):923–32. doi:10.1172/JCI15634.
Ellis HP, Kurian KM. Biological Rationale for the Use of PPAR γ Agonists in Glioblastoma. Front Oncol. 2014; 4:52. doi:10.3389/fonc.2014.00052.
Grommes C, Conway DS, Alshekhlee A, BarnholtzSloan JS. Inverse association of PPAR γ agonists use and high grade glioma development. J NeuroOncol. 2010; 100(2):233–9. doi:10.1007/s110600100185x.
Amit I, Garber M, Chevrier N, Leite AP, Donner Y, Eisenhaure T, Guttman M, Grenier JK, Li W, Zuk O, Schubert LA, Birditt B, Shay T, Goren A, Zhang X, Smith Z, Deering R, McDonald RC, Cabili M, Bernstein BE, Rinn JL, Meissner A, Root DE, Hacohen N, Regev A. Unbiased reconstruction of a mammalian transcriptional network mediating pathogen responses. Science (New York). 2009; 326(5950):257–63. doi:10.1126/science.1179050.
Acknowledgments
Not applicable.
Funding
DD was funded by a National Institutes of Health NRSA Trainee appointment on grant number T32 and by the California Research Alliance by BASF (CARA). NY was supported by NIH grants U01MH105979 and U01HG007910.
Authors’ contributions
NY and DD conceived and wrote the manuscript. DD implemented the software and analyzed the data. Both authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Author information
Authors and Affiliations
Corresponding author
Additional files
Additional file 1
Benchmarking FastProject. Memory and runtime are measured as the number of genes and cells vary. When varying the number of genes, 200 cells were used, and when varying the number of cells, an input matrix of 2000 genes was used. “Lean” mode is an option that may be selected at runtime which disables some projection methods that were found to scale poorly with the number of samples (MDS, Spectral Embedding) as well as removing extra filter steps (HDT, Fano Factor). Trials run using a compute cluster with 40 Intel Xeon E52690 processors at 3 GHz, but intentionally limited to either 4 or 20 threads during the runtime tests. (PNG 188 kb)
Additional file 2
Simulation of FalseNegative Weight Estimation Procedure. A) A simulation to test FastProject’s ability to distinguish genes that are biologically inactive (True Negatives) from genes dropped out due to technical artifacts (False Negatives). B) The KolmogorovSmirnov (KS) statistic is used to distinguish between distributions of true and false negatives. Here it can be seen that the true negatives tend to be assigned higher weights. This is contrasted with the weighting scheme used in a previous study (Gaublomme et al. 2015 [7]) in which the true and false negatives are not as differentiated. CE) The KS statistic’s for either scheme, as in (B), as simulation parameters are varied. C) Tests varying the parameter which controls the mean of the exponential distribution from which μ is drawn. D) Falsenegative curves are estimated using a mixture of housekeeping genes and nonhousekeeping genes. It can be seen that the choice of good housekeeping genes is beneficial, but not critical. E) The steepness of the generated falsenegative curves, α, is varied. (PNG 258 kb)
Additional file 3
Behavior of Signature Scores: Alternate Data Set. Behavior of signature scores calculated from the LPSstimulated dendritic cells of Shalek et al. 2014 [1]. A) Distribution of Signature/Projection consistency scores across four different types of signatures, Signed (signed immunological signatures from MSigDB), Unsigned (various unsigned hallmark and pathway signatures from MSigDB), Random Signed (signed signatures with randomly selected genes), and Random Unsigned (unsigned signatures with randomly selected genes). Lower panel shows distributions from the same signatures, run on data in which gene expression levels have been shuffled within each cell. B) Distribution of the Pearson’s correlation coefficient between signature scores and a confounding variable  the proportion of undetected genes in a sample. Upper plot shows correlations when signature are calculated by simply taking the unweighted average of log expression level for genes in the signature. Lower panel shows the effect of using the weighted method presented here. (PNG 152 kb)
Additional file 4
SignatureProjection Pairings. An example of signature/projection pairings from FastProject run on a subset of cells from Shalek et al. 2014 [1]. Analysis was run on bone marrow dendritic cells, stimulated with LPS and sequenced after 1–6 h. For comparison purposes, the same projection (ISOMAP on genes selected by the Fano Factor filter) is used for each plot. A) The arrangement of cells in the ISOMAP projection can be seen to largely agree with the time of sequencing, post stimulation. B) Cells at later times score higher on the HALLMARK_INFLAMMATORY signature from the MSIGDB collection. CE) Temporal signatures generated from Amit et al. 2009 [36]. Signatures generated by taking the top 200 up and downregulated genes, between adjacent time points, and removing genes with less than a 2fold difference. Changes in these signature scores can be seen to largely correspond with the temporal labels in (A). All signature/projection pairings, AE, were found to be significant by FastProject (p≤0.05). (PNG 339 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
DeTomaso, D., Yosef, N. FastProject: a tool for lowdimensional analysis of singlecell RNASeq data. BMC Bioinformatics 17, 315 (2016). https://doi.org/10.1186/s1285901611765
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901611765
Keywords
 SingleCell
 RNASeq
 Dimensionality reduction