SCOUP: a probabilistic model based on the Ornstein–Uhlenbeck process to analyze single-cell expression data during differentiation
- Hirotaka Matsumoto^{1, 2}Email author and
- Hisanori Kiryu^{2}
https://doi.org/10.1186/s12859-016-1109-3
© Matsumoto and Kiryu. 2016
Received: 19 March 2016
Accepted: 2 June 2016
Published: 8 June 2016
Abstract
Background
Single-cell technologies make it possible to quantify the comprehensive states of individual cells, and have the power to shed light on cellular differentiation in particular. Although several methods have been developed to fully analyze the single-cell expression data, there is still room for improvement in the analysis of differentiation.
Results
In this paper, we propose a novel method SCOUP to elucidate differentiation process. Unlike previous dimension reduction-based approaches, SCOUP describes the dynamics of gene expression throughout differentiation directly, including the degree of differentiation of a cell (in pseudo-time) and cell fate. SCOUP is superior to previous methods with respect to pseudo-time estimation, especially for single-cell RNA-seq. SCOUP also successfully estimates cell lineage more accurately than previous method, especially for cells at an early stage of bifurcation. In addition, SCOUP can be applied to various downstream analyses. As an example, we propose a novel correlation calculation method for elucidating regulatory relationships among genes. We apply this method to a single-cell RNA-seq data and detect a candidate of key regulator for differentiation and clusters in a correlation network which are not detected with conventional correlation analysis.
Conclusions
We develop a stochastic process-based method SCOUP to analyze single-cell expression data throughout differentiation. SCOUP can estimate pseudo-time and cell lineage more accurately than previous methods. We also propose a novel correlation calculation method based on SCOUP. SCOUP is a promising approach for further single-cell analysis and available at https://github.com/hmatsu1226/SCOUP.
Keywords
Background
Conventional analyses of bulk cells, such as bulk transcriptome analyses, are based on the averaged data of an ensemble of cells and cannot reveal the states of individual cells. Therefore, such analyses cannot distinguish cell types due to the effect of averaging across all cells in a sample, unless each cell lineage is divided in advance by using prior knowledge, such as marker genes. Additionally, bulk transcriptome during differentiation is usually the ensemble of the cells of different degrees of differentiation and information regarding changes in cellular state is smeared. Accordingly, the accurate investigation for gene expression dynamics and regulatory relationships among genes during differentiation are difficult.
With the advent of single-cell technologies, such as single-cell RNA-seq, quantification of the comprehensive states of individual cells is possible [1]. Using single-cell technologies, investigations of cellular states and its transition processes, such as the classification and identification of cell types [2–4], reconstruction of cell lineages [5, 6], and embryonic development [7, 8], have made remarkable progress. Single-cell data is also useful for elucidating cell fate decision mechanisms of multi-lineage differentiation from a single progenitor cell type [9, 10]. Thus, single-cell technologies have the power to shed light on differentiation in particular [11, 12].
To fully analyze the single-cell expression data during differentiation, novel computational methods are necessary [11, 13]. First, ordering of the cells based on expression data so that the order represents the trajectory of differentiation is necessary to investigate gene expression dynamics and regulatory mechanisms. Although experimental time can be used for ordering cells, even cells derived from the same time-point can exhibit different degrees of differentiation [14]. Moreover, computational ordering method is often useful to reconstruct the differentiation process from in vivo snap-shot data, which contains cells at distinct stages of differentiation [5]. Second, estimating the lineage of the cells is necessary to investigate multi-lineage differentiation. Although the expression of marker genes will be useful to classify cell lineages, the prior knowledge of marker genes is limited. Therefore, a lineage estimation method without prior knowledge is necessary to fully analyze the mechanisms of cell fate decisions.
Wanderlust [15] is a pioneering study for ordering cells based on expression data. It uses N-dimensional space composed by N marker genes and constructs the I-nearest neighbor graph in the space, and then reconstructs the differentiation path based on the graph. The degree of differentiation of a cell (in pseudo-time) is defined by the position on the path. Although Wanderlust is a promising method for reconstructing the differentiation path, it will not work when prior knowledge of marker genes is not given. Therefore, several methods that do not require the prior knowledge of marker genes have been developed to order cells [14, 16, 17]. These methods use dimension reduction techniques, such as principal component analysis (PCA), and reconstruct the differentiation path in reduced space using several approaches, such as minimum spanning tree (MST) and principal curves. Each cell is projected onto the reconstructed path and pseudo-time is defined by the projected position on the path. To estimate cell lineage from expression data, a few methods, which use the same framework, have been developed. Monocle [14], a dimension reduction-based approach, estimates the lineage of each cell by estimating multiple paths in reduced space and assigning each cell to one of the paths. These approaches are powerful tools to reconstruct the differentiation process without prior knowledge, and the development of such computational methods will help reveal the mechanisms of differentiation in conjunction with the advancement of single-cell technologies.
However, pseudo-time estimation and cell lineage estimation based on dimension reduction have several problems. For example, interpreting the biological meaning of the path in reduced space is difficult. Additionally, the position in reduced space is affected by noise and gene expression that is irrelevant to differentiation, and the results can therefore change significantly in a subsequent analysis. Moreover, deterministic approaches, such as applications of MST in reduced space, cannot quantify the subtle differences among cells and are inadequate to estimate the lineages of cells at an early stage of bifurcation, which are important for analyzing cell fate decisions. Hence, we developed another approach based on stochastic processes.
In this research, we developed a novel method SCOUP (a probabilistic model to analyze Single-Cell expression data during differentiation with Ornstein–Uhlenbeck Process). SCOUP describes the dynamics of gene expression throughout differentiation directly, including pseudo-time and cell fate of individual cells. SCOUP is based on the Ornstein–Uhlenbeck (OU) process, which represents a variable moving toward an attractor with Brownian motion. In the case of differentiation, an attractor is regarded as a stable expression pattern of a gene after differentiation, and hence, an OU process is appropriate to describe expression dynamics throughout differentiation. Because OU processes suppose only a single attractor and cannot represent multi-lineage differentiation, we expand the typical OU process into a mixture OU process by representing the cell fate of each cell and lineage-specific expression patterns with latent values and different attractors, respectively. We compared the accuracy of pseudo-time estimates from SCOUP with those of previous methods using time-series scqPCR and scRNA-seq, and SCOUP was superior to previous methods in almost all conditions. We also evaluated the cell lineage estimation using scqPCR data in which cells exhibit multi-lineage differentiation. SCOUP successfully estimated cell lineage more accurately than Monocle, especially for cells at an early stage of bifurcation. In addition, SCOUP represents each gene expression dynamic directly and can be applied to various downstream analyses. As an example, we developed a novel correlation calculation method for elucidating regulatory relationships among genes. We normalized data based on the optimized parameters in our model, which assumes the conditional independency among genes, and calculated correlations within normalized data, and this method detected covariance that cannot be explained by the model alone. We applied this method to scRNA-seq data and detected a candidate of key regulator for differentiation and clusters in a correlation network which were not detected with conventional correlation analysis.
We proposed a novel theoretical and computational method SCOUP to analyze single-cell data. The theoretical basis of SCOUP will be useful not only for pseudo-time and cell lineage estimation, but also for various biological analyses such as gene regulatory network inference. In particular, SCOUP can represent continuous-time stochastic dynamics and is suited for analyzing time-series data. As the number of single-cell data with high temporal resolution is increasing, computational methods for analyzing such data are becoming more important. Thus, SCOUP is a promising approach for further single-cell analysis and bioinformatics method development.
Methods
Ornstein-Uhlenbeck process
where α, θ, σ, and W _{ t } denote the strength of relaxation toward the attractor, the value of the attractor, the strength of noise, and “white noise,” respectively. If the initial value is given by X _{0}, the value at time t (X _{ t }) satisfies the following normal distribution:
In the process of cellular differentiation, a cell changes from one cell type to another, and its expression pattern changes from a specific pattern to a different specific pattern. Moreover, each single cell exhibits different degrees of differentiation, and therefore a continuous-time model is necessary to represent single-cell expression dynamics. With the OU process, we can describe such dynamics by considering that X _{0} and θ are the expression patterns of progenitor cells and differentiated cells, respectively. In addition, other parameters α and σ can be regarded as the speed of expression change and level of noise, respectively. Thus, the OU process is suitable for modeling gene expression dynamics throughout differentiation. In this research, we extended the OU process for single-cell expression data and developed a parameter optimization method.
OU process for single lineage differentiation
Although optimization of these initial parameters is possible, a fully differentiated state may be regarded as an initial state and pseudo-time may be inferred in the reverse order of differentiation. In this way, deciding the direction of differentiation without the knowledge of initial condition is difficult. Moreover, the expression data of progenitor cells are available in many experimental studies. Therefore, we assume that μ _{0g } and \(\sigma _{0g}^{2}\) are known in this research.
Sufficient statistic for OU processes
Accordingly, we can calculate the log-likelihood by using the following statistics \(\sum _{s=1}^{N-1}{X_{s}^{2}}\), \(\sum _{s=0}^{N-1}X_{s}X_{s+1}\), and \(\sum _{s=1}^{N-1}X_{s}\).
The expected values of the above statistics are sufficient for parameter optimization. The posterior probability P(X _{1}…X _{ N−1}|X _{ N },X _{0}) is regarded as the multivariate normal distribution, and the expectation of X _{ s } and \({X_{s}^{2}}\) can be calculated from the mean and variance–covariance matrix of the multivariate normal distribution. However, the expansion of the posterior probability gives only the (N−1)×(N−1) precision matrix, and we must therefore calculate the inverse of the matrix to obtain the variance–covariance matrix. Although we cannot use numerical methods to solve the inverse of the precision matrix because we consider N as the limit for infinite, we can solve for the inverse matrix analytically by using the tridiagonal property of the precision matrix [20]. By hand calculation, we showed that the expected values of these statistics were able to be solved analytically. For example, the expected value of one of the statistics is as follows:
The detailed calculation is described in the Additional file 1.
EM algorithm
where X _{ c g1:N−1}=(X _{ c g1},X _{ c g2},…,X _{ c g N−1}).
where Z ^{ α } is explained in the Additional file 1. The pseudo-time variable t _{ c } cannot be optimized analytically, and we therefore solve t _{ c } to satisfy \(d\mathcal {Q}/{dt}_{c}=0\) by Newton’s method.
In cases, X _{ c g0} is also unobserved, so we must calculate the expected value of X _{ c g0}. As such, we calculate the expected values, including the expected value of X _{ c g0} and \(X_{cg0}^{2}\), in the E step and optimize parameters with the above equation in the M step. The detailed optimization process and calculation are described in the Additional file 1.
We validated our parameter optimization method with simulation data and confirmed that SCOUP succeeded to optimize parameters so that the marginal likelihood was maximized (see Additional file 1).
Mixture OU process for multi-lineage differentiation
where π _{ k } is the probability of lineage k. This mixture model describes the multi-lineage case that each lineage diverges from the common initial distribution (one-step bifurcation model). This mixture model is a basic model for describing bifurcation and will be a useful method to analyze several bifurcation processes. Even in the cases that progenitor cells differentiate into different lineages through multi-step bifurcation, we can use the same model to represent multi-step processes by combining the one-step bifurcation models. However, the OU process with multi-step bifurcation becomes mathematically difficult and we leave it for future work.
by solving d Q/d π _{ k }=0. Other parameters are optimized likewise using the single lineage model. Accordingly, we calculate the expected values of variables, such as γ _{ ck } and S _{ c g0}, in the E step and update parameters in the M step.
The lineage of a cell is estimated from the expectation of the latent value of a cell (γ _{ c }). SCOUP can quantify the certainty of the estimated lineage of a cell from the the value of γ _{ c }.
Initialization of time parameter
Our method might converge to undesirable local optima if T is initialized randomly. For example, estimated pseudo-time might be inferred in the reverse order of differentiation. To avoid undesirable local optima, rough initialization of T is effective. Although experimental time will be useful for initialization, such data are not always available. For example, experimental time does not exist for expression data of an in vivo snap-shot sample [5]. Therefore, an initialization method that does not depend on experimental time is necessary. Here, we explain our initialization method based on a dimension reduction approach.
we developed dimension reduction approach for pseudo-time initialization, called SP (pseudo-time calculation based on Shortest Path from the root cell in the MST). Firstly, we added the mean of the initial distribution (μ∈{μ _{ g0}|g=1…G}) to expression data and regarded it as an initial point for the pseudo-time calculation. Next, we performed PCA, constructed MSTs in the reduced space, searched for the shortest path from an initial point using Prim’s algorithm, and regarded the weight of the shortest path as the pseudo-time. In this paper, we set the dimensionality of the PCA to two and used this pseudo-time for the initialization of our method.
Dimension reduction approach
In this section, we explain the previous pseudo-time estimation methods based on a dimension reduction approach.
Monocle [14] constructs a MST in reduced space, searches for the longest path in the MST, and estimates pseudo-time along the longest path. We added the mean of the initial distribution data and regarded it as an initial point for the pseudo-time calculation. We used all genes in a dataset as marker genes and the other parameters of Monocle were set to default values, unless otherwise specified.
TSCAN [17] performs model-based clustering in reduced space, connects clusters, and estimates pseudo-time by projecting cells onto the connected path. Although TSCAN can infer an order of clusters, it cannot regard a point as an initial point. Therefore, we compared the accuracy of outputted pseudo-time with reversed pseudo-time and defined the pseudo-time of TSCAN as the superior one. Because TSCAN failed to output pseudo-time of partial cells when we set a high number of clusters, we set the number of clusters to three in this research.
In this paper, we compared the performance of SCOUP with those of above dimension reduction-based methods in addition to SP. Although Wanderlust is also a useful method to estimate pseudo-time and cell lineage, we exclude it from comparison. This is because we consider the condition that the prior knowledge of marker genes is not given and Wanderlust is designed not for single-cell qPCR and RNA-Seq but for mass and flow cytometry data.
Correlation between genes
We also proposed a novel correlation function between two genes. Although we assume the conditional independence among genes to represent gene dynamics, we can detect the regulatory relationship between genes by calculating the covariance. Our correlation function quantifies the covariance between genes that is not explained by our model.
For time-series data, a ordinal correlation coefficient will be high even if two variables only have similar time-trend. For example, any two independent genes that are upregulated in accordance with differentiation exhibit a high correlation. In the case of the detection of interactions between genes, it is most appropriate to remove the influence of time-trend. To remove this trend effect, the expression data at a specific experimental time point is often used to calculate the correlation. However, this approach is insufficient to remove the time effect resulting from the difference between the experiment time and the progression of cells. Accordingly, the trend effect is best removed by using cells within a specific pseudo-time span for calculation. Although this analysis will remove the trend effect, the number of cells that are used for the calculation decreases owing to the limit of the span of pseudo-time and precise calculation will therefore be difficult.
Several methods have been developed to calculate correlation while removing the confounding effects. For example, scLVM [2] revealed hidden subpopulations from single-cell RNA-seq data by removing the effects, such as cell cycle. In this research, we developed a novel correlation function based on our probabilistic model to remove the effect of time-trend. As described in the section on “OU process for single lineage differetiatiation” and the Additional file 1, the probabilistic distribution of the expression of a gene g at time t (X _{ tg }) is described as follows:
We calculated the correlation coefficient for the above standardized values. This correlation function can detect gene pairs that exhibit interactions that are unexplained by the model, which assume the conditional independence among genes.
The above standardization assumes a single normal distribution and is not suitable for multi-lineage model. However, maxk γ _{ ck } of most cells, which we analyzed, were about 1.0, and hence, most cells would be assigned to one of the lineage. Therefore, the standardization will be effective by assigning a cell to a relevant lineage. In addition, correlation of each lineage will be calculated by dividing cells into each lineage in advance.
Dataset
Single-cell qPCR for single-lineage differentiation
We used the time-series single-cell qPCR dataset produced by Kouno’s group [21] from THP-1 human myeloid monocytic leukemia cells differentiating into macrophages. They investigated the expression of 45 transcription factors by 120 single cells at each eight time point (0, 1, 6, 12, 24, 48, 72, and 96 h) after phorbol myristate acetate stimulation.
To evaluate the estimated pseudo-time in many conditions, we constructed a dataset, (Kouno’s data (1)) follows. We added noise to raw expression data as described below to investigate the effect of noise in pseudo-time estimation. We added noise to raw expression data E _{ cg } by adding \(\bar E_{g}\times U_{R}[0,\epsilon ]\), where \(\bar E_{g}\) is the mean expression of a gene and U _{ R }[0,ε] is a uniform random number from 0 to ε. We produced 20 replicates for each ε (noise level), and validated the pseudo-time of each method for each noise level.
We also constructed another dataset, (Kouno’s data (2)), to validate lineage estimation by adding 45 pseudogenes that exhibit various expression patterns among lineages. We initially selected 60 cells randomly from 120 cells at a given time point. The expression \(E_{cg'}\phantom {\dot {i}\!}\) of a pseudogene g ^{′} by the selected cells is equal to raw expression (\(\phantom {\dot {i}\!}E_{cg'} = E_{cg}\)). For the remaining cells, we inverted the raw expression in relation to the initial mean (\(\phantom {\dot {i}\!}E_{cg'} = -2E_{cg} + \mu _{0g}\)). We also added noise as mentioned above in regard to Kouno’s data (1). Because Monocle cannot accept negative values, we incremented the values by a minimum of 1 to make the expression positive.
where C _{0} is the set of 0-h cells and |C _{0}| is the number of 0-h cells.
Single cell qPCR for bifurcation
To validate the lineage estimation in real data, we used a dataset produced by Moignard’s group [22]. They investigated the single-cell qPCR results for 46 transcription factors throughout hematopoietic development from embryonic day (E) 7.0 to E8.5 in mouse embryos. These data include a lineage bifurcation between E7.75 and E8.25; at this time, head fold (HF) cells differentiate into putative blood and endothelial populations, which are distinguished as either GFP^{+} cells (4SG) or Flk1^{+}GFP^{−} cells (4SFG^{−}). We used the expression profiles of HF, 4SG, and 4SFG^{−} and investigated whether SCOUP and Monocle can classify 4SG and 4SFG^{−} using only their expression profiles. We randomly selected 1000 cells because Monocle did not seem to work correctly for a large number of cells and this procedures left 364 HF cells, 360 4SG cells, and 276 4SFG^{−} cells. The initial distribution was calculated from HF cells in the same way as Kouno’s data.
Single-cell RNA-seq for single-lineage differentiation
We also investigated the stimulation time-series single-cell RNA-seq dataset (at 0, 1, 4, and 6 h) for primary mouse bone-marrow-derived dendritic cells that was produced by Shalek’s group [23]. This dataset contains data for three different time series corresponding to each of the different stimulation methods: lipopolysaccharide (LPS), viral-like double-stranded RNA (PIC), and synthetic mimic of a bacterial lipopeptides (PAM). First, we converted transcripts per million (TPM) to log(TPM+1) and defined this value as gene expression. Next, we removed outlier cells so that each cell in the dataset contained more than 4000 genes with detectable levels of expression; this left 281 LPS cells, 224 PAM cells, and 159 PIC cells. Third, we calculated the absolute difference in mean gene expression between the 1-h cells and 6-h cells for each stimulation. We extracted the top 1000 genes in descending order of this difference for each stimulation and used these genes for pseudo-time estimation. We also added unstimulated cells (outlier cells were removed through a procedure like that described above, leaving 85 cells) to the LPS, PAM, and PIC data and regarded these cells as 0-h data. The initial distribution was calculated from unstimulated cells in the same way as Kouno’s data.
Accuracy measure
Pseudo-time evaluation
where \(t_{c}^{\text {(e)}}\) and t _{ c } are respectively the experimental time and pseudo-time of cell c. I(t _{ i }<t _{ j }) is an indicator function that takes the value 1 if the conditional expression is true.
Lineage evaluation
We evaluated the performance of lineage estimation by SCOUP and Monocle by comparing the cell lineage annotation of each cell. The annotation of a cell from simulation data is obvious and that of Moignard’s data is given by 4SG or 4SFG^{−} in accordance with GFP^{+} or Flk1^{+}GFP^{−}. SCOUP estimates a cell lineage based on the expectation of the posterior probability of cell fate (γ _{ ck }). We classified cells into one of two lineages on the basis of whether γ _{ ck } exceeded a threshold. We calculated the precision and recall for each threshold and calculated the area under the curve (AUC) value. Monocle also can estimate cell lineage by setting the parameter num_paths to 2, thereby outputting the state of a cell as either state1 (pre-bifurcation), state2 (one lineage), or state3 (another lineage). Monocle is a deterministic method and cannot distinguish subtle differences. Therefore, we regard that state1, state2, and state3 belong to one lineage with probabilities 0.5, 1.0, and 0.0, respectively. We calculated the AUC value for Monocle in the same way.
Results and discussion
Validation of parameter optimization
We validated our parameter optimization method with simulation data. We generated simulation data from the normal distribution based on the OU process by varying the parameters. The number of genes and cells are set to 500 and 100, respectively.
Next, we investigated that the log-likelihood of optimized parameters was higher than those of varied parameters. Figure 2 c is the example of the log-likelihood curve with respect to time parameter of a cell (t _{ c }), and the value of optimized t _{ c } is drawn with x-mark. The log-likelihood of the optimized t _{ c } was located in the top of the log-likelihood curve. We also verified that the optimized parameters were located in the top of the log-likelihood surface in regards to other parameters (the details are described in the Additional file 1). Thus, SCOUP can optimize the parameters correctly.
Validation of pseudo-time estimation
In this section, we compared the accuracy of the pseudo-time of each method: SCOUP, our method; SP, pseudo-time estimation based on shortest path in the MST in reduced space; Monocle, dimension reduction-based method that reconstruct differentiation path by the longest path in the MST; TSCAN, dimension reduction-based method that reconstruct differentiation path by running model-based clustering and connecting clusters. For pseudo-time evaluation, we used Kouno’s data (1) and the Shalek’s data.
We also investigated the effect of the number of dimensions of reduced space for pseudo-time estimation in Monocle. We set the argument of Monocle m a x_c o m p o n e n t s, which corresponds to the number of dimensions, to 2 and 3 and denote Monocle analyses with each configuration as Monocle(2) and Monocle(3), respectively. Across all conditions, Monocle(3) was inferior to Monocle(2). This is because the third dimension of reduced space represents something unrelated to differentiation. Without prior knowledge, it is difficult to set a proper number of dimensions, and pseudo-time can be erroneous under an improper number of dimensions. Although SCOUP is based on a dimension reduction approach in the process of pseudo-time initialization, we verified that the pseudo-time estimated from different numbers of dimensions (i.e., 2 and 3) converged to almost same value in this dataset (r ^{2}=0.94 for ε=0.0). Even if the estimated pseudo-times of SCOUP differ, we can infer appropriate pseudo-times by selecting the model with the highest likelihood.
PIS for each method applied to Shalek’s data
LPS | PIC | PAM | |
---|---|---|---|
SCOUP | 0.03 | 0.12 | 0.12 |
SP | 0.14 | 0.32 | 0.17 |
Monocle(2) | NA | 0.38 | NA |
Monocle(3) | 0.18 | 0.45 | 0.32 |
TSCAN | 0.17 | 0.27 | 0.24 |
The PISs of PIC and PAM were higher than those of LPS. This will be because the numbers of PIC and PAM cells were lower than that of LPS. It is difficult to reconstruct differentiation trajectories from a small number of samples. In particular, it is important to obtain cells distributed evenly throughout the differentiation process in order to reconstruct trajectories with high accuracy.
In summary, SCOUP estimated pseudo-time with high accuracy, especially for RNA-seq data. Moreover, SCOUP successfully identified the initial state which was difficult to be detected with dimension reduction-based approaches. In addition, SCOUP is based on a probabilistic model, and hence can evaluate proper pseudo-time by using likelihood. Thus, SCOUP has advantages over dimension reduction-based methods in pseudo-time estimation.
Validation of cell lineage estimate
In this section, we evaluate the accuracy of cell lineage estimation from single-cell expression data containing lineage bifurcation.
Mean AUC values for cell lineage estimates using each method for Kouno’s data (2)
ε=0.0 | ε=0.5 | ε=1.0 | |
---|---|---|---|
SCOUP | 0.99 | 0.99 | 0.99 |
Monocle | 0.98 | 0.97 | 0.95 |
The results described above show that SCOUP is superior to Monocle with respect to cell lineage estimation for both simulated and real data. SCOUP can capture subtle differences in cells immediately after bifurcation and will be a powerful method for investigations of cell fate decision mechanisms.
We also investigated cell lineage estimation with Gaussian mixture model (GMM) implemented in mclust package [24]. The AUC values for mclust were inferior to those of SCOUP, and mclust was not able to estimate cell lineage correctly for cells at an early stage of bifurcation (see Additional file 1 for AUC values and PCA plots of mclust). This is because mclust does not have time parameters in the model and will work well only for cells whose expression pattern has sufficiently changed after bifurcation. Moreover, GMM fitted to the position in which large number of cells exist for Moignard’s data. Therefore, GMM is inadequate to estimate the path of bifurcation in the condition that cells are unevenly distributed. Thus, it is important to take time parameters into account to estimate the path of differentiation and cell lineage.
Clustering genes
The number of top 5000 genes, top 1000 genes in each group. The total number are not equal to 5000 and 1000 because the response curves for a few genes could not be calculated
Group | ||||||
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | |
total | 867 | 403 | 958 | 1354 | 778 | 599 |
top-1000 gene | 260 | 81 | 177 | 291 | 76 | 111 |
The top three GO terms for each group. The third column shows the negative logarithm of the Bonferroni-adjusted p-value
Group | GO term | −log10(p) |
---|---|---|
1 | Immune response | 22.9 |
Defense response | 11.4 | |
Response to wounding | 7.0 | |
2 | Antigen processing and presentation | 5.5 |
Immune response | 3.8 | |
Antigen processing and presentation of exogenous antigen | 3.3 | |
3 | Generation of precursor metabolites and energy | 5.1 |
Protein localization | 4.8 | |
Establishment of protein localization | 3.2 | |
4 | Cell cycle | 9.6 |
Cell division | 7.9 | |
Ribonucleoprotein complex biogenesis | 7.7 | |
5 | Translation | 6.7 |
M phase of mitotic cell cycle | 3.2 | |
Cell cycle | 2.9 | |
6 | Generation of precursor metabolites and energy | 11.5 |
Protein transport | 5.6 | |
Establishment of protein localization | 5.5 |
Correlation analysis
In this research, we propose a novel correlation analysis by using standardization based on SCOUP to detect covariance that cannot explained by the model that assumes the conditional independence among genes alone, and investigated the regulatory relationships among genes using correlations within raw expression data or standardized expression data. Hereafter, we refer to the correlations within raw data and standardized data as C _{Raw} and C _{Std}, respectively. We first investigated whether the target genes of a transcription factor (TF) can be predicted under the assumption that the expression of a TF and its target genes are highly correlated. The list of TFs and their target genes was downloaded from the Integrated Transcription Factor Platform (ITFP) [29], a database containing 71 TFs and 648 pairs of TFs and target genes in the top 1000 genes. We calculated the C _{Raw} and C _{Std} values between 71 TFs and the remaining 929 genes and extracted from the top 1000 positively correlated pairs of TFs and genes according to each correlation method. The top 1000 C _{Raw} and C _{Std} values contained correlations of 24 and 27 annotated pairs, respectively (see Additional file 1 for the list of detected annotated pairs), and the probabilities of capturing these annotated pairs by random sampling are p<6.2×10^{−5} and p<2.8×10^{−6}, respectively. This suggests that target genes of a specific TF can be predicted from a correlation analysis of single-cell expression data.
Only three annotated pairs were common between the 24 C _{Raw} correlation values and the 27 C _{Std} correlation values, which indicates that different regulatory relationships were detected when analyzing raw and standardized expression data. Analysis of standardized expression data revealed correlations that were not explained by the model that assumes the conditional independence among genes, whereas raw expression data analysis revealed correlations produced from similar expression patterns during differentiation. Thus, our novel correlation analysis method can deliver new insights that are not detected by conventional correlation methods.
Next, we aimed to detect a key regulator of each group by using the two correlation methods. We downloaded the candidates of key regulator TFs and their related genes from the Riken Transcription Factor Database (TFdb) [30] and FANTOM5 SSTAR [31] as well as TF data from ITFP. In this analysis, 117 genes of the annotated TFs and their related genes were contained in top 1000 gene and were considered as key regulator candidates. We calculated the C _{Raw} (and C _{Std}) values between each candidate and genes in a group, and calculated the average C _{Raw} (C _{Std}) value of the candidate for the group. We denote these values as \(\overline {C}_{\text {Raw}}(i,j)\) and \(\overline {C}_{\text {Std}}(i,j)\), where i is the index of a candidate and j is the index of a group. We assumed the key regulator of the group is highly correlated with genes in the group and investigated to detect the key regulators by extracting the candidates of high \(\overline {C}_{\text {Raw}}(i,j)\) or \(\overline {C}_{\text {Std}}(i,j)\). There were few differences between \(\overline {C}_{\text {Raw}}(i,j)\) and \(\overline {C}_{\text {Std}}(i,j)\) for groups 3 and 6 because our standardization was inadequate to deal with the transient patterns found in these groups. The difference between \(\overline {C}_{\text {Raw}}(i,1)\) and \(\overline {C}_{\text {Std}}(i,1)\) was largest among all groups, and therefore we focused on group 1 hereafter.
The top three transcription factors and their related genes for group 1
Rank | Gene symbol | C _{Raw} | Rank | Gene symbol | C _{Std} | |
---|---|---|---|---|---|---|
5 | Ifit1 | 0.46 | 313 | Sqstm1 | 0.076 | |
6 | Ifi205 | 0.44 | 45 | Ifih1 | 0.071 | |
17 | Ifi204 | 0.43 | 5 | Ifit1 | 0.071 |
Conclusions
The advancement of single-cell technologies will enable the elucidation of many biological processes, such as differentiation. The development of a novel computational method is necessary to fully analyze single-cell data. We developed a novel method, SCOUP, to analyze single-cell expression data for differentiation. Unlike previous methods, which use dimension reduction approaches and reconstruct differentiation trajectories in reduced space, SCOUP describes gene expression dynamics during differentiation directly, including pseudo-time and cell fate. We evaluated pseudo-time using SCOUP and previous methods based on the consistency between pseudo-time and experimental time and showed that the SCOUP results were superior to those of other methods for almost all conditions. We also compared the accuracy of cell lineage estimation using SCOUP and Monocle, and showed that SCOUP can estimate cell lineages with high accuracy, even for the cells at an early stage of bifurcation. SCOUP is based on a probabilistic model and can be extended to many applications. In this research, we developed a novel correlation analysis method based on SCOUP. It calculates the covariance that cannot be explained by a model, which assumes the conditional independence among genes, alone. We applied this method to scRNA-seq, and detected the candidate of key regulator of differentiation and the clusters in the correlation network which were not detected with conventional correlation analysis. In future work, we plan to extend our model to consider transient expression patterns complicated cell lineage pattern. In addition, we will develop a multivariate OU process to estimate gene regulatory networks more directly.
Abbreviations
EM, expectation-maximization; GMM, gaussian mixture model; GO, gene ontology; LPS, lipopolysaccharide; MST, minimum spanning tree; OU, Ornstein–Uhlenbeck; PAM, synthetic mimic of a bacterial lipopeptides; PCA, principal component analysis; PIS, pseudo-time inconsistency score; PIC, viral-like double-stranded RNA
Declarations
Acknowledgements
The authors thank their research group colleagues for assistance in this study. The authors thank Risa Kawaguchi and Tsukasa Fukunaga for critically reading the manuscript. Computations were performed using the supercomputing facilities at the Human Genome Center, University of Tokyo. (http://sc.hgc.jp/shirokane.html).
Funding
This study was supported by a Grant-in-Aid for JSPS Fellows (256670) and JSPS KAKENHI Grant Number 15H01465, 25134701, 80415778.
Availability of data and material
Our software SCOUP can be downloaded from https://github.com/hmatsu1226/SCOUP.
Authors’ contributions
HM designed the probabilistic model, implemented the software, performed the analyses, and wrote the paper. HK contributed to develop the model, designed the experiments. Both authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Ethics approval and consent to participate
Not applicable.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Kolodziejczyk AA, Kim JK, Svensson V, Marioni JC, Teichmann SA. The technology and biology of single-cell RNA sequencing. Mol Cell. 2015; 58(4):610–20.View ArticlePubMedGoogle Scholar
- Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ, Teichmann SA, Marioni JC, Stegle O. Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat Biotechnol. 2015; 33(2):155–60.View ArticlePubMedGoogle Scholar
- Grun D, Lyubimova A, Kester L, Wiebrands K, Basak O, Sasaki N, Clevers H, van Oudenaarden A. Single-cell messenger RNA sequencing reveals rare intestinal cell types. Nature. 2015; 525(7568):251–5.View ArticlePubMedGoogle Scholar
- Zeisel A, Munoz-Manchado AB, Codeluppi S, Lonnerberg P, La Manno G, Jureus A, Marques S, Munguba H, He L, Betsholtz C, Rolny C, Castelo-Branco G, Hjerling-Leffler J, Linnarsson S. Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science. 2015; 347(6226):1138–42.View ArticlePubMedGoogle Scholar
- Treutlein B, Brownfield DG, Wu AR, Neff NF, Mantalas GL, Espinoza FH, Desai TJ, Krasnow MA, Quake SR. Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq. Nature. 2014; 509(7500):371–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Burns JC, Kelly MC, Hoa M, Morell RJ, Kelley MW. Single-cell RNA-Seq resolves cellular complexity in sensory organs from the neonatal inner ear. Nat Commun. 2015; 6:8557.View ArticlePubMedPubMed CentralGoogle Scholar
- Xue Z, Huang K, Cai C, Cai L, Jiang CY, Feng Y, Liu Z, Zeng Q, Cheng L, Sun YE, Liu JY, Horvath S, Fan G. Genetic programs in human and mouse early embryos revealed by single-cell RNA sequencing. Nature. 2013; 500(7464):593–7.View ArticlePubMedGoogle Scholar
- Yan L, Yang M, Guo H, Yang L, Wu J, Li R, Liu P, Lian Y, Zheng X, Yan J, Huang J, Li M, Wu X, Wen L, Lao K, Li R, Qiao J, Tang F. Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells. Nat Struct Mol Biol. 2013; 20(9):1131–9.View ArticlePubMedGoogle Scholar
- Guo G, Huss M, Tong GQ, Wang C, Li Sun L, Clarke ND, Robson P. Resolution of cell fate decisions revealed by single-cell gene expression analysis from zygote to blastocyst. Dev Cell. 2010; 18(4):675–85.View ArticlePubMedGoogle Scholar
- Moignard V, Gottgens B. Transcriptional mechanisms of cell fate decisions revealed by single cell expression profiling. Bioessays. 2014; 36(4):419–26.View ArticlePubMedPubMed CentralGoogle Scholar
- Trapnell C. Defining cell types and states with single-cell genomics. Genome Res. 2015; 25(10):1491–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Semrau S, van Oudenaarden A. Studying lineage decision-making in vitro: emerging concepts and novel tools. Annu Rev Cell Dev Biol. 2015; 31:317–45.View ArticlePubMedGoogle Scholar
- Stegle O, Teichmann SA, Marioni JC. Computational and analytical challenges in single-cell transcriptomics. Nat Rev Genet. 2015; 16(3):133–45.View ArticlePubMedGoogle Scholar
- Trapnell C, Cacchiarelli D, Grimsby J, Pokharel P, Li S, Morse M, Lennon NJ, Livak KJ, Mikkelsen TS, Rinn JL. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol. 2014; 32(4):381–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Bendall SC, Davis KL, Amir EL-AD, Tadmor MD, Simonds EF, Chen TJ, Shenfeld DK, Nolan GP, Pe’er D. Single-cell trajectory detection uncovers progression and regulatory coordination in human B cell development. Cell. 2014; 157(3):714–25.View ArticlePubMedPubMed CentralGoogle Scholar
- Marco E, Karp RL, Guo G, Robson P, Hart AH, Trippa L, Yuan GC. Bifurcation analysis of single-cell gene expression data reveals epigenetic landscape. Proc Natl Acad Sci U S A. 2014; 111(52):5643–50.View ArticleGoogle Scholar
- Ji Z, Ji H. TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res. Advance Access published May 13, 2016. doi:http://dx.doi.org/10.1093/nar/gkw430.
- Cressler CE, Butler MA, King AA. Detecting adaptive evolution in phylogenetic comparative analysis using the Ornstein-Uhlenbeck model. Syst Biol. 2015; 64(6):953–68.View ArticlePubMedGoogle Scholar
- Kiryu H. Sufficient statistics and expectation maximization algorithms in phylogenetic tree models. Bioinformatics. 2011; 27(17):2346–53.View ArticlePubMedGoogle Scholar
- Hu GY, O’Connell RF. Analytical inversion of symmetric tridiagonal matrices. J Phys A. 1996; 29(7):1511–3.View ArticleGoogle Scholar
- Kouno T, de Hoon M, Mar JC, Tomaru Y, Kawano M, Carninci P, Suzuki H, Hayashizaki Y, Shin JW. Temporal dynamics and transcriptional control using single-cell gene expression analysis. Genome Biol. 2013; 14(10):118.View ArticleGoogle Scholar
- Moignard V, Woodhouse S, Haghverdi L, Lilly AJ, Tanaka Y, Wilkinson AC, Buettner F, Macaulay IC, Jawaid W, Diamanti E, Nishikawa S, Piterman N, Kouskoff V, Theis FJ, Fisher J, Gottgens B. Decoding the regulatory network of early blood development from single-cell gene expression measurements. Nat Biotechnol. 2015; 33(3):269–76.View ArticlePubMedPubMed CentralGoogle Scholar
- Shalek AK, Satija R, Shuga J, Trombetta JJ, Gennert D, Lu D, Chen P, Gertner RS, Gaublomme JT, Yosef N, Schwartz S, Fowler B, Weaver S, Wang J, Wang X, Ding R, Raychowdhury R, Friedman N, Hacohen N, Park H, May AP, Regev A. Single-cell RNA-seq reveals dynamic paracrine control of cellular variation. Nature. 2014; 510(7505):363–9.PubMedPubMed CentralGoogle Scholar
- Fraley C, Raftery AE, Murphy TB, Scrucca L. Mclust Version 4 for R: Normal Mixture Modeling for. Model-Based Clustering, Classification, and Density Estimation Technical Report No. 597. 2012.Google Scholar
- Huang DAW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009; 4(1):44–57.View ArticleGoogle Scholar
- Huang DAW, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009; 37(1):1–13.View ArticleGoogle Scholar
- Finak G, McDavid A, Yajima M, Deng J, Gersuk V, Shalek AK, Slichter CK, Miller HW, McElrath MJ, Prlic M, Linsley PS, Gottardo R. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 2015; 16:278.View ArticlePubMedPubMed CentralGoogle Scholar
- Watts C, West MA, Zaru R. TLR signalling regulated antigen presentation in dendritic cells. Curr Opin Immunol. 2010; 22(1):124–30.View ArticlePubMedGoogle Scholar
- Zheng G, Tu K, Yang Q, Xiong Y, Wei C, Xie L, Zhu Y, Li Y. ITFP: an integrated platform of mammalian transcription factors. Bioinformatics. 2008; 24(20):2416–7.View ArticlePubMedGoogle Scholar
- Kanamori M, Konno H, Osato N, Kawai J, Hayashizaki Y, Suzuki H. A genome-wide and nonredundant mouse transcription factor database. Biochem Biophys Res Commun. 2004; 322(3):787–93.View ArticlePubMedGoogle Scholar
- Lizio M, Harshbarger J, Shimoji H, Severin J, Kasukawa T, Sahin S, Abugessaisa I, Fukuda S, Hori F, Ishikawa-Kato S, Mungall CJ, Arner E, Baillie JK, Bertin N, Bono H, de Hoon M, Diehl AD, Dimont E, Freeman TC, Fujieda K, Hide W, Kaliyaperumal R, Katayama T, Lassmann T, Meehan TF, Nishikata K, Ono H, Rehli M, Sandelin A, Schultes EA, ’t Hoen PA, Tatum Z, Thompson M, Toyoda T, Wright DW, Daub CO, Itoh M, Carninci P, Hayashizaki Y, Forrest AR, Kawaji H. Gateways to the FANTOM5 promoter level mammalian expression atlas. Genome Biol. 2015; 16:22.View ArticlePubMedPubMed CentralGoogle Scholar
- Yu HB, Kielczewska A, Rozek A, Takenaka S, Li Y, Thorson L, Hancock RE, Guarna MM, North JR, Foster LJ, Donini O, Finlay BB. Sequestosome-1/p62 is the key intracellular target of innate defense regulator peptide. J Biol Chem. 2009; 284(52):36007–11.View ArticlePubMedPubMed CentralGoogle Scholar
- Esche C, Stellato C, Beck LA. Chemokines: key players in innate and adaptive immunity. J Invest Dermatol. 2005; 125(4):615–28.View ArticlePubMedGoogle Scholar
- Zlotnik A, Yoshie O, Nomiyama H. The chemokine and chemokine receptor superfamilies and their molecular evolution. Genome Biol. 2006; 7(12):243.View ArticlePubMedPubMed CentralGoogle Scholar
- Bieche I, Chavey C, Andrieu C, Busson M, Vacher S, Le Corre L, Guinebretiere JM, Burlinchon S, Lidereau R, Lazennec G. CXC chemokines located in the 4q21 region are up-regulated in breast cancer. Endocr Relat Cancer. 2007; 14(4):1039–52.View ArticlePubMedGoogle Scholar