SCOUP: a probabilistic model based on the Ornstein–Uhlenbeck process to analyze single-cell expression data during differentiation

Background Single-cell technologies make it possible to quantify the comprehensive states of individual cells, and have the power to shed light on cellular differentiation in particular. Although several methods have been developed to fully analyze the single-cell expression data, there is still room for improvement in the analysis of differentiation. Results In this paper, we propose a novel method SCOUP to elucidate differentiation process. Unlike previous dimension reduction-based approaches, SCOUP describes the dynamics of gene expression throughout differentiation directly, including the degree of differentiation of a cell (in pseudo-time) and cell fate. SCOUP is superior to previous methods with respect to pseudo-time estimation, especially for single-cell RNA-seq. SCOUP also successfully estimates cell lineage more accurately than previous method, especially for cells at an early stage of bifurcation. In addition, SCOUP can be applied to various downstream analyses. As an example, we propose a novel correlation calculation method for elucidating regulatory relationships among genes. We apply this method to a single-cell RNA-seq data and detect a candidate of key regulator for differentiation and clusters in a correlation network which are not detected with conventional correlation analysis. Conclusions We develop a stochastic process-based method SCOUP to analyze single-cell expression data throughout differentiation. SCOUP can estimate pseudo-time and cell lineage more accurately than previous methods. We also propose a novel correlation calculation method based on SCOUP. SCOUP is a promising approach for further single-cell analysis and available at https://github.com/hmatsu1226/SCOUP. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1109-3) contains supplementary material, which is available to authorized users.


Limit of a discrete time OU process
In this section, we consider limit of a discrete time OU process. Because both of genes and cells are supposed to be independent in SCOUP, we forget the index of gene and cell, and consider a general OU process in this section. We represent observed value as E and initial value as S, and X = {X s |s = 0, ..., N } is a path such that X N = E and X 0 = S. As mentioned in the main text, a OU process can be regarded as limit of a discrete time OU process: P ou (X s |X s−1 , α, σ 2 , t/N ) P (X|α, σ 2 , t) = N ∏ s=1 P ou (X s |X s−1 , α, σ 2 , t/N )P (X 0 ), where the interval of integration is the all paths which satisfies X 0 = S and X N = E. Hereafter, we assume X 0 is given and re-define X as X ∈ {X s |s = 1, ..., N } for simplification. The calculation of the case that X 0 is unobserved is given in the after section. In this case, the complete likelihood is given by P (X|S, α, σ, t) = N ∏ s=1 P ou (X s |X s−1 , α, σ 2 , t/N ).

Transformation into the multivariate normal distribution
In this section, we transform the product of the transition probability P ou (X s |X s−1 , α, σ 2 , t/N ) as the multivariate normal distribution. In the case of OU process, the transition probability is calculated with the normal distribution as follows: The complete likelihood P (X|S, α, σ 2 , t) is equal to the following multivariate normal distribution: where X −N ∈ {X s |s = 1, ..., N − 1} and Λ is (N − 1) × (N − 1) matrix and µ is (N − 1) dimension vector and satisfy following equations.
From above equation, µ j can be calculated as follows:

Derivation of mean vector and variance-covariance matrix
As mentioned in the next section, the expectation of X s , X s X s+1 , and X 2 s are necessary to optimize parameters. These expectations can be calculated from the mean vector and variance-covariance matrix of the multivariate normal distribution. Because we consider a limit of a discrete time OU process, we cannot use numerical calculation and have to solve analytically. In this section, we derivate the mean vector and the variance-covariance matrix.
Firstly, we derivate the variance-covariance matrix. To simplify this, we define Λ ′ so that Λ = −V −1 BΛ ′ . We also define following variable: and Λ ′ is represented as follows: It is shown that the inversion of symmetric tridiagonal matrix can be calculated analytically [1]. By using this, we can derive the inversion of Λ ′ and Λ as follows: Next, we substitute Λ −1 and derive the mean µ j .
Here, we use following equation to calculate above formula.
The sum of the above equations becomes By using following equation, the sum can be described as follows: Thus, the mean becomes

The complete log-likelihood
The complete log-likelihood of X given X 0 is given by and Q function is given by

Derivation of the sufficient statistic
The likelihood function depends on X only through X 2 s , X s X s+1 , and X s , these are the sufficient statistic for the model. In this section, we derive the these statistic analytically.

Derivation of F ss
Firstly, we solve the expectation of X 2 s .
The first member is The second member is

Derivation of F ss+1
Secondly, we calculate the following equation: The first member is The remainder is

Derivation of F s
Lastly, we calculate F s .

Derivation of Q function
As mentioned in the above section, Q function is In the following section, we calculate the Q function.

Derivation of 2F
Firstly, we calculate the member which is related to Secondary, we calculate the member which is related to Fourthly, we calculate the member which is related to θ(X 0 + X N − 2θ).
Lastly, we calculate the member which is related to θ 2 .

Derivation of
. In this calculation, we disregard the 1/N order terms.
At first, we calculate the member related to F ss and F ss+1 . The member related to Λ is calculated as follows: The member related to ( .
The member related to θ( The member related to θ * (X 0 + X N − 2θ) is And the member related to θ 2 , θ * θ, θ * 2 is Therefore, becomes as follows:

Q function
The standardization term is approximated as follows: By using the calculation results so far, the Q function is described as follows:
The highest order term results in σ * 2 = σ 2 which means σ 2 will not change. Therefore, we optimize in regards to second highest term.
When we regard above σ 2 as σ * 2 , dQ/dσ * 2 will be zero up to second highest order if σ 2 is converged sufficiently.

Optimization of t
We solve dQ/dt * = 0. Because t * is related to α and σ, we consider α where So, dQ/dt * is described as follows: ) .
The highest order term results in t * = t and we consider second highest term. In the case that dQ/dt * = 0 for second highest order, the following equation consists.
Above equation becomes as follows: Unfortunately, t cannot be solved analytically and we use Newton's method.
We optimize t iteratively by using above equation.

Mixture OU process for multi-lineage differentiation
We denote the number of cell, gene, and lineage by C, G, and K, respectively. We also denote the index of cell, gene, and lineage by c, g, and k, respectively. We assume each lineage has different attractor θ gk and the likelihood is given by where π k is the probability of lineage k.
With the latent value Z c which is 1 of K representation and indicates the cell fate, P (X, Z|E, S, Θ, T ) and P (Z|E, S, Θ, T ) are given by So, γ ck , which is the expectation of posterior probability of Z ck is represented as follows: .
To avoid overfitting, we added pseudo-count and re-defined γ ck as follows: Hereafter, we denote ln P (X cg |S cg , Z ck = 1) by l gck , and l gck is described as follows: And the Q function of l gck (Q gck ) is Thus, the complete Q function is

Parameter optimization
The optimization equation is derived by solving the differentiation of the Q function likewise the parameter optimization of single gene, cell, and lineage model.

Optimization of t c
We optimize t c so that it satisfies following equation.
We used Newton's method as follows:

Expected value of S cg
Thus far, we assume S cg (X cg0 ) is given. However, it is unobserved practically and we have to calculate expected value of S gc .
Therefore, the expected values related to X cg0 are given by , and we just substitute the above E[X cg0 ] and E[X 2 cg0 ] into X cg0 and X 2 cg0 , respectively, for parameter optimization.

The marginal log-likelihood
S gc is not observed and we have to marginalize over S gc to calculate the marginal log-likelihood, and it is described as follows: Therefore, the log-likelihood of E is .

A procedure of parameter optimization
In this research, we used following parameter optimization procedure to avoid sub-optimal solutions. Firstly, we initialized pseudo-time t c based on dimension reduction approach and α g , σ 2 g , θ g by using the model of K = 1.

2.
Calculate minimum spanning tree with Prim's algorithm on D dimensional latent space and calculate the shortest path from root (µ 0 ) to a cell c and define the standardized of the total weight of the shortest path as the initial value of t c (In this research, we set D = 2 unless we refer).
Secondary, we optimize parameter of mixture model. Because parameters fell into a sub-optimal solution which shows wrong order of cells when we optimized all parameter simultaneously, we optimized parameters except t c at first.
1. Initialize θ gk with θ g calculated by above procedure.
2. Initialize the expectation of a latent value of a cell (γ ck ) randomly and calculate other statistic, and optimize α g , σ 2 g , and θ gk with M-step.
3. Run E-step to calculate γ ck and other statistic.
5. Return to step 3 until the number of iterations reach m 1 (In this research, we set m 1 = 1, 000.).
Lastly, we optimize all parameters.
1. Run E-step 2. Run M-step. We optimize t c at first, and optimized other parameters after that.

Stop the parameter optimization if |e
|, and |t * c − t c | are under ϵ (In this research, we set ϵ = 0.01). We used these values to check convergence because these are meaningful values and α g and σ 2 g can change together so that the likelihood does not change (see next section). 5. Return to step 1.
α g can be very large and small which is meaningless to estimate accurately and we set lower and upper bounds to α min = 0.1, α max = 100 and set α g = α min (α max ) if α * g is under (over) α min (α max ). σ 2 g can be significantly small and we set the lower bound similar way (σ 2 min = 0.1). We also set the bounds of pseudo-time so that the lower bound is t min = 0.001 and the upper bound is t max = 2.0. For pseudo-time (t c ) optimization, we stop the Newton's method if |t cn+1 − t cn | is lower than ϵ or the number of iteration reach 100. The solution of Newton's method t c can be incorrect value and we set the new parameter of t c to the time whose likelihoods are highest in the three case: t c is old time; t c is the solution of Newton's method; t c = t min ; and t c = t max .

Validation of parameter optimization method
In this section, we validated the parameter optimization method with simulation data. We generated simulation data with C = 100, G = 500, and K = 1, and each parameter was sampled so that t c = U R [0, 1], α g = U R [0.1, 10], σ 2 g = U R [0.1, 100], and θ g = U R [−5, 5], where U R [a, b] is a uniform random number from a to b. All of the initial distributions of the gene were N (X cg0 |0, 1.0) and X cg0 was sampled from the distribution. To complete the scale of the parameters, we set t max = 1.0. With above conditions, we sampled the expression data from the normal distribution based on OU process, and applied this simulation data for SCOUP.
Firstly, we compared the values of estimated parameters with those of true parameters (Figure 1(A,B,C,D)). The values of estimated time is highly correlated with true values (r 2 = 0.94). The difference between estimated time and true time becomes large for large t c . This is because the distribution of OU-process becomes stationary distribution for t sufficiently large, and the fluctuation of the value of optimized t c will be large for such condition. The values of estimated θ g is also highly correlated with true values (r 2 = 0.96). But estimated θ g of a few genes are different from true values significantly. This will be because the influence of θ g on the distribution is significantly small when α g ≃ 0, and hence, the value of θ g is unstable in such condition. The values of estimated α g and σ 2 g are highly correlated with true values (r 2 is 0.79 and 0.77, respectively), but estimated α g and σ 2 g of some genes are significantly larger than true values. This is because that the distribution of different α g and σ 2 g will be almost equal for the gene of θ g ≃ µ 0g as long as the balance between α g and σ 2 g are kept, and the estimated absolute values will be unstable. Then, we investigated the value of mean (e −αgt θ g ) and variance (σ 2 g (1 − e −2αgt )/2α g ) of OU process at time t = 1 (Figure 1(E,F)). The values of estimated mean and variance are highly correlated with those of true mean and variance (0.99 and 0.94, respectively), and hence, SCOUP succeed to reconstruct the original probabilistic distribution with high accuracy.
Next, we investigated that the log-likelihood of optimized parameters is higher than those of varied parameters. Figure 2 is the example of the log-likelihood curve with respect to time parameter of a cell (t c ), and the value of optimized t c is drawn with x-mark. The log-likelihood of the optimized t c is located in the top of the log-likelihood curve. As shown in Figure 2(C), the log-likelihood are almost equal 0.5 < t c because the distribution will almost be equal to the stationary distribution for large t c . Figure 3 is the example of the log-likelihood surface with respect to α g and θ g of a gene. The log-likelihood of the optimized α g and θ g are located in the top of the log-likelihood surface. Figure 4 is also the example of the log-likelihood surface with respect to σ 2 g and θ g of a gene. The log-likelihood of the optimized σ 2 g and θ g are located in the top of the log-likelihood surface. Figure 5 is also the example of the log-likelihood surface with respect to θ g and σ 2 g of a gene. Although the log-likelihood of the optimized α g and σ 2 g are located in the top of the log-likelihood surface, the log-likelihood are almost equal for σ 2 g ≃ 20 × α g regarding a gene ( Figure 5(C)). This is because that the distribution of different α g and σ 2 g will be almost equal in some conditions as mentioned in the above paragraph.
Although the values of optimized parameters have potential to be unstable in some conditions, the mean and variance of OU process are stable and the log-likelihood of optimized parameters are highest. Therefore, we conclude that SCOUP succeed to optimize parameters.     g of a gene. The color of a pixel represents the log-likelihood and black represents the highest log-likelihood. The optimized (α g , σ 2 g ) is indicated with x-max.

Automatic reduction of unrelated gene in pseudo-time estimation
The expression data of gene, which is unrelated to differentiation, can fluctuate the pseudo-time estimation in the previous methods. In contrast, SCOUP estimates pseudo-time while automatically reducing the influence of unrelated gene. This is because the probabilistic distribution of such unrelated gene will be constant in relation to time parameter and such gene will not affect optimization of time parameter.
In this section, we verified that the unrelated genes actually do not affect pseudo-time estimation of SCOUP. We defined the gene log-likelihood for various time as follows: where each parameter is optimized and ∆t is the fluctuation of time. However, (t c −∆t) is set to the boundary value if it exceeds t min or t max . We used original Kouno's data with 45 pseudo-gene for validation. The expression of pseudo-gene were constructed by shuffling raw expression data among 960 cells for each gene. Figure 6 shows the median and quartile of l(E g , T ) − l(E g , T + ∆t) for original gene and pseudo-gene for various ∆t. In contrast to the original gene, whose gene log-likelihood is maximum at ∆t = 0.0, the gene log-likelihood of pseudo-gene are approximately constant. Thus, the pseudo-gene does not affect the pseudo-time optimization because the gene log-likelihood is not affected by the time parameter.

Computational complexity
The computational complexity of SCOUP is O(IKCG), where I is the number of optimization iterations. Therefore, the runtime of SCOUP increases linearly with the number of cells and genes. The runtimes for original Kouno's data, Moignard's data (1,000 cells), and Shalek's data (LPS) about 8 min, 26 min, and 91 min, respectively ( Table 1). The computations were performed on my MacBookPro equipped with 2.8 GHz Intel Core i5 processors and 16 GB RAM. Thus, SCOUP works well with short runtimes with laptop computers.

Estimation of the number of lineages (K)
We investigated whether SCOUP could estimate the number of lineages (K) in the case that the genuine number of lineage was unknown. We applied SCOUP for Kouno's data(2) (ϵ = 1.0) in the condition that K were set to either 1, 2, 3, or 4. Figure 7A shows the box plot of the log-likelihood for each condition (K = 1, 2, 3, 4). In comparison with the difference between the log-likelihood of K = 1 and that of K = 2, the log-likelihood of K = 2, 3, 4 are close. Therefore, the model with K = 2 is approximately sufficient to describe the data. In addition, the probability of a lineage k (π k ) can be used for K estimation. Figure 7B is the box plot of the remaining π k after removing two dominant lineage for K = 3, 4 model. In both conditions, many π k are under 0.1. Thus, the probability of a lineage is fitted to relatively small when K is set to more than the genuine number of lineages, and we can infer the appropriate number of lineages.

Cell lineage estimation with Gaussian mixture model
We estimated cell lineage with Gaussian mixture model (GMM) implemented in mclust package [2]. The AUC values of mclust for Kouno's data(2) (ϵ = 0.0) and Moignard's data are 0.86 and 0.96, respectively, and both of them are inferior to those of SCOUP (0.99 and 1.0). Figure 8 and 9 show cells of Kouno's data and Moignard's data in the space of first two PCs and the colors of cells indicate the genuine cell lineage (left), the lineage estimated using SCOUP (middle), and the lineage using mclust (right). GMM cannot estimate cell lineage correctly for cells at an early stage of bifurcation because GMM does not count the time axis and will work well only for cells whose expression pattern changes sufficiently after bifurcation. Moreover, mclust seems to overfit the 4SG cells around (-6, 0) in the PCA space, and failed to distinguish a portion of 4SG cells for Moignard's data (Figure 9). This is because GMM will fit to the position in which large number of cells exist, and GMM cannot estimate the path of bifurcation in the condition that cells are unevenly distributed . Therefore, it is important to count time axis to analyze expression for cells during bifurcation.  The color for the SCOUP analysis is defined by γ c0 ; black, 0.5; red, 0.0; and blue, 1.0. The color for the mclust analysis is defined by expectation of latent values; black, 0.5; red, 0.0; and blue, 1.0. We determined the color of each state so that they consist among each plot. 9 Annotated pairs in the top 1,000 C Raw and C Std values As described in the main manuscript, we investigated whether the target genes of a transcription factor (TF) can be predicted under the assumption that the expression of a TF and its target genes are highly correlated. The top 1,000 C Raw and C Std values contained correlations of There are 24 and 27 annotated pairs in the top 1,000 C Raw and C Std values, respectively ( Table 2). Only three annotated pairs (UHRF1, RRM2), (MCM5, RRM2), and (MCM4, RRM2), were common between the 24 C Raw correlation values and the 27 C Std correlation values.