 Research Article
 Open Access
 Published:
SwitchFinder – a novel method and query facility for discovering dynamic gene expression patterns
BMC Bioinformatics volume 17, Article number: 532 (2016)
Abstract
Background
Biological systems and processes are highly dynamic. To gain insights into their functioning timeresolved measurements are necessary. Timeresolved gene expression data captures temporal behaviour of the genes genomewide under various biological conditions: in response to stimuli, during cell cycle, differentiation or developmental programs. Dissecting dynamic gene expression patterns from this data may shed light on the functioning of the gene regulatory system. The present approach facilitates this discovery. The fundamental idea behind it is the following: there are changepoints (switches) in the gene behaviour separating intervals of increasing and decreasing activity, whereas the intervals may have different durations. Elucidating the switchpoints is important for the identification of biologically meanigfull features and patterns of the gene dynamics.
Results
We developed a statistical method, called SwitchFinder, for the analysis of timeseries data, in particular gene expression data, based on a changepoint model. Fitting the model to the gene expression timecourses indicates switchpoints between increasing and decreasing activities of each gene. Two types of the model  based on linear and on generalized logistic function  were used to capture the data between the switchpoints. Model inference was facilitated with the Bayesian methodology using Markov chain Monte Carlo (MCMC) technique Gibbs sampling. Further on, we introduced features of the switchpoints: growth, decay, spike and cleft, which reflect important dynamic aspects. With this, the gene expression profiles are represented in a qualitative manner  as sets of the dynamic features at their onsettimes. We developed a Web application of the approach, enabling to put queries to the gene expression timecourses and to deduce groups of genes with common dynamic patterns.
SwitchFinder was applied to our original data  the gene expression timeseries measured in neuroblastoma cell line upon treatment with alltrans retinoic acid (ATRA). The analysis revealed eight patterns of the gene expression responses to ATRA, indicating the induction of the BMP, WNT, Notch, FGF and NTRKreceptor signaling pathways involved in cell differentiation, as well as the repression of the cellcycle related genes.
Conclusions
SwitchFinder is a novel approach to the analysis of biological timeseries data, supporting inference and interactive exploration of its inherent dynamic patterns, hence facilitating biological discovery process. SwitchFinder is freely available at https://newbioinformatics.eu/switchfinder.
Background
Timeresolved measurements are performed to study the dynamics of biological processes e.g. the dynamics of gene expression in response to treatments, upon induction of a transcription factor, during cell cycle or embryonic development. The temporal response patterns may shed light on coordination and regulation of the genes, aiding the inference of gene regulatory networks. Several methods for the analysis of the timecourse gene expression data were developed, reviewed in [1], however, major challenges remain. The time courses are mainly short, hindering the inference of complex models with many parameters. The Markov modelbased methods [2, 3] rely on the assumption that the underlying process is a) Markovian and b) stationary: a) the state of a gene at each timepoint depends only on the state of the system at the previous timepoint and b) the probability of a transition from one timepoint to the next is constant for all timepoints. The biological relevance of these assumptions is questionable. The gene regulatory circuits permanently rewire – the genes switch between different regimes of activity, whereas the durations of the regimes may have different length. In fact, these are the turning points of gene behaviour that have biological relevance and are important to elucidate. The gene expression data is likewise sampled at the irregularly spaced timepoints with a hope to capture real biological events. The sparse irregular sampling generates spiky, sawtoothed data, presenting a difficulty for smooth interpolations. To overcome this, in [4] the use of piecewise constant functions was advocated.
The most common purpose of the timeresolved gene expression data analysis was to derive groups of genes with similar dynamical responses. Modelbased clustering [5] executes simultaneously two tasks: fitting a model to gene expression profiles and grouping the genes based on the parameters of the fitted models. However, relations between genes across time may have only a fragmentary character like e.g. immediateearly responses to stimulation. Modeling the expression profile of an individual gene might be more appropriate. Even at the risk of overfitting, this has an advantage of capturing unique features of the gene temporal behaviour. In [6], a mathematical model of response dynamics  the impulse model  was proposed for fitting the individual gene profile. The model contains seven biologically relevant parameters, emphasizing important aspects of the gene dynamics e.g. point of induction. In [7], the model was used in an integrative clusteringmodeling approach.
In the present approach, called SwitchFinder, a timeseries model is proposed that explicitly assumes the existence of the switchpoints (switches) between intervals of increasing and decreasing activities, which are interpolated with linear or generalized logistic function. Fitting the model to the timeresolved gene expression data implies the prediction of the switchpoints of individual genes.
Our approach has origin in the changepoint modelling, that has been widely applied in engineering, ecology, economics and finance [8–14]. The fundamental idea is: the model is characterized by a number of discrete regimes, within which different model parameters apply. The model switches from one regime to another and the characteristics of the observations change according to the particular regime. Assessing the locations of the changepoints (called in the literature switchpoints, breakpoints, structural breaks or thresholds) may give valuable insights into the modelled process. Various approaches to the changepoint problem for models with different assumptions were proposed. To mention are diagnostic methods based on testing with e.g. Schwartz’s Bayesian Information Criterion (BIC) [15], iterative fitting procedures for segmented regressions [16, 17], nonparametric smoothers [11] and dynamic programming algorithms [18–20]. Bayesian approach to multiple changepoints problem dates back to [21] and was further elaborated by e.g. [10, 22, 23]. In [24], the product partition model was used in the Bayesian framework, see also [25]. In [26], the multiple changepoints model was formulated in terms of a latent discretestate variable indicating the regimes and evolving as a discrete time, discretestate Markov process governed by a transition probability matrix. The model was estimated with Markov chain Monte Carlo (MCMC) sampling. Bayesian methodology is valuable for the inference of the changepoint models, since it treats the changepoints locations as parameters to be estimated in the same framework as the other model parameters. A MCMC technique Gibbs sampling proved to be especially attractive for the Bayesian inference [27].
The central interest of the present work was the inference of the switchpoints indicating changes between the regimes of the gene activity. Our model represents a series of switchpoints (peaks and troughs), joined by lines or logistic curves. We developed a Gibbs sampling procedure for the Bayesian inference of the model.
The switchpoints elucidated by the analysis may indicate an onset of features like Growth or Decay, introduced here to capture substantial dynamic properties of the gene behaviour. Knowing onsettimes of the dynamic features enables to represent the gene profiles in a qualitative manner. This is utilized in our approach to perform partitioning of the genes into groups with common dynamic patterns. The present approach inspires to put queries to the gene set like for example: which genes have peaks/troughs of their activity at certain time points? Which genes exhibit growth or decay at the particular onsettimes? The Web application of the approach provides the query interface and grants to a human expert a possibility to query the timeresolved data, facilitating the biological discovery process.
The present approach decouples the two tasks – statistically fitting individual genetic profiles and grouping of them. We do not regard the gene data set as multivariate timeseries data, as was done in [28]. The rationale behind this is that an individual gene dynamics is not only governed by the system dynamics, but by some external factors remaining “behind the scene” (e.g. chromatin modifications, posttranscriptional modifications, protein degradation). In [28], the authors segmented multivariate biological timeseries with the help of the fused LASSO regression, based on the assumption that the data at each timepoint (response) is explained with the preceding timepoints (regressors), except at the breakpoints, for which the preceding timepoints have negligible explanatory power. The authors then clustered the gene profiles in each segment. To the contrary, we first seek to detect the changepoints in each gene behaviour, then the changepoints can be used to get insights into the dependency structure of the system.
In the next section, we introduce the model and its Bayesian formulation.
Methods
The model
Figure 1, A illustrates the model for an exemplary gene expression profile with T=14 measurements. The model contains N=5 switches at timepoints 1, 6, 9, 12, 14 (switch locations) of the following types: trough, peak, trough, peak and trough. The switches separate intervals of increasing and decreasing activities of the gene, called regimes. The model assumes that the data within the regimes is interpolated with linear functions. The goal of the present method is to infer the most probable timepoints of switches between the regimes while fitting the model to the timeseries data.
Let r be the regime index: r=1,…,N−1. We denote the locations of the switches with L _{ r } (Ndimensional vector) and the yvalues at these locations (switch heights) with H _{ r }. The model assumes that the data values at timepoints between the switches are determined by the linear interpolation. Figure 1,b displays one time interval with timepoints t∈{L _{ r },…,L _{ r+1}}. The interpolated value at the timepoint t is denoted by y _{ t }. Due to the linearity property, the following proportion is valid: \(\frac {tL_{r} }{L_{r+1} L_{r}} =\frac {y_{t}H_{r}}{H_{r+1}H_{r}}\). Solving this equation for y _{ t }, while denoting with \(LF:=\frac {L_{r+1} t}{L_{r+1}L_{r}}\) (linear factor), we get:
If Y=(y _{ t })_{ t=1,…,T } is the data, the set of Eq. (1) for all intervals r specifies a linear regression model with the Ndimensional vector of parameters H=(H _{ r })_{ r=1,…N }. So the model underlying our approach is specified as: Y=X·H+e, where X is the (T×N)dimensional design matrix, defined with the help of the linear factors for all t and all r (see the matrix in Fig. 1,c for the model in Fig. 1,a). Vector e is the error term, which can be written as: e=σ·e _{ t }, e _{ t }∼N(0,1), where σ is the standard deviation of the error term. The parameters of the model to be estimated in course of the model inference are: locations of the switches L _{ r }, r=2,…,N−1, the heights of the switches H _{ r },r=1,…,N and σ. (For simplicity of the modelling, the first and the last timepoints of the timeseries are always labelled as switches).
If switch locations L _{ r } are known, the linear regression model is specified and can be fitted to the data Y by the Ordinary Least Squares (OLS) method. Then, the parameters of the model (i.e. the switch heights) can be determined by: H=(X ^{T} X)^{−1} X ^{T} Y. The fitted values under the model are calculated by: Y _{ fitted }=X·H.
In the following, for the sake of simplicity, we use a common notation for the linear regression model: β instead of H. Let the linear regression model be formulated as follows:
The Ndimensional vector of regression coefficients β and the standard deviation σ are parameters to be estimated.
Model inference
Probabilistic inference of the model (estimation of the switch locations and the parameters β and σ) was facilitated by the Bayesian methodology. Within a Bayesian framework, inference about parameters of a model, θ, is made based on its posterior distribution given the data, p(θY), using the proportionality: p(θY)∝L(θY)p(θ), where L(θY) is the likelihood function and p(θ) is the prior distribution of the parameters. Since the direct Bayesian inference of the present model is infeasible, the Markov chain Monte Carlo (MCMC) technique Gibbs sampling presents an attractive possibility. Gibbs sampling reduces a problem of sampling from a complex posterior distribution to a series of more tractable subtasks of sampling from simpler, lowerdimensional distributions, simulations from which can be done using standard functions [29, 30]. Namely, Gibbs sampling iteratively generates samples from full conditional posterior distributions as outlined below.
Suppose the model has k parameters θ=(θ _{1},…,θ _{ k }). Given an arbitrary set of starting values \((\theta _{2}^{(0)},\ldots,\theta _{k}^{(0)})\), consider the following steps:
Steps 1 through k are repeated J times, where J is the number of iterations, to obtain the samples \((\theta _{1}^{(j)},\theta _{2}^{(j)},\ldots,\theta _{k}^{(j)}),\; j=1,\ldots,J\). The distribution \(p(\theta _{i} \left \theta _{1}^{(j)} \right.,\ldots,\theta _{i1}^{(j)},\theta _{i+1}^{(j1)},\ldots,\theta _{k}^{(j1)},Y)\) is called the full conditional posterior distribution. If J is large enough, after some L, the Gibbs sampler has converged [29]. Then the joint and marginal distributions of θ _{1},…,θ _{ k } can be approximated by the empirical distributions of the simulated values. E.g. the mean of the marginal distribution of θ _{ i } may be calculated by:
In the following, we derive the conditional posterior distributions of β and σ ^{2}.
Conditional distribution of β, given σ ^{2}.
Assume σ ^{2} is known. We prescribe a multivariate normal distribution for the parameter β. Let the prior distribution of β is given by:
βσ ^{2}∼N(β _{0},Σ _{0}), where the vector β _{0} and the covariance matrix Σ _{0} are known. The prior density can be written as:
Because of the assumption of normality in (2), the likelihood function is given by:
Combining the prior density and the likelihood function, the posterior distribution of β, conditional on σ ^{2}, is specified by the following normal distribution (see [31], Chapter 7; [32]):
In case of an uninformative prior i.e. when β _{0} is the vector of nulls and Σ _{0} contains big values, the Bayesian estimate of the probability distribution of β is analogous to the distribution of the best linear unbiased estimator of β obtained by the OLS method. Namely, the unbiased estimator of β is a normally distributed random variable [31]:
So, we can use β _{ fitted }=(X ^{T} X)^{−1} X ^{T} Y and Σ=σ ^{2}(X ^{T} X)^{−1} as the mean and the covariance matrix for sampling the values of β.
If the mean vector μ and the covariance matrix Σ of the multivariate normal distribution are known, a commonly used method for generating values from this distribution is the following. Identify matrix A, which is the Cholesky decomposition i.e. AA ^{T}=Σ, then the sample value is calculated as: μ+AE, where E is an Ndimensional vector of standard normal variables sampled from N(0,1).
While sampling β, rejection sampling was used to ensure the validity of the new model: only models with alternating troughs and peaks and nondegenerate (i.e. with each data point as switch or with a regime having low amplitude) are admissible.
Conditional distribution of σ ^{2}, given β
Assume β is known. The usual specification for the distribution of σ ^{2} is the inverted Gamma distribution (because this is the natural conjugate prior for normal likelihood). So, \(\frac {1}{\sigma ^{2} }\) should be Gammadistributed. Let the prior distribution of \(\frac {1}{\sigma ^{2} }\) has the form: \(\frac {1}{\sigma ^{2}} \left \beta \right. \sim \Gamma \left (\frac {\nu _{0} }{2},\frac {\delta _{0} }{2} \right)\), where ν _{0} and δ _{0} are known, so
The likelihood function is given by (3). Multiplying the prior density and the likelihood gives the following posterior density:
that is also of a Gamma form, suggesting the following posterior distribution of \(\frac {1}{\sigma ^{2} }\):
It can be shown that in case of uninformative priors (ν _{0}=0, δ _{0}=1) this distribution is analogous to the distribution of the unbiased estimator of σ ^{2}, determined by the OLS method. If \(\sigma _{unbiased}^{2} \)is the unbiased estimator of σ ^{2}, then it is distributed as (see [31]):
where M is the number of regressors in the model (here, M=1), χ ^{2} is the chisquared distribution. So, we can use \(\frac {\sigma _{fitted}^{2} }{T1} \chi ^{2} \left (T1\right)\) for sampling the values for σ ^{2}, where σ _{ fitted } is calculated from data.
Sampling switch locations, given all the other information
While sampling a location for a switch r, we assume that the locations of the previous and the subsequent switches are known, so the possible choices lie in the interval i∈{L _{ r−1}+1,…,L _{ r+1}−1} representing a finite number of possibilities. Figure 1,d illustrates the approach. For each possible value i, by Bayes theorem, the posterior probability of the switch taking the particular location is the following: \(p\left (L_{r} =i\left Y\right. \right)=\frac {L\left (Y\left L_{r} =i\right. \right)p\left (L_{r} =i\right)}{P(Y)} \),where p(L _{ r }=i) is the prior probability, L(YL _{ r }=i) is the likelihood of data, given the particular location. It can be written: \(P(Y)=\sum _{j}L\left (Y\left L_{r} =j\right. \right)p\left (L_{r} =j\right) \). If we assume the uninformative prior, the probabilities p(L _{ r }=j) are the same for all ji.e. p(L _{ r }=j)=p(L _{ r }=i). Thus, the following formula for the calculation of probabilities of the possible switch locations results:
The likelihood of data, given the particular location, can be calculated as the product of the probabilities of making the error \(\left (e_{t} \right)_{t\in \left \{L_{r1} +1,\ldots,L_{r+1} 1\right \}} \), where each error is calculated as: \(e_{t} =\frac {Y^{t} Y_{fitted}^{t} }{\sigma } \). Note that the error is standard normal distributed: e _{ t }∼N(0,1), so we can use the R function pnorm(e _{ t })to obtain the individual probabilities (the number of probabilities is L _{ r+1}−L _{ r−1}−1).
Once we have p(L _{ r }=iY) i.e. the probabilities of each possible location given all the other information (let denote them with probs), we can sample an integer value with these probabilities by the Roulette selection method (Fig. 1,d). I.e. a random value rand is generated from the uniform distribution and \(i := 1+max\left \{m\left \Sigma _{i=1}^{m}probs_{i} < rand \right.\right \}\) will be taken as the sampled value for the switch location.
The workflow of the algorithm in Fig. 2 represents the repeated sampling of the model parameters in course of the MCMC iterations. In each run, the algorithm first allocates the switchpoints and then fits the model, providing necessary quantities for the sampling of new values for the model parameters. Only switch locations that generate valid models are accepted.
The number of the switchpoints, with which the MCMC procedure is initialized, is calculated with the exploratory nonparametric technique LOESS [33], originally LOWESS (LOcally Weighted Scatterplot smoother). It is a method for fitting a smooth curve between two variables. The procedure performs weighted polynomial regression for only a subset of observations i.e the fitting at point t is weighted toward the data nearest to t. The distance to t, that is considered near to it, is controlled by the parameter span. When span is less than 1, it represents the proportion of the total data included within each subset. More details can be found in the description of the R function loess, used in this work. The polynomial for the regression equations here was quadratic (degree 2). LOESS fits a nonlinear smoothing curve to the data, helping to reveal structural patterns in it. We use the fitting data to calculate local minima and maxima along the curve suggesting the number of the switchpoints. Higher values of the span produce smoother curves, hence, the number of the switchpoints decreases. Setting for the span is found in an iterative procedure. Starting with the small span 0.1, a curve is fitted to the data while increasing the span by a small amount (0.05) until none of the local minima and maxima are located immediately adjacent. The last number of the minima and maxima (added with 2 for the first and the last timepoints) yields the number of the switchpoints.
We call the presented model Model_Lin to distinguish it from the Model_Logit described in the next section.
Modeling with the generalized logistic function (Model_Logit)
Sometimes the increasing/decreasing activity of a gene exhibits a saturated behaviour, stabilizing with time. To model this, the generalised logistic function was used. We assume that in each time interval {L _{ trough },L _{ peak }}between two switch points, which are a trough and a peak, the fitted data lies on a logistic (sigmoid) curve and is calculated as follows:
where H _{ trough } and H _{ peak } are switch heights, prop _{ t } is the proportional location of the timepoint t with respect to the trough and is calculated as \(prop_{t} :=\frac {tL_{trough}}{L_{peak} L_{trough}}\). FL is the generalized logistic function defined as (see [34]):
which is the solution of the differential (Richard’s growth equation: \(\frac {\partial y}{\partial t} =ay\left [1\left (\frac {y}{K} \right)^{b} \right ]\) with initial condition y(t _{0})=y _{0}, where \(Q=1+\left (\frac {K}{y_{0}} \right)^{b} \).
The parameter b allows the shape of the sigmoid curve to vary flexibly. K is the maximum observable value of y, in our case K=1. In the present work, we used the following parameterization (y _{0}=0.001 is a small value):
B plays a role of the growth rate. Note that the linear transformation of a logistic curve in Eq. (4) is also a logistic curve. With this transformation the lower and upper asymptotic heights of the logistic curve FL (0 and 1) are moved to be the trough and the peak values, respectively. Equation (4) can be rewritten as:
Y _{ fitted }(t)=(1−FL(prop _{ t }))·H _{ trough }+FL(prop _{ t })·H _{ peak }. Then
represents a linear regression model (see Additional file 1, supplementary text). Fitting the model to the data Y (R function lm) facilitates calculation of the switch heights, analogously to the Model_Lin described above. The generalized logistic transformation of the proportional location of each timepoint between the neighbouring trough and peak allows for flexible modelling of the gene expression increase/decrease within time intervals of different length. Sampling of the logistic function parameters B and κ was executed with the help of bootstrapping ([35]) as follows. First, when switch locations and heights are known, Eq. (4) is rewritten as: \(\frac {Y(t)H_{{trough}} }{H_{{peak}} H_{{trough}}} ={FL} \left ({prop}_{t} \right)\). Denoting the lefthand side with propy _{ t } yields:
Thus, the parameters B and κ are estimated with the nonlinear least squares method (R function nls)  by fitting the nonlinear function FL to the data x=prop _{ t }, y=propy _{ t }. So, for the current model in each MCMC iteration, the design matrix is constructed and the linear regression model (5) is fitted to the data Y to calculate the switch heights H. Then Y _{ fitted } is calculated by (4). Further on, the residuals \(E^{t}=Y^{t}Y_{fitted}^{t}\) are calculated and the bootstrap samples of the residuals \(E_{b}=\left \{{E_{b}^{1}},\ldots,{E_{b}^{T}}\right \}\) are used to calculate the bootstrapped values Y _{ b }=Y _{ fitted }+E _{ b }, which are then fitted by (6) to obtain the samples of the parameters B and κ. Apart from that, the workflow of the MCMCbased inference of the Model_Logit is analogous to that of the Model_Lin depicted in Fig. 2. Oneregime models, presenting just logistic increase or decrease, are termed here Logit_Up and Logit_Down.
SwitchFinder as Web Application
We developed a Web application of the method SwitchFinder, which provides the userinterface for uploading the timeseries data, executing the algorithm and performing queries to the results of the data analysis, thus maintaining a feedbackloop between generation and interpretation of the results. We propose the concept of features, assigned to the inferred switchpoints, which capture meaningfull properties of the timeseries. The basic features peaks and troughs are the switchpoints of the genetic activities deduced by the method directly. The queries are supported: which genes have peaks/troughs at the given time points?. Hence, early, middle and late responses can be elucidated. The user can input thresholds on values of the peaks and troughs to select stronger effects and focus on fewer genes. Further features – Growth and Decay – designate those troughs and peaks that represent onsets of significant growth or decay of the gene activity as defined with the help of the slope (see Fig. 3). By increasing thresholds for the slopes, stronger effects can be selected. The nextlevel features  spikes and clefts  are defined based on the previouslevel features using three switchpoints (Fig. 3). The query result i.e. the set of genes, which exhibit the given features at the given timepoints, is downloadable as the list of genes or the plot (see Fig. 4). Single queries can be logically combined. The default values suggested by the Web application for the thresholds of the slopes are computed as 25%quantile of the distribution of the slopes over the data. With the default thresholds, the application computes features and represents each gene in form of a qualitative profile – as the set of features with their respective onsettimes (locations). For a location, its highestlevel feature is stored. The profiles that were fitted with a oneregime logistic model additionally obtain features LogitUp or LogitDown. Grouping of the genes is executed by kmeans clustering of the qualitative profiles using the Jaccard similarity [36]. Jaccard measure is especially appropriate for the calculation of similarity between two sets containing different numbers of elements. The concepts of features and qualitative profiles help to reveal groups of genes, organized around remarkable properties of the dynamic behaviour. The suggested grouping is only a platform for further investigations and exploration of the data set. By querying the data set and grouping, meaningful patterns of the dynamic gene expression can be deduced.
Results and discussion
Application of SwitchFinder to simulated data
To test robustness of the algorithm SwitchFinder, especially with respect to short timeseries data, we generated 10 data sets, each containing 2500 synthetic gene expression profiles of the length T=7. The simulation scheme for a data set was the following. The profiles were generated with standard deviation sigma=0.2 from the following models: a) Logit_Up (500 samples) and Logit_Down (500 samples) using 10 different combinations of the parameters κ and B: (0.4,20), (0.5,15), (0.5,20), (1.5,20), (2,8), (8,5), (10,8), (20,5), (20,10), (20,18) including extreme values that challenge the fitting procedure; b) models Model_Lin with one internal switch point of the type peak located at t=2/4/6 (600 cases) and of the type trough located at t=2/4/5 (600 cases); c) models Model_Lin with two internal switch points of the types (peak, trough) located at timepoints t=2,5 (200 cases) and of the types (trough, peak) located at t=2,6 (200 cases); d) model Model_Logit with parameters κ=20,B=10 and one internal switch at t=5 (100 cases). The parameters (heights) of the models were simulated to obtain realistic gene expression values as commonly produced by Agilent technology: sampled from lognormal distribution (meanlog=2, sdlog=0.3) and truncated to the interval (0,20). The scheme produces biologically realistic data sets with rich dynamic responses. Table 1 demonstrates the results of the application of SwitchFinder to 10 artificial data sets. The goodnessoffit of a model fitted to a gene expression profile was assessed with the residual standard deviation (RSD). The descriptive statistics of the RSDs for each data set is displayed. The statistics are very stable across the data sets. For the data cases originating from Model_Lin and Model_Logit models, precision and recall were calculated (Prec:=TP/(TP+FP), Recall:=TP/(TP+FN)), to evaluate the accuracy of the prediction of the switchpoints. The values were stably good. A small number of functions mismatches occurred (e.g. when data generated from oneregime logistic model was fitted with Model_Lin by the algorithm).
Application of SwitchFinder to human cell cycle data (long timeseries)
To verify that the algorithm is suitable for long timeseries, we applied it to the gene expression data from [37] measured at 48 timepoints during cell division cycle in human cancer cell line HeLa. We used the profiles of 66 known cell cycle regulated genes measured upon release of double thymidine block till 46 hours. The fitting results can be observed in Additional file 2. The mean of the residual standard deviations was 0.14 (sd=0.06). Figure 5 demonstrates examples of the fitted profiles.
We sorted the genes by the timepoints of their first peaks over the time, peaks with expression around 0 were neglected. The ordering revealed a clear picture of the cyclic activity of the genes and a good separation of G1/S and G2/M cell cycle phases (see supplementary Figure 5 in Additional file 1). Note that the present analysis did not use the assumption of periodicity of the gene expression, which was explicitely introduced into the analysis by [37]. Thus, the use of SwitchFinder allows for explicit temporal ordering of biological events like gene activity peaks. The reconstructed temporal order of the gene activities during the cell cycle demonstrated e.g. that the genes SLBP, MCM6, MSH2, NUCKS for their activation need earlier signals.
Application of SwitchFinder to data from neuroblastoma cell line treated with ATRA (short timeseries)
Neuroblastoma is an embryonal tumor arising from the neural crest precursors of the peripheral nervous system. It is supposed that a mechanism underlying this malignancy is the block of cell differentiation, which promotes maintenance of cell stemness and cell proliferation [38]. Differentiation therapies attempt to rescue the suppressed function i.e. to induce differentiation of neuroblastoma cells [39]. The aim of the present application of SwitchFinder was to identify genes involved in neuroblastoma differentiation and to study their expression patterns over time.
We applied the approach to our original data: the gene expression timeseries measured at 1, 6, 12, 24, 48, 96, 144 hours (T=7) in neuroblastoma cell line BE(2)C after treatment with the differentiation agent alltrans retinoic acid (ATRA). BE(2)C (ECACC 95011817, ATCC CRL2268, [40]) is a clone of the SKNBE(2) neuroblastoma cell line established in 1972 (ECACC 95011815, ATCC CRL2271). The Agilent whole genome 4x44K microarray raw data was backgroundcorrected and quantilenormalized (R package limma, [41]). A probe was selected for further analysis if the standard deviation of its expression profile was greater than 0.5 (to exclude probes with insufficient dynamics) and the gene expression in the respective nontreated control was stable. In total, 4422 probes (genes) were selected and fitted by SwitchFinder, from them 3787 probes were assigned to 8 groups representing meaningful dynamic patterns (Fig. 6, Additional files 3: Additional files AH).
Eight groups of genes, delineated by the analysis, reflect the timeresolved transcriptional response of neuroblastoma genes to the treatment with ATRA. Four groups comprise the activated genes, which were induced: immediately (Fig. 6 a, 883 probes), after 12 hours (B, 869 probes), after 24 hours (C, 184 probes) and after 48 hours (D, 149 probes). The group G (82 probes) summarizes genes with spiked behaviour, mostly at 12 or 48 hrs. Three groups comprise genes repressed by ATRA: the group E (1080 probes) with declining gene expression pattern, the group F (437 probes) with a cyclic decrease and the group H (107 probes) with genes having clefts at different timepoints, mostly at 24 hrs. Each group/pattern is characterized by one or more features e.g. onset of Growth at the first timepoint for the group A. However, the patterns were delineated not solely by the featuresbased clustering, but also by some additional considerations. Many genes from the group F were fitted with the model Logit_Down as the genes from E, however, their declining cyclic pattern was further discerned by the additional condition: if the expression value at 24 hrs. was lower than at the neighbouring timepoints. Further on, the Logit_Up model was a good fit for many activated genes. However, to elucidate the time of induction more precisely, we sorted the profiles by decreasing k and decreasing B (parameters of the logistic model), thus obtaining the temporal ordering of the genes starting from steep, early responses via Sformed (bended at 12, 24 hrs.) to convex, late responses.
The functional annotation of the gene groups was executed with the program DAVID [42–44]. Table 2 displays important genes from the group A of immediately induced genes, together with their gene ontology annotations. In suppl. Tables S1S8 (Additional file 4), the genes and their functional annotations are presented for each group. Our results indicate that the transcriptional response of neuroblastoma cells to the treatment with ATRA is the timeresolved realization of the BMP, Wnt, Notch and FGF signalling, as well as of the Gprotein coupled and neurotrophin TRK (NTRK) receptor signalling. This coincides with the gene regulatory programs during differentiation of the neural crest (NC) cells in course of the development of the sympathetic nervous system [45].
The groups of immediate and early (12 hrs.) responses are very rich on transcription factors involved in determination of cell fates and regulation of embryonic development: HOXDgenes, SOX4/8/9, FOXC1, FOXO1A, BMP4, TLE3, TLX2 etc., see Table 2. As expected, early induced were the genes involved in retinoic acid metabolism and signalling: RARA, RBP1, RDH10, SP100, CRABP2, CYP26A1 and RDH12.
The gene SNAI2, playing a role in the epithelialtomesenchymal transition (EMT), is overexpressed (group A) accompanied with the downregulation of adherence junction genes like cadherins CDH4/7/22, claudin CLDN11, cingulin CGN, catenins CTNNA1/2, as well as of tight and gap junction genes TJP1 and GJA5 [46]. Interestingly, the gene SNAI1, initially overexpressed as compared to the control, decays under the influence of ATRA. The mesenchymal markers were induced immediately or at 12 hours: fibronectin FN1, fibronectin receptors ITGB1/3/8, FNDC4/5, vitronectin VTN and vimentin VIM. Cell polarity regulator PPARD and a member of crumbs complex, CRB1, belong to the group A. The metalloproteinases MMP2/11/15 and ADAM19/22/23, which facilitate degradation of the extracellular matrix, were active immediately or at 12 hours. Thus, the results indicate a contribution of ATRA to the migratory phenotype of neuroblastoma cells.
Induced immediately were the receptors NTRK1 and NGFR  regulators of the nerve growth factor signalling known to be responsible for the maturation of the peripheral nervous systems through regulation of proliferation, differentiation and survival of neurons [47]. Activated early were the genes responsible for the axon guidance, axonogenesis, neuron projection, neurite outgrowth etc., which participate in the ephrin, semaphorin, plexin and Roundabout signalling: EFNB2, EFNA2/4, EPHA2, EPHB3, SEMA4C/6C, PLXNA2/4A, SLIT2, SLITRK6. Interestingly, the semaphorin SEMA6A, known to control cell migration, was repressed, although its receptor PLXNA2 was activated after 12 hrs. Previously, SEMA6A was found upregulated in undifferentiated embryonic stem (ES) cells [48]. Further observation: the neuropilin signalling (NRP1 and NRP2, group E) was repressed, together with the ephrin ligand EFNA1. In general, a complex spatiotemporal expression of guidance molecules and genes involved in neuron migration was observed. Vast transcriptional changes were induced by ATRA at genes involved in cytoskeleton organization, cell polarization and immune processes. E.g. the chemokine receptor CXCR4 was induced at 24 hrs. It represents a positive cue for the migration of the NC cells (its ligand CXCL12 was active after 48 hours). We suppose that canonical Wnt signalling is repressed or delayed upon treatment with ATRA, with noncanonical Wnt signalling taking place: PPARD/G and TLE3 were induced, TCF7 and TCF19 were repressed, DACT3 (antagonist of betacatenin) and further genes annotated with negative regulation of canonical Wnt signaling pathway were induced: ANKRD6, DKK1/2, SFRP1. The gene WNT11 was activated lately (group D).
A clearly observable effect of ATRAtreatment on NB cells is the repression of genes involved in cell cycle regulation, particularly in G1/S and G2/M transitions of mitotic cell cycle, in cell proliferation, DNA metabolic process, DNA damage response, DNA repair signalling: MYCN, AURKA/B, BIRC5, CDC2/6, CENPF, PCNA, PLK1/4 etc. (Group F). Furthermore, genes responsible for negative regulation of cell proliferation e.g. CDKN1A were active at 12 or 24 hrs. Notably, the gene ALK, an important unfavourable prognostic marker in neuroblastoma, was repressed.
To summarize, our study documented a powerful transcriptional effect of ATRA on NB cells. A complex gene regulatory machinery controls the two properties of neural crest cells: ability to extensively migrate and differentiate into numerous derivatives and to maintain multipotency [45]. The role of retinoic acid hereto even in normal organism is still not well understood. In neuroblastoma, the normal properties interfere with the abnormal EMT and migratory characteristics, acquired by tumor cells due to the genomic lesions in several developmental and guidance molecules genes [49]. We suppose that the dynamics of gene expression in neuroblastoma neurogenesis is influenced by the genetic aberrations inherent to this malignancy.
Conclusions
Identifying dynamic patterns under various biological conditions is crucial for the understanding of a biological system. The patterns reflect the coordination, coregulation and control of the system components. Identifying temporal changes and patterns of gene expression is important for the inference of gene regulatory networks. We developed a method SwitchFinder for the analysis of timeresolved data, applicable to the gene expression data. The changepoint model at the core of the method represents a series of the switchpoints between regimes of increasing and decreasing activities, captured by linear or generalized logistic functions. SwitchFinder fits the model to the geneexpression profiles, inferring the switchpoints inherent to the gene dynamics. The method exploits Bayesian model inference with the MCMC technique Gibbs sampling. To note, the method is suitable for long, as well as for short timeseries.
The advantage of the present approach is the inference of biologically justified and interpretable features of the genetic activity, as well as the possibility of their subjective exploration by researchers with different goals and background knowledge, in different biological scenarios. The Web application of the approach provides the user interface for querying the gene timeseries. A flexibility is given to the user to adjust the selection criteria for restricting the results to substantial dynamic phenomena. Actively guiding the data analysis is valuable for biologists, as opposite to an automatic, unsupervised application of a statistical/bioinformatics method. Some features of the data might be designated as important by an expert subjectively  beyond those obtained by statistical learning based on statistical characteristics. The features, in a next level of abstraction, can constitute further features or patterns. Such a qualitative approach should overcome overfitting and lead towards biologically meaningful results.
The featuresbased clustering is preferrable than clustering methods based on distance measures like Euclidian distance or correlation. The latter ignore the dynamic nature of the temporal data and overlook single data points, which represent important changes in the gene behaviour associated with the events of the gene regulatory control.
To mention, the present method is independent of the quantitative expression levels of different genes. It would not miss a relation between the genes with different abundance, but with the same qualitative pattern.
Previously, a platform PESTS was created, making the analysis of some statistical features of the gene profiles accessible via the user interface [50]. Qualitative representation of the gene expression profiles was performed in [51] by the Trend Temporal Abstraction, which transforms the timeseries into series of intervals with increasing, decreasing or steady trends. However, the dominant points, defining the intervals, were determined by the approximation of the data curves based on thresholds chosen by the user. This makes the algorithm sensitive to noise. The method concentrated on shapes of the gene profiles, rather than on proper timing of the dynamic events. To emphasize, the statistical inference of the prominent timepoints in the temporal profiles is of advantage. The Temporal Abstraction clustering was implemented in the software TimeClust [52], together with other clustering methods. Also, Hvidsten et al. [53] performed qualitative representations of the gene expression profiles in terms of templates (increasing, decreasing and constant) over subintervals. The authors used such descriptions as attributes in the roughsets based classification system to relate genes to biological processes. Parameter values for the identification of the templates were chosen experimentally with a purpose to maximize the performance of the whole system. Unlike previous approaches, our qualitative descriptions of the geneexpression profiles are based on the labelling of the switchpoints – not of the intervals. With this, gene profiles with activity intervals starting at the same timepoint, but having different durations would still have a chance to be assigned to the same group. This might be important for the elucidation of those gene regulatory events like e.g. when the gene group is controlled by a transcription factor. Furthermore, we are able to define higherorder qualitative features of the switchpoints like spikes and clefts.
The present method combines quantitative and qualitative characteristics: statistically inferred timing of dynamic events and the qualitative dynamic features.
The approach offers a great flexibility in the induction of biological knowledge from timeseries data: the user may explore the gene set by clustering (unsupervised) or interactively (supervised) by putting queries and experimenting with the qualitative features of particular timepoints.
The results of the method provide a platform for studying temporal relations like e.g. time delays with the goal to deduce dependencies between the genes. Modelling cellular dynamic responses on the level of pathways and networks can be considered as possible extensions of the approach. Our next goal is to adapt the SwitchFinder to the analysis of RNAseq timeseries.
Abbreviations
 ATRA:

alltrans retinoic acid
 BIC:

Bayesian Information Criterion
 EMT:

epithelialtomesenchymal transition
 ES:

embryonic stem
 LASSO:

least absolute shrinkage and selection operator
 LOESS:

LOWESS, locally weighted scatterplot smoothing
 MCMC:

Markov chain Monte Carlo
 NC:

neural crest
 OLS:

ordinary least squares
 RNAseq:

RNA sequencing
References
BarJoseph Z, Gitter A, Simon I. Studying and modelling dynamic biological processes using timeseries gene expression data. Nat Rev Genet. 2012; 13(8):552–64.
Schliep A, Schönhuth A, Steinhoff C. Using hidden markov models to analyze gene expression time course data. Bioinformatics. 2003; 19(suppl 1):255–63.
Schilling R, Costa IG, Schliep A. pgql: A probabilistic graphical query language for gene expression time courses. BioData Min. 2011; 4(1):1.
Hafemeister C, Costa IG, Schönhuth A, Schliep A. Classifying short gene expression timecourses with bayesian estimation of piecewise constant functions. Bioinformatics. 2011; 27(7):946–52.
Ramoni MF, Sebastiani P, Kohane IS. Cluster analysis of gene expression dynamics. Proc Natl Acad Sci. 2002; 99(14):9121–126.
Chechik G, Koller D. Timing of gene expression responses to environmental changes. J Comput Biol. 2009; 16(2):279–90.
Sivriver J, Habib N, Friedman N. An integrative clustering and modeling algorithm for dynamical gene expression data. Bioinformatics. 2011; 27(13):392–400.
Hamilton JD. Analysis of time series subject to changes in regime. J Econ. 1990; 45(1):39–70.
Hamilton JD. Estimation, inference and forecasting of time series subject to changes in regime. Handbook of statistics, Vol. 11, NorthHolland. 1993;:231–60.
Carlin BP, Gelfand AE, Smith AF. Hierarchical bayesian analysis of changepoint problems. Applied statistics. 1992;:389–405.
Sonderegger DL, Wang H, Clements WH, Noon BR. Using sizer to detect thresholds in ecological data. Front Ecol Environ. 2008; 7(4):190–5.
Wang J, Zivot E. A bayesian time series model of multiple structural changes in level, trend, and variance. J Bus Econ Stat. 2000; 18(3):374–86.
Koop GM, Potter S. Forecasting and estimating multiple changepoint models with an unknown number of change points, Federal Reserve Bank of New York, Staff Reports. 2004.
Chen CW, Chan JS, Gerlach R, Hsieh WY. A comparison of estimators for regression models with change points. Stat Comput. 2011; 21(3):395–414.
Yao YC. Estimating the number of changepoints via schwarz’criterion. Stat Probab Lett. 1988; 6(3):181–9.
Muggeo VM. Estimating regression models with unknown breakpoints. Stat Med. 2003; 22(19):3055–071.
Muggeo V. Segmented: An R package to fit regression models with brokenline relationships. R News, 8/1. 2008;:20–25.
Bai J, Perron P. Computation and analysis of multiple structural change models. J Appl Econ. 2003; 18(1):1–22.
Kleiber C, Hornik K, Leisch F, Zeileis A. strucchange: An r package for testing for structural change in linear regression models. J Stat Softw. 2002; 7(2):1–38.
Zeileis A, Kleiber C, Krämer W, Hornik K. Testing and dating of structural changes in practice. Comput Stat Data Anal. 2003; 44(1):109–23.
Chernoff H, Zacks S. Estimating the current mean of a normal distribution which is subjected to changes in time. Ann Math Stat. 1964; 35(3):999–1018.
Lai TL, Xing H. A simple bayesian approach to multiple changepoints. Statistica Sinica. 2011;:539–569.
Stephens D. Bayesian retrospective multiplechangepoint identification. Applied Statistics. 1994;:159–178.
Barry D, Hartigan JA. A bayesian analysis for change point problems. J Am Stat Assoc. 1993; 88(421):309–19.
Erdman C, Emerson JW, et al. bcp: an r package for performing a bayesian analysis of change point problems. J Stat Softw. 2007; 23(3):1–13.
Chib S. Estimation and comparison of multiple changepoint models. Journal of econometrics. 1998; 86(2):221–41.
Albert JH, Chib S. Bayes inference via gibbs sampling of autoregressive time series subject to markov mean and variance shifts. J Bus Econ Stat. 1993; 11(1):1–15.
Omranian N, MuellerRoeber B, Nikoloski Z. Segmentation of biological multivariate timeseries data. Scientific reports. 2015;5.
Geman S, Geman D. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. Pattern Analysis and Machine Intelligence, IEEE Transactions on Pattern Analysis and Machine Intelligence. 1984; (6):721–741.
Casella G, George EI. Explaining the gibbs sampler. Am Stat. 1992; 46(3):167–74.
Kim CJ, Nelson CR. Statespace models with regime switching. The MIT Press. 1999.
Judge GG, Hill RC, Griths W, Lutkepohl H, Lee TC. Introduction to the theory and practice of econometrics. New York, John Wiley and Sons (Wiley Series in Probability and Mathematical Statistics). 1982.
Cleveland WS, Grosse E, Shyu WM. Local regression models In: Chambers JM, Hastie TJ, editors. Statistical Models in S, Chapter 8. New York: Chapman & Hall: 1993. p. 309–376.
Richards F. A flexible growth function for empirical use. J Exp Bot. 1959; 10(2):290–301.
Efron B, Tibshirani RJ. An Introduction to the Bootstrap: CRC press; 1994.
Choi SS, Cha SH, Tappert CC. A survey of binary similarity and distance measures. J Syst Cybern Inform. 2010; 8(1):43–8.
Whitfield ML, Sherlock G, Saldanha AJ, Murray JI, Ball CA, Alexander KE, Matese JC, Perou CM, Hurt MM, Brown PO, et al. Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Mol Biol Cell. 2002; 13(6):1977–2000.
Tee A, Marshall GM, Liu PY, Liu T. In: Shimada H, (ed).Neuroblastoma: A Malignancy Due to Cell Differentiation Block: Neuroblastoma  Present and Future, InTech Open Access Publisher; 2012.
Frumm SM, Fan ZP, Ross KN, Duvall JR, Gupta S, VerPlank L, Suh BC, Holson E, Wagner FF, Smith WB, et al. Selective hdac1/hdac2 inhibitors induce neuroblastoma differentiation. Chem Biol. 2013; 20(5):713–25.
Ciccarone V, Spengler BA, Meyers MB, Biedler JL, Ross RA. Phenotypic diversification in human neuroblastoma cells: expression of distinct neural crest lineages. Cancer Res. 1989; 49(1):219–25.
Team RC. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2016. R Foundation for Statistical Computing. https://www.Rproject.org.
Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using david bioinformatics resources. Nat Protoc. 2009; 4(1):44–57.
Huang DW, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009; 37(1):1–13.
DAVID: DAVID Bioinformatics Resources. https://david.ncifcrf.gov. Accessed: 2015111.
SimõesCosta M, Bronner ME. Establishing neural crest identity: a gene regulatory recipe. Development. 2015; 142(2):242–57.
Powell DR, Blasky AJ, Britt SG, Artinger KB. Riding the crest of the wave: parallels between the neural crest and cancer in epithelialtomesenchymal transition and migration. Wiley Interdiscip Rev Syst Biol Med. 2013; 5(4):511–22.
Schor NF. Neuroblastoma as a neurobiological disease. J NeuroOncol. 1999; 41(2):159–66.
Katoh M, Katoh M. Comparative integromics on noncanonical wnt or planar cell polarity signaling molecules: transcriptional mechanism of ptk7 in colorectal cancer and that of sema6a in undifferentiated es cells. Int J Mol Med. 2007; 20(3):405–9.
Pugh TJ, Morozova O, Attiyeh EF, Asgharzadeh S, Wei JS, Auclair D, Carter SL, Cibulskis K, Hanna M, Kiezun A, et al. The genetic landscape of highrisk neuroblastoma. Nat Genet. 2013; 45(3):279–84.
Sinha A, Markatou M. A platform for processing expression of short time series (pests). BMC Bioinforma. 2011; 12(1):1.
Sacchi L, Bellazzi R, Larizza C, Magni P, Curk T, Petrovic U, Zupan B. Taclustering: Cluster analysis of gene expression profiles through temporal abstractions. Int J Med Inform. 2005; 74(7):505–17.
Magni P, Ferrazzi F, Sacchi L, Bellazzi R. Timeclust: a clustering tool for gene expression time series. Bioinformatics. 2008; 24(3):430–2.
Hvidsten TR, Lægreid A, Komorowski J. Learning rulebased models of biological process from gene expression time profiles using gene ontology. Bioinformatics. 2003; 19(9):1116–1123.
Acknowledgements
We thank Olaf Cremers for the production of the video tutorial for the SwitchFinder Web application.
Funding
This research was funded by the European Union’s Seventh Framework Programme, HEALTH, FP7/2007–2013 under grant agreement No: 259348, the ASSET Consortium.
Availability of data and materials
The link https://newbioinformatics.eu/switchfinder provides access to the Web Application of the approach SwitchFinder presented here. The human cell cycle data is used as demo dataset and downloadable. The gene expression timeseries data from neuroblastoma cell line treated with ATRA is also downloadable from this resource.
Authors’ contributions
SB and CP developed the mehod; SB, DS, JZ and ABC developed the Web application, DS implemented it; FW provided the gene expression timeseries data on neuroblastoma cell lines; SB applied the method to all the data and drafed the manuscript; all authors approved the final version of the manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
BE(2)C is an established cell line, the work with it does not need formal ethics approval and complies with national regulations.
Author information
Authors and Affiliations
Corresponding author
Additional files
Additional file 1
This file contains supplementary text and figures with additional explanations to the Model_Logit, as well as the supplementary Figure 5 with the heatmap, displaying the profiles of the cell cycle regulated genes ordered based on the results of the SwitchFinder. (PDF 64 kb)
Additional file 2
Results of the application of SwitchFinder to the cell cycle regulated genes. The gene expression profiles are plotted in black, the fitted lines  in blue, the fitted logistic curves  in red. The switchpoints are depicted as dots. (PDF 106 kb)
Additional file 3
Additional file AH. These files contain the fitting results for the genes from the groups AH, deduced by SwitchFinder, which represent eight dynamic patterns of the gene expression response to ATRA in neuroblastoma cell line. (ZIP 2457 kb)
Additional file 4
Tables S1S8. Demonstrating genes from the groups AH and their functional annotations. (PDF 63 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Bulashevska, S., Priest, C., Speicher, D. et al. SwitchFinder – a novel method and query facility for discovering dynamic gene expression patterns. BMC Bioinformatics 17, 532 (2016). https://doi.org/10.1186/s1285901613910
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901613910
Keywords
 Timeseries analysis
 Dynamic patterns of gene expression
 Changepoint problem
 Changepoint modeling
 Bayesian modeling
 MCMC
 Gibbs sampling
 Neuroblastoma
 ATRAinduced differentiation