 Methodology Article
 Open access
 Published:
Statistical significance approximation in local trend analysis of highthroughput timeseries data using the theory of Markov chains
BMC Bioinformatics volume 16, Article number: 301 (2015)
Abstract
Background
Local trend (i.e. shape) analysis of time series data reveals cochanging patterns in dynamics of biological systems. However, slow permutation procedures to evaluate the statistical significance of local trend scores have limited its applications to highthroughput time series data analysis, e.g., data from the next generation sequencing technology based studies.
Results
By extending the theories for the tail probability of the range of sum of Markovian random variables, we propose formulae for approximating the statistical significance of local trend scores. Using simulations and real data, we show that the approximate pvalue is close to that obtained using a large number of permutations (starting at time points >20 with no delay and >30 with delay of at most three time steps) in that the nonzero decimals of the pvalues obtained by the approximation and the permutations are mostly the same when the approximate pvalue is less than 0.05. In addition, the approximate pvalue is slightly larger than that based on permutations making hypothesis testing based on the approximate pvalue conservative. The approximation enables efficient calculation of pvalues for pairwise local trend analysis, making large scale allversusall comparisons possible. We also propose a hybrid approach by integrating the approximation and permutations to obtain accurate pvalues for significantly associated pairs. We further demonstrate its use with the analysis of the Polymouth Marine Laboratory (PML) microbial community time series from highthroughput sequencing data and found interesting organism cooccurrence dynamic patterns.
Availability
The software tool is integrated into the eLSA software package that now provides accelerated local trend and similarity analysis pipelines for time series data. The package is freely available from the eLSA website: http://bitbucket.org/charade/elsa.
Background
Time series data are important resources to explore the dynamics of biological systems, where the factors of interest could be genes in gene regulation studies, or organisms and/or environmental factors in ecological studies. Identifying reliable association patterns between these factors could further our understanding of the functionality and interaction of biological systems [1, 2]. When the actual associations are active only within certain time subintervals or the responses lag the stimulants [3, 4], ordinary correlation based analysis methods (i.e. Pearson’s and Spearman’s correlation) considering the expression/abundance profiles across the entire time span may fail to recover these local and potentially timedelayed association patterns. Fortunately, a wealth of computational methods had been developed to overcome such difficulties, such as local similarity analysis [3, 5, 6] and local trend (shape) analysis [4, 7]. Those methods complement ordinary analytical approaches and have important applications in gene profile clustering, regulatory network construction, cooccurrence pattern identification and many other areas [3–9]. For instance, Qian et al. [3] proposed a local similarity based measure to identify local and potential timedelayed associations between gene expression profiles. This local similarity analysis technique is further extended and successfully applied to microbial ecology time series studies [5, 6, 10, 11].
In local similarity analysis, local indicates the two factors are only associated within some time subinterval, and timedelayed indicates there is a time shift in the associated profiles. The strength of the local association is measured by the local similarity (LS) score. For time series data of two factors with normalized levels X _{1},X _{2},⋯,X _{ n } and Y _{1},Y _{2},⋯,Y _{ n }, the LS score is defined as the maximized absolute value of summation \(S = \sum _{k = 0}^{l1} X_{i+k}Y_{j+k}\), where I=[i,i+l−1] and J=[j,j+l−1] correspond to the intervals maximizing the summation – to be determined by the SmithWaterman dynamic programming algorithm [12]. By definition, LS score is proportional to the Pearson’s correlation coefficients (PCC) of the aligned parts of the two series. Its statistical significance can be evaluated by a large number of permutations [3, 6] or using the approximation recently proposed by Xia et al. [13].
While local similarity analysis bases its similarity measure on the similarity of the profile or abundance levels of the factors, others suggested that the similarity of increasing, stabilizing or decreasing trends along the time line can also be strong indicators of associations and developed methods based on this alternative measure. Ji and Tan [7] explored this idea by transforming the changing trend of gene expression profiles of n consecutive time points into a n−1 time point series corresponding to the status of {d e c r e a s e, n o c h a n g e, i n c r e a s e} in expression levels. All possible local associations of a specific length of time span were analyzed by an exhaustive search algorithm to find clusters of genes with significant locally similar expression profiles. Later, He and Zeng [14] renovated the analysis using a dynamic programming algorithm and employed a permutation approach to evaluate the statistical significance for the local trend scores. The techniques used by He and Zeng [14] were similar to those used in local similarity analysis except that the original time series data were first transformed to changing trends series. We will thus refer to the local similarity analysis techniques performed on the transformed changing trends series as the local trend (a.k.a. shape) analysis (LTA) and its corresponding similarity measure as the local trend (LT) score.
Local trend analysis has since been extended and applied to a wide range of biological applications, such as genegene association networks [15–17], genemetabolite networks [18], and transcription factor networks [19–21]. However, one of the major limitations common to local trend analysis is the time consuming permutation procedure used to evaluate the statistical significance (pvalue) of the LT score. While in practice false discovery rate (FDR or qvalue) [22] is used to mitigate the multiple comparison problem, still, fast and efficient approximation for the statistical significance of the LT score is urgently needed to estimate the pvalue. In addition, Madeira et al. [8] first transformed gene expression data into trends for each gene and developed linear time algorithms to find maximal biclusters. Recently, Goncalves and Madeira [9] extended the biclustering algorithms to allow for time delays [8]. These developments are highly significant by considering groups of genes simultaneously instead of gene pairs. However, the statistical issues related to maximal clusters of gene groups are beyond the scope of this study.
Recently, progress has been made to develop efficient statistical significance approximations for local similarity analysis [13, 23]. We notice that by extending the method proposed in Xia et al. [13], it is also possible to obtain pvalues of local trend scores more efficiently. In this paper we will describe an extension of Xia et al.’s [13] method to local trend analysis, including the mathematical modelling, algorithm implementation and computational validation with simulations and real data applications. In the Methods section, we first formally introduce the concept of local trend analysis and bring in useful results from related works. We then describe the method to model the transformed trend series using the Markov chain theory in both two and three letter alphabet cases. We also propose an approximation formula and numerical computation methods for the statistical significance of LT score based on these models. In the Results and Discussion section, we validate and show the efficiency of our new approach using simulated and real datasets and analyze a real microbial ecological time series dataset from the next generation sequencing (NGS) of marine samples collected near the Polymouth Marine Laboratory (PML) by Gilbert et al. [24].
The major difference between this paper and Xia et al. [13] is the study of statistical significance of local trend score here while Xia et al. [13] studied the statistical significance of local similarity scores based on the original time series data. After transformation of the original time series data to trends, the trend variables are highly dependent even if the original data are independent, making the evaluation of statistical significance of LT score challenging. New approximation results on the tail probabilities of the sums of Markov random variables need to be employed to derive an approximate formula to calculate the statistical significance of LT scores.
Methods
The local trend analysis
The first step in local trend analysis is to discretize the factor profile into a changing trend alphabet Σ – a set of symbols of interest, which represent distinctive changing trend states [8]. Typically either two letter alphabet Σ={D,U} or simply Σ={−1,1} for trenddown and trendup states [25, 26], or three letter alphabet Σ={D,N,U} or simply Σ={−1,0,1} for trenddown, nochange and trendup states [7, 27] are used. Discretization into larger size alphabet is possible but seldom used in practice. For given time series X _{1},X _{2},⋯,X _{ n }, We transform the ndimensional vector X to a n−1 dimensional trend vector \({d_{i}^{X}}, i = 1, 2, \cdots, n1\), by the following rules. When X _{ i }≠0,
where t≥0 is a threshold value for declaring changing trends. When X _{ i }=0, \({d_{i}^{X}}\) is defined as:
The trend series generating process and the dependency between X _{ i }’s and \({d^{X}_{i}}\)’s are depicted in Fig. 1. These rules were formalized in Ji and Tan [7] and Madeira et al. [8].
Based on this data transformation, the subsequent algorithms and statistics of local trend analysis closely follow that for local similarity analysis [3, 5, 6]. That is, for a pair of transformed trend series \({d_{1}^{X}}, {d_{2}^{X}}, \cdots, d_{n1}^{X}\) and \({d_{1}^{Y}}, {d_{2}^{Y}}, \cdots, d_{n1}^{Y}\), the SmithWaterman dynamic programming algorithm [6, 12] is used to find the interval pair I=[i,i+l−1] and J=[j,j+l−1] of the same length l with i−j≤D such that the absolute value of \(S = \sum _{k = 0}^{l1} d^{X}_{i+k}d^{Y}_{j+k}\) is maximized, which we refer to as local trend (LT) score with maximum time delay D, where D is a predefined parameter. Statistical significance for LT score corresponds to the probability of observing such a score or larger under the null hypothesis that the two factors X and Y are not associated. It was used to be approximated by permuting one of the time series data many times and calculating the fraction of times that the LT score for the permuted data is higher than that for the original data [3, 14]. With the permutation approach, the observations for the samples at the different time points are assumed independent under the null model.
Approximate statistical significance for local trend analysis
The permutation procedures described above to approximate the statistical significance for local trend analysis have several drawbacks. First, the calculated pvalues have substantial inherent variability associated with the randomness in permutation unless the number of permutations is very large. Second, the procedure is computationally expensive– the computational time scales linearly with the inverse of the required pvalue precision, which is prohibitive for allversusall pairwise analysis of highthroughput datasets.
In fact, the asymptotic theories for the tail distribution of the range of partial sum of zeromean independent, identically distributed (i.i.d.) and first order Markov chain exist [28–30] and can be applied here to calculate pvalues under the null model. Formulae for fast and efficient approximation of the statistical significance for aligning two i.i.d. zeromean sequences had been obtained and successfully applied to local similarity analysis previously [13]. In contrast, in local trend analysis, even if the original series X _{ i }’s are considered independent, the transformed trend series \({d_{i}^{X}}, i = 1, 2, \cdots, n1\) are not independent because, for any consecutive pair \({d_{i}^{X}}\) and \(d_{i+1}^{X}\), they both depend on X _{ i } (as shown in Fig. 1). They are not even a Markov chain of any order. In order to use the theory in [28–30] to approximate the statistical significance of LT scores, we make several simplifying assumptions.
The first assumption is that the time series data X _{ i },i=1,2,⋯,n and Y _{ i },i=1,2,⋯,n are exchangeable in that any order of the sample is equally likely. Time series data generally do not follow the exchangeability assumption and usually follow some trends. In particular, the value at a particular time may depend on the value at a previous time point. One way to overcome this complexity is to regress the value at time t+1 with respect to the value at the previous time point t and use the resulting residue for the follow up analysis. In the following of the paper, we assume that such transformations have been carried out and the exchangeability assumption as in most studies in the literature holds.
Secondly, we naively assume the first order Markov chain model for \({d_{i}^{X}}, i = 1, 2, \cdots \). As stated above, this assumption is obviously incorrect. We make this assumption for the convenience of using the theory in [28–30]. We also assume that the product of a pair of independent trend series \({d^{X}_{i}} {d^{Y}_{i}}\) follows a first order Markov chain, i.e.,
Under the assumption that X and Y have supports in an interval, \({d^{X}_{i}} {d^{Y}_{i}}\) is irreducible and aperiodic so that the theories for Markov random variables in [28–30] can be adapted.
Thirdly, we make the simplifying assumption that the LT scores for different time delays are independent when we do local trend analysis allowing time delays. Since the LT scores for different delays are all calculated based on the same values of X ^{′}s and Y ^{′}s, this independent assumption is violated. We make this assumption purely for computational convenience.
We note the lack of mathematical rigor for approximating the pvalue in this study. Therefore, the approaches presented in this paper can only be regarded as heuristic and should not be regarded as rigorous mathematical approximations. We show the usefulness of our approximation by comparing the approximate pvalue with that obtained from a large number of permutations. They are close in the sense that the first nozero decimals of the pvalues from the approximation and the permutations are mostly the same. The simulations also show that the approximate pvalue is slightly larger than that obtained through permutations. Due to the conservativeness of the approximate pvalue, hypothesis testing for associated pairs of factors based on the approximate pvalue may have lower power compared to that based on more accurate pvalues. We recommend a hybrid approach to combine approximation with permutations to obtain associated pairs of factors without lowering the power. The conservative nature of the approximate pvalue allows us to first calculate the approximate pvalues for all pairs of factors and then use permutations to obtain the more accurate pvalues only for factor pairs with approximate pvalue less than a loose threshold. This practice significantly saves computational time as most factor pairs have relatively large approximate pvalues. Future studies on more accurate approximation of statistical significance for LT scores based on rigorous mathematical theory are needed.
Using the theory of BachelierWiener processes, Feller [28] studied the approximate distribution of the range R _{ n } of the partial sum of n i.i.d. random variables \(\{Z_{i}\}_{i=1}^{n}\) with mean 0 and variance σ. Daudin et al. [29] studied the distribution of the maximum partial sum of either i.i.d. random variables or an irreducible aperiodic first order Markov chain taken values on a finite subset of the real line. Let φ be the stationary distribution of the Markov chain Z _{ i }, i=1,2,⋯, E _{ φ }(Z _{1})=0 and
Based on these results, it can be shown
where R _{ n } is the range of partial sums of Z _{1},Z _{2},⋯,Z _{ n }. We will use this equation to approximate the statistical significance of local trend score. For local trend analysis with no time delays, we let \(Z_{i} = {d_{i}^{X}} {d_{i}^{Y}}\) and approximate Z _{ i } by a first order Markov chain. Then the statistical significance of LT score without time delays (D=0) can be approximated using equation (5). With time delay of at most D, using a similar argument as in [13] and assuming that the LT scores for different delays are independent, we can approximate the statistical significance (pvalue) of a LT score with delay at most D by
The Markov chain model: two letter alphabet case
We first propose a Markov chain model for local trend analysis with the relatively simple two letter alphabet case (i.e. t=0), for which an exact solution for σ is available. Consider X _{1},X _{2},⋯,X _{ n } as continuous random variables such that the probability of taking a fixed value to be 0. By order statistics, we have \(P\left [({d_{i}^{X}}, d_{i+1}^{X}) = (1,1)\right ] = P[({d_{i}^{X}}, d_{i+1}^{X}) = (1,1)] = 1/6\) and \(P\left [\left ({d_{i}^{X}}, d_{i+1}^{X}\right) = \left (1,1\right)\right ] = P\left [\left ({d_{i}^{X}}, d_{i+1}^{X}\right)=\left (1,1\right)\right ] = 1/3\) and \(P({d_{i}^{X}} = 1) = P({d_{i}^{X}} = 1) = 1/2\) if the X _{ i }’s are exchangeable. Assuming that \({d_{i}^{X}}\)’s form a first order Markov chain, we can solve for the transition matrix
Then it can be shown by spectral expansion [31] that
For k≥1, we have \(P\left ({d_{1}^{X}} d_{k+1}^{X} = 1\right) = \left (1 + (1)^{k}/3^{k}\right)/2\) and \(P\left ({d_{1}^{X}} d_{k+1}^{X} = 1\right) = \left (1  (1)^{k}/3^{k}\right)/2\). Thus, \(E\left ({d_{1}^{X}} d_{k+1}^{X}\right) = \left (1\right)^{k}/3^{k}\). In local trend analysis, we compare \({d_{1}^{X}}, {d_{2}^{X}}, \cdots, d_{n1}^{X}\) with \({d_{1}^{Y}}, {d_{2}^{Y}}, \cdots, d_{n1}^{Y}\). Therefore, we have \(\sigma _{d^{X}d^{Y}}^{2} = E\left (\left ({d_{1}^{X}}\right)^{2}\right) E\left (\left ({d_{1}^{Y}}\right)^{2}\right)+ 2 \sum _{k = 1}^{\infty } E\left ({d_{1}^{X}} d_{k+1}^{X}\right) E\left ({d_{1}^{Y}} d_{k+1}^{Y}\right) = 1 + 2 \sum _{k = 1}^{\infty } 1/3^{2k} = 1 + 1/4 = 1.25\). When D=0, LT score for local trend analysis is the range of partial sum \(\sum _{i}{d_{i}^{X}} {d_{i}^{Y}}\). Following the result presented in equation (4), we obtain the approximate formula for local trend score pvalue in the two letter alphabet case (i.e. t=0):
where the function \({\mathcal L}_{D}\) is defined in equation (6) and s _{ D } is the LT score with delay at most D.
The Markov chain model: the three letter alphabet case
We next show the Markov chain modeling for local trend analysis with the three letter alphabet case (i.e. t>0) – allowing a more flexible description of state changes. In this case, the transition matrix T(t) is a function of the threshold value t. However, a closed form formula for T(t) is not readily available for general zeromean i.i.d. random variable X _{ i }’s. Instead, we have to use Monte Carlo strategy to numerically approximate T(t) for a given threshold value t.
To do the Monte Carlo simulation, we first generate a series of i.i.d. standard normal random values X _{1},…,X _{ N } for N large and use rules in equations (1) and (2) to transform the series into trend series \({d_{1}^{X}}, \ldots, d_{N1}^{X}\). We approximate the transition probability from a to b by T(t)_{ a,b }=C _{ a,b }/C _{ a }, a,b=−1,0,1, where C _{ a,b } is the number of pairs such that \(\left ({d^{X}_{i}}, d^{X}_{i+1}\right) = (a, b), ~i = 1, 2, \cdots, N1\) and C _{ a } is the number of pairs such that \({d^{X}_{i}} = a\). In this study, we let N=10000. Because all the rows of T(t) sum to 1, using the symmetry condition, we have T(t)_{1,1}=T(t)_{−1,−1}=b, T(t)_{1,−1}=T(t)_{−1,1}=c, T(t)_{0,1}=T(t)_{0,−1}=d, T(t)_{1,0}=T(t)_{−1,0}=1−b−c and T(t)_{0,0}=1−2d and therefore T(t) is of the following form:
Any row of the infinity power of T(t), T ^{∞}(t), converge to the stationary distribution φ. So we only need to estimate b,c,d to obtain T(t) and φ, reducing the number of parameters to be estimated to three. Though numerical, this Monte Carlo approach is very fast and accurate given today’s computational power.
With T(t) known, its eigenvalues \(\{\lambda _{i}(t)\}_{i=1}^{3}\), right column eigenvectors \(\{r_{i}(t)\}_{i=1}^{3}\) and left column eigenvectors \(\{l_{i}(t)\}_{i=1}^{3}\) are readily solvable. To be concise, we simply omit the dependence on t in notation and denote λ(t), r(t), l(t) and T(t) in shorthand by λ, r, l and T. The property of transition matrix of aperiodic and irreducible Markov chain guarantees λ _{1}=1, r _{1}=1 (a three dimensional vector of all 1s) and φ=l _{1}. Using spectral expansion, we can expand the kth power of T, T ^{k}, as:
where the individual entry \(T^{k}_{u,v}=\sum _{i=1}^{3}{\lambda _{i}^{k}} r_{i,u}l_{i,v}\). Actually carrying out the expansion, we obtain:
and let k→∞, we have the stationary distribution:
Subsequently, we have
Similarly,
The symmetry of states 1 and 1 ensures φ _{1}=φ _{−1} in the stationary distribution. Thus, using equation (4) we can compute \(\sigma _{d^{X}d^{Y}}(t)\phantom {\dot {i}\!}\) as following:
Since equation (10) can be numerically calculated based on the Monte Carlo estimates of b,c,d, we can calculate \(\sigma _{d^{X}d^{Y}}(t)\phantom {\dot {i}\!}\) and then plug it into \({\mathcal L}_{D}\) as defined in equation (6) to obtain:
which is the final formula for approximating the pvalues of LT scores in the three letter alphabet case.
We compare the approximate pvalues calculated using equation (6) and the pvalue using simulations. We then apply our method to analyze three real datasets. The first one is a microarray gene expression dataset of yeast cell division cycles (referred to as ‘CDC’), synchronized by the cdc15 gene from Spellman et al. [32]. The second one is a human microbiota dataset from one male (M3) and one female (F4) sampled daily at three body sites (feces, mouth and palms) for 15 months (M3) and for 6 months (F4) from the motion picture of human microbiome paper by Caporaso et al. (referred to as ‘MPH’) [33]. The third one is a microbial ecological time series data from recent NGS of marine microbial community samples collected from sites close to the Polymouth Marine Laboratory (PML) [24] (referred to as ‘PML’). We apply local trend analysis (with t=0 and t=0.5) to analyze the first two datasets and compared the approximate and permutation pvalues. We are the first to analyze the third dataset using local trend analysis and found interesting results.
Results and Discussion
Simulation Studies
Monte Carlo estimates of the transition probabilities
In deriving the approximate statistical significance, i.e. pvalues, for local trend analysis, we make simplifying assumptions to use Markov chain modeling on \({d_{i}^{X}}\) and \({d_{i}^{Y}}\). However, the validity and accuracy of the approximations have to be evaluated. Thus, we first study whether the transition probabilities estimated based on simulated time series data are close to those approximated using the Markov chain theory. We demonstrate this when X _{ i }’s and Y i′s are i.i.d. standard normal random variables, because in most common applications, raw biological experimental series data are normalized before pairwise comparisons. We use 10,000 Monte Carlo randomly generated X _{ i } and Y _{ i }’s, transform them with the thresholds t=0 and t=0.5 and estimate the parameters b, c, and d in the probability transition matrix. Meanwhile the transition matrix of the Markov chain is still solvable by integration using the Mathematica software.
In Table 1, for all the thresholds studied, the numerical integration results are very close to that learned from the randomly generated series. For example, when t=0, the estimated parameters are (b=0.3342,c=0.6658) while the Markov chain theory yields (b=0.3333,c=0.6667). When t=0.5, the estimates are (b=0.2313,c=0.6083,d=0.4034) and the Markov chain theory numerical results are (b=0.2311,c=0.6088,d=0.4043). With 10,000 simulations, the estimates differ at the third or fourth decimals and the Monte Carlo calculation is done within only seconds. With a larger number of simulations, the precision can be even better. Therefore our Markov chain modeling approximates the state transition probabilities and stationary distribution efficiently and accurately.
Approximating the tail probability of the LT score using equation (6)
The approximate pvalue for the local trend score given in the Methods section is only applicable when the pvalue is small and the number of time points is large. Therefore, we first study the range of applicability of our approximation formulae. For the two alphabet case, we precalculate \(\sigma _{d^{X}d^{Y}}=\sqrt {1.25}\) for t=0. Table 2 gives the approximate tail probability (pvalue) based on equation (7) (2nd column) and the simulated probability \(P(LT(0)/\sqrt {1.25n} \geq x)\) (3rd to 9th columns) for different numbers of time points when D=0. It can be seen that the approximate tail probability is close to the simulated probability when the approximate pvalue is less than 0.05 starting from n=20 time points in the sense the first nozero decimal of the approximate pvalue is mostly the same as that of the simulated pvalue. In general, the approximate tail probability is slightly larger than the simulated values when D=0 (see Table 2). Similar results were observed for D=1,2,3 (see Tables 3, 4 and 5). Thus, it will be slightly conservative in declaring significant associations if we use the approximate tail distribution to calculate the pvalue. However, for relatively small value of x, the approximate tail probability can be much larger than the simulated tail probability. Since we are mostly interested in significant associations with small pvalues, we do not consider this as a problem. On the other hand, since the approximate pvalue is larger than the true pvalue, the test based on the approximate pvalue is conservative and the power of the test can be lower than the power based on the true pvalue, which can be approximated by simulations.
In many studies, investigators calculate the pvalues by permuting the time series many times. Next, we compare the permutational and approximate pvalues by simulations. For the simulated data in the last paragraph, we calculate the pvalues using both the permutation approach (P _{ perm }) and the approximate formulae (P _{ theo }) for exactly the same pair of time series data. We do 1000 permutations for each pair and the maximum resolution (precision) of P _{ perm } is 0.001.
Figure 2 shows the comparison between (P _{ perm }) and (P _{ theo }). We find at D=0, starting from n=20 to 30, points in scatter plots become concentrated on the diagonal line (where P _{ perm }= P _{ theo }) and they become more aligned to the diagonal as n increases. This indicates an increasing rate of agreement between the approximate and permutation pvalues as a good approximation. The same is true with D=1,2,3 as the approximation become significantly closer to the permutation results when n increases and starting from n=30 to 40. However, as the time delay increases, the approximation becomes increasingly less accurate. In summary, we see that if we are interested in statistical significance at a given type I error threshold, the approximation provides results comparable to that from permutations starting from n=30 although the theoretical pvalue is slightly conservative.
The simulation results for t=0.5 are presented in Additional file 1.
The CDC dataset
The CDC dataset consists of the expression profiles of 6,177 genes at 24 time points. It is extremely time consuming to approximate the pvalues for local trend analysis for all the gene pairs using permutations. Thus, we only randomly select 25 genes and estimate the pvalue for each of the 300 gene pairs by permuting the original data 1000 times. We then compare P _{ theo } from our approximation to P _{ perm } from the permutation approach, as shown in Fig. 3. For both t=0 and t=0.5 and D=0,1,2,3, it can be seen from the figure that P _{ theo } is highly positively correlated with P _{ perm }, but P _{ theo } is slightly higher than P _{ perm } indicating that it is conservative when we declare statistical significance using P _{ theo }.
For all the situations considered, among the gene pairs with P _{ perm }≤0.05, over half of them are declared as significant by P _{ theo }. For the t=0 case, none of P _{ theo } is less than 0.05 when P _{ perm }>0.05. With D=0, we have 29 (10 %) out of 300 found significant while 260 (87 %) nonsignificant by both approaches, and in total 289 (97 %) are in agreement. Among the gene pairs with P _{ perm }>0.05, none of them are significant using P _{ theo }. Among the gene pairs declared as significant by P _{ perm }, about 29/40 (73 %) are declared as significant by P _{ theo }. Similarly, with D=1,2,3, there are 286 (95 %), 284 (95 %) and 285 (95 %) pvalue pairs in agreement with both P _{ perm } and P _{ theo }, respectively.
For t=0.5, with D=0, we have 260 (87 %) out of 300 found to be nonsignificant by both approximation and permutations. Among the remaining, 28(9 %) are found significant by both methods, and in total 288 (96 %) are in agreement. The results are similar with D=1,2,3, with 284 (95 %), 286 (95 %) and 278 (93 %) in agreement, respectively. Moreover, alltoall pairwise analysis of the whole CDC dataset with D=3 and permutation 1000 times cannot be completed in 100 hours on a “Dell, PE1950, Xeon E5420, 2.5GHz, 12010MB RAM” computing node, while, using the approximate approach, it can be finished within two hours on the same computing node.
The MPH dataset
The MPH dataset was collected from two healthy subjects, one male (M3) and one female (F4), both were sampled daily at three body sites (gut (feces), mouth, and skin (left and right palms)) 130, 133 and 135 days, respectively [33]. There are 335, 1295 and 373 unique operational taxonomic units (OTU) from feces, palm and tongue sites of ‘F4’ and ‘M3’, respectively. In order to feasibly finish computational time of the permutation approach, we select 40 abundant OTUs from the right palm of ‘F4’ of the MPH dataset. We present approximate and permutation pvalue comparison for local trend analysis in Fig. 4. The figure shows that the approximate pvalue is close to that from the permutations when t=0. However, the approximate pvalues are generally much larger than that based on permutations. One potential explanation is the sparsity of the data due to the large number of OTUs.
We choose typeI error threshold to be 0.05. For t=0, the results show good agreement. With D=0, we have 482 (62 %) and 263 (34 %) out of 780 found nonsignificant and significant, respectively, by both methods. In total 745 (96 %) are in agreement. Among the 33 (4 %) OTU pairs with discordant significance by P _{ theo } and P _{ perm }, all of them are significant by P _{ perm } but nonsignificant by P _{ theo }, which is more conservative. The results are similar with D=1,2,3, with 744 (95 %), 743 (95 %) and 732 (94 %) in concordance, respectively, and about 3–4 % incidences significant by P _{ perm } but nonsignificant by P _{ theo }.
For t=0.5 and D=0, we have 489 (63 %) out of 780 found nonsignificant and 188 (24 %) significant by both methods. In total, 677 (87 %) are in agreement. All of the discordant 103 (13 %) pairs are significant by P _{ perm } but nonsignificant by P _{ theo }. The results are similar with D=1,2,3, where 676 (88 %), 676 (88 %) and 685 (88 %) are in concordance, respectively. There are about 12–13 % associations significant by P _{ perm } but nonsignificant by P _{ theo }, showing that P _{ theo } is more conservative.
The PML dataset
Gilbert et al. [24] studied the microbial community composition change using highresolution 16S rRNA tag NGS sequencing of samples taken monthly over 6 years at a temperate marine coastal site off Plymouth Marine Laboratory (PML), Plymouth, UK (total 72 time points). They identified a total of 8,794 different bacterioplankton OTUs and environmental factors, and their presence are most common, abundant and variable across all the samples. As a proofofconcept analysis, we select 73 abundant OTUs, including 15 environment factors. The taxonomic level to which the OTUs could be identified was Phylum and Class. The raw read counts data were first normalized by percentile and Zscore transformation and then converted into trend series of {−1,0,1} with t=0. We then apply the local trend analysis to the trend series and analyze the results below.
In total 77 (81.9 %) of 94 associated OTU pairs (P<0.05, Q<0.05) are bacteria to bacteria. In those 77 OTU pairs, there are 54 OTU pairs belonging to Proteobacteria, which includes: Alphaproteobacteria, Betaproteobacteria and Gamaproteobacteria; nine pairs of OTUs belong to Bacteroidetes, Cyanobacteria and Verrucomicrobia. While the remaining 14 pairs of OTUs show the intergroup association between Proteobacteria and other bacteria. For example: PMLba1 (Alphaproteobacteria031) and PMLba7 (Alphaproteobacteria032) have the highest positive LT score from time point 1 (L T=0.830986, D=0 with P<1e ^{−16}, Q=0.000007). Their abundance level time series and trend series are shown in Fig. 5(a1) and Fig. 5(a2), respectively. PMLba8 (Gammaproteobacteria0341) and PMLba25 (Bacteroidetes0326) with a LT score 0.56338 (D=0 with P=0.000598, Q=0.022149 in approximation), whose abundance level time series and trend series are shown in Fig. 5(b1) and Fig. 5(b2), respectively, have similar trends from the 8th time point and onward.
Through studying the associations between environment factors and bacteria OTUs, we find that day length (DX1) and temperature are the main factors associated with bacteria among the 15 environment factors we select. PMLba55 (Gammaproteobacteria03170) and temperature are associated with a LT score of 0.5774 from the 1st time point (D=0 with P=0.0004, Q=0.0176). The abundance levels and trend series of PMLba55 and temperature are shown in Fig. 6(a1) and Fig. 6(a2), respectively. PMLba7 (Alphaproteobacteria032) and day length are associated with a LT score of 0.5633 starting from the 2nd time point (D=1 with P=0.0006, Q=0.0221). The abundance levels and trend series and day length are shown in Fig. 6(b1) and Fig. 6(b2), respectively.
Overall, we show that the majority of positive associated OTU pairs are within the same phylum, while there are some associations between different phylum. Finally, we used Cytoscape to create a network from the selected PML data as shown in Fig. 7. The hubs with large number of associations are Alphaproteobacteria (PMLba1, PMLba11, PMLba26, etc.). However, the environment factors are not directly associated with Alphaproteobacteria. Most bacteria association are synchronized and delays can only be found between environment factors and PMLba7. In addition, all the associations among bacteria trend series are positive. The eLSA software package is used for analyzing the relationships between environment factors and bacteria and generating the interaction network.
Conclusions
Many breakthroughs in highthroughput experimental technologies have made possible very large scale timeresolved omics studies (proteomics, transcriptomics, metagenomics) possible, tracking hundreds, thousands, or even tens of thousands of molecules simultaneously. Timeseries data generated from these studies provide an invaluable resource to investigate the changing dynamics of biological systems. To make full use of huge size datasets, accurate and efficient statistical and computational methods are urgently needed in all levels of analysis, from accurate estimation of abundance and expression levels, to pairwise association and network analysis.
In this paper, we provide asymptotic formulae to approximate the statistical significance of local trend scores used in local trend analysis for time series data. From our simulations and real data analysis, P _{ theo } is more conservative than P _{ perm }– a property particularly needed in many biological applications that are prone to false positive calls, such as microarray analysis [22]. However, the power of detecting the association can be low using the approximate pvalues. If more accurate pvalues for significant associations are desired, we suggest a “hybrid" approach: first use a relatively loose threshold on the fast approximate pvalues and obtain a relatively small set of associated pairs and then slow permutation approaches are used only for this set of associated pairs to obtain more accurate pvalues. This will significantly reduce the computational time yet maintain the power.
An important reason for us to embrace the approximation is its computation efficiency. As shown in Xia et al. [5], for a given typeI error, α, the time complexity of computing P _{ perm } is O(D M N/α), where D is the delay limit, N is the sample number and M is the replicate number. With P _{ theo }, before any pairwise comparison, we may compute and store (LT score, pvalue) pairs into a hashing table. Then, for each comparison, it only costs constant time O(1) to read out P _{ theo } and is independent of D, M, N and α, a strongly desired feature in large scale analysis.
For instance, in metagenomics, after short read assignment and abundance estimation [34, 35], profiles of thousands of microbial OTUs are present. Before this work, pairwise local trend analysis with this number of factors was hardly tractable using permutation procedures, if not impossible. Parallel computation and hardware acceleration or additional preclustering and filtering approaches are required, increasing the difficulty of analysis. With the new method, researchers can quickly compute the statistical significance for all OTU pairs on desktop computers, allowing onthefly association network mining and analysis. Finally, We have implemented the new method in the eLSA package [5], which now provides a highthroughput pipeline for local trend analysis.
Availability of data and materials
The eLSA software package that implements the local trend analysis and theoretical approximation is freely available for academic use from the website: http://bitbucket.org/charade/elsa. The eLSA package is a standard Python and C++ extension module that requires a Python distribution and a C++ compiling environment to install. eLSA has been extensively tested running on Ubuntu Linux machines (see the README file coming with the software for details).
The ‘CDC’, ‘MPH’ and ‘PML’ datasets are all publicly available in the supplementary of their publications [24, 32, 33]. No ethics approval was required for the study and no informed consent was required for the study, because the study involves no human and animal subjects and the study is not generating new human data. The human microbiome data analyzed in the study was published in Caporaso et al. [33] and publicly available.
Abbreviations
 CDC:

Cell division cycle (Dataset)
 eLSA:

Extended local similarity analysis
 FDR:

False discovery rate
 LS:

Local similarity
 LT:

Local trend
 LTA:

Local trend analysis
 MPH:

Motion picture of human (Dataset)
 NGS:

Next generation sequencing
 OTU:

Operational taxonomic unit
 PCC:

Pearson’s correlation coefficient
 PML:

Polymouth Marine Laboratory (Dataset)
References
BarJoseph Z. Analyzing time series gene expression data. Bioinforma. 2004; 20(16):2493–503.
Androulakis IP, Yang E, Almon RR. Analysis of timeseries gene expression data: methods, challenges, and opportunities. Annu Rev Biomed Eng. 2007; 9:205–28.
Qian J, DolledFilhart M, Lin J, Yu H, Gerstein M. Beyond synexpression relationships: local clustering of timeshifted and inverted gene expression profiles identifies new, biologically relevant interactions. J Mol Biol. 2001; 314(5):1053–66.
Balasubramaniyan R, Hullermeier E, Weskamp N, Kamper J. Clustering of gene expression data using a local shapebased similarity measure. Bioinforma. 2005; 21(7):1069–77.
Xia LC, Steele JA, Cram JA, Cardon ZG, Simmons SL, Vallino JJ, et al.Extended local similarity analysis (elsa) of microbial community and other time series data with replicates. BMC Syst Biol. 2011; 5(Suppl 2):15.
Ruan Q, Dutta D, Schwalbach MS, Steele JA, Fuhrman JA, Sun F. Local similarity analysis reveals unique associations among marine bacterioplankton species and environmental factors. Bioinforma. 2006; 22(20):2532–8.
Ji L, Tan KL. Identifying timelagged gene clusters using gene expression data. Bioinforma. 2005; 21(4):509–16.
Madeira SC, Teixeira MC, SaCorreia I, Oliveira AL. Identification of regulatory modules in time series gene expression data using a linear time biclustering algorithm. IEEE/ACM Trans Comput Biol Bioinforma. 2010; 7(1):153–65.
Goncalves J, Madeira S. Latebiclustering: Efficient heuristic algorithm for timelagged bicluster identification. IEEE/ACM Trans Comput Biol Bioinforma. 2014; 11(5):801–813.
Steele JA, Countway PD, Xia L, Vigil PD, Beman JM, Kim DY, et al.Marine bacterial, archaeal and protistan association networks reveal ecological linkages. ISME J. 2011; 5(9):1414–25.
Cram JA, Xia LC, Needham DM, Sachdeva R, Sun F, Fuhrman JA. Crossdepth analysis of marine bacterial networks suggests downward propagation of temporal changes. ISME J. 2015. http://www.nature.com/ismej/journal/vaop/ncurrent/abs/ismej201576a.html.
Waterman MS. Introduction to Computational Biology: Maps, Sequences and Genomes. London, UK: Chapman & Hall/CRC; 1995.
Xia LC, Ai DM, Cram J, Fuhrman JA, Sun FZ. Efficient statistical significance approximation for local similarity analysis of highthroughput time series data. Bioinforma. 2013; 29(2):230–237.
He F, Zeng AP. In search of functional association from timeseries microarray data based on the change trend and level of gene expression. BMC Bioinforma. 2006; 7:69.
He F, Chen H, ProbstKepper M, Geffers R, Eifes S, Del Sol A, et al.Plau inferred from a correlation network is critical for suppressor function of regulatory t cells. Mole Syst Biol. 2012; 8:624.
Goncalves JP, Aires RS, Francisco AP, Madeira SC. Regulatory snapshots: integrative mining of regulatory modules from expression time series and regulatory networks. PloS ONE. 2012; 7(5):35977.
Nam H, Lee K, Lee D. Identification of temporal association rules from timeseries microarray data sets. BMC Bioinforma. 2009; 10 Suppl 3:6.
Takahashi H, Morioka R, Ito R, Oshima T, AltafUlAmin M, Ogasawara N, et al.Dynamics of timelagged genetometabolite networks of escherichia coli elucidated by integrative omics approach. Omics : A J Integr Biol. 2011; 15(1–2):15–23.
Wang YC, Lan CY, Hsieh WP, Murillo LA, Agabian N, Chen BS. Global screening of potential candida albicans biofilmrelated transcription factors via network comparison. BMC Bioinforma. 2010; 11:53.
Liu Y, Jiang B, Zhang X. Geneset analysis identifies master transcription factors in developmental courses. Genomics. 2009; 94(1):1–10.
Wu WS, Li WH. Systematic identification of yeast cell cycle transcription factors using multiple data sources. BMC Bioinforma. 2008; 9:522.
Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A. False discovery rate, sensitivity and sample size for microarray studies. Bioinforma. 2005; 21(13):3017–24.
Durno WE, Hanson NW, Konwar KM, Hallam SJ. Expanding the boundaries of local similarity analysis. BMC Genomics. 2013; 14 Suppl 1:3.
Gilbert JA, Steele JA, Caporaso JG, Steinbruck L, Reeder J, Temperton B, et al.Defining seasonal marine microbial community dynamics. ISME J. 2011; 6(2):298–308.
Kwon AT, Hoos HH, Ng R. Inference of transcriptional regulation relationships from gene expression data. Bioinforma. 2003; 19(8):905–12.
Erdal S, Ozturk O, Armbruster D, Ferhatosmanoglu H, Ray WC. A time series analysis of microarray data. In: Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering (BIBE). IEEE: 2004. p. 366–375.
Ji L, Tan KL. Mining gene expression data for positive and negative coregulated gene clusters. Bioinforma. 2004; 20(16):2711–8.
Feller W. The asymptotic distribution of the range of sums of independent random variables. Ann Math Stat. 1951; 22(3):427–432.
Daudin JJ, Etienne MP, Vallois P. Asymptotic behavior of the local score of independent and identically distributed random sequences. Stoch Proc Appl. 2003; 107(1):1–28.
Etienne MP, Vallois P. Approximation of the distribution of the supremum of a centered random walk. application to the local score. Methodol Comput Appl. 2004; 6(3):255–275.
Ewens WJ, Grant GR. Statistical Methods in Bioinformatics: an Introduction. New York, USA: Springer; 2004.
Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, et al.Comprehensive identification of cell cycleregulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998; 9(12):3273–97.
Caporaso JG, Lauber CL, Costello EK, BergLyons D, Gonzalez A, Stombaugh J, et al.Moving pictures of the human microbiome. Genome Biol. 2011; 12(5):50.
Xia LC, Cram JA, Chen T, Fuhrman JA, Sun F. Accurate genome relative abundance estimation based on shotgun metagenomic reads. PLoS ONE. 2011; 6(12):27992.
He PA, Xia L. Oligonucleotide profiling for discriminating bacteria in bacterial communities. Comb Chem High T Scr. 2007; 10(4):247–255.
Acknowledgements
LCX is supported by National Institute of Health 2R01HG00613704. DMA and XYL are supported by National Natural Science Foundation of China (61370131 and 61300074). JAC, JAF and FZS are supported by the US NSF OCE1136818 and grant GBMF3779 from the Gordon and Betty Moore Foundation Marine Microbiology Initiative. We thank the anonymous reviewers and the editor for their excellent comments that help us significantly improve the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
LCX, JAF and FS designed this study. LCX, DA and JAC prepared the methods and datasets. LCX implemented the algorithm. LCX, DA and XL analyzed the datasets. LCX wrote the manuscript and all authors revised and approved the final manuscript.
Additional file
Additional file 1
Simulation results for t =0.5. (1004 Kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Xia, L.C., Ai, D., Cram, J.A. et al. Statistical significance approximation in local trend analysis of highthroughput timeseries data using the theory of Markov chains. BMC Bioinformatics 16, 301 (2015). https://doi.org/10.1186/s1285901507328
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901507328