 Methodology article
 Open Access
 Published:
Statistical significance approximation for local similarity analysis of dependent time series data
BMC Bioinformatics volume 20, Article number: 53 (2019)
Abstract
Background
Local similarity analysis (LSA) of time series data has been extensively used to investigate the dynamics of biological systems in a wide range of environments. Recently, a theoretical method was proposed to approximately calculate the statistical significance of local similarity (LS) scores. However, the method assumes that the time series data are independent identically distributed, which can be violated in many problems.
Results
In this paper, we develop a novel approach to accurately approximate statistical significance of LSA for dependent time series data using nonparametric kernel estimated longrun variance. We also investigate an alternative method for LSA statistical significance approximation by computing the local similarity score of the residuals based on a predefined statistical model. We show by simulations that both methods have controllable type I errors for dependent time series, while other approaches for statistical significance can be grossly oversized. We apply both methods to human and marine microbial datasets, where most of possible significant associations are captured and false positives are efficiently controlled.
Conclusions
Our methods provide fast and effective approaches for evaluating statistical significance of dependent time series data with controllable type I error. They can be applied to a variety of time series data to reveal inherent relationships among the different factors.
Background
Next generation sequencing (NGS) technologies have made it possible to generate a large amount of time series data in both genomics and metagenomics. An important question in time series data analysis is the identification of associated factors, where the factors can be genes in gene expression analysis or operational taxonomic units (OTUs) in metagenomic studies. Specifically, the abundance series of OTUs are used to investigate the temporal variation of microbial communities in longitudinal studies [1]. Most commonly used approaches for identifying associated factors are to calculate the Pearson correlation coefficients (PCC) or Spearman correlation coefficients (SPCC) among the factors and to identify the significantly associated pairs of factors. However, it was observed in previous studies that factors can be associated in a subset of time intervals (local) and maybe there are timedelays between the factors. PCC and SPCC may fail to identify such local associations with/without timedelays.
Several methods have been developed to understand such associations and have been applied to analyze gene expression profiles [2–4], regulatory network construction [5], cooccurrence patterns in microbial communities [6–9] and many other fields [10, 11]. For example, Qian et al. [2] proposed a local similarity method to identify potential local and timeshift relationships between gene expression data. Ji and Tan [4] suggested a similar procedure that switched gene expression profiles into distinctive changing trend states and calculated the local similarity of the new time series. Ruan et al. [7] investigated local relationships among microbial organisms and environment factors in the San Pedro Channel in the North Pacific Ocean and visualized the graphical structure of significant local similarity associations. Xia et al. [11] extended this method to investigate the replicated time series data and obtained confidence interval of LSA by bootstrap. In these studies, permutation test was used to evaluate statistical significance of the local similarity score, which is timeconsuming if a large number of factors are considered.
To overcome the computational issues of permutation test, several research groups developed theoretical approaches to approximate the statistical significance of LSA [12–14]. However, both permutation test and the theoretical approximations require the assumption that the time series are independent identical distributed (i.i.d.), which can be violated in most time series data.
In this study, we develop two new methods, referred to as datadriven LSA (DDLSA) and LSA for residues (LSAres), to more accurately approximate the statistical significance of LSA. DDLSA employs longrun covariance (described below) of stationary time series through nonparametric kernel estimate to evaluate statistical significance of the original LSA, while LSAres uses the residuals from a predefined model as a substitute for the original series to calculate the statistical significance, similar to the idea of local trend analysis [14]. We investigate the size and power of different approaches and show the validity of our methods using simulations. Further, we apply these methods to analyze human microbiome and marine microbial communities from different highthroughput experiments and compare the identified associated factors using our newly developed methods and those from previous theoretical approximations of LSA scores.
Methods
In this section, we first present an outline of the definition of LSA as given in [2, 7] and the theoretical approximation of statistical significance of the LSA score in [12]. Second, we present our new data driven LSA (DDLSA) approach for evaluating statistical significance of LSA for dependent time series data. For easy reading, the details of the methods are given as additional information. Third, we present the simulation strategies to evaluate the size and power of the different approaches. Fourth, we describe the human and marine metagenomic data used to demonstrate the applications of our new approaches.
Outline of LSA and theoretical approximation of statistical significance
Consider two time series X_{t} and Y_{t},t=1,⋯,n, with mean 0. The local similarity analysis [2, 7] was developed to find intervals of the same length from each sequence to maximize the similarity between the two time series. In practice, biologists are only interested in a relatively small number of delays. Therefore, it is required that the starting positions of the intervals differ by at most D, a parameter set by the practitioners. A dynamic programming algorithm was developed to calculate the largest similarity score, referred to as local similarity (LS) score. The idea was very similar to local sequence alignment in molecular sequence analysis [15]. In these early studies, statistical significance of the LS score was evaluated using permutations. Particularly, one of the time series data was fixed and the other one was permutated many times, and the resulting LS score was obtained using the dynamic programming algorithm. The pvalue was approximated by the fraction of times the LS score of the permuted data is larger than the LS score of the actual data.
There are several drawbacks to permutation test for approximating the statistical significance of the LS score. On the one hand, permutation test requires that data is independent at different time points. However, in practical problems, this assumption is usually violated and time series data may depend on the values of the previous time points. On the other hand, the permutation procedure is timeconsuming, especially when the pvalue precision is small, as the time complexity is inversely proportional to the pvalue precision. When the number of factors is large, all pairwise analysis of highthroughput data is computationally challenging. Therefore, fast and efficient methods to obtain statistical significance approximation of LS score are needed.
Xia et al. [12] and Durno et al. [13] independently developed theoretical approximations for the pvalue. Let s_{D} be the LS score with maximum delay of D between X_{t} and Y_{t}. Xia et al. [12] approximated the pvalue by \(\mathcal {L}_{D}\left (s_{D}/(\sigma \sqrt {n})\right) \), where
and n is the number of time points. If both X_{t} and Y_{t} are i.i.d, σ^{2}=var(X_{t}Y_{t}). If both X_{t} and Y_{t} are first order Markov chains (such as DNA sequences in the identification of CpG islands [16]), \(\sigma ^{2} = E_{\phi }\left (Z_{1}^{2}\right) + 2\sum _{k=1}^{\infty }E_{\phi }(Z_{1}Z_{k+1})\), with Z_{t}=X_{t}Y_{t}. Details on these approximations are given in Additional file 1.
Statistical significance of LS score for dependent time series
Time series data in general depend on each other and cannot be best modelled by Markov chains. Moreover, it is challenging to obtain σ^{2} defined above for Markov models. Therefore, we provide a data driven approach for evaluating the statistical significance of LS score for dependent time series data.
Assume X_{t} and Y_{t} are weakly stationary time series with mean 0. Here a time series X_{t} is weakly stationary if EX_{t}^{2}<∞, E(X_{t}) is a constant (independent of t) and Cov(X_{t},X_{t+k}) depends only on time delay k. Under the null hypothesis H_{0} that the two time series are not associated, Z_{t}=X_{t}Y_{t} is also weakly stationary with mean 0. Using similar arguments as in [12], we can show that the pvalue can again be approximated by \(\mathcal {L}_{D}\left (s_{D}/(\omega \sqrt {n})\right) \), where the function \(\mathcal {L}_{D}\) is given in Eq. 1 and \(\omega ={\lim }_{n \rightarrow \infty }\sqrt {var\left (\sum _{i=1}^{n}Z_{i}\right)/n}\) is referred to as the longrun variance. The details of theoretical derivations are given in the Additional file 1.
The estimate of ω plays a crucial role in deriving the statistical significance of LS score and has an enormous impact on the validity of local similarity analysis for dependent data. Following Andrew [17], we used an autoregressive (AR)(1) plugin data dependent method to estimate the longrun variance. The autoregressive model specifies that the current value depends linearly on its own previous values.
Let \(\hat {\gamma }_{z}(k)\) be the sample autocovariance function of Z_{t}, defined as:
where \(\bar {Z}=\frac {1}{n}\sum _{i=1}^{n}Z_{i}\) is the mean of Z_{t}. Under the null hypothesis H_{0}, we can approximate \(\hat {\gamma }_{z}(k)\) by \(\hat {\gamma }_{x}(k) \hat {\gamma }_{y}(k)\) if the means of X_{t} and Y_{t} are zero, where \(\hat {\gamma _{x}}(k)\), \(\hat {\gamma _{y}}(k)\) and \(\hat {\gamma _{z}}(k)\) are the sample autocovariance functions of X_{t}, Y_{t} and Z_{t}, respectively. We can estimate ω by
where b_{ω} is the bandwidth parameter \(b_{\omega }=\left \lfloor 1.1447(\hat {\tau }n)^{1/3}\right \rfloor \) [17],
In summary, given time series X_{t} and Y_{t}, we first calculate their LS score s_{D} using the dynamic programming algorithm in [7]. We then estimate the longrun variance using Eq. 3. Finally, the statistical significance of the LS score for dependent data can be approximated as \(\mathcal {L}_{D}(s_{D}/(\hat {\omega }_{n}\sqrt {n}))\). Since we estimate the longrun variance from real data, we refer to the new method as data driven LSA (DDLSA).
Local similarity analysis based on residuals
We also modified the original theoretical approximation of statistical significance of LS score [12] by considering the residuals of the original time series. First we suppose that time series data are generated from a predefined model, such as autoregressive (AR) model or autoregressive moving average (ARMA) model. We then use the residuals from the model as the substitution of the original data, since the correlation of data may come from the relevance of the residuals. Because of the independent property of the residuals, the statistical significance of LS score of residuals can be obtained from the approximate theoretical distribution of LSA for i.i.d. time series (Eq. 1). We refer to this method as LSAres.
Simulation studies
We evaluated the size and power of six different methods for determining the statistical significance of associations between factors in time series data. The six methods are described as follows.

1
PCC. Pearson correlation coefficient (PCC) is widely used to identify correlation between random variables. If the random variables X_{t} and Y_{t} are from a bivariate normal distribution and their PCC is r, the statistic \(t=r\sqrt {(n2)/(1r^{2})}\) has a Student’s tdistribution with degrees of freedom n−2 under the null hypothesis H_{0}.

2
SRCC. Spearman rank correlation coefficient (SRCC, r_{s}) between X_{t} and Y_{t} is defined as Pearson correlation coefficient between the rank values of those two variables. We can test for the significance of r_{s} using \(t=r_{s}\sqrt {(n2)/\left (1r_{s}^{2}\right)}\), which follows approximately a Student tdistribution with degrees of freedom n−2.

3
Theoretical LSA (TLSA). We used the procedures in [12] to calculate the pvalue of the LS score between X_{t} and Y_{t}.

4
Permutation test. We fixed one time series Y_{t} and reshuffled X_{t} for N=1000 times. Assuming that \(X_{t}^{(k)}, k=1,\cdots,N\) were the permutations of X_{t}, we computed the LS score between \(X_{t}^{(k)}\) and Y_{t}, denoted as \(s_{D}^{(k)}\). Then the pvalue was approximated by the fraction of times that \(s_{D}^{(k)}\) are at least as high as s_{D}, the LS score between X_{t} and Y_{t}.

5
LSAres. We adopted the AR or ARMA models to obtain the residuals of data, and calculated the statistical significance of the residuals through TLSA, which was regarded as the significance between X_{t} and Y_{t}.

6
DDLSA. In DDLSA, the time series data need to be centered first. Specifically, time series data X_{t},t=1,2,⋯,n are centered as \(\tilde {X}_{t} = X_{t}  \bar {X}_{t}\), where \(\bar {X}_{t} = \frac {1}{n}\sum _{t=1}^{n}X_{t}\) is the sample mean of X_{t}. \(\tilde {Y}_{t}\) is defined analogously. We utilized \(\mathcal {L}_{D}\left (s_{D}/\left (\hat {\omega }_{n}\sqrt {n}\right)\right)\) to calculate the approximate statistical significance of \(\tilde {X}_{t}\) and \(\tilde {Y}_{t}\) and took it as the significance between X_{t} and Y_{t}.
Comparison of the empirical size of different approaches
We investigated whether pvalues obtained from these statistics were close to the significance level which is the probability rejecting the null hypothesis, given that it were true. Here we used three different null models to compare the size of the six approaches for calculating the statistical significance of the LS score:

(1)
The AR(1) model:
$$ \begin{aligned} & X_{t}=\rho_{1}X_{t1}+\varepsilon_{t}^{X} \\ & Y_{t}=\rho_{2}Y_{t1}+\varepsilon_{t}^{Y} \end{aligned} $$(5) 
(2)
The ARMA(1,1) model:
$$ \begin{aligned} & X_{t}=\rho_{1}X_{t1}+\varepsilon_{t}^{X}+0.5~\varepsilon_{t1}^{X} \\ & Y_{t}=\rho_{2}Y_{t1}+\varepsilon_{t}^{Y}+0.5~\varepsilon_{t1}^{Y} \end{aligned} $$(6) 
(3)
The ARMA(1,1)TAR(1) model:
$$ \begin{aligned} & X_{t}=\rho_{1}X_{t1}+\varepsilon_{t}^{X}+0.5~\varepsilon_{t1}^{X} \\ & Y_{t}=\left\{\begin{array}{ll} \rho_{2}Y_{t1}+\varepsilon_{t}^{Y}, Y_{t1}\leq 1\\ 0.5~Y_{t1}+\varepsilon_{t}^{Y}, Y_{t1}> 1 \end{array}\right. \end{aligned} $$(7)
where 0<ρ_{1},ρ_{2}<1, \(\varepsilon _{t}^{X}\) and \(\varepsilon _{t}^{Y}\) are independent standard normal random variables. All these models were stationary. For each model, we first generated X_{0} and Y_{0} from the standard normal distribution. Then we generated (X_{t},Y_{t}), t=2,⋯,100+n from these models. Finally, we discarded the first 100 samples and took the others as the true X_{t} and Y_{t}. The procedure can guarantee the stationarity of the time series generated from these models.
Comparison of the empirical power of different approaches
Next we investigated the power of the six methods for detecting the association between the factors under two alternative models that the factors are associated. Our objective is to identify the most powerful method for detecting the associations between the factors.
The local AR model We studied a model that the two factors are only associated in a subinterval:
where \(\varepsilon _{1}^{X},\varepsilon _{1}^{Y}\sim N(0,1),\varepsilon _{t}^{X}\sim N\left (0,1\rho _{1}^{2}\right), \varepsilon _{t}^{Y}\sim N\left (0,1\rho _{2}^{2}\right), t=2,\cdots,n\) and they are independent. For simplicity and symmetry, we generated time series data that were correlated within the middle interval of length np as follows, where p is the fraction of the time interval that the two time series were correlated (shown in Fig. 1). We first generated X_{t} using Eq. 8. Second, let \(Y_{t}=\frac {1}{\sqrt {1+\sigma ^{2}}}(X_{t}+\xi _{t})\) in the middle np time points of the entire series where ξ_{t}∼N(0,σ^{2}),σ^{2}=(1−ρ^{2})/ρ^{2}. In the remaining n−np time points, Y_{t} were generated by the AR(1) model (Eq. 8) with ρ_{2}=ρ_{1}/(1+σ^{2}). We generated the time series data this way so that X_{t} followed a stationary AR(1) model, Y_{t} approximately followed a stationary AR(1) model, and X_{t} and Y_{t} were correlated in the middle np time points with correlation coefficient ρ.
The bivariate AR model We also investigated another model, referred to as the bivariate AR(1) model, that was used in [18] (Chapter 7, page 290).
where \(\varepsilon _{1}^{X}, \varepsilon _{1}^{Y}\sim N(0,1), \varepsilon _{t}^{X}\sim N\left (0,1\rho _{1}^{2}\right), \varepsilon _{t}^{Y}\sim N\left (0,1\rho _{2}^{2}\right), t=2,\cdots,n\) and the noise terms have correlation coefficients:
The variances of both X_{t} and Y_{t} are 1 and cor(X_{t},Y_{t})=ρ. Similarly as above, we generated locally associated time series data. In the middle np time points, we generated (X_{t},Y_{t}) using Eq. 9. In the remaining n−np time points, we generated (X_{t},Y_{t}) by the independent bivariate AR(1) model with ρ=0.
Applications to a human and a marine microbiome data sets
We applied DDLSA and LSAres to analyze a human and a marine microbiome time series data sets. The Moving Pictures of the Human Microbiome (MPHM) data was collected from two healthy subjects, one male (‘M3’) and one female (‘F4’). Both individuals were sampled daily at three body sites: gut (feces), mouth(tongue), and skin (left and right palms) [19]. The data set consists of 130, 135 and 133 daily samples from ‘F4’, and 332, 372 and 357 samples from ‘M3’. There are 335, 373 and 1295 operational taxonomic units (OTUs) from feces, tongue and palm (both left and right) sites of ‘F4’ and ‘M3’, where the taxonomic level is Genus. We selected 41 ‘core’ OTUs that were observed in at least 60% samples from the tongue of ‘F4’ and analyzed their relationships.
The PML data set is one of the longest microbial time series consisting of monthly samples taken over 6 years at a temperate marine coastal site off Plymouth, UK [20]. These samples were sequenced using highresolution 16S rRNA tag NGS sequencing. A total of 155 bacterial OTUs were identified with the taxonomic level of Order. Among them, we chose 62 abundant OTUs that were present in at least 50% of the time points, and 13 environment factors to analyze their association network. We filled the missing values in the environment data using linear interpolation.
Results and discussion
DDLSA and LSAres have controlled type I error rates and other approaches do not
We investigated the effects of the autoregressive coefficients ρ_{1} and ρ_{2} and the number of time points n on the type I error rates of the six methods for evaluating statistical significance under the AR(1) (Eq. 5), ARMA(1,1) (Eq. 6) and ARMA(1,1)TAR(1) (Eq. 7) models. We chose six different pairs of autoregressive coefficients from 0.5 to 0.8 and the number of time points n from 100 to 1000. The results are shown in Tables 1, 2 and 3 for the three models, respectively. For TLSA, Permutation test, LSAres and DDLSA, we set the maximum time delay D=0 for simplicity. For LSAres, we needed to specify the generative models for X_{t} and Y_{t}. For given data, the generative models are most likely unknown. We used AR or ARMA models as generative models and denoted the resulting methods as LSAres(AR) and LSAres(ARMA), respectively. Throughout the simulations, we let the prespecified error rate to be 0.05.
Table 1 shows that, except for the case of ρ_{1}=0,ρ_{2}=0, the empirical type I error rates of PCC, SRCC, TLSA and the permutation approaches are all larger than the prespecified type I error. When ρ_{1}=0,ρ_{2}=0, the empirical type I error rates of PCC, SRCC, TLSA and the permutation approaches are well controlled, which is reasonable as the time series are independent bivariate normally distributed. Further, the empirical type I error of TLSA is somewhat smaller than the significance level of 0.05 indicating that TLSA is conservative, consistent with findings in [12]. The results of LSAres and DDLSA are similar to that of TLSA. When ρ_{1}≠0 and/or ρ_{2}≠0, the PCC, SRCC, TLSA and the permutation approaches are not valid in the sense that their empirical type I error rates are much higher than the prespecified type I error. On the other hand, both DDLSA and LSAres control the type I errors reasonably well under all the simulated scenarios. Their type I error approaches the significance level as the number of time points increases. The performances of LSAres(AR) and LSAres(ARMA) are similar.
Tables 2 and 3 show the similar results for ARMA(1,1) and ARMA(1,1)TAR(1) models, respectively. Under the ARMA(1,1) and ARMA(1,1)TAR(1) models with ρ_{1}=−0.5,ρ_{2}=−0.5, X_{t} are i.i.d.. Therefore, the type I error rates of PCC, SRCC, TLSA and permutation approaches are well controlled. However, the empirical type I error rates are much larger than the prespecified type I error rate of 0.05 under all the other parameter settings. On the other hand, the type I error rates of LSAres and DDLSA are well controlled under all situations. Further, the type I error rates of both LSAres(AR) and LSAres(ARMA) are well controlled indicating that LSAres is applicable even when the generative model is misspecified.
Finally, the simulation results for time delay D≠0 are presented in the Additional file 2: Table S1S3.
Comparing the power of LSAres and DDLSA
Since PCC, SRCC, permutation and TLSA could not control type I error, we only investigated the power of LSAres and DDLSA. In the local AR model, we let ρ_{1}=0.5, ρ=0.3,0.4,0.5, p from 0.2 to 1, and the number of time points n from 20 to 300. Figure 2 shows the power of DDLSA and LSAres as a function of the number of time points. The power of both LSAres and DDLSA increases with the number of time points n, percentage of correlation p, and serial correlation ρ. In particular, when the two time series are associated in 60% of the time interval (p=0.6) with correlation (ρ=0.5), the power of DDLSA is greater than 0.9 when the number of time points n is at least 100. Under the AR model, the power of DDLSA is higher than that of LSAres. Although we only show the results for ρ_{1}=0.5 and time lag D=0, the results from other simulations with different autocorrelation parameters and time delays are similar to the result shown here. The simulation results under the local AR model with time delay D>0 are shown in Additional file 3: Fig. S1S3.
Similar to the simulations under the local AR model, we also investigated the power of DDLSA and LSAres with different parameters under the bivariate AR model and the results are shown in Fig. 3. However, the power of LSAres is higher than that of DDLSA, different from the local AR model. Overall, LSAres in testing local association is more useful than DDLSA if we know that the time series come from the predefined model, such as the ARMA model. The simulated results for the power of DDLSA and LSAres under the bivariate AR(1) model with time delay D>0 are shown in Additional file 3: Fig. S4S6.
Significantly associated OTU pairs from the MPHM data set
We analyzed the relationships among 41 OTUs that were observed in at least 60% of the tongue samples of individual ‘F4’. First, we found 21 significant autocorrelated OTUs among 41 OTUs using the BoxLjung test [21] under the null hypothesis H_{0}:ρ(k)=0 at the 5% significance level, where ρ(k) is the autocorrelation function for lag k. Figure 4 shows two autocorrelated OTUs. The firstorder autocorrelation of Neisseria is 0.61 (Pvalue =1.96×10^{−12}) indicating high autocorrelation. Although Clostridiales had relatively low autocorrelation (0.21), the hypothesis of no autocorrelation can still be rejected (Pvalue = 0.0148).
Second, we identified significantly locally associated OTU pairs with both pvalue and false discovery rate (FDR) below 0.05 and compared the performance of TLSA, DDLSA and LSAres with time delay up to 3. For LSAres, the residuals were found based on the ARMA(p,q) model and the orders were selected based on the AIC criterion. In our study, we used FDR or Qvalue to adjust for multiple hypothesis testing using the qvalue package in R [22]. Restricting the pvalue P≤0.05 and qvalue Q≤0.05, 317 pairs of significant associations are found among all 820 OTU pairs by TLSA, 189 by DDLSA, and 224 by LSAres, respectively (Table 4). Among the associations found by TLSA, 143 (∼ 45%) are not significant by DDLSA, and 111 (∼ 35%) are not significant by LSAres (Fig. 5). Such associations identified by TLSA but not by DDLSA or LSAres may be false positives caused by the autocorrelation of the raw data. If we combine associated pairs from DDLSA and LSAres, i.e. we define significant pairs as those found significant by either DDLSA or LSAres, 239 (∼ 89%) pairs out of 270 in total found by DDLSA or LSAres are also significant by TLSA. This finding is interesting, and it suggests that the combination of DDLSA and LSAres exhibits better performance than each alone. Note that DDLSA also finds some associations missed by LSAres and vice versa. For instance, DDLSA finds 189 and LSAres finds 224 significant associations but only 143 are found by both LSAres and DDLSA. Therefore, either DDLSA or LSAres is not a substitute but a complementary approach to the other one. For a comprehensive analysis of a data set, one should apply both approaches. Table 4 shows the results with more strict criteria of P≤0.01 and Q≤0.01.
We carefully investigated one of the OTU pairs identified by TLSA but not by DDLSA and LSAres: Leptotrichia and Kingella (Fig. 6). The association is significant by TLSA within a time interval of length 129 starting from the first time point with 3 days delay where Leptotrichia precedes Kingella (Pvalue = 0.003 and Qvalue = 0.007 by TLSA), while not significant by DDLSA (Pvalue = 0.16, Qvalue = 0.38) and LSAres (Pvalue = 0.50, Qvalue = 0.55). The autocorrelograms of the two OTUs show that both of them have the strong autocorrelation, where TLSA can’t control the type I error. However, DDLSA and LSAres work well in this situation.
In addition, we investigated if these sitespecific significant associations are shared across the two individuals. Sørensen index Q_{s} [23] was used to evaluate the similarity between significant associations of the two samples from ‘F4’ and ‘M3’. We considered only the common OTUs in the two samples. The two individuals shared 40 and 41 OTUs in the feces and tongue samples, respectively. Let S1 and S2 be the sets of significant associations between common OTUs of the two samples. The Sorensen index is defined as \(\frac {2S_{1} \cap S_{2}}{S_{1}+S_{2}}\), where S_{1}∩S_{2} is the intersection of S_{1} and S_{2} and · is the number of OTU pairs in a set. Using LSAres, we identified 91 (Q_{s}=0.35) and 177 (Q_{s}=0.55) shared significant associations in the feces and tongue samples ‘F4’ and ‘M3’, respectively. Using DDLSA, the corresponding numbers are 61 (Q_{s}=0.32) and 122 (Q_{s}=0.46).
Significantly associated OTU pairs from the PML data set
The seasonality of particular OTUs is obvious in their abundance profiles and autocorrelograms as shown in [20]. The stronger the seasonal periodicity, the more closely the autocorrelogram approaches a cyclical function. For example, there are significant seasonal cycles in the autocorrelograms of Verrucomicrobiales and Alphaproteobacteria (Fig. 7), and their periods are similar (about 1 year). Therefore, the abundance profiles of bacteria are possibly similar at the same time point of every year. However, the abundance may be somewhat different in some years. For example, both Verrucomicrobiales and Alphaproteobacteria are more abundant in the third year. In addition, a total of 33 out of 75 factors are significant autocorrelated based on the BoxLjung test at the 5% significance level, including 9 environment factors and 24 OTUs. We applied TLSA, DDLSA and LSAres to obtain significant associations of these 75 factors and Table 4 shows the number of identified significant associations.
Among 2550 pairwise associations of all 75 factors, 761, 371 and 98 pairs were found significant with time delay 3 with both Pvalue and Qvalue ≤ 0.05 by TLSA, DDLSA and LSAres, respectively. The relatively large number of significant associations identified by TLSA contain a large fraction of false positives since the dependency of the time series is not considered. The DDLSA and LSAres reduce the number of significant associations resulting from the factors’ autocorrelation. Figure 8 shows the Venn diagram illustrating the relationship of the sets of significant associations using the three approaches. There are 61 pairs found by all three methods. All the 98 associations found by LSAres are also significant by TLSA. This could be due to the periodicity of OTUs that makes ARMA model unsuitable for this dataset. We note that 486 (∼ 64%) out of 761 significant pairs by TLSA is nonsignificant by DDLSA, indicating that the autocorrelation of OTUs may lead to many false positives in the TLSA test. On the other hand, 275 out of the 371 (74%) significant associations found by DDLSA are also found by TLSA indicating high agreement with TLSA. If we combine the significant associations found by DDLSA and LSAres, 312 (∼ 76%) pairs out of 408 in total found by DDLSA or LSAres are also significant by TLSA. This result displays that the combination of DDLSA and LSAres exhibits better performance than each alone. In addition, the majority of associated OTU pairs found by TLSA and DDLSA are between the Proteobacteria, Actinobacteria and Verrucomicrobia phylum members, while those found by LSAres are between Proteobacteria, Verrucomicrobia and Gemmatimonadetes phylum members.
Conclusions
The rapid development of highthroughput sequencing technology generates massive amounts of sequencing data effectively and economically. These developments make large scale human metagenomics studies in a wide range of environment possible. A variety of time series data from these studies brings great opportunities for statistical methods to gain insight into the temporal and spatial dynamics of biological systems. Therefore, for obtaining more accurate and efficient results, it’s necessary to consider the specific property of time series in these studies, such as autocorrelation.
In this paper, we developed a theoretical statistical significance approximation of local similarity score for dependent time series data, which substitutes longrun variance based on nonparametric kernel estimate for sample variance. Moreover, we developed another method to approximate the statistical significance by using raw data’s residuals from a predefined model. We considered different dependent time series models to evaluate the type I error and power of our methods compared with others, i.e. original TLSA, permutation test, PCC and SRCC. Results from our simulations showed that our methods can control type I error reasonably, but the other four approaches cannot. Through simulations, we showed that DDLSA performs better than LSAres for the local AR model, but LSAres works better than DDLSA in the bivariate AR model. Therefore, these two methods complement each other under different correlation scenarios. Using the MPHM and PML datasets, we demonstrated that DDLSA and LSAres reduced the redundant associations efficiently and captured the most possible relationships among OTUs in metagenomics studies of microbial communities. In addition, to obtain more complete sets of significant associations, we suggested to integrate the results from DDLSA and LSAres—apply DDLSA and LSAres to the data set simultaneously and combine the significant associations identified by at least one method as the final significant associations. This will reduce false negatives effectively.
However, one drawback of LSAres is the determination of the data generative model. If we presume data from a more complicated model, residuals from this model may seem like normally distributed but may lose too much information about the original data. We have to make a tradeoff between employing complicated models and preserving useful information. In the paper, we investigated the impact on type I error by considering AR and ARMA models as alternative models and both of them work well. In the future, we will continue to study the influence of model misspecification.
We applied DDLSA and LSAres to time series data in microbial communities. In fact, they can be used in any type of data with the same length, such as medical (EEG or MEG signals), climate (temperature, solar irradiance, river runoff or rainfall) and economic (stock price) time series data. The timedelay associations of EEG time series play an important role in discovering new information about the activity of brain [24]. Climate time series often exhibit positive serial dependence [18]. Potentially local and time delayed associations are widespread in climate data, but it will increase the number of false positives if we use TLSA to calculate the statistical significance of their LS scores, while DDLSA and LSAres can overcame this problem.
Abbreviations
 AR:

Autoregressive model
 ARMA:

Autoregressive moving average model
 DDLSA:

Datadriven local similarity analysis
 EEG:

Electroencephalogram
 FDR:

False discovery rate
 i.i.d.:

independent identical distributed
 LS:

Local similarity
 LSA:

Local similarity analysis
 LSAres:

Local similarity analysis based on residuals
 MEG:

Magnetoencephalogram
 MPHM:

Moving pictures of the human microbiome (Dataset)
 OTU:

Operational taxonomic units
 PCC:

Pearson’s correlation coefficient
 PML:

Polymouth marine laboratory (Dataset)
 SRCC:

Spearman’s rank correlation coefficient
 TAR:

Threshold autoregressive model
 TLSA:

Theoretical significance of local similarity analysis
References
Faust K, Lahti LM, Gonze D, Vos WMD, Raes J. Metagenomics meets time series analysis: unraveling microbial community dynamics. Curr Opin Microbiol. 2015; 25(12):56–66.
Qian J, DolledFilhart M, Lin J, Yu H, Gerstein M. Beyond synexpression relationships: local clustering of timeshifted and inverted gene expression profiles identifies new, biologically relevant interactions. J Mol Biol. 2001; 314(5):1053–66.
Balasubramaniyan R, Hüllermeier E, Weskamp N, Kämper J. Clustering of gene expression data using a local shapebased similarity measure. Bioinformatics. 2005; 21(7):1069–77.
Ji L, Tan K. Mining gene expression data for positive and negative coregulated gene clusters. Bioinformatics. 2004; 20(16):2711–8.
Madeira SC, Teixeira MC, SaCorreia I, Oliveira AL. Identification of regulatory modules in time series gene expression data using a linear time biclustering algorithm. IEEE/ACM Trans Comput Biol Bioinform. 2010; 7(1):153–65.
Beman JM, Steele JA, Fuhrman JA. Cooccurrence patterns for abundant marine archaeal and bacterial lineages in the deep chlorophyll maximum of coastal california. ISME J. 2011; 5(7):1077–85.
Ruan Q, Dutta D, Schwalbach MS, Steele JA, Fuhrman JA, Sun F. Local similarity analysis reveals unique associations among marine bacterioplankton species and environmental factors. Bioinformatics. 2006; 22(20):2532–8.
Cram JA, Xia LC, Needham DM, Sachdeva R, Sun F, Fuhrman JA. Crossdepth analysis of marine bacterial networks suggests downward propagation of temporal changes. ISME J. 2015; 9(12):2573–86.
Steele JA, Countway PD, Xia L, Vigil PD, Beman JM, Kim DY, Chow CT, Sachdeva R, Jones AC, Schwalbach MS, Rose JM, Hewson I, Patel A, Sun F, Caron DA, Fuhrman JA. Marine bacterial, archaeal and protistan association networks reveal ecological linkages. ISME J. 2011; 5(9):1414–25.
Gonçalves JP, Madeira SC. Latebiclustering: Efficient heuristic algorithm for timelagged bicluster identification. IEEE/ACM Trans Comput Biol Bioinform. 2014; 11(5):801–13.
Xia LC, Steele JA, Cram JA, Cardon ZG, Simmons SL, Vallino JJ, Fuhrman JA, Sun F. Extended local similarity analysis (eLSA) of microbial community and other time series data with replicates. BMC Syst Biol. 2011; 5(Suppl 2):15.
Xia LC, Ai D, Cram JA, Fuhrman JA, Sun F. Efficient statistical significance approximation for local similarity analysis of highthroughput time series data. Bioinformatics. 2013; 29(2):230–7.
Durno WE, Hanson NW, Konwar KM, Hallam SJ. Expanding the boundaries of local similarity analysis. BMC Genom. 2013; 14(Suppl 1):3.
Xia LC, Ai D, Cram JA, Liang X, Fuhrman JA, Sun F. Statistical significance approximation in local trend analysis of highthroughput time series data using the theory of Markov chains. BMC Bioinformatics. 2015; 16:301.
Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981; 147(1):195–7.
Durbin R, Eddy S, Krogh A, Mitchison G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge: Cambridge University Press; 1998.
Andrews DWK. Heteroskedasticity and autocorrelation consistent covariance matrix estimation. Econometrica. 1991; 59(3):817–58.
Mudelsee M. Climate Time Series Analysis: Classical Statistical and Bootstrap Methods. Atmospheric and Oceanographic Sciences Library. Dordrecht: Springer; 2010.
Caporaso JG, Lauber CL, Costello EK, Lyons DBL, Gonzalez A, Stombaugh J, Knights D, Gajer P, Ravel J, Fierer N, Gordon J, Knight R. Moving pictures of the human microbiome. Genome Biol. 2011; 12(5):50.
Gilbert JA, Steele JA, Caporaso JG, Steinbrück L, Reeder J, Temperton B, Huse S, McHardy AC, Knight R, Joint I, Somerfield P, Fuhrman JA, Field D. Defining seasonal marine microbial community dynamics. ISME J. 2012; 6(2):298–308.
Ljung GM, Box GEP. On a measure of lack of fit in time series models. Biometrika. 1978; 65(2):297–303.
Storey JD, Bass AJ, Dabney A, Robinson D. qvalue: Qvalue estimation for false discovery rate control. R package version 2.12.0. 2015. http://github.com/jdstorey/qvalue.
Sørensen TA. A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on danish commons. Biol Skr. 1948; 5:1–34.
Pijn JP, da Silva FL. Propagation of electrical activity: nonlinear associations and time delays between eeg signals In: Zschocke S, Speckmann EJ, editors. Basic Mechanisms of the EEG. Brain Dynamics. Boston: Birkhäuser: 1993. p. 41–61.
Acknowledgments
The authors thank the Quantitative and Computational Biology Program at the University of Southern California for providing computational resources.
Funding
The research was supported by the National Natural Science Foundation of China Grants (11371227, 61432010, 11626247) and US National Science Foundation Grant (DMS1518001).
Availability of data and materials
The ’MPHM’ and ’PML’ datasets used during the current study are publicly available in the supplementary of their publications [19, 20]. The DDLSA R source code is available at https://github.com/BlueStamford/DDLSA.
Author information
Authors and Affiliations
Contributions
YL and FS conceived and designed the project. FZ performed the analysis. FZ drafted the manuscript and YL and FS finalized the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional files
Additional file 1
Appendix. Theoretical approximation of LSA statistical significance for i.i.d. or Markov time series and derivation of the asymptotic distribution of the LS score statistics. (PDF 259 kb)
Additional file 2
Table S1S3. Type I errors of TLSA, LSAres and DDLSA tests under the AR(1), ARMA(1,1) and ARMA(1,1)TAR(1) models with time delay D≠0, respectively. (PDF 551 kb)
Additional file 3
Figure S1S6. Power of LSAres and DDLSA for local AR model and bivariate AR model with different time delays(D). (PDF 1115 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Zhang, F., Sun, F. & Luan, Y. Statistical significance approximation for local similarity analysis of dependent time series data. BMC Bioinformatics 20, 53 (2019). https://doi.org/10.1186/s128590192595x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s128590192595x
Keywords
 Datadriven local similarity analysis
 Longrun variance
 Nonparametric kernel estimate
 Statistical significance