Skip to main content

Predicting viral exposure response from modeling the changes of co-expression networks using time series gene expression data

Abstract

Background

Deciphering the relationship between clinical responses and gene expression profiles may shed light on the mechanisms underlying diseases. Most existing literature has focused on exploring such relationship from cross-sectional gene expression data. It is likely that the dynamic nature of time-series gene expression data is more informative in predicting clinical response and revealing the physiological process of disease development. However, it remains challenging to extract useful dynamic information from time-series gene expression data.

Results

We propose a statistical framework built on considering co-expression network changes across time from time series gene expression data. It first detects change point for co-expression networks and then employs a Bayesian multiple kernel learning method to predict exposure response. There are two main novelties in our method: the use of change point detection to characterize the co-expression network dynamics, and the use of kernel function to measure the similarity between subjects. Our algorithm allows exposure response prediction using dynamic network information across a collection of informative gene sets. Through parameter estimations, our model has clear biological interpretations. The performance of our method on the simulated data under different scenarios demonstrates that the proposed algorithm has better explanatory power and classification accuracy than commonly used machine learning algorithms. The application of our method to time series gene expression profiles measured in peripheral blood from a group of subjects with respiratory viral exposure shows that our method can predict exposure response at early stage (within 24 h) and the informative gene sets are enriched for pathways related to respiratory and influenza virus infection.

Conclusions

The biological hypothesis in this paper is that the dynamic changes of the biological system are related to the clinical response. Our results suggest that when the relationship between the clinical response and a single gene or a gene set is not significant, we may benefit from studying the relationships among genes in gene sets that may lead to novel biological insights.

Background

In genomics studies, time-series gene expression data [13] often need to be processed and analyzed. In 2016, DREAM CHALLENGES released an open challenge called ‘Respiratory Viral DREAM Challenge: Discovering dynamic molecular signatures in response to virus exposure’ (https://www.synapse.org/#!Synapse:syn5647810/wiki/399108). The aim was to develop early predictors of susceptibility and contagiousness based on expression profiles collected prior to and at early time points following viral exposure. Some work reported the differences of transcriptomics [46] in the host response between symptomatic and asymptomatic subjects exposure to respiratory viruses. Additionally, as what were done by most participants (https://www.synapse.org/#!Synapse:syn5647810/wiki/402364), some common machine learning algorithms [7] can be used if we treat the challenge as a prediction problem. The challenge results [see Additional file 1 for parts of the challenge results] demonstrate that the prediction performance significantly depends on the participants’ models. However, we need to average the time series data across time or only use cross-sectional data at a time to perform ensemble learning, and the dynamic information of the time series data is lost in these approaches. Moreover, in the early stage of infection (within 24 h), there is little separation of the trajectories of genes among subjects with different clinical responses. Previous studies [8, 9] also showed that the individual responses after exposure to respiratory virus are influenced not only by the baseline immune status of the host but also by the dynamics of the early host immune response immediately following exposure. If we only consider a single gene, there is no distinct pattern in both cross-sectional and dynamic data. It is difficult to differentiate between positive and negative groups by gene expression levels at early stage. In this paper, we resort to gene sets analysis to correlate exposure response with dynamic gene expression patterns in gene sets. To consider multiple genes, some methods have been proposed to infer the relationship between genes. For example, the Dynamic Bayesian Network (DBN) was used to establish the dynamic regulatory network [10]. We note that a number of groups have studied time-varying dynamic Bayesian networks (TV-DBN) to model the varying network structures and reveal the dynamics of biological systems [11, 12]. The dynamic mixed membership stochastic block model (dMMSB) helps to infer the biological functions of genes through modeling the dynamic tomography of networks [13]. The review of differential network biology [14] advocated that differential network mapping at large scales may provide a deeper understanding of complex biological phenomena. The work [15] analyzed multiple differential co-expression networks based on time-course RNA-Seq data. Through Multiple Differential Modules (M-DMs), they found that dynamic modules are associated with the development of heart failure. These results in the literature suggest that considering the dynamics of networks may help us to better understand disease onset and progression. However, how to extract useful dynamic information from time-series gene expression data to build predictive model remains a challenging problem.

To study the relationship between viral exposure response and time-series gene expression data, we hypothesize that the changes (i.e. dynamics) of the relationship between genes in gene sets may be informative about viral exposure response, and propose a statistical framework to characterize and integrate dynamic information for response prediction where the model parameters have clear biological interpretations. The main innovations of the paper are: Firstly, we use spectral norm to extract information of the difference between two networks. Secondly, we model the changes of dynamic co-expression networks based on the graph-based change point detection method. Thirdly, we measure the similarity between two subjects by the relationship between gene trajectories.

The rest of the paper is organized as follows: In the “Results” Section, we evaluate the performance of our method using both simulated and real data. The results include data description and preprocessing, preliminary analysis and inference results. This is followed by the “Discussion” and “Conclusions” Sections. The “Methods” Section first introduces the notations, then describes the statistical models and inference procedure proposed in this manuscript.

Results

Simulations

In this section we assess the performance of the proposed algorithm on the data simulated as follows. For simplicity, we fixed the number of genes G=80. The sample size N and total time points T took values from the sets {20,50,100} and {40,80,150}, respectively. In the main text, we show the evaluation results under the case {N=100,T=40}. For the other cases, the results are provided in the Supplementary Materials [see Additional file 1]. We partition these 80 genes into four gene sets indexed by O1, O2, O3 and O4, respectively, with each gene set containing 20 genes. To model the time series data, we assume an AR(1) model for the mean expression levels, i.e.

$$\boldsymbol{\mu}_{t}=0.5\boldsymbol{\mu}_{t-1}+\boldsymbol{\varepsilon}_{t},~~\boldsymbol{\mu}_{0}=\boldsymbol{0},~~\boldsymbol{\varepsilon}_{t}\sim \mathcal{N}\left(\boldsymbol{0},\Sigma_{0.1}\right),~~t=1,...,T, $$

where Σ0.1 is the diagonal matrix with 0.1 as the diagonal element. In our model, the algorithm is based on the relationship between the response label and the change of the dynamic structure. As described in the “Models” section, we assume that under the null hypothesis, the covariance matrix of the simulated data is Σ0 across the time points and under the alternative hypothesis, the covariance matrix is Σ0 up to some time point after which it changes to Σ1. We assume \(\Sigma =\mathbbm {I}+ \rho \cdot \mathbbm {1}- diag(\rho)\) where ρ is a constant and 1 is the matrix of 1. For the null hypothesis, ρ=0 and we consider different scenarios for the alternative hypothesis when ρ takes value from set {0.1,0.3,0.5,0.7,0.9}. The time-series gene expression data are simulated for 50 subjects labelled ‘ +1’ through the model,

  • \(\boldsymbol {x}_{iO_{1}t_{1}} \sim \mathcal {N}\left (\boldsymbol {\mu }_{t}, \Sigma _{0}\right),~~i=1,...,50,~t_{1}=1,...,15\),

    \(\boldsymbol {x}_{iO_{1}t_{2}} \sim \mathcal {N}\left (\boldsymbol {\mu }_{t}, \Sigma _{1}\right),~~i=1,...,50,~t_{2}=16,...,40\),

  • \(\boldsymbol {x}_{iO_{2}t} \sim \mathcal {N}\left (\boldsymbol {\mu }_{t}, \Sigma _{0}\right),~~i=1,...,50,~t=1,...,40\),

  • \(\boldsymbol {x}_{iO_{3}t_{1}} \sim \mathcal {N}\left (\boldsymbol {\mu }_{t}, \Sigma _{1}\right),~~i=1,...,50,~t_{1}=1,...,25\),

    \(\boldsymbol {x}_{iO_{3}t_{2}} \sim \mathcal {N}\left (\boldsymbol {\mu }_{t}, \Sigma _{0}\right),~~i=1,...,50,~t_{2}=26,...,40\),

  • \(\boldsymbol {x}_{iO_{4}t} \sim \mathcal {N}\left (\boldsymbol {\mu }_{t}, \Sigma _{1}\right),~~i=1,...,50,~t=1,...,40\).

For 50 subjects labelled ‘ −1’, the data are generated by

  • \(\boldsymbol {y}_{jO_{1}t_{1}} \sim \mathcal {N}\left (\boldsymbol {\mu }_{t}, \Sigma _{0}\right),~~j=51,...,100,~t_{1}=1,...,15\),

    \(\boldsymbol {y}_{jO_{1}t_{2}} \sim \mathcal {N}\left (\boldsymbol {\mu }_{t}, \Sigma _{1}\right),~~j=51,...,100,~t_{2}=16,...,40\),

  • \(\boldsymbol {y}_{jO_{2}t_{1}} \sim \mathcal {N}\left (\boldsymbol {\mu }_{t}, \Sigma _{0}\right),~~j=51,...,100,~t_{1}=1,...,20\),

    \(\boldsymbol {y}_{jO_{2}t_{2}} \sim \mathcal {N}\left (\boldsymbol {\mu }_{t}, \Sigma _{1}\right),~~j=51,...,100,~t_{2}=21,...,40\),

  • \(\boldsymbol {y}_{jO_{3}t_{1}} \sim \mathcal {N}\left (\boldsymbol {\mu }_{t}, \Sigma _{1}\right),~~j=51,...,100,~t_{1}=1,...,25\),

    \(\boldsymbol {y}_{jO_{3}t_{2}} \sim \mathcal {N}\left (\boldsymbol {\mu }_{t}, \Sigma _{0}\right),~~j=51,...,100,~t_{2}=26,...,40\),

  • \(\boldsymbol {y}_{jO_{4}t_{1}} \sim \mathcal {N}\left (\boldsymbol {\mu }_{t}, \Sigma _{1}\right),~~j=51,...,100,~t_{1}=1,...,20\),

    \(\boldsymbol {y}_{jO_{4}t_{2}} \sim \mathcal {N}\left (\boldsymbol {\mu }_{t}, \Sigma _{0}\right),~~j=51,...,100,~t_{2}=21,...,40\),

where \(\boldsymbol {x}_{iO_{p}t}=\left \{x_{igt}: g\in O_{p}\right \}\), \(\boldsymbol {y}_{iO_{p}t}=\left \{y_{igt}:g\in O_{p}\right \}\) and p{1,2,3,4}. Under our simulation models, we know that the first and third gene sets have changes in both the positive and negative groups, and the changes happen at time points 15 and 25, respectively. For the second and fourth gene sets, the positive group has no change point and the negative group has changes at the 20th time point. Therefore, the second and fourth gene sets are informative about the response label. We compared the proposed algorithm with commonly used machine learning algorithms, including Logistic Regression (LR), Linear Discriminant Analysis (LDA), Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) [7]. Note that LR is with lasso penalty. The prior probabilities of class membership in LDA use the class proportions in the training set. The kernel function in SVM is the radial basis function, \(\exp \left (-\frac {|u|^{2}}{G}\right)\). The number of neighbors in KNN is set to 3. We perform 100 simulations, and for each simulation we randomly select 70% subjects as the training set, and the remaining as the test set. We evaluate the performance of the proposed algorithm from three aspects: change point detection, parameter inference, and prediction accuracy, respectively.

Change point detection and parameter inference The results under different scenarios are shown in Table 1. We have 4 gene sets indexed 1, 2, 3, and 4, respectively and the inferred parameters of these 4 gene sets are b1, b2, b3, and b4, respectively. As discussed in the simulation models, the subject label is the result of different change points in gene sets 2 and 4. For comparability, the absolute value of parameter b is denoted by |b|. When ρ is greater than 0.3, |b2| and |b4| are the largest in the 4 parameters which is consistent with the model structure. So when the difference between Σ0 and Σ1 is large enough, our method can identify the gene sets which contribute more to the response label. In Table 1, ‘CHP’ represents the average value over 100 replications for the estimation of change-point position, with the standard deviation in the parentheses. ‘P-value’ is the average p-value over 100 replications using graph-based change point detection method. ‘CHP(%)’ represents the proportion of times the change point is precisely detected in 100 simulations. When ρ is less than 0.1, the structure difference between Σ0 and Σ1 is small, and the detected change point may not be statistically significant. When ρ is greater than 0.5, there is more than 90% chance to detect the change point.

Table 1 Results under different scenarios

Prediction accuracy We average the time series data across time points as the input before they are analyzed by LR, LDA, SVM and KNN. The average ROC curves over 100 simulations of the classification results for each algorithm are shown in Fig. 1, where ‘FPR’ represents false positive rate and ‘TPR’ represents true positive rate. We can see that there is more advantage of our method with an increasing value of ρ. The average AUC values are summarized in Table 2. The proposed algorithm has the highest average AUC value of 100 simulations when ρ is greater than 0.5. Moreover, the ‘AUC’ row of Table 1 shows the classification performance for the test set. We can see that the value of AUC increases with the increase of ρ, which is consistent with our model hypothesis, as it is easier to infer the labels with a larger ρ.

Fig. 1
figure1

ROC curves. When ρ takes different values from {0.1,0.3,0.5,0.7,0.9}, average ROC curves over 100 replications of different algorithms are shown. a ρ=0.1. When the difference between Σ0 and Σ1 is small, the positive and negative groups are difficult to distinguish. The performances of all algorithms are similar. b ρ=0.3. The results are similar to that of ρ=0.1. c ρ=0.5. Our algorithm is slightly better than the others. d ρ=0.7. The proposed algorithm outperforms the other algorithms. e ρ=0.9. The performance of the proposed algorithm is substantially better than the others

Table 2 AUC of different algorithms under different scenarios

As described in the “Methods” section, our algorithm requires the given gene sets as input. So the performance of our algorithm may be affected by the way of grouping genes. We evaluated the performance of our algorithm in different ways of grouping genes. The AUC and prediction accuracy may depend on the grouping method, where a higher enrichment of signals in the pre-defined gene sets will lead to better performance as expected. More details are provided in the Supplementary Materials [see Additional file 1].

Real data analysis

Data description and preprocessing

In this section, we evaluate the performance of our proposed method through real data analysis. Some challenge results related to this paper are provided in the Supplementary Materials [see Additional file 1]. The complete results can be found at the URL (https://www.synapse.org/#!Synapse:syn5647810/wiki/402364) (note that only the registered users can log into the website). The time-series gene expression data for this challenge were collected from healthy volunteers exposed to a respiratory virus within a controlled experimental setting where some became ill and others did not despite the same exposure. Data were derived from seven viral challenge experiments in which volunteers were exposed to one of four different respiratory viruses (Influenza H1N1, Influenza H3N2, Respiratory Syncytial Virus, or Rhinovirus) in order to find gene expression profiling signatures of susceptibility. Peripheral blood gene expression profiling was made at 55 time points ranging between -30 h (pre-exposure) and 672 h (post-exposure). The released data include 125 subjects from seven study centers with time-series gene expression data for 22,277 probes in peripheral blood for each subject, with a total of 2371 samples. Additionally, clinical information was also available, such as age, gender, and the time of samples measured. To reduce noise, we removed 7 subjects who were injected interfering viruses, and removed probes corresponding to multiple genes, and averaged the multiple probes corresponding to the same gene. We considered a total of 12,532 genes. Therefore, we have N=118, G=12,532, and T=55 for this data set. There are 68 subjects with positive labels and 50 subjects with negative labels. The overall data can be visualized by the heat map as shown in Fig. 2. We can see that the genes can be grouped into distinct clusters, while samples can not be clustered according to response. Moreover, different study centers are clustered together, suggesting possible batch effects. In the following analysis, we normalized the gene expression data according to each time point to remove batch effects.

Fig. 2
figure2

Heat map of the baseline gene expression data across subjects. Genes are clearly classified into different groups. The data seem to suggest two groups with subtle differences, but different viral response groups are not clearly separated with these two groups. On the other hand, centers DEE3 H1N1 and Rhinovirus UVA are in the same group, whereas the remaining centers are in the other group suggesting center/batch effects

Preliminary analysis

A number of studies [16, 17] reported the differences in the host response between symptomatic and asymptomatic subjects challenged with respiratory viruses. For simplicity, we call the symptomatic response group the positive group and the asymptomatic response group the negative group. Firstly, we analyzed the cross-sectional data and performed differential expression analysis at a single time point. No significant difference in single gene expression level was found between the positive and negative groups before 40 h. We further investigated the relationship between gene trajectories and responses. Some papers [9, 17, 18] reported that OAS1, IFI44L, IRF7 and CCR1 may be associated with the response. From the expression trajectories of OAS1, IFI44L, IRF7 and CCR1 shown in Fig. 3, we can see that the changes of expression variances for OAS1, IFI44L, IRF7 occur at latter stage and expression variance of CCR1 does not change over time. That is as a single dynamic time series for these related genes, the positive and negative groups have significant differences after 45 h, however they do not exhibit differences at the early stage (within 24 h). Therefore, if we only consider a single gene, there is no distinct pattern in both cross-sectional and dynamic data. Therefore, it is difficult to differentiate between positive and negative groups by gene expression levels at early stage. On the other hand, the paper [19] shows that viral shedding increases sharply between 0.5 and 1 day (within 24 h) after exposure and consistently peaks on day 2. We resort to gene sets analysis to correlate exposure response with dynamic gene expression patterns in gene sets.

Fig. 3
figure3

Time series plots of four distinct genes. “MP” and “NP” represent the means of the positive and negative groups, respectively. “sd” means the standard deviations. a For gene CCR1, the positive and negative groups are difficult to distinguish. On average, the negative group is below the positive group in the middle stage. b For gene IFI44L, the positive group has an obvious upward pattern at about the 50th hour. However it is difficult to distinguish the positive and negative groups in early stage (within 24 h). c For gene IRF7, the pattern is similar to that of gene IFI44L. d For gene OAS1, at the 50th hour or so, the positive and negative groups can be better distinguished. However there is little separation of the trajectories of these genes for the positive and negative groups in the early stage

Firstly, we selected the gene sets that may be related to viral exposure responses. We consider “SYMPTOMATIC-SC2” as the response label which is a binary variable indicating post-exposure maximum symptom score greater than six and then screen out differentially expressed genes from each cross-sectional gene expression data at 55 time points, even if it is not significant. This led to 55 gene sets. Secondly, for each gene set, we represent it as an undirected weighted network and the weight is given by gene expression similarity, where we used the Pearson correlation coefficient of two genes to define their similarity. That is the function h in the “Models” section is Pearson correlation coefficient. We have tried a number of definitions of similarity and Pearson correlation had better performance overall. For each gene set, we obtained 55 time-dependent networks. And we detected change points for these networks using the method introduced in the “Models” section. After change point detection, all gene sets are sorted according to the time of change point. Thirdly, we set up multiple kernel prediction model based on the gene sets in which the relationships among genes change at early stage. Each gene set is integrated into a kernel.

Results

We randomly selected 70% subjects as the training set and the remaining as the test set. The training set contained 83 subjects (35 subjects with negative label and 48 subjects with positive label), 12,532 genes, and up to 55 time points. We want to test the biological hypothesis that the dynamic networks with early change point contribute more to the response label. Figure 4 shows the prediction performance of the model for the test set when we added the gene sets in the order of the detected change point time. It can be seen from Fig. 4 that at the early stage, with more gene sets included, AUC increased. However, after more than 12 gene sets were included, AUC started to decrease, which indicates that an increasing number of gene sets does not lead to an increase of prediction accuracy. This is consistent with our hypothesis that networks that change in the early stages are associated with the response label. Moreover, the curve in Fig. 4 has a turning point at the 35th gene set when AUC starts to increase again, suggesting that those unchanged gene sets may also have information on exposure response. This may be because those unchanged gene sets are markers of the asymptomatic group, which is consistent with the stable performance of the negative group in Fig. 3. Next, we investigated the learning parameter vector b. In terms of 55 gene sets, we consider those gene sets among the top 12 in which the relationships among genes change at early stage. The inferred parameters are summarized in Table 3.

Fig. 4
figure4

AUC. The value of AUC as a function of the number of genes considered. Gene sets are added to the Bayesian multiple kernel model one by one according to the order of the time when the change point occurs. It can be seen that the AUC increases at the beginning, and has the highest value when the 12th gene set is included. It then oscillates downward as more gene sets are added until the 35th gene set, when AUC starts to increase again

Table 3 Learning parameters for gene sets $

The results show that the 44th, 2nd, 34th and 35th gene sets contribute more to the response than the other gene sets. By enrichment analysis for these four gene sets, we can identify pathways related to viruses as shown in Fig. 5. It can be seen that the top pathways are associated with viruses. Finally, we visualize the gene sets associated with response in Fig. 6. Take the positive group as an example. In Fig. 6 each red line represents the change over time of the systematic feature of the gene expression values collected from randomly selecting 80% subjects from the positive group. That is to say, every point on the red line corresponds to the spectral norm of the corresponding matrix of the co-expression network constructed by the genes from the 35th gene set at a certain time t. The result shows that there is a clear difference between the positive and negative groups. More importantly, at the early stage, it is very difficult to distinguish the positive and negative groups from the trajectory of a single gene, as shown in Fig. 3. However that is more obvious in Fig. 6, which substantiates our hypothesis that the label of response is related to the dynamic nature of the changing of a system (gene set) but not a single gene. We visualized the co-expression networks of the 35th gene set at time points 0, 12, 24, 48, 96 and 146, respectively [see Additional file 1]. It seems that the connections of gene modules became closer after the samples were exposed to the virus.

Discussion

In this paper, we adopt a screening approach to find potential gene sets which may be related to response. For this screening step, we do not consider multiple testing when we detect change points of the dynamic networks. We further identify the gene sets related to the response through the proposed Bayesian model. The screening step can be considered as a variable selection step where no response information is used. In addition, when there is no simple relationship between the clinical response and a single gene or a gene set (therefore it is challenging to have statistically significant results for marginal analysis), a model that studies the changes of the relationships among genes in gene sets may offer novel biological insights.

Fig. 5
figure5

Enrichment analysis for the top five gene sets. We show the top 10 pathways enriched for genes in these five gene sets

Fig. 6
figure6

System feature. Dynamic plots of the relationship of genes in the positive and negative groups, respectively. In the early stage (within 24 h), there is a distinction between the positive and negative groups compared with Fig. 3. After the 40th hours, the positive group rises suddenly

Conclusions

We have proposed a novel approach of modeling time-series gene expression data for inferring an individual’s response to viral exposure. The biological hypothesis in this paper is that the dynamic changes of the system are related to the clinical response. Compared with previous time series analysis methods, we showed that change point detection for dynamic networks may be informative for the relationship between the clinical response and dynamic nature of the system (gene sets). Joint consideration of multiple kernels based on gene sets with dynamic network structures not only can predict an individual’s clinical response, but can also help elucidate the biological pathways involved. The effectiveness of the proposed method was demonstrated through the analyses of both simulated and real data.

In this paper, we construct the co-expression networks for the gene sets at each time point separately using Pearson correlation. We note that other methods may be used. For example, we can construct networks incorporating some prior knowledge such as regulatory network at each time point to improve network robustness. Some model-based methods such as TV-DBN [11] can be used to construct dynamic networks. Network reconstruction [12] incorporating the temporal nature of the data may help improve the performance of our model. On the other hand, the selection of matrix similarity may influence the change point detection for the networks. It is worth studying the different methods of change-point estimation for networks in the future. Additionally, we considered the case where the response variable is binary. If the response variable is continuous, we will consider a continuous response in the Bayesian model. In real data analysis, we used Pearson correlation coefficient to define similarity function of kernel. Some other kernel functions can be tried, such as dynamic time warping (dtw) which has been applied to gene expression data [20]. In practice, cross-validation method can be carried out to select the optimal kernel function definition when the sample size is sufficiently large. In this paper, when we compute multiple kernels for integrating different dynamic gene sets, there is no consideration about relationship between different kernels. However, different gene sets may have overlapping genes, which may influence the estimation of change point. We will consider the Bayesian integration model with correlation information in the future.

Methods

The main aim of the paper is to identify gene sets related to viral exposure response and meanwhile predict a person’s response using the dynamic relationships among genes in a gene set at early exposure stage. We assume that only some of the gene sets are informative about clinical responses. Firstly, the genes need to be organized into different gene sets based on some criteria. Here are some suggested ways to group genes. If there is prior biological knowledge, we can organize genes into different gene sets according to such knowledge. For example, for immune related diseases, the immune-related pathways in the database, MSigDB (http://software.broadinstitute.org/gsea/msigdb/index.jsp) [21], can be viewed as gene sets. Without prior knowledge, we may construct gene sets based on the observed data, e.g. genes with different expression levels at different time points. Secondly, the general framework of our method is summarized in Fig. 7. For each given gene set, time dependent networks are constructed. For example, at each time point, we can construct a fully connected network with genes as nodes and correlation coefficients between genes as weights. Because we hypothesize that the gene sets in which the relationships among genes change over time may be informative about the clinical response, we investigate whether the network has changed over time. The gene sets which change the network structures at early stage are candidate gene sets to predict clinical response. Then, we employ a Bayesian multiple kernel learning model to predict an individual’s response. The key for kernel learning is the definition of similarity. We use the overall relationship between genes to define the similarity between subjects. More details are provided in the “Models” and “Inference” sections.

Fig. 7
figure7

The workflow. Step 1, time-varying networks are constructed for each gene set and change point is detected for these dynamic networks. Step 2, different gene sets which change at early stage are integrated to build a multiple kernel learning model. More details are provided in the “Models” and “Inference” sections

Notations

Assume that there are N subjects, G genes, and T time points. Let

  • i {1,...,N} index the subjects,

  • g\(\in \{1,...,G\}\doteq \mathcal {G}\) index the genes,

  • t {1,...,T} index the time points where gene expression data are collected,

  • \(\boldsymbol {x}\in \mathbb {R}^{\mathrm {N}\times \mathrm {G}\times \mathrm {T}}\) represent the collection of expression values of all genes for all subjects at all time points,

  • yi{+1,−1} denote the response label of subject i,

  • \(O_{k} \subseteq \mathcal {G}\) be the kth subset of gene index set, where k is an integer satisfying 1 ≤k\(\leq 2^{|\mathcal {G}|}-1\) and \(|O_{k}|\doteq G_{k}\) and

  • \(s\in \mathcal {S}\) be the index of kernel, where \(\mathcal {S}\subseteq \left \{k: 1\leq k \leq 2^{|\mathcal {G}|}-1\right \}\).

For the elements in the array x, xigt represents the expression value of gene g at time t for subject i. The data set

$$\boldsymbol{x} = \bigotimes\limits_{i=1}^{N}\bigotimes\limits_{g=1}^{G}\bigotimes\limits_{t=1}^{T}x_{igt}, $$

where \(\bigotimes \) represents the Cartesian product. Note that xig·=(xig1,...,xigT)T is the time-series expression observation with length T of gene g for subject i. Similarly, xi·t=(xi1t,...,xiGt)T and x·gt=(x1gt,...,xNgt)T.

Models

For the kth gene set, the genes are collected in Ok and let Gk denote the number of genes in this set. At each time point, we can construct a network such as co-expression network, for genes in the set. Therefore, we can get T networks across the T time points, with these networks represented by T matrices {A1,...,AT}, where At(i,j), the (i,j)th entry of matrix At, is derived from h(x·it,x·jt), i,jOk and h is a function that defines the correlation or similarity between two genes in this set. The change point detection across these networks can be expressed as follow:

$$H_{0}: A_{t}\sim F_{0} ~~~for ~~1 \leq t \leq T, $$

vs

$$H_{1}: \exists \tau, ~where ~1 \leq \tau < T, ~~s.t.~~A_{t}\sim \left\{ \begin{array}{ll} F_{0}, ~~~for ~~1 \leq t \leq \tau, & \\ F_{1}, ~~~for ~~\tau < t \leq T, \end{array}\right. $$

where F0 and F1 are different probability measures on a nonzero measure set. Firstly, define the similarity between two matrices as

$$m\left(A_{t_{1}},A_{t_{2}}\right)= \left\|A_{t_{1}}-A_{t_{2}}\right\|_{2},~~~~for ~\forall ~t_{1}, ~t_{2}, $$

where ·2 is the spectral norm of a matrix [22]. The reason for using spectral norm to measure the similarity between two matrices is, for a symmetric matrix, the spectral norm equals to the spectral radius of this symmetric matrix. From a geometric point of view, the spectral radius of a matrix represents the degree of stretching along its corresponding direction. Secondly, we can construct a graph on {At:t=1,...,T}, i.e. the minimum spanning tree (MST), with the above definition of matrix similarity. Thirdly, we can detect the change point of {At:t=1,...,T}. We use the graph-based change point detection method [23] for statistical inference. More details about change point detection are provided in the Supplementary Materials [see Additional file 1]. We retain the gene sets that change at early stage to build predictive models.

After identifying gene sets with early change points, we use a Bayesian model to integrate dynamic information from multiple gene sets. Assume that the indices of the selected gene sets are collected in \(\mathcal {S}\). Each gene set indexed by s can define a kernel matrix Φs. Denote the kernel matrix set \(\Phi =\left \{\Phi _{s}: s\in \mathcal {S} \right \}\). We integrate all the \(|\mathcal {S}|\) gene sets by the following multiple kernel learning model [24],

$$ f\left(\boldsymbol{x}_{i}\right)=\boldsymbol{a}^{\mathrm{T}}\sum_{s\in\mathcal{S}} b_{s} \Phi_{s}^{i} + e, $$
(1)

where a=(a1,a2,...,aN)T is the sample weight vector, b is the kernel weight vector, e is bias and \(\Phi _{s}^{i}\) is the kernel vector which is the ith column of kernel matrix Φs. The (i,j)th element of Φs is defined by the similarity between subjects i and j. Φs(i,j)=ϕs(xi,xj), where the kernel function ϕs is defined as

$$\phi_{s}\left(\boldsymbol{x}_{i},\boldsymbol{x}_{j}\right)= \left\|\tilde{\Sigma}_{i}^{s}-\tilde{\Sigma}_{j}^{s}\right\|_{2}, ~~\forall i,~j\in\{1,2,...,N\}~and~s\in\mathcal{S}. $$

The (l,k)th entry of the matrix \(\tilde {\Sigma }_{i}^{s}\) is,

$$\tilde{\Sigma}_{i}^{s}(l,k)= \text{cov}\left(\boldsymbol{x}^{s}_{il\cdot},\boldsymbol{x}^{s}_{ik\cdot}\right), ~~~for~l,k\in O_{s}~and~s\in\mathcal{S}, $$

where \(\boldsymbol {x}^{s}_{il\cdot }\) represents the expression vector of gene l in gene set Os for subject i. In Eq. (1), f can be considered as a latent variable [25] connecting the observed expression data x and labels y. Through the estimation of parameter b, we can infer which gene sets have more contribution to the response label y.

Inference

The main aim of this section is to infer the parameters {a,b,e} in model (1). We adopt a Bayesian framework because of two advantages. Firstly, compared with general kernel-based methods [26, 27], kernel learning under a Bayesian framework reduces the requirement of kernel conditions, such as Mercer’s kernel condition [28, 29]. So we can select more flexible metrics to measure the similarity between subjects based on time series observations. Secondly, compared with general machine learning algorithms, such as SVMs, auxiliary parameters can also be inferred under a Bayesian framework [29]. Denote the priors {λ,γ,ω} corresponding to {a,b,e}, respectively. For computational convenience, we assume conjugate prior distributions [24] in the model. Let Ξ={αλ,βλ,αγ,βγ,αω,βω} denote the hyper-parameter set for {λ,γ,ω} and L be an intermediate output variable for the iteration of parameters. All priors and parameters in the model are denoted by \(\boldsymbol {\Theta }=\left \{\boldsymbol {\lambda },\boldsymbol {\gamma },\omega \right \}\bigcup \left \{\boldsymbol {a},\boldsymbol {b},e,\boldsymbol {f},\boldsymbol {L}\right \}\). Hence, the conjugate Bayesian priors for the parameters are

$$\begin{array}{*{20}l} \lambda\lambda_{i} \sim Gamma\left(\lambda_{i};\alpha_{\lambda},\beta_{\lambda}\right) \quad \quad \forall i,\\ a_{i} | \lambda_{i} \sim \mathcal{N}\left(a_{i};0,\lambda_{i}^{-1}\right) \quad \quad \forall i,\\ L_{{si}}| \boldsymbol{a},{\Phi}_{s}^{i}\sim\mathcal{N}\left(L_{{si}};\boldsymbol{a}^{\mathrm{T}}{\Phi_{s}^{i},1} \right) \quad \quad \forall (s,i),\\ \gamma_{s} \sim Gamma\left(\gamma_{s};\alpha_{\gamma},\beta_{\gamma}\right) \quad \quad \forall s,\\ b_{s} | \gamma_{s} \sim \mathcal{N}\left(b_{s};0,\gamma_{s}^{-1}\right) \quad \quad \forall s,\\ \omega \sim Gamma\left(\omega;\alpha_{\omega},\beta_{\omega}\right),\\ e | \omega \sim \mathcal{N}\left(e;0,\omega^{-1}\right),\\ f_{i} | \boldsymbol{b},e,{L}_{\cdot i} \sim \mathcal{N}\left(f_{i};\boldsymbol{b}^{\mathrm{T}{L}}_{\cdot i}+e,1\right) \quad \quad \forall i,\\ y_{i} | f_{i} \sim \delta\left(f_{i}y_{i}>\nu\right) \quad \quad \forall i, \end{array} $$

where δ(·) is the Kronecker delta function that returns 1 if the variable satisfies the restriction and 0 otherwise, and ν is a given margin parameter which is used to distinguish two categories. Next, we use variational approximation [30, chap.10] to estimate the parameters. The main idea of the algorithm is to approximate the marginal likelihood logp(y|x) by the lower bound \(\mathcal {L}\),

$$\log p\left(\boldsymbol{y}|\boldsymbol{x}\right)\geq \mathcal{L} \doteq E_{q(\boldsymbol{\Theta})}\left[\log p\left(\boldsymbol{y},\boldsymbol{\Theta}|\boldsymbol{x}\right)\right]-E_{q(\boldsymbol{\Theta})}\left[\log q(\boldsymbol{\Theta})\right], $$

where E represents the expectation of random variables and q(Θ) is the posterior distribution of Θ. The exact formulas of the lower bound \(\mathcal {L}\) are similar to those in the supplementary material of reference [24]. Hence, the approximate posterior distribution q(·) of each parameter can be computed by

$$ q(\cdot) \propto exp\left\{E_{q\left(\boldsymbol{\Theta}\backslash\cdot\right)}\left[\log p\left(\boldsymbol{y},\boldsymbol{\Theta}|\boldsymbol{x}\right)\right]\right\}, $$
(2)

where q(Θ·) is the distribution of Θ with the parameter (·) removed. Algorithm 1 summarizes the estimation process of model parameters {a,b,e,f,L}. After we obtain a trained model, the label for a new subject can be predicted by Eq. (1). More details about Algorithm 1 can be found in the Supplementary Materials [see Additional file 1].

Availability of data and materials

The datasets analysed during the current study are available from the corresponding author on reasonable request. The R code used during this study are included in this published article [see Additional file 2].

Abbreviations

DBN:

Dynamic Bayesian Network

TV-DBN:

Time-Varying Dynamic Bayesian Networks

dMMSB:

dynamic Mixed Membership Stochastic Block model

M-DMs:

Multiple Differential Modules

MSigDB:

Molecular Signatures Database

MST:

Minimum Spanning Tree

LR:

Logistic Regression

LDA:

Linear Discriminant Analysis

SVM:

Support Vector Machine

KNN:

K-Nearest Neighbor

CHP:

Change Point Position

ROC:

Receiver Operating Characteristic curve

FPR:

False Positive Rate

TPR:

True Positive Rate

AUC:

Area Under Curve

DTW:

Dynamic Time Warping

References

  1. 1

    Bar-Joseph Z, Gitter A, Simon I. Studying and modelling dynamic biological processes using time-series gene expression data. Nat Rev Genet. 2012; 13(8):552–64.

    CAS  PubMed  Article  Google Scholar 

  2. 2

    Ernst J, Bar-Joseph Z. STEM: a tool for the analysis of short time series gene expression data. BMC Bioinformatics. 2006; 7(1):191.

    PubMed  PubMed Central  Article  Google Scholar 

  3. 3

    Kim S, Imoto S, Miyano S. Dynamic Bayesian network and nonparametric regression for nonlinear modeling of gene networks from time series gene expression data. Biosystems. 2004; 75(1-3):57–65.

    CAS  PubMed  Article  Google Scholar 

  4. 4

    Woods CW, McClain MT, Chen M, Zaas AK, Nicholson BP, Varkey J, Veldman T, Kingsmore SF, Huang Y, Lambkin-Williams R, et al. A host transcriptional signature for presymptomatic detection of infection in humans exposed to influenza H1N1 or H3N2. PLoS ONE. 2013; 8(1):e52198.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  5. 5

    Proud D, Turner RB, Winther B, Wiehler S, Tiesman JP, Reichling TD, Juhlin KD, Fulmer AW, Ho BY, Walanski AA, et al. Gene expression profiles during in vivo human rhinovirus infection. Am J Respir Crit Care Med. 2008; 178(9):962–8.

    CAS  PubMed  Article  Google Scholar 

  6. 6

    Andres-Terre M, McGuire HM, Pouliot Y, Bongen E, Sweeney TE, Tato CM, Khatri P. Integrated, multi-cohort analysis identifies conserved transcriptional signatures across multiple respiratory viruses. Immunity. 2015; 43(6):1199–211.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  7. 7

    Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction, 2nd edition. New York: Springer; 2009.

    Google Scholar 

  8. 8

    Zaas AK, Chen M, Varkey J, Veldman T, Hero III AO, Lucas J, Huang Y, Turner R, Gilbert A, Lambkin-Williams R, et al. Gene expression signatures diagnose influenza and other symptomatic respiratory viral infections in humans. Cell Host Microbe. 2009; 6(3):207–17.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  9. 9

    Carin L, Hero A, Lucas J, Dunson D, Chen M, Heñao R, Tibau-Piug A, Zaas A, Woods CW, Ginsburg GS. High-dimensional longitudinal genomic data: an analysis used for monitoring viral infections. IEEE Signal Proc Mag. 2012; 29(1):108–23.

    Article  Google Scholar 

  10. 10

    Zou M, Conzen SD, A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data. Bioinformatics. 2005; 21(1):71–9.

    CAS  PubMed  Article  Google Scholar 

  11. 11

    Song L, Kolar M, Xing EP, Time-varying dynamic Bayesian networks. In: Advances in Neural Information Processing Systems 22. British Columbia, Canada: Vancouver: 2009.

  12. 12

    Dimitrakopoulou K, Tsimpouris C, Papadopoulos G, Pommerenke C, Wilk E, Sgarbas KN, Schughart K, Bezerianos A. Dynamic gene network reconstruction from gene expression data in mice after influenza A (H1N1) infection. J Clin Bioinforma. 2011; 1(1):27.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  13. 13

    Xing EP, Fu W, Song L, et al. A state-space mixed membership blockmodel for dynamic network tomography. Ann Appl Stat. 2010; 4(2):535–66.

    Article  Google Scholar 

  14. 14

    Ideker T, Krogan NJ. Differential network biology. Mol Syst Biol. 2012; 8:565.

    PubMed  PubMed Central  Article  Google Scholar 

  15. 15

    Ma X, Gao L, Karamanlidis G, Gao P, Lee CF, Garcia-Menendez L, Tian R, Tan K. Revealing pathway dynamics in heart diseases by analyzing multiple differential networks. PLoS Comput Biol. 2015; 11(6):e1004332.

    PubMed  PubMed Central  Article  Google Scholar 

  16. 16

    Gentile DA, Doyle WJ, Fireman P, Skoner DP. Effect of experimental influenza A infection on systemic immune and inflammatory parameters in allergic and nonallergic adult subjects. Ann Allergy Asthma Immunol. 2001; 87(6):496–500.

    CAS  PubMed  Article  Google Scholar 

  17. 17

    Huang Y, Zaas AK, Rao A, Dobigeon N, Woolf PJ, Veldman T, Øien NC, McClain MT, Varkey JB, Nicholson B, et al. Temporal dynamics of host molecular responses differentiate symptomatic and asymptomatic influenza A infection. PLoS Genet. 2011; 7(8):e1002234.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  18. 18

    Liu TY, Burke T, Park LP, Woods CW, Zaas AK, Ginsburg GS, Hero AO. An individualized predictor of health and disease using paired reference and target samples. BMC Bioinformatics. 2016; 17(1):1–15.

    Google Scholar 

  19. 19

    Carrat F, Vergu E, Ferguson NM, Lemaitre M, Cauchemez S, Leach S, Valleron A-J. Time lines of infection and disease in human influenza: a review of volunteer challenge studies. Am J Epidemiol; 167(7):775–85.

  20. 20

    Aach J, Church G. Aligning gene expression time series with time warping algorithms. Bioinformatics. 2001; 17:495–508.

    CAS  PubMed  Article  Google Scholar 

  21. 21

    Godec J, Tan Y, Liberzon A, Tamayo P, Bhattacharya S, Butte AJ, Mesirov JP, Haining WN. Compendium of immune signatures identifies conserved and species-specific biology in response to inflammation. Immunity. 2016; 44(1):194–206.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  22. 22

    Golub GH, Van Loan CF. Matrix Computations, 3rd edn. Baltimore: The Johns Hopkins University Press; 1996.

    Google Scholar 

  23. 23

    Chen H, Zhang N. Graph-based change-point detection. Ann Stat. 2015; 43(1):139–76.

    Article  Google Scholar 

  24. 24

    Gönen M. Bayesian efficient multiple kernel learning. In: Proceedings of the 29th International Conference on Machine Learning. Edinburgh, Scotland, The UK: 2012.

  25. 25

    Lawrence ND, Jordan MI. Semi-supervised learning via Gaussian processes In: Saul LK, Weiss Y, Bottou L, editors. Advances in Neural Information Processing Systems 17. MIT Press: 2005. p. 753–60. http://papers.nips.cc/paper/2605-semi-supervised-learning-via-gaussian-processes.pdf.

  26. 26

    Camps-Valls G, Bruzzone L. Kernel-based methods for hyperspectral image classification. IEEE Trans Geosci Remote Sens. 2005; 43(6):1351–62.

    Article  Google Scholar 

  27. 27

    Tuia D, Camps-Valls G, Matasci G, Kanevski M. Learning relevant image features with multiple-kernel classification. IEEE Trans Geosci Remote Sens. 2010; 48(10):3780–91.

    Article  Google Scholar 

  28. 28

    Mercer J. Functions of positive and negative type and their connection with the theory of integral equations. Phil Trans R Soc A. 1909; 209:415–46.

    Google Scholar 

  29. 29

    Tipping ME. The relevance vector machine. Adv Neural Inf Process Syst. 2000; 12:652–8.

    Google Scholar 

  30. 30

    Bishop CM. Pattern recognition and machine learning. New York: Springer; 2007.

    Google Scholar 

Download references

Acknowledgements

The datasets used for the analyses described in this manuscript were contributed by Drs. Ephraim Tsalik and Geoffrey Ginsburg from Duke University and Durham VA. They were obtained as part of the Respiratory Viral DREAM Challenge through Synapse ID [syn5647810]. And we thank all members from the SJTU-Yale Joint Center of Biostatistics and Data Science.

Funding

This work was partially supported by the Neil Shen SJTU Medical Research Fund, National Science Foundation of China [NSFC 11801316], Natural Science Foundation of Shandong Province [ZR2019QA002], and National Statistical Scientific Research Project [2018LY63]. The funding body did not play any role in the design of the study, collection, analysis, and interpretation of the data, and writing of the manuscript.

Author information

Affiliations

Authors

Contributions

FD contributed to the implementation of model, algorithm design, and manuscript writing. YH contributed to the simulation design and manuscript writing. TW and HL contributed to the implementation of model, real data analysis design and manuscript writing. DH and HZ contributed to the work direction, implementation of model, algorithm design, and manuscript writing. All authors have read and approved the manuscript.

Corresponding author

Correspondence to Hongyu Zhao.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1

Supplementary Materials include six sections: Section 1, Graph-based Change-point Detection; Section 2, Details of Algorithm 1; Section 3, More Simulations; Section 4, Analysis of the Effects of Gene Sets; Section 5, Challenge Results; and Section 6, Figures.

Additional file 2

An example of the R code used in the paper.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Dong, F., He, Y., Wang, T. et al. Predicting viral exposure response from modeling the changes of co-expression networks using time series gene expression data. BMC Bioinformatics 21, 370 (2020). https://doi.org/10.1186/s12859-020-03705-0

Download citation

Keywords

  • Change point
  • Kernel method
  • Time-series gene expression data
  • Co-expression networks
  • Dynamic information
  • Model interpretation