Predicting viral exposure response from modeling the changes of co-expression networks using time series gene expression data

Background Deciphering the relationship between clinical responses and gene expression profiles may shed light on the mechanisms underlying diseases. Most existing literature has focused on exploring such relationship from cross-sectional gene expression data. It is likely that the dynamic nature of time-series gene expression data is more informative in predicting clinical response and revealing the physiological process of disease development. However, it remains challenging to extract useful dynamic information from time-series gene expression data. Results We propose a statistical framework built on considering co-expression network changes across time from time series gene expression data. It first detects change point for co-expression networks and then employs a Bayesian multiple kernel learning method to predict exposure response. There are two main novelties in our method: the use of change point detection to characterize the co-expression network dynamics, and the use of kernel function to measure the similarity between subjects. Our algorithm allows exposure response prediction using dynamic network information across a collection of informative gene sets. Through parameter estimations, our model has clear biological interpretations. The performance of our method on the simulated data under different scenarios demonstrates that the proposed algorithm has better explanatory power and classification accuracy than commonly used machine learning algorithms. The application of our method to time series gene expression profiles measured in peripheral blood from a group of subjects with respiratory viral exposure shows that our method can predict exposure response at early stage (within 24 h) and the informative gene sets are enriched for pathways related to respiratory and influenza virus infection. Conclusions The biological hypothesis in this paper is that the dynamic changes of the biological system are related to the clinical response. Our results suggest that when the relationship between the clinical response and a single gene or a gene set is not significant, we may benefit from studying the relationships among genes in gene sets that may lead to novel biological insights.

(Continued from previous page)

Conclusions:
The biological hypothesis in this paper is that the dynamic changes of the biological system are related to the clinical response. Our results suggest that when the relationship between the clinical response and a single gene or a gene set is not significant, we may benefit from studying the relationships among genes in gene sets that may lead to novel biological insights.
Keywords: Change point, Kernel method, Time-series gene expression data, Co-expression networks, Dynamic information, Model interpretation

Background
In genomics studies, time-series gene expression data [1][2][3] often need to be processed and analyzed. In 2016, DREAM CHALLENGES released an open challenge called 'Respiratory Viral DREAM Challenge: Discovering dynamic molecular signatures in response to virus exposure' (https://www.synapse.org/#!Synapse:syn5647810/wiki/399108). The aim was to develop early predictors of susceptibility and contagiousness based on expression profiles collected prior to and at early time points following viral exposure. Some work reported the differences of transcriptomics [4][5][6] in the host response between symptomatic and asymptomatic subjects exposure to respiratory viruses. Additionally, as what were done by most participants (https://www.synapse.org/#!Synapse:syn5647810/wiki/ 402364), some common machine learning algorithms [7] can be used if we treat the challenge as a prediction problem. The challenge results [see Additional file 1 for parts of the challenge results] demonstrate that the prediction performance significantly depends on the participants' models. However, we need to average the time series data across time or only use cross-sectional data at a time to perform ensemble learning, and the dynamic information of the time series data is lost in these approaches. Moreover, in the early stage of infection (within 24 h), there is little separation of the trajectories of genes among subjects with different clinical responses. Previous studies [8,9] also showed that the individual responses after exposure to respiratory virus are influenced not only by the baseline immune status of the host but also by the dynamics of the early host immune response immediately following exposure. If we only consider a single gene, there is no distinct pattern in both cross-sectional and dynamic data. It is difficult to differentiate between positive and negative groups by gene expression levels at early stage. In this paper, we resort to gene sets analysis to correlate exposure response with dynamic gene expression patterns in gene sets. To consider multiple genes, some methods have been proposed to infer the relationship between genes. For example, the Dynamic Bayesian Network (DBN) was used to establish the dynamic regulatory network [10]. We note that a number of groups have studied time-varying dynamic Bayesian networks (TV-DBN) to model the varying network structures and reveal the dynamics of biological systems [11,12]. The dynamic mixed membership stochastic block model (dMMSB) helps to infer the biological functions of genes through modeling the dynamic tomography of networks [13]. The review of differential network biology [14] advocated that differential network mapping at large scales may provide a deeper understanding of complex biological phenomena. The work [15] analyzed multiple differential co-expression networks based on time-course RNA-Seq data. Through Multiple Differential Modules (M-DMs), they found that dynamic modules are associated with the development of heart failure. These results in the literature suggest that considering the dynamics of networks may help us to better understand disease onset and progression. However, how to extract useful dynamic information from time-series gene expression data to build predictive model remains a challenging problem.
To study the relationship between viral exposure response and time-series gene expression data, we hypothesize that the changes (i.e. dynamics) of the relationship between genes in gene sets may be informative about viral exposure response, and propose a statistical framework to characterize and integrate dynamic information for response prediction where the model parameters have clear biological interpretations. The main innovations of the paper are: Firstly, we use spectral norm to extract information of the difference between two networks. Secondly, we model the changes of dynamic coexpression networks based on the graph-based change point detection method. Thirdly, we measure the similarity between two subjects by the relationship between gene trajectories.
The rest of the paper is organized as follows: In the "Results" Section, we evaluate the performance of our method using both simulated and real data. The results include data description and preprocessing, preliminary analysis and inference results. This is followed by the "Discussion" and "Conclusions" Sections. The "Methods" Section first introduces the notations, then describes the statistical models and inference procedure proposed in this manuscript.

Simulations
In this section we assess the performance of the proposed algorithm on the data simulated as follows. For simplicity, we fixed the number of genes G = 80.

Change point detection and parameter inference
The results under different scenarios are shown in Table 1. We have 4 gene sets indexed 1, 2, 3, and 4, respectively and the inferred parameters of these 4 gene sets are b 1 , b 2 , b 3 , and b 4 , respectively. As discussed in the simulation models, the subject label is the result of different change points in gene sets 2 and 4. For comparability, the absolute value of parameter b is denoted by |b|. When ρ is greater than 0.3, |b 2 | and |b 4 | are the largest in the 4 parameters which is consistent with the model structure. So when the difference between 0 and 1 is large enough, our method can identify the gene sets which contribute more to the response label. In Table 1, 'CHP' represents the average value over 100 replications for the estimation of change-point position, with the standard deviation in the parentheses. 'P-value' is the average p-value over 100 replications using graph-based change point detection method. 'CHP(%)' represents the proportion of times the change point is precisely detected in 100 simulations. When ρ is less than 0.1, the structure difference between 0 and 1 is small, and the detected change point may not be statistically significant. When ρ is greater than 0.5, there is more than 90% chance to detect the change point. Table 1 Results under different scenarios   Fig. 1, where 'FPR' represents false positive rate and 'TPR' represents true positive rate. We can see that there is more advantage of our method with an increasing value of ρ. The average AUC values are summarized in Table 2. The proposed algorithm has the highest average AUC value of 100 simulations when ρ is greater than 0.5. Moreover, the ' AUC' row of Table 1 shows the classification performance for the test set. We can see that the value of AUC increases with the increase of ρ, which is consistent with our model hypothesis, as it is easier to infer the labels with a larger ρ.
As described in the "Methods" section, our algorithm requires the given gene sets as input. So the performance of our algorithm may be affected by the way of grouping genes. We evaluated the performance of our algorithm in different ways of grouping genes. The AUC and prediction accuracy may depend on the grouping method, where a higher enrichment of signals in the pre-defined gene sets will lead to better

Data description and preprocessing
In this section, we evaluate the performance of our proposed method through real data analysis. Some challenge results related to this paper are provided in the Supplementary Materials [see Additional file 1]. The complete results can be found at the URL (https://www.synapse.org/#!Synapse:syn5647810/wiki/402364) (note that only the registered users can log into the website). The time-series gene expression data for this challenge were collected from healthy volunteers exposed to a respiratory virus within a controlled experimental setting where some became ill and others did not despite the same exposure. Data were derived from seven viral challenge experiments in which volunteers were exposed to one of four different respiratory viruses (Influenza H1N1, Influenza H3N2, Respiratory Syncytial Virus, or Rhinovirus) in order to find gene expression profiling signatures of susceptibility. Peripheral blood gene expression profiling was made at 55 time points ranging between -30 h (pre-exposure) and 672 h (post-exposure). The released data include 125 subjects from seven study centers with time-series gene expression data for 22,277 probes in peripheral blood for each subject, with a total of 2371 samples. Additionally, clinical information was also available, such as age, gender, and the time of samples measured. To reduce noise, we removed 7 subjects who were injected interfering viruses, and removed probes corresponding to multiple genes, and averaged the multiple probes corresponding to the same gene. We considered a total of 12,532 genes. Therefore, we have N = 118, G = 12, 532, and T = 55 for this data set. There are 68 subjects with positive labels and 50 subjects with negative labels. The overall data can be visualized by the heat map as shown in Fig. 2. We can see that the genes can be grouped into distinct clusters, while samples can not be clustered according to response. Moreover, different study centers are clustered together, suggesting possible batch effects.
In the following analysis, we normalized the gene expression data according to each time point to remove batch effects.

Preliminary analysis
A number of studies [16,17] reported the differences in the host response between symptomatic and asymptomatic subjects challenged with respiratory viruses. For simplicity, we call the symptomatic response group the positive group and the asymptomatic response group the negative group. Firstly, we analyzed the cross-sectional data and performed differential expression analysis at a single time point. No significant difference in single gene expression level was found between the positive and negative groups before 40 h. We further investigated the relationship between gene trajectories and responses. Some papers [9,17,18] reported that OAS1, IFI44L, IRF7 and CCR1 may be associated with the response. From the expression trajectories of OAS1, IFI44L, IRF7 and CCR1 shown in Fig. 3, we can see that the changes of expression variances for OAS1, IFI44L, IRF7 occur at latter stage and expression variance of CCR1 does not change over time. That is as a single dynamic time series for these related genes, the positive and negative groups have significant differences after 45 h, however they do not exhibit differences at the early stage (within 24 h). Therefore, if we only consider a single gene, there is no distinct pattern in Fig. 2 Heat map of the baseline gene expression data across subjects. Genes are clearly classified into different groups. The data seem to suggest two groups with subtle differences, but different viral response groups are not clearly separated with these two groups. On the other hand, centers DEE3 H1N1 and Rhinovirus UVA are in the same group, whereas the remaining centers are in the other group suggesting center/batch effects both cross-sectional and dynamic data. Therefore, it is difficult to differentiate between positive and negative groups by gene expression levels at early stage. On the other hand, the paper [19] shows that viral shedding increases sharply between 0.5 and 1 day (within 24 h) after exposure and consistently peaks on day 2. We resort to gene sets analysis to correlate exposure response with dynamic gene expression patterns in gene sets. Firstly, we selected the gene sets that may be related to viral exposure responses. We consider "SYMPTOMATIC-SC2" as the response label which is a binary variable indicating post-exposure maximum symptom score greater than six and then screen out differentially expressed genes from each cross-sectional gene expression data at 55 time points, even if it is not significant. This led to 55 gene sets. Secondly, for each gene set, we represent it as an undirected weighted network and the weight is given by gene expression similarity, where we used the Pearson correlation coefficient of two genes to define their similarity. That is the function h in the "Models" section is Pearson correlation coefficient. We have tried a number of definitions of similarity and Pearson correlation had better performance overall. For each gene set, we obtained 55 time-dependent networks. And we detected change points for these networks using the method introduced in the "Models" section. After change point detection, all gene sets are sorted according to the time of change point. Thirdly, we set up multiple kernel prediction model based on the gene sets in which the relationships among genes change at early stage. Each gene set is integrated into a kernel.

Results
We randomly selected 70% subjects as the training set and the remaining as the test set. The training set contained 83 subjects (35 subjects with negative label and 48 subjects with positive label), 12,532 genes, and up to 55 time points. We want to test the biological hypothesis that the dynamic networks with early change point contribute more to the response label. Figure 4 shows the prediction performance of the model for the test set However it is difficult to distinguish the positive and negative groups in early stage (within 24 h). c For gene IRF7, the pattern is similar to that of gene IFI44L. d For gene OAS1, at the 50th hour or so, the positive and negative groups can be better distinguished. However there is little separation of the trajectories of these genes for the positive and negative groups in the early stage when we added the gene sets in the order of the detected change point time. It can be seen from Fig. 4 that at the early stage, with more gene sets included, AUC increased. However, after more than 12 gene sets were included, AUC started to decrease, which indicates that an increasing number of gene sets does not lead to an increase of prediction accuracy. This is consistent with our hypothesis that networks that change in the early stages are associated with the response label. Moreover, the curve in Fig. 4 has a turning point at the 35th gene set when AUC starts to increase again, suggesting that those unchanged gene sets may also have information on exposure response. This may be because those unchanged gene sets are markers of the asymptomatic group, which is consistent with the stable performance of the negative group in Fig. 3. Next, we investigated the learning parameter vector b. In terms of 55 gene sets, we consider those gene sets among the top 12 in which the relationships among genes change at early stage. The inferred parameters are summarized in Table 3. The results show that the 44th, 2nd, 34th and 35th gene sets contribute more to the response than the other gene sets. By enrichment analysis for these four gene sets, we can identify pathways related to viruses as shown in Fig. 5. It can be seen that the top pathways are associated with viruses. Finally, we visualize the gene sets It can be seen that the AUC increases at the beginning, and has the highest value when the 12th gene set is included. It then oscillates downward as more gene sets are added until the 35th gene set, when AUC starts to increase again associated with response in Fig. 6. Take the positive group as an example. In Fig. 6 each red line represents the change over time of the systematic feature of the gene expression values collected from randomly selecting 80% subjects from the positive group. That is to say, every point on the red line corresponds to the spectral norm of the corresponding matrix of the co-expression network constructed by the genes from the 35th gene set at a certain time t. The result shows that there is a clear difference between the positive and negative groups. More importantly, at the early stage, it is very difficult to distinguish the positive and negative groups from the trajectory of a single gene, as shown in Fig. 3. However that is more obvious in Fig. 6, which substantiates our hypothesis that the label of response is related to the dynamic nature of the changing of a system (gene set) but not a single gene. We visualized the co-expression networks of the 35th gene set at time points 0, 12, 24, 48, 96 and 146, respectively [see Additional file 1]. It seems that the connections of gene modules became closer after the samples were exposed to the virus.

Discussion
In this paper, we adopt a screening approach to find potential gene sets which may be related to response. For this screening step, we do not consider multiple testing when we detect change points of the dynamic networks. We further identify the gene sets related to   Fig. 3. After the 40th hours, the positive group rises suddenly the response through the proposed Bayesian model. The screening step can be considered as a variable selection step where no response information is used. In addition, when there is no simple relationship between the clinical response and a single gene or a gene set (therefore it is challenging to have statistically significant results for marginal analysis), a model that studies the changes of the relationships among genes in gene sets may offer novel biological insights.

Conclusions
We have proposed a novel approach of modeling time-series gene expression data for inferring an individual's response to viral exposure. The biological hypothesis in this paper is that the dynamic changes of the system are related to the clinical response. Compared with previous time series analysis methods, we showed that change point detection for dynamic networks may be informative for the relationship between the clinical response and dynamic nature of the system (gene sets). Joint consideration of multiple kernels based on gene sets with dynamic network structures not only can predict an individual's clinical response, but can also help elucidate the biological pathways involved. The effectiveness of the proposed method was demonstrated through the analyses of both simulated and real data.
In this paper, we construct the co-expression networks for the gene sets at each time point separately using Pearson correlation. We note that other methods may be used. For example, we can construct networks incorporating some prior knowledge such as regulatory network at each time point to improve network robustness. Some model-based methods such as TV-DBN [11] can be used to construct dynamic networks. Network reconstruction [12] incorporating the temporal nature of the data may help improve the performance of our model. On the other hand, the selection of matrix similarity may influence the change point detection for the networks. It is worth studying the different methods of change-point estimation for networks in the future. Additionally, we considered the case where the response variable is binary. If the response variable is continuous, we will consider a continuous response in the Bayesian model. In real data analysis, we used Pearson correlation coefficient to define similarity function of kernel. Some other kernel functions can be tried, such as dynamic time warping (dtw) which has been applied to gene expression data [20]. In practice, cross-validation method can be carried out to select the optimal kernel function definition when the sample size is sufficiently large. In this paper, when we compute multiple kernels for integrating different dynamic gene sets, there is no consideration about relationship between different kernels. However, different gene sets may have overlapping genes, which may influence the estimation of change point. We will consider the Bayesian integration model with correlation information in the future.

Methods
The main aim of the paper is to identify gene sets related to viral exposure response and meanwhile predict a person's response using the dynamic relationships among genes in a gene set at early exposure stage. We assume that only some of the gene sets are informative about clinical responses. Firstly, the genes need to be organized into different gene sets based on some criteria. Here are some suggested ways to group genes. If there is prior biological knowledge, we can organize genes into different gene sets according to such knowledge. For example, for immune related diseases, the immune-related pathways in the database, MSigDB (http://software.broadinstitute.org/gsea/msigdb/index.jsp) [21], can be viewed as gene sets. Without prior knowledge, we may construct gene sets based on the observed data, e.g. genes with different expression levels at different time points. Secondly, the general framework of our method is summarized in Fig. 7. For each given gene set, time dependent networks are constructed. For example, at each time point, we can construct a fully connected network with genes as nodes and correlation coefficients between genes as weights. Because we hypothesize that the gene sets in which the relationships among genes change over time may be informative about the clinical response, we investigate whether the network has changed over time. The gene sets which change the network structures at early stage are candidate gene sets to predict clinical response. Then, we employ a Bayesian multiple kernel learning model to predict an individual's response. The key for kernel learning is the definition of similarity. We use the overall relationship between genes to define the similarity between subjects. More details are provided in the "Models" and "Inference" sections.

Models
For the kth gene set, the genes are collected in O k and let G k denote the number of genes in this set. At each time point, we can construct a network such as co-expression network, for genes in the set. Therefore, we can get T networks across the T time points, with these networks represented by T matrices {A 1 , ..., A T }, where A t (i, j), the (i, j)th entry of matrix A t , is derived from h x ·it , x ·jt , i, j ∈ O k and h is a function that defines the correlation or similarity between two genes in this set. The change point detection across these networks can be expressed as follow: where F 0 and F 1 are different probability measures on a nonzero measure set. Firstly, define the similarity between two matrices as where · 2 is the spectral norm of a matrix [22]. The reason for using spectral norm to measure the similarity between two matrices is, for a symmetric matrix, the spectral norm equals to the spectral radius of this symmetric matrix. From a geometric point of view, the spectral radius of a matrix represents the degree of stretching along its corresponding direction. Secondly, we can construct a graph on After identifying gene sets with early change points, we use a Bayesian model to integrate dynamic information from multiple gene sets. Assume that the indices of the selected gene sets are collected in S. Each gene set indexed by s can define a kernel matrix s . Denote the kernel matrix set = { s : s ∈ S}. We integrate all the |S| gene sets by the following multiple kernel learning model [24], where a = (a 1 , a 2 , ..., a N ) T is the sample weight vector, b is the kernel weight vector, e is bias and i s is the kernel vector which is the ith column of kernel matrix s . The (i, j)th element of s is defined by the similarity between subjects i and j. s (i, j) = φ s x i , x j , where the kernel function φ s is defined as , ∀i, j ∈ {1, 2, ..., N} and s ∈ S.
The (l, k)th entry of the matrix˜ s i is, (1), f can be considered as a latent variable [25] connecting the observed expression data x and labels y. Through the estimation of parameter b, we can infer which gene sets have more contribution to the response label y.

Inference
The main aim of this section is to infer the parameters {a, b, e} in model (1). We adopt a Bayesian framework because of two advantages. Firstly, compared with general kernel-based methods [26,27], kernel learning under a Bayesian framework reduces the requirement of kernel conditions, such as Mercer's kernel condition [28,29]. So we can select more flexible metrics to measure the similarity between subjects based on time series observations. Secondly, compared with general machine learning algorithms, such as SVMs, auxiliary parameters can also be inferred under a Bayesian framework [29]. Denote the priors {λ, γ , ω} corresponding to {a, b, e}, respectively. For computational convenience, we assume conjugate prior distributions [24] in the model. Let = α λ , β λ , α γ , β γ , α ω , β ω denote the hyper-parameter set for {λ, γ , ω} and L be an intermediate output variable for the iteration of parameters. All priors and parameters in the model are denoted by = {λ, γ , ω}  a, b, e, f , L . Hence, the conjugate Bayesian priors for the parameters are where δ(·) is the Kronecker delta function that returns 1 if the variable satisfies the restriction and 0 otherwise, and ν is a given margin parameter which is used to distinguish two categories. Next, we use variational approximation [30, chap.10] to estimate the parameters. The main idea of the algorithm is to approximate the marginal likelihood log p y|x by the lower bound L, where E represents the expectation of random variables and q( ) is the posterior distribution of . The exact formulas of the lower bound L are similar to those in the supplementary material of reference [24]. Hence, the approximate posterior distribution q(·) of each parameter can be computed by q(·) ∝ exp E q( \·) log p y, |x , where q ( \·) is the distribution of with the parameter (·) removed. Algorithm 1 summarizes the estimation process of model parameters a, b, e, f , L . After we obtain a trained model, the label for a new subject can be predicted by Eq. (1). More details about Algorithm 1 can be found in the Supplementary Materials [see Additional file 1].

Algorithm 1 Approximate posterior distributions of parameters Input:
: kernel matrix set of training data; y: labels of the samples in the training set; : hyper-parameters; iter: number of iterations.

Output:
Posterior distribution of each parameter in a, b, e, f , L . 1: initial parameters: mean μ and covariance matrix ; 2: repeat 3: compute μ r a and r a and related parameters; 4: compute μ r L and r L for the intermediate output L ; 5: compute μ r (e,b) and r (e,b) and related parameters; 6: compute μ r f of latent variable f and related parameters; 7: until r equals to iter.