Inference of radio-responsive gene regulatory networks using the graphical lasso algorithm

Background Inference of gene regulatory networks (GRNs) from gene microarray expression data is of great interest and remains a challenging task in systems biology. Despite many efforts to develop efficient computational methods, the successful modeling of GRNs thus far has been quite limited. To tackle this problem, we propose a novel framework to reconstruct radio-responsive GRNs based on the graphical lasso algorithm. In our attempt to study radiosensitivity, we reviewed the literature and analyzed two publicly available gene microarray datasets. The graphical lasso algorithm was applied to expression measurements for genes commonly found to be significant in these different analyses. Results Assuming that a protein-protein interaction network obtained from a reliable pathway database is a gold-standard network, a comparison between the networks estimated by the graphical lasso algorithm and the gold-standard network was performed. Statistically significant p-values were achieved when comparing the gold-standard network with networks estimated from one microarray dataset and when comparing the networks estimated from two microarray datasets. Conclusion Our results show the potential to identify new interactions between genes that are not present in a curated database and GRNs using microarray datasets via the graphical lasso algorithm.


Background
In recent years, there has been a great interest in identifying radio-responsive genes across the whole genome using gene microarray data in the field of radiation oncology. To develop new biomarkers for radiation exposure, Templin et al. used whole genome microarray and miRNA data generated from blood samples of patients who underwent total body irradiation in preparation for stem cell transplantation [1]. Rieger and Chu utilized oligonucleotide microarrays with cell lines collected from 15 healthy individuals to investigate the transcriptional response of 10,000 genes in DNA damage to ionizing radiation (IR) and ultraviolet (UV) radiation [2]. In another study, Rieger et al. explored transcriptional responses to radiation in lymphoblastoid cells collected from 57 patients and found 20 IR-responsive and 4 UV radiation-responsive genes predictive of radiation toxicity [3]. Eschrich et al. employed systems biology modeling to better understand radiosensitivity by identifying novel radiation-specific biomarkers [4]. With gene expression profiles from 48 human cancer cell lines, this biomarker discovery platform resulted in a key radiosensitivity network with 10 hub genes. In our previous study [5], we identified important radio-responsive genes using literature review and gene microarray data analysis. With a systems biology approach, we found a core radio-responsive protein interaction network and its key biological processes using gene ontology analysis.
Gene regulatory networks (GRNs) provide simplified representations and easy interpretation of biological processes in an organism under given conditions [6]. However, inference of GRNs remains a major challenging problem in systems biology, although a number of approaches have been proposed [7]. Cai et al. employed sparse structural equation models (SEMs) that integrate gene expression and cis-expression quantitative trait loci data for improving inference accuracy and proposed a systematic inference method for SEM estimation [8]. Based on Bayesian analysis using a non-parametric Gaussian process modeling, a novel method for inferring GRNs was proposed by Aijö and Lähdesmäki [9]. This approach enables the use of both time-series and steady-state gene expression measurements to improve the inference of GRNs. Menéndez et al. used a Gaussian Markov Random Field (GMRF) to deal with the problem of reverse engineering in GRNs and applied the graphical lasso algorithm to effectively estimate undirected graphical models [10]. Applying a lasso penalty to the problem of inverse covariance matrix estimation facilitated a fast and efficient calculation. The graphical lasso algorithm, which uses a blockwise coordinate descent approach to estimate a sparse graphical model, was proposed by Friedman et al. [11].
In this study, we employed a multi-component filtering process, based on a systems biology approach that was proposed in our previous study [5], that narrows down the list of gene candidates and applied the graphical lasso algorithm to microarray datasets to infer radioresponsive GRNs. To estimate the accuracy of the network modeling, the estimated networks were compared with a reliable protein interaction network produced by a manually curated protein interaction database.

Datasets
In our previous study [5], we attempted to investigate all putative genes implicated in radiation response through literature review using PubMed and Scopus search engines and found a list of 221 genes associated with radiation response. In addition, to identify significant radio-responsive genes, biological processes, and pathways, further analysis was performed using two publicly available microarray datasets (GSE23393 [1] and GSE1977 [2]) downloaded from Gene Expression Omnibus (GEO) database. In GSE1977, lymphoblastoid cells collected from 15 healthy individuals were exposed to 5-Gy radiation doses and harvested 4 hours later. To examine the change in gene expression after IR, only mock and IR cases were used. In GSE23393, blood was obtained from eight radiotherapy patients treated at our institution immediately before irradiation and at 4 hours after total body irradiation with 1.25-Gy X-rays. In this study, to explore GRNs associated with radiation response, we used the resulting information obtained from our literature review and reanalyzed the two microarray datasets.

Identification of significant genes
In the preprocessing stage, gene expression measurements were log-base-2 transformed, followed by quantile normalization across all samples. In our previous study [5], to identify significant genes that have considerable differential expression between before and after irradiation, we used a permutation test. In this study, we employed Significant Analysis of Microarrays (SAM) that resulted in a false discovery rate (FDR) and fold-change for each gene [12]. To minimize the possibility of omitting interesting genes, significant genes identified in the two different techniques were combined for further analysis.

Pathway analysis
Significant genes identified in our analysis were entered into a manually curated pathway database (MetaCore™, GeneGo, Inc.). This system leads to the most probable protein interaction network based on a list of genes uploaded by the user. We converted this resulting network to an undirected network and used it as a goldstandard network to assess the GRNs estimated by the graphical lasso algorithm.

Graphical lasso algorithm
In recent years, increasing attention has been paid to the estimation of sparse inverse covariance using L 1 (lasso) regularization [11,13,14]. This approach has been efficiently applied to the investigation of sparse undirected graphical models. In the graphical model, a node represents a feature (gene or protein in this study) and an edge between two nodes represents the relationship between the two corresponding features. In particular, each nonzero off-diagonal element in the inverse covariance matrix indicates that there is a dependency between the two features. That is, as the number of zero off-diagonal elements in the inverse covariance matrix increases, sparser graphs are yielded.
The underlying assumption behind the graphical lasso is that a data matrix X n×p consisting of p features measured on n observations follows a multivariate Gaussian distribution with mean μ and covariance Σ [10]. Let = −1 be the precision matrix, and let S denote the covariance matrix of the data. The problem is to maximize the penalized log-likelihood over nonnegative definite matrices , taking the form 1 is the L 1 norm that is the sum of the absolute values of the elements of −1 , and λ is a nonnegative tuning parameter that controls the sparsity of the estimated network. More specifically, large values of λ lead to sparse networks due to the lasso-type penalty, whereas small values of λ lead to dense networks. Note that the problem that maximizes the penalized log-likelihood is convex [11]. The subgradient equation for Eq. (1) is where W = −1 . The block coordinate descent approach partitions the rows and columns such that the target column is always the last, cycling through all features in turn. The partition of , W, and S is defined as: where the size of 11 , W 11 , and S 11 is (p − 1) × (p − 1); the size of θ 12 , w 12, and s 12 is (p − 1) × 1; and θ 22 The lower-right block of Eq. (2) is Since W = −1 , we have from which we derive W 11 θ 12 + w 12 θ 22 = 0, Then, we have can be rewritten as We note that Eq. (6) is equivalent to the gradient equation of the regular lasso problem. At each partition, a lasso regression is fitted. Then, w 12 and w 22 are calculated and inserted into W before a new partition is made. This procedure is repeated until W is converged. Table 1 summarizes the graphical lasso algorithm.

Evaluation of estimated gene regulatory networks
To compare the GRNs estimated using the graphical lasso algorithm with a gold-standard network produced by MetaCore software, we used Recall, Precision, and f-score metrics defined as follows [15]: where TP indicates the true positives (correctly inferred edges); FP represents the false positives (edges inferred in the estimated network, but not present in the gold-standard network); and FN signifies the false negatives (edges present in the gold-standard network, but not inferred).

Identification of significant genes using microarray datasets
Using SAM, significant genes were identified for the GSE1977 and GSE23393 datasets. For GSE1977, 61 genes were identified, with a cutoff of FDR < 20%, including 44 induced genes and 17 repressed genes. For GSE23393, 64 genes were identified, with a cutoff of FDR < 20%, including 19 induced genes and 45 repressed genes. Note that more genes were induced in GSE1977, whereas more genes were repressed in GSE23393. We examined commonly found genes in the two datasets. Table 2 shows the 21 overlapping genes with their fold-change and FDR. For these genes, considerable fold-changes were observed: averaged fold-changes for induced and repressed genes were 3.02 and 0.53, respectively, in GSE1977 and 3.12 and 0.60, respectively, in GSE23393. There was no significant difference in fold-change between GSE1977 and GSE23393 (p = 0.64 using Wilcoxon signed-rank test). It is worthwhile to note that induced genes had relatively lower FDR than repressed genes in both datasets.
In our previous study [5], we identified 20 genes that were significant in both datasets using a permutation test. It was found that 15 genes out of 20 were re-identified in this study, with 5 genes (BTG2, CDKN1A, MDM2, MR1, and XPC) excluded in SAM analysis. To minimize the possibility of excluding important genes, we combined the two gene lists identified in the SAM analysis and the permutation test, resulting in a list of 26 genes. Note that all 26 genes are in the list of 221 genes found in our literature review [5].

Pathway analysis results
These 26 genes were fed into MetaCore software to identify a key interacting network. Figure 1 illustrates a directly connected protein-protein interaction network produced by MetaCore. This network consisted of 16 directly connected nodes and 28 edges. For the remaining 10 genes, there was no single connection. In this network, the MYC gene seems to play a key role as a hub gene with 12 connections. We assume that this network is reliable, considering it as a gold-standard network to assess the accuracy of the GRNs estimated by the graphical lasso algorithm. Due to the static nature of the microarray datasets and the unidirectional property of GRNs resulting from the graphical lasso algorithm, we removed the directionality from the network in Figure 1 to ease the comparison between the unidirectional gold-standard network and the estimated GRNs.

Identification of gene regulatory networks
For the 16 genes shown in Figure 1, expression measurements after irradiation were extracted from the GSE1977 and GSE23393 microarray datasets and were used in the graphical lasso algorithm. Using different λ values, a large range of candidate networks were yielded. Table 3 summarizes our experimental results. Table 3(A) and (B) show comparison results of networks estimated from GSE1977 and GSE23393 datasets, respectively, with the gold-standard network. To calculate a p-value of a result obtained for each λ in Table 3, a simulation was performed. For example, using −log 10 (λ) = 1.2 in GSE1977, the estimated network had 27 edges with precision = 0.41, recall = 0.39, and f-score = 0.4. The common number of edges between the estimated network and the gold-standard network was 11. For the simulation, we randomly created two networks, both having 16 nodes: one with 28 edges (the number of edges in the gold-standard network) and the other with 27 edges, as in the estimated network.
Then, a f-score was calculated between the two networks. This procedure was repeated 10,000 times. From our simulation results, the number of times that the f-score was larger than 0.4 (note that the f-score in the above example was 0.4) was 50. Therefore, its p-value was 50/10000 = 0.005. As shown in Table 3, the networks estimated from GSE1977 had larger f-scores than those estimated from GSE23393 and overall, their p-values were statistically significant. Table 3(C) shows results of the comparison between the networks estimated from GSE1977 and GSE23393 datasets. We note that the f-scores were greater than those between the gold-standard network and the networks estimated using GSE1977 or GSE23393. Their p-values were statistically significant when −log 10 (λ) ≥ 1.1 in GSE1977 and −log 10 (λ) ≥ 0.8 in GSE23393. However, as −log 10 (λ) values increased, the complexity of models (the number of edges in estimated networks) also increased. We compared the gold-standard network with estimated networks that have the number of edges that was similar to the gold-standard network (the number of edges: 28). In comparison between the gold-standard network and networks estimated from GSE1977 with −log 10 (λ) = 1.2, the number of edges was 27, f-score was 0.4, and p-value was 0.005. In comparison between the gold-standard network and networks estimated from GSE23393 with −log 10 (λ) = 1.04, the number of edges was 30, f-score was 0.24, and p-value was 0.337. In a comparison of the two networks estimated using GSE1977 (with −log 10 (λ) = 1.2) and GSE23393 (with −log 10 (λ) = 1.04), the f-score was 0.42 and p-value was 0.004. Figure 2 shows the change in f-scores calculated from two networks estimated using GSE1977 and GSE23393 with different λ values. Figure 3 illustrates the change in networks estimated when the graphical lasso algorithm was applied to GSE1977 with 6 different λ values.

Discussion
To identify radio-responsive GRNs, we employed the graphical lasso algorithm using a list of radio-responsive genes selected from literature review and microarray data analysis. For the identification of significant genes with two publicly available microarray datasets (GSE1977 and GSE23393), we used SAM analysis. As a result, we found 21 genes to be important with FDR < 20%. These 21 genes included 15 of 20 genes that we identified in our previous study using a permutation test. This result suggests that we may miss important genes by using only a single statistical approach. In this study, we combined the two gene sets for further analysis, resulting in a list of 26 genes, including genes related to DNA repair: GADD45A, XPC, DDB2, and PCNA [2].
These 26 genes were entered into MetaCore software for a systems biology analysis. We identified a proteinprotein interaction network that was used as a goldstandard network in this study. This network consisted of 16 nodes and 28 edges. Interestingly, the MYC gene had 12 connections, implying that this gene may play an important role in DNA-damage-related biological functions. It has been known that MYC is a key regulator of cell proliferation and apoptosis [16][17][18][19]. However, the role of MYC is not fully understood, due to its contradictory effect in enhancing or reducing radioresponsiveness [20,21].
The f-scores calculated to compare the gold-standard network with the networks estimated from GSE1977 were larger than those found in comparing the goldstandard network with the networks estimated from GSE23393. It was noted that the f-scores calculated from the two networks estimated from GSE1977 and GSE23393 were larger than those calculated from the gold-standard network and the networks estimated from GSE1977. This means that there are some common   edges between the two networks estimated from GSE1977 and GSE23393, which are not present in the gold-standard network, suggesting that these edges could represent unknown associations between the genes.
In this study, we compared the estimated networks with a gold-standard network to investigate the change in the accuracy and number of interactions between genes with different λ values. However, in general, one needs to find the best regularization parameter λ, taking into account a tradeoff between model prediction and model complexity [10,22]. Bayesian information criterion (BIC) and Akaike criterion (AIC) are widely used for model selection.
Despite the lack of available datasets regarding radiation response, we demonstrated the presented methodology has the potential to identify radio-responsive GRNs via the graphical lasso algorithm based on literature review and microarray data analysis.

Conclusions
We demonstrated that the graphical lasso algorithm can be a useful tool to reconstruct GRNs. We used a biomarker filtering method proposed in our previous study based on literature review and microarray data analysis. To evaluate the accuracy of radio-responsive GRNs estimated from two publicly available microarray datasets, we used a reliable protein interaction network generated from a curated database as a gold-standard network. It is expected that our proposed method can be efficiently used to identify not only significant radio-responsive genes, but also radio-responsive GRNs.