 Methodology Article
 Open Access
 Published:
Incorporating prior biological knowledge for networkbased differential gene expression analysis using differentially weighted graphical LASSO
BMC Bioinformatics volume 18, Article number: 99 (2017)
Abstract
Background
Conventional differential gene expression analysis by methods such as student’s ttest, SAM, and Empirical Bayes often searches for statistically significant genes without considering the interactions among them. Networkbased approaches provide a natural way to study these interactions and to investigate the rewiring interactions in disease versus control groups. In this paper, we apply weighted graphical LASSO (wgLASSO) algorithm to integrate a datadriven network model with prior biological knowledge (i.e., proteinprotein interactions) for biological network inference. We propose a novel differentially weighted graphical LASSO (dwgLASSO) algorithm that builds groupspecific networks and perform networkbased differential gene expression analysis to select biomarker candidates by considering their topological differences between the groups.
Results
Through simulation, we showed that wgLASSO can achieve better performance in building biologically relevant networks than purely datadriven models (e.g., neighbor selection, graphical LASSO), even when only a moderate level of information is available as prior biological knowledge. We evaluated the performance of dwgLASSO for survival time prediction using two microarray breast cancer datasets previously reported by Bild et al. and van de Vijver et al. Compared with the top 10 significant genes selected by conventional differential gene expression analysis method, the top 10 significant genes selected by dwgLASSO in the dataset from Bild et al. led to a significantly improved survival time prediction in the independent dataset from van de Vijver et al. Among the 10 genes selected by dwgLASSO, UBE2S, SALL2, XBP1 and KIAA0922 have been confirmed by literature survey to be highly relevant in breast cancer biomarker discovery study. Additionally, we tested dwgLASSO on TCGA RNAseq data acquired from patients with hepatocellular carcinoma (HCC) on tumors samples and their corresponding nontumorous liver tissues. Improved sensitivity, specificity and area under curve (AUC) were observed when comparing dwgLASSO with conventional differential gene expression analysis method.
Conclusions
The proposed networkbased differential gene expression analysis algorithm dwgLASSO can achieve better performance than conventional differential gene expression analysis methods by integrating information at both gene expression and network topology levels. The incorporation of prior biological knowledge can lead to the identification of biologically meaningful genes in cancer biomarker studies.
Background
Typically, a differential gene expression analysis (e.g., student’s ttest, SAM, Empirical Bayes, etc.) is performed to identify genes with significant changes between biologically disparate groups [1–3]. However, independent studies for the same clinical types of patients often lead to different sets of significant genes and had only few in common [4]. This may be attributed to the fact that genes are members of strongly intertwined biological pathways and are highly interactive with each other. Without considering these interactions, differential gene expression analysis will easily yield biased result and lead to a fragmented picture.
Networkbased methods provide a natural framework to study the interactions among genes [5]. Datadriven network model reconstructs biological networks solely based on statistical evidence. Relevance network is one common datadriven network model [6, 7]. It uses correlation or mutual information to measure the “relevance” between genes and sets a hard threshold to connect high relevant pairs. Relevance network has extensive application due to its simplicity and easy implementation. However, its drawback becomes significant when the variable number increases: it confounds direct and indirect associations [8]. For example, a strong correlation for gene pair XY and XZ will introduce a less strong but probably still statistically significant correlation for gene pair YZ. As a result, when the number of genes is large, relevance network tends to generate overcomplicated networks that contain overwhelming false positives. Bayesian network is another classic datadriven network model [9]. Unlike undirected graphs such as relevance networks, Bayesian networks generate directed acyclic graphs, in which each edge indicates a conditional dependence relationship between two genes given their parents. The benefits of using Bayesian networks are: 1) By modeling conditional dependence relationship, Bayesian networks only identify direct associations; 2) With directions in the graph, Bayesian networks allow to infer causal relationship. However, it’s challenging to apply Bayesian networks on highthroughput omic data since learning the structure of Bayesian networks for high dimensional data is timeconsuming and can be statistically unreliable. Additionally, Bayesian network cannot model cyclic structures, such as feedback loops, which are common in biological networks.
Recently, Gaussian graphical models (GGMs) have been increasingly applied on biological network inference [10–12]. Similar to Bayesian network, GGMs can remove the effect of indirect associations through estimation of the conditional dependence relationship. At the same time, they generate undirected graphs and have no limitation on modeling only acyclic structures. In GGMs, a connection between two nodes corresponds to a nonzero entry in the inverse covariance matrix (i.e., precision matrix), which indicates a conditional dependency between these two nodes given the others. GGMs dates back to early 1970s when Dempster introduced “covariance selection” problem [13]. The conventional approach to solve this problem relies on statistical test (e.g., deviation tests) and forward/backward selection procedure [14]. This is not feasible for highthroughput omic data when the number of genes is ranging from several hundred to thousands while the number of samples are only tens to hundreds. In addition, the “small n, large p” scenario for omic data (i.e., sample size is far less than the variable number), makes maximum likelihood estimation (MLE) of precision matrix not to exist because the sample covariance matrix is rank deficient. To deal with these issues, Schäfer et al. proposed to combine MoorePenrose pseudoinverse and bootstrapping technique to approximate the precision matrix [15]. Others applied ℓ _{1} regularization to get a sparse network [16–18]. Taking into account of the sparsity property of biological networks and the computational burden of bootstrapping, ℓ _{1} regularization methods are preferred. Among various ℓ _{1} regularization methods, Meinshausen et al. performed ℓ _{1} regularized linear regression (i.e., LASSO) for each node to select its “neighbors” [16]. Given all its neighbors, one node is conditionally independent with the remaining ones. Since LASSO is performed for each node, this ‘neighbor selection’ approach may face a consistency problem. For example, while gene X is selected as Y’s neighbor, gene Y may not be selected as X’s neighbor when performing LASSO for gene X and gene Y separately. Compared with neighbor selection method, a more reasonable approach is graphical LASSO, which directly estimates precision matrix by applying ℓ _{1} regulation on the elements of the precision matrix to obtain a sparse estimated precision matrix [17, 18]. We will pursuit the extension of graphical LASSO in this paper.
In additional to datadriven network models, there are many publicly available databases such as STRING (http://stringdb.org), KEGG (http://www.genome.jp/kegg), BioGRID(http://thebiogrid.org/), and ConsensusPathDB (http://consensuspathdb.org/), where one can extract various types of interactions including proteinprotein, signaling, and gene regulatory interactions [19–22]. Biological networks reconstructed from these databases have been reported useful. For example, Chuang et al. reconstructed proteinprotein interaction (PPI) network from multiple databases to help identify markers of metastasis for breast cancer studies using gene expression data [23]. They overlaid the gene expression value on its corresponding protein in the network and searched for subnetworks whose activities across all patients were highly discriminative of metastasis. By doing this, they found several hub genes related to known breast cancer mutations, while these genes were not found significant by conventional differential gene expression analysis. They also reported that the identified subnetworks are more reproducible between different breast cancer cohorts than individual gene markers. However, databases are far from being complete. Networks constructed purely based on the databases have a large number of false negatives. In addition, databases are seldom specific to a certain disease, so the interactions that exist in the databases may not be reflective of the patient population under study. In contrast, datadriven models are likely to have a large number of false positives due to background noise. Considering this, an appropriate approach to integrate the prior biological knowledge from databases and datadriven network model is desirable for more robust and biologically relevant network reconstruction [24].
Previously, prior biological knowledge has been incorporated into the neighbor selection method [25]. It relies on the Bayesian interpretation of LASSO and assigns two different prior distributions for connections that are present in the database and those are not. Recently, weighted graphical LASSO (wgLASSO) has been proposed to incorporate prior biological knowledge into graphical LASSO by assigning different weights to the entries of precision matrix [26]. In this work, we extend the original wgLASSO algorithm, explain this idea from a Bayesian perspective, and perform comprehensive comparisons between wgLASSO and competing datadriven network models (e.g., neighbor selection, graphical LASSO). Additionally, exploring the topological changes between biological disparate groups may lead to new discoveries that cannot be identified by conventional differential gene expression analysis [27–29]. For example, highdegree nodes (i.e., hubs) that only exist in one of the biologically disparate groups may indicate the regulatory rule of the hub genes only in that group. Knowledgefused differential dependency network (KDDN) is a recently proposed method to construct knowledge incorporated network that can show the rewiring connections between two groups [29]. An opensource Cytoscape app is available for easy implementation [30]. In this paper, we propose a novel algorithm called differentially weighted graphical LASSO (dwgLASSO) for networkbased differential gene expression analysis. This is achieved by building separate networks for biologically disparate groups using wgLASSO, exploring the topological changes between different groups, and prioritizing significant gene list from conventional differential gene expression analysis as shown in Fig. 1. Other previously reported methods include those that focus on integrating prior biological knowledge into datadriven network model to identify subnetworks that are related to the disease under study [31, 32]. Our work differs with these methods since we compute a differential network score for each gene and prioritize them for subsequent analysis rather than outputting a subnetwork list for biological interpretation. Also, methods that directly incorporate gene networks or prior biological knowledge into statistical models for classification and regression tasks have been reported [33, 34]. The rationale is that functionally linked genes tend to be coregulated and coexpressed, and therefore should be treated similarly in the statistical model. Our work leaves the statistical model untouched. Instead, it focuses on using the best set of gene biomarkers as an input to the statistical model. This is considered to have advantages over providing multiple linked genes from the network whose expression values have similar patterns. We show the application of dwgLASSO on two independent microarray datasets from breast cancer patients for survival time prediction, and on TCGA RNAseq data acquired from patients with hepatocellular carcinoma (HCC) for classification task between tumor samples and their corresponding nontumorous liver tissues.
The rest of the paper is organized as follows. “Methods” section introduces the extended wgLASSO algorithm and the proposed dwgLASSO for networkbased differential gene expression analysis. “Results and discussion” section presents the results of wgLASSO and dwgLASSO based on simulation, microarray and RNAseq data. Finally, “Conclusion” section summarizes our work and discusses possible future extensions.
Methods
Network inference using wgLASSO
Consider a centered and scaled data matrix \(\mathbf {X}_{n\times p} \left (\mathrm {i.e.}, \sum _{i=1}^{n} x_{ij}=0, \sum _{i=1}^{n} x_{ij}^{2}=1\right)\), it measures the intensities of p genes on n samples, from a pdimensional Gaussian distribution with zero means on each dimension and positive definite covariance matrix Σ _{ p×p } (i.e., \(\mathbf {X} \sim \mathcal {N}(\mathbf {0},\mathbf {\Sigma })\)). Suppose the sample size n is far less than the variable number p (i.e., n≪p), then the MLE of the precision matrix (i.e., Θ=Σ ^{−1}) does not exist since the sample covariance matrix S is rank deficient. If we further assume Θ is sparse, then a ℓ _{1} regularization term can be added to the negative loglikelihood function f(XΘ)=− log detΘ+tr(S Θ) for a sparse precision matrix estimation as shown in Eq. (1). Graphical LASSO is an algorithm to efficiently solve Eq. (1) by using block coordinate descent [8, 9]. Once the sparse precision matrix \(\hat {\boldsymbol {\Theta }}\) is obtained, a nonzero element in \(\hat {\boldsymbol {\Theta }}\) (i.e., \(\hat {\theta }_{ij}\neq 0\)) indicates a conditional dependence between x _{ i } and x _{ j } given the others. For network \(\mathcal {G}=\{(i,j);1\le i<j\le p\}\), we have \(\hat {\mathcal {G}}=\{(i,j):\hat {\theta }_{ij}\neq 0\}\).
where Θ is the precision matrix, Θ≻0 is the constraint that Θ has to be positive definite, S is the sample covariance matrix, tr denotes the trace, the sum of the diagonal elements in a matrix, ‖Θ‖_{1} represents the ℓ _{1} norm of Θ, the sum of the absolute values of all the elements in Θ, and λ is the tuning parameter controlling the sparsity of Θ.
LASSO based estimates have a Bayesian interpretation [35]. \(\hat {\boldsymbol {\Theta }}\) is the maximum a posteriori (MAP) estimate for the posterior distribution p(ΘX) with a Laplacian prior distribution p(Θ) as shown in Eq. (2). The LASSO term λ‖Θ‖_{1} in Eq. (1) is now part of p(Θ)= exp(−λ‖Θ‖_{1}) with zero means and a scaling parameter λ. From the Bayesian perspective, p(Θ) encodes the prior knowledge of the network topology. For a database that contains only binary information (connecting or not) for a given gene pair, a natural way is to assign two different scaling parameters λ _{1} and λ _{2} for connecting pairs and those are not connected, as shown in Eq. (3). For connecting pairs, their Laplacian prior distribution is diffused, while for nonconnecting pairs their Laplacian prior distribution is concentrated (i.e., λ _{1}≫λ _{2}). In another word, a larger penalty will be assigned to nonconnecting pairs to increase the chance of their corresponding entries in Θ to shrink to zero. In reality, tuning λ _{1} and λ _{2} at the same time involves two dimensional grid search, which is quite timeconsuming for highdimensional data. An extreme solution to set λ _{2}=0 links all the connecting gene pairs from the database in the graph, neglecting the fact that the database might contain some spurious connections for the disease under study.
Instead of using the binary information, a continuous confidence score is more suitable to incorporate prior biological knowledge into graphical LASSO. The confidence score can be obtained from multiple resources. For example, an estimated functional association score for PPIs is provided by STRING database. We scale this confidence score into the range [0,1] and create a weight matrix W _{ p×p }. In W, 1 indicates a complete trust for a gene pair to be connected, 0 represents that no evidence supports a gene pair to be connected. In this way, we can assign different penalties to different gene pairs as shown in Eq. (4). Compared to Eq. (3), (4) also gives larger penalty for less likely connecting gene pairs, but now there is only one tuning parameter λ. For a fixed λ, R package glasso can solve Eq. (4) efficiently given W [17].
where 1 is all 1 matrix, W is the weight matrix containing the confidence score for each gene pair and ∗ represents the elementwise multiplication between two matrices.
For LASSO based optimization problem as shown in Eq. (4), tuning the parameter λ is crucial since it controls the sparsity of the output \(\hat {\boldsymbol {\Theta }}\). Typically, λ is tuned by crossvalidation, Akaike information criterion (AIC), Bayesian information criterion (BIC), or stability selection [36]. Considering that AIC and BIC often lead to data underfitting (i.e., oversparse network) and stability selection requires extensive computational time, we prefer to use cross validation with one standard error rule to select the optimal tuning parameter λ ^{opt}. By using one standard error rule, we can achieve the simplest (most regularized) model whose error is within one standard deviation of the minimal error. Our wgLASSO algorithm is shown below.
Networkbased differential gene expression analysis using dwgLASSO
Figure 2 shows the framework of the proposed dwgLASSO algorithm for networkbased differential gene expression analysis. dwgLASSO prioritizes the significant list obtained from the conventional differential gene expression analysis based on the topological changes between the groupspecific networks built by wgLASSO. Specifically, dwgLASSO first performs differential gene expression analysis to obtain a list of significant genes whose expression values differ between the two biologically disparate groups. Then based on these significant genes, dwgLASSO builds group specific networks using wgLASSO. After the networks are constructed, dwgLASSO calculates a differential network score for each gene in the significant list based on the topological changes between the two groupspecific networks. In calculating the differential network score, dwgLASSO first computes the node degree for each gene in both networks, meaning the number of neighbors each gene is connected with. Then considering the size of the two networks are different, the node degrees are scaled into the range [0,1]. At last, the differential network score for one gene is computed as the absolute value of the difference between the two associated scaled node degrees from different groups. Finally, with the differential network scores, dwgLASSO prioritizes the significant list from the conventional differential gene expression analysis in a decreasing order. The prioritized gene list is used for subsequent analysis such as building classification or regression models. We believe dwgLASSO can help classification or regression models to achieve better prediction performance since the prioritized list integrates information at the gene expression and network structure levels. More than that, the incorporation of prior biological knowledge is more likely to identify biologically meaningful genes. Detailed algorithm for dwgLASSO is shown below.
Results and discussion
Simulation data
Biological networks are reported to be scalefree, which means the degree distribution of the network follows a power law [37]. We considered this scalefree property of biological network in generating simulation data using R package huge [38]. Using huge, a scalefree network was built by inputting the node number p. The sparsity of the network s is fixed, depending on p. For example, when the node number is 100, the sparsity of the network is 0.02, indicating only 2% of all possible connections (i.e., \(\frac {p\times (p1)}{2}\)) exist in the scalefree network. Once the scalefree network is built, huge creates the true precision matrix Θ _{ true } based on the network topology and the positive definite constraint Θ _{ true }≻0 so that Σ _{ true }=(Θ _{ true })^{−1} exists. At last, simulation data \(\mathbf {X}_{n\times p} \sim \mathcal {N}(\mathbf {0},\mathbf {\Sigma }_{true})\) was generated.
We created simulation datasets with various p and n, as seen in Table 1. The weight matrix W, which contains prior biological knowledge, was constructed based on Θ _{ true }. In reality, databases may also contain spurious connections for the disease under study. To evaluate how the incorrect connections in W will impact wgLASSO, we introduced an additional metric, acc. When a c c=60%, we randomly reassigned 40% incorrect connections in W. Specifically, W was created as follows. Initially, for zero entries in Θ _{ true }, the corresponding entries in W were also zero; for nonzero entries in Θ _{ true }, the corresponding entries in W were randomly generated from the uniform distribution \(\mathcal {U}(0,1)\). Then, we randomly assigned incorrect connections into W based on the acc value while keeping the total connections in W the same as those in Θ _{ true }. Under the assumption that incorrect entries in W should have lower confidence scores compared to those of correct entries, we generated incorrect entries from the uniform distribution \(\mathcal {U}(0,0.5)\).
We estimated the true network topology by using neighbor selection, graphical LASSO, and the proposed wgLASSO methods. For neighbor selection method, two strategies were applied to deal with the inconsistency problem. Neighbor selection with “or” operator accepted inconsistent connections while neighbor selection with “and” operator rejected them. To make a fair comparison, we tuned the regularization parameter in each method to ensure the output network has the same sparsity as the true network (i.e., s=0.02 for p=100, s=0.004 for p=500). For each n and p scenario, we regenerated X _{ n×p } 100 times, calculated the false positives and false negatives of connections for each method, and listed their means and standard deviations in Table 1. To evaluate how the incorrect connections in W would impact the performance of wgLASSO, we randomly reassigned 40% (a c c=60%) and 60% (a c c=40%) incorrect prior biological knowledge in W. From Table 1, we can conclude that the estimated network from wgLASSO has much less false positives and false negatives, compared with those from neighbor selection and graphical LASSO methods. A decrease of acc in W would lead to more false positives and false negatives from wgLASSO, but it still outperforms neighbor selection and graphical LASSO methods when the acc in W is only as moderate as 40%.
To make more comprehensive comparison, we plotted precision recall curve to evaluate the performance of neighbor selection, graphical LASSO and wgLASSO methods. We ran the above methods with p=100,n=50 and a c c=40% in W, computed the precision and recall, and generated the plot as shown in Fig. 3. From Fig. 3, wgLASSO displays a clear improvement over neighbor selection and graphical LASSO methods. This agrees with our expectation since wgLASSO considers whether the connection has supporting evidence from database and how well it fits the data in the model.
Microarray data
We applied the proposed dwgLASSO algorithm on two breast cancer microarray datasets: Bild et al. and van de Vijver et al. datasets [39, 40]. The former includes 158 patients with all their survival records, and was used for training. We excluded patients with less than 5year followup time. Among the remaining patients, 42 with less than 5year survival during the followup time were considered to form high risk group while the other 60 formed the low risk group. van de Vijver et al. dataset contains 295 breast cancer patients, together with their survival records, and was used for independent testing. Both datasets are available at PRECOG website (https://precog.stanford.edu), an online repository for querying cancer gene expression and clinical data, and have been preprocessed for subsequent statistical analysis [41]. The raw Bild et al. and van de Vijver et al. datasets are also available at Gene Expression Omnibus (GSE3143) and R package seventyGeneData, respectively [42].
Our interest is to obtain a prioritized significant gene list based on dwgLASSO for more accurate survival time prediction. The workflow is shown in Fig. 4. We first performed univariate analysis on Bild et al. dataset to select a list of statistically significant genes based on concordance index between the expression value and survival time [43]. This lead to a total of 58 genes whose adjusted pvalues were less than 0.05. The inflation of Type I error caused by multiple testing was controlled by the false discovery rate (FDR) using the BenjaminiHochberg procedure. The total 58 significant genes are included in Additional file 1: Table S1 along with their associated adjusted pvalues. We then applied wgLASSO algorithm to build two separate networks using the total 58 significant genes for the high risk and low risk groups, respectively. The weight matrix W was constructed based on the confidence scores from STRING database after inputting the 58 significant genes to investigate the PPIs among them. For gene pairs with no confidence scores from STRING, we assigned the corresponding entries in W to zeros. In wgLASSO, we performed 10fold cross validation and chose the optimal tuning parameter λ ^{opt} by one standard error rule. Fig. 5 shows our chose of λ ^{opt}: λ ^{opt}=0.223 for high risk group and λ ^{opt}=0.184 for low risk group. From the networks, we calculated the node degree for each gene in two groups \(\left (d_{i}^{h}, d_{i}^{l}\right)\), scaled them based on the network size \(\left (sd_{i}^{h}, sd_{i}^{l}\right)\), and computed the differential network score \(\left (dns_{i}=sd_{i}^{h}sd_{i}^{l}\right)\). At last, we prioritized the 58 significant genes based on the network differential scores in a decreasing order.
To evaluate whether dwgLASSO could lead to more accurate survival time prediction, we tested the prioritized gene list using different methods on the independent van de Vijver et al. dataset. The 295 patients were divided into high risk and low risk groups according to the risk scores calculated using multivariate Cox regression from the top 10 significant genes based on dwgLASSO, a competing prior knowledge incorporated network analysis method (i.e., KDDN), and conventional differential gene expression analysis (i.e., concordance index). Unlike dwgLASSO that builds groupspecific networks, KDDN generates only one network with all rewiring connections. From the network constructed by KDDN, we computed the node degree for each gene to help prioritize the significant gene list. KaplanMeier survival analysis was then performed to evaluate the performance of the above three scenarios. The resulting survival curves are shown in Figs. 6 a, b, and d. To evaluate how much the incorporation of prior biological knowledge contributes to the improved performance of dwgLASSO, we tested the top 10 significant genes selected based on dwgLASSO with no prior biological knowledge incorporated (i.e., W=0). The resulting survival curve is shown in Fig. 6 c. As expected, dwgLASSO with no prior biological knowledge incorporated is equivalent to using graphical LASSO in building group specific networks (Fig. 4). As illustrated in Fig. 6, the top 10 significant genes from dwgLASSO with prior biological knowledge incorporated yielded the best performance (p−value=7.01×10^{−7}, hazard ratio =3.325), compared to the top 10 significant genes from KDDN (p−value=7.46×10^{−7}, hazard ratio =3.304), the top 10 significant genes based on dwgLASSO with no prior biological knowledge incorporated (p−value=0.00031, hazard ratio =2.316), and the top 10 significant genes based on concordance index (p−value=0.002, hazard ratio =2.037). We believe the improved performance achieved by dwgLASSO and KDDN are due to the extra information provided from the topological changes between high risk and low risk groups. Also, dwgLASSO and KDDN benefit from incorporating prior biological knowledge to obtain more reliable and biologically relevant genes shared across independent datasets, leading to better prediction performance than those that do not use prior biological knowledge (Fig. 6). Table 2 presents the top 10 significant genes selected based on concordance index and dwgLASSO with prior biological knowledge incorporated, together with their adjusted pvalues. The top 10 genes from the other methods are presented in Additional files 2: Table S2.
Among the top 10 significant genes based on dwgLASSO in Table 2, UBE2S has been reported to be overexpressed in breast cancer [44]. The authors showed UBE2S knockdown suppressed the malignant characteristics of breast cancer cells, such as migration, invasion, and anchorageindependent growth. SALL2 has also been reported as a predictor of lymph node metastasis in breast cancer [45]. Unlike UBE2S, SALL2 was identified as a tumor suppressor gene that can suppress cell growth when overexpressed [46]. Additionally, XBP1 has been reported to be activated in triplenegative breast cancer and has a pivotal role in the tumorigenicity and progression of this breast cancer subtype [47]. KIAA0922 has also been reported as a novel inhibitor of Wnt signaling pathway, which is closely related to breast cancer [48]. None of UBE2S, SALL2, XBP1 and KIAA0922 is among the top 10 significant genes based on concordance index according to Table 2.
In Fig. 7, we showed the neighbors of UBE2S and SALL2 in the high risk and low risk groups based on the networks created by wgLASSO from Bild et al. dataset. UBE2S is overexpressed in the high risk group while SALL2 is underexpressed. This agrees with that UBE2S is a promoting breast cancer gene while SALL2 is a suppressor breast cancer gene [44, 46]. Additionally, UBE2S has higher scaled node degree in the high risk group while SALL2 has higher scaled node degree in the low risk group \(\left (sd_{UBE2S}^{h}=0.286, sd_{UBE2S}^{l}=0.778, sd_{SALL2}^{h}=\right.\left.1.0, sd_{SALL2}^{l}=0.444\right)\). This shows, as a promoting breast cancer gene, UBE2S is more actively connected with its neighbors in the high risk group while, the suppressor breast cancer gene, SALL2 is more actively connected with its neighbors in the low risk group. In Fig. 7, yellow edges represent connections that have been supported from STRING database. We can see that these connections based on prior biological knowledge are not always showing up from the output of wgLASSO. This is a nice property since prior biological knowledge only provides evidence. We still need the support from the data to make a connection. Therefore, by integrating prior biological knowledge into datadriven models, we expect to build more robust and biologically relevant networks. Table 3 shows the survival time prediction performance when the top 5, top 10 and top 15 significant genes are selected by each of the four methods as the inputs to the multivariate Cox regression model (Fig. 6). In all three cases, the proposed dwgLASSO algorithm with prior biological knowledge incorporated achieved the best performance, followed by KDDN and dwgLASSO without prior biological knowledge incorporated. The method that relies purely on concordance index had the least performance.
RNAseq data
Using UCSC Cancer Genomics Browser, we obtained TCGA RNAseq data (level 3) acquired from patients with HCC [49]. The RNAseq data was acquired by analysis of 423 liver tissues, including 371 primary tumor, 50 solid normal and 2 recurrent tumor samples based on Illumina HiSeq 2000 RNA Sequencing platform and mapped onto the human genome coordinates using UCSC cgData HUGO probeMap. Among the 371 primary tumor samples, 50 of them can find its corresponding solid normal samples. To evaluate dwgLASSO on RNAseq data, we apply a workflow shown in Fig. 8. We first picked out the 100 samples whose tumor tissues and their corresponding nontumorous tissues can both be found. Randomly, we selected 60 of them (30 tumor samples and their corresponding normal samples) as the training dataset. The remaining 40 samples (20 tumor samples and their corresponding normal samples) were used as testing dataset 1. Considering testing dataset 1 only contains 40 samples, we created testing dataset 2 by combining the above 40 samples and the remaining 321 tumor samples whose corresponding normal samples cannot be found. With testing datasets 1 and 2, we evaluated the performance of dwgLASSO on both balanced and large sample size datasets. Specifically, we preprocessed RNAseq data using R package DESeq2 on the training dataset [50]. From DESeq2, we selected statistically significant genes whose adjusted pvalues were less than 0.01 for subsequent analysis. At this step, the number of significant genes is typically between 1000 and 2000. We prioritized the significant gene list based on dwgLASSO. From the prioritized gene list, the top 5 genes were selected to train a logistic regression classifier to distinguish tumor and normal samples. The trained logistic regression classifier was finally evaluated on testing datasets 1 and 2. To compare dwgLASSO with other methods, we also prioritized the significant gene list based on adjusted pvalue from DESeq2, dwgLASSO without prior biological knowledge incorporated and KDDN, built logistic regression classifier using the top 5 genes on the prioritized list and evaluated the trained classifier on the testing datasets 1 and 2.
The above procedure was repeated 100 times and the means and standard deviations for sensitivity, specificity and area under curve (AUC) were calculated using testing datasets 1 and 2 as shown in Table 4. In agreement with microarray data, networkbased methods with prior biological knowledge incorporated yielded the best performance, followed by networkbased method without prior biological knowledge incorporated, and the conventional differential gene expression analysis method was the worst. This is expected since both dwgLASSO and KDDN methods take into account of the changes of genes at gene expression and network topology levels, and incorporate prior biological knowledge into their network models.
Conclusion
In this paper, we apply a novel network inference method, wgLASSO to integrate prior biological knowledge into a datadriven model. We also propose a new networkbased differential gene expression analysis method dwgLASSO for better identification of genes associated with biologically disparate groups. Simulation results show that wgLASSO can achieve better performance in building biologically relevant networks than purely datadriven models (e.g., neighbor selection and graphical LASSO) even when only a moderate level of information is available as prior biological knowledge. We demonstrate the performance of dwgLASSO in survival time prediction using two independent microarray breast cancer datasets previously published by Bild et al. and van de Vijver et al. The top 10 genes selected by dwgLASSO based on the dataset from Bild et al. dataset lead to a significantly improved survival time prediction on the dataset from van de Vijver et al., compared with the top 10 significant genes obtained by conventional differential gene expression analysis. Among the top 10 genes selected by dwgLASSO, UBE2S, SALL2, XBP1 and KIAA0922 have been previously reported to be relevant in breast cancer biomarker discovery study. We also tested dwgLASSO using TCGA RNAseq data acquired from patients with HCC on tumors samples and their corresponding nontumorous liver tissues. Improved sensitivity, specificity and AUC were observed when comparing dwgLASSO with conventional differential gene expression analysis method. Future research work will focus on applying dwgLASSO on other omic studies such as proteomics and metabolomics.
Abbreviations
 AIC:

Akaike information criterion
 AUC:

area under curve
 BIC:

Bayesian information criterion
 DEA:

differential gene expression analysis
 DN:

differential network
 dwgLASSO:

differentially weighted graphical LASSO
 FDR:

false discovery rate
 FP:

false positives
 FN:

false negatives
 GGMs:

Gaussian graphical models
 HCC:

hepatocellular carcinoma
 KDDN:

Knowledgefused differential dependency network
 LASSO:

least absolute shrinkage and selectioin operator
 MAP:

maximum a posteriori
 MLE:

maximum likelihood estimation
 PPI:

proteinprotein interaction
 wgLASSO:

weighted graphical LASSO
References
Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci. 2001; 98(9):5116–21.
Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW. On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J Comput Biol. 2001; 8(1):37–52.
Efron B, Tibshirani R, Storey JD, Tusher V. Empirical bayes analysis of a microarray experiment. J Am Stat Assoc. 2001; 96(456):1151–60.
EinDor L, Zuk O, Domany E. Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc Natl Acad Sci. 2006; 103(15):5923–8.
Zuo Y, Yu G, Zhang C, Ressom HW. A new approach for multiomic data integration. In: Bioinformatics and Biomedicine (BIBM), 2014 IEEE International Conference On: 2014. p. 214–7. IEEE.
Butte AJ, Kohane IS. Unsupervised knowledge discovery in medical databases using relevance networks. In: Proceedings of the AMIA Symposium: 1999. p. 711. American Medical Informatics Association.
Butte AJ, Kohane IS. Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac Symp Biocomput. 2000; 5:418–29. Citeseer.
Zuo Y, Yu G, Tadesse MG, Ressom HW. Biological network inference using low order partial correlation. Methods. 2014; 69(3):266–73.
Friedman N, Linial M, Nachman I, Pe’er D. Using Bayesian networks to analyze expression data. J Comput Biol. 2000; 7(34):601–20.
Toh H, Horimoto K. Inference of a genetic network by a combined approach of cluster analysis and graphical gaussian modeling. Bioinformatics. 2002; 18(2):287–97.
Dobra A, Hans C, Jones B, Nevins JR, Yao G, West M. Sparse graphical models for exploring gene expression data. J Multivar Anal. 2004; 90(1):196–212.
Kishino H, Waddell PJ. Correspondence analysis of genes and tissue types and finding genetic links from microarray data. Genome Inform. 2000; 11:83–95.
Dempster AP. Covariance selection. Biometrics. 1972;:157–75.
Edwards D. Introduction to Graphical Modelling: Springer Science & Business Media; 2012.
Schäfer J, Strimmer K. An empirical bayes approach to inferring largescale gene association networks. Bioinformatics. 2005; 21(6):754–64.
Meinshausen N, Bühlmann P. Highdimensional graphs and variable selection with the lasso. Ann Stat. 2006;:1436–62.
Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008; 9(3):432–41.
Mazumder R, Hastie T. The graphical lasso: New insights and alternatives. Electron J Stat. 2012; 6:2125.
Snel B, Lehmann G, Bork P, Huynen MA. String: a webserver to retrieve and display the repeatedly occurring neighbourhood of a gene. Nucleic Acids Res. 2000; 28(18):3442–4.
Kanehisa M, Goto S. Kegg: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000; 28(1):27–30.
Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M. Biogrid: a general repository for interaction datasets. Nucleic Acids Res. 2006; 34(suppl 1):535–9.
Kamburov A, Wierling C, Lehrach H, Herwig R. Consensuspathdb–a database for integrating human functional interaction networks. Nucleic Acids Res. 2009; 37(suppl 1):623–8.
Chuang HY, Lee E, Liu YT, Lee D, Ideker T. Networkbased classification of breast cancer metastasis. Mol Syst Biol. 2007; 3(1).
Zuo Y, Yu G, Ressom HW. Integrating prior biological knowledge and graphical lasso for network inference. In: Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference On: 2015. p. 1543–7. IEEE.
Wang Z, Xu W, San Lucas FA, Liu Y. Incorporating prior knowledge into gene network study. Bioinformatics. 2013; 29(20):2633–40.
Li Y, Jackson SA. Gene network reconstruction by integration of prior biological knowledge. G3: Genes—Genomes—Genetics. 2015; 5(6):1075–9.
Ha MJ, Baladandayuthapani V, Do KA. Dingo: differential network analysis in genomics. Bioinformatics. 2015; 31(21):3413–20.
Zhang B, Li H, Riggins RB, Zhan M, Xuan J, Zhang Z, Hoffman EP, Clarke R, Wang Y. Differential dependency network analysis to identify conditionspecific topological changes in biological networks. Bioinformatics. 2009; 25(4):526–32.
Tian Y, Zhang B, Hoffman EP, Clarke R, Zhang Z, Shih IM, Xuan J, Herrington DM, Wang Y. Knowledgefused differential dependency network models for detecting significant rewiring in biological networks. BMC Syst Biol. 2014; 8(1):1.
Tian Y, Zhang B, Hoffman EP, Clarke R, Zhang Z, Shih IM, Xuan J, Herrington DM, Wang Y. Kddn: an opensource cytoscape app for constructing differential dependency networks with significant rewiring. Bioinformatics. 2015; 31(2):287–9.
Wei Z, Li H. A markov random field model for networkbased analysis of genomic data. Bioinformatics. 2007; 23(12):1537–44.
Chouvardas P, Kollias G, Nikolaou C. Inferring active regulatory networks from gene expression data using a combination of prior knowledge and enrichment analysis. BMC Bioinforma. 2016; 17(5):319.
Wei P, Pan W. Incorporating gene networks into statistical tests for genomic data via a spatially correlated mixture model. Bioinformatics. 2008; 24(3):404–11.
Binder H, Schumacher M. Incorporating pathway information into boosting estimation of highdimensional risk prediction models. BMC Bioinforma. 2009; 10(1):1.
Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Series B (Methodol). 1996;:267–88.
Meinshausen N, Bühlmann P. Stability selection. J R Stat Soc Series B (Stat Methodol). 2010; 72(4):417–73.
Barabasi AL, Oltvai ZN. Network biology: understanding the cell’s functional organization. Nat Rev Genet. 2004; 5(2):101–13.
Zhao T, Liu H, Roeder K, Lafferty J, Wasserman L. The huge package for highdimensional undirected graph estimation in r. J Mach Learn Res. 2012; 13(1):1059–62.
Bild AH, Yao G, Chang JT, Wang Q, Potti A, Chasse D, Joshi MB, Harpole D, Lancaster JM, Berchuck A, et al. Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature. 2006; 439(7074):353–7.
Van De Vijver MJ, He YD, van’t Veer LJ, Dai H, Hart AA, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, et al. A geneexpression signature as a predictor of survival in breast cancer. N Engl J Med. 2002; 347(25):1999–2009.
Gentles AJ, Newman AM, Liu CL, Bratman SV, Feng W, Kim D, Nair VS, Xu Y, Khuong A, Hoang CD, et al. The prognostic landscape of genes and infiltrating immune cells across human cancers. Nat Med. 2015; 21(8):938–45.
Marchionni L, Afsari B, Geman D, Leek JT. A simple and reproducible breast cancer prognostic test. BMC Genomics. 2013; 14(1):1.
Pencina MJ, D’Agostino RB. Overall c as a measure of discrimination in survival analysis: model specific population value and confidence interval estimation. Stat Med. 2004; 23(13):2109–23.
Ayesha AK, Hyodo T, Asano E, Sato N, Mansour MA, Ito S, Hamaguchi M, Senga T. UBE2S is associated with malignant characteristics of breast cancer cells. Tumor Biol. 2016; 37(1):763–72.
Huang E, Cheng SH, Dressman H, Pittman J, Tsou MH, Horng CF, Bild A, Iversen ES, Liao M, Chen CM, et al. Gene expression predictors of breast cancer outcomes. Lancet. 2003; 361(9369):1590–6.
Liu H, Adler AS, Segal E, Chang HY. A transcriptional program mediating entry into cellular quiescence. PLoS Genet. 2007; 3(6):91.
Chen X, Iliopoulos D, Zhang Q, Tang Q, Greenblatt MB, Hatziapostolou M, Lim E, Tam WL, Ni M, Chen Y, et al. Xbp1 promotes triplenegative breast cancer by controlling the hif1 [agr] pathway. Nature. 2014; 508(7494):103–7.
Maharzi N, Parietti V, Nelson E, Denti S, RobledoSarmiento M, Setterblad N, Parcelier A, Pla M, Sigaux F, Gluckman JC, et al. Identification of tmem131l as a novel regulator of thymocyte proliferation in humans. J Immunol. 2013; 190(12):6187–97.
Zhu J, Sanborn JZ, Benz S, Szeto C, Hsu F, Kuhn RM, Karolchik D, Archie J, Lenburg ME, Esserman LJ, et al. The ucsc cancer genomics browser. Nat Methods. 2009; 6(4):239–40.
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for rnaseq data with deseq2. Genome Biol. 2014; 15(12):1.
Acknowledgements
None.
Funding
This work is in part supported by the National Institutes of Health Grants U01CA185188, R01CA143420 and R01GM086746 awarded to HWR.
Availability of supporting data
The datasets supporting the results of this article are included within the article and its additional files, or from referenced sources.
Authors’ contributions
YZ designed and implemented the algorithms, conducted the synthetic simulation and real data application, and drafted the paper. YC collected the two microarray datasets and participated in generating the results for the synthetic simulation and real data application. GY and RL provided expertise in differential expression analysis. HWR directed the project and completed the paper. All authors reviewed and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Author information
Authors and Affiliations
Corresponding author
Additional files
Additional file 1
Table S1: The total 58 significant genes along with their associated adjusted pvalues. (CSV 1.09 kb)
Additional file 2
Table S2: The top 10 significant genes based on KDDN and dwgLASSO without prior biological knowledge incorporated. (CSV 4.00 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Zuo, Y., Cui, Y., Yu, G. et al. Incorporating prior biological knowledge for networkbased differential gene expression analysis using differentially weighted graphical LASSO. BMC Bioinformatics 18, 99 (2017). https://doi.org/10.1186/s1285901715151
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901715151
Keywords
 Prior biological knowledge
 Gaussian graphical model
 Weighted graphical LASSO
 Networkbased differential gene expression analysis