A powerful score-based statistical test for group difference in weighted biological networks

Ji, Jiadong; Yuan, Zhongshang; Zhang, Xiaoshuai; Xue, Fuzhong

doi:10.1186/s12859-016-0916-x

Methodology article
Open access
Published: 12 February 2016

A powerful score-based statistical test for group difference in weighted biological networks

Jiadong Ji¹,
Zhongshang Yuan¹,
Xiaoshuai Zhang¹ &
…
Fuzhong Xue¹

BMC Bioinformatics volume 17, Article number: 86 (2016) Cite this article

3028 Accesses
10 Citations
1 Altmetric
Metrics details

Abstract

Background

Complex disease is largely determined by a number of biomolecules interwoven into networks, rather than a single biomolecule. A key but inadequately addressed issue is how to test possible differences of the networks between two groups. Group-level comparison of network properties may shed light on underlying disease mechanisms and benefit the design of drug targets for complex diseases. We therefore proposed a powerful score-based statistic to detect group difference in weighted networks, which simultaneously capture the vertex changes and edge changes.

Results

Simulation studies indicated that the proposed network difference measure (NetDifM) was stable and outperformed other methods existed, under various sample sizes and network topology structure. One application to real data about GWAS of leprosy successfully identified the specific gene interaction network contributing to leprosy. For additional gene expression data of ovarian cancer, two candidate subnetworks, PI3K-AKT and Notch signaling pathways, were considered and identified respectively.

Conclusions

The proposed method, accounting for the vertex changes and edge changes simultaneously, is valid and powerful to capture the group difference of biological networks.

Background

From the perspective of network medicine, a disease phenotype is rarely a consequence of an abnormality in a single biomolecule (e.g. RNA, protein, metabolite), but reflects various pathobiological processes that interact in a complex network [1]. One single factor can exert certain effects on disease when studying it alone, while this effect may be vanished when studying it within one network or pathway [2], and vice versa. Therefore, biomolecules should be studied in the context of biological systems they are involved in [3]. Perhaps the abstraction for a biological system is network, such as transcriptional regulatory networks, signal transduction networks, protein interaction networks and metabolic networks [4]. In the biological networks, the vertices represent biomolecules, and the edges represent functional, causal or physical interactions between the vertices. Different types of networks are often used to represent diverse types of biological processes, each of which stores information about levels and interactions related to specific biomolecules [5]. In fact, different physiological conditions may manifest as different networks. Moreover, complex disease are multi-factorial and analyzing the individual components is insufficient, so it is essential to dissect how these components interact with each other and weave into one network, and how these interactions differ with respect to disease status. Statistical comparison of group difference in biological networks or pathways can provide new insight into the underlying disease mechanism, and have extensive biomedical and clinical applications [6–10]. For instance, a better understanding of the effects of molecular interconnectedness on disease progression may lead to superior identification of disease related biomolecules and pathways, which may further offer more effective targets for drug development in a cost-effective and timely manner.

On the other hand, identifying biological and environmental causes of human diseases has always been one of the central concerns in epidemiology. However, traditional epidemiology has been pejoratively labeled as the “black box” epidemiology [11], and increasingly suffered from criticism partly due to the fact that too much attention has been paid to the identification of a single risk factor rather than the network or pathway related to a disease, which led to difficulty to deeply explore disease mechanism [12]. It is highly desirable to unlock the black box underlying observed associations and to illuminate the biological interaction mechanisms of disease-related components hiding behind the black box. There are unmet needs to access multi-level omics data on the population level. Thanks to the development of recent technological advances in high-throughput omics platforms, we can enable the acquisition of omics data at unprecedented speed and amounts, and further integrate various omics data with traditional epidemiology to promote the development of systems epidemiology [12, 13]. It offers the potential to provide new insight into the underlying disease mechanisms in breadth and depth at human population level. Under the framework of systems epidemiology, the focus has been shifted from identification of single factor to exploration of specific networks or pathways contributing to disease [14, 15].

In a word, it is in great needs to do statistical comparison of biological networks. So far, several methods have been proposed to utilize network topology information to carry out various biomedical tasks. Langfelder et al. [16] provided several measures for comparing network topologies for weighted correlation networks. Zhang et al. [17] proposed a differential dependency network analysis to detect topological changes in transcriptional networks between subclasses of breast cancer. Valcarcel et al. [18] introduced a formal statistical method for the differential analysis of molecular pair-wise associations via network representation. Recently, Yates et al. [19] developed an additive element-wise-based dissimilarity measure for biological network hypothesis tests. However, most of above methods mainly focus on the difference of network topology and are unable to account for the changes of vertices. Although in most situations, the differences of single vertices-wise or edges-wise may be weak, their aggregated differences can be quite strong. It will undoubtedly lose statistical power to only consider the connection with the topological difference between two networks. Meanwhile, non-parametric permutation procedures are commonly employed to perform analysis in most existed methods, which were inevitably time-consuming, especially for big data.

The premise for networks or pathways comparison is to make clear the cause of biological network difference. Generally, both changes in the nodes level (e.g. the magnitude of each gene’s expression change in regulation network), and changes in the edges (e.g. the strength of connection) can lead to the whole network difference. Reverter et al. [20] presented an analytical procedure to simultaneously identify genes that were differentially expressed (DE) as well as genes that are differentially connected (DC) for unweighted networks. Their methods depend heavily on the specific definition of DE and DC, and the two-component mixture of bi-variate normal distribution may be violated in other biological networks, though it may be reasonable in gene expression network. Furthermore, weighted (correlation-based) networks are commonly encountered and increasingly relevant in biological applications [16, 21–23]. Statistical methods for detecting the group difference in weighted biological networks are still in great demand.

In this article, we proposed a new score-based network difference measure (NetDifM) as a powerful test statistic to detect group difference in weighted networks, which simultaneously capture the difference of vertices and edges. Various simulations were conducted to evaluate its type I error and statistical power, compared with other existed method. Two real data sets about GWAS of leprosy and gene expression of ovarian cancer were further analyzed to show their performance in practice.

Methods

Statistical model

A weighted biological network can be modeled as an undirected graph G = (V, E), where V is the set of vertices (sometimes referred to nodes) and E is the set of edges (also called connections). Two vertices, representing biomolecules, are connected by an undirected edge if there is an association between them. Each edge can be assigned a weight resembling the strength of evidence for the association.

We denote the two networks in two groups (cases and controls) by G ^D and G ^C respectively, suppose both G ^D and G ^C have the same number of vertices (M) and edges (K), the null hypothesis test is H ₀ : G ^D = G ^C. Let V(G ^D) and E(G ^D) denote the set of all vertices and edges in G ^D, x ^D_i x ^D_j indicate the edge $ {x}_i^D-{x}_j^D $ (i ≠ j, i, j = 1, 2, ⋯, M), β ^D_ij represent the strength of association between x ^D_i and x ^D_j if x ^D_i x ^D_j existed. For individual l (l = 1, 2, ⋯, N), the trait value is denoted as Y _l, $ {Y}_l=\left\{\begin{array}{l}1,\kern0.4em l\in case\\ {}0\kern0.1em ,\kern0.2em l\in control\end{array}\right. $ and the i ^th vertex is denoted as x _li. Under H ₀, networks in two groups are identical not only in the average vertices levels but also in the connection strength. The score test vector of vertices is D ^V = (D ^V₁ , D ^V₂ , ⋯, D ^V_M )^T, where $ {D}_i^V={\displaystyle {\sum}_{l=1}^N\left({Y}_l-\overline{Y}\right){x}_{li}} $ measures the contribution of vertex x _i to the disease. Analogously, the score test vector of edges is D ^E = (D ^E₁ , D ^E₂ , ⋯, D ^E_K )^T, where $ {D}_k^E={\displaystyle {\sum}_{l=1}^N\left({Y}_l-\overline{Y}\right)\left({x}_{li}-{\overline{x}}_i\right)\left({x}_{lj}-{\overline{x}}_j\right)} $ measures the contribution of connection strength between x _i and x _j (i.e. the k ^th edge) to the disease. Then the proposed overall network difference measure can be defined as

$$ NetDifM={D}^T{\varSigma}^{-1}D, $$

where $ D=\left(\begin{array}{l}{D}^V\\ {}{D}^E\end{array}\right), $ Σ = cov(D) = (σ _pq)_{(M + K) × (M + K)}, p, q = 1, 2, ⋯, (M + K). The estimated covariance matrix of D can be represented as $ \left(\kern-.9em \begin{array}{cc}{\varSigma}_V& {\varSigma}_{VE}\\ {}\kern1em {\varSigma}_{VE}^T\kern1em & {\varSigma}_E\end{array}\right) $ and calculated as follows,

1) For Σ _V, p, q = 1, 2, ⋯, M,

$$ {\sigma}_{pq}={\displaystyle {\sum}_{l=1}^N{\left({Y}_l-\overline{Y}\right)}^2\operatorname{cov}\left({X}_p,{X}_q\right)},{X}_p=\left({x}_{1p},{x}_{2p},\cdots, {x}_{Np}\right); $$

2) For Σ _E, p, q = M + 1, M + 2, ⋯, M + K,

$$ {\sigma}_{pq}={\displaystyle {\sum}_{l=1}^N{\left({Y}_l-\overline{Y}\right)}^2\operatorname{cov}\left({Z}_p,{Z}_q\right)},{Z}_p=\left({X}_i-{\overline{X}}_i\right)\times \left({X}_j-{\overline{X}}_j\right); $$

3) For Σ _VE, p = 1, 2, ⋯, M , q = M + 1, M + 2, ⋯, M + K

$$ {\sigma}_{pq}={\displaystyle {\sum}_{l=1}^N{\left({Y}_l-\overline{Y}\right)}^2\operatorname{cov}\left({X}_p,{Z}_q\right)}. $$

Naturally, for a large sample size, NetDifM has a centered χ ²(M + K) distribution under the null hypothesis (The derivation of NetDifM see Additional file 1). When sample size is small, a permutation procedure can be conducted as follows to get the empirical P value and assess the statistical significance. (1) calculate the test statistic NetDifM from the original sample; (2) randomly assign subjects to one of two groups, the sample size for each group keeps the same as the original data; (3) perform the above steps Q times and calculate the test statistic for each repeated sample, NetDifM ^*_i , i = 1, 2, ⋯, Q; (4) estimate the P value according to $ p- value=\frac{1}{Q}{\displaystyle \sum_{i=1}^QI\left( NetDif{M}_i^{*}> NetDifM\right)} $, where I(•) is the indicator function.

Intuitively, considering the elements of one network are not more than vertices and edges, an element-wise measure may be expected to have the ability to identify the group difference in biological networks. A vertices and edges wise difference measure (VEWDM), through the simple summation of vertices difference and edges, can be constructed as

$$ VEWDM=\frac{1}{M}{\displaystyle \sum_{i=1}^M{T}_i^2+\frac{1}{K}{\displaystyle \sum_{i=1}^M{\displaystyle \sum_{j\ne i}^M{U}_{ij}^2}}} $$

where $ {T}_i=\frac{{\overline{x}}_i^D-{\overline{x}}_i^C}{\sqrt{\operatorname{var}\left({\overline{x}}_i^D\right)+\operatorname{var}\left({\overline{x}}_i^C\right)}} $, $ {\overline{x}}_i^D $ and $ {\overline{x}}_i^C $ indicate the sample mean of x _i in G ^D and G ^C respectively; $ {U}_{ij}=\frac{\beta_{ij}^D-{\beta}_{ij}^C}{\sqrt{\operatorname{var}\left({\beta}_{ij}^D\right)+\operatorname{var}\left({\beta}_{ij}^C\right)}} $, when the strength of edges are quantified by the Pearson correlations r _ij, $ {U}_{ij}=\left({z}_{ij}^D-{z}_{ij}^C\right)/\sqrt{\frac{1}{n_D-3}+\frac{1}{n_C-3}} $, z _ij are the Fisher-transforms of the correlations $ {z}_{ij}=\frac{1}{2} \ln \frac{1+{r}_{ij}}{1-{r}_{ij}} $, n _D and n _C are the corresponding sample sizes. The proposed VEWDM seems to be the linear combination of some chi-square statistics. Actually, the asymptotic theoretical properties have been explored for the linear combination of independent chi-squares [24]. Nevertheless, it is quite complex and difficult here to obtain the asymptotic distribution of VEWDM, since the correlations between different vertices and different edges statistics (T _i and U _ij) highly depend on the specific network structure. In other words, the asymptotic properties might be network-specific. To solve this problem, we adopted the strategy of a permutation test to make statistical inference.

Simulation

Simulations were designed to evaluate the type I error rate and statistical power, and to compare the performance of NetDifM, VEWDM and Yates’D (recently proposed dissimilarity measure in Yates et al. [19]) under different sample size and network topological structure. The statistical power is defined as the probability that the test correctly rejects the null hypothesis (H ₀) when the alternative hypothesis (H ₁) is true. It can be estimated from the empirical distribution as the proportion of observations for which the p-value is less than given nominal level (α = 0.05). For the specific network with M vertices and K edges, the simulated M-dimensional variables (vertices) were generated from a multivariate normal distribution N _M(μ, Σ) with mean vector μ and covariance matrix Σ using the R “mvtnorm” package. We specified the mean vector μ = (μ ₁, μ ₂, ⋯, μ _M) and covariance matrix Σ = (I _ij β _ij)_M × M, where $ {I}_{ij}=\left\{\begin{array}{l}1,\kern0.4em {x}_i{x}_j\in E(G)\\ {}0,\kern0.3em {x}_i{x}_j\notin E(G)\kern0.1em \end{array}\right. $, i ≠ j, i, j = 1, 2, ⋯, M was the indicator function.

Under the null hypotheses (H ₀), the data was generated by setting μ ^D = μ ^C and I ^D_ij β ^D_ij = I ^C_ij β ^C_ij . 1000 simulations were repeated to assess the type I error of the above methods given various sample sizes under different network scale, including network with ten vertices and 21 edges (Fig. 1a) and another one with 20 vertices and 45 edges (Fig. 1b). Under the alternative hypotheses, three scenarios were considered.

For scenario 1, only vertices (average levels) were different between G ^D and G ^C. We simulated two vertices difference with μ ^D₃ − μ ^C₃ = 0.2 and μ ^D₁₀ − μ ^C₁₀ = 0.2 under the network topological structure as in Fig. 1a. Three vertices difference with μ ^D₆ − μ ^C₆ = 0.1, μ ^D₈ − μ ^C₈ = 0.2 and μ ^D₁₇ − μ ^C₁₇ = 0.2 were also designed under the relative larger scale network as in Fig. 1b.

For scenario 2, only edges (connection strength) were different between G ^D and G ^C. We simulated three edges difference with β ^D₃₅ − β ^C₃₅ = − 0.2, β ^D₅₇ − β ^C₅₇ = 0.2 and β ^D_8,10 − β ^C_8,10 = 0.2 under the network topological structure as in Fig. 1a. Seven edges difference with β ^C_10,11 − β ^D_10,11 = − 0.2, β ^C_1,10 − β ^D_1,10 = β ^C_2,12 − β ^D_2,12 = β ^C_4,14 − β ^D_4,14 = β ^C_12,19 − β ^D_12,19 = 0.2 and β ^C₂₃ − β ^D₂₃ = β ^C_7,20 − β ^D_7,20 = 0.1 were also designed under the relative larger scale network as in Fig. 1b.

For scenario 3, both vertices and edges were designed to be different between G ^D and G ^C. Under the topology structure as in Fig. 1a, we combined the settings in scenario 1 and scenario 2 (the difference only existed for orange vertices and red edges), so as for the topology structure as in Fig. 1b.

For each scenario, 1000 replicates were used to evaluate statistical power. P-values of the proposed NetDifM were assessed using both the asymptotic distribution and the empirical null distribution obtained from 1000 times permutations.

It is necessary to assess the performance of the proposed statistics, given the deviation from the normal distribution. For the network with ten vertices and 21 edges, we designed the following two scenarios, (i) conduct the exponential transformation for five vertices randomly chosen among the ten vertices; (ii) do the exponential transformation for all ten vertices. For each scenario, we evaluate the type I error rate and statistical power under the same three scenarios mentioned as above.

Furthermore, we also provided estimated computing time under different network with sample size 200 and 1000 permutations, using one laptop as an Intel PentiumT4400 with a 2.2 GHz CPU and 2 GB RAM.

Application

GWAS data of leprosy

By Ingenuity Pathways Analysis, a plausible biologic network underlying susceptibility to leprosy was provided for depicting the functional relationship between some susceptibility genes identified from GWAS of leprosy [25]. Using the initial GWAS data with 706 cases and 514 controls, we attempted to detect the difference of the networks including genes CARD6, HLA-DRB1, RIPK2, CARD9 and IFNG. All participants provided written informed consent, and the study was approved by the ethics committees of Shandong Academy of Medical Science [25]. These five genes located on different chromosomes and totally contained 914 SNPs (see in Additional file 1: Table S1), with network structure given in Fig. 2a. Since each gene contained several SNPs, we first employed principal component analysis and conducted the statistical network comparison by treating the first principal component as the network vertices.

Gene expression data of ovarian cancer

Tothillet al. [26] used high-density expression oligonucleotide microarrays for profiling 285 well-annotated serous and endometrioid invasive ovarian, fallopian tube, and peritoneal cancers. The subjects were divided into a C1 subtype, with 83 patients, and a C2–C6 subtype, with 168 patients. Complete expression data are available on GEO (accession GSE9899). The proposed method was also applied to detect the network difference between these two groups (C1 versus C2–C6). Here we studied two specific pathways (PI3K-AKT signaling pathway and Notch signaling pathway) reported in the literatures [27–29] to be relevant to ovarian cancer. The subnetwork of PI3K-AKT signaling pathway from the KEGG pathway database with 7 genes contained 26 probe sets (see in Additional file 1: Table S2) was abstracted into network with topological structure shown in Fig. 2b. The subnetwork of Notch signaling pathway with 14 genes contained 45 probe sets (see in Additional file 1: Table S2) was abstracted into network with topological structure shown in Fig. 2c. All probe sets corresponding to the same gene symbol were first averaged to obtain gene-level expression measurements.

Results

Simulation

Shown in Table 1 are the estimated type I error rates of the proposed NetDifM, VEWDM, NetDifM based on permutation (NetDifMpm) and Yates’D under different sample sizes. It reveals that all type I error rates based on permutation procedure are close to given nominal level (α = 0.05). NetDifM tended to be slightly conservative under small sample size, while using the asymptotic distribution maintains a good control of type I error rate under large sample size.

Table 1 Type I error rates of four methods

Full size table

Figure 3a indicates the statistical power under scenario 1 when only vertices changed with the network topological structure demonstrates in Fig. 1a. As expected, Yates’D has no power due to that it can only capture the edge change. NetDifM is substantially more powerful than VEWDM, and it is slightly less powerful than its permutation-based type. Similar trend could also be found under the relative larger scale network (Fig. 3b).

Shown in Fig. 4 is the performance under scenario 2 (only edges change). The statistical power of all methods monotonically increases with sample size. NetDifM has much higher power than that of VEWDM and Yates’D. The power of NetDifM and Yates’D keep almost the same in the larger scale network (Fig. 4b).

Figure 5 illustrates the statistical power under the scenario 3 (both edges and vertices change). Both NetDifM and VEWDM are much more powerful than Yates’D, and NetDifM still has the best performance.

To evaluate the scalability and computational efficiency of the proposed methods, we also conducted simulations using a larger network with 40 vertices and 54 edges (see in Additional file 1: Figure S1). It is clear that the proposed NetDifM still have the best performance (see in Additional file 1: Figure S2; Additional file 1: Table S3).

Figure 6 indicates the results given the deviation from the normal distribution, where the proposed statistics still hold the relative better performance than other method.

Table 2 presents the estimated computing time. It indicates that the proposed NetDifM indeed runs fast, and the computational time increases as the network become larger.

Table 2 Computing time (seconds) of four methods with sample size 200 and 1000 permutations

Full size table

The results of application

Network difference analysis for both the GWAS of leprosy and gene expression data of ovarian cancer further confirm in practice that the proposed NetDifM captured the network changes. Shown in Table 3 are the results of the proposed NetDifM and other methods for detecting the network difference between two groups. The difference of gene interaction network with 5 genes can be detected significantly at α = 0.05 by NetDifM, NetDifMpm and VEWDM.

Table 3 P-values of the four methods for the two real data sets (1000 times permutation)

Full size table

Group difference of the subnetwork of PI3K-AKT signaling pathway was detected significantly at α = 0.05 by NetDifM, NetDifMpm and VEWDM. When applied to the subnetwork of Notch signaling pathway, all four methods can detect the network difference significantly (Table 3). Shown in Table 4 are the results of the proposed NetDifM and other methods for detecting the specific vertices, treating a vertex as well as its connected edge as a network, under1000 permutation times.

Table 4 P-values of the four methods for detecting the genes in the PI3K-AKT signaling pathway and Notch signaling pathway

Full size table

Discussion and conclusions

Complex disease is largely determined by a number of biomolecules interwoven into networks, rather than a single biomolecule. Group-level comparison of network properties (vertices level and the strength of connection between vertices) may shed light on underlying biological processes or disease mechanisms, and benefit the design of drug targets and drug combination for the therapy of complex diseases. Meanwhile, although the conventional single-based paradigm has successfully identified a list of risk factors, one common sense is that there still exist an intermediate “black box” between the exposures and the disease phenotypes (end point observations). In the “black box”, various risk factors weaved into complicated biological networks dominating the disease occurrence, development and prognosis. Recent advances in high-throughput technologies and omics resources are revolutionizing biomedical research, and allow a transition from the traditional paradigm for biological and epidemiological studies of complex diseases to a new paradigm based on systems epidemiology [12, 13, 15]. Under this framework, network-based analysis has been integrated into observational study designs to organize the interdependencies of biomolecules and other factors at a human population level, expecting to open the “black box”. A key but inadequately addressed issue is still to develop valid statistical method to test possible differences of the networks between two groups.

In our previous study [15], we have developed a statistical method for detecting the pathway effect contributing to disease, mainly under the framework of systems epidemiology. However, this method is limited to the pathway with chain structure, and can only capture the edge changes while omitting the vertex changes. Meanwhile, the nonparametric bootstrap method has to be used to obtain the significance. At present study, we proposed a score-based powerful statistical test to detect the significant changes in biological networks between two different conditions (e.g. health and disease). It can simultaneously capture the vertex changes and edge changes. Various simulations were conducted to assess the reliability and statistical power of the proposed method. It indicated that both NetDifM and VEWDM were much more powerful than Yates’D, and NetDifM kept the best performance under various scenarios (Figs. 3 and 6), and it can indeed capture the perturbation of vertices and edges in the network simultaneously. One strength for NetDifM is that we can obtain its theoretic property, and thus can avoid the high computation burden. As expected, the proposed NetDifM indeed runs fast (Table 2). The VEWDM was used Fisher r-to-z transformation to identify significant differences between two correlations. Fukushima et al. [30] also developed an R package to identify differential correlations between two conditions based on Fisher’s z-test which affords users a simple and effective framework in omics data. The VEWDM can be treated as a global measure to detect the group difference of networks between two conditions, accounting for not only edges difference but also vertices difference. Even though one is interested in testing particular vertex or edge rather than the whole network, its connected edge should also be considered.

Two real data sets analyses further highlighted that NetDifM had more advantage in practice. In the GWAS data of leprosy, we detected a candidate gene interaction network containing five genes. For the gene expression data of ovarian cancer, two candidate subnetworks, PI3K-AKT signaling pathway and Notch signaling pathway, respectively were considered and identified, suggesting that the proposed method is capable of identifying differential gene expression and gene-gene co-expression patterns, which are certainly helpful for us to further understand the underlying disease mechanism. Rao et al. [31] reported that combined overexpression of OVA66 and MDM2 promotes oncogenesis by enhancing activation of the IGF-1R–ERK1/2 signaling pathway, and JAG1 enhances ovarian cancer cell growth and cisplatin-resistance [32]. The expression of HES1 is confirmed to be strongly associated with the pathogenesis of ovarian endometriomas [33]. Meanwhile, decreased NOTCH2 expression is associated with the poorly differentiated serous epithelial ovarian carcinoma histology [34]. RBPJ underexpression in ovarian tumor tissue relative to matched normal tissue [35]. Moreover, ADAM17 is one of the several metalloproteinases that play a key role in epidermal growth factor receptor signalling and can be a potential target antigen to devise novel immunotherapeutic strategies against ovarian cancer [36]. The PI3K-AKT and Notch pathways and their abundant associated genes comprise complicated networks, which play a significant role in the progressive growth of tumor cells.

Network difference can result from not only changes of vertices but also changes of edges, and the changes of vertices-wise and edges-wise are often closely related. For instance, differential expression of genes may be due to either mutation of its own gene or the effects of expression changes of other genes in the network. However, the degree of differential expression of one gene due to its own mutation is often lower than affected by expressions of upstream genes in the network [37]. Reverter et al. [20] presented an analytical procedure to simultaneously identify differential gene expression and connectivity for unweighted gene network. In their work, an edge between two genes is established if the absolute value of the correlation coefficient exceeds a fixed threshold. Consequently, if we set the threshold less than 0.5, and the correlation coefficient between gene A and gene B is 0.9 in cases and 0.5 in controls, then the connection between gene A and gene B is treated as no difference between cases and controls. While in this situation, there exists a difference of the strength of connection between gene A and gene B among cases and controls, given our methods focus mainly on weighted biological networks.

Furthermore, the covariance structure between vertex changes and edge changes has been embedded into the proposed score-based network difference measure. In addition, one would be more interested in testing particular vertex or edge (genes or metabolites) rather than the whole network or pathway. Actually, a vertex as well as its connected edge can be treated as a subnetwork, and the proposed network difference method can easily be extended to identify the specific vertices. Even though some local interventions were often generated to prevent and cure a particular disease, it is essential to understand the global system. The ‘think globally, act locally’ paradigm should be strongly embedded into our mind [1].

One limitation in our paper is that we assume the network topological structure is fixed, and little attention has been paid on the network structure learning problem. Constructing network structure means determining every possible edge with highest degree of data matching, and often one joint probability distribution of a number of variables can reflect more than one network structure. Usually, combining experimental evidence with their experience, most biologists and clinical researchers have a growing awareness of the interplay between the biological components and can depict more or less the specific network or pathway for the corresponding biological process. Meanwhile, numerous databases (e.g. KEGG, GO, I2D) can be further borrowed to establish the network structure. The proposed NetDifM will do not work in its current version under the scenario when the covariance matrix is not invertible. One possible solution is to first apply a shrinkage strategy to simplify the network, and then adopt the proposed statistic. For instance, we can first remove those edges if the correlation between the two linked vertices is smaller than a predefined threshold, and then apply the proposed test to the remaining network.

Statistical comparisons of group difference in biological networks are highly desirable. The proposed network difference measure was valid and powerful to detect biological network difference. R code to perform NetDifM, NetDifMpm and VEWDM is provided in the Additional file 2.

Availability of supporting data

The gene expression data of ovarian cancer were downloaded from the GEO datasets (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE9899).

Abbreviations

DC:: Differentially connected
DE:: Differentially expressed
GWAS:: Genome-wide association study
KEGG:: Kyoto encyclopedia of genes and genomes
NetDifM :: Network difference measure
NetDifMpm:: NetDifM based on permutation
VEWDM :: Vertices and edges wise difference measure

References

Barabasi AL, Gulbahce N, Loscalzo J. Network medicine: a network-based approach to human disease. Nat Rev Genet. 2011;12(1):56–68.
Article PubMed Central CAS PubMed Google Scholar
Bedelbaeva K, Snyder A, Gourevitch D, Clark L, Zhang XM, Leferovich J, et al. Lack of p21 expression links cell cycle control and appendage regeneration in mice. Proc Natl AcadSci U S A. 2010;107(13):5845–50.
Article CAS Google Scholar
Schadt EE. Molecular networks as sensors and drivers of common human diseases. Nature. 2009;461(7261):218–23.
Article CAS PubMed Google Scholar
Barabasi AL, Oltvai ZN. Network biology: understanding the cell’s functional organization. Nat Rev Genet. 2004;5(2):101–13.
Article CAS PubMed Google Scholar
Albert R. Scale-free networks in cell biology. J Cell Sci. 2005;118(Pt 21):4947–57.
Article CAS PubMed Google Scholar
Wu X, Jiang R, Zhang MQ, Li S. Network-based global inference of human disease genes. Mol Syst Biol. 2008;4:189.
Article PubMed Central PubMed Google Scholar
Taylor IW, Linding R, Warde-Farley D, Liu Y, Pesquita C, Faria D, et al. Dynamic modularity in protein interaction networks predicts breast cancer outcome. Nat Biotechnol. 2009;27(2):199–204.
Article CAS PubMed Google Scholar
Laenen G, Thorrez L, Bornigen D, Moreau Y. Finding the targets of a drug by integration of gene expression data with a protein interaction network. Mol Biosyst. 2013;9(7):1676–85.
Article CAS PubMed Google Scholar
Yang B, Zhang J, Yin Y, Zhang Y. Network-based inference framework for identifying cancer genes from gene expression data. Biomed Res Int. 2013;2013:401649.
PubMed Central PubMed Google Scholar
Wu B, Li C, Du Z, Yao Q, Wu J, Feng L, et al. Network based analyses of gene expression profile of LCN2 overexpression in esophageal squamous cell carcinoma. Sci Rep. 2014;4:5403.
PubMed Central CAS PubMed Google Scholar
Hafeman DM, Schwartz S. Opening the Black Box: a motivation for the assessment of mediation. Int J Epidemiol. 2009;38(3):838–45.
Article PubMed Google Scholar
Haring R, Wallaschofski H. Diving through the “-omics”: the case for deep phenotyping and systems epidemiology. OMICS. 2012;16(5):231–4.
Article PubMed Central CAS PubMed Google Scholar
Lund E, Dumeaux V. Systems epidemiology in cancer. Cancer Epidemiol Biomarkers Prev. 2008;17(11):2954–7.
Article PubMed Google Scholar
de la Fuente A. From ‘differential expression’ to ‘differential networking’ - identification of dysfunctional regulatory networks in diseases. Trends Genet. 2010;26(7):326–33.
Article PubMed Google Scholar
Ji J, Yuan Z, Zhang X, Li F, Xu J, Liu Y, et al. Detection for pathway effect contributing to disease in systems epidemiology with a case-control design. BMJ Open. 2015;5(1):e006721.
Article PubMed Central PubMed Google Scholar
Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008;9:559.
Article PubMed Central PubMed Google Scholar
Zhang B, Li H, Riggins RB, Zhan M, Xuan J, Zhang Z, et al. Differential dependency network analysis to identify condition-specific topological changes in biological networks. Bioinformatics. 2009;25(4):526–32.
Article PubMed Central PubMed Google Scholar
Valcarcel B, Wurtz P, Seicha BNK, Tukiainen T, Kangas AJ, Soininen P, et al. A differential network approach to exploring differences between biological states: an application to prediabetes. PLoS One. 2011;6(9):e24702.
Article PubMed Central CAS PubMed Google Scholar
Yates PD, Mukhopadhyay ND. An inferential framework for biological network hypothesis tests. BMC Bioinformatics. 2013;14:94.
Article PubMed Central PubMed Google Scholar
Reverter A, Ingham A, Lehnert SA, Tan SH, Wang Y, Ratnakumar A, et al. Simultaneous identification of differential gene expression and connectivity in inflammation, adipogenesis and cancer. Bioinformatics. 2006;22(19):2396–404.
Article CAS PubMed Google Scholar
Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol. 2005;4(1):Article17.
PubMed Google Scholar
Gill R, Datta S, Datta S. A statistical framework for differential network analysis from microarray data. BMC Bioinformatics. 2010;11:95.
Article PubMed Central PubMed Google Scholar
Kim J, Wozniak JR, Mueller BA, Shen X, Pan W. Comparison of statistical tests for group differences in brain functional networks. Neuroimage. 2014;101:681–94.
Article PubMed Central PubMed Google Scholar
Fleiss JL. On the distribution of a linear combination of independent chi squares. J Am Stat Assoc. 1971.
Zhang FR, Huang W, Chen SM, Sun LD, Liu H, Li Y, et al. Genomewide association study of leprosy. N Engl J Med. 2009;361(27):2609–18.
Article CAS PubMed Google Scholar
Tothill RW, Tinker AV, George J, Brown R, Fox SB, Lade S, et al. Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome. Clin Cancer Res. 2008;14(16):5198–208.
Article CAS PubMed Google Scholar
Fresno VJA, Casado E, de Castro J, Cejas P, Belda-Iniesta C, Gonzalez-Baron M. PI3K/Aktsignalling pathway and cancer. Cancer Treat Rev. 2004;30(2):193–204.
Article Google Scholar
Rose SL. Notch signaling pathway in ovarian cancer. Int J Gynecol Cancer. 2009;19(4):564–6.
Article PubMed Google Scholar
Groeneweg JW, Foster R, Growdon WB, Verheijen R, Rueda BR. Notch signaling in serous ovarian cancer. J Ovarian Res. 2014;7(1):95.
Article PubMed Central PubMed Google Scholar
Fukushima A. DiffCorr: an R package to analyze and visualize differential correlations in biological networks. Gene. 2013;518(1):209–14.
Article CAS PubMed Google Scholar
Rao W, Li H, Song F, Zhang R, Yin Q, Wang Y, et al. OVA66 increases cell growth, invasion and survival via regulation of IGF-1R-MAPK signaling in human cancer cells. Carcinogenesis. 2014;35(7):1573–81.
Article CAS PubMed Google Scholar
Liu MX, Siu MK, Liu SS, Yam JW, Ngan HY, Chan DW. Epigenetic silencing of microRNA-199b-5p is associated with acquired chemoresistance via activation of JAG1-Notch1 signaling in ovarian cancer. Oncotarget. 2014;5(4):944–58.
Article PubMed Central PubMed Google Scholar
Wang LL, Cai HQ, Dong XQ, Zhang LW, Jiang SS, Zhao N, et al. Differentially expressed gene profiles in the serum before and after the ultrasound-guided ethanol sclerotherapy in patients with ovarian endometriomas. Clin Biochem. 2015;48(16-17):1131–7.
Article CAS PubMed Google Scholar
Galic V, Shawber CJ, Reeves C, Shah M, Murtomaki A, Wright J, et al. NOTCH2 expression is decreased in epithelial ovarian cancer and is related to the tumor histological subtype. Pathol Discov. 2013;1:4.
Article PubMed Central PubMed Google Scholar
Kulic I, Robertson G, Chang L, Baker JH, Lockwood WW, Mok W, et al. Loss of the Notch effector RBPJ promotes tumorigenesis. J Exp Med. 2015;212(1):37–52.
Article PubMed Central CAS PubMed Google Scholar
Sinnathamby G, Zerfass J, Hafner J, Block P, Nickens Z, Hobeika A, et al. ADAM metallopeptidase domain 17 (ADAM17) is naturally processed through major histocompatibility complex (MHC) class I molecules and is a potential immunotherapeutic target in breast, ovarian and prostate cancers. Clin Exp Immunol. 2011;163(3):324–32.
Article PubMed Central CAS PubMed Google Scholar
Xiong M, Feghali-Bostwick CA, Arnett FC, Zhou X. A systems biology approach to genetic studies of complex diseases. FEBS Lett. 2005;579(24):5325–32.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

This work was supported by grants from National Natural Science Foundation of China (grant number81573259 and 31200994). The funding body played no role in the design, writing or decision to publish this manuscript. We thank the leprosy and ovarian cancer investigators for access to their study data.

Author information

Authors and Affiliations

Department of Biostatistics, School of Public Health, Shandong University, PO Box 100, Jinan, 250012, Shandong, China
Jiadong Ji, Zhongshang Yuan, Xiaoshuai Zhang & Fuzhong Xue

Authors

Jiadong Ji
View author publications
You can also search for this author in PubMed Google Scholar
Zhongshang Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoshuai Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Fuzhong Xue
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fuzhong Xue.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

JJ, ZY and FX conceived, designed the study and implemented the data analysis, JJ, ZY and XZ drafted the manuscript. All authors read and approved the final manuscript.

Additional files

Additional file 1: Table S1.

The location and SNP number for five susceptibility genes belonging to the network associated with leprosy. Table S2. The location and probe sets number for genes belonging to the subnetwork from PI3K-AKT signaling pathway and Notch signaling pathway. Table S3. Type I error rates of four methods (40 vertices and 54 edges in the network). Table S4. Type I error rates of four methods given the deviation from the normal distribution. Figure S1. A network including 40 vertices and 54 edges. Figure S2. The statistical power of the four methods under three scenarios. (a) only vertices change, (b) only edges change, (c) both vertices and edges change. (PDF 140 kb)

Additional file 2:

R code. (DOCX 13 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Ji, J., Yuan, Z., Zhang, X. et al. A powerful score-based statistical test for group difference in weighted biological networks. BMC Bioinformatics 17, 86 (2016). https://doi.org/10.1186/s12859-016-0916-x

Download citation

Received: 11 September 2015
Accepted: 29 January 2016
Published: 12 February 2016
DOI: https://doi.org/10.1186/s12859-016-0916-x

A powerful score-based statistical test for group difference in weighted biological networks