 Methodology article
 Open Access
 Published:
A powerful scorebased statistical test for group difference in weighted biological networks
BMC Bioinformatics volume 17, Article number: 86 (2016)
Abstract
Background
Complex disease is largely determined by a number of biomolecules interwoven into networks, rather than a single biomolecule. A key but inadequately addressed issue is how to test possible differences of the networks between two groups. Grouplevel comparison of network properties may shed light on underlying disease mechanisms and benefit the design of drug targets for complex diseases. We therefore proposed a powerful scorebased statistic to detect group difference in weighted networks, which simultaneously capture the vertex changes and edge changes.
Results
Simulation studies indicated that the proposed network difference measure (NetDifM) was stable and outperformed other methods existed, under various sample sizes and network topology structure. One application to real data about GWAS of leprosy successfully identified the specific gene interaction network contributing to leprosy. For additional gene expression data of ovarian cancer, two candidate subnetworks, PI3KAKT and Notch signaling pathways, were considered and identified respectively.
Conclusions
The proposed method, accounting for the vertex changes and edge changes simultaneously, is valid and powerful to capture the group difference of biological networks.
Background
From the perspective of network medicine, a disease phenotype is rarely a consequence of an abnormality in a single biomolecule (e.g. RNA, protein, metabolite), but reflects various pathobiological processes that interact in a complex network [1]. One single factor can exert certain effects on disease when studying it alone, while this effect may be vanished when studying it within one network or pathway [2], and vice versa. Therefore, biomolecules should be studied in the context of biological systems they are involved in [3]. Perhaps the abstraction for a biological system is network, such as transcriptional regulatory networks, signal transduction networks, protein interaction networks and metabolic networks [4]. In the biological networks, the vertices represent biomolecules, and the edges represent functional, causal or physical interactions between the vertices. Different types of networks are often used to represent diverse types of biological processes, each of which stores information about levels and interactions related to specific biomolecules [5]. In fact, different physiological conditions may manifest as different networks. Moreover, complex disease are multifactorial and analyzing the individual components is insufficient, so it is essential to dissect how these components interact with each other and weave into one network, and how these interactions differ with respect to disease status. Statistical comparison of group difference in biological networks or pathways can provide new insight into the underlying disease mechanism, and have extensive biomedical and clinical applications [6–10]. For instance, a better understanding of the effects of molecular interconnectedness on disease progression may lead to superior identification of disease related biomolecules and pathways, which may further offer more effective targets for drug development in a costeffective and timely manner.
On the other hand, identifying biological and environmental causes of human diseases has always been one of the central concerns in epidemiology. However, traditional epidemiology has been pejoratively labeled as the “black box” epidemiology [11], and increasingly suffered from criticism partly due to the fact that too much attention has been paid to the identification of a single risk factor rather than the network or pathway related to a disease, which led to difficulty to deeply explore disease mechanism [12]. It is highly desirable to unlock the black box underlying observed associations and to illuminate the biological interaction mechanisms of diseaserelated components hiding behind the black box. There are unmet needs to access multilevel omics data on the population level. Thanks to the development of recent technological advances in highthroughput omics platforms, we can enable the acquisition of omics data at unprecedented speed and amounts, and further integrate various omics data with traditional epidemiology to promote the development of systems epidemiology [12, 13]. It offers the potential to provide new insight into the underlying disease mechanisms in breadth and depth at human population level. Under the framework of systems epidemiology, the focus has been shifted from identification of single factor to exploration of specific networks or pathways contributing to disease [14, 15].
In a word, it is in great needs to do statistical comparison of biological networks. So far, several methods have been proposed to utilize network topology information to carry out various biomedical tasks. Langfelder et al. [16] provided several measures for comparing network topologies for weighted correlation networks. Zhang et al. [17] proposed a differential dependency network analysis to detect topological changes in transcriptional networks between subclasses of breast cancer. Valcarcel et al. [18] introduced a formal statistical method for the differential analysis of molecular pairwise associations via network representation. Recently, Yates et al. [19] developed an additive elementwisebased dissimilarity measure for biological network hypothesis tests. However, most of above methods mainly focus on the difference of network topology and are unable to account for the changes of vertices. Although in most situations, the differences of single verticeswise or edgeswise may be weak, their aggregated differences can be quite strong. It will undoubtedly lose statistical power to only consider the connection with the topological difference between two networks. Meanwhile, nonparametric permutation procedures are commonly employed to perform analysis in most existed methods, which were inevitably timeconsuming, especially for big data.
The premise for networks or pathways comparison is to make clear the cause of biological network difference. Generally, both changes in the nodes level (e.g. the magnitude of each gene’s expression change in regulation network), and changes in the edges (e.g. the strength of connection) can lead to the whole network difference. Reverter et al. [20] presented an analytical procedure to simultaneously identify genes that were differentially expressed (DE) as well as genes that are differentially connected (DC) for unweighted networks. Their methods depend heavily on the specific definition of DE and DC, and the twocomponent mixture of bivariate normal distribution may be violated in other biological networks, though it may be reasonable in gene expression network. Furthermore, weighted (correlationbased) networks are commonly encountered and increasingly relevant in biological applications [16, 21–23]. Statistical methods for detecting the group difference in weighted biological networks are still in great demand.
In this article, we proposed a new scorebased network difference measure (NetDifM) as a powerful test statistic to detect group difference in weighted networks, which simultaneously capture the difference of vertices and edges. Various simulations were conducted to evaluate its type I error and statistical power, compared with other existed method. Two real data sets about GWAS of leprosy and gene expression of ovarian cancer were further analyzed to show their performance in practice.
Methods
Statistical model
A weighted biological network can be modeled as an undirected graph G = (V, E), where V is the set of vertices (sometimes referred to nodes) and E is the set of edges (also called connections). Two vertices, representing biomolecules, are connected by an undirected edge if there is an association between them. Each edge can be assigned a weight resembling the strength of evidence for the association.
We denote the two networks in two groups (cases and controls) by G ^{D} and G ^{C} respectively, suppose both G ^{D} and G ^{C} have the same number of vertices (M) and edges (K), the null hypothesis test is H _{0} : G ^{D} = G ^{C}. Let V(G ^{D}) and E(G ^{D}) denote the set of all vertices and edges in G ^{D}, x ^{D}_{ i } x ^{D}_{ j } indicate the edge \( {x}_i^D{x}_j^D \) (i ≠ j, i, j = 1, 2, ⋯, M), β ^{D}_{ ij } represent the strength of association between x ^{D}_{ i } and x ^{D}_{ j } if x ^{D}_{ i } x ^{D}_{ j } existed. For individual l (l = 1, 2, ⋯, N), the trait value is denoted as Y _{ l }, \( {Y}_l=\left\{\begin{array}{l}1,\kern0.4em l\in case\\ {}0\kern0.1em ,\kern0.2em l\in control\end{array}\right. \) and the i ^{th} vertex is denoted as x _{ li }. Under H _{0}, networks in two groups are identical not only in the average vertices levels but also in the connection strength. The score test vector of vertices is D ^{V} = (D ^{V}_{1} , D ^{V}_{2} , ⋯, D ^{V}_{ M } )^{T}, where \( {D}_i^V={\displaystyle {\sum}_{l=1}^N\left({Y}_l\overline{Y}\right){x}_{li}} \) measures the contribution of vertex x _{ i } to the disease. Analogously, the score test vector of edges is D ^{E} = (D ^{E}_{1} , D ^{E}_{2} , ⋯, D ^{E}_{ K } )^{T}, where \( {D}_k^E={\displaystyle {\sum}_{l=1}^N\left({Y}_l\overline{Y}\right)\left({x}_{li}{\overline{x}}_i\right)\left({x}_{lj}{\overline{x}}_j\right)} \) measures the contribution of connection strength between x _{ i } and x _{ j } (i.e. the k ^{th} edge) to the disease. Then the proposed overall network difference measure can be defined as
where \( D=\left(\begin{array}{l}{D}^V\\ {}{D}^E\end{array}\right), \) Σ = cov(D) = (σ _{ pq })_{(M + K) × (M + K)}, p, q = 1, 2, ⋯, (M + K). The estimated covariance matrix of D can be represented as \( \left(\kern.9em \begin{array}{cc}{\varSigma}_V& {\varSigma}_{VE}\\ {}\kern1em {\varSigma}_{VE}^T\kern1em & {\varSigma}_E\end{array}\right) \) and calculated as follows,
1) For Σ _{ V }, p, q = 1, 2, ⋯, M,
2) For Σ _{ E }, p, q = M + 1, M + 2, ⋯, M + K,
3) For Σ _{ VE }, p = 1, 2, ⋯, M , q = M + 1, M + 2, ⋯, M + K
Naturally, for a large sample size, NetDifM has a centered χ ^{2}(M + K) distribution under the null hypothesis (The derivation of NetDifM see Additional file 1). When sample size is small, a permutation procedure can be conducted as follows to get the empirical P value and assess the statistical significance. (1) calculate the test statistic NetDifM from the original sample; (2) randomly assign subjects to one of two groups, the sample size for each group keeps the same as the original data; (3) perform the above steps Q times and calculate the test statistic for each repeated sample, NetDifM ^{*}_{ i } , i = 1, 2, ⋯, Q; (4) estimate the P value according to \( p value=\frac{1}{Q}{\displaystyle \sum_{i=1}^QI\left( NetDif{M}_i^{*}> NetDifM\right)} \), where I(•) is the indicator function.
Intuitively, considering the elements of one network are not more than vertices and edges, an elementwise measure may be expected to have the ability to identify the group difference in biological networks. A vertices and edges wise difference measure (VEWDM), through the simple summation of vertices difference and edges, can be constructed as
where \( {T}_i=\frac{{\overline{x}}_i^D{\overline{x}}_i^C}{\sqrt{\operatorname{var}\left({\overline{x}}_i^D\right)+\operatorname{var}\left({\overline{x}}_i^C\right)}} \), \( {\overline{x}}_i^D \) and \( {\overline{x}}_i^C \) indicate the sample mean of x _{ i } in G ^{D} and G ^{C} respectively; \( {U}_{ij}=\frac{\beta_{ij}^D{\beta}_{ij}^C}{\sqrt{\operatorname{var}\left({\beta}_{ij}^D\right)+\operatorname{var}\left({\beta}_{ij}^C\right)}} \), when the strength of edges are quantified by the Pearson correlations r _{ ij }, \( {U}_{ij}=\left({z}_{ij}^D{z}_{ij}^C\right)/\sqrt{\frac{1}{n_D3}+\frac{1}{n_C3}} \), z _{ ij } are the Fishertransforms of the correlations \( {z}_{ij}=\frac{1}{2} \ln \frac{1+{r}_{ij}}{1{r}_{ij}} \), n _{ D } and n _{ C } are the corresponding sample sizes. The proposed VEWDM seems to be the linear combination of some chisquare statistics. Actually, the asymptotic theoretical properties have been explored for the linear combination of independent chisquares [24]. Nevertheless, it is quite complex and difficult here to obtain the asymptotic distribution of VEWDM, since the correlations between different vertices and different edges statistics (T _{ i } and U _{ ij }) highly depend on the specific network structure. In other words, the asymptotic properties might be networkspecific. To solve this problem, we adopted the strategy of a permutation test to make statistical inference.
Simulation
Simulations were designed to evaluate the type I error rate and statistical power, and to compare the performance of NetDifM, VEWDM and Yates’D (recently proposed dissimilarity measure in Yates et al. [19]) under different sample size and network topological structure. The statistical power is defined as the probability that the test correctly rejects the null hypothesis (H _{0}) when the alternative hypothesis (H _{1}) is true. It can be estimated from the empirical distribution as the proportion of observations for which the pvalue is less than given nominal level (α = 0.05). For the specific network with M vertices and K edges, the simulated Mdimensional variables (vertices) were generated from a multivariate normal distribution N _{ M }(μ, Σ) with mean vector μ and covariance matrix Σ using the R “mvtnorm” package. We specified the mean vector μ = (μ _{1}, μ _{2}, ⋯, μ _{ M }) and covariance matrix Σ = (I _{ ij } β _{ ij })_{ M × M }, where \( {I}_{ij}=\left\{\begin{array}{l}1,\kern0.4em {x}_i{x}_j\in E(G)\\ {}0,\kern0.3em {x}_i{x}_j\notin E(G)\kern0.1em \end{array}\right. \), i ≠ j, i, j = 1, 2, ⋯, M was the indicator function.
Under the null hypotheses (H _{0}), the data was generated by setting μ ^{D} = μ ^{C} and I ^{D}_{ ij } β ^{D}_{ ij } = I ^{C}_{ ij } β ^{C}_{ ij } . 1000 simulations were repeated to assess the type I error of the above methods given various sample sizes under different network scale, including network with ten vertices and 21 edges (Fig. 1a) and another one with 20 vertices and 45 edges (Fig. 1b). Under the alternative hypotheses, three scenarios were considered.
For scenario 1, only vertices (average levels) were different between G ^{D} and G ^{C}. We simulated two vertices difference with μ ^{D}_{3} − μ ^{C}_{3} = 0.2 and μ ^{D}_{10} − μ ^{C}_{10} = 0.2 under the network topological structure as in Fig. 1a. Three vertices difference with μ ^{D}_{6} − μ ^{C}_{6} = 0.1, μ ^{D}_{8} − μ ^{C}_{8} = 0.2 and μ ^{D}_{17} − μ ^{C}_{17} = 0.2 were also designed under the relative larger scale network as in Fig. 1b.
For scenario 2, only edges (connection strength) were different between G ^{D} and G ^{C}. We simulated three edges difference with β ^{D}_{35} − β ^{C}_{35} = − 0.2, β ^{D}_{57} − β ^{C}_{57} = 0.2 and β ^{D}_{8,10} − β ^{C}_{8,10} = 0.2 under the network topological structure as in Fig. 1a. Seven edges difference with β ^{C}_{10,11} − β ^{D}_{10,11} = − 0.2, β ^{C}_{1,10} − β ^{D}_{1,10} = β ^{C}_{2,12} − β ^{D}_{2,12} = β ^{C}_{4,14} − β ^{D}_{4,14} = β ^{C}_{12,19} − β ^{D}_{12,19} = 0.2 and β ^{C}_{23} − β ^{D}_{23} = β ^{C}_{7,20} − β ^{D}_{7,20} = 0.1 were also designed under the relative larger scale network as in Fig. 1b.
For scenario 3, both vertices and edges were designed to be different between G ^{D} and G ^{C}. Under the topology structure as in Fig. 1a, we combined the settings in scenario 1 and scenario 2 (the difference only existed for orange vertices and red edges), so as for the topology structure as in Fig. 1b.
For each scenario, 1000 replicates were used to evaluate statistical power. Pvalues of the proposed NetDifM were assessed using both the asymptotic distribution and the empirical null distribution obtained from 1000 times permutations.
It is necessary to assess the performance of the proposed statistics, given the deviation from the normal distribution. For the network with ten vertices and 21 edges, we designed the following two scenarios, (i) conduct the exponential transformation for five vertices randomly chosen among the ten vertices; (ii) do the exponential transformation for all ten vertices. For each scenario, we evaluate the type I error rate and statistical power under the same three scenarios mentioned as above.
Furthermore, we also provided estimated computing time under different network with sample size 200 and 1000 permutations, using one laptop as an Intel PentiumT4400 with a 2.2 GHz CPU and 2 GB RAM.
Application
GWAS data of leprosy
By Ingenuity Pathways Analysis, a plausible biologic network underlying susceptibility to leprosy was provided for depicting the functional relationship between some susceptibility genes identified from GWAS of leprosy [25]. Using the initial GWAS data with 706 cases and 514 controls, we attempted to detect the difference of the networks including genes CARD6, HLADRB1, RIPK2, CARD9 and IFNG. All participants provided written informed consent, and the study was approved by the ethics committees of Shandong Academy of Medical Science [25]. These five genes located on different chromosomes and totally contained 914 SNPs (see in Additional file 1: Table S1), with network structure given in Fig. 2a. Since each gene contained several SNPs, we first employed principal component analysis and conducted the statistical network comparison by treating the first principal component as the network vertices.
Gene expression data of ovarian cancer
Tothillet al. [26] used highdensity expression oligonucleotide microarrays for profiling 285 wellannotated serous and endometrioid invasive ovarian, fallopian tube, and peritoneal cancers. The subjects were divided into a C1 subtype, with 83 patients, and a C2–C6 subtype, with 168 patients. Complete expression data are available on GEO (accession GSE9899). The proposed method was also applied to detect the network difference between these two groups (C1 versus C2–C6). Here we studied two specific pathways (PI3KAKT signaling pathway and Notch signaling pathway) reported in the literatures [27–29] to be relevant to ovarian cancer. The subnetwork of PI3KAKT signaling pathway from the KEGG pathway database with 7 genes contained 26 probe sets (see in Additional file 1: Table S2) was abstracted into network with topological structure shown in Fig. 2b. The subnetwork of Notch signaling pathway with 14 genes contained 45 probe sets (see in Additional file 1: Table S2) was abstracted into network with topological structure shown in Fig. 2c. All probe sets corresponding to the same gene symbol were first averaged to obtain genelevel expression measurements.
Results
Simulation
Shown in Table 1 are the estimated type I error rates of the proposed NetDifM, VEWDM, NetDifM based on permutation (NetDifMpm) and Yates’D under different sample sizes. It reveals that all type I error rates based on permutation procedure are close to given nominal level (α = 0.05). NetDifM tended to be slightly conservative under small sample size, while using the asymptotic distribution maintains a good control of type I error rate under large sample size.
Figure 3a indicates the statistical power under scenario 1 when only vertices changed with the network topological structure demonstrates in Fig. 1a. As expected, Yates’D has no power due to that it can only capture the edge change. NetDifM is substantially more powerful than VEWDM, and it is slightly less powerful than its permutationbased type. Similar trend could also be found under the relative larger scale network (Fig. 3b).
Shown in Fig. 4 is the performance under scenario 2 (only edges change). The statistical power of all methods monotonically increases with sample size. NetDifM has much higher power than that of VEWDM and Yates’D. The power of NetDifM and Yates’D keep almost the same in the larger scale network (Fig. 4b).
Figure 5 illustrates the statistical power under the scenario 3 (both edges and vertices change). Both NetDifM and VEWDM are much more powerful than Yates’D, and NetDifM still has the best performance.
To evaluate the scalability and computational efficiency of the proposed methods, we also conducted simulations using a larger network with 40 vertices and 54 edges (see in Additional file 1: Figure S1). It is clear that the proposed NetDifM still have the best performance (see in Additional file 1: Figure S2; Additional file 1: Table S3).
Figure 6 indicates the results given the deviation from the normal distribution, where the proposed statistics still hold the relative better performance than other method.
Table 2 presents the estimated computing time. It indicates that the proposed NetDifM indeed runs fast, and the computational time increases as the network become larger.
The results of application
Network difference analysis for both the GWAS of leprosy and gene expression data of ovarian cancer further confirm in practice that the proposed NetDifM captured the network changes. Shown in Table 3 are the results of the proposed NetDifM and other methods for detecting the network difference between two groups. The difference of gene interaction network with 5 genes can be detected significantly at α = 0.05 by NetDifM, NetDifMpm and VEWDM.
Group difference of the subnetwork of PI3KAKT signaling pathway was detected significantly at α = 0.05 by NetDifM, NetDifMpm and VEWDM. When applied to the subnetwork of Notch signaling pathway, all four methods can detect the network difference significantly (Table 3). Shown in Table 4 are the results of the proposed NetDifM and other methods for detecting the specific vertices, treating a vertex as well as its connected edge as a network, under1000 permutation times.
Discussion and conclusions
Complex disease is largely determined by a number of biomolecules interwoven into networks, rather than a single biomolecule. Grouplevel comparison of network properties (vertices level and the strength of connection between vertices) may shed light on underlying biological processes or disease mechanisms, and benefit the design of drug targets and drug combination for the therapy of complex diseases. Meanwhile, although the conventional singlebased paradigm has successfully identified a list of risk factors, one common sense is that there still exist an intermediate “black box” between the exposures and the disease phenotypes (end point observations). In the “black box”, various risk factors weaved into complicated biological networks dominating the disease occurrence, development and prognosis. Recent advances in highthroughput technologies and omics resources are revolutionizing biomedical research, and allow a transition from the traditional paradigm for biological and epidemiological studies of complex diseases to a new paradigm based on systems epidemiology [12, 13, 15]. Under this framework, networkbased analysis has been integrated into observational study designs to organize the interdependencies of biomolecules and other factors at a human population level, expecting to open the “black box”. A key but inadequately addressed issue is still to develop valid statistical method to test possible differences of the networks between two groups.
In our previous study [15], we have developed a statistical method for detecting the pathway effect contributing to disease, mainly under the framework of systems epidemiology. However, this method is limited to the pathway with chain structure, and can only capture the edge changes while omitting the vertex changes. Meanwhile, the nonparametric bootstrap method has to be used to obtain the significance. At present study, we proposed a scorebased powerful statistical test to detect the significant changes in biological networks between two different conditions (e.g. health and disease). It can simultaneously capture the vertex changes and edge changes. Various simulations were conducted to assess the reliability and statistical power of the proposed method. It indicated that both NetDifM and VEWDM were much more powerful than Yates’D, and NetDifM kept the best performance under various scenarios (Figs. 3 and 6), and it can indeed capture the perturbation of vertices and edges in the network simultaneously. One strength for NetDifM is that we can obtain its theoretic property, and thus can avoid the high computation burden. As expected, the proposed NetDifM indeed runs fast (Table 2). The VEWDM was used Fisher rtoz transformation to identify significant differences between two correlations. Fukushima et al. [30] also developed an R package to identify differential correlations between two conditions based on Fisher’s ztest which affords users a simple and effective framework in omics data. The VEWDM can be treated as a global measure to detect the group difference of networks between two conditions, accounting for not only edges difference but also vertices difference. Even though one is interested in testing particular vertex or edge rather than the whole network, its connected edge should also be considered.
Two real data sets analyses further highlighted that NetDifM had more advantage in practice. In the GWAS data of leprosy, we detected a candidate gene interaction network containing five genes. For the gene expression data of ovarian cancer, two candidate subnetworks, PI3KAKT signaling pathway and Notch signaling pathway, respectively were considered and identified, suggesting that the proposed method is capable of identifying differential gene expression and genegene coexpression patterns, which are certainly helpful for us to further understand the underlying disease mechanism. Rao et al. [31] reported that combined overexpression of OVA66 and MDM2 promotes oncogenesis by enhancing activation of the IGF1R–ERK1/2 signaling pathway, and JAG1 enhances ovarian cancer cell growth and cisplatinresistance [32]. The expression of HES1 is confirmed to be strongly associated with the pathogenesis of ovarian endometriomas [33]. Meanwhile, decreased NOTCH2 expression is associated with the poorly differentiated serous epithelial ovarian carcinoma histology [34]. RBPJ underexpression in ovarian tumor tissue relative to matched normal tissue [35]. Moreover, ADAM17 is one of the several metalloproteinases that play a key role in epidermal growth factor receptor signalling and can be a potential target antigen to devise novel immunotherapeutic strategies against ovarian cancer [36]. The PI3KAKT and Notch pathways and their abundant associated genes comprise complicated networks, which play a significant role in the progressive growth of tumor cells.
Network difference can result from not only changes of vertices but also changes of edges, and the changes of verticeswise and edgeswise are often closely related. For instance, differential expression of genes may be due to either mutation of its own gene or the effects of expression changes of other genes in the network. However, the degree of differential expression of one gene due to its own mutation is often lower than affected by expressions of upstream genes in the network [37]. Reverter et al. [20] presented an analytical procedure to simultaneously identify differential gene expression and connectivity for unweighted gene network. In their work, an edge between two genes is established if the absolute value of the correlation coefficient exceeds a fixed threshold. Consequently, if we set the threshold less than 0.5, and the correlation coefficient between gene A and gene B is 0.9 in cases and 0.5 in controls, then the connection between gene A and gene B is treated as no difference between cases and controls. While in this situation, there exists a difference of the strength of connection between gene A and gene B among cases and controls, given our methods focus mainly on weighted biological networks.
Furthermore, the covariance structure between vertex changes and edge changes has been embedded into the proposed scorebased network difference measure. In addition, one would be more interested in testing particular vertex or edge (genes or metabolites) rather than the whole network or pathway. Actually, a vertex as well as its connected edge can be treated as a subnetwork, and the proposed network difference method can easily be extended to identify the specific vertices. Even though some local interventions were often generated to prevent and cure a particular disease, it is essential to understand the global system. The ‘think globally, act locally’ paradigm should be strongly embedded into our mind [1].
One limitation in our paper is that we assume the network topological structure is fixed, and little attention has been paid on the network structure learning problem. Constructing network structure means determining every possible edge with highest degree of data matching, and often one joint probability distribution of a number of variables can reflect more than one network structure. Usually, combining experimental evidence with their experience, most biologists and clinical researchers have a growing awareness of the interplay between the biological components and can depict more or less the specific network or pathway for the corresponding biological process. Meanwhile, numerous databases (e.g. KEGG, GO, I2D) can be further borrowed to establish the network structure. The proposed NetDifM will do not work in its current version under the scenario when the covariance matrix is not invertible. One possible solution is to first apply a shrinkage strategy to simplify the network, and then adopt the proposed statistic. For instance, we can first remove those edges if the correlation between the two linked vertices is smaller than a predefined threshold, and then apply the proposed test to the remaining network.
Statistical comparisons of group difference in biological networks are highly desirable. The proposed network difference measure was valid and powerful to detect biological network difference. R code to perform NetDifM, NetDifMpm and VEWDM is provided in the Additional file 2.
Availability of supporting data
The gene expression data of ovarian cancer were downloaded from the GEO datasets (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE9899).
Abbreviations
 DC:

Differentially connected
 DE:

Differentially expressed
 GWAS:

Genomewide association study
 KEGG:

Kyoto encyclopedia of genes and genomes
 NetDifM :

Network difference measure
 NetDifMpm:

NetDifM based on permutation
 VEWDM :

Vertices and edges wise difference measure
References
 1.
Barabasi AL, Gulbahce N, Loscalzo J. Network medicine: a networkbased approach to human disease. Nat Rev Genet. 2011;12(1):56–68.
 2.
Bedelbaeva K, Snyder A, Gourevitch D, Clark L, Zhang XM, Leferovich J, et al. Lack of p21 expression links cell cycle control and appendage regeneration in mice. Proc Natl AcadSci U S A. 2010;107(13):5845–50.
 3.
Schadt EE. Molecular networks as sensors and drivers of common human diseases. Nature. 2009;461(7261):218–23.
 4.
Barabasi AL, Oltvai ZN. Network biology: understanding the cell’s functional organization. Nat Rev Genet. 2004;5(2):101–13.
 5.
Albert R. Scalefree networks in cell biology. J Cell Sci. 2005;118(Pt 21):4947–57.
 6.
Wu X, Jiang R, Zhang MQ, Li S. Networkbased global inference of human disease genes. Mol Syst Biol. 2008;4:189.
 7.
Taylor IW, Linding R, WardeFarley D, Liu Y, Pesquita C, Faria D, et al. Dynamic modularity in protein interaction networks predicts breast cancer outcome. Nat Biotechnol. 2009;27(2):199–204.
 8.
Laenen G, Thorrez L, Bornigen D, Moreau Y. Finding the targets of a drug by integration of gene expression data with a protein interaction network. Mol Biosyst. 2013;9(7):1676–85.
 9.
Yang B, Zhang J, Yin Y, Zhang Y. Networkbased inference framework for identifying cancer genes from gene expression data. Biomed Res Int. 2013;2013:401649.
 10.
Wu B, Li C, Du Z, Yao Q, Wu J, Feng L, et al. Network based analyses of gene expression profile of LCN2 overexpression in esophageal squamous cell carcinoma. Sci Rep. 2014;4:5403.
 11.
Hafeman DM, Schwartz S. Opening the Black Box: a motivation for the assessment of mediation. Int J Epidemiol. 2009;38(3):838–45.
 12.
Haring R, Wallaschofski H. Diving through the “omics”: the case for deep phenotyping and systems epidemiology. OMICS. 2012;16(5):231–4.
 13.
Lund E, Dumeaux V. Systems epidemiology in cancer. Cancer Epidemiol Biomarkers Prev. 2008;17(11):2954–7.
 14.
de la Fuente A. From ‘differential expression’ to ‘differential networking’  identification of dysfunctional regulatory networks in diseases. Trends Genet. 2010;26(7):326–33.
 15.
Ji J, Yuan Z, Zhang X, Li F, Xu J, Liu Y, et al. Detection for pathway effect contributing to disease in systems epidemiology with a casecontrol design. BMJ Open. 2015;5(1):e006721.
 16.
Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008;9:559.
 17.
Zhang B, Li H, Riggins RB, Zhan M, Xuan J, Zhang Z, et al. Differential dependency network analysis to identify conditionspecific topological changes in biological networks. Bioinformatics. 2009;25(4):526–32.
 18.
Valcarcel B, Wurtz P, Seicha BNK, Tukiainen T, Kangas AJ, Soininen P, et al. A differential network approach to exploring differences between biological states: an application to prediabetes. PLoS One. 2011;6(9):e24702.
 19.
Yates PD, Mukhopadhyay ND. An inferential framework for biological network hypothesis tests. BMC Bioinformatics. 2013;14:94.
 20.
Reverter A, Ingham A, Lehnert SA, Tan SH, Wang Y, Ratnakumar A, et al. Simultaneous identification of differential gene expression and connectivity in inflammation, adipogenesis and cancer. Bioinformatics. 2006;22(19):2396–404.
 21.
Zhang B, Horvath S. A general framework for weighted gene coexpression network analysis. Stat Appl Genet Mol Biol. 2005;4(1):Article17.
 22.
Gill R, Datta S, Datta S. A statistical framework for differential network analysis from microarray data. BMC Bioinformatics. 2010;11:95.
 23.
Kim J, Wozniak JR, Mueller BA, Shen X, Pan W. Comparison of statistical tests for group differences in brain functional networks. Neuroimage. 2014;101:681–94.
 24.
Fleiss JL. On the distribution of a linear combination of independent chi squares. J Am Stat Assoc. 1971.
 25.
Zhang FR, Huang W, Chen SM, Sun LD, Liu H, Li Y, et al. Genomewide association study of leprosy. N Engl J Med. 2009;361(27):2609–18.
 26.
Tothill RW, Tinker AV, George J, Brown R, Fox SB, Lade S, et al. Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome. Clin Cancer Res. 2008;14(16):5198–208.
 27.
Fresno VJA, Casado E, de Castro J, Cejas P, BeldaIniesta C, GonzalezBaron M. PI3K/Aktsignalling pathway and cancer. Cancer Treat Rev. 2004;30(2):193–204.
 28.
Rose SL. Notch signaling pathway in ovarian cancer. Int J Gynecol Cancer. 2009;19(4):564–6.
 29.
Groeneweg JW, Foster R, Growdon WB, Verheijen R, Rueda BR. Notch signaling in serous ovarian cancer. J Ovarian Res. 2014;7(1):95.
 30.
Fukushima A. DiffCorr: an R package to analyze and visualize differential correlations in biological networks. Gene. 2013;518(1):209–14.
 31.
Rao W, Li H, Song F, Zhang R, Yin Q, Wang Y, et al. OVA66 increases cell growth, invasion and survival via regulation of IGF1RMAPK signaling in human cancer cells. Carcinogenesis. 2014;35(7):1573–81.
 32.
Liu MX, Siu MK, Liu SS, Yam JW, Ngan HY, Chan DW. Epigenetic silencing of microRNA199b5p is associated with acquired chemoresistance via activation of JAG1Notch1 signaling in ovarian cancer. Oncotarget. 2014;5(4):944–58.
 33.
Wang LL, Cai HQ, Dong XQ, Zhang LW, Jiang SS, Zhao N, et al. Differentially expressed gene profiles in the serum before and after the ultrasoundguided ethanol sclerotherapy in patients with ovarian endometriomas. Clin Biochem. 2015;48(1617):1131–7.
 34.
Galic V, Shawber CJ, Reeves C, Shah M, Murtomaki A, Wright J, et al. NOTCH2 expression is decreased in epithelial ovarian cancer and is related to the tumor histological subtype. Pathol Discov. 2013;1:4.
 35.
Kulic I, Robertson G, Chang L, Baker JH, Lockwood WW, Mok W, et al. Loss of the Notch effector RBPJ promotes tumorigenesis. J Exp Med. 2015;212(1):37–52.
 36.
Sinnathamby G, Zerfass J, Hafner J, Block P, Nickens Z, Hobeika A, et al. ADAM metallopeptidase domain 17 (ADAM17) is naturally processed through major histocompatibility complex (MHC) class I molecules and is a potential immunotherapeutic target in breast, ovarian and prostate cancers. Clin Exp Immunol. 2011;163(3):324–32.
 37.
Xiong M, FeghaliBostwick CA, Arnett FC, Zhou X. A systems biology approach to genetic studies of complex diseases. FEBS Lett. 2005;579(24):5325–32.
Acknowledgements
This work was supported by grants from National Natural Science Foundation of China (grant number81573259 and 31200994). The funding body played no role in the design, writing or decision to publish this manuscript. We thank the leprosy and ovarian cancer investigators for access to their study data.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
JJ, ZY and FX conceived, designed the study and implemented the data analysis, JJ, ZY and XZ drafted the manuscript. All authors read and approved the final manuscript.
Additional files
Additional file 1: Table S1.
The location and SNP number for five susceptibility genes belonging to the network associated with leprosy. Table S2. The location and probe sets number for genes belonging to the subnetwork from PI3KAKT signaling pathway and Notch signaling pathway. Table S3. Type I error rates of four methods (40 vertices and 54 edges in the network). Table S4. Type I error rates of four methods given the deviation from the normal distribution. Figure S1. A network including 40 vertices and 54 edges. Figure S2. The statistical power of the four methods under three scenarios. (a) only vertices change, (b) only edges change, (c) both vertices and edges change. (PDF 140 kb)
Additional file 2:
R code. (DOCX 13 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Ji, J., Yuan, Z., Zhang, X. et al. A powerful scorebased statistical test for group difference in weighted biological networks. BMC Bioinformatics 17, 86 (2016). https://doi.org/10.1186/s128590160916x
Received:
Accepted:
Published:
Keywords
 Network medicine
 Systems epidemiology
 Scorebased statistical test
 Network comparison