Skip to main content

A pseudo-value regression approach for differential network analysis of co-expression data

Abstract

Background

The differential network (DN) analysis identifies changes in measures of association among genes under two or more experimental conditions. In this article, we introduce a pseudo-value regression approach for network analysis (PRANA). This is a novel method of differential network analysis that also adjusts for additional clinical covariates. We start from mutual information criteria, followed by pseudo-value calculations, which are then entered into a robust regression model.

Results

This article assesses the model performances of PRANA in a multivariable setting, followed by a comparison to dnapath and DINGO in both univariable and multivariable settings through variety of simulations. Performance in terms of precision, recall, and F1 score of differentially connected (DC) genes is assessed. By and large, PRANA outperformed dnapath and DINGO, neither of which is equipped to adjust for available covariates such as patient-age. Lastly, we employ PRANA in a real data application from the Gene Expression Omnibus database to identify DC genes that are associated with chronic obstructive pulmonary disease to demonstrate its utility.

Conclusion

To the best of our knowledge, this is the first attempt of utilizing a regression modeling for DN analysis by collective gene expression levels between two or more groups with the inclusion of additional clinical covariates. By and large, adjusting for available covariates improves accuracy of a DN analysis.

Peer Review reports

Background

The rapid advancement of RNA-sequencing (RNA-seq) data from high-throughput sequencing technologies has provided clear advantages in gene expression studies. It has broadened our understanding of genetics and pathogenesis of human diseases [1, 2]. Compared to microarrays, RNA-seq has a wider dynamic range, the ability to detect novel transcripts, and often results in higher sensitivity and specificity of detection of differential gene expression [3, 4]. With the increase of gene expression studies, the statistical methods to analyze these gene expression data have also accordingly adapted and progressed. The regression modelling for RNA-seq differential expression (DE) analysis has been established to compare the number of DE genes under different biological or clinical states, including linear model based limma [5], negative binomial model based edgeR [6], or Poisson log-linear model approach [7]. The analysis of DE has a major limitation in that it looks at one gene at a time, even though a set of genes are often involved in the same biological process [8, 9]. In contrast, the differential network (DN) analysis complements the DE analysis [10] by looking at genes collectively.

The DN analysis identifies changes in measures of association (i.e. network properties or topologies) of the networks across biological conditions, which makes it distinct from a single network analysis. Several groups have proposed statistical methods for DN analysis [11,12,13,14]. In particular, DINGO [11] and dnapath [14] have developed methods for RNA-seq data, to find differentially connected (DC) genes in subnetworks corresponding to different pathways, between two groups of patients; e.g. ‘high-risk’ versus ‘low-risk’ or ‘long-term survivors’ vs. ‘short-term survivors.’ While these methods are convenient to use and applicable, they do not consider other observed covariates that may be associated with gene expression.

For instance, a previous study has shown that the expression levels of oxidative stress-associated genes were differentially expressed with smokers with chronic obstructive pulmonary disease (COPD) through gene set enrichment analysis using microarray data [15]. Let us suppose we want to carry out DN analysis on expression data that includes oxidative stress-associated genes and smoking status, which would be used as a grouping variable. In practice, clinicians would also want to include additional covariates such as patient history of cardiac arrhythmia [16] and lung carcinoma [17] to garner more information for better prognosis. However, there is no available direct regression modeling for DN analysis regressing gene expression level between the smoking statuses with the inclusion of additional clinical covariates described above.

The pseudo-value approach was first developed from the leave-one-out jackknife subsampling procedure, applied to a marginal quantity representing some aspect of a marginal distribution of the response variable. It was originally introduced by Andersen and his colleagues [18, 19] for multi-state survival models. Several studies [20, 21] purported that the pseudo-value regression has advantages that the pseudo-values derived from an asymptotically linear and unbiased estimator are approximately independent and identically distributed with the same conditional expectation. Ahn and Logan [22] and Ahn and Mendolia [23] showed that their pseudo-value approaches controlled the type I error while maintaining high power with clustered survival data. With these benefits, we propose a regression modeling method that regresses the jackknife pseudo-values [24] derived from a measure of connectivity of genes in a network to estimate the effects of predictors. Note that the grouping variable itself could also be included in the regression model along with additional clinical covariates while regressing the pseudo-values. We loosely refer to this as a “multivariate setting”, whereas in “univariate settings” only the group variable is utilized in a DN analysis.

Thus, in this paper, we introduce a Pseudo-value Regression Approach for Network Analysis (PRANA). This is a novel method of DN analysis that can adjust for additional covariates. We start from mutual information (MI) criteria, followed by pseudo-value calculations, which are then entered into a robust regression model. This article assesses the model performances of our pseudo-value approach in a multivariable setting, followed by a comparison to dnapath and DINGO in the univariable setting through simulations. Lastly, we employ our method in a real data application [25] from the Gene Expression Omnibus (GEO) database [26] to identify DC genes that are associated with COPD. All statistical analyses are performed in R version 4.0.2 (R Foundation for Statistical Computing, Vienna, Austria).

Results

Simulation study

More details of the simulation setup are available in the “Materials” section below. We select \(p = 20, 50, 100\) genes to test our pseudo-value approach in smaller to larger gene networks. For each gene network, five different sample sizes \(n = 40, 100, 200, 500, 1000\) are considered and in each setting, 1000 Monte Carlo replicates. We draw 1000 random samples first, then take the subsamples from this pool for a simulation with a smaller sample size to reduce computational burden. A random network is generated at each simulation replicate in which a layer of randomness is imposed to account for biological variability of the network structure. For additional details on the generation of simulated RNA-seq data, see the “Materials” section. Simulations are repeated to show the performance of our method by altering the effect size from 5%, 10%, to 20% for simulation scenarios I and II.

Table 1 Scenario I simulation results of binary group variable in the multivariable robust regression model (continuous age and binary group) using pseudo-value approach with 1000 replicates
Table 2 Scenario II simulation results of binary group variable in the multivariable robust regression model (continuous age and binary group) using pseudo-value approach with 1000 replicates

Results are compared with the true parent network in order to compute the performance measures described in the “Performance Measures” section. In the true network setting, a gene is considered truly DC between groups if it has at least one DC edge connected to other genes. Tables 1 and 2 summarize simulation results in the multivariable setting for scenarios I and II, respectively, when the continuous variable is added as a covariate with the binary group variable in the regression. Table 2 incorporates the effect of covariate when generating random networks, whereas Table 1 does not. For both scenarios, results show that the pseudo-value regression method generally yields a high precision and recall across all specifications of network size, sample size, and effect size. The pseudo-value regression method maintains a high precision while having an acceptable recall, especially, when a smaller sample size is considered.

Table 3 Scenario I simulation results of binary group variable in the univariable robust regression model using pseudo-value approach with 1000 replicates
Table 4 Scenario II simulation results of binary group variable in the univariable robust regression model using pseudo-value approach with 1000 replicates
Table 5 Scenario III simulation results of binary group variable in the multivariable and univariable robust regression model using pseudo-value approach with 1000 replicates

Tables 3 and 4 summarize simulation results for scenarios I and II, respectively, when only the binary group variable is included in the model for pseudo-value calculation. Thus, age dependent networks are simulated for Table 4 but not for Table 3. Two competing univariable methods, dnapath and DINGO are included for these simulations.

Overall, a similar pattern is observed in the univariable setting; i.e., PRANA consistently reaches a high precision and recall. The performance improves as n increases, as to be expected. It is noteworthy that PRANA outperforms dnapath in simulation when the sample size is relatively small regardless of the network size. Our method also shows a better recall value and F1-score than dnapath with small sample sizes (\({n} = 40, 100\)). As DINGO requires substantially large computational time, it was considered for the gene network with smaller sample sizes only. To be more specific, simulations with larger sample sizes (\(n = 500, 1000\)) are stopped after 20 days for DINGO from the University of Florida Research Computing Linux server, HiPerGator 3.0 with 10 CPU cores and 10 GB of RAM per node. See Table S1 in Additional file 1 for more details.

Table 5 presents results of scenario III, where age acts as a confounder. That is, an observed difference in connectivity may be due to a difference in the distribution of age in the two groups. Higher precision values from the multivariable pseudo-value regression indicate that PRANA correctly identifies the DC genes, compared with dnapath and DINGO, neither of which accounts for the effects of age. By and large, PRANA has higher precision than DINGO and higher recall than dnapath.

Application study

Analysis of COPDGene data

23 out of 28 COPD-related genes are predicted to be DC between current and non-current smokers with PRANA while accounting for smoking pack years, age, gender, race, and FEV1. A complete list of DC genes found from the pseudo-value approach are CITED2, TESK2, AMZ1, DDX1, DMWD, MED13L, ZBTB38, EML4, HSPA4, ITGB8, TEPP, TNPO1, ARNTL, DTWD1, ADAMTSL3, THRA, SLMAP, DENND2D, STN1, SYN3, ASAP2, IER3, and MFHAS1.

We compared results of PRANA with dnapath [14] and DINGO [11]. With DINGO, a total of 19 out of 28 COPD-related genes were selected as DC genes between current and non-current smokers. A complete list of DC genes found in DINGO are the following: ARNTL, DDX1, HSPA4, ITGB8, SLMAP, SYN3, ASAP2, IER3, MFHAS1, VGLL4, CITED2, TESK2, CCDC69, EML4, ADAMTSL3, DENND2D, AMZ1, RASEF, and ZBTB38. Lastly, 3 genes were found DC between current smoking groups with dnapath, namely DTWD1, EML4, and TEPP.

Of the 23 DC genes from PRANA, 5 genes are found exclusive to PRANA (DMWD, MED13L, TNPO1, THRA, and STN1). Notably, DMWD is linked to myotonic dystrophy, a rare genetic muscular disorder [27]. Thyroid hormone receptor alpha (THRA) is related to congenital hypothyroidism [28]. These findings about additional genes will facilitate harnessing of the possible mechanisms at work in COPD exacerbation.

Fig. 1
figure 1

Network plots visualizing the gene network (\({p} = 20\)) without a covariate dependence structure that depends on binary group only (scenario I). The row represents group whereas the column represents age categories. The three networks in each row are identical, since there is no effect of age on the structure of network. The edges of the hub nodes are removed based on the effect size of the binary group

Fig. 2
figure 2

Network plots visualizing the gene network (\({p} = 20\)) with a covariate dependence structure that depends on age and group information (scenario II). The row represents group whereas the column represents age categories. All six networks have unique structure of the network. The edges of the hub nodes are firstly removed based on the effect size of the group, as shown in Fig. 1 above. For this scenario II, additional edges of nodes with greater number of connected edges are removed for each age category

Fig. 3
figure 3

Network plots visualizing the gene network (\({p} = 20\)) with a covariate dependence structure that depends on age and group information with unequal sampling proportions with respect to different distribution of the age in the two groups (scenario III). The row represents group whereas the column represents age categories. All six networks have unique structure of the network. The edges of the two hub nodes are removed for each age category. To employ the effect of group, \(10\%/10\%/80\%\) of the subjects in \(z = 1\) will have a network structure to each of the first, second, and third networks in the first row. In contrast, \(80\%/10\%/10\%\) of the subjects in \(z = 2\) will have a network structure to each of the first, second, and third networks in the second row

Heat shock protein family A (Hsp70) member 4 (HSPA4) is associated with gastric ulcer [29]. Multifunctional ROCO family signaling regulator 1 (MFHAS1) is linked to soft tissue tumor and cell cycle [30]. HSPA4 and MFHAS1 are DC genes identified in both PRANA and DINGO. Echinoderm microtubule-associated protein-like 4 (EML4) is found in all three methods. It has been studied for its association with lung cancer [30, 31]. A Venn diagram is provided to show the overlap between and among three methods (Fig. 4). In addition, a diagram is included to summarize the findings of this application study (Fig. 5).

Fig. 4
figure 4

A Venn diagram displaying the number of overlapping DC genes between and among univariable analysis such as DINGO and dnapath versus multivariable robust regression with pseudo-value approach using COPDGene study data from GEO database

Fig. 5
figure 5

A diagram summarizing results using each methods analyzing the COPDGene study data from GEO database. A full list of DCGs (differentially connected genes) are provided in each box

Discussion

Simulations and real-data analysis have elucidated that PRANA is superior to existing alternatives and a practical tool, which includes covariates in the model. To the best of our knowledge, this is the first attempt to develop a regression modeling in DN analysis. Our working objective is to propose a statistical method that determines whether a gene is significantly DC between groups with the covariate included in the model. In this paper, we have shown through simulations that PRANA reaches a consistently high degree of precision and recall to identify DC genes with varying simulation parameters such as network size, sample size, and effect size. We also analyzed a COPD-related gene expression data from the GEO database. When comparing results from our method to dnapath and DINGO, five COPD-related genes are additionally found DC between current versus non-current smokers: DMWD, MED13L, TNPO1, THRA, and STN1.

There are a number of limitations to be highlighted in this study. We have used the absolute value of the differences between the two adjacency matrices as a proxy to determine the true DC genes. Certainly, this is a practical way to detect differences in the number of edges for each genes in a network. The comparison of maximum values between adjacency matrices was also considered. However, we concluded that they are more useful describing the global characteristic of a network, which deviates from our objective, namely, a gene-specific characteristic of a network.

Another limitation is the inability to perturb simulated networks in a continuous way. Right now, we have discretized the effect of a covariate into three groups. Perhaps, there are other models where a truly continuous covariate could be incorporated.

Lastly, the Pearson correlation, partial correlation, and degree-weighted LASSO were also examined as alternatives to the ARACNE as a measure of association or connectedness, albeit not reported in the paper, due to relatively poor performance and heavy computational costs. It remains an interesting task for future studies to extend our work to other measures of association of a network which better assess different structural changes in the network. We conclude by foregrounding the future direction of the pseudo-value regression approach for the DN analysis, which are potentially extensible to other data types, such as the microbiome data.

Conclusion

The adjustment of covariate is an important step in differential network analysis. In this paper, we presented PRANA, a novel pseudo-value regression approach for the DN analysis, which can incorporate additional clinical covariates in the model. This is a direct regression modeling, and it is therefore computationally amenable for the most users.

Methods

Mutual information and ARACNE

Mutual information (MI) determines whether and how two genes interact. That is, it is a measure of their relatedness and calculated from their joint expression profiles. MI is zero if and only if the joint distribution between the expression level of gene j and gene k satisfies \(P(g_j, g_k) = P(g_j)P(g_k)\) for \(j \ne k\), or if \(j = k\); in other words, they are statistically independent. Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNE) which are proposed by [32, 33] estimates MI using a computationally efficient Gaussian kernel estimator. The estimate of MI is used to quantify the connectivity of each pair of genes in a network. Given a set of two genes measurements, \(\overrightarrow{u_i} \equiv ( g_{ij}, g_{ik} )\), \(i = 1, \ldots , n\), the joint probability distribution is approximated as \(f(\overrightarrow{u}) = 1/n \sum _i h^{-2} \Phi (h^{-1} (\overrightarrow{u} - \overrightarrow{u_i}))\), where \(\Phi\) is the bivariate standard normal density and h is the position-dependent kernel width. Then the MI can be expressed as [32]:

$$\begin{aligned} \hat{I}_{jk} = \frac{1}{n} \sum _i \log \frac{f(g_{ij}, g_{ik})}{f(g_{ij}) f(g_{ik})}, \end{aligned}$$

where \(f(g_j)\) and \(f(g_k)\) are the marginals of \(f(\overrightarrow{u})\). The matrix containing entries \(\hat{I}_{jk}\) is defined as the association matrix. The ARACNE algorithm copula-transforms the profiles for MI estimation because MI is reparameterization invariant; thus, the range of these transformed variables is between 0 and 1 [34].

Pseudo-value approach

Let \(\hat{I}_{jk}\) be the MI estimate for a pair of genes \(j,k \in \{1, \dots , p\}\) of an estimated network from n individuals. For each gene k, we sum the edges (MI estimates) around the gene by taking the column sum of the association matrix to obtain the total connectivity (which can be deemed as a continuous version of degree centrality) of gene k:

$$\begin{aligned} \hat{\theta }_{k} = \sum _{j=1}^{p} \hat{I}_{jk}, \end{aligned}$$

where \(k = 1, \dots , p\).

The jackknife pseudo-values [24] for the \(i^\text {th}\) individual and \(k^\text {th}\) gene are defined by:

$$\begin{aligned} \tilde{\theta }_{ik} = n\hat{\theta }_{k} - (n-1)\hat{\theta }_{k(i)}, \end{aligned}$$
(1)

where \(\hat{\theta }_{k(i)}\) is the column sum of a gene calculated from the re-estimated association matrix using the RNA-seq data without the \(i^\text {th}\) subject. For each gene k, the re-estimation process requires n such calculations with the data size of \(n-1\).

Let Z a binary group indicator. Let \(\mathcal {G}_1 = \{i: Z_{i} = 1 \}\), \(\mathcal {G}_2 = \{i: Z_{i} = 2 \}\), and \(n_z = |\mathcal {G}_z|\) is the sample size for the two groups \(z=1, 2\) and \(n = \sum n_{z}\). The jackknife pseudo-values are separately obtained within groups. Following the general formula above, for gene k and group z, we similarly define \(\hat{\theta }_{k}^z\) and \(\hat{\theta }_{k(i)}^z\), where \(i = 1,\ldots , n_{z}\). Then for each \(i \in \mathcal {G}_z\), the \(k^\text {th}\) gene jackknife pseudo-values are calculated by \(\tilde{\theta }_{ik} = n_z\hat{\theta }_{k}^z - (n_z-1)\hat{\theta }_{k(i)}^z\).

Next, a robust regression is applied to regress the pseudo-values on a set of covariates, including Z and \({\textbf {X}}\), where Z is the group indicator and \({\textbf {X}} = (X_{1}, \dots , X_{q}\)) are the potential confounders, such as age and gender. For the \(i^\text {th}\) individual and \(k^\text {th}\) gene, we posit the model

$$\begin{aligned} E[\tilde{\theta }_{ik} | Z_{i}, {\textbf {X}}_{i}] = \alpha _{k} + \beta _{k}Z_{i} + \gamma _{k1}X_{i1} + \dots + \gamma _{kq}X_{iq}, \end{aligned}$$
(2)

where \(\alpha _{k}\) is the intercept, \(\beta _{k}\) is the regression coefficient for Z, and \(\gamma _{k1}, \dots , \gamma _{kq}\) is the set of regression coefficients to be estimated for X. The main parameter of interest is \(\beta _{k}\) to test for the change in total connectivity (or degree centrality) of the \(k^{\text {th}}\) gene between groups. Least trimmed squares (LTS), also known as least trimmed sum of squares [35], is then implemented to perform a robust regression. The LTS estimator is defined by

$$\begin{aligned} \min _{\alpha _k, \beta _k, \gamma _{k1}, \ldots , \gamma _{kq}} \sum _{i=1}^h r_{(i)} (\alpha _k, \beta _k, \gamma _{k1}, \ldots , \gamma _{kq})^2, \end{aligned}$$

where \(r_{(i)}\) is the set of ordered absolute values of the residuals (in increasing order of absolute value) and h may depend on some pre-defined trimming proportion c, for instance by means of \(h = [n(1-c)] + 1\). In general, c is chosen between 0.5 and 1 [36].

Hypothesis testing

To test whether the true difference in total connectivity of \(k^\text {th}\) gene differs between groups, we test the null hypothesis of \(H_{0}: \beta _{k} = 0\) against the research hypothesis \(H_{1}: \beta _{k} \ne 0\). The t-statistic is computed by \(\hat{\beta }_{k}/SE(\hat{\beta }_{k})\), where \(SE(\hat{\beta }_{k})\) standard error of \(\hat{\beta }_{k}\), obtained using large sample theory, and which is the least-squares estimator of \(\beta _{k}\) for \(k = 1, \dots , p\) from the robust regression described in equation (2). P-values are calculated using a t-distribution as in robustbase R package [37, 38].

It is important to control the false discovery rate (FDR), since multiple hypothesis tests are conducted in the DN analysis. The FDR measures the proportion of false discoveries among a set of genes which are significantly DC between groups. The empirical Bayes screening (EBS) approach [39] has been applied to control the FDR, which is an extension of Westfall and Young step-down adjusted p-values [40]. The EBS procedure is robust against model mis-specification, as it utilizes nonparametric function estimation techniques for the estimation of the marginal density of the transformed p-values.

Materials

This section details step-by-step procedures how the simulation is performed. The performance of our proposed method is assessed by an extensive simulation study. Data are simulated with different number of genes p and sample size n. In this simulation, the regression model includes two covariates Z and X, where Z is the group indicator and \(X \sim N(55, 10)\) is a continuous covariate such as the age of a patient. Three different simulation scenarios are considered.

Data generation

Simulate weighted networks and RNA-seq data with a dependence structure that depends on Z and/or X using the SeqNet R package [41]. In this setting, there are total of six networks for the combination of two groups and three age categories (younger than 50, 50-60, and older than 60). We consider three different scenarios incorporating group information only (scenario I), age and group information (scenario II), and age and group information with unequal sampling proportions with different distributions of the age in the two groups (scenario III) (see Figures 1–3 for visual demonstrations).

Scenarios I (a–b) and II (a–c)

  1. a.

    Generate the first random network with p nodes for \(z = 1\). The \(p \times p\) adjacency matrix, where the diagonal elements are 0 and non-diagonal elements are in \(\{0, 1\}\), is extracted from this first graph. It is a symmetric matrix indicating whether a pair of nodes are connected by an edge. Take the column sum of the adjacency matrix to see the total number of connected edges to the node. Record the indices of this vector with column sum for the use of effect size adjustment in later step.

  2. b.

    Perturb the first network to generate the second network for \(z = 2\) by removing the edges around nodes using the indices obtained in previous step. To assess the effect size of group, the top \(5\%\), \(10\%\), and \(20\%\) of total nodes with the most number of edges in a network lose their edges (e.g. 2 nodes with the most number of edges for a network with \(p = 20\) for the effect size of \(10\%\), Fig. 1). This is the end of scenario I.

  3. c.

    For scenario II, further perturb remaining networks by removing edges of one additional node with the next most number of edges, coming after Step (b) above. This is to simulate networks with a covariate dependence structure on both age and group (see Fig. 2).

Scenario III

We created a scenario where age is acting like a confounder. In other words, for a given each category that two networks are the same, but the distributions of the age of the patients are different in the two groups. Therefore, there will be an observed difference in network connectivity, which is explained through age.

  1. a.

    Generate the first random network with p nodes for younger than 50 category. The \(p \times p\) adjacency matrix, where the diagonal elements are 0 and non-diagonal elements are either \(\{0, 1\}\), is extracted from this first graph. Record the indices of connected edges for the perturbation of network in later steps below.

  2. b.

    Perturb the first network to generate the second network for age 50–60 category by removing the edges of the two nodes with the most number of edges in a network lose all of their edges. In other words, we refer to the indices, recorded in the adjacency matrix from the earlier step, and remove all the connected edges around the two nodes.

  3. c.

    Next, we repeat the same to perturb the second network to obtain the third network for older than 60 category (see Fig. 3).

For all three scenarios, we assign a partial correlation to edges to obtain weighted networks [41]. Note that adjacency matrices of these weighted networks are used for the the true connection per gene. Generate RNA-seq samples based on weighted networks with equal sampling proportions for scenarios I and II. However, specifically for scenario III, a sampling proportion differs across age categories and groups. That is, \(10\%/10\%/80\%\) for \(z=1\) and \(80\%/10\%/10\%\) for \(z = 2\). The data generation involves with two major steps. Firstly, we generate gene expressions (Gaussian values) from a group-specific weighted network for each gene, denoted as \(\tilde{x}_{i} \sim N(0, 1)\). These Gaussian values are then mapped into RNA-seq data column-wise by using the inverse CDF of empirical distribution of the reference data using expression data with accession number GSE158699 [25] from the Gene Expression Omnibus (GEO) database [26]. We will have \(n_{z} \times p\) matrices for each group \(z = 1, 2\).

Algorithm

  1. 1

    Obtain an association matrix with ARACNE from the data generated in steps from the “Data Generation” section to fit an estimated network using minet [32, 42] for each group.

  2. 2

    For each gene k, calculate the column sums of association matrix for each group z separately, denoted by \(\hat{\theta }_{k}^{z}\).

  3. 3

    For each gene k and individual \(i \in \mathcal {G}_{z}\), compute \(\hat{\theta }_{k(i)}^{z}\) from the association matrix that is re-estimated based on RNA-seq data without the ith subject of \(n_{z} \times p\) data from the “Data Generation” section for each group z separately, where \(i = 1, \dots , n_{z}\).

  4. 4

    Calculate \(\tilde{\theta }_{ik}\) using Eq. (1) based on Step 2 and 3.

  5. 5

    For each gene k, fit a multivariable robust regression with binary group variable and continuous age variable to obtain the p values of the group variable, computed from the t test. These p values are used to compute the performance measures of simulation study. More details on the performance measures are stated next.

Performance measures

To evaluate the performance of our proposed method, precision, recall, and the F1 score are calculated. Let \(\Omega ^{z} \in \textbf{R}^{p \times p}\) be the adjacency matrix for group z, where

$$\begin{aligned} \Omega _{jk}^{z} = {\left\{ \begin{array}{ll} 1 &{} \quad \text {if } j^{\text {th}} \text { gene and } k^{\text {th}} \text { gene are connected}\\ 0 &{} \quad \text {otherwise}, \end{array}\right. } \end{aligned}$$

for \(z = 1, 2\). Then, for each gene k, we calculate

$$\begin{aligned} \eta _{k} = I\bigg ( \sum _{j=1}^{p} | \Omega _{jk}^{1} - \Omega _{jk}^{2} | \ge 1 \bigg ), \end{aligned}$$

where \(I(\cdot )\) is an indicator function to determine whether gene k has differential connectivity. The gene k is truly DC if \(\eta _{k} = 1\), and is not DC if \(\eta _{k} = 0\) for the true gene network. Similarly, for the covariate dependence structure, the following quantities are obtained

$$\begin{aligned} \eta _{k} = I\bigg ( \frac{1}{c} \sum _{c} \sum _{j=1}^{p} | \Omega _{jk}^{1,c} - \Omega _{jk}^{2, c} | \ge 1 \bigg ), \end{aligned}$$

where \(\Omega ^{z, c} \in \textbf{R}^{p \times p}\) be the adjacency matrix for group z and age category \(c=1, 2, 3\). Denote that S is the total number of Monte Carlo simulation replicates. Let \(q_{ks}\) be adjusted p value as in following procedure [39] of kth gene at the sth simulation replicate. \(\alpha\) represents the magnitude of error control, and 0.05 was used throughout the simulation.

  • Precision is the proportion of genes that are inferred to be significantly DC from the test which have true connection between two comparing groups:

    $$\begin{aligned} \text {Precision} = \frac{ \sum _{k = 1}^{p} \eta _{k} \, I(q_{ks}< \alpha ) }{\sum _{k = 1}^{p} I(q_{ks} < \alpha ) }. \end{aligned}$$
  • Recall is the proportion of genes that have true connection which are correctly inferred to be significantly DC between two comparing groups from the test:

    $$\begin{aligned} \text {Recall} = \frac{ \sum _{k = 1}^{p} \eta _{k} \, I(q_{ks} < \alpha ) }{\sum _{k = 1}^{p} \eta _{k} }. \end{aligned}$$
  • F1 is calculated based on the harmonic mean of precision and recall obtained from the simulation. A higher F1 score suggests lower false negative and false positive rate:

    $$\begin{aligned} \text {F1} = 2 \cdot \frac{\text {Precision} \cdot \text {Recall}}{\text {Precision} + \text {Recall}}. \end{aligned}$$

COPDGene data

A recent genome-wide association study [43] identified 35 new COPD-related genes from the UK Biobank and International COPD Genetics Consortium data. Among these 35 COPD-related genes, 28 genes are available in the data from the COPDGene study for the analysis using PRANA and other DN analysis methods including dnapath and DINGO. The 28 COPD-related genes are the following: CITED2, TESK2, COL15A1, AMZ1, RASEF, DDX1, DMWD, MED13L, ZBTB38, CCDC69, EML4, HSPA4, ITGB8, TEPP, TNPO1, ARNTL, DTWD1, ADAMTSL3, RREB1, THRA, SLMAP, DENND2D, STN1, SYN3, ASAP2, IER3, MFHAS1, and VGLL4.

Among 2561 samples from the initial phenotype data, we have used 406 samples that were provided as the validation set in the analysis of the original study. For the analysis with PRANA, binary current smoking status variable is used as the grouping variable, and smoking pack years, age, gender, race, and FEV1 are included as additional covariates in a multivariable model. The binary current smoking status variable is used as the grouping variable for dnapath and DINGO.

Availability of data and materials

PRANA is available at (https://github.com/sjahnn/PRANA). The COPDGene data is available in the GEO database with accession number GSE158699 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE158699). Please reach out to the author (Somnath Datta, somnath.datta@ufl.edu) if you have any further inquiries on the data or code.

Abbreviations

COPD:

Chronic obstructive pulmonary disease

DC:

Differentially connected

DE:

Differential expression

DN:

Differential network

GEO:

Gene Expression Omnibus

LASSO:

Least absolute shrinkage and selection operator

MI:

Mutual information

References

  1. Reuter JA, Spacek DV, Snyder MP. High-throughput sequencing technologies. Mol Cell. 2015;58(4):586–97.

    Article  CAS  Google Scholar 

  2. Gao M, Zhong A, Patel N, Alur C, Vyas D. High throughput RNA sequencing utility for diagnosis and prognosis in colon diseases. World J Gastroenterol. 2017;23(16):2819–25.

    Article  CAS  Google Scholar 

  3. Wang Z, Gerstein M, Snyder M. RNA-seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63.

    Article  CAS  Google Scholar 

  4. Zhao S, Fung-Leung W, Bittner A, Ngo K, Liu X. Comparison of RNA-seq and microarray in transcriptome profiling of activated T cells. PLoS ONE. 2014;9(1): e78644.

    Article  Google Scholar 

  5. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47.

    Article  Google Scholar 

  6. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40.

    Article  CAS  Google Scholar 

  7. Li J, Witten DM, Johnstone IM, Tibshirani R. Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics. 2012;13(3):523–38.

    Article  Google Scholar 

  8. Ideker T, Krogan NJ. Differential network biology. Mol Syst Biol. 2012;8(565):1–9.

    Google Scholar 

  9. Kim Y, Hao J, Gautam Y, Mersha TB, Kang M. DiffGRN: differential gene regulatory network analysis. Int J Data Min Bioinform. 2018;20(4):362–79.

    Article  Google Scholar 

  10. de la Fuente A. From ‘differential expression’ to ‘differential networking’—identification of dysfunctional regulatory networks in diseases. Trends Genet. 2010;26(7):326–33.

    Article  Google Scholar 

  11. Ha MJ, Baladandayuthapani V, Do KA. DINGO: differential network analysis in genomics. Bioinformatics. 2015;31(21):3413–20.

    Article  CAS  Google Scholar 

  12. McKenzie AT, Katsyv I, Song WM, Wang M, Zhang B. DGCA: a comprehensive R package for differential gene correlation analysis. BMC Syst Biol. 2016;10(106):1–25.

    Google Scholar 

  13. Gill R, Datta S, Datta S. A statistical framework for differential network analysis from microarray data. BMC Bioinform. 2010;11(95):1–10.

    Google Scholar 

  14. Grimes T, Potter SS, Datta S. Integrating gene regulatory pathways into differential network analysis of gene expression data. Sci Rep. 2019;9(1):1–12.

    Article  CAS  Google Scholar 

  15. Pierrou S, Broberg P, O’Donnell RA, Pawlowski K, Virtala R, Lindqvist E, et al. Expression of genes involved in oxidative stress responses in airway epithelial cells of smokers with chronic obstructive pulmonary disease. Am J Respir Crit Care Med. 2007;175(6):577–86.

    Article  CAS  Google Scholar 

  16. Rusinowicz T, Zielonka TM, Zycinska K. Cardiac arrhythmias in patients with exacerbation of COPD. Adv Exp Med Biol. 2017;1022:53–62.

    Article  Google Scholar 

  17. Durham AL, Adcock IM. The relationship between COPD and lung cancer. Lung Cancer. 2015;90(2):121–7.

    Article  CAS  Google Scholar 

  18. Andersen P, Klein J, Rosthøj S. Generalised linear models for correlated pseudo-observations, with applications to multi-state models. Biometrika. 2003;90(1):15–27.

    Article  Google Scholar 

  19. Andersen PK, Klein JP. Regression analysis for multistate models based on a pseudo-value approach, with applications to bone marrow transplantation studies. Scand J Stat. 2007;34:3–16.

    Article  Google Scholar 

  20. Dutta S, Datta S, Datta S. Temporal prediction of future state occupation in a multistate model from high-dimensional baseline covariates via pseudo-value regression. J Stat Comput Simul. 2017;87(7):1363–78.

    Article  Google Scholar 

  21. Graw F, Gerds TA, Schumacher M. On pseudo-values for regression analysis in competing risks models. Lifetime Data Anal. 2009;15:241–55.

    Article  Google Scholar 

  22. Ahn KW, Logan BR. Pseudo-value approach for conditional quantile residual lifetime analysis for clustered survival and competing risks data with applications to bone marrow transplant data. Ann Appl Stat. 2016;10(2):618–37.

    Article  Google Scholar 

  23. Ahn KW, Mendolia F. Pseudo-value approach for comparing survival medians for dependent data. Stat Med. 2014;33(9):1531–8.

    Article  Google Scholar 

  24. Efron B, Tibshirani RJ. An introduction to the bootstrap. Philadelphia: Chapman & Hall/CRC; 1993.

    Book  Google Scholar 

  25. Wang Z, Masoomi A, Xu Z, Boueiz A, Lee S, Zhao T, et al. Improved prediction of smoking status via isoform-aware RNA-seq deep learning models. PLoS Comput Biol. 2021;17(10):e1009433.

    Article  CAS  Google Scholar 

  26. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data sets-update. Nucleic Acids Res. 2013;41:D991–5.

    Article  CAS  Google Scholar 

  27. Westerlaken JH, Van der Zee CE, Peters W, Wieringa B. The DMWD protein from the myotonic dystrophy (DM1) gene region is developmentally regulated and is present most prominently in synapse-dense brain areas. Brain Res. 2003;971(1):116–27.

    Article  CAS  Google Scholar 

  28. Tylki-Szymańska A, Acuna-Hidalgo R, Krajewska-Walasek M, Lecka-Ambroziak A, Steehouwer M, Gilissen C, et al. Thyroid hormone resistance syndrome due to mutations in the thyroid hormone receptor α gene (THRA). J Med Genet. 2015;52(5):312–6.

    Article  Google Scholar 

  29. Sakurai T, Kashida H, Hagiwara S, Nishida N, Watanabe T, Fujita J, et al. Heat shock protein A4 controls cell migration and gastric ulcer healing. Dig Dis Sci. 2015;60(4):850–7.

    Article  CAS  Google Scholar 

  30. Stelzer G, Rosen N, Plaschkes I, Zimmerman S, Twik M, Fishilevich S, et al. The GeneCards suite: from gene data mining to disease genome sequence analyses. Curr Protoc Bioinform. 2016;54:1301–13033.

    Article  Google Scholar 

  31. Adib R, Montgomery JM, Atherton J, O’Regan L, Richards MW, Straatman KR, et al. Mitotic phosphorylation by NEK6 and NEK7 reduces the microtubule affinity of EML4 to promote chromosome congression. Sci Signal. 2019;12(594):eaaw2939.

    Article  Google Scholar 

  32. Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla-Favera R, et al. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinform. 2006;7 Suppl 1(Suppl 1):S7.

    Article  Google Scholar 

  33. Zoppoli P, Morganella S, Ceccarelli M. TimeDelay-ARACNE: Reverse engineering of gene networks from time-course data by an information theoretic approach. BMC Bioinform. 2010;11(154):1–15.

    Google Scholar 

  34. Basso K, Margolin AA, Stolovitzky G, Klein U, Dalla-Favera R, Califano A. Reverse engineering of regulatory networks in human B cells. Nat Genet. 2005;37(4):382–90.

    Article  CAS  Google Scholar 

  35. Rousseeuw PJ. Least median of squares regression. J Am Stat Assoc. 1984;79(388):871–80.

    Article  Google Scholar 

  36. Pison G, Van Aelst S, Willems G. Small sample corrections for LTS and MCD. Metrika. 2002;55:111–23.

    Article  Google Scholar 

  37. Todorov V, Filzmoser P. An object-oriented framework for robust multivariate analysis. J Stat Soft. 2009;32(3):1–47.

    Article  Google Scholar 

  38. Maechler M, Rousseeuw P, Croux C, Todorov V, Ruckstuhl A, Salibian-Barrera M, et al. robustbase: Basic Robust Statistics; 2022. R package version 0.95-0. http://robustbase.r-forge.r-project.org/.

  39. Datta S, Datta S. Empirical Bayes screening of many p-values with applications to microarray studies. Bioinformatics. 2005;21(9):1987–94.

    Article  CAS  Google Scholar 

  40. Westfall PH, Young SS. Resampling-based multiple testing: examples and methods for p-value adjustment. New York: Wiley-Interscience; 1993.

    Google Scholar 

  41. Grimes T, Datta S. SeqNet: An R package for generating gene-gene networks and simulating RNA-seq data. J Stat Soft. 2021;98(21). https://doi.org/10.18637/jss.v098.i12.

  42. Meyer PE, Lafitte F, Bontempi G. minet: A R/Bioconductor package for inferring large transcriptional networks using mutual information. BMC Bioinform. 2008;9(461).

  43. Sakornsakolpat P, Prokopenko D, Lamontagne M, et al. Genetic landscape of chronic obstructive pulmonary disease identifies heterogeneous cell-type and phenotype associations. Nat Genet. 2019;51(3):494–505.

    Article  CAS  Google Scholar 

Download references

Acknowledgements

None.

Funding

S.A. has been supported by the National Institute on Alcohol Abuse and Alcoholism of the National Institutes of Health under grants number [NIH T32AA025877-03].

Author information

Authors and Affiliations

Authors

Contributions

S.D. developed the original idea of the study. S.A. implemented the methodology, performed simulations and data analyses of the study. T.G. provided critical feedback. S.A. drafted the manuscript. T.G. and S.D. provided suggestions when writing the manuscript. All authors have reviewed and edited the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Somnath Datta.

Ethics declarations

Ethical approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ahn, S., Grimes, T. & Datta, S. A pseudo-value regression approach for differential network analysis of co-expression data. BMC Bioinformatics 24, 8 (2023). https://doi.org/10.1186/s12859-022-05123-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-022-05123-w

Keywords

  • Pseudo-value
  • Differential network analysis
  • Regression method
  • Gene regulatory network
  • RNA-seq data