# Differential expression analysis with global network adjustment

- Jonathan A Gelfond
^{1}Email author, - Joseph G Ibrahim
^{2}, - Mayetri Gupta
^{3}, - Ming-Hui Chen
^{4}and - Jannine D Cody
^{5}

**14**:258

**DOI: **10.1186/1471-2105-14-258

© Gelfond et al.; licensee BioMed Central Ltd. 2013

**Received: **11 March 2013

**Accepted: **25 July 2013

**Published: **23 August 2013

## Abstract

### Background

Large-scale chromosomal deletions or other non-specific perturbations of the transcriptome can alter the expression of hundreds or thousands of genes, and it is of biological interest to understand which genes are most profoundly affected. We present a method for predicting a gene’s expression as a function of other genes thereby accounting for the effect of transcriptional regulation that confounds the identification of genes differentially expressed relative to a regulatory network. The challenge in constructing such models is that the number of possible regulator transcripts within a global network is on the order of thousands, and the number of biological samples is typically on the order of 10. Nevertheless, there are large gene expression databases that can be used to construct networks that could be helpful in modeling transcriptional regulation in smaller experiments.

### Results

We demonstrate a type of penalized regression model that can be estimated from large gene expression databases, and then applied to smaller experiments. The ridge parameter is selected by minimizing the cross-validation error of the predictions in the independent out-sample. This tends to increase the model stability and leads to a much greater degree of parameter shrinkage, but the resulting biased estimation is mitigated by a second round of regression. Nevertheless, the proposed computationally efficient “over-shrinkage” method outperforms previously used LASSO-based techniques. In two independent datasets, we find that the median proportion of explained variability in expression is approximately 25%, and this results in a substantial increase in the signal-to-noise ratio allowing more powerful inferences on differential gene expression leading to biologically intuitive findings. We also show that a large proportion of gene dependencies are conditional on the biological state, which would be impossible with standard differential expression methods.

### Conclusions

By adjusting for the effects of the global network on individual genes, both the sensitivity and reliability of differential expression measures are greatly improved.

## Background

A goal of systems biology is to understand how a perturbation affects a network of interrelationships between genes. There are many models for gene networks, but few have shown accurate predictions across many datasets [1]. There is a growing collection of gene perturbation experiments in which a subset of transcipts’ expression have been modulated, either through carcinogenesis, chemical treatments, or through siRNA [2]. These experiments comprise an information-rich dataset that allows us to construct global network models and test their predictive accuracy on the level of the full transcriptome.

Network models can be helpful for distinguishing between all differentially expressed genes and the genes that are immediately affected by a perturbation such as a gene deletion. There are clinically relevant reasons for this line of investigation. For instance, chromosomal abnormalities are the most common cause of mental retardation in the US, and the deletion of a chromosomal segment is a common subtype [3]. The deleted segments contain many genes, so that this is a type of sporadic, multiple gene knockout. The deletions are often *hemizygous*; that is, only one of the two (maternal or paternal) homologous chromosomes with the segment remains intact. Hence, the genes within the deleted segments are present with half of the normal copy number. Most genes exhibit some form of dosage-sensitive decrease in expression (e.g., DNA deletion tends to decrease mRNA expression), but there is a possibility that the gene’s regulatory network can compensate for the loss by selectively increasing the expression of the intact copy. The ability or inability of the regulatory networks to compensate with the intact copies on the homologous chromosome could explain which genes are responsible for common abnormal phenotypes. In this paper, we examined subjects with nearly identical somatic deleted segments of the q-arm of chromosome 18, and we performed comparative gene expression microarrays of these subjects with a parental control sample. Our goal is to: 1) identify the global gene expression differences, 2) to identify which genes are the most affected by the global expression differences relative to their regulatory networks, and 3) to identify genes that have different regulatory networks in case and control groups.

Various methods have been developed both for predicting gene expression and adjusting for the correlations between genes in differential expression analyses. Dahl et al [4] improved the detection of differential expression by grouping genes into co-expression clusters with a Dirichlet process mixture model. Leek at al [5] proposed a general framework for significance testing when a large-scale number of components have mutual dependency. This method may be applied to the differential expression problem and allows independent testing of hypotheses by conditioning on an orthogonal dependence kernel. These methods have shortcomings because they do not explicitly utilize the vast amount of data from prior experiments.

Some network methods do use extensive prior data such as Ruan et al [6] who used the cluster average of a set of co-expressed genes as a prediction for another gene within the cluster. This relatively simple method had similar or superior accuracy to models that used substantial auxiliary data including regulatory pathways and DNA-binding patterns of transcription factors such as the approach taken by Gustafsson et al [7] that used an “elastic-net" penalty [8] and information from multiple genomic modalities (microarray, ChIP-seq, etc.). Along the lines of Ruan et al [6], we present a simple and computationally efficient method for predicting gene expression. Our method extends the work of Congrove et al [9] who modeled a gene’s expression as linearly dependent on other genes, but we have added features that improve computational efficiency and robustness to variations in transcriptional networks.

## Methods

Suppose we have a series of microarrays represented by a matrix *Y* that is *G* × *N* where *G* is the number of the genes on the *N* arrays, and the expression measurements for each gene *g* will be the row vector *y*_{
g
}, *g* ∈ [1 … *G*]. We define the perturbation vector *X* as an *N* × 1 vector with elements *x*_{
i
} = *I*(*i*^{
th
}array is perturbed) where *I*(·) is the indicator function. Our goal is to estimate the effect of the perturbation conditional on the gene expression network *P*(*y*_{
g
}|*X*,*Y*_{−g}) where ${Y}_{-g}=\{{y}_{{g}^{\prime}}:{g}^{\prime}\ne g,{g}^{\prime}\in \phantom{\rule{0.3em}{0ex}}[1\dots G\left]\right\}$. That is, we want to predict the effects of the perturbation on a gene given that it is within a regulatory network consisting of all other genes. Typically, the sample size (*N*) of the experimental data *Y* is small ≤10, but we assume that a database of experiments *Y*_{
D
} (*G* × *N*_{
D
}) is also available with a substantially larger sample size *N*_{
D
} ≥ 100. Our method will use this database *Y*_{
D
} as an independent training set to estimate the regulatory network within *Y*.

*y*

_{ g }follows an approximately Gaussian distribution with mean

*β*

_{ g }

*Y*. A predictive model for each gene within

*Y*and

*Y*

_{ D }can be constructed by using Sparse Simultaneous Equations Model (SSEM) [9, 10]. Under this model, the gene expression of a transcript

*y*

_{ g }can be estimated by a weighted sum of the remaining transcripts. The key parameters are within

*β*

_{ g }(1 ×

*G*) where the ${\beta}_{g{g}^{\prime}}$ elements are the influence that gene

*g*

^{′}has on gene

*g*and represent the overall transcription network with

*β*

_{ gg }= 0 so that genes do not influence themselves. These

*β*

_{ g }parameters can be estimated from a database of related array experiments. Thus, we have the model

*ε*

_{ g }is an 1 ×

*N*vector of Gaussian errors. This may be expressed in matrix form as

*B*is

*G*×

*G*and composed of the ${\beta}_{g{g}^{\prime}}$ parameters such that

*B*has diagonal elements consisting of zeros, and

*E*is the

*G*×

*N*matrix of Gaussian errors with rows

*ε*

_{ g }. The intrinsic gene network can be described by

*B*. Cosgrove et al [9] model the effect of the perturbation of this network by introducing

*ϕ*

_{ g }

*G*×

*N*matrix composed of

*ϕ*

_{ g }and is the

*direct*effect of the perturbation

*X*on gene expression that is not accounted for by the gene network. Their goal is to estimate these direct actions Φ upon gene expression to elucidate, for example, the direct drug targets. They estimated the

*B*matrix by using

*compendia*or a large (

*N*> 100) database of experiments

*Y*

_{ D }, and estimate Φ using the residuals

Cosgrove et al [9] used *ϕ*^{
pert
} as the estimator of the direct action of the perturbation. Further, they did not take into account that the network *B* may not be the same in different cell types or biological conditions, which we consider in our approach below.

Although it was originally applied to large genomic databases, this model can be extended to apply to situations where the current experiment has insufficient data in order to accurately estimate the *B* parameter. Furthermore, estimating *B* with a large database *Y*_{
D
} and applying it to a new dataset will avoid the problem of using the data twice. For instance, if we have a large database of expression measures *Y*_{
D
}, we may estimate *B*_{
D
} such that *Y*_{
D
} = *B*_{
D
}*Y*_{
D
} + *E*, under the assumption that *B*_{
D
} ≈ *B*. Given our estimate of *B*_{
D
} with rows *β*_{
Dg
}, we may then construct a linear predictor of each gene *g* in the new dataset *Y* by *Y*_{Ng} = *β*_{
Dg
}*Y*.

*X*that applies to the individual samples, we propose the model

*β*

_{ Dg }

*Y*

*ν*

_{ g }=

*Y*

_{Ng}

*ν*

_{ g }is substituted for

*β*

_{ g }

*Y*and

*X*

^{′}

*δ*

_{ g }is substituted for

*ϕ*

_{ g }. The additional

*ν*

_{ g }scalar parameter represents a very important increase in model flexibility compared to Cosgrove et al [9] where

*ν*

_{ g }is constrained to be 1.0. In the new approach,

*ν*

_{ g }is estimated for each gene so that if the historical network is inconsistent with the current network and is not a good predictor for gene expression in the current experiment, then its effect on inference approaches zero as

*ν*

_{ g }does. The effect of the biological condition on gene

*g*is described by the scalar

*δ*

_{ g }, and this is distinct from the usual formulation of differential expression that tests the effect of

*X*(

*N*× 1) alone on

*y*

_{ g }[11, 12]. The

*δ*

_{ g }parameter corresponds to an element of the Φ parameter, but unlike

*ϕ*

_{ g }in Equation 3,

*δ*

_{ g }is

*jointly estimated*with

*ν*

_{ g }and undergoes formal statistical testing in our approach rather than acting merely as a basis for ranking candidates. Hence, the hypothesis

*H*

_{0}:

*δ*

_{ g }= 0 is the test of differential expression, conditional on the value of the historical gene network. The p-value can be derived through standard linear model theory, and the status of differential expression can be determined by setting a cutoff for p-values or false-discovery rates. The use of the database

*Y*

_{ D }to construct

*Y*

_{Ng}(1 ×

*N*) is key because it allows the network effect within

*Y*to be estimated with only one additional unknown parameter

*ν*

_{ g }, which is essential for datasets with relatively few biological replicates. It is noteworthy that the differential expression hypothesis using Equation 6 is an increase (or decrease) in a gene’s expression conditional on the gene-network, which is a somewhat different hypothesis than standard differential expression. Additionally, we further extend the previous model of [9] by considering interactions between the network prediction and the covariate matrix by defining the 1 ×

*N*element-wise multiplicative interaction term (

*Y*

_{Ng}×

*X*) within

The tests of the scalar *γ*_{
g
} = 0 correspond to the test of disruption of the gene network associated with the covariate *X*. By modeling the disruption of the network, we may test biologically interesting hypotheses and further improve the accuracy of predicting gene expression.

Even with large gene expression databases, regularization is required for accurate parameter estimation due to the condition *N* ≪ *G* because of the large number of parameters in the model for each gene. The prior model [9] used the Lasso [13] that has an *L*_{1} penalty and produces a parsimonious model with many regression coefficients close to 0. We observed that the Lasso produced unstable regression coefficients with split sample training and validation sets, but prediction accuracy was nevertheless maintained. In a parsimonious model, some transcripts from a set of correlated transcript predictors would have non-zero coefficients whereas the other transcripts in the set would have coefficients close to zero. In this manner, the predictive weight would be unevenly shared by the correlated transcripts. However, Ruan et al [6] obtained good success in predicting gene expression by using a k-nearest-neighbor (KNN) method. The KNN method used the average of co-expressed genes as a predictor, which gives equal predictive weight to a set of correlated transcripts. We suggest a balance between unequal, *parsimonious* (Lasso) and equal, *robust* (KNN) weight among correlated transcript predictors. One such compromise is the elastic net that combines penalties on the *L*_{1} and *L*_{2} norms. The elastic net, however, carries a substantial burden in computational cost and risk of overfitting by optimizing over two parameters.

*Y*

_{ D }and the new data of interest

*Y*are all centered at 0, which is numerically convenient. The gene expression variances are not scaled to all have equal variance because the scale of variability is biologically relevant. We are applying the network estimates from one array platform to another, which are on the log-scale, and we assume that fold-changes (i.e., a doubling of gene-expression) will be meaningful across platforms. After centering, the first step of the model fit is the estimation

*β*

_{ Dg }with penalty

*λ*based upon the database

*Y*

_{ D }. In this approach, the regularization penalty takes the same form as standard ridge regression, and the first objective function to be minimized for each gene is

^{2}is the Euclidean norm. The

*β*

_{ Dg }are the ridge regression coefficients with

*y*

_{ Dg }as the outcome and ${Y}_{D,-g}^{\prime}$ (

*Y*

_{D,−g}=

*Y*

_{ D }without row

*g*) as the covariate matrix. The identifiability of

*β*

_{ Dg }when

*λ*> 0 follows from the theory of ridge regression [14]. As

*λ*→

*∞*, the regression parameter estimates

*β*

_{ Dg }will tend towards 0 as the penalty becomes the dominant term. Hence, the larger the

*λ*, the more the

*β*

_{ Dg }

*Y*

_{ D }is biased towards the null compared to the observed database values

*y*

_{ Dg }. However, we introduce a second step and add another parameter

*ν*

_{ g }to counteract the bias towards the null in predicting the data from a small experiment

*Y*with rows

*y*

_{ g }. The second step is the estimation of

*ν*

_{ g }conditional on the

*β*

_{ Dg }that minimizes ∥

*y*

_{ g }−

*β*

_{ Dg }

*Y*

*ν*

_{ g }∥

^{2}using ordinary least squares. If we let

*Y*

_{ Ng }=

*β*

_{ Dg }

*Y*, then the estimation of

*ν*

_{ g }can be seen as the univariate OLS coefficient of

*Y*regressed onto

*Y*

_{ Ng }. That is, the second objective function operates on the new data

*Y*, and the estimate of row

*y*

_{ g }is

*β*

_{ Dg }

*Y*

*ν*

_{ g }where

*ν*

_{ g }minimizes

using ordinary least squares. The introduction of *ν*_{
g
} within a second stage of the regression has two effects. First, ${\beta}_{g}^{\mathit{\text{OSRR}}}={\beta}_{\mathit{\text{Dg}}}{\nu}_{g}$ is no longer biased towards the null as *λ* → *∞*. The larger *λ* might make *β*_{
Dg
}*Y* smaller than *y*_{
g
} by an order of magnitude, which is the motivation for the term “Over-Shrinkage". Second, larger values of *λ* will act to equalize the elements of *β*_{
g
} corresponding to correlated transcripts. This equalization property produces a similar effect as KNN’s equal weighting, but OSRR still allows data-driven deviation from equality and allows genes that are not the nearest neighbors to influence prediction. Hence, we call this a global network adjustment. An important difference between OSRR and approaches such as ridge regression, LASSO, and the elastic net is that these methods penalize the magnitude of the parameter *β* norms with some function *P*(*β*) whereas the OSRR approach considers a link function between the linear predictor *X* *β* and the outcome *Y*. The link function *g*(*X* *β*) is the second regression so that the model becomes *Y* = *g*(*X* *β*), which may serve to minimize the effect of poor prediction.

The robustness of the OSRR method to variable gene regulatory networks and applicability to smaller sample sizes are its attractive features. If the gene expression network estimate *B*_{
D
} from *Y*_{
D
} results in a prediction *Y*_{
Ng
} that is uncorrelated with the smaller independent dataset *Y*, then the regression parameter *ν*_{
g
} converges to 0 under standard linear model theory. This property is quite distinct from the use of the LASSO model by [9] that does not consider larger training sets for estimating *B* or variable gene-gene dependencies due to the perturbation or biological state. The selection of *λ* is based upon the maximum correlation *across all genes* between the database prediction *β*_{
Dg
}*Y* and the current experiment *Y*. Overfitting may result because the OSRR is applied to the same data it is tuned with (i.e., *λ* is selected based upon *Y*). OSRR largely avoids overfitting and gains simplicity by using a single tuning parameter *λ* for all transcripts. The method was implemented in the R statistical software [15], and we fitted the model in equation 6 with the lme4 R package [16]. A random intercept was used to account for the correlation within case-control pairs. The programs, example simulated data, and a demonstration of the penalty selection are available for download from the supplemental website as Additional file 1.

## Simulation studies

*y*(

*n*× 1) with two independent sets of correlated predictors

*X*

_{1}and

*X*

_{2}both with dimension (

*N*×

*G*). The

*N*= 201 rows of

*X*

_{1}and

*X*

_{2}are drawn from a compound-symmetric, multivariate Gaussian distribution with dimension

*G*= 100, mean 0 and variance 1.0 with correlation

*ρ*= 0.9. Both

*X*

_{1}and

*X*

_{2}represent two correlated sets of genes that have an influence on the expression of the gene

*y*. Within these sets, the influence of the genes is the same, which might reflect the activity of two pathways that are either associated with an increase or an inhibition of the expression of

*y*. That is,

*y*=

*X*

_{1}

*B*

_{1}+

*X*

_{2}

*B*

_{2}+

*ε*where ${B}_{1}={[\phantom{\rule{0.3em}{0ex}}{\beta}_{1}\dots {\beta}_{1}]}^{\prime}$ and ${B}_{2}={[{\beta}_{2}\dots {\beta}_{2}]}^{\prime}$ are

*G*× 1 with identical elements

*β*

_{1}= 1.0 and

*β*

_{2}= −1.0, respectively. The errors

*ε*have a Gaussian distribution with variance 1.0. The data were fit using 3 models: ordinary least squares (OLS), Elastic Net, and the OSRR with

*λ*= 10,000. The elastic net penalty was selected using K = 5 fold cross-validation. The results are shown in Figure 1, which is a network representation of the coefficient estimates. All models perform similarly in terms of prediction. The estimates of

*B*

_{1}and

*B*

_{2}are highly variable for OLS. The elastic net predicts as well as other models, but the estimated coefficients are highly variable with some coefficients clustered around 0. To offset the Lasso parameter estimates close to 0, the remaining coefficients are estimated to be larger in magnitude than the true values. The OSRR exhibits a strong smoothing property that equalizes the influence that each of the correlated predictors has, whereas the elastic net and the Lasso tend to reduce the number of influential predictors. This is an advantage in that it provides robustness, but when there are sparse predictors or there is sufficient sample size to estimate the coefficients more precisely, then we anticipate that the OLS, elastic net or Lasso are advantageous.

*Y*

_{ D }with

*G*= 200 and

*N*= 100. The transcripts were given a block diagonal correlation structure with 2 blocks of size 100 transcripts each with compound symmetry having variance 1 and correlation

*ρ*= 0.8. Although this is a relatively simple multivariate Gaussian model compared to actual gene networks, it still represents a large dependency between genes. The OSRR model was fitted as described to estimate

*B*

_{ D }with

*λ*= 1000 selected out of the set {10

^{2},10

^{3},10

^{4},10

^{5},10

^{6}} based upon the out-sample prediction from training and validation sets. We then simulated a smaller (

*N*= 20) dataset

*Y*with the same correlation structure. Ten of the arrays were under the control condition (

*x*

_{ i }= 0), and 10 were under a treatment condition (

*x*

_{ i }= 1). Differential expression was simulated by randomly selecting genes with probability 0.2 and then adding or subtracting

*δ*

_{ g }= 1.0 to those genes. We compared the performance of the model in equation 6 to a model without correction for the network relationships

*Y*

_{Ng}. This naive model reduces to the t-test. The results are shown in Figure 2. The histograms for p-values corresponding to the null hypothesis (

*δ*

_{ g }= 0) are given for both models, and the OSRR model clearly has smaller p-values for the differentially expressed (DE) genes compared to the naive model. Also, the naive model p-values do not have a uniform distribution for the not differentially expressed genes as the OSRR does. The lack of uniform distribution of the null p-values is due to the dependence structure and could inflate the Type I error. This is consistent with higher sensitivity and specificity of OSRR relative to naive methods that do not utilize prior information on gene networks.

## Empiricial studies

### Chromosomal deletion study

In our chromosome deletion study, we investigated the effects of 18q chromosomal deletions (deleted segments were the 60,021,550−76,117,153 bps of chromosome 18) on cell lines derived from blood samples called lymphoblastoid cell lines (LCLs). There were 10 case-control (c-c) pairs with the case subject having a deletion on chromosome 18 (18q-), and the control was a normal genotype, same-sex parent giving a total of *N* = 20 = 2 × 10 subjects. We assayed each c-c pair with 4 Agilent 44K expression arrays. We used a dye-swap design such that for 2 arrays the case was labeled with Cy3 (control labeled with Cy5) and 2 with case labeled with Cy5 (control labeled with Cy3). Each case had associated clinical phenotypes including height, weight, age, and various laboratory measures of growth hormone responsiveness. The log-expression data for a given subject and gene are denoted as *y*_{
grs
} where the indices *g*, *r*, and *s* represent gene, replicate, and spot respectively. The Cy3 and Cy5 channels were treated separately, and we did not use the log-ratio. For each subject, we averaged the replicates and the spots corresponding to the same gene yielding the *N* × 1 vector *y*_{
g
}. We did not model the dye factor because it was balanced and not predictive. There is a corresponding *N* × 1 covariate matrix *X* where *x*_{
i
} = *I*(*i*^{
th
}subject is a case). We also examined the p-values of the *δ*_{
g
} parameter from the model compared to a standard analysis with the LIMMA software [12].

For *Y*_{
D
}, we used a much larger dataset from [17] who studied similar LCL cells that were treated with a variety of chemotherapies. There were a total of 374 Affymetrix chip assays in this dataset. We included all genes that were within the deleted region, but otherwise restricted our analysis to genes that had gene symbols in common to both the Affymetrix and the Agilent platforms and were expressed in the Affymetrix data. This leaves a total of 5,035 genes to estimate *B*_{
D
}. The expression values of the database genes were centered, but not scaled. A key issue is how to select the penalty parameter *λ*, and the selection process can be extremely computationally intensive as [9] used 200 CPU days for their cross-validation Lasso approach. To greatly simplify the selection of the penalty and minimize the tuning and hypothesis testing on the same dataset, we chose one penalty parameter for all genes and took a split sample approach so that the database was divided into two sets of 187 arrays. Further, we took into account three aspects of goodness-of-fit in assessing cross-validation. First, we considered prediction among the split samples and second, we considered the stability of the regression parameters *B*_{
D
} between the split samples. Third, we examined the prediction of *B*_{
D
} with respect to our independent experiment. The computation time for selecting *λ* took approximately 151,445 CPU seconds (8GB RAM, 2 Ghz) or about a 100 fold decrease in computing time relative to the Lasso approach for a similar sized dataset and computing environment [9]. The Based upon these three criteria, we selected *λ* = 10^{4} out of the set {10^{2},10^{3},10^{4},10^{5},10^{6}}, and we applied the corresponding ${\widehat{B}}_{D}$ to construct predictors ${Y}_{\mathrm{N}g}={\widehat{\beta}}_{\mathit{\text{Dg}}}Y$ for *y*_{
g
} from our expression experiment. Lastly, to further assess the robustness and Type I error, we performed a permutation analysis comparing OSRR and a conventional method. We permuted the case-control labels 100 times, and compared the histograms of the p-values from both methods.

### Results

*Y*

_{Ng}based upon ${\widehat{B}}_{D}$ estimated from the database

*Y*

_{ D }were strongly correlated with new data

*y*

_{ g }. In Figure 3, the median correlation of

*Y*

_{Ng}with

*y*

_{ g }is 0.54 so that $25\%\approx \sqrt{0.54}$ of the variance of transcription regulation is explained. Note that the correlation of the observed expression with the network predictor is expected to be > 0, and the mean correlation is significantly greater than 0 (

*p*< 0.0001). The degree of prediction is remarkable because Cosgrove et al [9] originally studied bacteria, whereas these data indicate that more complex human transcriptional patterns even on different microarray platforms are largely predictable. Nevertheless, for some genes, there is a negative correlation consistent with random noise. This underscores the importance of the estimation of

*ν*

_{ g }in equation 6. That is, if the network model does not fit the smaller dataset, then

*ν*

_{ g }is estimated to be close to 0 so that its effects are mitigated.

*p*< 0.05) values, it also has a clear “spike" of small p-values towards 0. It is possible to model these types of distributions with a beta and uniform mixture (BUM) as suggested by Pounds et al [18] where the beta component models the spike of p-values that corresponds to differentially expressed transcripts and the uniform component corresponds to the transcripts that follow the null hypotheses. The BUM fit is shown in Figure 4, and the ratio of the the uniform (blue) component over the mixture density (green) is the empirical Bayes probability estimate of the false positive rate. The comparison of numbers of genes that at given p-values cutoffs is shown in Table 1. Clearly, the OSRR model identifies more than double (851 vs. 337) the number of significant transcripts, and those transcripts selected by the LIMMA approach are also selected by OSRR at a rate of 95% (319/337). The correlation between the − log(p-values) from LIMMA and OSRR is 0.88, which indicates fairly good agreement in ranking the genes by probability of differential expression. The mean value for the false discovery rate (FDR) of the 21 genes in the deleted region in the chromosome according to LIMMA was 0.51 compared to the mean FDR of the q-value adjusted OSRR p-values of 0.17, indicating that the OSRR model is more likely to identify the deleted genes as differentially expressed.

**Number of significant transcripts in standard LIMMA and OSRR models at different p-value cutoffs**

Up/Down | P-Value | LIMMA | OSRR | Both |
---|---|---|---|---|

Down | 0.0500 | 94 | 209 | 92 |

0.0100 | 25 | 79 | 25 | |

0.0010 | 9 | 19 | 9 | |

0.0001 | 5 | 8 | 5 | |

Up | 0.0500 | 241 | 642 | 227 |

0.0100 | 46 | 251 | 42 | |

0.0010 | 6 | 52 | 5 | |

0.0001 | 2 | 15 | 2 |

*δ*

_{ g }. The OSRR estimate was much more reproducible than the conventional estimate in independent samples with correlations of 0.321 (OSRR) vs. 0.201 (Standard) (95% CI for difference [0.054,0.183],

*p*< 0.001). Furthermore, the out-sample results clearly show that OSRR is more consistent with both OSRR and conventional estimates in independent split datasets.

**Reliability assessed by a repeated split-sample analysis**

In/Out Sample | Type | Mean correlation | Std Dev |
---|---|---|---|

In-Sample |
| 0.917 | 0.034 |

Out-Sample | ${\delta}_{g}^{\text{Std}}$ vs ${\delta}_{g}^{\text{Std}}$ | 0.203 | 0.115 |

Out-Sample |
| 0.321 | 0.073 |

Out-Sample |
| 0.313 | 0.089 |

We compared the performance of the elastic net predictions with OSRR using a subset of 500 genes because of the computational time of the elastic net in this context. We selected a gene-specific penalty as did Cosgrove et al [9] who used the Lasso. We compared the average correlation of the predictions with the expression of the 500 genes in our smaller dataset, and we found that the mean correlation was 0.01 95% CI [−0.01,0.03] (paired t-test: *p* = 0.43) higher for the OSRR model. This is consistent with OSRR allowing equal prediction at a fraction of the computational cost of the elastic net.

We also applied the interaction model in Equation 7. A substantial proportion (10%, 484/5035) of genes had interactions (*H*_{0}: *γ*_{
g
} = 0 with *p* < 0.01) between the dependence on the gene network and case-control status. This implies that the regulatory relationships are different for particular genes between cases and controls. The gene with the most significant interaction effect was the GARS gene, which has been linked to severe neurological conditions [19]. Interestingly, 93% of the 100 most significant interactions were negative (${\widehat{\gamma}}_{g}<0$), which implies that the expression predictions are substantially less positively correlated with the observed expression in the abnormal cases than controls. See Additional file 2. This is consistent with overall disruption of the regulatory networks in the cases. Such observations would be impossible using standard differential expression models that only examine mean differences.

For the purpose of comparison, we applied the method of Cosgrove et al [9] to this dataset without the use of the larger training dataset using the elastic net estimators. Their method was not intended to be applied to datasets of this size (2 conditions (case/control) and 10 paired samples), but we wanted to investigate the necessity of using the database in a real example. Despite 5-fold cross-validation for the selection of the tuning parameters, there was overfitting of the gene network so that the estimates *BY* from the fit had a median correlation with *Y* of > 0.95 and the analysis of the residuals suggested in Equation 5 yielded p-values that did not differ from a uniform distribution (Data not shown). No differentially expressed gene could be identified due to the confounding of the gene expression network *BY* and the case-control status. However, we emphasize that their method was not intended for small datasets.

### Circadian rhythm study

*B*matrix using a database

*Y*

_{ D }from an experiment with the same Affymetrix platform with 122 samples of mixed tissue types [21]. We assessed the 3,000 genes with the highest variance in the circadian dataset. We fit a sinusiodal model with and without network adjustment term

*Y*

_{Ng}

*ν*

_{ g }to identify genes associated with the circadian rhythm

*T*(1 ×

*N*) is the time covariate in radians,

*d*is an intercept, and the time dependencies are tested with scalars

*a*

_{ g }and

*b*

_{ g }(

*H*

_{0}:

*a*

_{ g }=

*b*

_{ g }= 0) using a log-likelihood test assuming that

*ε*is Gaussian noise. We did not use the information about the genotype of the mice within the models. Rather, we tested whether or not the OSRR model could account for the biological variation due to different genetic backgrounds and still identify time dependent genes. The results are shown in Figures 6 and 7. Figure 6 shows the unpermuted data analysis in both models with the OSRR model having a higher proportion of significant genes. Figure 7 shows the null distributions under permutations, and the null distributions of the two models are very similar. This indicates that the OSRR model has more power to detect time-dependent genes with a similar Type I error rate as the conventional model. We also compared the performance of the elastic net and the OSRR gene expression predictions by the average correlations of the network predictors

*Y*

_{Ng}with the outsample gene expression

*y*

_{ g }for a random subset of 200 genes. We found that the OSRR method had a 0.1 higher average correlation (95% CI [0.05,0.15] with a paired t-test:

*p*= 0.0003). That is, the OSRR had better prediction that the elastic net model for less computational costs; however, this advantage in prediction may be due to the smaller sample size of the training set (

*N*= 122) relative to the chromosomal deletion training set (

*N*= 374) for which the prediction of the two methods was approximately equivalent.

## Discussion

We have utilized the estimates of gene-networks from large databases in the analysis of small, independent datasets assayed on a different microarray type to successfully predict about 25% of the variation in transcript expression. The prediction is shown to markedly increase the sensitivity and the reliability of detecting differentially expressed transcripts in two different datasets. These inferences are different from the standard differential expression analyses because they reflect an adjustment based upon a regulatory model for each gene, and by testing an interaction term, one may make a statistical assessment of whether the regulatory network has changed between biological states. This method is approximately 100 times computationally faster than the previously reported method [9]. The majority of the computational time is spent fitting predictive models for each gene independently, and this time is multiplied by the density of the grid of the tuning parameter(s). OSRR has the advantages that the tuning parameter grid can be sparse and that the algorithms for fitting linear models are relatively well-optimized. The computation can be accelerated by parallelizing the fit, but parallel analyses are nontrivial because of large memory requirements of the operations (i.e., solving linear equations with 5,000 variables). Attention should also be paid to improving the computational efficiency of more complex models such as the elastic net.

Unlike the previous model, the use of the network predictor as a covariate in the differential expression regression model also provides robustness against poor prediction of particular transcripts. Given OSRR’s ease of implementation and its robustness, there is a broad set of potential applications to small sample size expression experiments that leverage the growing large-scale gene expression databases such as the Gene Expression Omnibus (GEO) [22]. We recommend that researchers select relevant datasets to construct network models by considering the species, type of tissue or cells, the microarray platform, and other sources of variation. We have seen that if the database *Y*_{
D
} used to estimate *B*_{
D
} has a different tissue type than the independent dataset *Y*, then the method tends to be less effective in accounting for variation.

## Conclusion

*Y*

_{N2 g}. To account for this, we propose

If the other network model is a superior predictor, then this will be reflected by the *ν*_{N2 g} parameter. However, the more terms within the model, the more biological replicates are required for model stability. We may include information a known network of specific genes in a similar manner. We could further address another aspect of the SSEM problem. That is, the structure among the rows of *B* was not modeled. For example, if a gene had a relatively large influence on many other genes, then one could use this pattern for better estimation of *B*. This may potentially be achieved using a similar approach of Friedman et al [24] in the construction of the graphical Lasso.

## Declarations

### Acknowledgements

Research reported in this publication (JAG, JGI, MHC, MG) was supported by the National Institute of the General Medical Sciences of the the National Institutes of Health under Award Number 2R01GM070335-13A1. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Funding for the experiment, JDC, and JAG were supported by The Chromosome 18 Registry and Research Society.

## Authors’ Affiliations

## References

- Prill RJ, Marbach D, Saez-Rodriguez J, Sorger PK, Alexopoulos LG, Xue X, Clarke ND, Altan-Bonnet G, Stolovitzky G: Towards a rigorous assessment of systems biology models: the DREAM3 challenges. PLOS ONE. 2010, 5 (2): e9202-10.1371/journal.pone.0009202.PubMed CentralView ArticlePubMedGoogle Scholar
- Barrett T, Troup DB, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Muertter RN, Holko M, Ayanbule O, Yefanov A, Soboleva A: NCBI GEO: archive for functional genomics data sets-10 years on. Nucleic Acids Res. 2011, 39 (Suppl 1): D1005—D1010-PubMed CentralPubMedGoogle Scholar
- Schoumans J, Ruivenkamp C, Holmberg E, Kyllerman M, Anderlid B, Nordenskjold M: Detection of chromosomal imbalances in children with idiopathic mental retardation by array based comparative genomic hybridisation (array-CGH). J Med Genet. 2005, 42 (9): 699-705. 10.1136/jmg.2004.029637.PubMed CentralView ArticlePubMedGoogle Scholar
- Dahl DB, Newton MA: Multiple hypothesis testing by clustering treatment effects. J Am Stat Assoc. 2007, 102 (478): 517-526. 10.1198/016214507000000211.View ArticleGoogle Scholar
- Leek JT, Storey JD: A general framework for multiple testing dependence. Proc Natl Acad Sci. 2008, 105 (48): 18718-18723. 10.1073/pnas.0808709105.PubMed CentralView ArticlePubMedGoogle Scholar
- Ruan J: A top-performing algorithm for the DREAM3 gene expression prediction challenge. PLOS ONE. 2010, 5 (2): e8944-10.1371/journal.pone.0008944.PubMed CentralView ArticlePubMedGoogle Scholar
- Gustafsson M, Hornquist M: Gene expression prediction by soft integration and the Elastic Net—Best performance of the DREAM3 gene expression challenge. PLoS ONE. 2010, 5 (2): e9134-10.1371/journal.pone.0009134.PubMed CentralView ArticlePubMedGoogle Scholar
- Zou H, Hastie T: Regularization and variable selection via the elastic net. J R Stat Soc Ser B. 2005, 67 (2): 301-320. 10.1111/j.1467-9868.2005.00503.x.View ArticleGoogle Scholar
- Cosgrove EJ, Zhou Y, Gardner TS, Kolaczyk ED: Predicting gene targets of perturbations via network-based filtering of mRNA expression compendia. Bioinformatics. 2008, 24 (21): 2482-2490. 10.1093/bioinformatics/btn476.PubMed CentralView ArticlePubMedGoogle Scholar
- Zellner A, Theil H: Three-stage least squares: simultaneous estimation of simultaneous equations. Econometrica. 1962, 30: 54-78. 10.2307/1911287.View ArticleGoogle Scholar
- Newton M, Noueiry A, Sarkar D, Ahlquist P: Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics. 2004, 5 (2): 155-176. 10.1093/biostatistics/5.2.155.View ArticlePubMedGoogle Scholar
- Wettenhall J, Smyth G: limmaGUI: a graphical user interface for linear modeling of microarray data. Bioinformatics. 2004, 20 (18): 3705-3706. 10.1093/bioinformatics/bth449.View ArticlePubMedGoogle Scholar
- Tibshirani R: Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B-Methodological. 1996, 58 (1): 267-288.Google Scholar
- Hoerl AE, Kennard RW: Ridge regression: biased estimation for nonorthogonal problems. Technometrics. 1970, 12: 55-67. 10.1080/00401706.1970.10488634.View ArticleGoogle Scholar
- R Development Core Team: R: A Language and Environment for Statistical Computing. 2012, Vienna, Austria: R Foundation for Statistical ComputingGoogle Scholar
- Bates D, Maechler M, Bolker B: lme4: Linear Mixed-Effects Models Using S4 Classes. 2011, [R package version 0.999375-39]. [http://CRAN.R-project.org/package=lme4]. Accessed 01/03/2013,Google Scholar
- Choy E, Yelensky R, Bonakdar S, Plenge RM, Saxena R, De Jager PL, Shaw SY, Wolfish CS, Slavik JM, Cotsapas C, Rivas M, Dermitzakis ET, Cahir-McFarland E, Kieff E, Hafler D, Daly MJ, Altshuler D: Genetic analysis of human traits in vitro: drug response and gene expression in Lymphoblastoid cell lines. PLOS Genet. 2008, 4 (11): e1000287-10.1371/journal.pgen.1000287.PubMed CentralView ArticlePubMedGoogle Scholar
- Pounds S, Morris S: Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics. 2003, 19 (10): 1236-1242. 10.1093/bioinformatics/btg148.View ArticlePubMedGoogle Scholar
- Antonellis A, Ellsworth R, Sambuughin N, Puls I, Abel A, Lee-Lin S, Jordanova A, Kremensky I, Christodoulou K, Middleton L, Sivakumar K, Ionasescu V, Funalot B, Vance J, Goldfarb L, Fischbeck K, Green E: Glycyl tRNA synthetase mutations in Charcot-Marie-Tooth disease type 2D and distal spinal muscular atrophy type V. Am J Human Genet. 2003, 72 (5): 1293-1299. 10.1086/375039.View ArticleGoogle Scholar
- Siepka SM, Yoo SH, Park J, Song W, Kumar V, Hu Y, Lee C, Takahashi JS: Circadian mutant overtime reveals F-box protein FBXL3 regulation of CryptochromeandPeriodGene expression. Cell. 2007, 129: 1011-1023. 10.1016/j.cell.2007.04.030.PubMed CentralView ArticlePubMedGoogle Scholar
- Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, et al: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A. 2004, 101 (16): 6062-6067. 10.1073/pnas.0400782101.PubMed CentralView ArticlePubMedGoogle Scholar
- Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R: NCBI GEO: mining tens of millions of expression profiles - database and tools update. Nucleic Acids Res. 2007, 35 (Sp. Iss. SI): D760-D765.PubMed CentralView ArticlePubMedGoogle Scholar
- Smith SC, Baras AS, Lee JK, Theodorescu D: The COXEN principle: translating signatures of In vitro chemosensitivity into tools for clinical outcome prediction and drug discovery in cancer. Cancer Res. 2010, 70 (5): 1753-1758. 10.1158/0008-5472.CAN-09-3562.PubMed CentralView ArticlePubMedGoogle Scholar
- Friedman J, Hastie T, Tibshirani R: Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008, 9 (3): 432-441. 10.1093/biostatistics/kxm045.PubMed CentralView ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.