OpWise: Operons aid the identification of differentially expressed genes in bacterial microarray experiments

Background Differentially expressed genes are typically identified by analyzing the variation between replicate measurements. These procedures implicitly assume that there are no systematic errors in the data even though several sources of systematic error are known. Results OpWise estimates the amount of systematic error in bacterial microarray data by assuming that genes in the same operon have matching expression patterns. OpWise then performs a Bayesian analysis of a linear model to estimate significance. In simulations, OpWise corrects for systematic error and is robust to deviations from its assumptions. In several bacterial data sets, significant amounts of systematic error are present, and replicate-based approaches overstate the confidence of the changers dramatically, while OpWise does not. Finally, OpWise can identify additional changers by assigning genes higher confidence if they are consistent with other genes in the same operon. Conclusion Although microarray data can contain large amounts of systematic error, operons provide an external standard and allow for reasonable estimates of significance. OpWise is available at .


Introduction
Microarray measurements of gene expression have become a popular tool for studying bacterial physiology, and hundreds of such studies are being conducted each year. Generally, these studies compare a treatment, either environmental or genetic, to a control condition. After obtaining raw hybridization intensities by scanning the slides or chips, the next steps are to normalize the data to remove experi-mental artifacts and then to identify differentially expressed genes.
To assess the reliability of the microarray measurements and to distinguish significant changers from other genes, statisticians have analyzed the variation between replicate experiments [1][2][3][4][5][6][7][8]. Implicitly, assessing significance by testing replication error assumes that replication captures all of the error in the data, and that there are no systematic bi-ases. However, systematic errors have been observed due to many factors, including cross-hybridization, non-specific hybridization, dye incorporation bias, intensity-dependent effects, and spatial artifacts [1,[9][10][11]. Although normalization methods correct for some of these, systematic bias will likely remain; for example, most normalization methods cannot account for cross-hybridization or non-specific hybridization. To determine if systematic errors do remain after normalization, additional information besides the replicates is required.
For bacterial microarray experiments, we use operons to assess the amount of systematic error in the data. Bacterial genes are often co-transcribed in multi-gene operons, and genes in the same operon should, in principle, have the same expression pattern. Although genes in the same operon are often expressed at different levels due to the varying stability of different segments of the mRNA, in steady-state situations, this will not affect the ratio in expression levels between conditions. Because most mRNA half-lives are short (under 10 minutes [12,13]), mRNA levels will be near steady state both in sustained growth (e.g., log phase) or within 20-30 minutes of a stress (e.g., heat) being applied. Thus, the steady state approximation should generally hold, and expression ratios should be consistent across an operon. Another reason why expression patterns can vary within an operon is that some operons have internal promoters or differential regulation of mRNA stability that can lead to differences in expression patterns [14]. In practice, however, genes known to be in the same operon usually have very similar expression patterns, and expression patterns can be used to predict operons [15].
We assume that genes in the same operon have identical expression patterns, and infer that differences between the expression patterns of genes in the same operon are due to errors, which may be systematic or not. This assumption is somewhat conservative, because any true differences in expression patterns between genes in the same operon will be mistaken for errors, leading to overestimation of the amount of systematic error and conservative assessments of significance. In practice, however, this effect appears to be slight. Because the operon structure of most genes has not been experimentally determined, we rely on operon predictions that are available for all prokaryotes [16], along with estimates of their reliability [17,18].
Given this assumption about operons, we wish to estimate the amount of systematic bias in the data. One simple test is to ask how often two genes that are in the same operon have the same direction of change. However, even if one of the genes is a confident changer, and even if the operon prediction is highly confident, the measurement for the other gene in the operon may be noisy. In this case, the second gene will often report a change in the opposite direction from the first gene because of variation between the replicate measurements, and not because of systematic bias. Thus, interpreting the external information from operons requires us to have a model of the replication error. We extend linear models for microarray data with replicates [3,5,8] to include systematic errors, and present an empirical Bayes analysis of the overall amount of systematic error and of the significance of each gene. Because we have observed that even low-confidence changers show a significant amount of agreement with operons, we do not assume that a minority of genes are changers and that the rest of the genes do not change [5,8]. Instead, we will assume that all genes are changing, even if, for most of them, the magnitude of change is small and the direction of change cannot be determined with confidence. Consequently, rather than trying to distinguish the changers from the rest of the genes, we estimate for each gene the posterior distribution for the gene's fold-change given the data and the model. This can be summarized as a confidence interval, as the posterior probability that the gene's expression level went up (or down) in response to the treatment, or as the probability that the gene changed by 1.5-fold or more.
To test our method, we conducted simulations and also analyzed several experimental data sets. In simulations, the method correctly estimates the amount of systematic bias in the data and gives reasonable p-values even when some of the assumptions of the method are violated. On real data, we tested the agreement with operons of genes having varying levels of significance. For both two-color cDNA data and Affymetrix oligonucleotide data, our method finds significant amounts of systematic error and reports plausible p-values that show a gradual reduction in agreement with operons as significance decreases. In contrast, approaches based on replication error, including non-parametric approaches [4,6,7], often show low agreement with operons for confident changers (genes with > 99% probability of being true changers). Thus, methods that ignore systematic bias may be overstating significance dramatically.
We can also take advantage of operon structure to identify more changers. Intuitively, if two or three genes in the same operon all change in the same direction then they are unlikely to be false positives, but a changer that disagrees with the other genes in the same operon is suspect. Such reasoning is often used by biologists when examining microarray data. We derive a statistically sound "operon-wise" p-value, and show that these operon-wise p-values allow the identification of more changers at any specified level of significance than do single-gene p-values.

OpWise
We present "OpWise," an empirical Bayes method for estimating the significance of the changes reported for each gene. The key elements of OpWise are (i) a linear error model that includes systematic errors, (ii) an approach for estimating the parameters of the error model (the hyperparameters), and, in particular, for inferring the amount of systematic error from the agreement within operons, (iii) a mathematical solution for the posterior distribution of a gene's change in expression given the data for the gene and the parametrized error model, and (iv) an extension to the method to take other genes in the same operon into account when estimating the significance of each gene.
To describe the expression of each gene, we use normalized expression ratios, as these should be consistent within each operon. In practice, we use logratios (base 2) rather than raw ratios. Also, instead of assuming that only a small fraction of genes are changing, we assume that every gene is changing (but only a small fraction of them might be measured with high confidence). Furthermore, we assume that there is some unknown amount of systematic error in the measurement for each gene, so that errors will remain no matter the number of replicates. Then, given the data for a gene i, we estimate the posterior distribution for the true log-ratio µ i . This distribution can be summarized with a confidence interval or with the probability P (µ i > 0) that a gene's expression level went up in the treatment condition. This probability will be near zero for highly confident down-changers, near one for highly confident up-changers, and near 0.5 for low-confidence measurements.

A Linear Model with Systematic Errors
First consider a simple experimental design with direct comparisons, where the samples from the conditions being compared are hybridized to the same chip. Each gene i has an unknown true response µ i , systematic error i , and variance between replicates σ 2 i . The measurements x i for gene i are assumed to be normally distributed around µ i + i , and can be summarized by the observed mean m i = j x ij /n i , where n i is the number of measurements for gene i, and the total squared deviance s 2 i = j (x ij − m i ) 2 , so that the likelihood of the data for each gene i is given by Another popular experimental design is to compare two types of samples separately to an external standard, such as genomic DNA or pooled mRNA samples. In these types of experiments, there are two sets of measured log levels for each gene, and the difference between them gives the log ratio. We refer to these log levels as x 1i and x 2i , and summarize them with counts n 1i and n 2i , sample means m 1i and m 2i , and total squared deviances s 2 1i and s 2 2i . We assume that the true variance in measurements x 1i and x 2i is identical, and that the unknown systematic bias i affects the difference. We wish to estimate the distribution of µ i ≡ µ 1i − µ 2i . Using the summary statistics (2) which is the same form as the direct comparison case except that N i has replaced n i in the exponential.
In either case, we use the conjugate prior to make the problem analytically tractable (as in [5,8]). We first assume that the distribution of θ i ≡ 1/σ 2 i follows a chi-squared distribution (Eq. 3). Given σ 2 i for a gene, we then assume that the true mean µ i is normally distributed with variance proportionate to σ 2 i . This assumption fits our data better than the alternative assumption of a fixed variance of µ i across all genes (see Results), and previous work also used this proportionality [8]. We use the same proportionality for the systematic error i . Hence, our prior is: with hyperparameters α, ν, β, and γ. α is the scale of the chi-squared, ν is its degrees of freedom, 1/β determines the amount of true changes in expression, and 1/γ determines the amount of systematic error. We assume that the true means for the genes are independent, except that genes in the same operon have the same θ i and µ i (but independent bias i ). Genes in the same operon are co-regulated, so µ i should be similar. The assumption that θ i is identical is required because in our model µ i depends on θ i ; the effectiveness of this assumption will be tested in the Results. Because operon predictions are only 80-90% accurate, we use a method that estimates the probability P (Operon ij ) that two adjacent genes are co-transcribed [16], and treat the actual state of each potential operon pair as an unknown random variable. For example, the prediction method might estimate that two genes have a 90% probability of being in the same operon; in our model, we use this estimate as the true probability. We use only the likely operon pairs (those with P (Operon ij ) ≥ 0.5).

Solving A Simplified Model
We first describe how to solve a simplified model with systematic errors removed, so that γ = ∞ and thus all i = 0. We need to estimate the hyperparameters from the data, so that we have a fully specified prior distribution, and then we need to infer the posterior distribution of the log-fold-change µ i for each gene.
Estimating the hyperparameters. In this simplified model, we need to estimate the prior distribution for θ i (or σ 2 i ), which is determined by the scale α and degrees of freedom ν, and then the scale of variation for the true log-ratio µ i given the variance σ 2 i , which is given by 1/β. Although we assume that µ i is normally distributed for all genes, instead of being allowed to vary for a minority of genes, the variation between replicates in our model is the same as in [8]. As discussed by [8], log s 2 i (the log of the squared deviances) is approximately normally distributed, and its mean and variance can be written analytically. By fitting the hyperparameters α and ν to the observed mean and variance of log s 2 i , [8] derived the following estimator: where ψ() is the digamma function, ψ () is the trigamma function, andē is the mean of the e i . ν can be obtained by inverting the trigamma function, which can be preformed numerically by Newton iteration [8]. This leads to an estimate for α as well, and specifies the prior distribution of the true variances σ 2 i for each gene (Eq. 3). We then find the maximum likelihood estimate of β, which describes the prior distribution of the true means µ 2 i for each gene (Eq. 3). The likelihood of the data is This equation can be viewed as a product of tdistributions for the posterior probabilities of each gene's measurements. We choose β to maximize the (logarithm of) this likelihood, using a Newton iteration method (nlm in the R statistics package: http://www.r-project.org/).

Significance of individual genes.
Given estimates for the hyperparameters and the observed mean m i and total squared deviance s 2 i for a gene i, the posterior probability distribution for µ i is given by Intuitively, this distribution represents "shrunk" estimates of the mean and variance. m 2 i appears in the estimate of the variance σ 2 i because m 2 i contains information about the variance (in our model the expectation of µ 2 i is σ 2 i /β). The degrees of freedom for this t distribution includes both the observations n i and the prior knowledge about the variance ν.
Given this posterior distribution, we can use the standard t test to answer questions about the confidence of measurement for gene i, e.g., to give a 95% confidence interval for the log-change µ i or the posterior probability that the gene went up (P (µ i > 0)).

Accounting for Systematic Errors
The key advantage of our approach is to use biological knowledge (i.e., operon predictions) to take systematic errors into account. By definition, these systematic errors will not be eliminated by increasing the number of replicate measurements, but their size can be estimated from the variation between genes in the same operon. In this section, we add systematic errors to the above model (γ < ∞, i = 0) and describe how to account for such bias. Specifically, we show how to estimate the amount of bias and how take the bias into account when assessing significance.
Estimating the parameters. If we ignore the distinction between systematic error i and true variation µ i , then we can replace µ i with where 1/β ≡ 1/β + 1/γ, so that the form of the distribution of m i for a model with systematic errors is the same as that for a model without systematic errors, except that we replace β with β . The distribution of s 2 i is not affected by systematic errors. Thus, we can estimate α, ν and β using the method for the simplified model.
We then find the maximum likelihood estimate of γ, which controls the amount of bias, by using our assumption that genes in the same operon will have the same values of µ i and of θ i = 1/σ 2 i . The total likelihood of the data can be decomposed into terms for individual genes and pairwise terms for operon pairs: We have already taken into account the effect of γ on the single-gene likelihoods f ( x i ) by introducing β , which is now being held constant, so these terms do not need to be considered. To derive an equation for the pairwise likelihood ratios, we first note the possibility that the operon prediction is incorrect, in which case the genes are independent and the likelihood ratio is 1: The pairwise likelihood ratio for the operon case can be derived from to give and similarly for j, and and similarly for j. Although much of Eq. 14 has no simple intuitive explanation, and unfortunately the constant terms are required (e.g. see Eq. 10), the (X/2) −df /2 terms can be viewed as t distribution forms for the joint probability f ( x i , x j |Operon ij ) divided by similar forms for the independent probabilities f ( x i ) and f ( x j ). Given this solution for the likelihood of the data, we can use a Newton iteration method to find the value of γ that maximizes the product of the pairwise likelihood ratios given by Eq. 10.

Significance of individual genes.
If we ignore the information from other genes, then the posterior distribution of µ i is given by a t distribution with This is the same as case without systematic bias except that N i , which describes the amount of data and hence the reduction in uncertainty due to replication, has been replaced by the smaller term N i .

Significance taking Operons into Account
Although the method as described so far uses operon predictions to estimate the hyperparameters, it uses only the information for each gene when computing p-values. We will refer to these as "single-gene" pvalues. In this section, we describe 'operon-wise" p-values that use information from other genes in the same operon to improve our estimates of the significance of each gene. As we will show in the Results, using this additional information often allows increased confidence in the measurements. First, assume that we have two genes i and j that are known to be in the same operon, with the same (unknown) µ ij and θ ij but with differing biases i , j . Given measurements for the two genes, the posterior distribution for µ ij is a t distribution with It is straightforward to extend this formula to three or more genes. In practice, operon predictions are uncertain, and we need to take this uncertainty into account in estimating confidence. We use only the adjacent pairs that are predicted to be in the same operon (those with P (Operon ij ) ≥ 0.5), as non-adjacent pairs are less reliable. In the most complicated situation, we have genes i and k on either side of our target gene j and four possible cases: singleton transcript j, twogene operon ij, two-gene operon jk, or three-gene operon ijk. The posterior distribution of µ j is then a mixture of the corresponding four posterior distributions, and a specific probability such as P (µ j > 0) is determined from a linear combination of the probabilities from four t tests.
To determine the weight of the terms in the mixture, we do not use the input probabilities P (Operon ij ) and P (Operon jk ). Instead, we use the posterior operon probabilities given the data. That is, we use the microarray data to help estimate the likelihood that a pair of genes are cotranscribed. Using the posterior operon probabilities gives the rigorously correct posterior distribution for µ j (derivation not shown). Using the posterior operon probabilities also prevents the method from asserting that a gene went down when it in fact went up but other genes in the operon went down, because in this situation the posterior probability of the operon will be low.
Using Bayes' law, these posterior probabilities where P (Operon ij ) is the prior probability and the formula for the ratio on the right was given in Eq. 14. Given the individual pair probabilities and the mixture of four cases discussed above, the weight for each case is just its probability. For example, the weight for the three-gene operon case is

Data Sets
We tested OpWise on four data sets collected with a variety of measurement platforms (both glass sides and Affymetrix chips) that used different methods of controlling systematic bias (multiple probes per gene or dye swap) and from several different bacteria: • dvSalt30 -Desulfovibrio vulgaris salt shock at 30 minutes (Z. He and J. Zhou, personal communication). This data was collected using two-color glass slides with 70-mer probes. The experiment was an indirect comparison through a genomic control. There were three biological replications for each condition, measured with one slide each, and two spots per gene per slide, for a total of six replicate measurements for each gene and condition.
• ecox -A comparison of aerobic and anaerobic log-phase growth in Escherichia coli (GEO accession GDS680, [19]). This data was from Affymetrix oligonucleotide chips with three or four replicate hybridizations for each of the two conditions.
• shCold5 -Shewanella oneidensis cold shock at 5 minutes (Z. He and J. Zhou, submitted). This data was a direct comparison of two-color glass slides using cDNA probes. There were five biological replicates with one slide each and two spots per gene per slide (10 measurements per gene total), but no dye swap (the same dyes were used for the control and treatment samples throughout).
• shHeat5 -Shewanella oneidensis heat shock at 5 minutes [20]. This data was also a direct comparison of two-color cDNA probes. There were three biological replicates, with two replicate slides each and two spots per gene per slide (12 total measurements per gene), and with dye swap (Cy3 dye was used for the treatment in half of the slides and for the control in the other half of the slides).
For the two-color direct comparison data sets (shCold5 and shHeat5), we performed intensitydependent and then spatial normalization on each slide. Specifically, we first used a locally smooth estimator to remove intensity-dependent effects and then subtracted the median from each sector, similar to the recommendations of [6]. For the indirect comparison data set (dvSalt30), we treated the ratio of intensities between the channels corresponding to cDNA and to genomic DNA as a raw expression level. We first performed a global normalization for each slide so that the total expression level was the same for each slide, and then computed the average of the log-expression levels across slides from the two conditions. We then applied the intensitydependent and spatial normalization approaches to these log-levels. For all three of these data sets, we considered the different spots for each gene as independent sets of replicates. There was little difference between within-slide and between-slide variance (data not shown). For the Affymetrix data set (ecox), the data we downloaded had already been normalized with dChip [21], so we used the normalized expression levels provided; to prevent small values of expression level from giving extreme outliers for log ratios, we added a small constant (5) to the expression levels before taking a logarithm.
For each data set, we also performed 50 simulations using the parameters estimated for that data set by OpWise. Each simulation had the same proportion of missing data as the corresponding data set. For operons, we randomly assigned adjacent genes on the same strand to be in the same operon or not with the probabilities given by the prediction method, but only if the probability was 0.5 or greater. With these simulations of the OpWise model, we were able to test our assumptions about the distribution of means and variances. To emulate the heavy tails in ecox (see below), we performed 50 simulations where 10% of the genes had much higher variation in the mean (a much lower β) than the other genes. Finally, to test our assumptions that (i) the true mean and true variance are correlated and (ii) the true variance is correlated within each operon, for each data set we performed 50 "uncoupled" simulations where the mean was independent of variance (the mean was normal with a fixed width) and genes in the same operon had independent variances.

Results and Discussion
We first used simulations to test whether OpWise fit the data and whether OpWise was robust to deviations from its assumptions. We then tested for systematic bias in the real data and examined significance estimates from OpWise and other methods. Finally, we tested whether operon-wise tests were more powerful than single-gene tests.

Fit of Model to Data
To see how well the model fit the data, we inferred the hyperparameters for each data set, used these parameters to create simulated data, and compared the simulated data to the original data sets. The model's inverse chi-square distribution gave an excellent fit to the observed distribution of squared deviance s 2 i (Supplementary Figure 1). The simulated distribution of observed means had heavier tails than a normal distribution, due to the wide spread of deviances. The distribution of means fit the data fairly well for three of the data sets, but for the ecox data set, the true distribution had even heavier tails (Supplementary Figure 1).
To test our assumption that the variation in the true means depends on the true variances, we compared the correlations of observed means and squared deviances in the real data to simulations using the OpWise model and also using an uncoupled model in which the means and variances were independent. The observed mean and squared deviance were much more correlated than in the uncoupled model, except in the shCold5 data set (Supplementary Table 1). Similarly, within each operon the squared deviances were significantly correlated (Supplementary Table 1). However, the correlations were generally weaker than in the simulations, indicating deviations from the assumptions.

Robustness of OpWise in Simulations
To test OpWise, we created simulated data sets based on our statistical model. We wanted to verify that the estimated hyperparameters were accurate enough to give reasonable p-values. Because OpWise uses operons to estimate the overall reliability of the measurements, we also hypothesized that OpWise would be robust to the modest deviations from its assumptions. In particular, OpWise assumes that the variance in the true change of each gene depends on the variance of measurement for that gene. Because we found a weaker-than-expected relationship between observed deviances and means, we performed "uncoupled" simulations where the true means and variances were uncorrelated. Our statistical model also uses normal distributions. Although different genes can have widely varying variances of measurements, which allows the observed means to have somewhat heavy tails, even heavier tails were observed for the ecox data set. So, we also conducted heavy-tailed simulations (see Methods).
We examined the single-gene estimates of P (µ i > 0) for the simulated data (µ i is the true log-change for gene i). For the simulations using the OpWise model, we compared these p-values computed with estimated hyperparameters to "ideal" p-values computed with the true hyperparameters. For the "uncoupled" simulations with µ i independent of σ i , and for the heavy-tailed simulations, we compared the p-values to the actual sign of µ i for each gene.
When comparing the log odds of the estimated p-values to the log odds of the ideal p-values, we consistently observed a strongly linear relationship, with correlation coefficients above 0.9999 (see Figure 1A; logodds(p) ≡ log p 1−p ). In other words, the ordering and shape of the significance values was not affected, but the overall scale of significance could be. To summarize this linear relationship between the two sets of significance estimates, we used the slope of the ideal log odds as a function of the estimated log odds. As shown in Figure 1B, most simulations had slopes very close to the ideal value of 1.0. In a total of 200 simulations across 4 data sets, the most extreme aggressive slope was 1.12 (for shHeat5). This corresponds to reporting P (µ > 0) = 0.964 when the true P = 0.95.
For the uncoupled and heavy-tailed simulations, which violated the assumptions of our model, we did not have ideal p-values to compare to, so we instead used logistic regression (glm in R, http://rproject.org) to estimate the slope. Logistic regression identifies the multiplier for the estimated log odds that best fits the observed pattern of whether µ > 0 or not -see Figure 1C. As shown in Figure  1D, the accuracy of OpWise was not dramatically affected by uncoupling the mean from the variance. However, the heavy-tailed simulations for the ecox data set produced slopes around 1.2, with a maximum of 1.35. (There was also one simulation with a very low slope, but this was due to a few extreme and biologically implausible values of µ i that are not present in our genuine data sets.) A slope of 1.35, which corresponds to reporting P = 0.982 when the true P = 0.95, is not ideal, but as we will show, methods that do not account for systematic bias, including non-parametric methods, can perform dramatically worse. For all simulations, we also compared the operonwise p-values to either the ideal or true significance. These gave similar slopes as the single-gene p-values, but with consistently smaller deviations from 1.0 (data not shown). Overall, OpWise was largely insensitive to deviations from its assumptions.

Presence of Bias
OpWise identified large amounts of systematic bias, similar in magnitude to the true changes in gene levels and the replication error, in all four data sets (Table 1). Furthermore, the bias was statistically highly significant in all four data sets, as determined by a maximum likelihood ratio test (see Table 1).
One source of apparent bias might be correlation between the replicates. That is, if the replicate measurements are not truly independent and some of the replicates are correlated, then the noise in the average of the replicate measurements will be larger than expected. For example, the shHeat5 data set had a total of 12 measurements per gene (3 biological samples times two slides per sample with dyes reversed times two spots per gene on each slide). In this data set, the replicate measurements with the same dye assignment were more correlated than those with reversed dyes. To test the pattern of bias with fully independent replicates, we created two subsets of the data. First, we used only the first spot for each gene on the slides and a single biological replicate, leaving two replicates with different dye assignments. Sec-ond, we used only a single dye assignment and only the first spot per slide, leaving three replicates from different samples. In both cases, we still observed large amounts of bias (data not shown). We also verified that OpWise was not sensitive to correlations between replicates. We created an exact duplicate of each replicate, and this "doubled" data set gave significance values very similar to the original data set (results not shown).
We also considered the possibility that mRNA levels in shCold5 and shHeat5, which were measured only 5 minutes after the stress was applied, were far from steady-state and that some operons would have poor agreement because of differential mRNA decay. However, later time points from these same experiments showed similar amounts of bias (data not shown). Overall, these analyses confirmed that systematic bias is a major problem in real data sets. Next, we show that ignoring this bias can lead to overestimating the significance of individual genes.

OpWise Estimates Significance Correctly
To test the quality of the significance estimates on real data, we compared the confidence assigned by OpWise to the extent of agreement with operons. Although our p-values are single-tailed -they test only the hypothesis that µ i > 0 -we wanted a two-tailed notion of confidence, because this is more comparable to other methods. We defined the twotailed confidence as C = 2 · |p − 1/2|. For each data set, we sorted genes by confidence into eight groups. For each gene, we then identified other genes predicted to be in the same operon, and asked whether the two genes changed in the same direction. (We used only adjacent genes, as operon predictions for non-adjacent genes are less confident.) Intuitively, if a group of genes are 99% confident changers, then 99% of the time, the measurement for that gene is correct, and it will always have the same sign as other genes in the operon; the other 1% of the time, there is no information about the gene, and the genes will have the same sign, by chance, 50% of the time. That is, P (Agree) = C + (1 − C)/2, or 2 · P (Agree) − 1 = C. We also needed to correct for the possibility that the operon prediction is incorrect, which gives 2 · P (Agree) − 1 = C · P (Operon). Thus, we defined an adjusted measure of agreement, whose expectation ranges from 0 for data that is all noise to 1 for perfect data, as Adjusted = (2·Agree−1)/P (Operon), where Agree is 1 if true and 0 if false. This measure corrects for variations in the confidence of operon predictions between groups of genes -in some data sets, the most confident changers were, on average, in more confidently predicted operons (data not shown). Finally, even if the measurement for the first gene in the operon is highly confident and correct, the measurement for the other gene in the operon may be noisy, and the two genes may not agree. As there is no simple way to correct for this, we used the simulations described above, and compared the relationship between confidence and agreement in the real data to that in the simulations. The relationship between confidence and adjusted agreement with operons was approximately linear in all data sets ( Figure 2) and was largely consistent with simulations ( Figure 2 & Supplementary Figure 2).
Furthermore, for most groups of genes, including those with modest confidence values, the adjusted agreement with operons was much larger than zero. This suggests that the expression levels of all genes in these experiments were in fact changing, even if many individual genes could not be measured with confidence. In all four data sets, the top six of eight confidence groups had significantly more operon pairs that agreed with microarray data than not (all p < 0.05, binomial test). This confirmed our assumption that all genes are changers. Figure 2 also shows the relationship between confidence and operons for our model without considering bias (using γ = ∞). Naturally, the confidence estimates from the model without bias were higher. In the shHeat5 and shCold5 data sets, the bias-free estimates of confidence were much too high: the highest and second-highest confidence groups both had confidence levels very near one, but the secondhighest group had a much lower level of agreement with operons than the highest group. This also rules out one alternative explanation for why we detected significant bias in these data sets, which is that microarray data lacks bias but the operon predictions were flawed or systematically overconfident. In the latter case, the agreement with operons should have been lower for changers at every level of confidence, including the most confident changers. For dvSalt30, the bias-free confidence estimates appear to be more modestly over-confident, while for ecox, the differ-ence between models with and without bias was small.

Bias-Free Significance Estimates Are Unreasonable
We also compared the confidence estimates from our model to those from a popular non-parametric method, SAM version 1.21 [4]. For each gene, SAM tests the null hypothesis that the gene's expression level is identical in the two conditions. SAM uses a modified t statistic with a pseudovariance term in the denominator, but rather than using a t test, SAM estimates the null distribution for the modified t statistic by performing random permutations of the data. SAM then uses the proportion of genes with high p-values to estimate the proportion of genes that are non-changers, and hence the proportion of genes that are true changers (similar to [7]). Finally, it corrects for multiple testing and estimates the false discovery rate (FDR). (For each gene, the FDR is an estimate of the proportion of false positives among genes that are at that gene's significance level or more significant.) To compare significance values from SAM to the confidence levels from Op-Wise in Figure 2, we needed the proportion of false positives within each group, also known as the local false discovery rate -the confidence is 1 minus the local FDR. For the most significant group, the local FDR is simply the FDR for the least significant member of the group. For the less significant groups, the number of false positives can be estimated from the FDR by subtracting the false positives expected for the more significant groups (similar to [22]).
As shown in Figure 2, for the shHeat5 and shCold5 data sets, SAM is far too confident, and is similar to the parametric model without bias. For the shHeat5 data set, SAM estimated an FDR of under 10 −4 for 2,284 genes, representing three quarters of all genes! In contrast, OpWise estimated that this group of genes was only 80% confident, implying a false discovery rate of 20%. The modest agreement with operons of these genes suggests that OpWise's estimate is reasonable (Figure 2). Indeed, the subset of the SAM significant changers that were not considered significant by the single-gene OpWise method (those with confidence < 0.95) showed much lower agreement with operons than those that were considered confident (83% vs. 97% of operon pairs changed in the same direction, p < 10 −13 , Fisher exact test). Reporting a FDR of 10 −4 when the true value is around 0.2 is far worse an overstatement of p-values than we ever observed in the OpWise simulations, even in those that violated our distributional assumptions (it would correspond to a slope of 6.6 in Figure 1D).
For the dvSalt30 data set, which has a moderate amount of bias, SAM was also more confident than our model, at least for the more significant changers (the three right-most groups containing the top 1,300 genes). The SAM curve was also noticeably below the simulation curve, suggesting that it was (moderately) over-confident. Finally, for ecox, which has little bias and a heavy-tailed distribution, SAM performed well (see top right of curve), while OpWise was perhaps slightly over-confident. Overall, we concluded that the bias OpWise inferred in these data sets was genuine, and that ignoring this bias (i.e., assuming that errors will average out over replicates) leads to unreasonable p-values.

Operon-wise Tests Have Greater Power
We hypothesized that when genes in operons have consistent measurements, higher confidence can be assigned to those measurements. We calculated "operon-wise" p-values that, for each gene, take into account the data for other genes in the same operon (if such genes exist; otherwise the operon-wise and single-gene p-values are identical). To test whether operon-wise p-values were more powerful than singlegene p-values, we compared the distributions of the operon-wise significance values to that of the singlegene significance values. Significance was defined as 1 − C. As shown in Figure 3, the operon-wise significance estimates are much more confident in each of the data sets, and at a significance cutoff of 0.01, 2-10 times more genes can be identified.
To summarize the performance of the various methods considered here -SAM, single-gene Op-Wise p-values, operon-wise p-values, and single-gene OpWise with bias ignored -we report the number of putative changers identified at a confidence threshold of 0.05 and the agreement with operons of those changers ( Table 2). If bias is ignored, then single-gene OpWise generates similar results as SAM, but with bias accounted for, OpWise changers have much higher agreement with operons. This is probably because OpWise correctly identifies fewer genes as statistically reliable changers. The exception is the ecox data set, which has less bias (see Table 1), and hence all three methods give similar results. Compared to single-gene OpWise, the operonwise method identifies more genes, which also show excellent agreement with operons, as this is part of how they are selected.

Conclusions
We have described how operons can be used to detect systematic errors in measurements of prokaryotic gene expression patterns, to account for the bias when estimating significance, and to increase the confidence of measurements that are consistent within an operon. OpWise relies on the assumption that genes in the same operon have matching expression profiles. Although this assumption is only approximately correct, it is effective in practice, and is strongly preferable to ignoring the presence of systematic errors in the data. This assumption could be made more accurate by excluding from consideration those operon pairs that span an internal promoter or a partial terminator. Unfortunately, predicting alternative transcripts remains a challenging problem even in E. coli [23].
OpWise also relies on assumptions about the distributions of the true means and variances of the data. These assumptions are not entirely accurate, but without such assumptions, it would not be possible to distinguish low agreement within operons due to replication noise from that due to systematic bias. In simulations, OpWise was robust to the observed deviations from the assumptions.
In four data sets, OpWise identified significant and sometimes large amounts of systematic error. If this bias is not taken into account, as is generally the case with current approaches, then the statistical analysis can be far too aggressive. This bias is not an artefact arising from errors in operon predictions or from our distributional assumptions.
Likely sources for this bias include crosshybridization or non-specific hybridization of some probes [10,21].
Indeed the data set without large amounts of bias (ecox) was collected using Affymetrix gene chips that use 15 probe sets per gene, and was normalized with a method that attempts to identify "bad" probes and remove them from the data [21].
Irrespective of bias and for all four data sets, the operon-wise method identified many more genes at any desired level of significance than the single-gene method. Although we only tested the operon-wise approach with one method for assessing significance, in principle, operon-wise p-values can be computed using single-gene p-values from any method. However, operon-wise p-values should not be used to rank genes, because consistent operons with modest changes can be ranked highly, and these could be indirect effects that are of low biological interest. Instead, we recommend setting a confidence threshold and then ranking all genes (or operons) above that confidence level by their fold-change. In any case, the main benefit of the present work is not for ranking or other broad exploratory analyses but in the ability to obtain reasonable p-values for specific hypotheses of the form "was gene X or operon Y upregulated in this experiment?" We also note that the benefit of OpWise is in assessing the reliability of the measurement, and not in estimating the amount of change for any gene.
As microarray technology becomes less expensive, experiment designs with high amounts of replication are becoming common. We observed that the systematic error can be comparable to or even larger than the variation between replicates. If systematic error is large relative to replication error, then performing many replicate measurements may not be cost-effective, and using several different probes for each gene might be preferable.
Finally, although the method we describe here requires operons and is only applicable to prokaryotic data, a similar approach might be useful for eukaryotes if there is prior knowledge of pairs of genes that have matching expression patterns. For example, stable complexes in yeast are often co-expressed [24], and the worm C. elegans has operons (but their co-expression may be weak [25]). In any case, our finding that statistical confidence levels from single probes can be misleading because of systematic bias probably applies to eukaryotic data.   For each data set and for three methods of assessing significance (OpWise, OpWise without bias, and significance analysis of microarrays), we divided the changers into eight groups of genes with different levels of confidence. The x axis shows the average confidence within each group of genes. For each group, the y axis shows the adjusted agreement with operon pairs (the adjusted proportion of pairs which have the same sign of log-ratio), which ranges from 0 for random data to 1 for perfect measurements. We also show average results from simulations for each data set (simulated and analyzed with the OpWise model). The error bars give the 95% confidence interval (from a t test) for the mean agreement for each group from the OpWise significance values. The odd left side of the ecox SAM curve is due to noise in the local FDR.    Tables   Table 1: Systematic bias in four biological data sets The typical size of the bias in the apparent log 2 -ratio is the square root of its variance, or E(1/(θ i · γ)), where E(1/θ i ) = α/(ν − 1). The bias over the signal is the square root of the ratio of variances ( β/γ). The bias over the replicate error is also the square root of the ratio of variances ( 1/γ), and considers a single measurement (is not divided by the number of replicates). We also report the typical bias divided by the standard deviation of the observed log-changes m i . To show that the bias is statistically significant, we compared the likelihood ratio of the best-fitting model given systematic error to that without (with γ = ∞), using Eq. 10. Because we are testing whether γ lies at a boundary, in the absence of bias the distribution of 2 · log(ratio) approximates a 50:50 mixture of two chi-squared distributions with 0 and 1 degrees of freedom [26].  For OpWise, genes were selected if the two-tailed confidence was 95% or higher (P (µ i > 0) < 0.025 or P (µ i < 0) > 0.975). For SAM, genes were selected if the false discovery rate was 5% or lower. For each method and for each data set, we report how many genes were selected as significant changers and what percentage of the operon pairs that contain those genes changed in the same direction. This "agreement" should be 100% for perfect microarray data and perfect operon predictions and 50% for random data.

Additional Files
Supplementary Table 1