### Model for detecting differential expression

Our methods are applied to averages of normalized log ratios; we discuss the specification of these quantities in different experimental settings in the section on normalizations below. In this section we will refer to them simply as observed log ratios. We use logarithms to base 2.

Our model is a normal-uniform mixture model [15, 16]. We begin by modeling the genes as two different groups: differentially expressed and non-differentially expressed. Each group is modeled by its own density, and so the data as a whole are modeled by a weighted mixture of these densities, where the weights correspond to the prior probabilities of being in each of the two groups. This results in a mixture model with two components. Since genes that are not differentially expressed have a true log ratio of zero, we model the observed log ratios for these genes, after an appropriate transformation, as a group with a Gaussian density. The differentially expressed genes have log ratios that are, for the most part, in some sense "far" from the other group. So these genes can be viewed as outliers from the main distribution of non-differentially expressed genes. These genes are modeled as uniformly distributed over an appropriately wide range.

The model is

where *x*
_{
i
}is the observed log ratio for gene *i*, *π* is the prior probability that a gene is not differentially expressed, *N* (*x*|*μ*, *σ*
^{2}) denotes a Gaussian distribution with mean *μ* and variance *σ*
^{2}, *U*
_{[a,b]}(*x*) denotes a uniform distribution on the interval [*a*, *b*], and *N* is the number of genes.

We estimate the model by maximum likelihood using the EM algorithm [17]. We define the unknown labels, *z*
_{
i
}
*, i =* 1,..., *N*, where *z*
_{
i
}is 0 if gene *i* is not differentially expressed and 1 if it is. There are two steps in the algorithm: the Expectation, or E step, where the labels are estimated given the current parameter estimates, and the Maximization, or M step, where the model parameters, *π*, *μ* and *σ*
^{2}, are estimated given the current estimates of the labels. The maximum likelihood estimates of *a* and *b* are
= min{*x*
_{
i
}: *i* = 1,..., *N*}, and
= max{*x*
_{
i
}: *i* = 1,..., *N*}; these do not change during the algorithm. The steps in the algorithm are as follows:

#### Iteration k

##### Maximization Step

The likelihood for the model given parameter estimates at iteration *k* is

The above steps are iterated until convergence. Convergence can be checked by calculating the parameter estimates, the labels, and the logarithm of the likelihood at each step, given the current estimates of the parameters. Once the change in these quantities between steps gets small enough the algorithm is deemed to have converged. The increasing property of the EM algorithm guarantees that a local maximum is reached, but a global maximum cannot be guaranteed. This depends on the starting values. For the analyses in this paper, the starting value for the label z_{
i
}was 1 if gene *i*'s observed log ratio, minus the mean value for all genes and divided by the standard deviation of the values across all genes, was greater than 2 in absolute value, and 0 otherwise. This appeared to give good results.

The final label estimate for gene *i*,
, is the posterior probability that it is differentially expressed, given the parameter estimates. The posterior probabilities do not need to be adjusted for multiple comparisons.

### Normalizations

There are two different types of experimental setup for which we will discuss normalization. The first is where the two different samples, say control and treatment, have each been labeled with a different color dye, say treatment with red (Cy5, R) and control with green (Cy3, G). In the second experimental setup, the treatment and control samples have replicates, with both control and treatment replicates being labeled with the same dye, say red (Cy5, R), and these are compared to a reference sample labeled with the other dye.

Two of the data sets analyzed in this paper, the HIV and the Like-like datasets, are of the first type of setup. The other data set analyzed in this paper is the Apo AI mouse data [1] which is of the second type of setup, with pooled control mRNA used as its reference sample. Since there are slightly different normalizations and quantities of interest used for analysis in these two cases, we will discuss them separately below, referring to the first experiment type as the log ratio experiment (since the log ratios are the quantities of interest), and to the second as the log ratio difference experiment (since the differences of log ratios between control and treatment samples are the quantities of interest).

#### Normalizations for the log ratio cDNA experiment

The main problem in applying the Normal-Uniform mixture model is that the data need to be normalized in order for this model to be appropriate. In the basic type of cDNA experiment, the log ratio of expressions in the two samples is the quantity of interest. There are dye and other effects that add a bias, making the mean of the non-differentially expressed log ratios non-zero (see the Like-like example in the Results section). Also, the variance of the log ratios depends on the log of the total intensity, where the total intensity is defined as the product of the red and green intensities. We need to ensure that any normalization does not "pull in" the differentially expressed genes.

#### Single slide normalizations

The normalization of single slide log ratios is a two-step process. In the first step, the observed log ratios are regressed nonparametrically on the log intensities, using the lowess regression smoother [18], and the fitted value is subtracted from the observed log ratios. In our implementation, a modification, the loess smoother [19], is used in place of lowess. Specifically,

where *R* and *G* are the intensities in the red and green channels, and *c*(log(*RG*)) is the fitted value from loess regression of log(*R*/*G*) on log(*RG*), a situation we denote by
.

We got good results with a loess span in the range 60% to 80%. This generally did a good job of normalizing the mean but not the spread.

The spread depends on the log intensity, log(*RG*), and we estimate a running mean absolute deviation by loess regression of the absolute mean-normalized log ratio on the log intensity. We then divide the mean-normalized log ratio by the loess-estimated mean absolute deviation in order to get our final estimate,

where

. We got good results with a span between 10% and 20%. As can be seen from the figures in the Results section, this does a good job of making the log ratios for non-differentially expressed genes approximately normal and homoscedastic.

#### Multiple slide normalizations with dye swap

In dye swap experiments, there is an even number of replicates and they are divided into two groups with equal numbers of replicates. In the second group of replicates, the assignment of dyes to samples is the reverse of that in the first group. Log ratios in this case are taken with the different samples as numerator and denominator (since the assigned dyes will be different for the two groups and averaging must be done over the same ratio of samples not the same ratio of dyes). In that case, mean normalization is unnecessary, although normalization of the variance is still required. This is because we take the average of the log ratios across replicates, ensuring that the dye effect cancels out.

#### Multiple slide normalizations without dye swap

Here we take the average of log ratios and log intensities across replicates for each of the genes and apply the mean lowess normalization, given by equation (4), with average ratios and intensities in place of the single replicate log ratios and intensities.

The variance normalization is not the same for multiple replicate slides as for a single slide. Because the average log ratios are not robust to outliers, even after mean normalization, we carry out a normalization based on variation across replicates rather than on variation depending on intensities, to downweight the influence of outlying observations. If the empirical standard deviation of the log ratios across replicates is greater than the absolute mean-normalized average log ratio for a gene, we divide its mean-normalized average log ratio by its standard deviation. If the empirical standard deviation of the log ratios across replicates is small, defined as smaller than the absolute mean-normalized average log ratio, we divide instead by a constant. The constant is chosen to be a high percentile (we use the 99th) of the distribution of the standard deviations of genes for which the absolute mean-normalized average log ratio is greater than the standard deviation. This avoids a gene being declared differentially expressed just because its empirical across-replicate standard deviation is small, as can easily happen by chance when there are few replicates. Thus the mean- and variance-normalized log ratio for a given gene is:

where *m* is the number of replicates, *q*
_{
j
}is the mean-normalized log ratio of replicate *j*, *s* is the standard deviation of log ratios across replicates, and *k* is the chosen percentile of the distribution of standard deviations of genes whose absolute mean-normalized average log ratio is greater than their standard deviation.

#### Normalizations for the log ratio difference cDNA experiment

Here the quantity of interest is the difference in log ratios between control and treatment replicates. We define

where *n*
_{
treatment
}is the number of treatment replicates, *n*
_{
control
}is the number of control replicates, *n* = *n*
_{
treatment
}+ *n*
_{
control
}, *q*
_{
treatment,i
}is the log ratio of treatment replicate *i* and *q*
_{
control,j
}is the log ratio of control replicate *j*. With these definitions we give the multiple-replicates normalizations, defined analogously to those in the log ratio type experiment.

#### Multiple slide normalizations

We again use loess to allow dependence of the mean normalization of M on A in the following way:

*M*
_{
norm
}= *M* - *c*(*A*) (9)

where *c*(*A*) = *loess*(*M* ~ *A*), with the recommended span for the loess smoother being between 60% and 80%.

For the variance normalization we again use the information about the variance contained in the replicates to get a robust estimator of the overall variance. We calculate the variance of log ratios across the *n*
_{
control
}replicates in the control dataset and call this *V*
_{
control
}. Similarly we calculate the variance of log ratios across the *n*
_{
treatment
}replicates in the treatment dataset and call this *V*
_{
treatment
}. Our estimate for the standard deviation s, in M for each gene is given by

We then develop the variance normalization similarly to the previous log ratio type experiment case. The variance normalization is given by

### Methods for comparison with NUDGE

We now give brief descriptions of the methods for finding differentially expressed genes that will be used for comparison with NUDGE in the datasets examined in the Results section.

#### Rule of two

This simple but popular method, mentioned in [3], involves examining the ratios or average ratios of the two channels for each gene, and calling those genes with a ratio or average ratio greater than two or less than half, differentially expressed. It requires some initial normalization and its performance can depend on the normalization.

#### t test and adjusted t test

One of the most obvious first approaches to try for this problem is the classical *t* test, as used, for example, in [20]. A simple normalization consisting of centering the mean of the log ratios within each replicate is often used in this case. One needs to be able to estimate the standard deviations as well.

Because of the large number of tests being run (thousands in the usual cDNA experiment setup), the standard *t* test needs to be modified to account for the multiple testing. Traditionally the most popular adjustment has been the Bonferroni correction, as mentioned in [21]. For the Bonferroni correction with *N* genes/tests and significance level *α*, we instead call each test significant only if it is significant at the
level, controlling for the probability of one or more false positives.

#### EBarrays

This follows a hierarchical Bayes approach for modeling the gene expression levels as detailed in [22]. As in our approach, the data are assumed to be generated by a two-component mixture model, one component for differentially expressed and the other for non-differentially expressed genes, each with their own distribution. The parameters specifying these distributions are estimated from the data, whence the name Empirical Bayes.

Results in this framework are given for two different parametric models in [22]. In the first model, the observed intensities for the replicates in each channel are assumed to be independently generated from a gamma distribution with a channel-specific scale parameter. The scale parameters are, in turn, assumed to have an inverse gamma distribution, whose parameters are estimated from the whole dataset. In the second model, the log ratios are assumed to be normally distributed, with gene-specific means that are themselves normally distributed. To normalize, the authors divided the log ratio for a given gene and replicate by the average log ratio across genes for that replicate.

#### Significance analysis of microarrays (SAM) [4]

The statistic used to test for differential expression is a regularized *t* statistic, i.e. the mean value divided by the sum of the standard deviation and a constant. SAM controls the False Discovery Rate (FDR), i.e. the number of genes declared to be differentially expressed that are not in truth differentially expressed.

A rejection region is fixed and SAM uses a permutation analysis to estimate the FDR. The user then decides on an acceptable rejection region based on their preferences for FDR.