Skip to main content
  • Methodology article
  • Open access
  • Published:

Statistical methodology for the analysis of dye-switch microarray experiments

Abstract

Background

In individually dye-balanced microarray designs, each biological sample is hybridized on two different slides, once with Cy3 and once with Cy5. While this strategy ensures an automatic correction of the gene-specific labelling bias, it also induces dependencies between log-ratio measurements that must be taken into account in the statistical analysis.

Results

We present two original statistical procedures for the statistical analysis of individually balanced designs. These procedures are compared with the usual ML and REML mixed model procedures proposed in most statistical toolboxes, on both simulated and real data.

Conclusion

The UP procedure we propose as an alternative to usual mixed model procedures is more efficient and significantly faster to compute. This result provides some useful guidelines for the analysis of complex designs.

Background

DNA microarray technology is a high throughput technique by which the expression of the whole genome is studied in a single experiment. Experiments must be well organized and design issues are crucial, see [1, 2]. In dual label experiments Cy3 and Cy5 are used as fluorescent dyes allowing to compare two RNA samples on the same slide. It is now well known that there exists a differential effect of the two dyes [3, 4], that can be gene-specific. An efficient way to remove this technical artifact is to use balanced reverse dye designs [5]. Balanced reverse dye designs can be divided into three classes along a line of strengthening balancing constraints:

  1. 1.

    Balanced reverse dyes for which each biological sample is hybridized only one time and therefore present with only one dye, on only one array (Table 1.1). These designs are globally balanced but not individually balanced.

  2. 2.

    Individually-balanced design for which each biological sample is divided into two parts, one hybridized with Cy 3 on one array and the other with Cy 5 on another array. Each biological sample is hybridized exactly two times (Table 1.2).

  3. 3.

    Dye-swaps for which each couple of biological samples from two conditions are hybridized on two arrays with reverse dyes. Dye-swaps are constrained to be couple-balanced (Table 1.3).

Table 1 Three different balanced reverse dye designs for the comparison of 2 treatments

Dye-swap design is mostly used when the technical error is higher than the biological variability, either to reduce the technical variance, or when gene-specific dye-bias is of concern [6, 7]. When the biological variability is greater than the technical error, globally balanced designs are statistically more efficient [5]. However the number of biological samples is sometimes limited, therefore this design is not always possible in practice.

The term Dye-switch is used for the first and sometimes also for the second classes. Dye-switch designs of the second class are sometimes described and proposed in papers dealing with microarrays experiments. For example loop designs are often members of this class [8, 9], although the distinction between the first and the second class is not always clearly made.

A major point to notice is that the statistical analysis may be very different for the three classes of design. The analysis of the first and third classes is straightforward and well described in articles (see for example [4, 10, 11]): the experimental units are mutually independent (we consider as usual that the two conjugate arrays of the dye-swaps are summed up to only one experimental unit), and simple statistical procedures such as Student T-tests (or regularized T-tests) can be performed. On the contrary, if we consider the second class of designs, the experimental units are not independent, a feature that must (or must not) be accounted for. The literature about the statistical study of such designs is limited: some papers proposed some theoretical contributions for their analysis [12, 13], but simple guidelines for experimenters and practical considerations (computational burden, choice of a strategy for parameter estimation) are not available.

We consider here the simplest individually-balanced dye-switch design: two conditions A and B are compared in a two-color cDNA microarray experiment, with n biological samples for each condition. The design is the following: each RNA sample (A1 to A n for condition A, and B1 to B n for condition B) is divided into two parts, one labelled with Cy 5 and the second labelled with Cy 3. Then 2n microarrays are hybridized with respectively A1Cy 5 and B1Cy 3, B1Cy 5 and A2Cy 3, A2Cy 5 and B2Cy 3, and so on till B n Cy 5 and A1Cy 3, (see Table 1.2). There are 2n samples, 4n labelled samples, 2n microarrays, and each sample is hybridized two times (one with Cy 5 and one with Cy 3) on two different arrays. We propose a simple, efficient and robust method for the statistical analysis of this experiment.

Model on the measure of the expression of genes

After the normalization step, X i is the expression measure on the log-scale, for a given gene, corresponding to condition A on array i. Let j(i) denote the sample number corresponding to condition A and array i.

Similarly, Y i is the expression measure for the condition B sample on the same array, and j'(i) the sample number corresponding to condition B and array i. In the following the gene index is not present in order to simplify the mathematical expressions, but it is important to note that all the terms in the following models are gene-specific. Here we use an analysis of variance (ANOVA) model for the expression measure as introduced by [10].

The model for X i and Y i is the following:

X i = μ A + δ l ( i ) + B j ( i ) + M i + T i Y i = μ B + δ l ′ ( i ) + B j ′ ( i ) + M i + T ′ i , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaqbaeWabiqaaaqaaiabdIfaynaaBaaaleaacqWGPbqAaeqaaOGaeyypa0dcciGae8hVd02aaSbaaSqaaiabdgeabbqabaGccqGHRaWkcqWF0oazdaWgaaWcbaGaemiBaWMaeiikaGIaemyAaKMaeiykaKcabeaakiabgUcaRiabdkeacnaaBaaaleaacqWGQbGAcqGGOaakcqWGPbqAcqGGPaqkaeqaaOGaey4kaSIaemyta00aaSbaaSqaaiabdMgaPbqabaGccqGHRaWkcqWGubavdaWgaaWcbaGaemyAaKgabeaaaOqaaiabdMfaznaaBaaaleaacqWGPbqAaeqaaOGaeyypa0Jae8hVd02aaSbaaSqaaiabdkeacbqabaGccqGHRaWkcqWF0oazdaWgaaWcbaGafmiBaWMbauaacqGGOaakcqWGPbqAcqGGPaqkaeqaaOGaey4kaSIaemOqai0aaSbaaSqaaiqbdQgaQzaafaGaeiikaGIaemyAaKMaeiykaKcabeaakiabgUcaRiabd2eannaaBaaaleaacqWGPbqAaeqaaOGaey4kaSIafmivaqLbauaadaWgaaWcbaGaemyAaKgabeaakiabcYcaSaaaaaa@64A6@
(1)

where

  • μ A and μ B are the population mean expression measures for condition A and B.

  • δl(i)is a two-level fixed effect corresponding to the dye effect. δl(i)= δ1 (resp. δ2) for all the samples labelled with Cy5 (resp. Cy3). This term accounts for the gene-specific dye bias.

  • Bj(i)represents an independent gaussian random term with mean 0 and standard deviation σ B , corresponding to the random effect of sample j(i). This variable is specific to the biological sample and is called biological error, related to the variability of the biological material inside each population A and B.

  • M i represents an independent gaussian random term with mean 0 and standard deviation σ M . M i is the effect of the spot associated to the gene under concern in microarray i and has the same value for the two samples which are hybridized on array i. This error term takes into account the spatial heterogeneity in each array that affects both Cy 3 and Cy 5 measurements.

  • T i represents an independent gaussian random term with mean 0 and standard deviation σ T , corresponding to the technical variability, including the steps of labelling, hybridization and measure of intensity of fluorescence. This variable has a specific value for each combination gene×dye×sample, even if the samples are hybridized on the same array and at the same spot, so that T i and T ′ i MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGafmivaqLbauaadaWgaaWcbaGaemyAaKgabeaaaaa@2E97@ are independent random variables. T i and M i are the two components of the so-called technical error.

Model on the difference of expression on one array

Let D i = X i - Y i , i = 1,...,2n. Using equation (1) we obtain:

D i = μ A − μ B + B j ( i ) − B j ′ ( i ) + δ l ( i ) − δ l ′ ( i ) + T i − T ′ i MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemiraq0aaSbaaSqaaiabdMgaPbqabaGccqGH9aqpiiGacqWF8oqBdaWgaaWcbaGaemyqaeeabeaakiabgkHiTiab=X7aTnaaBaaaleaacqWGcbGqaeqaaOGaey4kaSIaemOqai0aaSbaaSqaaiabdQgaQjabcIcaOiabdMgaPjabcMcaPaqabaGccqGHsislcqWGcbGqdaWgaaWcbaGafmOAaOMbauaacqGGOaakcqWGPbqAcqGGPaqkaeqaaOGaey4kaSIae8hTdq2aaSbaaSqaaiabdYgaSjabcIcaOiabdMgaPjabcMcaPaqabaGccqGHsislcqWF0oazdaWgaaWcbaGafmiBaWMbauaacqGGOaakcqWGPbqAcqGGPaqkaeqaaOGaey4kaSIaemivaq1aaSbaaSqaaiabdMgaPbqabaGccqGHsislcuWGubavgaqbamaaBaaaleaacqWGPbqAaeqaaaaa@5995@
(2)

which may be writtenD i = μ + BD i + (-1)i+1δ + TD i

where

  • μ = μ A - μ B is the true differential expression between conditions A and B for the gene under concern,

  • BD i = Bj(i)-Bj'(i)is a random variable with mean 0 and standard deviation 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaWaaOaaaeaacqaIYaGmaSqabaaaaa@2CE0@ σ B ,

  • TD i = T i - T ′ i MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGafmivaqLbauaadaWgaaWcbaGaemyAaKgabeaaaaa@2E97@ is an independent random variable with mean 0 and standard deviation 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaWaaOaaaeaacqaIYaGmaSqabaaaaa@2CE0@ σ T ,

  • δ = δ1 - δ2 is the difference between the Cy3 and Cy5 dye effects. This term accounts for the gene-specific dye bias.

Each variable D i follows a Gaussian distribution with mean E(D i ) = μ + (-1)i+1δ and variance V ( D i ) = 2 σ B 2 + 2 σ T 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemOvayLaeiikaGIaemiraq0aaSbaaSqaaiabdMgaPbqabaGccqGGPaqkcqGH9aqpcqaIYaGmiiGacqWFdpWCdaqhaaWcbaGaemOqaieabaGaeGOmaidaaOGaey4kaSIaeGOmaiJae83Wdm3aa0baaSqaaiabdsfaubqaaiabikdaYaaaaaa@3D36@ . All the covariances cov(D i , D j ) are equal to zero except the following ones:

cov ( D i , D i + 1 ) = σ B 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGagi4yamMaei4Ba8MaeiODayNaeiikaGIaemiraq0aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcqWGebardaWgaaWcbaGaemyAaKMaey4kaSIaeGymaedabeaakiabcMcaPiabg2da9GGaciab=n8aZnaaDaaaleaacqWGcbGqaeaacqaIYaGmaaaaaa@3EEF@

with the convention that 2n + 1 = 1.

In this study, we present and compare different strategies for the statistical analysis of individually-balanced designs. The article is organized as follows. In the Results section, five statistical procedures to analyze individually balanced designs (Table 1.2) are compared on both simulated and real data. The Conclusion section draws the main conclusions and gives some useful guidelines for the analysis of individually-balanced designs. The details of the computation are given in the Methods section.

Results

Statistical procedure comparison

In this section, we investigate the efficiency of five test procedures for the differential analysis of datasets corresponding to the design of Table 1.2. The procedures are the following (see the Methods section for more details):

  • Naive Method NM: for each gene, the naive test statistic

    T N = 2 n D ¯ S 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemivaq1aaSbaaSqaaiabd6eaobqabaGccqGH9aqpdaGcaaqaaiabikdaYaWcbeaakiabd6gaULqbaoaalaaabaGafmiraqKbaebaaeaadaGcaaqaaiabdofatnaaCaaabeqaaiabikdaYaaaaeqaaaaaaaa@3649@

is computed.

  • Unbiased Paired Method (UP): for each gene, the unbiased paired statistic

    T U P = 2 n D ¯ ( S 2 + 2 C ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemivaq1aaSbaaSqaaiabdwfavjabdcfaqbqabaGccqGH9aqpdaGcaaqaaiabikdaYaWcbeaakiabd6gaULqbaoaalaaabaGafmiraqKbaebaaeaadaGcaaqaaiabcIcaOiabdofatnaaCaaabeqaaiabikdaYaaacqGHRaWkcqaIYaGmcqWGdbWqcqGGPaqkaeqaaaaaaaa@3C15@

is computed. Notice that from the Methods section, the theoretical value of C must be positive. In practice, the estimated value may be negative. In such a case, C is truncated at 0.

  • Unbiased Unpaired Method (UU): for each gene, the unbiased unpaired statistic

    T U U = n X ¯ − Y ¯ S X 2 + S Y 2 − 2 C X Y MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemivaq1aaSbaaSqaaiabdwfavjabdwfavbqabaGccqGH9aqpdaGcaaqaaiabd6gaUbWcbeaajuaGdaWcaaqaaiqbdIfayzaaraGaeyOeI0IafmywaKLbaebaaeaadaGcaaqaaiabdofatnaaDaaabaGaemiwaGfabaGaeGOmaidaaiabgUcaRiabdofatnaaDaaabaGaemywaKfabaGaeGOmaidaaiabgkHiTiabikdaYiabdoeadnaaBaaabaGaemiwaGLaemywaKfabeaaaeqaaaaaaaa@4412@

is computed. As for the previous method, the value of C XY must be positive. If not, C XY is truncated at 0. Furthermore, the unbiased variance estimator is S X 2 + S Y 2 − 2 C X Y MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaem4uam1aa0baaSqaaiabdIfaybqaaiabikdaYaaakiabgUcaRiabdofatnaaDaaaleaacqWGzbqwaeaacqaIYaGmaaGccqGHsislcqaIYaGmcqWGdbWqdaWgaaWcbaGaemiwaGLaemywaKfabeaaaaa@3967@ . Since C XY is non-negative, the variance estimator may have a negative value. In such a case, the variance can be fixed at a given threshold (0.001 in the following).

  • Mixed Model with ML estimation (ML): for each gene, model (1) is adjusted with the Maximum Likelihood algorithm.

  • Mixed Model with REML estimation (REML): for each gene, model (1) is adjusted with the Restricted Maximum Likelihood algorithm.

It is important to consider both the ML and REML algorithms for the mixed model since each algorithm has its own advantages. While ML is known to provide biased estimates of the variance components, computations are faster and the algorithm does converge. REML gives unbiased estimates of the parameters, but may not converge if the number of observations is small. Both ML and REML computations were performed using the R package Maanova [10].

Simulations

To study the behavior of the 5 procedures, we performed a simulation study using model (1). We considered 3 different values for σ B 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaacciGae83Wdm3aa0baaSqaaiabdkeacbqaaiabikdaYaaaaaa@2FC9@ (0.5, 1, 2) and σ M 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaacciGae83Wdm3aa0baaSqaaiabd2eanbqaaiabikdaYaaaaaa@2FDF@ (1, 2, 5), 4 values for the number of samples in one condition (5, 10, 20, 30) and 5 possible values for the differential expression μ = μ A - μ B (0, 1, 2, 3, 4). The parameter σ T was fixed at 1. For each combination of the parameters, 10,000 genes were simulated.

Control of the Type I error rate

We first consider the case μ = 0. Table 2 shows the actual Type I error rate level of the 5 test procedures, when the requested nominal level is 5%. Different behaviors can be observed: NM and ML result in a type I error rate higher than the nominal level, and procedure UU is conservative. UP results in an actual level that is close to the expected one, whatever the conditions. In most cases, REML enables an efficient control of the type I error. Yet, when the biological variability is high and the number of samples is low, REML yields a high type I error because of inconsistent estimations of the variance (see the next section). When σ B 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaacciGae83Wdm3aa0baaSqaaiabdkeacbqaaiabikdaYaaaaaa@2FC9@ = 2 and n = 5, the discrepancy between the theoretical and the actual level is even worse for REML than for the other methods.

Table 2 Actual level of the 5 test procedures in one simulation of 10 000 genes

From these first observations we conclude that we can discard procedures NM and ML, since in differential analysis an effective control of the Type I error rate is necessary.

Performance analysis

We now compare the performance of the 3 remaining procedures to detect differentially expressed genes. Table 3 shows the proportion of detected differentially expressed genes, for different values of the parameter set. It clearly appears that the power of procedure UU is low compared with procedures UP and REML. This may be the consequence of the Student approximation (each test statistic is compared with the quantile of a Student distribution with 2n - 2 degrees of freedom), that could be more erroneous in the case of the UU statistic.

Table 3 Power of the UU, UP and REML test procedures

An interesting point is that UP results are more stable than the REML results. If we consider sample sizes n larger than 20, we observe that the absolute values of the approximate REML T-test range from 0 to 32, except for some genes where the absolute value is larger than 400. These outliers come from an erroneous estimation of the variance of the mean difference, that is evaluated to be (almost) 0. This does not happen with (UP) since the estimated variance is max(S2, S2 + 2C), i.e. the variance is overestimated to avoid outliers. Notice that despite this overestimation in many cases the power of UP is larger than the power of REML.

Computational burden and convergence

We now consider the important question of computational time for the 2 competitive procedures UP and REML. Since microarray experiments can involve hundreds of thousands of genes, it becomes critical to use efficient algorithms for the statistical analysis of the data. Table 4 gives the user CPU time associated to each procedure for the complete analysis of 10,000 genes. While the computational time is constant whatever the condition for the (UP) procedure, (REML) is 8 to 330 times longer than (UP), depending on the number of samples.

Table 4 CPU times of procedures UP and REML

Furthermore, REML can result in inconsistent estimates of the variance, as shown in the previous sections, or may not converge. Table 4 provides the number of genes for which the REML algorithm did not converge.

Embriogenomic data

The impact of pregnancy on the cattle endometrium transcriptom is investigated in [14]. In Mammals, the implantation of the embryo is a key event in the establishment of a pregnancy. A microarray experiment has been made to analyze the gene expression of the bovine pregnant endometrium and determine key pathways that control the endometrium physiology during the implantation process. The expression of 13300 genes in the endometrium of cows (n = 5) has been investigated. Only 5 animals were available for each condition so that the dye-switch design of Table 1.2 was used. Gene profiling has been established to analyze the impact of pregnancy by comparing the endometrium of cyclic (day 20 of cycle) versus pregnant animals (day 20 of pregnancy). In the following, the results of the five statistical procedures defined above are compared using this dataset.

The Venn diagram of Figure 1 shows the number of genes declared differentially expressed (DE) by 4 methods using the Bonferroni method with a 5% level. The UU method gives the least number of DE genes (4) and is not presented in the diagram. REML (which did not converge for 3 genes) gives the greater number of DE genes (93), among which 23 are also found by the other methods, and 70 are specifically found by REML (70 REML specific genes). 70 genes are found DE by ML (22 ML specific genes), and 58 by the naive method (9 Naive specific). Finally 33 genes are declared DE by UP, and all of them are also found by one, two or all of its competitors. Therefore UP provides the less discordant results. The higher number of DE genes obtained with the naive and the ML methods was expected, since it is known from the theory and the simulation study that these methods yield more false positives than the nominal risk. Figure 2 (right) shows that the ML and UP estimates of the standard error are coherent but that the ML estimate are lower than the ones obtained by the UP method. This point is in keeping with the statistical theory which assesses that the UP estimate of the variance is unbiased while the ML estimate has a negative bias.

Figure 1
figure 1

Venn diagram for the embriogenomics experiment. Comparison of the DE genes obtained by four methods. Vertical right rectangle: REML, horizontal low rectangle: UP, bone: N and circle: ML.

Figure 2
figure 2

Comparison of the standard errors obtained with ML, REML and UP for the REML-DE genes of the embriogenomics experiment. Left: REML estimates (y-axis) versus UP estimates (x-axis) of the standard error. Center: REML estimates versus ML estimates. Right: UP estimates versus ML estimates.

The high number of DE genes specifically found with REML is odd. Figures 2 and 3 show that this comes from very low estimates of variance for some genes, so that these genes are declared DE not because the mean difference of expression between the two conditions is high, but because this mean difference is divided by a very low standard error. So most of the 70 genes only found by the REML method are due to too low estimates of the gene variance obtained by the REML algorithm. This observation is in keeping with the results of the simulation study. Therefore the UP method or the naive method should be preferred in this particular experiment. The use of REML without a sharp biological analysis of the results gene by gene would be misleading.

Figure 3
figure 3

Mean difference versus standard error for the REML-differentially expressed genes of the embriogenomics experiment. Standard error of the difference obtained by REML (y-axis) versus mean difference between the two conditions (x-axis). Black points are not found DE by other methods than REML.

Teleost fish dataset

An important application of the methodology proposed in the previous section is the analysis of loop design experiments. Loop and interwoven loop designs were initially proposed in [2] to compare p treatments, where p is 3 or higher. Figure 4 displays a particular interwoven loop design where 3 different 2-by-2 loop comparisons of treatments are combined in a single experiment. The 3 loop comparisons are

Figure 4
figure 4

The Teleofish experiment design.

  • N 1 → S 1 → N 3 → S 3 → N 5 → S 5 → N 2 → S 2 → N 4 → S 4 → N 1

  • S 1 → G 1 → S 3 → G 3 → S 5 → G 5 → S 2 → G 2 → S 4 → G 4 → S 1

  • N 1 → G 2 → N 3 → G 4 → N 5 → G 1 → N 2 → G 3 → N 4 → G 5 → N 1

Each of these comparisons corresponds to the design of Table 1.2 discussed in the previous section. Such experimental designs have been studied both theoretically [15] and practically [8, 9]. Here, we briefly present the Teleost fish data of [8].

The Teleost fish experiment aims to compare 3 populations of fish (Northern Fundulus heteroclitus, Southern Fundulus heteroclitus and Fundulus grandis). Five individuals were examined in each population to determine the variation in gene expression between populations. Each individual is used to probe four cDNA microarrays, according to the design of Figure 4. The raw data consist of 120 measurements (15 individuals × 4 slides × 2 duplicates per slide) for 907 genes.

In [8], the signal is modelled as follows (after per slide duplicate averaging):Y ijkg = m + A i + D j + (AD) ij + G g + (AG) ig + (DG) jg + (V G) kg + e ijkg ,

where A, D, G and V stand for Array, Dye, Gene and Variety, respectively. Then the 4 measurements corresponding to a given individual are averaged, and an F statistic is computed per gene to check whether the variety effect is significant or not.

This strategy roughly amounts to the UU test procedure of section when the number of treatments is higher than 2. The main difference is that in model (4), the model does not include the array random effect which takes into account the dependency between two measures on the same array. According to the results of section, this implies that the variance estimator is biased, leading to a loss of power.

As an alternative, we perform the statistical analysis using the UP procedure. Each pairwise comparison between 2 varieties is made, and a gene is declared differentially expressed if at least 2 of the 3 tests are significant. Each test is performed at the level 0.02, meaning that for a given gene, the nominal level is roughly 0.001 (3 × 0.022 for 2 of the 3 tests to be significant under H0 at level 0.02). This is a good compromise between the 0.01 threshold adopted in the original articles with no correction for multiple testing, and the 0.5 × 10-4 (0.05/907) threshold given by a 5% level per test combined with a Bonferronni multiple testing correction. While the drawback of our strategy is to replace one test by three, the advantage is that the variance estimate is unbiased.

Table 5 gives the Oleksiak original list of differentially expressed genes found with the original method and the UP list of genes found with the UP procedure.

Table 5 Lists of genes for the Teleofish experiment

Among the 15 genes originally identified, 5 are also declared differentially expressed with the (UP) method. At a first glance, the (UP) procedure seems less powerful than the original method since only 9 genes are found here compared with the 15 genes of the original article. But due to the threshold adopted by the authors in [8], the expected number of false positives is 9 for the Oleksiak list, whereas for the (UP) list we expect only 1 false positive. Therefore most of the 10 extra genes found in [8] may be false positives. To examine the discriminant effect of the 9 genes of the (UP) list, we performed as in the original publication a clustering of the individuals, according to the significative genes. The corresponding tree is given in Figure 5. A cutoff of the tree at 0.15 gives the following 3 classes :

Figure 5
figure 5

Clustering tree for the Teleofish dataset. Clustering tree for the Teleofish dataset, obtained from the second list of differentially expressed genes of Table 5.

{S 1, S 2, S 4, S 5, N 1},   {G 1, G 2, G 3, G 4, G 5, S 3},   {N 2, N 5, N 3, N 4}.

These 3 classes roughly correspond to the three populations of interest, up to 2 misclassified observations. In the original article, the partition in 3 classes gave{N 1, N 2, N 3, N 4, N 5},   {S 1, S 4},   {G 1, G 2, G 3, G 4, G 5, S 2, S 3, S 5}.

With only 9 genes (rather than 15), the classification obtained with (UP) is improved compared with the classification of the original method.

Discussion

Random terms taking into account the array and the sample effects must be included in the statistical model at the gene level for dye-switch experiments. We showed on simulations that the naive paired T-test, which does not take into account the biological sample effect, leads to more false positives than expected, especially when the biological sample effect is high. It may be safely used only when the biological variance is lower than the technical variance. The REML estimate for mixed model provides an approximatively correct false positive rate, at the price of high computational complexity, lack of convergence for low or medium sample sizes and sometimes spurious results. To the contrary, the UP method we propose is easy to implement and not computationally intensive. The method is protected against spurious results, leading to a more robust and powerful analysis than REML when the biological variability is high and the number of samples low, an usual situation in microarray experiments.

For small sample size experiments, it is advised to use regularized T-test, see [16–19]. Regularization strategies are based on statistical methods that take the individual variance of each gene as input and give a regularized variance for each gene as output. The UP procedure proposed in this paper gives an estimate for the variance of the differential expression for each gene, so it allows a further application of all these regularization methods.

Conclusion

In this paper the proposed estimate of the variance of the differential expression is assessed for the comparison between two conditions in a dye-switch design. The same methodology could be extended to more complex designs involving more than two conditions and duplicate hybridizations of the same biological sample on different arrays.

Methods

Paired test procedure

According to expression (2), an unbiased estimator of μ is D ¯ = 1 2 n ∑ D i MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGafmiraqKbaebacqGH9aqpjuaGdaWcaaqaaiabigdaXaqaaiabikdaYiabd6gaUbaakmaaqaeabaGaemiraq0aaSbaaSqaaiabdMgaPbqabaaabeqab0GaeyyeIuoaaaa@3692@ . The variance V D ¯ = V ( D ¯ ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemOvay1aaSbaaSqaaiqbdseaezaaraaabeaakiabg2da9iabdAfawjabcIcaOiqbdseaezaaraGaeeykaKcaaa@337C@ of this estimator is

V D ¯ = V ( 1 2 n ∑ D i ) = 1 4 n 2 [ ∑ V ( D i ) + 2 ∑ cov ( D i , D i + 1 ) ] = 1 4 n 2 [ 2 n ( 2 σ B 2 + 2 σ T 2 ) + 4 n σ B 2 ] = 1 2 n ( 4 σ B 2 + 2 σ T 2 ) = 1 2 n ( V ( D ) + 2 cov ( D i , D i + 1 ) ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaqbaeaabuWaaaaabaGaemOvay1aaSbaaSqaaiqbdseaezaaraaabeaaaOqaaiabg2da9aqaaiabdAfawjabcIcaOKqbaoaalaaabaGaeGymaedabaGaeGOmaiJaemOBa4gaaOWaaabqaeaacqWGebardaWgaaWcbaGaemyAaKgabeaaaeqabeqdcqGHris5aOGaeiykaKcabaaabaGaeyypa0dabaqcfa4aaSaaaeaacqaIXaqmaeaacqaI0aancqWGUbGBdaahaaqabeaacqaIYaGmaaaaaOWaamWaaeaadaaeabqaaiabdAfawjabcIcaOiabdseaenaaBaaaleaacqWGPbqAaeqaaOGaeiykaKIaey4kaSIaeGOmaiZaaabqaeaacyGGJbWycqGGVbWBcqGG2bGDcqGGOaakcqWGebardaWgaaWcbaGaemyAaKgabeaakiabcYcaSiabdseaenaaBaaaleaacqWGPbqAcqGHRaWkcqaIXaqmaeqaaOGaeiykaKcaleqabeqdcqGHris5aaWcbeqab0GaeyyeIuoaaOGaay5waiaaw2faaaqaaaqaaiabg2da9aqaaKqbaoaalaaabaGaeGymaedabaGaeGinaqJaemOBa42aaWbaaeqabaGaeGOmaidaaaaakmaadmaabaGaeGOmaiJaemOBa4MaeiikaGIaeGOmaidcciGae83Wdm3aa0baaSqaaiabdkeacbqaaiabikdaYaaakiabgUcaRiabikdaYiab=n8aZnaaDaaaleaacqWGubavaeaacqaIYaGmaaGccqGGPaqkcqGHRaWkcqaI0aancqWGUbGBcqWFdpWCdaqhaaWcbaGaemOqaieabaGaeGOmaidaaaGccaGLBbGaayzxaaaabaaabaGaeyypa0dabaqcfa4aaSaaaeaacqaIXaqmaeaacqaIYaGmcqWGUbGBaaGccqGGOaakcqaI0aancqWFdpWCdaqhaaWcbaGaemOqaieabaGaeGOmaidaaOGaey4kaSIaeGOmaiJae83Wdm3aa0baaSqaaiabdsfaubqaaiabikdaYaaakiabcMcaPaqaaaqaaiabg2da9aqaaKqbaoaalaaabaGaeGymaedabaGaeGOmaiJaemOBa4gaaOGaeiikaGIaemOvayLaeiikaGIaemiraqKaeiykaKIaey4kaSIaeGOmaiJagi4yamMaei4Ba8MaeiODayNaeiikaGIaemiraq0aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcqWGebardaWgaaWcbaGaemyAaKMaey4kaSIaeGymaedabeaakiabcMcaPiabcMcaPaaaaaa@A693@
(5)

To perform a statistical test on parameter μ, we need to estimate V D ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemOvay1aaSbaaSqaaiqbdseaezaaraaabeaaaaa@2E5D@ .

Naive variance estimate

The naive estimate of V D ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemOvay1aaSbaaSqaaiqbdseaezaaraaabeaaaaa@2E5D@ is

1 2 n − 1 ∑ i ( D i − D ¯ ) 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSaaaeaacqaIXaqmaeaacqaIYaGmcqWGUbGBcqGHsislcqaIXaqmaaGcdaaeqbqaaiabcIcaOiabdseaenaaBaaaleaacqWGPbqAaeqaaOGaeyOeI0IafmiraqKbaebacqGGPaqkdaahaaWcbeqaaiabikdaYaaaaeaacqWGPbqAaeqaniabggHiLdaaaa@3CE9@

which is used to perform paired T-tests. But in a dye-switch experiment the variables D i - D ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGafmiraqKbaebaaaa@2CFC@ are not centered, since the means of D i and D ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGafmiraqKbaebaaaa@2CFC@ are μ + (-1)i+1δ and μ, respectively. Hence we consider the alternative estimator

S 2 = 1 2 n − 2 ∑ i ( D i − D ¯ ( i ) ) 2 , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaem4uam1aaWbaaSqabeaacqaIYaGmaaGccqGH9aqpjuaGdaWcaaqaaiabigdaXaqaaiabikdaYiabd6gaUjabgkHiTiabikdaYaaakmaaqafabaGaeiikaGIaemiraq0aaSbaaSqaaiabdMgaPbqabaGccqGHsislcuWGebargaqeamaaBaaaleaacqGGOaakcqWGPbqAcqGGPaqkaeqaaOGaeiykaKYaaWbaaSqabeaacqaIYaGmaaaabaGaemyAaKgabeqdcqGHris5aOGaeiilaWcaaa@4476@

where

D ¯ i = 1 n ∑ i  is odd D i if  i  is odd , D ¯ ( i ) = 1 n ∑ i  is even D i otherwise . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaqbaeaabiGaaaqaaiqbdseaezaaraWaaSbaaSqaaiabdMgaPbqabaGccqGH9aqpjuaGdaWcaaqaaiabigdaXaqaaiabd6gaUbaakmaaqafabaGaemiraq0aaSbaaSqaaiabdMgaPbqabaaabaGaemyAaKMaeeiiaaIaeeyAaKMaee4CamNaeeiiaaIaee4Ba8MaeeizaqMaeeizaqgabeqdcqGHris5aaGcbaGaeeyAaKMaeeOzayMaeeiiaaIaemyAaKMaeeiiaaIaeeyAaKMaee4CamNaeeiiaaIaee4Ba8MaeeizaqMaeeizaqMaeiilaWcabaGafmiraqKbaebadaWgaaWcbaGaeiikaGIaemyAaKMaeiykaKcabeaakiabg2da9KqbaoaalaaabaGaeGymaedabaGaemOBa4gaaOWaaabuaeaacqWGebardaWgaaWcbaGaemyAaKgabeaaaeaacqWGPbqAcqqGGaaicqqGPbqAcqqGZbWCcqqGGaaicqqGLbqzcqqG2bGDcqqGLbqzcqqGUbGBaeqaniabggHiLdaakeaacqqGVbWBcqqG0baDcqqGObaAcqqGLbqzcqqGYbGCcqqG3bWDcqqGPbqAcqqGZbWCcqqGLbqzcqGGUaGlaaaaaa@7505@

The expectation of this alternative estimator is

E [ 1 2 n − 2 ∑ i ( D i − D ¯ ( i ) ) 2 ] = 1 2 n − 2 ∑ i [ E ( D i 2 ) − E ( D ¯ ( i ) 2 ) ] = 1 2 n − 2 ∑ i [ ( 2 σ B 2 + 2 σ T 2 ) − 1 n ( 2 σ B 2 + 2 σ T 2 ) ] = 1 2 n − 2 × 2 n × n − 1 n ( 2 σ B 2 + 2 σ T 2 ) = 2 σ B 2 + 2 σ T 2 . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaqbaeaabqWaaaaabaGaemyrau0aamWaaeaajuaGdaWcaaqaaiabigdaXaqaaiabikdaYiabd6gaUjabgkHiTiabikdaYaaakmaaqafabaGaeiikaGIaemiraq0aaSbaaSqaaiabdMgaPbqabaGccqGHsislcuWGebargaqeamaaBaaaleaacqGGOaakcqWGPbqAcqGGPaqkaeqaaOGaeiykaKYaaWbaaSqabeaacqaIYaGmaaaabaGaemyAaKgabeqdcqGHris5aaGccaGLBbGaayzxaaaabaGaeyypa0dabaqcfa4aaSaaaeaacqaIXaqmaeaacqaIYaGmcqWGUbGBcqGHsislcqaIYaGmaaGcdaaeqbqaamaadmaabaGaemyrauKaeiikaGIaemiraq0aa0baaSqaaiabdMgaPbqaaiabikdaYaaakiabcMcaPiabgkHiTiabdweafjabcIcaOiqbdseaezaaraWaa0baaSqaaiabcIcaOiabdMgaPjabcMcaPaqaaiabikdaYaaakiabcMcaPaGaay5waiaaw2faaaWcbaGaemyAaKgabeqdcqGHris5aaGcbaaabaGaeyypa0dabaqcfa4aaSaaaeaacqaIXaqmaeaacqaIYaGmcqWGUbGBcqGHsislcqaIYaGmaaGcdaaeqbqaamaadmaabaGaeiikaGIaeGOmaidcciGae83Wdm3aa0baaSqaaiabdkeacbqaaiabikdaYaaakiabgUcaRiabikdaYiab=n8aZnaaDaaaleaacqWGubavaeaacqaIYaGmaaGccqGGPaqkcqGHsisljuaGdaWcaaqaaiabigdaXaqaaiabd6gaUbaakiabcIcaOiabikdaYiab=n8aZnaaDaaaleaacqWGcbGqaeaacqaIYaGmaaGccqGHRaWkcqaIYaGmcqWFdpWCdaqhaaWcbaGaemivaqfabaGaeGOmaidaaOGaeiykaKcacaGLBbGaayzxaaaaleaacqWGPbqAaeqaniabggHiLdaakeaaaeaacqGH9aqpaeaajuaGdaWcaaqaaiabigdaXaqaaiabikdaYiabd6gaUjabgkHiTiabikdaYaaakiabgEna0kabikdaYiabd6gaUjabgEna0MqbaoaalaaabaGaemOBa4MaeyOeI0IaeGymaedabaGaemOBa4gaaOGaeiikaGIaeGOmaiJae83Wdm3aa0baaSqaaiabdkeacbqaaiabikdaYaaakiabgUcaRiabikdaYiab=n8aZnaaDaaaleaacqWGubavaeaacqaIYaGmaaGccqGGPaqkaeaaaeaacqGH9aqpaeaacqaIYaGmcqWFdpWCdaqhaaWcbaGaemOqaieabaGaeGOmaidaaOGaey4kaSIaeGOmaiJae83Wdm3aa0baaSqaaiabdsfaubqaaiabikdaYaaakiabc6caUaaaaaa@B444@

S2 is a downward biased estimator of V( D ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGafmiraqKbaebaaaa@2CFC@ ). The higher σ B 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaacciGae83Wdm3aa0baaSqaaiabdkeacbqaaiabikdaYaaaaaa@2FC9@ compared with σ T 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaacciGae83Wdm3aa0baaSqaaiabdsfaubqaaiabikdaYaaaaaa@2FED@ , the higher the bias:

V ( D ¯ ) = E ( S 2 ) [ 1 + σ B 2 / σ T 2 1 + σ B 2 / σ T 2 ] . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemOvayLaeiikaGIafmiraqKbaebacqGGPaqkcqGH9aqpcqWGfbqrcqGGOaakcqWGtbWudaahaaWcbeqaaiabikdaYaaakiabcMcaPmaadmaabaGaeGymaeJaey4kaSscfa4aaSaaaeaaiiGacqWFdpWCdaqhaaqaaiabdkeacbqaaiabikdaYaaacqGGVaWlcqWFdpWCdaqhaaqaaiabdsfaubqaaiabikdaYaaaaeaacqaIXaqmcqGHRaWkcqWFdpWCdaqhaaqaaiabdkeacbqaaiabikdaYaaacqGGVaWlcqWFdpWCdaqhaaqaaiabdsfaubqaaiabikdaYaaaaaaakiaawUfacaGLDbaacqGGUaGlaaa@4F12@

From this naive estimate of the variance we can derive a first T-test statistic to be used for the differential analysis:

T N = 2 n D ¯ S 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemivaq1aaSbaaSqaaiabd6eaobqabaGccqGH9aqpdaGcaaqaaiabikdaYiabd6gaUbWcbeaajuaGdaWcaaqaaiqbdseaezaaraaabaWaaOaaaeaacqWGtbWudaahaaqabeaacqaIYaGmaaaabeaaaaaaaa@363F@
(6)

Unbiased variance estimate

Let C = 1 2 n − 4 ∑ i ( D i − D ¯ ( i ) ) ( D i + 1 − D ¯ ( i + 1 ) ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaem4qamKaeyypa0tcfa4aaSaaaeaacqaIXaqmaeaacqaIYaGmcqWGUbGBcqGHsislcqaI0aanaaGcdaaeqaqaaiabcIcaOiabdseaenaaBaaaleaacqWGPbqAaeqaaOGaeyOeI0IafmiraqKbaebadaWgaaWcbaGaeiikaGIaemyAaKMaeiykaKcabeaakiabcMcaPiabcIcaOiabdseaenaaBaaaleaacqWGPbqAcqGHRaWkcqaIXaqmaeqaaOGaeyOeI0IafmiraqKbaebadaWgaaWcbaGaeiikaGIaemyAaKMaey4kaSIaeGymaeJaeiykaKcabeaakiabcMcaPaWcbaGaemyAaKgabeqdcqGHris5aaaa@4DF6@ . We have

E ( C ) = 1 2 n − 4 ∑ i E [ D i D i + 1 − D i D ¯ ( i + 1 ) − D i + 1 D ¯ ( i ) + D ¯ ( i ) D ¯ ( i + 1 ) ] = 1 2 n − 4 ∑ i [ σ B 2 − 2 n σ B 2 − 2 n σ B 2 + 1 n 2 × n × 2 σ B 2 ] = σ B 2 . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaqbaeaabmWaaaqaaiabdweafjabcIcaOiabdoeadjabcMcaPaqaaiabg2da9aqaaKqbaoaalaaabaGaeGymaedabaGaeGOmaiJaemOBa4MaeyOeI0IaeGinaqdaaOWaaabuaeaacqWGfbqrcqGGBbWwcqWGebardaWgaaWcbaGaemyAaKgabeaakiabdseaenaaBaaaleaacqWGPbqAcqGHRaWkcqaIXaqmaeqaaOGaeyOeI0Iaemiraq0aaSbaaSqaaiabdMgaPbqabaGccuWGebargaqeamaaBaaaleaacqGGOaakcqWGPbqAcqGHRaWkcqaIXaqmcqGGPaqkaeqaaOGaeyOeI0Iaemiraq0aaSbaaSqaaiabdMgaPjabgUcaRiabigdaXaqabaGccuWGebargaqeamaaBaaaleaacqGGOaakcqWGPbqAcqGGPaqkaeqaaOGaey4kaSIafmiraqKbaebadaWgaaWcbaGaeiikaGIaemyAaKMaeiykaKcabeaakiqbdseaezaaraWaaSbaaSqaaiabcIcaOiabdMgaPjabgUcaRiabigdaXiabcMcaPaqabaGccqGGDbqxaSqaaiabdMgaPbqab0GaeyyeIuoaaOqaaaqaaiabg2da9aqaaKqbaoaalaaabaGaeGymaedabaGaeGOmaiJaemOBa4MaeyOeI0IaeGinaqdaaOWaaabuaeaadaWadaqaaGGaciab=n8aZnaaDaaaleaacqWGcbGqaeaacqaIYaGmaaGccqGHsisljuaGdaWcaaqaaiabikdaYaqaaiabd6gaUbaakiab=n8aZnaaDaaaleaacqWGcbGqaeaacqaIYaGmaaGccqGHsisljuaGdaWcaaqaaiabikdaYaqaaiabd6gaUbaakiab=n8aZnaaDaaaleaacqWGcbGqaeaacqaIYaGmaaGccqGHRaWkjuaGdaWcaaqaaiabigdaXaqaaiabd6gaUnaaCaaabeqaaiabikdaYaaaaaGccqGHxdaTcqWGUbGBcqGHxdaTcqaIYaGmcqWFdpWCdaqhaaWcbaGaemOqaieabaGaeGOmaidaaaGccaGLBbGaayzxaaaaleaacqWGPbqAaeqaniabggHiLdaakeaaaeaacqGH9aqpaeaacqWFdpWCdaqhaaWcbaGaemOqaieabaGaeGOmaidaaOGaeiOla4caaaaa@9996@

From this and equation (5) we can deduce the following unbiased estimate of V D ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemOvay1aaSbaaSqaaiqbdseaezaaraaabeaaaaa@2E5D@ :

S D ¯ 2 = 1 2 n ( S 2 + 2 C ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaem4uam1aa0baaSqaaiqbdseaezaaraaabaGaeGOmaidaaOGaeyypa0tcfa4aaSaaaeaacqaIXaqmaeaacqaIYaGmcqWGUbGBaaGccqGGOaakcqWGtbWudaahaaWcbeqaaiabikdaYaaakiabgUcaRiabikdaYiabdoeadjabcMcaPaaa@3B84@

Finally, the "unbiased paired t-statistic" for testing the null hypothesis H0 = {μ1 = μ2} is

T U P = 2 n D ¯ ( S 2 + 2 C ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemivaq1aaSbaaSqaaiabdwfavjabdcfaqbqabaGccqGH9aqpdaGcaaqaaiabikdaYiabd6gaUbWcbeaajuaGdaWcaaqaaiqbdseaezaaraaabaWaaOaaaeaacqGGOaakcqWGtbWudaahaaqabeaacqaIYaGmaaGaey4kaSIaeGOmaiJaem4qamKaeiykaKcabeaaaaaaaa@3C0B@

which is approximately distributed as a Student distribution with 2n - 2 df under H0.

Unpaired test procedure

Let X ¯ j MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGafmiwaGLbaebadaWgaaWcbaGaemOAaOgabeaaaaa@2EAD@ (respectively Y ¯ j MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGafmywaKLbaebadaWgaaWcbaGaemOAaOgabeaaaaa@2EAF@ ) be the mean of the 2 results obtained with the same biological sample (in 2 different arrays and with the 2 dyes) for condition A (respectively condition B). From model (1) one obtains

X ¯ j = μ A + ( δ 1 + δ 2 ) / 2 + B j + M i ( j ) + M i ′ ( j ) ) / 2 + ( T i ( j ) + T i ′ ( j ) ) / 2 Y ¯ j = μ B + ( δ 1 + δ 2 ) / 2 + B ′ j + M i ( j ) + M i ′ ( j ) ) / 2 + ( T ′ i ( j ) + T ′ i ′ ( j ) ) / 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaqbaeaabiqaaaqaaiqbdIfayzaaraWaaSbaaSqaaiabdQgaQbqabaGccqGH9aqpiiGacqWF8oqBdaWgaaWcbaGaemyqaeeabeaakiabgUcaRiabcIcaOiab=r7aKnaaBaaaleaacqaIXaqmaeqaaOGaey4kaSIae8hTdq2aaSbaaSqaaiabikdaYaqabaGccqGGPaqkcqGGVaWlcqaIYaGmcqGHRaWkcqWGcbGqdaWgaaWcbaGaemOAaOgabeaakiabgUcaRiabd2eannaaBaaaleaacqWGPbqAcqGGOaakcqWGQbGAcqGGPaqkaeqaaOGaey4kaSIaemyta00aaSbaaSqaaiqbdMgaPzaafaGaeiikaGIaemOAaOMaeiykaKcabeaakiabcMcaPiabc+caViabikdaYiabgUcaRiabcIcaOiabdsfaunaaBaaaleaacqWGPbqAcqGGOaakcqWGQbGAcqGGPaqkaeqaaOGaey4kaSIaemivaq1aaSbaaSqaaiqbdMgaPzaafaGaeiikaGIaemOAaOMaeiykaKcabeaakiabcMcaPiabc+caViabikdaYaqaaiqbdMfazzaaraWaaSbaaSqaaiabdQgaQbqabaGccqGH9aqpcqWF8oqBdaWgaaWcbaGaemOqaieabeaakiabgUcaRiabcIcaOiab=r7aKnaaBaaaleaacqaIXaqmaeqaaOGaey4kaSIae8hTdq2aaSbaaSqaaiabikdaYaqabaGccqGGPaqkcqGGVaWlcqaIYaGmcqGHRaWkcuWGcbGqgaqbamaaBaaaleaacqWGQbGAaeqaaOGaey4kaSIaemyta00aaSbaaSqaaiabdMgaPjabcIcaOiabdQgaQjabcMcaPaqabaGccqGHRaWkcqWGnbqtdaWgaaWcbaGafmyAaKMbauaacqGGOaakcqWGQbGAcqGGPaqkaeqaaOGaeiykaKIaei4la8IaeGOmaiJaey4kaSIaeiikaGIafmivaqLbauaadaWgaaWcbaGaemyAaKMaeiikaGIaemOAaOMaeiykaKcabeaakiabgUcaRiqbdsfauzaafaWaaSbaaSqaaiqbdMgaPzaafaGaeiikaGIaemOAaOMaeiykaKcabeaakiabcMcaPiabc+caViabikdaYaaaaaa@98DD@

where j is the biological sample index (recall that sample j is different for the two conditions), i(j) and i'(j) are the arrays on which sample j has been hybridized. X ¯ j MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGafmiwaGLbaebadaWgaaWcbaGaemOAaOgabeaaaaa@2EAD@ and Y ¯ ′ j MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGafmywaKLbaeHbauaadaWgaaWcbaGaemOAaOgabeaaaaa@2EBA@ may be correlated as a result of a possible common array effect. X ¯ j MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGafmiwaGLbaebadaWgaaWcbaGaemOAaOgabeaaaaa@2EAD@ and X ¯ ′ j MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGafmiwaGLbaeHbauaadaWgaaWcbaGaemOAaOgabeaaaaa@2EB8@ are uncorrelated because the two different biological samples of the same condition cannot be present together on the same array. From result (5) we have:

V ( X ¯ − Y ¯ ) = V D ¯ = 1 2 n ( 4 σ B 2 + 2 σ T 2 ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemOvayLaeiikaGIafmiwaGLbaebacqGHsislcuWGzbqwgaqeaiabcMcaPiabg2da9iabdAfawnaaBaaaleaacuWGebargaqeaaqabaGccqGH9aqpjuaGdaWcaaqaaiabigdaXaqaaiabikdaYiabd6gaUbaakiabcIcaOiabisda0GGaciab=n8aZnaaDaaaleaacqWGcbGqaeaacqaIYaGmaaGccqGHRaWkcqaIYaGmcqWFdpWCdaqhaaWcbaGaemivaqfabaGaeGOmaidaaOGaeiykaKIaeiOla4caaa@48A0@

The usual unpaired estimate of V ( X ¯ − Y ¯ ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemOvayLaeiikaGIafmiwaGLbaebacqGHsislcuWGzbqwgaqeaiabcMcaPaaa@324B@ is equal to ( S X 2 + S Y 2 ) / n MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaeiikaGIaem4uam1aa0baaSqaaiabdIfaybqaaiabikdaYaaakiabgUcaRiabdofatnaaDaaaleaacqWGzbqwaeaacqaIYaGmaaGccqGGPaqkcqGGVaWlcqWGUbGBaaa@37D6@ , where S X 2 = 1 n − 1 ∑ j ( X ¯ j − X ¯ ) 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaem4uam1aa0baaSqaaiabdIfaybqaaiabikdaYaaakiabg2da9KqbaoaalaaabaGaeGymaedabaGaemOBa4MaeyOeI0IaeGymaedaaOWaaabeaeaacqGGOaakcuWGybawgaqeamaaBaaaleaacqWGQbGAaeqaaOGaeyOeI0IafmiwaGLbaebacqGGPaqkdaahaaWcbeqaaiabikdaYaaaaeaacqWGQbGAaeqaniabggHiLdaaaa@406C@ and S Y 2 = 1 n − 1 ∑ j ( Y ¯ j − Y ¯ ) 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaem4uam1aa0baaSqaaiabdMfazbqaaiabikdaYaaakiabg2da9KqbaoaalaaabaGaeGymaedabaGaemOBa4MaeyOeI0IaeGymaedaaOWaaabeaeaacqGGOaakcuWGzbqwgaqeamaaBaaaleaacqWGQbGAaeqaaOGaeyOeI0IafmywaKLbaebacqGGPaqkdaahaaWcbeqaaiabikdaYaaaaeaacqWGQbGAaeqaniabggHiLdaaaa@4072@ , whose common mean (under the homoscedastic model (1)) is equal to

σ B 2 + 1 2 ( σ T 2 + σ M 2 ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaacciGae83Wdm3aa0baaSqaaiabdkeacbqaaiabikdaYaaakiabgUcaRKqbaoaalaaabaGaeGymaedabaGaeGOmaidaaOGaeiikaGIae83Wdm3aa0baaSqaaiabdsfaubqaaiabikdaYaaakiabgUcaRiab=n8aZnaaDaaaleaacqWGnbqtaeaacqaIYaGmaaGccqGGPaqkcqGGUaGlaaa@3F27@

Therefore

E [ ( S X 2 + S Y 2 ) ] = 2 σ B 2 + σ T 2 + σ M 2 . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemyrau0aamWaaeaacqGGOaakcqWGtbWudaqhaaWcbaGaemiwaGfabaGaeGOmaidaaOGaey4kaSIaem4uam1aa0baaSqaaiabdMfazbqaaiabikdaYaaakiabcMcaPaGaay5waiaaw2faaiabg2da9iabikdaYGGaciab=n8aZnaaDaaaleaacqWGcbGqaeaacqaIYaGmaaGccqGHRaWkcqWFdpWCdaqhaaWcbaGaemivaqfabaGaeGOmaidaaOGaey4kaSIae83Wdm3aa0baaSqaaiabd2eanbqaaiabikdaYaaakiabc6caUaaa@49A0@

This method overestimates the true variance

V D ¯ = 1 2 n [ 4 σ B 2 + 2 σ T 2 ] . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemOvay1aaSbaaSqaaiqbdseaezaaraaabeaakiabg2da9KqbaoaalaaabaGaeGymaedabaGaeGOmaiJaemOBa4gaaOWaamWaaeaacqaI0aaniiGacqWFdpWCdaqhaaWcbaGaemOqaieabaGaeGOmaidaaOGaey4kaSIaeGOmaiJae83Wdm3aa0baaSqaaiabdsfaubqaaiabikdaYaaaaOGaay5waiaaw2faaiabc6caUaaa@4162@

The overestimation is more dramatic as σ M 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaacciGae83Wdm3aa0baaSqaaiabd2eanbqaaiabikdaYaaaaaa@2FDF@ increases. This estimate may be corrected: σ M 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaacciGae83Wdm3aa0baaSqaaiabd2eanbqaaiabikdaYaaaaaa@2FDF@ may be estimated using the empirical covariance between X ¯ j MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGafmiwaGLbaebadaWgaaWcbaGaemOAaOgabeaaaaa@2EAD@ and Y ¯ j ′ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGafmywaKLbaebadaWgaaWcbaGafmOAaOMbauaaaeqaaaaa@2EBB@ . Let

C X Y = 1 n − 2 ( ∑ j = 1 n ( X ¯ j − X ¯ ) ( Y ¯ j − Y ¯ ) + ∑ j = 1 n ( X ¯ j − X ¯ ) ( Y ¯ j − 1 − Y ¯ ) ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaem4qam0aaSbaaSqaaiabdIfayjabdMfazbqabaGccqGH9aqpjuaGdaWcaaqaaiabigdaXaqaaiabd6gaUjabgkHiTiabikdaYaaakmaabmaabaWaaabCaeaacqGGOaakcuWGybawgaqeamaaBaaaleaacqWGQbGAaeqaaOGaeyOeI0IafmiwaGLbaebacqGGPaqkcqGGOaakcuWGzbqwgaqeamaaBaaaleaacqWGQbGAaeqaaOGaeyOeI0IafmywaKLbaebacqGGPaqkaSqaaiabdQgaQjabg2da9iabigdaXaqaaiabd6gaUbqdcqGHris5aOGaey4kaSYaaabCaeaacqGGOaakcuWGybawgaqeamaaBaaaleaacqWGQbGAaeqaaOGaeyOeI0IafmiwaGLbaebacqGGPaqkcqGGOaakcuWGzbqwgaqeamaaBaaaleaacqWGQbGAcqGHsislcqaIXaqmaeqaaOGaeyOeI0IafmywaKLbaebacqGGPaqkaSqaaiabdQgaQjabg2da9iabigdaXaqaaiabd6gaUbqdcqGHris5aaGccaGLOaGaayzkaaaaaa@6364@

with the convention that Y ¯ 0 = Y ¯ n MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGafmywaKLbaebadaWgaaWcbaGaeGimaadabeaakiabg2da9iqbdMfazzaaraWaaSbaaSqaaiabd6gaUbqabaaaaa@3234@ . The mean of the first sum is

1 n − 2 ∑ j = 1 n E [ ( X ¯ j − X ¯ ) ( Y ¯ ′ j − Y ¯ ) ] = 1 n − 2 ∑ j = 1 n E [ ( ( M i ( j ) + M i ′ ( j ) ) / 2 − M ¯ ) ( ( M i ( j ) + M i ″ ( j ) ) / 2 − M ¯ ) ] = 1 n − 2 ∑ j = 1 n σ M 2 4 − 2 σ M 2 4 n = σ M 2 4 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaqbaeaabmWaaaqaaKqbaoaalaaabaGaeGymaedabaGaemOBa4MaeyOeI0IaeGOmaidaaOWaaabCaeaacqWGfbqrdaWadaqaaiabcIcaOiqbdIfayzaaraWaaSbaaSqaaiabdQgaQbqabaGccqGHsislcuWGybawgaqeaiabcMcaPiabcIcaOiqbdMfazzaaryaafaWaaSbaaSqaaiabdQgaQbqabaGccqGHsislcuWGzbqwgaqeaiabcMcaPaGaay5waiaaw2faaaWcbaGaemOAaOMaeyypa0JaeGymaedabaGaemOBa4ganiabggHiLdaakeaacqGH9aqpaeaajuaGdaWcaaqaaiabigdaXaqaaiabd6gaUjabgkHiTiabikdaYaaakmaaqahabaGaemyrau0aamWaaeaacqGGOaakcqGGOaakcqWGnbqtdaWgaaWcbaGaemyAaKMaeiikaGIaemOAaOMaeiykaKcabeaakiabgUcaRiabd2eannaaBaaaleaacuWGPbqAgaqbaiabcIcaOiabdQgaQjabcMcaPaqabaGccqGGPaqkcqGGVaWlcqaIYaGmcqGHsislcuWGnbqtgaqeaiabcMcaPiabcIcaOiabcIcaOiabd2eannaaBaaaleaacqWGPbqAcqGGOaakcqWGQbGAcqGGPaqkaeqaaOGaey4kaSIaemyta00aaSbaaSqaaiqbdMgaPzaagaGaeiikaGIaemOAaOMaeiykaKcabeaakiabcMcaPiabc+caViabikdaYiabgkHiTiqbd2eanzaaraGaeiykaKcacaGLBbGaayzxaaaaleaacqWGQbGAcqGH9aqpcqaIXaqmaeaacqWGUbGBa0GaeyyeIuoaaOqaaaqaaiabg2da9aqaaKqbaoaalaaabaGaeGymaedabaGaemOBa4MaeyOeI0IaeGOmaidaaOWaaabCaeaajuaGdaWcaaqaaGGaciab=n8aZnaaDaaabaGaemyta0eabaGaeGOmaidaaaqaaiabisda0aaakiabgkHiTKqbaoaalaaabaGaeGOmaiJae83Wdm3aa0baaeaacqWGnbqtaeaacqaIYaGmaaaabaGaeGinaqJaemOBa4gaaaWcbaGaemOAaOMaeyypa0JaeGymaedabaGaemOBa4ganiabggHiLdaakeaaaeaacqGH9aqpaKqbagaadaWcaaqaaiab=n8aZnaaDaaabaGaemyta0eabaGaeGOmaidaaaqaaiabisda0aaaaaaaaa@A250@

It is easy to see that the second sum in C XY has the same mean. Therefore an unbiased estimate of V D ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemOvay1aaSbaaSqaaiqbdseaezaaraaabeaaaaa@2E5D@ is 1 n ( S X 2 + S Y 2 − 2 C X Y ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSaaaeaacqaIXaqmaeaacqWGUbGBaaGccqGGOaakcqWGtbWudaqhaaWcbaGaemiwaGfabaGaeGOmaidaaOGaey4kaSIaem4uam1aa0baaSqaaiabdMfazbqaaiabikdaYaaakiabgkHiTiabikdaYiabdoeadnaaBaaaleaacqWGybawcqWGzbqwaeqaaOGaeiykaKcaaa@3E20@ , and the approximate unpaired t-statistic is

T U U = n X ¯ − Y ¯ S X 2 + S Y 2 − 2 C X Y . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemivaq1aaSbaaSqaaiabdwfavjabdwfavbqabaGccqGH9aqpdaGcaaqaaiabd6gaUbWcbeaajuaGdaWcaaqaaiqbdIfayzaaraGaeyOeI0IafmywaKLbaebaaeaadaGcaaqaaiabdofatnaaDaaabaGaemiwaGfabaGaeGOmaidaaiabgUcaRiabdofatnaaDaaabaGaemywaKfabaGaeGOmaidaaiabgkHiTiabikdaYiabdoeadnaaBaaabaGaemiwaGLaemywaKfabeaaaeqaaaaakiabc6caUaaa@4500@
(7)

References

  1. Yang Y, Speed T: Design issues for cDNA microarray experiments. Nat Rev Genet 2002, 3(8):579–88.

    CAS  PubMed  Google Scholar 

  2. Churchill G: Fundamentals of experimental design for cDNA microarrays. Nat Genet 2002, 32: 490–495. 10.1038/ng1031

    Article  CAS  PubMed  Google Scholar 

  3. Yang Y, Dudoit S, Luu P, Speed T: Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nuclear Acids Res 2002, 30(4):e15. 10.1093/nar/30.4.e15

    Article  Google Scholar 

  4. Kerr M, Afshari C, Bennett L, Bushel P, Martinez J, Walker N, Churchill G: Statistical Analysis of a gene expression microarray experiment with replication. Statistica Sinica 2002, 12: 203–217.

    Google Scholar 

  5. Dobbin K, Shih J, Simon R: Statistical design of reverse dye microarrays. Bioinformatics 2003, 19(7):803–810. 10.1093/bioinformatics/btg076

    Article  CAS  PubMed  Google Scholar 

  6. Martin-Magniette M, Aubert J, Cabannes E, Daudin J: Evaluation of the gene-specific dye bias in cDNA microarray experiments. Bioinformatics 2005, 21(9):1995–2000. 10.1093/bioinformatics/bti302

    Article  CAS  PubMed  Google Scholar 

  7. Dobbin K, Kawasaki E, Petersen D, Simon R: Characterizing dye bias in microarray experiments. Bioinformatics 2005, 21(10):2430–2437. 10.1093/bioinformatics/bti378

    Article  CAS  PubMed  Google Scholar 

  8. Oleksiak M, Churchill G, Crawford D: Variation in gene expression within and among natural populations. Nat Genet 2002, 32: 261–266. 10.1038/ng983

    Article  CAS  PubMed  Google Scholar 

  9. Whitehead A, Crawford D: Neutral and adaptive variation in gene expression. Proc Natl Acad Sci USA 2006, 103(14):5425–5430. 10.1073/pnas.0507648103

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  10. Kerr M, Martin M, Churchill G: Analysis of variance for gene expression microarray data. J Comput Biol 2000, 7: 819–837. 10.1089/10665270050514954

    Article  CAS  PubMed  Google Scholar 

  11. Wit E, McClure J: Statistics for Microarrays: Design, Analysis and Inference. Chichester: Wiley; 2004.

    Book  Google Scholar 

  12. Landgrebe J, Bretz F, Brunner E: Efficient two-sample designs for microarray experiments with biological replications. Silico Biology 2004., 4:

    Google Scholar 

  13. Landgrebe J, Bretz F, Brunner E: Efficient design and analysis of two colour factorial microarray experiments. Comput Stat & Data Anal 2006, 50(2):499–517. 10.1016/j.csda.2004.08.014

    Article  Google Scholar 

  14. Mansouri N, Sandra O, Aubert J, Everts R, Galio L, Heyman Y, Audouart C, Degrelle S, Hue I, Yang X, Lewin H, JP R: Identification of differentially regulated genes in the endometrium of cyclic and pregnant cows using a high-throughput transcriptome analysis. American Journal of Reproductive Immunology 2007, 58(5):204–220.

    Google Scholar 

  15. Wit E, Nobile A, Khanin R: Near-optimal designs for dual channel microarray studies. J R Stat Soc C 2005, 54(5):817–830. 10.1111/j.1467-9876.2005.00519.x

    Article  Google Scholar 

  16. Baldi P, Long A: A Bayesian Framework for the Analysis of Microarray Expression Data: Regularized t-Test and Statistical Inferences of Gene Changes. Bioinformatics 2001, 17: 509–519. 10.1093/bioinformatics/17.6.509

    Article  CAS  PubMed  Google Scholar 

  17. Delmar P, Robin S, Daudin J: VarMixt: efficient variance modelling for the differential analysis of replicated gene expression data. Bioinformatics 2005, 21(4):502–508. 10.1093/bioinformatics/bti023

    Article  CAS  PubMed  Google Scholar 

  18. Smyth G: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 2004, 3: Article 3. 10.2202/1544-6115.1027

    Article  Google Scholar 

  19. Tusher V, Tibshirani R, Chu C: Significance analysis of microarrays applied to transcriptional response to ionizing radiations. PNAS 2001, 98: 5116–5121. 10.1073/pnas.091062498

    Article  PubMed Central  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

The authors thank Douglas L. Crawford who provided the Teleofish dataset.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tristan Mary-Huard.

Additional information

Authors' contributions

TMH and JJD conceived the method and prepared the manuscript. TMH and JA implemented part of the software and performed the statistical analysis. NM made the Embriogenomics experiment under the direction of OS. All authors contributed to the discussion and approved the final manuscript.

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Mary-Huard, T., Aubert, J., Mansouri-Attia, N. et al. Statistical methodology for the analysis of dye-switch microarray experiments. BMC Bioinformatics 9, 98 (2008). https://doi.org/10.1186/1471-2105-9-98

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-9-98

Keywords