Illumina BeadArray platform is a microarray technology offering highly replicable measurements of gene expression in a biological sample. Each probe is measured on average of thirty to sixty beads randomly distributed on the surface of the array, avoiding spatial artifacts and the reported probe intensity is the robust mean of the bead measurements. Fluorescence intensity measured on each bead is subject to several sources of noise (non-specific binding, optical noise, …). Thus the intensities produced by the microarray require a background correction in order to account measurement error. For that purpose, Illumina microarray design includes a set of non specific negative control probes which provides an estimate of the background noise distribution.
In genome-wide microarrays, the observed intensity of a probe is usually modeled as the sum of a signal and a background noise. Namely, let X
be the observed intensity of a given probe, we assume that
where S is the true signal which counts for the abundance of the probe complementary sequence in the target sample and is independent of the background noise B. Only X is observed but the quantity of interest is the signal S. Therefore, a background correction adjusting the effect of noise on the true signal is necessary to enhance the biological validity of the results. In this context the knowledge of both signal and noise distributions provides a background correction procedure: the signal S is estimated by the conditional expectation of S given the observation X=x and given the distributions of B and S. Under parametric assumptions on B and S, the problem is limited to the estimation of the parameters. Besides, in many experimental contexts involving measurement error the normal distribution of the noise is assumed. Specific arguments for microarray data find their origins in analytical chemistry (see e.g. ).
Background correction of Affymetrix and two-color microarray data has been widely developed in literature (see  for a review and a comparison). Irizarry et al proposed a parametric model for Affymetrix based on a exponential distribution of the signal, called normexp model. Several estimation procedures have been developed for this model. The first parameter estimation, still popular today, is the Robust Multi-array Average (RMA) procedure. Maximum Likelihood Estimation (MLE), incorporating the negative controls, has been later proposed and is considered to be more sensitive to the true parameter values (see ). These procedures can be found in Bioconductora packages including limma.
Illumina design differs from those of Affymetrix and two-color microarrays by including a set of negative probes which do not specifically target any regular probe. Aside from non specific hybridization, these negative probes do not hybridize and then have signals close to zero. Thus their observed intensity is X=B. As all probes from a given array correspond to the same biological sample and are subject to the same technical steps during the analysis process, the noise is generally assumed identically distributed on an array and the negative probes provide a sample from its distribution.
The background correction implemented in Illumina BeadStudio software is the subtraction of the estimated mean of the negative probe distribution. However, it creates a large amount of probes with negative intensities unusable in further analysis. The deletion of these probes is considered in some studies as an opportunity to gain statistical power when the number of strongly differentially expressed genes is large, but it can lead to an important loss of information. Ding et al illustrate this phenomenon in their mice leukemia study: a large amount of corrected values are negative only in one group suggesting that the corresponding probes have discriminating ability. This issue is confirmed by Dunning et al on spike-in data.
To avoid this problem, parametric models have been used on Illumina data with parameter estimations taking into account the specific design of Illumina microarrays. In this context, the normexp model has been first adapted. Ding et al use the Maximum Likelihood Estimation (MLE) based on a Monte-Carlo Markov chain approximation and compare their method to an Illumina-adapted RMA procedure using an ad hoc rule of thumbs to estimate the parameters. Xie et al go into details in normexp method comparison on experimental and simulated data. Lin et al present a variance stabilizing transformation (VST) on a model involving both additive and multiplicative noises, which simultaneously denoise and transform the data. Replacing the classical log-transformation, VST produces less directly interpretable results and tends to produces very small fold changes, as underlined by Shi et al who propose an original approach to compare methods offering different bias-precision trade-off by aligning the innate offset generated by each pre-processing strategy. They conclude in favor of the normexp model with robust ’non-parametric’ parameters associated to a quantile-normalization using control and regular probes. Besides, Chen et al propose a gamma parametrisation of the background noise distribution to handle with the departure from normality observed on negative probes, associated with an exponential distribution of the signal. They emphasize that gamma-exponential model can provide an improvement in terms of differential analysis, but might not be an adequate parametrisation in some cases.
The spread of each background correction among the Illumina users is hard to evaluate since many authors do not mention precisely the pre-processing steps performed in their study. Nevertheless, the normexp model that will be especially examined in this paper is included in several widely used packages such as lumi and limma of Bioconductora.
Despite its popularity the normexp model does not properly fit Illumina microarray data. This issue was raised by Wang and Ye , who estimate the density of the signal on an Illumina microarray with a kernel-based deconvolution procedure. The shape of the estimated signal density does not present the characteristics of an exponential distribution and a gamma modeling seems more appropriate. We confirm these findings by implementing the kernel-based estimator by Wang and Wang  available in the R package decon. (The results are displayed in Additional file 1, Section 1). The signal density estimate exhibits a heavy tail which can not be fitted by an exponential distribution density. Nevertheless, kernel-based density estimators does not appear efficient to recover breakpoints in the density, and presents instabilities, which limits the interpretation of the signal density estimate in the microarray context. In this paper, we emphasize that the normal-exponential model is not flexible enough to model the signal-noise decomposition on Illumina microarrays by showing that the distance between the reconstructed density from the estimated parameters and the distribution of the observed intensities is large.
We propose an alternative model thereafter called “normal-gamma model” which addresses this lack of fit. In our model, the normal noise distribution is assumed and the signal on one array is assumed to be gamma distributed. As the exponential distribution is a special case of the gamma distribution, this model extends the normexp model. The potential of such generalization was already suggested by Xi et al in their discussion. We derive the necessary estimation procedure by likelihood maximization. The good quality of fit is attested on two types of Illumina microarrays. The associated background correction is compared to methods based on the normexp model in terms of quality of estimation of the signal and checked for robustness on simulated data. The characteristics of the background correction procedures are compared on a set of spike-in data, and a parallel is drawn with the same characteristics studied on normal-gamma simulated data. Finally, the normexp and normal-gamma background corrections are compared on two dilution data sets.
The paper is organized as follows. The experimental and simulated data sets as well as the estimation procedures are presented in Section “Methods”: the notations and the general model-based background correction formula are gathered in Section “General model-based background correction formula”; the previous models developed for Illumina microarray background correction, including the normexp model, are summarized in Section “Previous modelings”; Section “A new modeling: the normal-gamma model” presents the proposed alternative parametric model built with normal noise and gamma distributed signal, as well as a parametric estimation procedure and its associated background correction. The performances of this new model are evaluated on simulated, spike-in and dilution data sets in Section “Results and discussion”. The impact of this more flexible parametrisation on background correction as well as the perspectives for further pre-processing analyses are discussed in Section “Conclusions”. The normal-gamma parameter estimation and the associated background correction are implemented in the R-package NormalGamma. The scripts used to produce the tables and figures are available in Additional file 2.