 Research article
 Open access
 Published:
Preprocessing of gene expression data by optimally robust estimators
BMC Bioinformatics volume 11, Article number: 583 (2010)
Abstract
Background
The preprocessing of gene expression data obtained from several platforms routinely includes the aggregation of multiple raw signal intensities to one expression value. Examples are the computation of a single expression measure based on the perfect match (PM) and mismatch (MM) probes for the Affymetrix technology, the summarization of bead level values to bead summary values for the Illumina technology or the aggregation of replicated measurements in the case of other technologies including realtime quantitative polymerase chain reaction (RTqPCR) platforms. The summarization of technical replicates is also performed in other "omics" disciplines like proteomics or metabolomics.
Preprocessing methods like MAS 5.0, Illumina's default summarization method, RMA, or VSN show that the use of robust estimators is widely accepted in gene expression analysis. However, the selection of robust methods seems to be mainly driven by their high breakdown point and not by efficiency.
Results
We describe how optimally robust radiusminimax (rmx) estimators, i.e. estimators that minimize an asymptotic maximum risk on shrinking neighborhoods about an ideal model, can be used for the aggregation of multiple raw signal intensities to one expression value for Affymetrix and Illumina data. With regard to the Affymetrix data, we have implemented an algorithm which is a variant of MAS 5.0.
Using datasets from the literature and MonteCarlo simulations we provide some reasoning for assuming approximate lognormal distributions of the raw signal intensities by means of the Kolmogorov distance, at least for the discussed datasets, and compare the results of our preprocessing algorithms with the results of Affymetrix's MAS 5.0 and Illumina's default method.
The numerical results indicate that when using rmx estimators an accuracy improvement of about 1020% is obtained compared to Affymetrix's MAS 5.0 and about 15% compared to Illumina's default method. The improvement is also visible in the analysis of technical replicates where the reproducibility of the values (in terms of Pearson and Spearman correlation) is increased for all Affymetrix and almost all Illumina examples considered. Our algorithms are implemented in the R package named RobLoxBioC which is publicly available via CRAN, The Comprehensive R Archive Network (http://cran.rproject.org/web/packages/RobLoxBioC/).
Conclusions
Optimally robust rmx estimators have a high breakdown point and are computationally feasible. They can lead to a considerable gain in efficiency for wellestablished bioinformatics procedures and thus, can increase the reproducibility and power of subsequent statistical analysis.
Background
Affymetrix microarrays consist of a number of probe cells, each probe cell containing a unique probe. There are two types of probes, perfect match (PM) and mismatch (MM) occurring as pairs. The sequences for PM and MM are almost identical. The difference consists of a single base change in the middle of the PM probe sequence to the WatsonCrick complement for the MM probe sequence. A series of such probe pairs forms a probe set which represents a transcript [1].
Hence, it is part of the preprocessing of Affymetrix arrays to compute a single expression value for the different probe sets. One of the most popular algorithms for this purpose is MAS 5.0, developed by Affymetrix [1]. It is the algorithm that, for instance, was most frequently applied within the framework of phase II of the microarray quality control (MAQC) project [2].
MAS 5.0 uses PM and Ideal Match (IM) to compute the expression values where, for probe set i and probe pair j,
with default values τ_{1} = 0.03 (contrast tau) and τ_{2} = 10 (scale tau). The specific background (SB _{ i } ) is determined using Tukey's biweight onestep estimator (T_{bi}) where Affymetrix's version of Tukey's biweight disregards all observations outside of median ±5 median absolute deviation (MAD) (i.e. unstandardized MAD) leading to a very robust estimator:
Then, the signal log value for each probe set is obtained via
with probe value PV _{ i, j } = log_{2}(V_{ i, j } ) and V_{ i, j } = max{PM _{ i, j }  IM _{ i, j } , δ}. The constant δ with default value δ = 2^{20} is introduced for numerical stability.
However, as table three in Section 2.6 of Hampel et al. (1986) [3] suggests there are more efficient robust estimators than Tukey's biweight, e.g. the Tanhestimator. In addition, Table eight.five in Section 8.7 of Kohl (2005) [4] shows that in the infinitesimal robust setup for normal location and scale one may lose up to 54.9% asymptotic efficiency if one uses Tukey's biweight in combination with MAD (TuMad) instead of the optimally robust estimator. Consequently, we implemented an algorithm along the l ines of MAS 5.0 where we substituted the Tukey biweight onestep location estimator by an in infinitesimally robust radiusminimax (rmx) kstep (k ≥ 1) location and scale estimator [5].
Illumina BeadArrays are based on 3micron silica beads that are randomly positioned on fiber optic bundles or planar silica slides. Each bead is covered with hundreds of thousands of copies of a specific oligonucleotide sequence assigning the bead to a certain bead type, where each bead type is represented in high redundancy with more than 30 replicates on average. The intensity values of the single beads are called bead level data. It is part of the preprocessing to aggregate the bead level data to socalled bead summary data leading to a single expression value for each bead type. In this paper we only consider data from singlechannel gene expression BeadArrays. Besides that, BeadArrays can also be used for SNP genotyping, methylation profiling, and copy number variation analysis [6].
In Illumina's proprietary BeadStudio Software an outlier rejection method (median ±3 × MAD) combined with mean and standard deviation is used to aggregate the bead level data to bead summary data. The MAD in this setup is standardized by 1.4826 to be consistent at the normal model. That is, the location part of Illumina's estimator is a Hubertype skipped mean and is very close to estimator X42 of Hampel (1985) [7], which uses 3.03 × MAD. Quoting Hampel et al. (1986) [3], p. 69, the X42 estimator is "frequently quite reasonable, according to present preliminary knowledge". However, MonteCarlo studies show that there may be an efficiency loss of 525% compared to more sophisticated robust estimators [7]. Hence, we implemented an algorithm which uses rmx kstep estimators for summarizing bead level data. The goal of this paper is to demonstrate that the application of optimally robust rmx estimators can lead to a considerable efficiency gain and increased reproducibility for wellestablished preprocessing algorithms. First, we argue for normal location and scale as the ideal model, at least for the logtransformed values of some publicly available Affymetrix and Illumina data sets, using minimum Kolmogorov distance. Second, we use MonteCarlo simulations and those publicly available datasets to compare MAS 5.0 and Illumina's default method with our modified algorithms based on rmx estimators. The proposed preprocessing algorithms are implemented in the R package RobLoxBioC[8, 9]. A brief overview of infinitesimal robustness is given in the Methods section at the end of the manuscript.
Results and Discussion
Affymetrix Data
We replace Tukey's biweight which only provides a location estimate by an rmx estimator for normal location and scale \mathcal{N}(\mu ,{\sigma}^{\text{2}}). The Tukey biweight and the rmx estimator are constructed as onestep and kstep estimates based on median and MAD, respectively. As both estimators are so called asymptotically linear (AL) estimators, a straightforward way to compare these estimators is to observe the corresponding influence curves/functions (ICs) displayed in Figure 1.
In both cases, the location part of the IC is redescending. In contrast to Tukey's biweight rejecting observations more than about 3.35 (standardized) MAD times away from the median, the rmx estimator only downweights large observations. Moreover, the plot shows that Tukey's biweight is mostly affected by undiscoverable (very likely to occur in the normal model) gross errors located around 1.51 (standardized) MADtimes away from the median where the IC of Tukey's biweight (and the MAD) is maximal, whereas the rmx estimator is mostly deflected by large observations where the Euclidean length of the location and scale IC is maximal.
However, for applying the rmx estimator for normal location and scale, one first should check if it is plausible to assume normal location and scale as the ideal model for the M_{ i, j } := log_{2}(PM _{ i, j } /MM _{ i, j } ) values. As we can not test for approximate normality (there is no such statistical test), we use the minimum Kolmogorov distance for this purpose. That is, we minimize, in μ ∈ ℝ and σ ∈ (0, ∞),
where Φ _{ μ,σ } is the cumulative distribution function of \mathcal{N}(\mu ,{\sigma}^{\text{2}}) and {\widehat{F}}_{{M}_{i}} is the empirical distribution function of the sample M_{ i, j },..., M_{ i, ni }. Working with a rightcontinuous empirical distribution function the above supremum is equal to
where M_{i,(1),}..., M_{i,(ni)}is the increasingly sorted sample. In particular, the minimal possible Kolmogorov distance for sample size n is (2n)^{1}.
Of course, it would be possible to use some other distance (e.g. Cramér von Mises) or the test statistic of some test for normality for this purpose. However, we decided to use the Kolmogorov distance since there is a certain connection between Kolmogorov neighborhoods and the grosserror model in infinitesimal robustness (see Rieder (1994), Lemma 4.2.8 and Proposition 5.3.3 [10]) and the Kolmogorov distance can be computed efficiently via equation (5). Nevertheless, the computations take more than 100 minutes for the HGU95A dataset and more than 130 minutes for the HGU133A dataset, which consist of 59 and 42 GeneChips, respectively, using function KolmogorovMinDist of our package RobLoxBioC on an Intel P9500 (64 bit Linux, 8 GByte RAM). For more details on these Latin square spikein datasets we refer to Cope et al. (2004) [11] and Irizarry et al. (2006) [12].
Table 1 shows the number of probe sets per number of probe level pairs for the HGU95A and HGU133A GeneChips. Figure 2 displays the minimum Kolmogorov distances for the HGU95A and HGU133A Latin square datasets as well as for normal random samples (50000 MonteCarlo replications for each sample size) where we selected only those probe level pairs with a considerable number of probe sets. In Table 2 we recorded the differences of the medians of the minimum Kolmogorov distances between the Latin square datasets and corresponding normal random samples. The results for 95% and 99% quantiles are very similar. Based on these results it is very reasonable to assume normal location and scale as the ideal model for M_{ i, j } expecting only minor deviations from normality.
To get a rough estimate of the corresponding size of the contamination neighborhood (i.e. Tukey's grosserror model [13]), which is required for selecting an appropriate rmx estimator, we use the following heuristics: for Kolmogorov (U_{ k } ), total variation (U_{ v } ) and contamination neighborhoods (U_{ c } ) of size s ∈ (0,1) it holds
In addition, at least in the onedimensional case and under symmetry, the optimally robust ICs for U_{ c } (2s) and U_{ v } (s) coincide. Moreover, if the optimally robust IC is monotone, then it is also the solution for U_{ k } (s) [10]. Based on these considerations we multiply the median difference between the observed and the simulated Kolmogorov distance by two, leading us to a neighborhood size s ∈ [0, 0.05].
We use the corresponding rmx 3step estimator instead of Tukey's biweight to compute the specific background values SB _{ i } and the signal log values SigLogVal _{ i } . Asymptotically (i.e. classical firstorder asymptotics) speaking it makes no difference which k we choose to construct the rmx estimator and differences only occur if one takes a look at higherorder asymptotics as shown by unpublished results of P. Ruckdeschel. However, to date, there are no finitesample results indicating an optimal choice for k if there is any. The use of three steps is motivated by the observation that in all situations we considered so far, the estimates were stable and did not change very much after the third iteration.
The results in Section 8.7 of Kohl (2005) [4] show that, in the infinitesimal robust setup and for known contamination radius, the optimally robust AL estimators clearly outperform the TuMad estimator for the estimation of normal location and scale with respect to the asymptotic maximum MSE, where the maximum efficiency loss is 54.9%. Moreover, the results of the following MonteCarlo study, which is in the spirit of the Princeton robustness study [14], indicate that this is also true for the rmx estimator in the case of an unknown neighborhood radius and finitesample size. Due to the finitesample size and the shrinkage of the neighborhoods, we use a finitesample correction of the neighborhood radius determined by a large simulation study. The finitesample correction leads to a larger neighborhood radius; i.e., to a more conservative estimation procedure. It can be computed with function finiteSampleCorrection of the R package RobLox[15].
For the simulations we chose a sample size n = 11 as most of the probes sets have this number of probe pairs on HGU133A GeneChips (cf. Table 1) and performed M = 10^{5} MonteCarlo replications. As the ideal model we used \mathcal{N}(0,1) which is no restriction due to equivariance of the location and scale model. As contaminating (gross errors generating) distributions we employed:

\mathcal{N}(0,9)
, t_{3} and Cauchy(0, 1) leading to an increased variance

\mathcal{N}(3,1)
and \mathcal{N}(10,1)causing a positive bias

Dirac measures at 1.51 (D_{1.51}) and 1000 (D_{1000}), which are approximations for the least favorable contaminations (i.e. leading to maximum risk) for the Tukey and the rmx estimator, respectively.
We selected s ∈ {0.01, 0.02, 0.04} as sizes of the gross error models. The results for other contaminating distributions or amounts of gross errors can easily be computed with function AffySimStudy of our R package RobLoxBioC.
Since there is no estimator yielding reliable results if there are 50% or more gross errors, we wanted to admit only random samples with less than 50% contamination. The probability for rejecting a sample is ≤ exp{2n(0.5  s)^{2}} by Theorem 2 of Hoeffding (1963) [16]; i.e., decays exponentially in the sample size n. At n = 11 and s ∈ {0.01, 0.02, 0.04} a replacement of a sample is necessary with probability 4.4 · 10^{10}, 2.7 · 10^{8} and 1.6 · 10^{6}, respectively. Hence, unsurprisingly, there was no single sample that had to be replaced in our MonteCarlo study.
The results in Table 3 show that the rmx location estimate in all situations considered has a smaller empirical MSE than Tukey's biweight. The efficiency loss of Tukey's biweight in nearly all situations is about 1520%.
Next, we present some results demonstrating the accuracy of the two procedures for the HGU95A and HGU133A Latin square datasets. For the computations we also used the rmx 3step estimator for s ∈ [0, 0.05] which is implemented as default in function RobLoxBioC of our R package RobLoxBioC. The results for the MAS 5.0 with Tukey's biweight were determined with function mas5 of Bioconductor package affy[17, 18]. In addition to the availability of different robust estimators, the implementation in RobLoxBioC is more efficient. The normalization using RobLoxBioC on an Intel P9500 (64 bit Linux, 8 GByte RAM) requires about 1 minute in contrast to about 9 minutes for mas5.
In Figure 3 analogously to Figure 2 in Cope et al. (2004) [11], the mean standard deviations (SDs) are plotted against the mean logexpression values for the two datasets. The curves were determined by smoothing the resulting scatterplots, which include SDs and mean log expressions for each gene not spikedin. These plots indicate an improvement of the accuracy of MAS 5.0 when using the rmx estimator instead of Tukey's biweight as the variability in terms of mean SD for the rmx estimator is clearly smaller especially for HGU95A. Some quantiles of the computed SD values are given in Table 4. The results for the log foldchanges observed for nondifferentially expressed genes (null logfc) i.e., genes not spikedin  confirm these results; see Table 4. Overall we expect that using the rmx estimator increases the accuracy of MAS 5.0 by 1020%.
The comparisons of the two robust procedures were performed with the Bioconductor package affycomp[19]. The full assessments of Cope et al. (2004) [11] and Irizarry et al. (2006) [12] can be computed using the R code specified in the file AffymetrixExample.R in the scripts folder of our package RobLoxBioC. The simulation study can be recomputed by the R code given in the file AffymetrixSimStudy.R also provided in the scripts folder.
As the following results indicate, the higher accuracy of rmx estimators increases the reproducibility of gene expression analyses. We analyzed a random subset of the MAQCI study [20] provided by the Bioconductor package MAQCsubsetAFX[21]. Regarding the Affymetrix platform, a total of 120 U133 Plus 2.0 GeneChips have been generated and four different reference RNAs have been used. (A) 100% of Stratagene's Universal Human Reference RNA, (B) 100% of Ambion's Human Brain Reference RNA, (C) 75% of A and 25% of B and (D) 25% of A and 75% of B. Each reference has been repeated five times on six different test sites. The datasets refA,..., refD provided by package MAQCsubsetAFX consist of the data of six randomly chosen U133 Plus 2.0 GeneChips (one for each test site) for each reference RNA. As Figure 4 shows, the assumption of approximate normality is fulfilled. We measured the reproducibility in terms of the Spearman correlation of the normalized data and the Pearson correlation of the log2transformed normalized data. In all cases the correlation was found to be higher for the rmx estimators. The relative increase is 0.61.2% (absolute. 0.0060.011) in the case of Spearman correlation and 1.21.9% (absolute. 0.0110.017) in the case of Pearson correlation.
The results can be recomputed using the R code specified in the file AffymetrixReproducibility.R in the scripts folder of the package RobLoxBioC.
Illumina Data
Since we intend to apply the rmx estimators for normal location and scale to summarize the bead level data, we first checked whether the normal model is appropriate for these data. We use the spikein data investigated in Dunning et al. (2008) [22] consisting of eight customized Mouse6 version 1 BeadChips hybridized with a complex mouse background. Each BeadChip contains six BeadArrays each made up of two strips on the chip surface. In total each of the BeadArrays includes 49283 bead types. The raw bead level values were sharpened and background subtracted [22]. Due to the random positioning of the beads, the number of beads per bead type varies from array to array. In Figure 5 and 6 we have depicted those bead types with 30 to 50 replicates and 15 to 65 replicates, respectively. The results were obtained using function KolmogorovMinDist of the R package RobLoxBioC where the computations took about 9 hours for the spikein dataset. Both figures indicate that the assumption of normal location and scale as the ideal model is more appropriate for the logtransformed bead level data. Hence, we propose to use the rmx 3step estimator for normal location and scale and s ∈ [0, 0.05] in combination with the already mentioned finitesample correction instead of Illumina's method, which is a Hubertype skipped mean and standard deviation [7], to summarize the logtransformed bead level data. The use of three steps and the choice s ∈ [0, 0.05] is driven by the same heuristics as in the Affymetrix case.
In a further step we have performed a simulation study using a very similar setup as in the Affymetrix case. Due to the higher redundancy of the Illumina data, we chose a sample size of 30 instead of 11. Moreover, we replaced the Dirac measure at 1.51 by the Dirac measure at 3 (D_{3}) which is an approximation for the least favorable contamination for Illumina's default method. The results for other contaminations can easily be computed with the function IlluminaSimStudy of the R package RobLoxBioC. As in the Affymetrix setup, we applied the modification that less than 50% of the observations contained in a sample may be contaminated where again no single sample had to be modified. The results in Table 5 show that the two estimators perform similarly with a slight advantage for the rmx estimator. Due to the outlier rejection step included in the Illumina method, it is unsurprising, that it performs especially well if the contaminating distribution puts mass on large values. In contrast, the rmx estimator outperforms the Illumina method in situations where the outliers are less obvious like in the case of t_{3} or \mathcal{N}\left(\text{3},\text{1}\right). Furthermore, looking at the maximum empirical risk for the simultaneous estimation of location and scale Illumina's method shows an efficiency loss of about 15% compared to the rmx estimator. In view of these results it is no surprise that Illumina's method performs best in Figure 2 of Dunning et al. (2008) [22] where outliers at 2^{16} were used and average bias and log2 variance are plotted versus percentage of simulated outliers for several summary methods. Besides that, the approach of Dunning et al. (2008) [22] contains a flaw from a statistical point of view. the original data is contaminated irrespective of the bead type. That is, one grosserror model for the whole dataset was used instead of a grosserror model for each bead type in the dataset. This approach might reflect the way contamination occurs in practice but, already at moderate contamination rates, one obtains many bead types where 50% or more of the bead values are contaminated and consequentially no reliable estimator exists. We postulate that this is the reason why the reported breakdown points are clearly smaller than the "real" breakdown points of the considered estimators (10% trimmed mean and SD: 5% vs. 10%, median and MAD: 30% vs. 50%, Illumina method: 30% vs. 50%).
Next, we report some results representing the accuracy of the two procedures for the spikein data of Dunning et al. (2008) [22]. For the computations we again used the rmx 3step estimator for s ∈ [0, 0.05] which is the default in function RobLoxBioC of the R package RobLoxBioC. The results for Illumina's method were determined with the function createBeadSummaryData of the Bioconductor package beadarray[23]. The computations of the bead summary values take about 100 seconds and about 500 seconds using createBeadSummaryData and RobLoxBioC, respectively. So far, the rmx estimator is implemented in interpreted R code. By switching to compiled code (e.g., C/C++ or FORTRAN) we probably could compete with createBeadSummaryData which is calling C code.
For the comparisons we use the approach of Cope et al. (2004) [11] and Irizarry et al. (2006) [12] as in the case of the Affymetrix data. In a first step we plotted the mean SDs against the mean logexpression values for those genes not spikedin, results are depicted in Figure 7. The plot indicates a slight improvement of the accuracy by using the rmx estimator instead of Illumina's default method as the mean SD is slightly smaller for the rmx estimator. Some quantiles for the SD values are given in Table 6. Secondly, we took a look at the log foldchanges observed for the nondifferentially expressed genes (null logfc). This statistic also confirms the above results; see Table 6. Moreover, we added the results for the other methods implemented in package beadarray which are mean and SD, median and MAD, 5% trimmed mean and SD as well as 5% winsorized mean and SD. The numerical results show that these other methods perform worse than Illumina's default method and the rmx estimator. Overall the rmx estimator performs the best and we expect an increase in accuracy of at least 15% by using the rmx estimator instead of the other methods.
The results mentioned can be recomputed via the R code provided in files IlluminaExample.R and IlluminaSimStudy.R which are included in the scripts folder of our R package RobLoxBioC.
As the following results indicate, the higher accuracy of our rmx estimators is reflected in an increased reproducibility of gene expression analyses. We have again used the spikein data of Dunning et al. (2008) [22] which can be divided into twelve sets each including four technical replicates. For these twelve sets we measured the reproducibility in terms of the Spearman and Pearson correlation of the summarized log2transformed data, overall leading to 72 pairwise comparisons. In 69 (Spearman correlation) and 66 (Pearson correlation) cases respectively, the correlation was higher for the rmx estimators. As before, the differences between the two procedures were found to be small and remained well below 0.5% in all cases.
The results can be recomputed using the R code given in the file IlluminaReproducibility.R in the scripts folder of our package RobLoxBioC.
Conclusions
As the variability of the estimation is clearly reduced as well as the reproducibility is increased when we apply rmx estimators for preprocessing, it is reasonable to assume a higher power for subsequent statistical analyses.
In the case of Illumina data the rmx summarization method can be combined with different preprocessing methods that can be applied to bead summary data, e.g. the variancestabilizing transformation (VST) of Lin et al. (2008) [24].
In the case of Affymetrix data there are several other wellknown normalization methods based on parametric models e.g. the robust multiarray average (RMA [25]) or the variance stabilization and calibration (VSN [26]) which can be used. The RMA procedure is based on a linear additive model where one uses median polish [27] for parameter estimation. A replacement of the median polish by a corresponding rmx polish may further improve the algorithm. In the case of VSN a possible modification could consist of replacing the least trimmed sum of squares (LTS) regression [28] by an rmx estimator for regression [4, 10]. As the above results and the results in Chapters 7 and 8 in Kohl (2005) [4] indicate, these modifications lead to an increased accuracy. At the same time we can retain the high breakdown point of the already available robust estimators by using the kstep construction in combination with bounded rmx ICs [29].
The reported results and the universality of the infinitesimal robustness approach suggest that optimally robust rmx estimators should also be of interest for other bioinformatics applications.
Methods
Infinitesimal Robustness
The approach of HuberCarol (1970) [30], Rieder (1978) [31], Bickel (1981) [32] and Rieder (1980) [33], Rieder (1994) [10] to robust testing and estimation employs shrinking neighborhoods of the parametric model, where the shrinking rate n^{1/2}, as the sample size n →∞, may be deduced in a testing setup [34]. Due to the shrinkage of the neighborhoods and the asymptotics involved this approach to robustness is called infinitesimal. A brief comparison and distinction to the robustness approaches of Huber (1981) [35], Hampel et al. (1986) [3] and Maronna et al. (2006) [36] is given in the Introduction of Kohl et al. (2010) [29].
Denoting by {\mathcal{M}}_{\text{1}}(\mathcal{A}) the set of all probability measures on some measurable space (\Omega ,\mathcal{A}), one considers a smoothly parameterized model
whose parameter space Θ is an open subset of some finitedimensional ℝ ^{k} , and which is dominated. dP_{ θ } = p_{ θ }d μ (θ_{∈} Θ). The smoothness of the model \mathcal{P}, at any fixed θ ∈ Θ is characterized by the requirement of L_{2} differentiability (also called differentiability in quadratic mean); see Section 2.3 of [10]. The ℝ ^{k} valued L_{2} derivative is denoted by {\Lambda}_{\theta}\in {L}_{2}^{k}({P}_{\theta}) and its covariance {\mathcal{I}}_{\theta}={\text{E}}_{\theta}{\Lambda}_{\theta}{{\Lambda}^{\prime}}_{\theta} underP_{ θ } is the Fisher information of \mathcal{P} at θ which is assumed of full rank k. This type of differentiability is for instance implied by continuous differentiability of p_{ θ }and continuity of I_{ θ } with respect to θ and then {\Lambda}_{\theta}=\frac{\partial}{\partial \theta}\mathrm{log}{p}_{\theta} the classical scores; see Lemma A.3 of Hájek (1972) [37].
Given the socalled ideal model\mathcal{P} one defines asymptotically linear (AL) estimators S to be any sequence of estimators S_{ n }: Ω^{n}→ ℝ^{k}such that
for some, necessarily unique, influence curve (IC)ψ_{ θ }∈ Ψ(θ), where
Here we used the stochastic Landaunotation of Pfanzagl (1994) [38], i.e. {\text{op}}_{\theta}^{n}({n}^{0})\to 0 in product {P}_{\theta}^{n} probability as n →∞, and ℐ _{ k } denotes the k\times k identity matrix. For more details we refer to Rieder (1994), Section 4.2 [10].
In infinitesimal robustness, the i.i.d. observations x_{1},..., x_{ n } may follow any law Q in some shrinking neighborhood about P_{ θ } . In this article, we consider the (convex) contamination neighborhood system {\mathcal{U}}_{c}(\theta ) which consists of all contamination neighborhoods, at size 0 ≤ s ≤ 1,
Subsequently, s = s_{ n } = rn^{1/2} for starting radius r ∈ [0, ∞) and n →∞.
Given this setup, the aim is to minimize the asymptotic maximum risk
with continuous loss function ℓ: ℝ ^{k} → [0, ∞). Throughout this article we will use square loss ℓ (z) = z^{2} which leads to the (asymptotic maximum) mean squared error MSE _{ θ } (ψ_{ θ }, r).
To simplify notation, the fixed θ will be dropped from notation henceforth.
The optimally robust ψ*, the unique solution to minimize MSE(ψ, r) among all ψ ∈ Ψ for given radius r, is given in Theorem 5.5.7 of Rieder (1994) [10]: there exist some vector z ∈ ℝ^{k}and matrix A ∈ ℝ^{k × k}, A positive definite, such that
where
and
Conversely, form (12)  (15) suffices for ψ* to be the solution. The minimax solution to more general risks is derived in Ruckdeschel and Rieder (2004) [39].
In applications, the starting radius r for the neighborhoods U_{ c }(θ, rn^{1/2}) is unknown or only known to belong to some interval [r_{lo}, r_{up}) ⊂ [0, ∞). For this situation Rieder et al. (2008) [5] propose to consider the relative MSE of {\psi}_{s}^{\star} at radius r, defined as
and IC {\psi}_{{r}_{0}}^{\star} which minimizes
among all s ∈ [r_{lo}, r_{up}) is called radiusminimax (rmx).
Given the rmx IC the corresponding rmx estimator can then be determined via the onestep construction respectively, an iterated onestep  that is, kstep (k ≥ 1) construction
based on a suitable starting estimate {S}_{n}^{(0)}[29].
The normal location and scale model, i.e. {P}_{\theta}=\mathcal{N}(\mu ,\phantom{\rule{0.1em}{0ex}}{\sigma}^{2})\phantom{\rule{0.1em}{0ex}} with θ = (μ, σ)', μ ∈ ℝ, σ ∈ (0, ∞) forms an L_{2} differentiable exponential family. As starting estimator one can use median and median absolute deviation (MAD) as justified by Kohl (2005), Section 2.3.4 [4]. Since the rmx IC in this model is bounded, the breakdown point of the starting estimator, which is 50% for median and MAD, is inherited to the rmx onestep estimator [29].
References
Affymetrix, Inc: Statistical Algorithms Description Document. Affymetrix, Santa Clara; 2002.
MAQCConsortium, Shi L, Campbell G, Jones WD, Campagne F, Wen Z, Walker SJ, Su Z, Chu TM, Goodsaid FM, Pusztai L, Shaughnessy JDJ, Oberthuer A, Thomas RS, Paules RS, Fielden M, Barlogie B, Chen W, Du P, Fischer M, Furlanello C, Gallas BD, Ge X, Megherbi DB, Symmans WF, Wang MD, Zhang J, Bitter H, Brors B, Bushel PR, Bylesjo M, Chen M, Cheng J, Cheng J, Chou J, Davison TS, Delorenzi M, Deng Y, Devanarayan V, Dix DJ, Dopazo J, Dorff KC, Elloumi F, Fan J, Fan S, Fan X, Fang H, Gonzaludo N, Hess KR, Hong H, Huan J, Irizarry RA, Judson R, Juraeva D, Lababidi S, Lambert CG, Li L, Li Y, Li Z, Lin SM, Liu G, Lobenhofer EK, Luo J, Luo W, McCall MN, Nikolsky Y, Pennello GA, Perkins RG, Philip R, Popovici V, Price ND, Qian F, Scherer A, Shi T, Shi W, Sung J, ThierryMieg D, ThierryMieg J, Thodima V, Trygg J, Vishnuvajjala L, Wang SJ, Wu J, Wu Y, Xie Q, Yousef WA, Zhang L, Zhang X, Zhong S, Zhou Y, Zhu S, Arasappan D, Bao W, Lucas AB, Berthold F, Brennan RJ, Buness A, Catalano JG, Chang C, Chen R, Cheng Y, Cui J, Czika W, Demichelis F, Deng X, Dosymbekov D, Eils R, Feng Y, Fostel J, FulmerSmentek S, Fuscoe JC, Gatto L, Ge W, Goldstein DR, Guo L, Halbert DN, Han J, Harris SC, Hatzis C, Herman D, Huang J, Jensen RV, Jiang R, Johnson CD, Jurman G, Kahlert Y, Khuder SA, Kohl M, Li J, Li L, Li M, Li QZ, Li S, Li Z, Liu J, Liu Y, Liu Z, Meng L, Madera M, MartinezMurillo F, Medina I, Meehan J, Miclaus K, Moffitt RA, Montaner D, Mukherjee P, Mulligan GJ, Neville P, Nikolskaya T, Ning B, Page GP, Parker J, Parry RM, Peng X, Peterson RL, Phan JH, Quanz B, Ren Y, Riccadonna S, Roter AH, Samuelson FW, Schumacher MM, Shambaugh JD, Shi Q, Shippy R, Si S, Smalter A, Sotiriou C, Soukup M, Staedtler F, Steiner G, Stokes TH, Sun Q, Tan PY, Tang R, Tezak Z, Thorn B, Tsyganova M, Turpaz Y, Vega SC, Visintainer R, von Frese J, Wang C, Wang E, Wang J, Wang W, Westermann F, Willey JC, Woods M, Wu S, Xiao N, Xu J, Xu L, Yang L, Zeng X, Zhang J, Zhang L, Zhang M, Zhao C, Puri RK, Scherf U, Tong W, Wolfinger RD: The MicroArray Quality Control (MAQC)II study of common practices for the development and validation of microarraybased predictive models. Nature Biotechnology 2010, in press.
Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA: Robust statistics. The approach based on influence functions. New York: Wiley; 1986.
Kohl M: Numerical Contributions to the Asymptotic Theory of Robustness. PhD thesis. University of Bayreuth; 2005.
Rieder H, Kohl M, Ruckdeschel P: The cost of not knowing the radius. Stat Meth & Appl 2008, 17: 13–40.
Kuhn K, Baker SC, Chudin E, Lieu MH, Oeser S, Bennett H, Rigault P, Barker D, McDaniel TK, Chee MS: A novel, highperformance random array platform for quantitative gene expression profiling. Genome Res 2004, 14: 2347–56.
Hampel FR: The breakdown points of the mean combined with some rejection rules. Technometrics 1985, 27: 95–107.
R Development Core Team:R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2010. [ISBN 3–900051–07–0] [http://www.Rproject.org] [ISBN 3900051070]
Kohl M: RobLoxBioC: Infinitesimally robust estimators for preprocessing omics data. R Foundation for Statistical Computing, Vienna, Austria; 2010. [R package version 0.7.1] [R package version 0.7.1]
Rieder H: Robust Asymptotic Statistics. New York: Springer; 1994.
Cope LM, Irizarry RA, Jaffee HA, Wu Z, Speed TP: A benchmark for Affymetrix GeneChip expression measures. Bioinformatics 2004, 20: 323–31.
Irizarry RA, Wu Z, Jaffee HA: Comparison of Affymetrix GeneChip expression measures. Bioinformatics 2006, 22: 789–94.
Tukey JW: A survey of sampling from contaminated distributions. In Contributions to Probability and Statistics I. Edited by: Olkin I. Stanford: Stanford University Press; 1960:448–485.
Andrews DF, Bickel PJ, Hampel FR, Huber PJ, Rogers WH, Tukey JW: Robust estimates of location: survey and advances. Princeton, Princeton University Press; 1972.
Kohl M: RobLox: Optimally robust influence curves and estimators for location and scale. R Foundation for Statistical Computing, Vienna, Austria; 2009. [R package version 0.7] [R package version 0.7]
Hoeffding W: Probability inequalities for sums of bounded random variables. J Am Stat Assoc 1963, 58: 13–30.
Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: Open software development for computational biology and bioinformatics. Genome Biology 2004, 5: R80.
Gautier L, Cope L, Bolstad BM, Irizarry RA: affy  analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 2004, 20: 307–15.
Irizarry RA, Wu Z: affycomp: Graphics Toolbox for Assessment of Affymetrix Expression Measures. 2009. [R package version 1.19.4] [R package version 1.19.4]
Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ, de Longueville F, Kawasaki ES, Lee KY, Luo Y, Sun YA, Willey JC, Setterquist RA, Fischer GM, Tong W, Dragan YP, Dix DJ, Frueh FW, Goodsaid FM, Herman D, Jensen RV, Johnson CD, Lobenhofer EK, Puri RK, Scherf U, ThierryMieg J, Wang C, Wilson M, Wolber PK, Zhang L, Amur S, Bao W, Barbacioru CC, Lucas AB, Bertholet V, Boysen C, Bromley B, Brown D, Brunner A, Canales R, Cao XM, Cebula TA, Chen JJ, Cheng J, Chu TM, Chudin E, Corson J, Corton JC, Croner LJ, Davies C, Davison TS, Delenstarr G, Deng X, Dorris D, Eklund AC, Fan XH, Fang H, FulmerSmentek S, Fuscoe JC, Gallagher K, Ge W, Guo L, Guo X, Hager J, Haje PK, Han J, Han T, Harbottle HC, Harris SC, Hatchwell E, Hauser CA, Hester S, Hong H, Hurban P, Jackson SA, Ji H, Knight CR, Kuo WP, LeClerc JE, Levy S, Li QZ, Liu C, Liu Y, Lombardi MJ, Ma Y, Magnuson SR, Maqsodi B, McDaniel T, Mei N, Myklebost O, Ning B, Novoradovskaya N, Orr MS, Osborn TW, Papallo A, Patterson TA, Perkins RG, Peters EH, Peterson R, Philips KL, Pine PS, Pusztai L, Qian F, Ren H, Rosen M, Rosenzweig BA, Samaha RR, Schena M, Schroth GP, Shchegrova S, Smith DD, Staedtler F, Su Z, Sun H, Szallasi Z, Tezak Z, ThierryMieg D, Thompson KL, Tikhonova I, Turpaz Y, Vallanat B, Van C, Walker SJ, Wang SJ, Wang Y, Wolfinger R, Wong A, Wu J, Xiao C, Xie Q, Xu J, Yang W, Zhang L, Zhong S, Zong Y, Slikker WJ: The MicroArray Quality Control (MAQC) project shows inter and intraplatform reproducibility of gene expression measurements. Nature Biotechnology 2006, 24(9):1151–61.
Gatto L:MAQCsubsetAFX: MAQC data subset for the Affymetrix platform. 2010. [R package version 1.0.3] [http://www.slashhome.be/MAQCsubsetAFX.php] [R package version 1.0.3]
Dunning MJ, BarbosaMorais NL, Lynch AG, Tavaré S, Ritchie ME: Statistical issues in the analysis of Illumina data. BMC Bioinformatics 2008, 9: 85.
Dunning MJ, Smith ML, Ritchie ME, Tavare S: beadarray: R classes and methods for Illumina beadbased data. Bioinformatics 2007, 23: 2183–4.
Lin SM, Du P, Huber W, A KW: Modelbased variancestabilizing transformation for Illumina microarray data. Nucleic Acids Res 2008, 36(2):e11.
Irizarry RA, Hobbs B, Collin F, BeazerBarclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data. Biostatistics 2003, 4: 249–64.
Huber W, von Heydebreck A, Sueltmann H, Poustka A, Vingron M: Variance Stabilization Applied to Microarray Data Calibration and to the Quantification of Differential Expression. Bioinformatics 2002, 18: 96–104.
Tukey JW: Exploratory Data Analysis. Reading, Mass.: AddisonWesley Publishing Company; 1977.
Rousseeuw PJ, Leroy AM: Robust Regression and Outlier Detection. New York: John Wiley and Sons; 1987.
Kohl M, Ruckdeschel P, Rieder H: Infinitesimally Robust Estimation in General Smoothly Parametrized Models. Stat Meth & Appl 2010, 19: 333–354.
HuberCarol C: Étude asymptotique de tests robustes. PhD thesis. ETH Zürich; 1970.
Rieder H: A robust asymptotic testing model. Ann Stat 1978, 6: 1080–94.
Bickel PJ: Quelques aspects de la statistique robuste. New York: Springer; 1981. [Ecole d'ete de probabilites de SaintFlour IX1979, Lect. Notes Math. 876] [Ecole d'ete de probabilites de SaintFlour IX1979, Lect. Notes Math. 876]
Rieder H: Estimates derived from robust tests. Ann Stat 1980, 8: 106–115.
Ruckdeschel P:A Motivation for 1/\sqrt{n}ShrinkingNeighborhoods. Metrika 2006, 63: 295–307.
Huber PJ: Robust statistics. New York: Wiley; 1981.
Maronna RA, Martin RD, Yohai VJ: Robust Statistics: Theory and Methods. New York: Wiley; 2006.
Hájek J: Local asymptotic minimax and admissibility in estimation. Proc. 6th Berkeley Sympos. math. Statist. Probab., Univ. Calif. 1970 1972, 1: 175–194.
Pfanzagl J: Parametric statistical theory. Berlin: De Gruyter Textbook; 1994.
Ruckdeschel P, Rieder H: Optimal influence curves for general loss functions. Stat Decis 2004, 22: 201–223.
Acknowledgements
We thank Dr. David Enot for valuable comments and careful proofreading of the manuscript. We also thank two referees and the editor for helpful comments. The authors have no conflicts of interest.
Author information
Authors and Affiliations
Corresponding author
Additional information
Authors' contributions
MK implemented the algorithms, performed the computations and contributed to the manuscript. HPD contributed to the manuscript writing and discussion. Both authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Kohl, M., Deigner, HP. Preprocessing of gene expression data by optimally robust estimators. BMC Bioinformatics 11, 583 (2010). https://doi.org/10.1186/1471210511583
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/1471210511583