Preprocessing of gene expression data by optimally robust estimators

Background The preprocessing of gene expression data obtained from several platforms routinely includes the aggregation of multiple raw signal intensities to one expression value. Examples are the computation of a single expression measure based on the perfect match (PM) and mismatch (MM) probes for the Affymetrix technology, the summarization of bead level values to bead summary values for the Illumina technology or the aggregation of replicated measurements in the case of other technologies including real-time quantitative polymerase chain reaction (RT-qPCR) platforms. The summarization of technical replicates is also performed in other "-omics" disciplines like proteomics or metabolomics. Preprocessing methods like MAS 5.0, Illumina's default summarization method, RMA, or VSN show that the use of robust estimators is widely accepted in gene expression analysis. However, the selection of robust methods seems to be mainly driven by their high breakdown point and not by efficiency. Results We describe how optimally robust radius-minimax (rmx) estimators, i.e. estimators that minimize an asymptotic maximum risk on shrinking neighborhoods about an ideal model, can be used for the aggregation of multiple raw signal intensities to one expression value for Affymetrix and Illumina data. With regard to the Affymetrix data, we have implemented an algorithm which is a variant of MAS 5.0. Using datasets from the literature and Monte-Carlo simulations we provide some reasoning for assuming approximate log-normal distributions of the raw signal intensities by means of the Kolmogorov distance, at least for the discussed datasets, and compare the results of our preprocessing algorithms with the results of Affymetrix's MAS 5.0 and Illumina's default method. The numerical results indicate that when using rmx estimators an accuracy improvement of about 10-20% is obtained compared to Affymetrix's MAS 5.0 and about 1-5% compared to Illumina's default method. The improvement is also visible in the analysis of technical replicates where the reproducibility of the values (in terms of Pearson and Spearman correlation) is increased for all Affymetrix and almost all Illumina examples considered. Our algorithms are implemented in the R package named RobLoxBioC which is publicly available via CRAN, The Comprehensive R Archive Network (http://cran.r-project.org/web/packages/RobLoxBioC/). Conclusions Optimally robust rmx estimators have a high breakdown point and are computationally feasible. They can lead to a considerable gain in efficiency for well-established bioinformatics procedures and thus, can increase the reproducibility and power of subsequent statistical analysis.


Background
Affymetrix microarrays consist of a number of probe cells, each probe cell containing a unique probe. There are two types of probes, perfect match (PM) and mismatch (MM) occurring as pairs. The sequences for PM and MM are almost identical. The difference consists of a single base change in the middle of the PM probe sequence to the Watson-Crick complement for the MM probe sequence. A series of such probe pairs forms a probe set which represents a transcript [1].
Hence, it is part of the preprocessing of Affymetrix arrays to compute a single expression value for the different probe sets. One of the most popular algorithms for this purpose is MAS 5.0, developed by Affymetrix [1]. It is the algorithm that, for instance, was most frequently applied within the framework of phase II of the microarray quality control (MAQC) project [2]. MAS with default values τ 1 = 0.03 (contrast tau) and τ 2 = 10 (scale tau). The specific background (SB i ) is determined using Tukey's biweight one-step estimator (T bi ) where Affymetrix's version of Tukey's biweight disregards all observations outside of median ±5 median absolute deviation (MAD) (i.e. unstandardized MAD) leading to a very robust estimator: Then, the signal log value for each probe set is obtained via with probe value PV i, j = log 2 (V i, j ) and V i, j = max {PM i, j -IM i, j , δ}. The constant δ with default value δ = 2 -20 is introduced for numerical stability.
However, as table three in Section 2.6 of Hampel et al. (1986) [3] suggests there are more efficient robust estimators than Tukey's biweight, e.g. the Tanh-estimator. In addition, Table eight.five in Section 8.7 of Kohl (2005) [4] shows that in the infinitesimal robust setup for normal location and scale one may lose up to 54.9% asymptotic efficiency if one uses Tukey's biweight in combination with MAD (TuMad) instead of the optimally robust estimator. Consequently, we implemented an algorithm along the l ines of MAS 5.0 where we substituted the Tukey biweight one-step location estimator by an in infinitesimally robust radius-minimax (rmx) kstep (k ≥ 1) location and scale estimator [5].
Illumina BeadArrays are based on 3-micron silica beads that are randomly positioned on fiber optic bundles or planar silica slides. Each bead is covered with hundreds of thousands of copies of a specific oligonucleotide sequence assigning the bead to a certain bead type, where each bead type is represented in high redundancy with more than 30 replicates on average. The intensity values of the single beads are called bead level data. It is part of the preprocessing to aggregate the bead level data to so-called bead summary data leading to a single expression value for each bead type. In this paper we only consider data from single-channel gene expression BeadArrays. Besides that, BeadArrays can also be used for SNP genotyping, methylation profiling, and copy number variation analysis [6].
In Illumina's proprietary BeadStudio Software an outlier rejection method (median ±3 × MAD) combined with mean and standard deviation is used to aggregate the bead level data to bead summary data. The MAD in this setup is standardized by 1.4826 to be consistent at the normal model. That is, the location part of Illumina's estimator is a Huber-type skipped mean and is very close to estimator X42 of Hampel (1985) [7], which uses 3.03 × MAD. Quoting Hampel et al. (1986) [3], p. 69, the X42 estimator is "frequently quite reasonable, according to present preliminary knowledge". However, Monte-Carlo studies show that there may be an efficiency loss of 5-25% compared to more sophisticated robust estimators [7]. Hence, we implemented an algorithm which uses rmx k-step estimators for summarizing bead level data. The goal of this paper is to demonstrate that the application of optimally robust rmx estimators can lead to a considerable efficiency gain and increased reproducibility for well-established preprocessing algorithms. First, we argue for normal location and scale as the ideal model, at least for the log-transformed values of some publicly available Affymetrix and Illumina data sets, using minimum Kolmogorov distance. Second, we use Monte-Carlo simulations and those publicly available datasets to compare MAS 5.0 and Illumina's default method with our modified algorithms based on rmx estimators. The proposed preprocessing algorithms are implemented in the R package RobLoxBioC [8,9]. A brief overview of infinitesimal robustness is given in the Methods section at the end of the manuscript.

Affymetrix Data
We replace Tukey's biweight which only provides a location estimate by an rmx estimator for normal location and scale  ( , )   2 . The Tukey biweight and the rmx estimator are constructed as one-step and k-step estimates based on median and MAD, respectively. As both estimators are so called asymptotically linear (AL) estimators, a straightforward way to compare these estimators is to observe the corresponding influence curves/functions (ICs) displayed in Figure 1.
In both cases, the location part of the IC is redescending. In contrast to Tukey's biweight rejecting observations more than about 3.35 (standardized) MAD times away from the median, the rmx estimator only downweights large observations. Moreover, the plot shows that Tukey's biweight is mostly affected by undiscoverable (very likely to occur in the normal model) gross errors located around 1.51 (standardized) MAD-times away from the median where the IC of Tukey's biweight (and the MAD) is maximal, whereas the rmx estimator is mostly deflected by large observations where the Euclidean length of the location and scale IC is maximal.
However, for applying the rmx estimator for normal location and scale, one first should check if it is plausible to assume normal location and scale as the ideal model for the M i, j := log 2 (PM i, j /MM i, j ) values. As we can not test for approximate normality (there is no such statistical test), we use the minimum Kolmogorov distance for this purpose. That is, we minimize, in μ ℝ and s (0, ∞), where Φ μ,s is the cumulative distribution function of where M i,(1), ..., M i,(ni) is the increasingly sorted sample. In particular, the minimal possible Kolmogorov distance for sample size n is (2n) -1 .
Of course, it would be possible to use some other distance (e.g. Cramér von Mises) or the test statistic of some test for normality for this purpose. However, we decided to use the Kolmogorov distance since there is a certain connection between Kolmogorov neighborhoods and the gross-error model in infinitesimal robustness (see Rieder (1994), Lemma 4.2.8 and Proposition 5.3.3 [10]) and the Kolmogorov distance can be computed efficiently via equation (5). Nevertheless, the computations take more than 100 minutes for the HGU95A dataset and more than 130 minutes for the HGU133A dataset, which consist of 59 and 42 GeneChips, respectively, using function KolmogorovMinDist of our   [12]. Table 1 shows the number of probe sets per number of probe level pairs for the HGU95A and HGU133A GeneChips. Figure 2 displays the minimum Kolmogorov distances for the HGU95A and HGU133A Latin square datasets as well as for normal random samples (50000 Monte-Carlo replications for each sample size) where we selected only those probe level pairs with a considerable number of probe sets. In Table 2 we recorded the differences of the medians of the minimum Kolmogorov distances between the Latin square datasets and corresponding normal random samples. The results for 95% and 99% quantiles are very similar. Based on these results it is very reasonable to assume normal location and scale as the ideal model for M i, j expecting only minor deviations from normality.
To get a rough estimate of the corresponding size of the contamination neighborhood (i.e. Tukey's grosserror model [13]), which is required for selecting an   appropriate rmx estimator, we use the following heuristics: for Kolmogorov (U k ), total variation (U v ) and contamination neighborhoods (U c ) of size s (0,1) it holds In addition, at least in the one-dimensional case and under symmetry, the optimally robust ICs for U c (2s) and U v (s) coincide. Moreover, if the optimally robust IC is monotone, then it is also the solution for U k (s) [10]. Based on these considerations we multiply the median difference between the observed and the simulated Kolmogorov distance by two, leading us to a neighborhood size s [0, 0.05].
We use the corresponding rmx 3-step estimator instead of Tukey's biweight to compute the specific background values SB i and the signal log values SigLog-Val i . Asymptotically (i.e. classical first-order asymptotics) speaking it makes no difference which k we choose to construct the rmx estimator and differences only occur if one takes a look at higher-order asymptotics as shown by unpublished results of P. Ruckdeschel. However, to date, there are no finite-sample results indicating an optimal choice for k if there is any. The use of three steps is motivated by the observation that in all situations we considered so far, the estimates were stable and did not change very much after the third iteration.
The results in Section 8.7 of Kohl (2005) [4] show that, in the infinitesimal robust setup and for known contamination radius, the optimally robust AL estimators clearly outperform the TuMad estimator for the estimation of normal location and scale with respect to the asymptotic maximum MSE, where the maximum efficiency loss is 54.9%. Moreover, the results of the following Monte-Carlo study, which is in the spirit of the Princeton robustness study [14], indicate that this is also true for the rmx estimator in the case of an unknown neighborhood radius and finite-sample size. Due to the finite-sample size and the shrinkage of the neighborhoods, we use a finite-sample correction of the neighborhood radius determined by a large simulation study. The finite-sample correction leads to a larger neighborhood radius; i.e., to a more conservative estimation procedure. It can be computed with function finiteSampleCorrection of the R package RobLox [15].
For the simulations we chose a sample size n = 11 as most of the probes sets have this number of probe pairs on HGU133A GeneChips (cf. Table 1) and performed M = 10 5 Monte-Carlo replications. As the ideal model we used  ( , ) 0 1 which is no restriction due to equivariance of the location and scale model. As contaminating (gross errors generating) distributions we employed: •  ( , ) 0 9 , t 3 and Cauchy(0, 1) leading to an increased variance •  ( , ) 3 1 and  ( , ) 10 1 causing a positive bias • Dirac measures at 1.51 (D 1.51 ) and 1000 (D 1000 ), which are approximations for the least favorable contaminations (i.e. leading to maximum risk) for the Tukey and the rmx estimator, respectively.
We selected s {0.01, 0.02, 0.04} as sizes of the gross error models. The results for other contaminating distributions or amounts of gross errors can easily be computed with function AffySimStudy of our R package RobLoxBioC.
Since there is no estimator yielding reliable results if there are 50% or more gross errors, we wanted to admit only random samples with less than 50% contamination. The probability for rejecting a sample is ≤ exp{-2n(0.5s) 2 } by Theorem 2 of Hoeffding (1963) [16]; i.e., decays exponentially in the sample size n. At n = 11 and s {0.01, 0.02, 0.04} a replacement of a sample is necessary with probability 4.4 · 10 -10 , 2.7 · 10 -8 and 1.6 · 10 -6 , respectively. Hence, unsurprisingly, there was no single sample that had to be replaced in our Monte-Carlo study.
The results in Table 3 show that the rmx location estimate in all situations considered has a smaller empirical MSE than Tukey's biweight. The efficiency loss of Tukey's biweight in nearly all situations is about 15-20%.
Next, we present some results demonstrating the accuracy of the two procedures for the HGU95A and HGU133A Latin square datasets. For the computations we also used the rmx 3-step estimator for s [0, 0.05] which is implemented as default in function RobLox-BioC of our R package RobLoxBioC. The results for the MAS 5.0 with Tukey's biweight were determined with function mas5 of Bioconductor package affy [17,18]. In addition to the availability of different robust estimators, the implementation in RobLoxBioC is more efficient. The normalization using RobLoxBioC on an Intel P9500 (64 bit Linux, 8 GByte RAM) requires about 1 minute in contrast to about 9 minutes for mas5.
In Figure 3 analogously to Figure 2 in Cope et al. (2004) [11], the mean standard deviations (SDs) are plotted against the mean log-expression values for the two datasets. The curves were determined by smoothing the resulting scatterplots, which include SDs and mean log expressions for each gene not spiked-in. These plots indicate an improvement of the accuracy of MAS 5.0 when using the rmx estimator instead of Tukey's biweight as the variability in terms of mean SD for the rmx estimator is clearly smaller especially for HGU95A. Some quantiles of the computed SD values are given in Table 4. The results for the log fold-changes observed for non-differentially expressed genes (null log-fc) -i.e., genes not spiked-in -confirm these results; see Table 4. Overall we expect that using the rmx estimator increases the accuracy of MAS 5.0 by 10-20%.
The comparisons of the two robust procedures were performed with the Bioconductor package affycomp [19]. The full assessments of Cope et al. (2004) [11] and Irizarry et al. (2006) [12] can be computed using the R code specified in the file AffymetrixExample.R in the scripts folder of our package RobLoxBioC. The simulation study can be recomputed by the R code given in the file AffymetrixSimStudy.R also provided in the scripts folder.
As the following results indicate, the higher accuracy of rmx estimators increases the reproducibility of gene expression analyses. We analyzed a random subset of the MAQC-I study [20] provided by the Bioconductor package MAQCsubsetAFX [21]. Regarding the  Figure 4 shows, the assumption of approximate normality is fulfilled. We measured the reproducibility in terms of the Spearman correlation of the normalized data and the Pearson correlation of the log2-transformed normalized data. In all cases the correlation was found to be higher for the rmx estimators. The relative increase is 0.6-1.2% (absolute. 0.006-0.011) in the case of Spearman correlation and 1.2-1.9% (absolute. 0.011-0.017) in the case of Pearson correlation. The results can be recomputed using the R code specified in the file AffymetrixReproducibility.R in the scripts folder of the package RobLoxBioC.

Illumina Data
Since we intend to apply the rmx estimators for normal location and scale to summarize the bead level data, we first checked whether the normal model is appropriate  for these data. We use the spike-in data investigated in Dunning et al. (2008) [22] consisting of eight customized Mouse-6 version 1 BeadChips hybridized with a complex mouse background. Each BeadChip contains six BeadArrays each made up of two strips on the chip surface. In total each of the BeadArrays includes 49283 bead types. The raw bead level values were sharpened and background subtracted [22]. Due to the random positioning of the beads, the number of beads per bead type varies from array to array. In Figure 5 [7], to summarize the log-transformed bead level data. The use of three steps and the choice s [0, 0.05] is driven by the same heuristics as in the Affymetrix case.
In a further step we have performed a simulation study using a very similar setup as in the Affymetrix case. Due to the higher redundancy of the Illumina data, we chose a sample size of 30 instead of 11. Moreover, we replaced the Dirac measure at 1.51 by the Dirac measure at 3 (D 3 ) which is an approximation for the least favorable contamination for Illumina's default method. The results for other contaminations can easily be computed with the function IlluminaSimStudy of the R package RobLoxBioC. As in the Affymetrix setup, we applied the modification that less than 50% of the observations contained in a sample may be contaminated where again no single sample had to be modified. The results in Table 5 show that the two estimators perform similarly with a slight advantage for the rmx estimator. Due to the outlier rejection step included in the Illumina method, it is unsurprising, that it performs especially well if the contaminating distribution puts   [22] where outliers at 2 16 were used and average bias and log2 variance are plotted versus percentage of simulated outliers for several summary methods. Besides that, the approach of Dunning et al. (2008) [22] contains a flaw from a statistical point of view. the original data is contaminated irrespective of the bead type. That is, one gross-error model for the whole dataset was used instead of a gross-error model for each bead type in the dataset. This approach might reflect the way contamination occurs in practice but, already at moderate contamination rates, one obtains many bead types where 50% or more of the bead values are contaminated and consequentially no reliable estimator exists. We postulate that this is the reason why the reported breakdown points are clearly smaller than the "real" breakdown points of the considered estimators (10% trimmed mean and SD: 5% vs. 10%, median and MAD: 30% vs. 50%, Illumina method: 30% vs. 50%). Next, we report some results representing the accuracy of the two procedures for the spike-in data of Dunning et al. (2008) [22]. For the computations we again used the rmx 3-step estimator for s [0, 0.05] which is the default in function RobLoxBioC of the R package RobLoxBioC. The results for Illumina's method were determined with the function createBeadSummary-Data of the Bioconductor package beadarray [23]. The computations of the bead summary values take about 100 seconds and about 500 seconds using cre-ateBeadSummaryData and RobLoxBioC, respectively. So far, the rmx estimator is implemented in interpreted R code. By switching to compiled code (e.g., C/C++ or FORTRAN) we probably could compete with createBeadSummaryData which is calling C code.
For the comparisons we use the approach of  the case of the Affymetrix data. In a first step we plotted the mean SDs against the mean log-expression values for those genes not spiked-in, results are depicted in Figure 7. The plot indicates a slight improvement of the accuracy by using the rmx estimator instead of Illumina's default method as the mean SD is slightly smaller for the rmx estimator. Some quantiles for the SD values are given in Table 6. Secondly, we took a look at the log fold-changes observed for the non-differentially expressed genes (null log-fc). This statistic also confirms the above results; see Table 6. Moreover, we added the results for the other methods implemented in package beadarray which are mean and SD, median and MAD, 5% trimmed mean and SD as well as 5% winsorized mean and SD. The numerical results show that these other methods perform worse than Illumina's default method and the rmx estimator. Overall the rmx estimator performs the best and we expect an increase in accuracy of at least 1-5% by using the rmx estimator instead of the other methods.
The results mentioned can be recomputed via the R code provided in files IlluminaExample.R and IlluminaSimStudy.R which are included in the scripts folder of our R package RobLoxBioC.
As the following results indicate, the higher accuracy of our rmx estimators is reflected in an increased reproducibility of gene expression analyses. We have again used the spike-in data of Dunning et al. (2008) [22] which can be divided into twelve sets each including four technical replicates. For these twelve sets we measured the reproducibility in terms of the Spearman and Pearson correlation of the summarized log2-transformed data, overall leading to 72 pairwise comparisons. In 69 (Spearman correlation) and 66 (Pearson correlation) cases respectively, the correlation was higher for the rmx estimators. As before, the differences between the two procedures were found to be small and remained well below 0.5% in all cases.
The results can be recomputed using the R code given in the file IlluminaReproducibility.R in the scripts folder of our package RobLoxBioC.

Conclusions
As the variability of the estimation is clearly reduced as well as the reproducibility is increased when we apply rmx estimators for preprocessing, it is reasonable to assume a higher power for subsequent statistical analyses.
In the case of Illumina data the rmx summarization method can be combined with different preprocessing methods that can be applied to bead summary data, 20 [24].
In the case of Affymetrix data there are several other well-known normalization methods based on parametric models e.g. the robust multi-array average (RMA [25]) or the variance stabilization and calibration (VSN [26]) which can be used. The RMA procedure is based on a linear additive model where one uses median polish [27] for parameter estimation. A replacement of the median polish by a corresponding rmx polish may further improve the algorithm. In the case of VSN a possible modification could consist of replacing the least trimmed sum of squares (LTS) regression [28] by an rmx estimator for regression [4,10]. As the above results and the results in Chapters 7 and 8 in Kohl (2005) [4] indicate, these modifications lead to an increased accuracy. At the same time we can retain the high breakdown point of the already available robust estimators by using the k-step construction in combination with bounded rmx ICs [29].
The reported results and the universality of the infinitesimal robustness approach suggest that optimally robust rmx estimators should also be of interest for other bioinformatics applications.

Infinitesimal Robustness
The approach of Huber-Carol (1970) [30], Rieder (1978) [31], Bickel (1981) [32] and Rieder (1980) [33], Rieder (1994) [10] to robust testing and estimation employs shrinking neighborhoods of the parametric model, where the shrinking rate n -1/2 , as the sample size n ∞, may be deduced in a testing setup [34]. Due to the shrinkage of the neighborhoods and the asymptotics involved this approach to robustness is called infinitesimal. A brief comparison and distinction to the robustness approaches of Huber (1981) whose parameter space Θ is an open subset of some finite-dimensional ℝ k , and which is dominated. dP θ = p θ d μ (θ Θ). The smoothness of the model  , at any fixed θ Θ is characterized by the requirement of L 2 differentiability (also called differentiability in quadratic mean); see Section 2.3 of [10]. The ℝ k -valued L 2 derivative is denoted by Λ   ∈ L P k 2 ( ) and its covariance   [11] and [12] (smaller is better) for the rmx estimator for s [0, 0.05], Illumina's default method as well as mean and SD (mean), median and MAD (median), 5% trimmed mean and SD (5% trim) and 5% winsorized mean and SD (5% winsorize).
underP θ is the Fisher information of  at θ which is assumed of full rank k. This type of differentiability is for instance implied by continuous differentiability of p θ and continuity of I θ with respect to θ and then Λ    = ∂ ∂ log p the classical scores; see Lemma A.3 of Hájek (1972) [37]. Given the so-called ideal model  one defines asymptotically linear (AL) estimators S to be any sequence of estimators S n : Ω n ℝ k such that for some, necessarily unique, influence curve (IC)ψ θ Ψ(θ), where Here we used the stochastic Landau-notation of Pfanzagl (1994) [38], i.e. op  n n ( ) 0 0 → in product P n  probability as n ∞, and ℐ k denotes the k × k identity matrix. For more details we refer to Rieder (1994), Section 4.2 [10].
In infinitesimal robustness, the i.i.d. observations x 1 ,..., x n may follow any law Q in some shrinking neighborhood about P θ . In this article, we consider the (convex) contamination neighborhood system  c ( )  which consists of all contamination neighborhoods, at size 0 ≤ s ≤ 1, among all s [r lo , r up ) is called radius-minimax (rmx). Given the rmx IC the corresponding rmx estimator can then be determined via the one-step construction respectively, an iterated one-step -that is, k-step (k ≥ 1) -construction based on a suitable starting estimate S n ( ) 0 [29].
The normal location and scale model, i.e. P    =  ( , ) 2 with θ = (μ, s)', μ ℝ, s (0, ∞) forms an L 2 differentiable exponential family. As starting estimator one can use median and median absolute deviation (MAD) as justified by Kohl (2005), Section 2.3.4 [4]. Since the rmx IC in this model is bounded, the breakdown point of the starting estimator, which is 50% for median and MAD, is inherited to the rmx one-step estimator [29].