Skip to main content
  • Research article
  • Open access
  • Published:

Preprocessing of gene expression data by optimally robust estimators

Abstract

Background

The preprocessing of gene expression data obtained from several platforms routinely includes the aggregation of multiple raw signal intensities to one expression value. Examples are the computation of a single expression measure based on the perfect match (PM) and mismatch (MM) probes for the Affymetrix technology, the summarization of bead level values to bead summary values for the Illumina technology or the aggregation of replicated measurements in the case of other technologies including real-time quantitative polymerase chain reaction (RT-qPCR) platforms. The summarization of technical replicates is also performed in other "-omics" disciplines like proteomics or metabolomics.

Preprocessing methods like MAS 5.0, Illumina's default summarization method, RMA, or VSN show that the use of robust estimators is widely accepted in gene expression analysis. However, the selection of robust methods seems to be mainly driven by their high breakdown point and not by efficiency.

Results

We describe how optimally robust radius-minimax (rmx) estimators, i.e. estimators that minimize an asymptotic maximum risk on shrinking neighborhoods about an ideal model, can be used for the aggregation of multiple raw signal intensities to one expression value for Affymetrix and Illumina data. With regard to the Affymetrix data, we have implemented an algorithm which is a variant of MAS 5.0.

Using datasets from the literature and Monte-Carlo simulations we provide some reasoning for assuming approximate log-normal distributions of the raw signal intensities by means of the Kolmogorov distance, at least for the discussed datasets, and compare the results of our preprocessing algorithms with the results of Affymetrix's MAS 5.0 and Illumina's default method.

The numerical results indicate that when using rmx estimators an accuracy improvement of about 10-20% is obtained compared to Affymetrix's MAS 5.0 and about 1-5% compared to Illumina's default method. The improvement is also visible in the analysis of technical replicates where the reproducibility of the values (in terms of Pearson and Spearman correlation) is increased for all Affymetrix and almost all Illumina examples considered. Our algorithms are implemented in the R package named RobLoxBioC which is publicly available via CRAN, The Comprehensive R Archive Network (http://cran.r-project.org/web/packages/RobLoxBioC/).

Conclusions

Optimally robust rmx estimators have a high breakdown point and are computationally feasible. They can lead to a considerable gain in efficiency for well-established bioinformatics procedures and thus, can increase the reproducibility and power of subsequent statistical analysis.

Background

Affymetrix microarrays consist of a number of probe cells, each probe cell containing a unique probe. There are two types of probes, perfect match (PM) and mismatch (MM) occurring as pairs. The sequences for PM and MM are almost identical. The difference consists of a single base change in the middle of the PM probe sequence to the Watson-Crick complement for the MM probe sequence. A series of such probe pairs forms a probe set which represents a transcript [1].

Hence, it is part of the preprocessing of Affymetrix arrays to compute a single expression value for the different probe sets. One of the most popular algorithms for this purpose is MAS 5.0, developed by Affymetrix [1]. It is the algorithm that, for instance, was most frequently applied within the framework of phase II of the microarray quality control (MAQC) project [2].

MAS 5.0 uses PM and Ideal Match (IM) to compute the expression values where, for probe set i and probe pair j,

IM i , j = { MM i , j , MM i , j < PM i , j 2 SB i PM i , j , MM i , j PM i , j and SB i > τ 1 2 τ 1 τ 2 τ 1 + τ 2 SB PM i , j , MM i , j PM i , j and SB i τ 1
(1)

with default values τ1 = 0.03 (contrast tau) and τ2 = 10 (scale tau). The specific background (SB i ) is determined using Tukey's biweight one-step estimator (Tbi) where Affymetrix's version of Tukey's biweight disregards all observations outside of median ±5 median absolute deviation (MAD) (i.e. unstandardized MAD) leading to a very robust estimator:

SB i = T bi ( log 2 ( PM i , j / MM i , j ) , j = 1 , , n i )
(2)

Then, the signal log value for each probe set is obtained via

SigLogVa 1 i = T bi ( PV i , j , j = 1 , , n i )
(3)

with probe value PV i, j = log2(V i, j ) and V i, j = max{PM i, j - IM i, j , δ}. The constant δ with default value δ = 2-20 is introduced for numerical stability.

However, as table three in Section 2.6 of Hampel et al. (1986) [3] suggests there are more efficient robust estimators than Tukey's biweight, e.g. the Tanh-estimator. In addition, Table eight.five in Section 8.7 of Kohl (2005) [4] shows that in the infinitesimal robust setup for normal location and scale one may lose up to 54.9% asymptotic efficiency if one uses Tukey's biweight in combination with MAD (TuMad) instead of the optimally robust estimator. Consequently, we implemented an algorithm along the l ines of MAS 5.0 where we substituted the Tukey biweight one-step location estimator by an in infinitesimally robust radius-minimax (rmx) k-step (k ≥ 1) location and scale estimator [5].

Illumina BeadArrays are based on 3-micron silica beads that are randomly positioned on fiber optic bundles or planar silica slides. Each bead is covered with hundreds of thousands of copies of a specific oligonucleotide sequence assigning the bead to a certain bead type, where each bead type is represented in high redundancy with more than 30 replicates on average. The intensity values of the single beads are called bead level data. It is part of the preprocessing to aggregate the bead level data to so-called bead summary data leading to a single expression value for each bead type. In this paper we only consider data from single-channel gene expression BeadArrays. Besides that, BeadArrays can also be used for SNP genotyping, methylation profiling, and copy number variation analysis [6].

In Illumina's proprietary BeadStudio Software an outlier rejection method (median ±3 × MAD) combined with mean and standard deviation is used to aggregate the bead level data to bead summary data. The MAD in this setup is standardized by 1.4826 to be consistent at the normal model. That is, the location part of Illumina's estimator is a Huber-type skipped mean and is very close to estimator X42 of Hampel (1985) [7], which uses 3.03 × MAD. Quoting Hampel et al. (1986) [3], p. 69, the X42 estimator is "frequently quite reasonable, according to present preliminary knowledge". However, Monte-Carlo studies show that there may be an efficiency loss of 5-25% compared to more sophisticated robust estimators [7]. Hence, we implemented an algorithm which uses rmx k-step estimators for summarizing bead level data. The goal of this paper is to demonstrate that the application of optimally robust rmx estimators can lead to a considerable efficiency gain and increased reproducibility for well-established preprocessing algorithms. First, we argue for normal location and scale as the ideal model, at least for the log-transformed values of some publicly available Affymetrix and Illumina data sets, using minimum Kolmogorov distance. Second, we use Monte-Carlo simulations and those publicly available datasets to compare MAS 5.0 and Illumina's default method with our modified algorithms based on rmx estimators. The proposed preprocessing algorithms are implemented in the R package RobLoxBioC[8, 9]. A brief overview of infinitesimal robustness is given in the Methods section at the end of the manuscript.

Results and Discussion

Affymetrix Data

We replace Tukey's biweight which only provides a location estimate by an rmx estimator for normal location and scale N ( μ , σ 2 ) . The Tukey biweight and the rmx estimator are constructed as one-step and k-step estimates based on median and MAD, respectively. As both estimators are so called asymptotically linear (AL) estimators, a straightforward way to compare these estimators is to observe the corresponding influence curves/functions (ICs) displayed in Figure 1.

Figure 1
figure 1

Location and scale ICs. Comparison of location and scale ICs for Tukey's biweight, the rmx estimator (rmx IC) and the maximum likelihood estimator (MLE).

In both cases, the location part of the IC is redescending. In contrast to Tukey's biweight rejecting observations more than about 3.35 (standardized) MAD times away from the median, the rmx estimator only downweights large observations. Moreover, the plot shows that Tukey's biweight is mostly affected by undiscoverable (very likely to occur in the normal model) gross errors located around 1.51 (standardized) MAD-times away from the median where the IC of Tukey's biweight (and the MAD) is maximal, whereas the rmx estimator is mostly deflected by large observations where the Euclidean length of the location and scale IC is maximal.

However, for applying the rmx estimator for normal location and scale, one first should check if it is plausible to assume normal location and scale as the ideal model for the M i, j := log2(PM i, j /MM i, j ) values. As we can not test for approximate normality (there is no such statistical test), we use the minimum Kolmogorov distance for this purpose. That is, we minimize, in μ and σ (0, ∞),

d κ ( N ( μ , σ 2 ) , F ^ M i ) = sup x | Φ μ , σ ( x ) F ^ M i ( x ) |
(4)

where Φ μ,σ is the cumulative distribution function of N ( μ , σ 2 ) and F ^ M i is the empirical distribution function of the sample M i, j ,..., M i, ni . Working with a right-continuous empirical distribution function the above supremum is equal to

max j = 1 , , n i { Φ μ , σ ( M i , ( j ) ) j 1 n i , j n i Φ μ , σ ( M i , ( j ) ) }
(5)

where Mi,(1),..., Mi,(ni)is the increasingly sorted sample. In particular, the minimal possible Kolmogorov distance for sample size n is (2n)-1.

Of course, it would be possible to use some other distance (e.g. Cramér von Mises) or the test statistic of some test for normality for this purpose. However, we decided to use the Kolmogorov distance since there is a certain connection between Kolmogorov neighborhoods and the gross-error model in infinitesimal robustness (see Rieder (1994), Lemma 4.2.8 and Proposition 5.3.3 [10]) and the Kolmogorov distance can be computed efficiently via equation (5). Nevertheless, the computations take more than 100 minutes for the HGU95A dataset and more than 130 minutes for the HGU133A dataset, which consist of 59 and 42 GeneChips, respectively, using function KolmogorovMinDist of our package RobLoxBioC on an Intel P9500 (64 bit Linux, 8 GByte RAM). For more details on these Latin square spike-in datasets we refer to Cope et al. (2004) [11] and Irizarry et al. (2006) [12].

Table 1 shows the number of probe sets per number of probe level pairs for the HGU95A and HGU133A GeneChips. Figure 2 displays the minimum Kolmogorov distances for the HGU95A and HGU133A Latin square datasets as well as for normal random samples (50000 Monte-Carlo replications for each sample size) where we selected only those probe level pairs with a considerable number of probe sets. In Table 2 we recorded the differences of the medians of the minimum Kolmogorov distances between the Latin square datasets and corresponding normal random samples. The results for 95% and 99% quantiles are very similar. Based on these results it is very reasonable to assume normal location and scale as the ideal model for M i, j expecting only minor deviations from normality.

Table 1 Number of probe sets
Figure 2
figure 2

Minimum Kolmogorov distance for Affymetrix data. Minimum Kolmogorov distances for HGU95A and HGU133A datasets as well as for normal (pseudo) random samples (50000 Monte-Carlo replications). The grey boxes indicate the mode of the number of probes in a probe-set. The figure indicates that the considered Affymetrix data is in good agreement with the normal location and scale model.

Table 2 Minimum Kolmogorov distances for Affymetrix data

To get a rough estimate of the corresponding size of the contamination neighborhood (i.e. Tukey's gross-error model [13]), which is required for selecting an appropriate rmx estimator, we use the following heuristics: for Kolmogorov (U k ), total variation (U v ) and contamination neighborhoods (U c ) of size s (0,1) it holds

U c ( s ) U v ( s ) U κ ( s )
(6)

In addition, at least in the one-dimensional case and under symmetry, the optimally robust ICs for U c (2s) and U v (s) coincide. Moreover, if the optimally robust IC is monotone, then it is also the solution for U k (s) [10]. Based on these considerations we multiply the median difference between the observed and the simulated Kolmogorov distance by two, leading us to a neighborhood size s [0, 0.05].

We use the corresponding rmx 3-step estimator instead of Tukey's biweight to compute the specific background values SB i and the signal log values SigLogVal i . Asymptotically (i.e. classical first-order asymptotics) speaking it makes no difference which k we choose to construct the rmx estimator and differences only occur if one takes a look at higher-order asymptotics as shown by unpublished results of P. Ruckdeschel. However, to date, there are no finite-sample results indicating an optimal choice for k if there is any. The use of three steps is motivated by the observation that in all situations we considered so far, the estimates were stable and did not change very much after the third iteration.

The results in Section 8.7 of Kohl (2005) [4] show that, in the infinitesimal robust setup and for known contamination radius, the optimally robust AL estimators clearly outperform the TuMad estimator for the estimation of normal location and scale with respect to the asymptotic maximum MSE, where the maximum efficiency loss is 54.9%. Moreover, the results of the following Monte-Carlo study, which is in the spirit of the Princeton robustness study [14], indicate that this is also true for the rmx estimator in the case of an unknown neighborhood radius and finite-sample size. Due to the finite-sample size and the shrinkage of the neighborhoods, we use a finite-sample correction of the neighborhood radius determined by a large simulation study. The finite-sample correction leads to a larger neighborhood radius; i.e., to a more conservative estimation procedure. It can be computed with function finiteSampleCorrection of the R package RobLox[15].

For the simulations we chose a sample size n = 11 as most of the probes sets have this number of probe pairs on HGU133A GeneChips (cf. Table 1) and performed M = 105 Monte-Carlo replications. As the ideal model we used N ( 0 , 1 ) which is no restriction due to equivariance of the location and scale model. As contaminating (gross errors generating) distributions we employed:

  • N ( 0 , 9 )

    , t3 and Cauchy(0, 1) leading to an increased variance

  • N ( 3 , 1 )

    and N ( 10 , 1 ) causing a positive bias

  • Dirac measures at 1.51 (D1.51) and 1000 (D1000), which are approximations for the least favorable contaminations (i.e. leading to maximum risk) for the Tukey and the rmx estimator, respectively.

We selected s {0.01, 0.02, 0.04} as sizes of the gross error models. The results for other contaminating distributions or amounts of gross errors can easily be computed with function AffySimStudy of our R package RobLoxBioC.

Since there is no estimator yielding reliable results if there are 50% or more gross errors, we wanted to admit only random samples with less than 50% contamination. The probability for rejecting a sample is ≤ exp{-2n(0.5 - s)2} by Theorem 2 of Hoeffding (1963) [16]; i.e., decays exponentially in the sample size n. At n = 11 and s {0.01, 0.02, 0.04} a replacement of a sample is necessary with probability 4.4 · 10-10, 2.7 · 10-8 and 1.6 · 10-6, respectively. Hence, unsurprisingly, there was no single sample that had to be replaced in our Monte-Carlo study.

The results in Table 3 show that the rmx location estimate in all situations considered has a smaller empirical MSE than Tukey's biweight. The efficiency loss of Tukey's biweight in nearly all situations is about 15-20%.

Table 3 Simulation study: Tukey's biweight versus rmx estimator

Next, we present some results demonstrating the accuracy of the two procedures for the HGU95A and HGU133A Latin square datasets. For the computations we also used the rmx 3-step estimator for s [0, 0.05] which is implemented as default in function RobLoxBioC of our R package RobLoxBioC. The results for the MAS 5.0 with Tukey's biweight were determined with function mas5 of Bioconductor package affy[17, 18]. In addition to the availability of different robust estimators, the implementation in RobLoxBioC is more efficient. The normalization using RobLoxBioC on an Intel P9500 (64 bit Linux, 8 GByte RAM) requires about 1 minute in contrast to about 9 minutes for mas5.

In Figure 3 analogously to Figure 2 in Cope et al. (2004) [11], the mean standard deviations (SDs) are plotted against the mean log-expression values for the two datasets. The curves were determined by smoothing the resulting scatterplots, which include SDs and mean log expressions for each gene not spiked-in. These plots indicate an improvement of the accuracy of MAS 5.0 when using the rmx estimator instead of Tukey's biweight as the variability in terms of mean SD for the rmx estimator is clearly smaller especially for HGU95A. Some quantiles of the computed SD values are given in Table 4. The results for the log fold-changes observed for non-differentially expressed genes (null log-fc) -i.e., genes not spiked-in - confirm these results; see Table 4. Overall we expect that using the rmx estimator increases the accuracy of MAS 5.0 by 10-20%.

Table 4 Accuracy measures: Tukey's biweight versus rmx estimator
Figure 3
figure 3

Tukey's biweight versus rmx estimator. Mean standard deviation (SD) versus mean log expression for Tukey's biweight and the rmx estimator for s [0,0.05]. The curves were determined by smoothing the resulting scatterplots which include SDs and mean log expressions for each gene not spiked-in. As the variability in terms of mean SD for the rmx estimator is clearly smaller especially for HGU95A, these plots indicate an improvement of the accuracy of MAS 5.0 by using the rmx estimator instead of Tukey's biweight.

The comparisons of the two robust procedures were performed with the Bioconductor package affycomp[19]. The full assessments of Cope et al. (2004) [11] and Irizarry et al. (2006) [12] can be computed using the R code specified in the file AffymetrixExample.R in the scripts folder of our package RobLoxBioC. The simulation study can be recomputed by the R code given in the file AffymetrixSimStudy.R also provided in the scripts folder.

As the following results indicate, the higher accuracy of rmx estimators increases the reproducibility of gene expression analyses. We analyzed a random subset of the MAQC-I study [20] provided by the Bioconductor package MAQCsubsetAFX[21]. Regarding the Affymetrix platform, a total of 120 U133 Plus 2.0 GeneChips have been generated and four different reference RNAs have been used. (A) 100% of Stratagene's Universal Human Reference RNA, (B) 100% of Ambion's Human Brain Reference RNA, (C) 75% of A and 25% of B and (D) 25% of A and 75% of B. Each reference has been repeated five times on six different test sites. The datasets refA,..., refD provided by package MAQCsubsetAFX consist of the data of six randomly chosen U133 Plus 2.0 GeneChips (one for each test site) for each reference RNA. As Figure 4 shows, the assumption of approximate normality is fulfilled. We measured the reproducibility in terms of the Spearman correlation of the normalized data and the Pearson correlation of the log2-transformed normalized data. In all cases the correlation was found to be higher for the rmx estimators. The relative increase is 0.6-1.2% (absolute. 0.006-0.011) in the case of Spearman correlation and 1.2-1.9% (absolute. 0.011-0.017) in the case of Pearson correlation.

Figure 4
figure 4

Minimum Kolmogorov distance for MAQC-I data. Minimum Kolmogorov distances for randomly selected MAQC-I Affymetrix datasets (refA,..., refD) as well as for normal (pseudo) random samples (50000 Monte-Carlo replications). Only probe-sets with a considerable number of probes are depicted. The boxplots show that the data is in good agreement with the normal location and scale model.

The results can be recomputed using the R code specified in the file AffymetrixReproducibility.R in the scripts folder of the package RobLoxBioC.

Illumina Data

Since we intend to apply the rmx estimators for normal location and scale to summarize the bead level data, we first checked whether the normal model is appropriate for these data. We use the spike-in data investigated in Dunning et al. (2008) [22] consisting of eight customized Mouse-6 version 1 BeadChips hybridized with a complex mouse background. Each BeadChip contains six BeadArrays each made up of two strips on the chip surface. In total each of the BeadArrays includes 49283 bead types. The raw bead level values were sharpened and background subtracted [22]. Due to the random positioning of the beads, the number of beads per bead type varies from array to array. In Figure 5 and 6 we have depicted those bead types with 30 to 50 replicates and 15 to 65 replicates, respectively. The results were obtained using function KolmogorovMinDist of the R package RobLoxBioC where the computations took about 9 hours for the spike-in dataset. Both figures indicate that the assumption of normal location and scale as the ideal model is more appropriate for the log-transformed bead level data. Hence, we propose to use the rmx 3-step estimator for normal location and scale and s [0, 0.05] in combination with the already mentioned finite-sample correction instead of Illumina's method, which is a Huber-type skipped mean and standard deviation [7], to summarize the log-transformed bead level data. The use of three steps and the choice s [0, 0.05] is driven by the same heuristics as in the Affymetrix case.

Figure 5
figure 5

Minimum Kolmogorov distance for Illumina data. Minimum Kolmogorov distances for 48 Mouse-6 version 1 BeadChips as well as for normal (pseudo) random samples (50000 replications). The boxplots indicate that the log-transformed bead level data is in good agreement with the normal location and scale model.

Figure 6
figure 6

Quantiles of Minimum Kolmogorov distance for Illumina data. 50% and 99% quantiles of minimum Kolmogorov distances for 48 Mouse-6 version 1 BeadChips as well as for normal (pseudo) random samples (50000 replications). The plot confirms that the log-transformed bead level data is in good agreement with the normal location and scale model.

In a further step we have performed a simulation study using a very similar setup as in the Affymetrix case. Due to the higher redundancy of the Illumina data, we chose a sample size of 30 instead of 11. Moreover, we replaced the Dirac measure at 1.51 by the Dirac measure at 3 (D3) which is an approximation for the least favorable contamination for Illumina's default method. The results for other contaminations can easily be computed with the function IlluminaSimStudy of the R package RobLoxBioC. As in the Affymetrix setup, we applied the modification that less than 50% of the observations contained in a sample may be contaminated where again no single sample had to be modified. The results in Table 5 show that the two estimators perform similarly with a slight advantage for the rmx estimator. Due to the outlier rejection step included in the Illumina method, it is unsurprising, that it performs especially well if the contaminating distribution puts mass on large values. In contrast, the rmx estimator outperforms the Illumina method in situations where the outliers are less obvious like in the case of t3 or N ( 3 , 1 ) . Furthermore, looking at the maximum empirical risk for the simultaneous estimation of location and scale Illumina's method shows an efficiency loss of about 15% compared to the rmx estimator. In view of these results it is no surprise that Illumina's method performs best in Figure 2 of Dunning et al. (2008) [22] where outliers at 216 were used and average bias and log2 variance are plotted versus percentage of simulated outliers for several summary methods. Besides that, the approach of Dunning et al. (2008) [22] contains a flaw from a statistical point of view. the original data is contaminated irrespective of the bead type. That is, one gross-error model for the whole dataset was used instead of a gross-error model for each bead type in the dataset. This approach might reflect the way contamination occurs in practice but, already at moderate contamination rates, one obtains many bead types where 50% or more of the bead values are contaminated and consequentially no reliable estimator exists. We postulate that this is the reason why the reported breakdown points are clearly smaller than the "real" breakdown points of the considered estimators (10% trimmed mean and SD: 5% vs. 10%, median and MAD: 30% vs. 50%, Illumina method: 30% vs. 50%).

Table 5 Simulation study: Illumina's default method versus rmx estimator

Next, we report some results representing the accuracy of the two procedures for the spike-in data of Dunning et al. (2008) [22]. For the computations we again used the rmx 3-step estimator for s [0, 0.05] which is the default in function RobLoxBioC of the R package RobLoxBioC. The results for Illumina's method were determined with the function createBeadSummaryData of the Bioconductor package beadarray[23]. The computations of the bead summary values take about 100 seconds and about 500 seconds using createBeadSummaryData and RobLoxBioC, respectively. So far, the rmx estimator is implemented in interpreted R code. By switching to compiled code (e.g., C/C++ or FORTRAN) we probably could compete with createBeadSummaryData which is calling C code.

For the comparisons we use the approach of Cope et al. (2004) [11] and Irizarry et al. (2006) [12] as in the case of the Affymetrix data. In a first step we plotted the mean SDs against the mean log-expression values for those genes not spiked-in, results are depicted in Figure 7. The plot indicates a slight improvement of the accuracy by using the rmx estimator instead of Illumina's default method as the mean SD is slightly smaller for the rmx estimator. Some quantiles for the SD values are given in Table 6. Secondly, we took a look at the log fold-changes observed for the non-differentially expressed genes (null log-fc). This statistic also confirms the above results; see Table 6. Moreover, we added the results for the other methods implemented in package beadarray which are mean and SD, median and MAD, 5% trimmed mean and SD as well as 5% winsorized mean and SD. The numerical results show that these other methods perform worse than Illumina's default method and the rmx estimator. Overall the rmx estimator performs the best and we expect an increase in accuracy of at least 1-5% by using the rmx estimator instead of the other methods.

Figure 7
figure 7

Illumina's default method versus rmx estimator. Mean standard deviation (SD) versus mean log expression for Illumina's default method and the rmx estimator for s [0,0.05]. The plot indicates a slight improvement of the accuracy by using the rmx estimator instead of Illumina's default method as the mean SD is slightly smaller for the rmx estimator.

Table 6 Accuracy measures: Illumina's default method versus rmx estimator

The results mentioned can be recomputed via the R code provided in files IlluminaExample.R and IlluminaSimStudy.R which are included in the scripts folder of our R package RobLoxBioC.

As the following results indicate, the higher accuracy of our rmx estimators is reflected in an increased reproducibility of gene expression analyses. We have again used the spike-in data of Dunning et al. (2008) [22] which can be divided into twelve sets each including four technical replicates. For these twelve sets we measured the reproducibility in terms of the Spearman and Pearson correlation of the summarized log2-transformed data, overall leading to 72 pairwise comparisons. In 69 (Spearman correlation) and 66 (Pearson correlation) cases respectively, the correlation was higher for the rmx estimators. As before, the differences between the two procedures were found to be small and remained well below 0.5% in all cases.

The results can be recomputed using the R code given in the file IlluminaReproducibility.R in the scripts folder of our package RobLoxBioC.

Conclusions

As the variability of the estimation is clearly reduced as well as the reproducibility is increased when we apply rmx estimators for preprocessing, it is reasonable to assume a higher power for subsequent statistical analyses.

In the case of Illumina data the rmx summarization method can be combined with different preprocessing methods that can be applied to bead summary data, e.g. the variance-stabilizing transformation (VST) of Lin et al. (2008) [24].

In the case of Affymetrix data there are several other well-known normalization methods based on parametric models e.g. the robust multi-array average (RMA [25]) or the variance stabilization and calibration (VSN [26]) which can be used. The RMA procedure is based on a linear additive model where one uses median polish [27] for parameter estimation. A replacement of the median polish by a corresponding rmx polish may further improve the algorithm. In the case of VSN a possible modification could consist of replacing the least trimmed sum of squares (LTS) regression [28] by an rmx estimator for regression [4, 10]. As the above results and the results in Chapters 7 and 8 in Kohl (2005) [4] indicate, these modifications lead to an increased accuracy. At the same time we can retain the high breakdown point of the already available robust estimators by using the k-step construction in combination with bounded rmx ICs [29].

The reported results and the universality of the infinitesimal robustness approach suggest that optimally robust rmx estimators should also be of interest for other bioinformatics applications.

Methods

Infinitesimal Robustness

The approach of Huber-Carol (1970) [30], Rieder (1978) [31], Bickel (1981) [32] and Rieder (1980) [33], Rieder (1994) [10] to robust testing and estimation employs shrinking neighborhoods of the parametric model, where the shrinking rate n-1/2, as the sample size n →∞, may be deduced in a testing setup [34]. Due to the shrinkage of the neighborhoods and the asymptotics involved this approach to robustness is called infinitesimal. A brief comparison and distinction to the robustness approaches of Huber (1981) [35], Hampel et al. (1986) [3] and Maronna et al. (2006) [36] is given in the Introduction of Kohl et al. (2010) [29].

Denoting by 1 ( A ) the set of all probability measures on some measurable space ( Ω , A ) , one considers a smoothly parameterized model

P = { P θ | θ Θ } 1 ( A )
(7)

whose parameter space Θ is an open subset of some finite-dimensional k , and which is dominated. dP θ = p θ d μ (θ Θ). The smoothness of the model P, at any fixed θ Θ is characterized by the requirement of L2 differentiability (also called differentiability in quadratic mean); see Section 2.3 of [10]. The k -valued L2 derivative is denoted by Λ θ L 2 k ( P θ ) and its covariance θ = E θ Λ θ Λ θ underP θ is the Fisher information of P at θ which is assumed of full rank k. This type of differentiability is for instance implied by continuous differentiability of p θ and continuity of I θ with respect to θ and then Λ θ = θ log p θ the classical scores; see Lemma A.3 of Hájek (1972) [37].

Given the so-called ideal modelP one defines asymptotically linear (AL) estimators S to be any sequence of estimators S n : Ωnksuch that

n ( S n θ ) = 1 n i = 1 n ψ θ ( x i ) + o p θ n ( n 0 )
(8)

for some, necessarily unique, influence curve (IC)ψ θ Ψ(θ), where

Ψ ( θ ) = { ψ θ L 2 k ( P θ ) | E θ ψ θ = 0 , E θ ψ θ Λ θ = I k }
(9)

Here we used the stochastic Landau-notation of Pfanzagl (1994) [38], i.e. op θ n ( n 0 ) 0 in product P θ n probability as n →∞, and k denotes the k × k identity matrix. For more details we refer to Rieder (1994), Section 4.2 [10].

In infinitesimal robustness, the i.i.d. observations x1,..., x n may follow any law Q in some shrinking neighborhood about P θ . In this article, we consider the (convex) contamination neighborhood system U c ( θ ) which consists of all contamination neighborhoods, at size 0 ≤ s ≤ 1,

U c ( θ , s ) = { ( 1 s ) P θ + s Q | Q 1 ( A ) }
(10)

Subsequently, s = s n = rn-1/2 for starting radius r [0, ∞) and n →∞.

Given this setup, the aim is to minimize the asymptotic maximum risk

lim n sup Q U c ( θ , r n 1 / 2 ) ( n 1 / 2 ( S n θ ) ) d Q n n
(11)

with continuous loss function ℓ: k → [0, ∞). Throughout this article we will use square loss (z) = |z|2 which leads to the (asymptotic maximum) mean squared error MSE θ (ψ θ , r).

To simplify notation, the fixed θ will be dropped from notation henceforth.

The optimally robust ψ*, the unique solution to minimize MSE(ψ, r) among all ψ Ψ for given radius r, is given in Theorem 5.5.7 of Rieder (1994) [10]: there exist some vector z kand matrix A k × k, A positive definite, such that

ψ = A ( Λ z ) w
(12)
w = min { 1 , b | A ( Λ z ) | 1 }
(13)

where

r 2 b = E ( | A ( Λ z ) | b ) +
(14)

and

0 = E ( Λ z ) w
(15)
A 1 = E ( Λ z ) ( Λ z ) w
(16)

Conversely, form (12) - (15) suffices for ψ* to be the solution. The minimax solution to more general risks is derived in Ruckdeschel and Rieder (2004) [39].

In applications, the starting radius r for the neighborhoods U c (θ, rn-1/2) is unknown or only known to belong to some interval [rlo, rup) [0, ∞). For this situation Rieder et al. (2008) [5] propose to consider the relative MSE of ψ s at radius r, defined as

relMSE ( ψ s , r ) : = MSE ( ψ s , r ) / MSE ( ψ r , r )
(17)

and IC ψ r 0 which minimizes

sup r [ r l o , r up ) relMSE ( ψ s , r )
(18)

among all s [rlo, rup) is called radius-minimax (rmx).

Given the rmx IC the corresponding rmx estimator can then be determined via the one-step construction respectively, an iterated one-step - that is, k-step (k ≥ 1) -construction

S n ( k ) = S n ( k 1 ) + 1 n i = 1 n ψ S n ( k 1 ) , r 0 ( x i )
(19)

based on a suitable starting estimate S n ( 0 ) [29].

The normal location and scale model, i.e. P θ = N ( μ , σ 2 ) with θ = (μ, σ)', μ , σ (0, ∞) forms an L2 differentiable exponential family. As starting estimator one can use median and median absolute deviation (MAD) as justified by Kohl (2005), Section 2.3.4 [4]. Since the rmx IC in this model is bounded, the breakdown point of the starting estimator, which is 50% for median and MAD, is inherited to the rmx one-step estimator [29].

References

  1. Affymetrix, Inc: Statistical Algorithms Description Document. Affymetrix, Santa Clara; 2002.

    Google Scholar 

  2. MAQC-Consortium, Shi L, Campbell G, Jones WD, Campagne F, Wen Z, Walker SJ, Su Z, Chu TM, Goodsaid FM, Pusztai L, Shaughnessy JDJ, Oberthuer A, Thomas RS, Paules RS, Fielden M, Barlogie B, Chen W, Du P, Fischer M, Furlanello C, Gallas BD, Ge X, Megherbi DB, Symmans WF, Wang MD, Zhang J, Bitter H, Brors B, Bushel PR, Bylesjo M, Chen M, Cheng J, Cheng J, Chou J, Davison TS, Delorenzi M, Deng Y, Devanarayan V, Dix DJ, Dopazo J, Dorff KC, Elloumi F, Fan J, Fan S, Fan X, Fang H, Gonzaludo N, Hess KR, Hong H, Huan J, Irizarry RA, Judson R, Juraeva D, Lababidi S, Lambert CG, Li L, Li Y, Li Z, Lin SM, Liu G, Lobenhofer EK, Luo J, Luo W, McCall MN, Nikolsky Y, Pennello GA, Perkins RG, Philip R, Popovici V, Price ND, Qian F, Scherer A, Shi T, Shi W, Sung J, Thierry-Mieg D, Thierry-Mieg J, Thodima V, Trygg J, Vishnuvajjala L, Wang SJ, Wu J, Wu Y, Xie Q, Yousef WA, Zhang L, Zhang X, Zhong S, Zhou Y, Zhu S, Arasappan D, Bao W, Lucas AB, Berthold F, Brennan RJ, Buness A, Catalano JG, Chang C, Chen R, Cheng Y, Cui J, Czika W, Demichelis F, Deng X, Dosymbekov D, Eils R, Feng Y, Fostel J, Fulmer-Smentek S, Fuscoe JC, Gatto L, Ge W, Goldstein DR, Guo L, Halbert DN, Han J, Harris SC, Hatzis C, Herman D, Huang J, Jensen RV, Jiang R, Johnson CD, Jurman G, Kahlert Y, Khuder SA, Kohl M, Li J, Li L, Li M, Li QZ, Li S, Li Z, Liu J, Liu Y, Liu Z, Meng L, Madera M, Martinez-Murillo F, Medina I, Meehan J, Miclaus K, Moffitt RA, Montaner D, Mukherjee P, Mulligan GJ, Neville P, Nikolskaya T, Ning B, Page GP, Parker J, Parry RM, Peng X, Peterson RL, Phan JH, Quanz B, Ren Y, Riccadonna S, Roter AH, Samuelson FW, Schumacher MM, Shambaugh JD, Shi Q, Shippy R, Si S, Smalter A, Sotiriou C, Soukup M, Staedtler F, Steiner G, Stokes TH, Sun Q, Tan PY, Tang R, Tezak Z, Thorn B, Tsyganova M, Turpaz Y, Vega SC, Visintainer R, von Frese J, Wang C, Wang E, Wang J, Wang W, Westermann F, Willey JC, Woods M, Wu S, Xiao N, Xu J, Xu L, Yang L, Zeng X, Zhang J, Zhang L, Zhang M, Zhao C, Puri RK, Scherf U, Tong W, Wolfinger RD: The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nature Biotechnology 2010, in press.

    Google Scholar 

  3. Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA: Robust statistics. The approach based on influence functions. New York: Wiley; 1986.

    Google Scholar 

  4. Kohl M: Numerical Contributions to the Asymptotic Theory of Robustness. PhD thesis. University of Bayreuth; 2005.

    Google Scholar 

  5. Rieder H, Kohl M, Ruckdeschel P: The cost of not knowing the radius. Stat Meth & Appl 2008, 17: 13–40.

    Article  Google Scholar 

  6. Kuhn K, Baker SC, Chudin E, Lieu MH, Oeser S, Bennett H, Rigault P, Barker D, McDaniel TK, Chee MS: A novel, high-performance random array platform for quantitative gene expression profiling. Genome Res 2004, 14: 2347–56.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Hampel FR: The breakdown points of the mean combined with some rejection rules. Technometrics 1985, 27: 95–107.

    Article  Google Scholar 

  8. R Development Core Team:R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2010. [ISBN 3–900051–07–0] [http://www.R-project.org] [ISBN 3-900051-07-0]

    Google Scholar 

  9. Kohl M: RobLoxBioC: Infinitesimally robust estimators for preprocessing omics data. R Foundation for Statistical Computing, Vienna, Austria; 2010. [R package version 0.7.1] [R package version 0.7.1]

    Google Scholar 

  10. Rieder H: Robust Asymptotic Statistics. New York: Springer; 1994.

    Book  Google Scholar 

  11. Cope LM, Irizarry RA, Jaffee HA, Wu Z, Speed TP: A benchmark for Affymetrix GeneChip expression measures. Bioinformatics 2004, 20: 323–31.

    Article  CAS  PubMed  Google Scholar 

  12. Irizarry RA, Wu Z, Jaffee HA: Comparison of Affymetrix GeneChip expression measures. Bioinformatics 2006, 22: 789–94.

    Article  CAS  PubMed  Google Scholar 

  13. Tukey JW: A survey of sampling from contaminated distributions. In Contributions to Probability and Statistics I. Edited by: Olkin I. Stanford: Stanford University Press; 1960:448–485.

    Google Scholar 

  14. Andrews DF, Bickel PJ, Hampel FR, Huber PJ, Rogers WH, Tukey JW: Robust estimates of location: survey and advances. Princeton, Princeton University Press; 1972.

    Google Scholar 

  15. Kohl M: RobLox: Optimally robust influence curves and estimators for location and scale. R Foundation for Statistical Computing, Vienna, Austria; 2009. [R package version 0.7] [R package version 0.7]

    Google Scholar 

  16. Hoeffding W: Probability inequalities for sums of bounded random variables. J Am Stat Assoc 1963, 58: 13–30.

    Article  Google Scholar 

  17. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: Open software development for computational biology and bioinformatics. Genome Biology 2004, 5: R80.

    Article  PubMed  PubMed Central  Google Scholar 

  18. Gautier L, Cope L, Bolstad BM, Irizarry RA: affy - analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 2004, 20: 307–15.

    Article  CAS  PubMed  Google Scholar 

  19. Irizarry RA, Wu Z: affycomp: Graphics Toolbox for Assessment of Affymetrix Expression Measures. 2009. [R package version 1.19.4] [R package version 1.19.4]

    Google Scholar 

  20. Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ, de Longueville F, Kawasaki ES, Lee KY, Luo Y, Sun YA, Willey JC, Setterquist RA, Fischer GM, Tong W, Dragan YP, Dix DJ, Frueh FW, Goodsaid FM, Herman D, Jensen RV, Johnson CD, Lobenhofer EK, Puri RK, Scherf U, Thierry-Mieg J, Wang C, Wilson M, Wolber PK, Zhang L, Amur S, Bao W, Barbacioru CC, Lucas AB, Bertholet V, Boysen C, Bromley B, Brown D, Brunner A, Canales R, Cao XM, Cebula TA, Chen JJ, Cheng J, Chu TM, Chudin E, Corson J, Corton JC, Croner LJ, Davies C, Davison TS, Delenstarr G, Deng X, Dorris D, Eklund AC, Fan XH, Fang H, Fulmer-Smentek S, Fuscoe JC, Gallagher K, Ge W, Guo L, Guo X, Hager J, Haje PK, Han J, Han T, Harbottle HC, Harris SC, Hatchwell E, Hauser CA, Hester S, Hong H, Hurban P, Jackson SA, Ji H, Knight CR, Kuo WP, LeClerc JE, Levy S, Li QZ, Liu C, Liu Y, Lombardi MJ, Ma Y, Magnuson SR, Maqsodi B, McDaniel T, Mei N, Myklebost O, Ning B, Novoradovskaya N, Orr MS, Osborn TW, Papallo A, Patterson TA, Perkins RG, Peters EH, Peterson R, Philips KL, Pine PS, Pusztai L, Qian F, Ren H, Rosen M, Rosenzweig BA, Samaha RR, Schena M, Schroth GP, Shchegrova S, Smith DD, Staedtler F, Su Z, Sun H, Szallasi Z, Tezak Z, Thierry-Mieg D, Thompson KL, Tikhonova I, Turpaz Y, Vallanat B, Van C, Walker SJ, Wang SJ, Wang Y, Wolfinger R, Wong A, Wu J, Xiao C, Xie Q, Xu J, Yang W, Zhang L, Zhong S, Zong Y, Slikker WJ: The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nature Biotechnology 2006, 24(9):1151–61.

    Article  CAS  PubMed  Google Scholar 

  21. Gatto L:MAQCsubsetAFX: MAQC data subset for the Affymetrix platform. 2010. [R package version 1.0.3] [http://www.slashhome.be/MAQCsubsetAFX.php] [R package version 1.0.3]

    Google Scholar 

  22. Dunning MJ, Barbosa-Morais NL, Lynch AG, Tavaré S, Ritchie ME: Statistical issues in the analysis of Illumina data. BMC Bioinformatics 2008, 9: 85.

    Article  PubMed  PubMed Central  Google Scholar 

  23. Dunning MJ, Smith ML, Ritchie ME, Tavare S: beadarray: R classes and methods for Illumina bead-based data. Bioinformatics 2007, 23: 2183–4.

    Article  CAS  PubMed  Google Scholar 

  24. Lin SM, Du P, Huber W, A KW: Model-based variance-stabilizing transformation for Illumina microarray data. Nucleic Acids Res 2008, 36(2):e11.

    Article  PubMed  PubMed Central  Google Scholar 

  25. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data. Biostatistics 2003, 4: 249–64.

    Article  PubMed  Google Scholar 

  26. Huber W, von Heydebreck A, Sueltmann H, Poustka A, Vingron M: Variance Stabilization Applied to Microarray Data Calibration and to the Quantification of Differential Expression. Bioinformatics 2002, 18: 96–104.

    Article  Google Scholar 

  27. Tukey JW: Exploratory Data Analysis. Reading, Mass.: Addison-Wesley Publishing Company; 1977.

    Google Scholar 

  28. Rousseeuw PJ, Leroy AM: Robust Regression and Outlier Detection. New York: John Wiley and Sons; 1987.

    Book  Google Scholar 

  29. Kohl M, Ruckdeschel P, Rieder H: Infinitesimally Robust Estimation in General Smoothly Parametrized Models. Stat Meth & Appl 2010, 19: 333–354.

    Article  Google Scholar 

  30. Huber-Carol C: Étude asymptotique de tests robustes. PhD thesis. ETH Zürich; 1970.

    Google Scholar 

  31. Rieder H: A robust asymptotic testing model. Ann Stat 1978, 6: 1080–94.

    Article  Google Scholar 

  32. Bickel PJ: Quelques aspects de la statistique robuste. New York: Springer; 1981. [Ecole d'ete de probabilites de Saint-Flour IX-1979, Lect. Notes Math. 876] [Ecole d'ete de probabilites de Saint-Flour IX-1979, Lect. Notes Math. 876]

    Book  Google Scholar 

  33. Rieder H: Estimates derived from robust tests. Ann Stat 1980, 8: 106–115.

    Article  Google Scholar 

  34. Ruckdeschel P:A Motivation for 1 / n -Shrinking-Neighborhoods. Metrika 2006, 63: 295–307.

    Article  Google Scholar 

  35. Huber PJ: Robust statistics. New York: Wiley; 1981.

    Book  Google Scholar 

  36. Maronna RA, Martin RD, Yohai VJ: Robust Statistics: Theory and Methods. New York: Wiley; 2006.

    Book  Google Scholar 

  37. Hájek J: Local asymptotic minimax and admissibility in estimation. Proc. 6th Berkeley Sympos. math. Statist. Probab., Univ. Calif. 1970 1972, 1: 175–194.

    Google Scholar 

  38. Pfanzagl J: Parametric statistical theory. Berlin: De Gruyter Textbook; 1994.

    Book  Google Scholar 

  39. Ruckdeschel P, Rieder H: Optimal influence curves for general loss functions. Stat Decis 2004, 22: 201–223.

    Article  Google Scholar 

Download references

Acknowledgements

We thank Dr. David Enot for valuable comments and careful proofreading of the manuscript. We also thank two referees and the editor for helpful comments. The authors have no conflicts of interest.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matthias Kohl.

Additional information

Authors' contributions

MK implemented the algorithms, performed the computations and contributed to the manuscript. HPD contributed to the manuscript writing and discussion. Both authors read and approved the final manuscript.

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Kohl, M., Deigner, HP. Preprocessing of gene expression data by optimally robust estimators. BMC Bioinformatics 11, 583 (2010). https://doi.org/10.1186/1471-2105-11-583

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-11-583

Keywords