Skip to main content

From hybridization theory to microarray data analysis: performance evaluation



Several preprocessing methods are available for the analysis of Affymetrix Genechips arrays. The most popular algorithms analyze the measured fluorescence intensities with statistical methods. Here we focus on a novel algorithm, AffyILM, available from Bioconductor, which relies on inputs from hybridization thermodynamics and uses an extended Langmuir isotherm model to compute transcript concentrations. These concentrations are then employed in the statistical analysis. We compared the performance of AffyILM and other traditional methods both in the old and in the newest generation of GeneChips.


Tissue mixture and Latin Square datasets (provided by Affymetrix) were used to assess the performances of the differential expression analysis depending on the preprocessing strategy. A correlation analysis conducted on the tissue mixture data reveals that the median-polish algorithm allows to best summarize AffyILM concentrations computed at the probe-level. Those correlation results are equivalent to the best correlations observed using popular preprocessing methods relying on intensity values. The performances of each tested preprocessing algorithm were quantified using the Latin Square HG-U133A dataset, thanks to the comparison of differential analysis results with the list of spiked genes. The figures of merit generated illustrates that the performances associated to AffyILM(medianpolish), inferred from the present statistical analysis, are comparable to the best performing strategies previously reported.


Converting probe intensities to estimates of target concentrations prior to the statistical analysis, AffyILM(medianpolish) is one of the best performing strategy currently available. Using hybridization theory, probe-level estimates of target concentrations should be identically distributed. In the future, a probe-level multivariate analysis of the concentrations should be compared to the univariate analysis of probe-set summarized expression data.


During the last decade, high-throughput technologies have been extensively used to monitor the expression profile of known and predicted genes. Those studies allowed to compare samples between healthy and pathologic tissues or cell-types, to monitor the effects of several drugs, and to describe sets of genes based on their implication in several biological processes [19]. As several microarray platforms are available (e.g. single color, dual color, high density arrays), different preprocessing and statistical analysis tools were developed, according to the specificities of each system. Here, we focus on the Affymetrix Genechip family products. The old generation chips used two types of probes to monitor the expression: Perfect Match probes (PM) and Mismatch Probes (MM), with the aim of characterizing non-specific binding thanks to the introduction of a mismatch in the central position of the probe. Irizarry et al. showed that the MM signal is also dependent on the PM target concentration, thus leading to a more complex situation where each MM probe signal should be interpreted using specific and non-specific binding [10, 11]. The original idea of subtracting MM signal to PM signal before statistical analysis thus evolved, in different directions: several groups recommended to use PM-only signal, where others used an adjusted MM signal [10, 11]. MM probes were also used as weak specific binders to estimate the background and saturation level [12]. Note that in its latest Genechips for expression arrays Affymetrix makes only use of PM probes, and completely redesigned probe sequences. In the newest chips, the probes are selected from the whole transcript and not from its 3'-end as it used to be the case. The non-specific binding is assumed to be common for all probes, and is assessed with a set of anti-genomic probes. Such novelties in the design call for a new critical evaluation of preprocessing methods, which is the aim of this paper.

When comparing several samples, algorithms need to take discrepancies between arrays into account, such as a slight difference in the amount of hybridized material (biological and technical replicates, precision of the pipetting device, hybridization time ...). The study of probe intensities thus traditionally implies a normalization step and/or the design of an appropriate statistical model. From a statistical point of view, several probes targeting the same transcript would ideally be jointly analyzed, using a multifactorial procedure - like the ANOVA-2 procedure - to be able to compare the expression of a transcript between two conditions (i.e. healthy/disease), and several samples (replicates). However, the probe-specific signal are not identically distributed, so that the iid requirement of ANOVA-2 procedures is not met. Discrepancies between PM probes targeting the same sequence have been reported by several authors. Depending on probe-specific sequences, several processes have been reported to explain the important biases observed in PM signals [13, 14]. First, thanks to a more refined knowledge about transcript sequences and their locations in the genome, many probeset definition have been redefined and many probes have been redesigned in the successive generations of arrays. To correctly analyse old-generation arrays, several authors used alternative chip definition files and probe annotation tables [1519]. Other sources of outlying effects have been reported by several authors. For instance, probes targeting distinct transcripts were found to be systematically correlated accross a large number of experiments, and fail to correlate with respect to the rest of the probes designed to target the same transcript (these are probes containing G-stacks, T7-primer sequences ...) [13, 14]. Cross-hybridization effects may occur, as probe sequence may complement parts of the sequence from other transcripts. Listing all the factors that leads to outlying probe signals is not the purpose of this paper, but those examples illustrates the complexity of the models needed to analyse the data appropriately. Most preprocessing methods adress this issue by summarizing the probe-level signal into a unique probeset-level score, and this step is performed by excluding outliers or giving them a smaller weight as compared to other probes (Average Difference in MAS 4.0, 1-step Tukey-Biweight in MAS 5.0, Medianpolish in RMA and GC-RMA) [11, 2022].

The aim of this paper is to provide a general test of performance of some preprocessing methods for Genechips, focusing in particular on affyILM, an algorithm recently made available as a Bioconductor package and developed by us. This algorithm is based on physical modelling of the hybridization process and uses a generalized Langmuir isotherm [23] to estimate the concentration of transcripts from the raw data, using thermodynamics principles. The inputs are hybridization free energies obtained from experimental values which measure the affinity for each probe to bind to the complementary transcript. Several papers in the past decade discussed the physical modeling of the hybridization process in GeneChips [2330]. AffyILM is fully based on underlying thermodynamics of hybridization and the basic principles were discussed in previous publications (see e.g. [31, 32]). In this test of performance we considered old- and new-generation (i.e. PM only) chips and compared affyILM using several popular summarization methods (median, average difference, 1-step Tukey-Biweight, MBEI, medianpolish), with other preprocessing methods (MAS 5.0, RMA, GCRMA, Plier, FARMS,dChip) [11, 2022, 3335].

Results and Discussion

Preprocessing microarray data is commonly performed on a four-step basis: background correction, pm/mm correction, normalization and summarization. Several algorithms have been reported to treat each of those steps, and this field of research has evolved in a combinatorial way, to define the best strategy [36]. To improve the understanding of the results described in this article, we have compared the performances with regards to MAS 5.0 and Plier (provided by Affymetrix), RMA and GCRMA, dCHIP and FARMS [11, 2022, 3335]. Those softwares have been the most widely used over ten years, and were studied in several performances studies [3639].

Our aim is to provide a preprocessing algorithm based on physical chemistry. We used the extended Langmuir model, which is discussed in Ref. [23], to compute the physical concentration of the targets associated to each of the probes in the chip. Assuming that the differences between arrays are due mainly to the variable amount of sample which is hybridized, we performed a global multiplicative rescaling of the estimated concentrations, so that the pairwise comparison of concentrations is characterized by a slope equal to one between arrays.

The estimation of non specific binding implies platform-generation-specific features. The use of the Langmuir Isotherm to estimate concentrations is presented here, first ignoring this background-correction step, so that interpretation of the results does not depend on a difference between algorithms used with regard to the specificities related to each generation of array. As a consequence, the variability of probe-specific concentrations may be somewhat over-estimated (a constant background noise introduced in the Langmuir Isotherm may lead to a variable over-estimation of the concentration, as this part of the signal should not be associated to the probe-specific free energy). For this reason, the individual analysis reported here has been performed using both the Student t-test and variants of this test designed to stabilize the variance, at the probe-set level, using shrunken estimates of variability (regularized t-test and window t-test) [38, 40].

Correlation of differential expression analysis

To test the performance of affyILM, we selected the Affymetrix Tissue Mixture Study (available on Affymetrix's Website) [41]. This study involves two different genechips: the old HG-U133 Plus 2 and the new (PM-only) Hugene 1.0 ST array. Table 1 summarizes the design of the tissue mixture study: heart and brain samples were mixed in various ratios. Such a strategy allows to perform two types of comparisons: (i) Within-platform comparisons quantifies the robustness of the methods against biologival noise and (ii) between-platform comparisons quantifies the correlations between HG-U133Plus2 and Hugene 1.0 ST arrays. To quantify the performances of the differential expression analysis depending on the preprocessing strategy, we computed the correlation coefficient of Pearson, Kendall and Spearman on the log10 value of the p-values resulting from the differential expression tests that are compared. All three correlation coefficients provided similar results. In general, the Pearson correlation coefficient spanned the widest range of values, best highlighting the differences between the methods, and was therefore used in this paper.

Table 1 Description of the tissue mixture study

Differential expression analysis was performed between samples reported in Table 1, using the pure Brain and Heart samples as a reference (see Methods). The correlation analysis between mixtures and reference samples quantifies the decrease in performance due to an increase in biological noise. Highest correlation coefficients highlight the robustness of the algorithms against biological noise.

The results of the analysis are reported in Tables 2, 3, 4, 5, 6 and 7. Tables 2 and 3 report the combination of affyILM with several summarization methods on old HG-U133Plus2 and new Hugene 1.0 ST arrays, respectively. For each combination, three differential expression analysis methods were used (Student t-test, Regularized t-test and Window t-test). More details can be found in Methods.

Table 2 Comparison of the differential expression analysis on HG-U133Plus2 arrays preprocessed with affyILM and popular summarization procedures
Table 3 Comparison of the differential expression analysis on Hugene 1.0 ST arrays preprocessed with affyILM and popular summarization procedures
Table 4 Comparison of the differential expression analysis between HG-U133Plus2 and Hugene 1.0 ST arrays preprocessed with affyILM and several summarization procedures
Table 5 Comparison of the differential expression analysis on HG-U133Plus2 arrays preprocessed with popular methods
Table 6 Comparison of the differential expression analysis on Hugene 1.0 ST arrays preprocessed with several popular algorithms
Table 7 Comparison of the differential expression analysis between HG-U133Plus2 and Hugene 1.0 ST arrays preprocessed with several popular algorithms

The first three columns illustrate the correlations obtained when biological noise is added in equal amount in both samples. The following four columns illustrate the correlations computed when the analysis is performed between one sample (Brain or Heart) and the mixtures with 75% and 25% of each sample, respectively. The last six columns refer to the correlations measured on the comparisons between one sample (Brain or Heart), and 3 mixtures with 50% of each sample. The analysis shows that, whatever the differential expression analysis strategy, the best performances are obtained when the probe-level data is summarized using the medianpolish algorithm, followed by the 1-step Tukey-biweight algorithm. In addition, whatever the summarization method, the best results are obtained when the data is analyzed with variance-stabilizing methods (Regularized t-test and Window t-test), as compared to the classic Student t-test. As expected, the correlation decreases with increasing amounts of biological noise (column 1 to 3, column 4 to 8, 10, 12 to 6, and column 5 to 9, 11, 13 to 7). affyILM concentrations analyzed with or without the scaling step leads to similar results, and the observed correlation is higher using scaled data for some comparisons, lower for the others. The same conclusions can be formulated for both generation of arrays, as shown by comparing Table 2 and Table 3. The two types of arrays lead to similar performance results.

Table 4 summarizes the cross-platform correlation between the HG-U133Plus2 and Hugene 1.0 ST chips. For this purpose, we used the mapping table provided by Affymetrix (best match) [42] (See Methods). The correlations in Table 4 are lower than those reported in Tables 2 and 3, which means that the noise inherent to the platform discrepancies is higher than the noise observed between mixtures. Those discrepancies can be due to several causes. First, the definition of the probesets and the design of the probes does not follow the same strategy between 3' expression arrays and the recent Human Gene 1.0 array. Second, the mapping between the two arrays is not perfect, as a probeset from one array may be associated to several probesets from the other one. In the present study, we computed the scores using the best match mapping table provided by Affymetrix [42]. We only used unique mappings between the arrays, and discarded all the transcripts that are covered by only one of the two arrays. As a consequence, the cross-platform comparison is not perfect.

Using Table 4 to assess the performances of the individual methods leads to the same conclusions as from Tables 2 and 3: best results are obtained when affyILM is used in combination with the medianpolish summarization.

Tables 5, 6 and 7 report the correlations computed when the analysis is performed with current preprocessing methods, using the same 3 differential expression analysis methods. The highest correlations are observed when data is preprocessed using either RMA or GCRMA, both making use of the medianpolish summarization. The two methods provided by Affymetrix leads to lower correlations (MAS 5 and PLIER), and PLIER seems more appropriate than MAS 5 on the most recent array type (Tables 5 and 6). MAS 5 leads to better results on HG-U133Plus2, only when the mismatch probes are ignored (Table 5). Taken together, all those correlations tests suggest that the best methods are RMA, GCRMA and affyILM(medianpolish), whatever the array type.

As expected, a decrease in correlation is observed for increasing amounts of biological noise. However, the comparisons between mixes and Brain/Heart samples reveals an unexpected behavior: 50/50 mixes (mix5a, b, c) leads to higher correlation values when compared with Brain sample (mix9) than the correlation values obtained by comparison with Heart sample (mix1). One explanation for this effect could be that the 50/50 mixes are enriched in Heart sample, thus the difference between the mixes and Heart samples (mix1) may be lower than expected, and higher when compared to Brain samples (mix9). This effect can also be observed with 75% - 25% mixes (mix4 and mix 6) when compared to Brain and Heart samples (compare columns m1m6 with m4m9 or m1m4 with m6m9). As a consequence, we suspect that either the concentration of Heart samples was under-estimated, or the concentration of Brain samples was over-estimated, prior to the design of the mixes. Alternatively, this effect could be due to a difference in the quality of Heart and Brain samples, with a higher level of biological noise in the Brain sample than in the Heart sample (contamination, sample degradation, ...).

In the correlation study reported here, we tested affyILM in combination with the summarization steps used by other methods, to provide a fair comparison. The medianpolish summarization is associated to the best correlation coefficients, whichever preprocessing strategy is used. Giorgi et al. recently reported that the medianpolish procedure induce inter-array correlation, and introduced tRMA, where the use of rows and columns is inverted with regards to the medianpolish used in RMA and GCRMA (transposed medianpolish). This study also showed that the magnitude of the artifact associated with the medianpolish increases when the number of replicates is small, and is affected by an odd or even number of replicates [43]. In addition, in RMA and GCRMA, the medianpolish is performed on log-values (in accordance with the underlying model of RMA and GCRMA). We extended our correlation studies on differential expression results to compare the performances of the medianpolish and the transposed medianpolish, in combination with affyILM, and applied these procedures to the concentrations and to the log of the concentrations. The results of the analysis are supplied in additional files 1, 2 and 3 (respectively for HG-U133plus2, Hugene 1.0 ST and between array comparison). The transposed medianpolish and the medianpolish perform similarly, in agreement with Giorgi's statement that the reported artifact should not affect differential expression analysis. Our study also show that the medianpolish leads to higher correlation values when performed on log-values, in agreement with the model used by RMA and GCRMA.

Performance evaluation on Latin-Square data

Correlation results on the differential expression analysis are not sufficient to discriminate those three methods. To best characterize the performance of the differential expression tests, we used the well-known latin-square HGU-133 spike-in dataset, which offers the opportunity to compare the most significant probesets with the precise knowledge of the true list of spiked RNA's. Each pairwise differential expression analysis has been performed, leading to 91 distinct comparisons (see Methods) [44]. The results from the 91 pairwise comparisons have been merged in a single list of p-values, and compared to the associated list of spiked genes to compute the sensitivity (= TP/(TP+FN)) and false discovery rate (FDR = 1-Precision = FP/(FP+TP)), depicted in Figure 1 using alternative ROC curves for each tested preprocessing and differential analysis methods. Those graphs are equivalent to Precison/Recall curves and compares the ability of the methods to find the truth with the price to pay for it, quantified by the proportion of errors in the top list, for increasing sizes of top-lists. The alternative ROC curves (Figure 1) can be used to best discriminate methods and can be interpreted similarly to traditional ROC curves (Figure 2, × axis = 1-specificity = FP/(TN+FP)), using the same Y axis, where the best performing methods are closer to the upper left corner of the graph (finding the truth without errors).

Figure 1

Performance evaluation on Latin Square Data HG-U133A: Sensitivity VS False Discovery Rate. The sensitivity (= TP/(TP+FN)) is compared to the false discovery rate (= 1-Precision = FP/(TP+FP)), using affyILM(medianpolish) and other popular preprocessing methods. A, B, and C respectively report the performance evaluation when the differential expression analysis is conducted with the student t-test, the window t-test, and the regularized t-test. D, E and F are zooms of A, B and C, in the lowest FDR region (up to 20%). Each curve is computed from the analysis of 91 available pairwise comparisons between 3 replicates of latin square samples.

Figure 2

Performance evaluation on Latin Square Data HG-U133A: ROC curves. The sensitivity (= TP/(TP+FN)) is compared to 1-specificity (= FP/(TN+FP)), using affyILM(medianpolish) and other popular preprocessing methods. A, B, and C respectively report the performance evaluation when the differential expression analysis is conducted with the student t-test, the window t-test, and the regularized t-test. D, E and F are zooms of A, B and C, in the lowest 1-Specificity region (up to 20%). Each curve is computed from the analysis of 91 available pairwise comparisons between 3 replicates of latin square samples.

As demonstrated in several papers, variants of the Student t-test designed to stabilize the variance estimates outperforms the classic Student t-test, whichever preprocessing strategy is used [37, 39, 40, 4547]. The best performing individual analysis reported in our study is conducted with the window t-test. Comparing MAS5, RMA, GCRMA, PLIER and affyILM(medianpolish) reveals that the best performances for the window t-test and regularized t-test are obtained with RMA, affyILM(medianpolish) and GCRMA, as shown in Figure 1. Most of the spiked probesets are detected with less than 5%-10% of error in the top list with the window t-test or the reguralized t-test, and the remaining probesets are progressively detected. No method is able to detect all spiked RNA's. Using RMA, affyILM(medianpolish) and GCRMA in combination with the window t-test leads to a detection of more than 80% of the spiked genes with less than 5% of error in the top list. In conjunction with the window t-test, RMA and affyILM display a better top list than GCRMA, as the beginning of the curves is closer to the Y-axis. In the second part of those curves, the progressive detection of the remaining genes can be tracked by the error associated with the detection of 90% of the spiked genes: the corresponding FDR is close to 20%, 30% and 90% respectively for RMA, affyILM(medianpolish) and GCRMA. Using the classic Student t-test, the performances of affyILM(medianpolish) decreases and are lower than GCRMA, but still remains higher than PLIER and MAS 5.0.

Figure 2 illustrates the performances of the analysis with regard to the specificity (traditional ROC curves). All the methods quickly reaches a high level of sensitivity. However, affyILM(medianpolish), RMA and FARMS are able to reach a higher sensitivity level, followed by dChip, as compared to GCRMA. The most interesting part of those curve, featuring low 1-specificity values (high specificity), illustrates that the best performances are obtained with RMA, affyILM and FARMS (Figure 2D-F).

The AffyComp assessment of preprocessing methods allow to compare the performances of preprocessing methods by monitoring several descriptive statistics. Although affyILM misses an appropriate background correction algorithm, we submitted affyILM(medianpolish) to the AffyComp III assessment. The results of the assessment will serve as a basis for the evaluation of affyILM future developments. The aim of this paper was to assess the performances of affyILM with regards to differential expression. However, we provide the reports of Affycomp using affyILM(medianpolish) in the additional files 4 and 5, respectively for the Latin Squares HG-U95a and HG-U133. Additional files 6 and 7 summarize the AffyComp III scores of affyILM and selected methods. The scores obtained reveal that methods can be splitted in two categories, and best methods are RMA, GCRMA, FARMS and affyILM, in accordance with our performance evaluation.


The aim of this paper was to perform a thorough performance analysis of affyILM, a Bioconductor package designed to preprocess Affymetrix GeneChip expression data. This model relies on the thermodynamics of hybridization and avoids complex statistical transformation as normalization steps used by other methods. To avoid biases due to the variability of the amount of hybridized material, the concentrations are scaled with an array-specific factor (selected to get a slope of 1 between pairwise array comparisons). To avoid biases due to the need for platform-specific background estimation algorithms, the background correction step has been ignored in our study. The study reported here adresses two main goals: evaluating the performances of affyILM with respect to differential expression analysis, and selecting the best summarizing strategy to avoid outliers-associated bias.

Our correlation study on mixtures between two biological tissue samples first highlights that the medianpolish summarization leads to the best results in conjunction with the extended Langmuir Isotherm, followed by 1-step Tukey-Biweight algorithm and MBEI, as seen from the data shown in Tables 2, 3 and 4. The comparison with other methods reveals that the correlations observed in our study are similar to the best performing methods. The performances are similar for HG-U133 plus 2 and the recent Hugene 1.0 ST array type. The performance evaluation has been completed by an analysis of the HG-U133 Latin Square experiment, allowing to compare the most significant genes with the true knowlegde of spiked RNAs. The package affyILM used in combination with variants of the Student t-test that stabilizes the variance, provides a better top-list than GCRMA, and is close to RMA for this dataset. The three best methods relies on the medianpolish summarization, highlighting the importance of the summarization step. Using the traditional t-test, performances of affyILM are lower, in agreement with our expectations, due to the absence of a background-correction step.

According to the statistical tests performed in this paper the accuracy of affyILM is similar to the best performing preprocessing algorithms (RMA, GCRMA...). The advantages of affyILM is that it is entirely based on physical principles and does not make use of excessive parameters fitting. It runs equally well on a single experiment and it does not make use of heavy normalization, apart from global rescaling of the concentration levels. In addition, it provides to each measured expression level an error estimate [28], which is useful to discriminate between the robust determined expression levels, from those with high error rates. The experience with other type of arrays [48] suggests that the performance of affyILM could be further increased with a better parametrization of the hybridization free energies, which are used by affyILM to compute the concentrations from the Langmuir isotherm. So far the free energy values used are taken from Sugimoto et al. data [49], obtained from experiments of hybridization between RNA and DNA strands in solution.

In the future, our efforts will be shared on two objectives. First, we will try to further improve the strategy by including a background-correction step. In our current implementation of affyILM, this step is not performed, causing a variable over-estimation of the concentration. Second, we will try to simplify the analysis by testing an appropriate weighted multivariate analysis strategy from probe concentrations, instead of summarizing it. Transcript concentrations are estimated from each probe, to get rid of a dependance between the intensity and the sequence-specific hybridization free energies. As estimated target concentrations are analysed in place of probe intensities, all probes targetting specificaly the same transcript should thus provide the same information and share the same biological variability. Multivariate analysis procedures are typically used to analyse such data. This strategy should be more powerful, as a multivariate analysis uses more values in the test than an univariate analysis of summarized values. This strategy was previously used by Barrera et al. on intensity values and proved to be efficient [50]. However, the univariate analysis reported here on summarized values shows that affyILM performs best in combination with the medianpolish, which highlights the impact of outlying probes during the summarization step. Outlying probes reveal the presence of unexpected behavior (cross-hybridization, errors in probeset definition or probe sequence...). To avoid biases during multivariate analysis in the presence of outliers, we will focus on the definition of appropriate weighting factors for each probe.



We first selected the tissue mixture study dataset provided by Affymetrix in order to characterize the correlation results of the differential expression analysis between brain and heart samples, and several mixtures of the two samples, as described in Table 1. Each tissue mixture was hybridized on two distinct generation of expression arrays, namely the 3' HG-U133Plus2 expression array and the Human Gene 1.0 ST v1 array. Each sample/mixture was hybridized on 3 arrays (triplicates) [41]. In order to characterize the performances of several preprocessing algorithms (in combination with several differential expression analysis methods), we followed two distinct strategies. In the first strategy, we computed Pearson's correlation coefficient between the significance of the differential expression analysis performed on the optimal comparison (Brain VS Heart = Mix1 vs Mix9) and on several mixtures comparisons (Mix.x vs Mix.y), thus comparing the results with increasing biological noise. The differential expression analysis, as well as the preprocessing strategy, are described in the Procedures section hereunder. In the second strategy, the correlations were computed between the two generations of arrays, for each available pairwise mixture comparison. This second strategy was used to compare the performances of each analysis strategy/preprocessing method for their ability to extract common information from both array type.

To best characterize the performances of the selected preprocessing/analysis strategies, we selected the latin-square HG-U133A spike-in dataset. The design of the latin-square experiment relies on the definition of 14 sets of 3 probesets, leading to 42 probesets spiked with known concentrations of RNA. Using those RNAs, 14 triplicated hybridizations were performed with increasing concentrations (0, 0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512 and 1024 pM). Each set of 3 probesets is spiked with a specific concentration. The latin-square experiment leads to 91 possible pairwise comparisons between triplicate experiments [44]. The Bioconductor package AffyComp [36] provides a list of probe sets which potentially cross-hybridize with the spiked transcripts of the Latin square HGU-133A experiment. To avoid cross-hybridization biases, these probesets have been discarded from our analysis, in agreement with the procedure implemented in the AffyComp package. Some of these probesets are expected to hybridize to the spiked sequences, implying that we must consider them as true positives if analyzed, but this would be a violation of the latin-square design (14 groups of 3 probesets spiked in 14 concentrations). As an example, the spiked clone sequence of probeset AFFX-ThrX-3_at aligns with probes defining probesets AFFX-ThrX-5_at, AFFX-ThrX-M_at, AFFX-r2-Bs-thr-3_s_at, AFFX-r2-Bs-thr-5_s_at and AFFX-r2-Bs-thr-M_s_at. The Affymetrix description document of the Latin Square HG-U133a also refers to 3 probesets known to cross-hybridize (included in the probesets listed by AffyComp).


The raw data provided by Affymetrix was preprocessed using the R programming environment and some Bioconductor packages [51, 52]. The estimation of probe-level concentrations from the Langmuir isotherm was performed using affyILM. The summarization of probe-level concentrations was performed using internal functions of the affy package [53], by first creating an affybatch with the probe-level concentrations. GCRMA was performed using the gcrma package. The GCRMA methodology makes use of MM probes to estimate the background distribution for each array [11]. In the Human Gene 1.0 st data (PM-only arrays), the background distribution was computed using the set of anti-genomic probes designed by Affymetrix. MAS 5.0, RMA and dChip were performed with the expresso function of the affy package. PLIER and FARMS were respectively performed with the R/Bioconductor packages named plier and farms.

Differential expression analysis was performed using PEGASE, an R package written by Berger et al. in order to use several methods and to compare the results [54]. Probesets containing only NA values in at least one of the 2 subsets of each pairwise comparison were removed. Selected methods are the classic Student t-test, the Regularized t-test [40], and the Window t-test [38].

For each available pairwise comparison in the Tissue Mixture study, the lists of p-values generated by PEGASE were used to compute a correlation score using two strategies.

In the first strategy, we first defined the Mix1 vs Mix9 (Brain VS Heart) comparison as the reference list of p-values, for each type of array and for each analysis strategy. Pearson's correlation coefficient was then computed between the log10 of the reference p-values and the log10 of the p-values associated to any other pairwise comparison between mixtures. This procedure was repeated for each tested combination of preprocessing and differential expression analysis methodologies.

The aim of the second strategy is to compare results from HG-U133Plus2 and Human Gene 1.0 ST v1 arrays. First, we retrieved the mapping table between the probesets designed in the two generations of arrays, available in the support documentation of the supplier. We selected the best-match mapping table, and used it to subset the list of p-values obtained from both types of array [42]. Pearson's correlation coefficient was then computed on the mapped subsets, for a given pairwise comparison between mixtures, between the log10 of the lists of p-values computed from both generation of arrays. The procedure was repeated for each tested combination of preprocessing and differential expression analysis methods that are compatible with the two types of arrays, and for each available pairwise comparison.

The performance evaluation of the Latin-square data was done using the PEGASE R package [54]. As 14 samples define the latin-square experiment, 91 pairwise comparisons can be performed between triplicates (see Datasets section above). The differential expression analysis was performed with PEGASE (Student t-test, Regularized t-test and Window t-test) and 91 lists of p-values were generated for each method. To quantify the performances of the analysis in one step, the 91 lists of p-values were concatenated in a single list, for each combination of preprocessing/analysis methods. In addition, known concentrations of spiked RNA's were used to compute 91 lists of fold-change values (FC). The other probesets defined on the array are not expected to vary between samples and were associated to a fold-change value equal to 1. The 91 lists of expected fold-change were then converted into 91 binary lists (True if FC < 1 or FC > 1; False if FC = 1), and concatenated in a single binary list. For each combination of preprocessing and differential expression analysis methodologies, the final binary list of spiked probesets and the total list of p-values were used with PEGASE to compute the sensitivity (= recall = TP/(TP+FN)), the false discovery rate (FDR = 1-precision = FP/(FP+TP)), and 1-specificity (= FP/(TN+FP)) for increasing thresholds.


  1. 1.

    Allison DB, Cui X, Page GP, Sabripour M: Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet 2006, 7: 55–65.

    CAS  Article  PubMed  Google Scholar 

  2. 2.

    Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstråle M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC: PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet 2003, 34(3):267–73.

    CAS  Article  PubMed  Google Scholar 

  3. 3.

    Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 2005, 102(43):15545–50.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  4. 4.

    Goeman JJ, van de Geer SA, de Kort F, van Houwelingen HC: A global test for groups of genes: testing association with a clinical outcome. Bioinformatics 2004, 20: 93–9.

    CAS  Article  PubMed  Google Scholar 

  5. 5.

    Mansmann U, Meister R: Testing differential gene expression in functional groups. Goeman's global test versus an ANCOVA approach. Methods Inf Med 2005, 44(3):449–53.

    CAS  PubMed  Google Scholar 

  6. 6.

    Hummel M, Meister R, Mansmann U: GlobalANCOVA: exploration and assessment of gene group effects. Bioinformatics 2008, 24: 78–85.

    CAS  Article  PubMed  Google Scholar 

  7. 7.

    Dinu I, Liu Q, Potter JD, Adewale AJ, Jhangri GS, Mueller T, Einecke G, Famulsky K, Halloran P, Yasui Y: A biological evaluation of six gene set analysis methods for identification of differentially expressed pathways in microarray data. Cancer Inform 2008, 6: 357–68.

    PubMed Central  CAS  PubMed  Google Scholar 

  8. 8.

    Liu Q, Dinu I, Adewale AJ, Potter JD, Yasui Y: Comparative evaluation of gene-set analysis methods. BMC Bioinformatics 2007, 8: 431.

    PubMed Central  Article  PubMed  Google Scholar 

  9. 9.

    Berger F, De Meulder B, Gaigneaux A, Depiereux S, Bareke E, Pierre M, De Hertogh B, Delorenzi M, Depiereux E: Functional analysis: evaluation of response intensities-tailoring ANOVA for lists of expression subsets. BMC Bioinformatics 2010, 11: 510.

    PubMed Central  Article  PubMed  Google Scholar 

  10. 10.

    Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003, 4(2):249–64.

    Article  PubMed  Google Scholar 

  11. 11.

    Wu Z, Irizarry R, Gentleman R, Martinez-Murillo F, Spencer F: A Model-Based Background Adjustment for Oligonucleotide Expression Arrays. Journal of the American Statistical Association 2004, 99(468):909–917.

    Article  Google Scholar 

  12. 12.

    Binder H, Preibisch S: "Hook"-calibration of GeneChip-microarrays: Theory and algorithm. Algorithms Mol Biol 2008, (3):12.

    Google Scholar 

  13. 13.

    Sanchez-Graillet O, Rowsell J, Langdon WB, Stalteri M, Arteaga-Salas JM, Upton GJG, Harrison AP: Widespread existence of uncorrelated probe intensities from within the same probeset on Affymetrix GeneChips. J Integr Bioinform 2008., 5(2):

    Google Scholar 

  14. 14.

    Upton GJG, Sanchez-Graillet O, Rowsell J, Arteaga-Salas JM, Graham NS, Stalteri MA, Memon FN, May ST, Harrison AP: On the causes of outliers in Affymetrix GeneChip data. Brief Funct Genomic Proteomic 2009, 8(3):199–212.

    Article  PubMed  Google Scholar 

  15. 15.

    Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, Bunney WE, Myers RM, Speed TP, Akil H, Watson SJ, Meng F: Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res 2005, 33(20):e175.

    PubMed Central  Article  PubMed  Google Scholar 

  16. 16.

    Ferrari F, Bortoluzzi S, Coppe A, Sirota A, Safran M, Shmoish M, Ferrari S, Lancet D, Danieli GA, Bicciato S: Novel definition files for human GeneChips based on GeneAnnot. BMC Bioinformatics 2007, 8: 446.

    PubMed Central  Article  PubMed  Google Scholar 

  17. 17.

    de Leeuw WC, Rauwerda H, Jonker MJ, Breit TM: Salvaging Affymetrix probes after probe-level re-annotation. BMC Res Notes 2008, 1: 66.

    PubMed Central  Article  PubMed  Google Scholar 

  18. 18.

    Liu H, Zeeberg BR, Qu G, Koru AG, Ferrucci A, Kahn A, Ryan MC, Nuhanovic A, Munson PJ, Reinhold WC, Kane DW, Weinstein JN: AffyProbeMiner: a web resource for computing or retrieving accurately redefined Affymetrix probe sets. Bioinformatics 2007, 23(18):2385–90.

    CAS  Article  PubMed  Google Scholar 

  19. 19.

    Lu J, Lee JC, Salit ML, Cam MC: Transcript-based redefinition of grouped oligonucleotide probe sets using AceView: high-resolution annotation for microarrays. BMC Bioinformatics 2007, 8: 108.

    PubMed Central  Article  PubMed  Google Scholar 

  20. 20.

    Affymetrix Statistical Algorithms Description Document. 2002.

  21. 21.

    Hubbell E: Gene Logic Workshop on Low Level Analysis of AffymetrixGeneChip data. Estimating signal with next generation Affymetrix software 2001. []

    Google Scholar 

  22. 22.

    Wu Z, Irizarry RA: Stochastic models inspired by hybridization theory for short oligonucleotide arrays. J Comput Biol 2005, 12(6):882–93.

    CAS  Article  PubMed  Google Scholar 

  23. 23.

    Carlon E, Heim T: Thermodynamics of RNA/DNA hybridization in high-density oligonucleotide microarrays. Physica A: Statistical Mechanics and its Applications 2006, 362(2):433–449.

    CAS  Article  Google Scholar 

  24. 24.

    Held GA, Grinstein G, Tu Y: Modeling of DNA microarray data by using physical properties of hybridization. PNAS 2003, 100: 7575–7580.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  25. 25.

    Naef F, Magnasco MO: Solving the riddle of the bright mismatches: labeling and effective binding in oligonucleotide arrays. Phys Rev E Stat Nonlin Soft Matter Phys 2003, 68(1 Pt 1):011906.

    Article  PubMed  Google Scholar 

  26. 26.

    Binder H: Thermodynamics of competitive surface adsorption on DNA microarrays. J Phys: Condens Matt 2006, 18: S491.

    CAS  Google Scholar 

  27. 27.

    Halperin A, Buhot A, Zhulina EB: On the hybridization isotherms of DNA microarrays: the Langmuir model and its extensions. J Phys: Condens Matt 2006, 18: S463.

    CAS  Google Scholar 

  28. 28.

    Mulders GCWM, Barkema GT, Carlon E: Inverse Langmuir method for oligonucleotide microarray analysis. BMC Bioinformatics 2009, 10: 64.

    PubMed Central  Article  PubMed  Google Scholar 

  29. 29.

    Pozhitkov AE, Boube I, Brouwer MH, Noble PA: Beyond Affymetrix arrays: expanding the set of known hybridization isotherms and observing pre-wash signal intensities. Nucleic Acids Res 2010, 38(5):e28.

    PubMed Central  Article  PubMed  Google Scholar 

  30. 30.

    Burden CJ, Binder H: Physico-chemical modelling of target depletion during hybridization on oligonulceotide microarrays. Phys Biol 2010, 7: 016004.

    Article  Google Scholar 

  31. 31.

    Mulders GCWM, Barkema GT, Carlon E: Inverse Langmuir method for oligonucleotide microarray analysis. BMC Bioinformatics 2009, 10: 64.

    PubMed Central  Article  PubMed  Google Scholar 

  32. 32.

    Kroll KM, Barkema GT, Carlon E: Linear model for fast background subtraction in oligonucleotide microarrays. Algorithms Mol Biol 2009, 4: 15.

    PubMed Central  Article  PubMed  Google Scholar 

  33. 33.

    Hochreiter S, Clevert DA, Obermayer K: A new summarization method for Affymetrix probe level data. Bioinformatics 2006, 22(8):943–9.

    CAS  Article  PubMed  Google Scholar 

  34. 34.

    Affymetrix Technical Note: Guide to Probe Logarithmic Intensity Error (PLIER) Estimation2005. []

  35. 35.

    Li C, Wong WH: Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci USA 2001, 98: 31–6.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  36. 36.

    Cope LM, Irizarry RA, Jaffee HA, Wu Z, Speed TP: A benchmark for Affymetrix GeneChip expression measures. Bioinformatics 2004, 20(3):323–31.

    CAS  Article  PubMed  Google Scholar 

  37. 37.

    Choe SE, Boutros M, Michelson AM, Church GM, Halfon MS: Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biol 2005, 6(2):R16.

    PubMed Central  Article  PubMed  Google Scholar 

  38. 38.

    Berger F, De Hertogh B, Pierre M, Gaigneaux A, Depiereux E: The "Window t test": a simple and powerful approach to detect differentially expressed genes in microarray datasets. Central European Journal of Biology 2008, 3(3):327–344.

    Google Scholar 

  39. 39.

    De Hertogh B, De Meulder B, Berger F, Pierre M, Bareke E, Gaigneaux A, Depiereux E: A benchmark for statistical microarray data analysis that preserves actual biological and technical variance. BMC Bioinformatics 2010, 11: 17.

    PubMed Central  Article  PubMed  Google Scholar 

  40. 40.

    Baldi P, Long AD: A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes. Bioinformatics 2001, 17(6):509–19.

    CAS  Article  PubMed  Google Scholar 

  41. 41.

    Affymetrix Data Resource Center: Gene 1.0 ST Array Data Set[]

  42. 42.

    Affymetrix Support: Microarray Technical Documentations - Array Comparison[]

  43. 43.

    Giorgi FM, Bolger AM, Lohse M, Usadel B: Algorithm-driven artifacts in median Polish summarization of microarray data. BMC Bioinformatics 2010, 11: 553.

    PubMed Central  Article  PubMed  Google Scholar 

  44. 44.

    Affymetrix Data Resource Center: Latin Square Data for Expression Algorithm Assessment[]

  45. 45.

    Cui X, Hwang JTG, Qiu J, Blades NJ, Churchill GA: Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics 2005, 6: 59–75.

    Article  PubMed  Google Scholar 

  46. 46.

    Opgen-Rhein R, Strimmer K: Accurate ranking of differentially expressed genes by a distribution-free shrinkage approach. Stat Appl Genet Mol Biol 2007., 6: Article9 Article9

    Google Scholar 

  47. 47.

    Smyth GK: Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 2004, 3: Article3.

    PubMed  Google Scholar 

  48. 48.

    Hooyberghs J, Van Hummelen P, Carlon E: The effects of mismatches on hybridization in DNA microarrays: determination of nearest neighbor parameters. Nucleic acids research 2009., 37(7):

    Google Scholar 

  49. 49.

    Sugimoto N, Nakano S, Katoh M, Matsumura A, Nakamuta H, Ohmichi T, Yoneyama M, Sasaki M: Thermodynamic Parameters To Predict Stability of RNA/DNA Hybrid Duplexes. Biochemistry 1995, 34: 11211–11216.

    CAS  Article  PubMed  Google Scholar 

  50. 50.

    Barrera L, Benner C, Tao YC, Winzeler E, Zhou Y: Leveraging two-way probe-level block design for identifying differential gene expression with high-density oligonucleotide arrays. BMC Bioinformatics 2004, 5: 42.

    PubMed Central  Article  PubMed  Google Scholar 

  51. 51.

    Team RDC: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2010. . [ISBN 3-900051-07-0]

    Google Scholar 

  52. 52.

    Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004, 5(10):R80.

    PubMed Central  Article  PubMed  Google Scholar 

  53. 53.

    Gautier L, Cope L, Bolstad BM, Irizarry RA: affy-analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 2004, 20(3):307–15.

    CAS  Article  PubMed  Google Scholar 

  54. 54.

    Berger F, Hertogh BD, Pierre M, Bareke E, Gaigneaux A, Depiereux E: PHOENIX, a web interface for (re)analysis of microarray data. Central European Journal of Biology 2009, 4(4):603–618.

    CAS  Google Scholar 

Download references


We acknowledge financial support from KULeuven grant STRT1/09/042. Discussions with K. M. Kroll are gratefully acknowledged.

Author information



Corresponding authors

Correspondence to Fabrice Berger or Enrico Carlon.

Additional information

Authors' contributions

FB scripted the procedures and ran the analysis of the data. Both authors contributed to the analysis, interpretation and redaction of the paper. All authors read and approved the final manuscript.

Electronic supplementary material


Additional file 1: . HG-U133Plus2-medianpolish-comp.xls summarizes the performances of affyILM with the transposed medianpolish on the HG-U133 Plus 2 arrays (Tissue mixture correlation study). (XLS 52 KB)


Additional file 2: . Hugene10st-medianpolish-comp.xls summarizes the performances of affyILM with the transposed medianpolish on the Hugene 1.0 ST arrays (Tissue mixture correlation study). (XLS 31 KB)


Additional file 3: . HG-U133Plus2-VS-Hugene10st-medianpolish-comp.xls summarizes the performances of affyILM with the transposed medianpolish using cross-platform comparisons between HG-U133Plus 2 and Hugene 1.0 ST arrays. (Tissue mixture correlation study). (XLS 33 KB)


Additional file 4: . affycomp-LS95-report.pdf provides the report of the affycomp III evaluation on the latin square HG-U95 experiment. (PDF 1 MB)


Additional file 5: . affycomp-LS133-report.pdf provides the report of the affycomp III evaluation on the latin square HG-U133 experiment. (PDF 2 MB)


Additional file 6: . affycomp-LS95-scores.xls reports the scores assessed by affycomp III evaluation of selected methods, on the latin square HG-U95 experiment. (XLS 46 KB)


Additional file 7: . affycomp-LS133-scores.xls reports the scores assessed by affycomp III evaluation of selected methods, on the latin square HG-U133 experiment. (XLS 45 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Berger, F., Carlon, E. From hybridization theory to microarray data analysis: performance evaluation. BMC Bioinformatics 12, 464 (2011).

Download citation


  • Outlying Probe
  • Differential Expression Analysis
  • Preprocessing Method
  • Heart Sample
  • Biological Noise