A probetreatmentreference (PTR) model for the analysis of oligonucleotide expression microarrays
 Huanying Ge^{1},
 Chao Cheng^{1} and
 Lei M Li^{1, 2}Email author
https://doi.org/10.1186/147121059194
© Ge et al; licensee BioMed Central Ltd. 2008
Received: 06 June 2007
Accepted: 14 April 2008
Published: 14 April 2008
Abstract
Background
Microarray preprocessing usually consists of normalization and summarization. Normalization aims to remove nonbiological variations across different arrays. The normalization algorithms generally require the specification of reference and target arrays. The issue of reference selection has not been fully addressed. Summarization aims to estimate the transcript abundance from normalized intensities. In this paper, we consider normalization and summarization jointly by a new strategy of reference selection.
Results
We propose a ProbeTreatmentReference (PTR) model to streamline normalization and summarization by allowing multiple references. We estimate parameters in the model by the Least Absolute Deviations (LAD) approach and implement the computation by median polishing. We show that the LAD estimator is robust in the sense that it has bounded influence in the threefactor PTR model. This model fitting, implicitly, defines an "optimal reference" for each probeset. We evaluate the effectiveness of the PTR method by two Affymetrix spikein data sets. Our method reduces the variations of nondifferentially expressed genes and thereby increases the detection power of differentially expressed genes.
Conclusion
Our results indicate that the reference effect is important and should be considered in microarray preprocessing. The proposed PTR method is a general framework to deal with the issue of reference selection and can readily be applied to existing normalization algorithms such as the invariantset, subarray and quantile method.
Background
Microarray is one of the most successful techniques in the field of functional genomics. As a high throughput approach, it provides a global gene expression profile of a living cell under certain conditions. The Affymetrix expression microarray is the most widelyused platform. It uses 11–20 probes which have 25 oligonucleotide bases, to represent one gene, and as a whole they are called a probeset. Associated with each perfect match (PM) probe, a mismatch (MM) probe that differs only in the middle (13^{ th }) base is included in some expression arrays.
From each array, we get fluorescence intensity of each probe after image processing. The estimation of gene expression from probe intensities is a statistical problem where much effort has been made. Among them are Affymetrix's MAS 5.0 [1], Li and Wong's dChip [2–4], and RMA [5–7]. Each method mainly consists of two modules: normalization and summarization. Normalization aims to reduce nonbiological variations that are introduced during sample preparation, hybridization and instrumental reading. Summarization combines normalized signal values in each probeset and produces a final abundance estimate for the corresponding gene.
Normalization requires the specifications of reference and target arrays. We can normalize target arrays against a reference, no matter whether it is a raw microarray or an artificiallydefined one. In the invariantset normalization, which is part of the dChip software, the median intensity of each array is calculated, and the array with the median of these overall medians is chosen as the reference; however, other options are also allowed. The normalization is carried out according to the smooth curve fitted from the rankinvariant intensity set between the reference and target array [2]. The quantile normalization in RMA uses complete data information and defines a pseudoreference by the averaged quantiles. The normalization is carried out based on the quantilequantile transformation [5, 8]. In the subarray normalization, once a reference and a target are given, it divides the whole array into subarrays and normalizes probe intensities within each subarray using least trimmed squares (LTS) regression method [9–11]. However, the issue of reference selection has not been well addressed.
In this paper, we propose a ProbeTreatmentReference (PTR) model that takes into account the referenceeffect. The method based on this model aims to streamline normalization and summarization by allowing multiple references. Instead of one reference array, it uses a reference set to do microarray preprocessing. Compared to the twofactor model in [6] and [12], which contains probe and arrayspecific effect (later we refer to it as the PA model), we reformulate the arrayspecific factor by a treatment factor and a reference factor in the PTR model. We estimate the parameters in the model by the least absolute deviations (LAD) approach, which is robust in the sense of bounded influence [13].
The proposed PTR method tries to integrate normalization with summarization in a unified framework. It can be applied to several existing normalization methods, and is particularly useful for the subarray normalization which aims to reduce spatial variation. Because the referencespecific effect is adjusted at the probeset level, implicitly, from the reference set, our model defines an "optimal reference" for each probeset, which may not come from the same raw array. We show that the foldchange estimates from the PTR method are resistant to a bad reference array. It reduces the variation of nondifferentially expressed genes and thereby increases the detection power of differentially expressed genes.
Results
Microarray data sets
The Affymetrix HGU133A data set [14] contains triplicates of 14 experiments with 42 spikein genes. These spikein genes are organized into 14 gene groups, each of which contains 3 genes of the same concentration. The concentrations range from 0–512 pM according to a 14 × 14 Latin square design. In this paper, we analyze experiment 3 and 4 particularly. Experiment 3 contains the triplicate: Exp03_R1, Exp03_R2 and Exp03_R3 ; experiment 4 contains the triplicate: Exp04_R1, Exp04_R2 and Exp04_R3. The concentrations of 36 spikein genes in experiment 4 are twofold higher; the concentrations of the remaining 6 spikein genes are 0 in either experiment 3 or 4. Later we refer to these six arrays as data set "Expt34". Besides, we generate a perturbed data set by adding noise to array Exp03_R1 and keeping the other five arrays unchanged. The noise value is defined by δ_{ i }= max(0, x_{ i }), where x_{ i }is a normally distributed random variable with a zero mean and variance equal to that of the array Exp03_R1. We denote this perturbed array by Exp03_R1*.
We use another data set, the "Golden spike" data set of Choe et al. in 2005 [15], to assess the detection power of differentially expressed genes. The data set contains six Affymetrix DrosGenome1 chips, three replicates for control and three for treatment. A total of 1309 genes are differentially expressed with predefined fold changes from the set, {1.2, 1.5, 1.7, 2.0, 2.5, 3.0, 3.5, 4.0}, between the treatment and control group.
The PTR method

The allpairwise strategy: all arrays are included in the reference set; for each reference, all the other arrays are its targets.

The cross strategy: all arrays are included in the reference set; for each reference from control, we take replicates from treatment as its target arrays; conversely, for each reference from treatment, we take replicates from control as its target arrays.
Second, we use existing algorithms such as the invariantset, subarray and quantile, to do normalization between each reference and its target arrays. In this paper, we use the function normalize.AffyBatch.invariantset in the BioConductor's affy package [16] to carry out the invariantset normalization. We keep the default parameter settings but allow the reference to be selected manually. We use the procedure described in [9] for the subarray normalization. The window size is 25, the overlapping size is 10 and the trimming factor is 0.55 for the HGU133A dataset.
We write a new R function for the quantile normalization. The algorithm is not identically the same as the one [8] used in the RMA package. That is, we normalize each target array against a raw reference array as follows: sort the reference and the target array respectively in the ascending order; replace the sorted intensity values in each target array with those in the reference array; rearrange intensities in each target array according to their original orders [10].
Finally, for each probeset, we summarize the arrays from logtransformed PM intensities according to the threefactor PTR model:
log_{2}(Intensity) = global term + treatment effect + probe effect + reference effect + error.
The model states that the variation of logarithmic probe signal values could be sufficiently explained by treatments, probe affinities and references used in normalization. The global term and treatment effects are taken to be the final transcript abundance estimates; the reference effects and the probe effects could be used for diagnosis.
Perturbed data set
The top, middle, and bottom row show the results from the invariantset, quantile, and subarray normalization respectively. In A1B1C1, we take the perturbed array Exp03_R1* as the reference; in A2C2, we take the array Exp03_R2 as the reference, while in B2, we take the average quantiles as the pseudoreference; in A3B3C3, we simply use the PTR method coupled with the reference set of all six arrays. We see that the invariantset is more sensitive to the perturbed reference array Exp03_R1* (A1) and it performs well if an appropriate reference is selected (A2). The quantile method shows a similar but less severe phenomenon when the quantilequantile transformation is based on the array Exp03_R1*. In contrast, the quantile normalization using average quantiles as the pseudoreference is more resistant to this perturbation. For the subarray normalization, the perturbation does not affect the ratio estimates, since the normalization transformations are done in each subwindow which greatly reduces the effects of this global noise.
The PTR method improves the performance of all three normalization algorithms. We see that the logratios of spikein genes are closer to the expected logratio value M = 1, and nonspikein genes have smaller variations along M = 0. The PTR method deals with the single perturbed reference array Exp03_R1* well, and it estimates the final transcript abundance robustly.
Variation reduction by the PTR method
It is shown that the PTR method achieves smaller variances for the nonspikein genes in the MA plots of the perturbed data set. To further measure the effectiveness of the PTR method in variation reduction, we compare it with other preprocessing methods using "Expt34" data set. The nonspikein genes are expected to have the same expression in both experiment 3 and 4, and the differences between them are due to the nonbiological factors. We preprocess six arrays with different methods of normalization and summarization and estimate the expression values for both experiment 3 and 4. In the MA plots, we fit LOESS curves [17] to the absolute values of M against A using nonspikein genes only. The curves shown in Figure 3 measure the variations of nonspikein genes between two experiments. The left, middle, and right plot respectively show the results from the invariantset, quantile, and subarray normalization method. Except for the PTR method, we summarize the normalized results using the expresso function in the affy package [16], with either the LiWong's MBEI [3] or the RMA's median polish [6] method. From the plots, we see that the PTR method achieves the smallest variations for each normalization algorithm. In the left plot, we use the invariantset normalization for both the PTR method and the other singlereference methods. The plot shows that the same preprocessing method results in different variation curves with different reference choices. The PTR method reduces the background noise significantly.
In the right plot of Figure 3, we use the subarray normalization, and once again the PTR method improves the performance. Furthermore, the curves from the single reference results can be clustered into two categories. Each reference array from experiment 3 performs better at the lowintensity range but worse in the highintensity range; whereas each from experiment 4 performs slightly better in the highintensity range but worse in the lowintensity range. A similar yet less prominent pattern also exists in the results using the invariantset and quantile normalization. Replicates within either experiment 3 or 4 are similar, but arrays from two experiments lead to more different results. The PTR method takes reference arrays from both experiments, combines two types of expression results and gives the smallest variation in both low and highintensity ranges.
The LOESS assessment has demonstrated that the PTR method can reduce the variation significantly for the invariantset, quantile and subarray normalization method. This variation reduction, especially in the lowintensity range, is particularly valuable in the measurement of gene expressions.
Improvement on detection of differentially expressed genes
In addition, we make the similar comparisons on the "golden spike" data set. We set the differentially expressed genes using fold change threshold to 1.2 or 1.7. The ROC curves are shown at the bottom of the Figure 4. We can see that the PTR method performs better with the invariantset normalization than the quantile normalization. However, both of them outperform the other preprocessing methods.
The assessment on both data sets demonstrates the higher specificity and sensitivity achieved by the PTR method. The noise reduction improves the detection of deferentially expressed genes.
Implicit reference selection in the PTR modelfitting
Discussion
We explicitly address the reference issue in preprocessing expression microarray and propose the PTR method to carry out normalization and summarization jointly. The PTR method can be applied to existing normalization methods such as the invariantset, subarray, and quantile. Particularly, it is the first time we propose a practical scheme to implement the subarray normalization. The subarray takes into account the spatial pattern of hybridization, and can properly normalize the majority of probe intensities even in the presence of scratches or serious nonhomogeneous hybridization. The PTR method enhances this ability using multiple references in normalization and implicitly selecting an "optimal reference" for each probeset during summarization. In general, the PTR method is applicable to preprocessing biological measurements from arraytype instruments such as exon arrays, beads arrays, pathway arrays, tissue arrays and other customized arrays.
The proposed PTR method primarily aims to measure expressions by Affymetrix chips with technical replicates as in many designed microarray experiments. In the normalization step, the sample information is not required although the processing involves multiple pairwise comparisons. In the summarization step, the sample information is necessary for estimating expression differentiation. In some situations, especially the class discovery and class prediction in the cancer study, no technical replicates are available for each sample. To apply the PTR method, we need to assign different treatment labels to different samples that are not replicates in any sense. In this case, one microarray has served as two functional parts: array and referencespecific effect. For now we could not rule out the possibility of partial confounding of array and referencespecific effect in general. But we argue that, according to our scheme, normalization is carried out by pooling all probesets while summarization is carried out for each probeset separately.
Consequently, we expect that the referencespecific parameter in each probeset reflects reference array's block effect in the present of other probesets. The association between array and referencespecific effect should not be significant. It is particular true in the randomized designed Affymetrix chips.
We have not discussed the background correction issue, but the existing background correction methods can be implemented before the normalization step of the PTR method. Interestingly, we have shown that the PTR method reduces the variation of nondifferentially expressed genes, especially in the low intensity range. As a consequence, we improve the signaltonoise ratio and increase the detection power of the deferentially expressed genes. The improvement is achieved without any additional information.
We include all arrays in the reference set and use the cross strategy to select the corresponding target arrays for the data set in this paper. We are aware that they may not be the optimal strategy for reference and target selection. However, the PTR method can protect against and detect abnormal arrays. First, by using the robust LAD method in the summarization, we can accurately estimate the transcript abundance as long as the majority of normalization results are right. Second, we can evaluate the array quality by the reference effect distribution, as shown in Figure 5. The array showing distinct distribution from other arrays is likely to be a bad candidate for reference and may be excluded from the reference set in the nextround PTR processing.
The computational complexity of the PTR method can be decomposed into two parts: normalization and summarization. In an experiment with N arrays, at most we need to carry out N(N1) pairwise normalization. Thus the complexity is approximately N times of that of singlereference normalization. The median polishing algorithm used in the summarization is an iterative procedure. Compared to the algorithm used in RMA, the median polishing in the PTR is threeway instead of twoway with approximately N times the memory. According to our experience, it does not need substantially more iterations to achieve the same accuracy.
The current implementation of the PTR method needs the R running environment for data input/output and visualization. The normalization part could be carried out by C++ or R depending on the selection of normalization methods. The core computation of the PTR summarization is written in C++ and is called by an R interface. Take the "Expt34" for example, if we use the allpairwise reference selection strategy, the normalization procedure costs about 2 minutes for the quantile algorithm and 5 minutes for the invariantset algorithm in a computer with Intel Pentium 2.8 GHz and 4 GB of RAM. The summarization part takes about 10 minutes on the same machine. Due to the specification of memory allocation in R, our current implementation can process a data set up to total 20 arrays in about one hour on the same machine mentioned above. This capability can be expanded if we code the PTR method entirely in C/C++. For a large dataset, we need to adopt algorithms that allocate memory more efficiently. We are now working along this direction to improve the implementation of the PTR method. We also notice that the examples we have examined in the article involve at most three replicates per treatment group. In the case of more samples per treatment group such as 10 or 20, the reference selection may not be a serious issue due to the large sample size. The performance of the PTR method is unclear and the RMA method is a safer choice.
Conclusion
We propose the ProbeTreatmentReference (PTR) model to deal with the referencespecific effect. The PTR method streamlines normalization and summarization by allowing multiple references in normalization. It can readily be applied to existing referencedependent normalization methods. We evaluate the effectiveness of the PTR method on two Affymetrix spikein data sets. The method reduces the variations of nondifferentially expressed genes whereas increases the detection power of the differentially expressed genes.
Methods
Probetreatmentreference model
The scheme of the PTR method is shown in Figure 1. We carry out the PTR modelfitting on each probeset. The inputs of the fitting are normalized results using multiple references. We note that only PM probes are used in summarization even if the MM probes are available. Let variable y_{ ijkl }be the j^{ th }normalized probe intensity value of the l^{ th }replicate for the i^{ th }treatment, with respect to the k^{ th }reference array, where i = 1, ..., I; j = 1, ..., J; k = 1, ..., K; l = 1, ..., L. We postulate that log_{2}(y_{ ijkl }) follows the threefactor model as below:
log_{2}(y_{ ijkl }) = μ + α_{ i }+ β_{ j }+ γ_{ k }+ ε_{ ijkl }, (1)
where μ is the global term; α_{ i }represents the i^{ th }treatmentspecific effect; β_{ j }represents the j^{ th }probespecific effect; and γ_{ k }represents the k^{ th }referencespecific effect. It is neither simple nor obvious to specify the distribution form for ε_{ ijkl }. In addition, they are correlated. In order to make the model identifiable, we impose one of the following constraints:

Sum constraints: ∑_{ i }α_{ i }= 0, ∑_{ j }β_{ j }= 0, ∑_{ k }γ_{ k }= 0;

Median constraints: median({α_{ i }}) = 0, median({β_{ j }}) = 0, median({γ_{ k }}) = 0.
We take μ + α_{ i }as the final transcript abundance estimate for the i^{ th }treatment.
The PTR model is an extension of the RMA's twofactor PA model. Rather than the arrayspecific effect in the twofactor model, the treatment and referencespecific effects are used instead in the PTR model. Suppose that we have two treatments and each has three replicates. The twofactor model has 22 (1 + 5 + 10) parameters, assuming the probeset has 11 PM probes. After the modelfitting, the expression values for each treatment are calculated using the medians or averages across the three replicates. The PTR model has 23 (1 + 1 + 10 + 5) parameters, if we include all six arrays in the reference set. The expression values for each treatment are calculated directly. Meanwhile, it does not necessarily lead to overfitting if we use more reference arrays, since the number of normalized array increases accordingly. For example, using the cross strategy, three references generate 132 (4 × 11 × 3) normalized PM intensity values, while six references generate 264 (4 × 11 × 6) values. Of course, we are aware that these normalized intensities do correlated to one another.
Least absolute deviations (LAD) estimation and its computation
Consequently, we can minimize the sum of absolute deviations by the "median polish" approach [21]. Namely, in each step, we sweep out the median for each level of one categorical variable. In each iteration, we run the median polishing through all categorical variables. The SAE is reduced during each step and we stop the iteration once it becomes stable. Our experiences show that at most 20 iterations can give satisfactory results.
Robustness of LAD estimator
means that the signal value y_{ j }is from the 2^{ nd }treatment, 11^{ th }probe and 6^{ th }reference.
where f(0) is the density function of ε at zero and the last term is the sign function, taking value 1, 0, or 1. Since f(0) > 0, the influence function of the PTR model is bounded and the largest influence on estimates only depends on f(0). We have check the influence of treatment effect estimate in the HGU133A spikein data set, and they are less than $\frac{1}{6}$ for most probesets.
Availability and requirements
Software package for the PTR method is available on http://leililab.cmb.usc.edu/yeastaging/projects/ptr/.
Declarations
Acknowledgements
The work is supported by the grant R01 GM7530801 from NIH.
Authors’ Affiliations
References
 Affymetrix: Affymetrix Microarray Suite User Guide, version 5. 2001, Santa Clara, CAGoogle Scholar
 Li C, Wong WH: Modelbased analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biology. 2001, 2 (8):Google Scholar
 Li C, Wong WH: Modelbased analysis of oligonucleotide arrays: expression index computation and outlier detection. PNAS. 2001, 98: 316. 10.1073/pnas.011404098.PubMed CentralView ArticlePubMedGoogle Scholar
 dChip software: User's Manual. 2005, [http://biosun1.harvard.edu/complab/dchip]Google Scholar
 Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003, 19 (2): 185193. 10.1093/bioinformatics/19.2.185.View ArticlePubMedGoogle Scholar
 Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003, 31 (4):Google Scholar
 Irizarry RA, Hobbs B, Collin F, BeazerBarclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003, 4 (2): 249264. 10.1093/biostatistics/4.2.249.View ArticlePubMedGoogle Scholar
 Bolstad BM: Probe Level Quantile Normalization of High Density Oligonucleotide Array Data. 2001, [http://bmbolstad.com/stuff/qnorm.pdf]Google Scholar
 Cheng C, Li LM: Subarray normalization subject to differentiation. Nucleic Acids Res. 2005, 33 (17): 55655573. 10.1093/nar/gki844.PubMed CentralView ArticlePubMedGoogle Scholar
 Li LM, Cheng C: Differentiation Detection in Microarray Normalization, Chapter 2 in "Methods in Microarray Normalization. 2008, CRC, Phillip StaffordGoogle Scholar
 Cheng C, Fabrizio P, Ge H, Wei M, Longo VD, Li LM: Significant and Systematic Expression Differentiation in LongLived Yeast Strains. PLoS ONE. 2007, 2 (10): e109510.1371/journal.pone.0001095.PubMed CentralView ArticlePubMedGoogle Scholar
 Holder D, Raubertas RF, Pikounis VB, Svetnik V, Soper K: Statistical analysis of high density oligonucleotide arrays: a SAFER approach. Proceedings of the ASA Annual Meeting. 2001, . Atlanta, GAGoogle Scholar
 Bloomfield P, Steiger WL: Least absolute deviations: theory, applications, and algorithms. 1983, BirkhauserGoogle Scholar
 Affymetrix: HUMAN GENOME U133 DATA SET. [http://www.affymetrix.com/support/technical/sample_data/datasets.affx]
 Choe SE, Boutros M, Church GM, Halfon MS: Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biology. 2005, 6 (2):Google Scholar
 Gautier L, Cope L, Bolstad BM, Irizarry RA: affyanalysis of Affymetrix GeneChip data at the probe level. Bioinformatics. 2004, 20:Google Scholar
 Cleveland WS: Robust Locally Weighted Regression and Smoothing Scatterplots. Journal of the American Statistical Association. 1979, 74 (368): 829836. 10.2307/2286407.View ArticleGoogle Scholar
 Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5 (10):Google Scholar
 Koenker RW, D'Orey V: Algorithm AS 229: Computing Regression Quantiles. Applied Statistics. 1987, 36 (3): 383393. 10.2307/2347802.View ArticleGoogle Scholar
 Barrodale I, Roberts F: An improved algorithm for discrete L1 linear approximation. SIAM Journal on Numerical Analysis. 1973, 10 (5): 839848. 10.1137/0710069.View ArticleGoogle Scholar
 Tukey JW: Exploratory Data Analysis. 1977, AddisonWesleyGoogle Scholar
 Narula SC, Wellington JF: The Minimum Sum of Absolute Errors Regression: A State of the Art Survey. International Statistical Review. 1982, 50 (3): 317326.View ArticleGoogle Scholar
 Boldin MV, Galina SI, Tyurin YN: SignBased Methods in Linear Statistical Models. 1997, American Mathematical SocietyGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.