Graphical technique for identifying a monotonic variance stabilizing transformation for absolute gene intensity signals

Background The usefulness of log2 transformation for cDNA microarray data has led to its widespread application to Affymetrix data. For Affymetrix data, where absolute intensities are indicative of number of transcripts, there is a systematic relationship between variance and magnitude of measurements. Application of the log2 transformation expands the scale of genes with low intensities while compressing the scale of genes with higher intensities thus reversing the mean by variance relationship. The usefulness of these transformations needs to be examined. Results Using an Affymetrix GeneChip® dataset, problems associated with applying the log2 transformation to absolute intensity data are demonstrated. Use of the spread-versus-level plot to identify an appropriate variance stabilizing transformation is presented. For the data presented, the spread-versus-level plot identified a power transformation that successfully stabilized the variance of probe set summaries. Conclusion The spread-versus-level plot is helpful to identify transformations for variance stabilization. This is robust against outliers and avoids assumption of models and maximizations.


Background
Microarrays measure the abundance of thousands of mRNA transcripts in one experiment. Currently, two different microarray technologies dominate gene expression research efforts, namely, custom spotted arrays and Affymetrix GeneChips ® . Custom spotted arrays are characterized by long single strands of complimentary DNA (cDNA) affixed to a solid substrate in spots to which two different fluorescently labelled samples are hybridized. Affymetrix GeneChips ® are characterized by the use of several (11)(12)(13)(14)(15)(16)(17)(18)(19)(20) short oligonucleotide (25-mers) probes to interrogate for a single gene and to which one fluorescently labelled sample is hybridized. For both technologies, the fluorescence intensity for each probe/spot is assumed to be indicative of the amount of mRNA transcript in the sample. Since two samples are hybridized to custom spotted arrays, the ratios of the experimental signal relative to a control signal of the resulting intensities are analyzed. On the other hand, for the one sample hybridized to an Affymetrix GeneChip ® , the absolute intensities indicative of the number of transcripts are the resulting gene expression measures.
To satisfy assumptions required for statistical analyses, data from both technologies are often transformed by a suitable function [1]. Specifically, the reasons for applying a transformation to a dataset include to achieve stability in variance, or to achieve linearity, additivity and/or normality. Sometimes a transformation is applied because it facilitates interpretation and induces symmetry. Such is the case when applying the log 2 transformation to custom spotted microarray (cDNA) data. This transformation defines the relative abundance of a transcript in an experimental sample in comparison to the control sample as the unit of analysis. For genes where the experimental intensity is greater than the control intensity, the ratio could take values in the range (1, ∞); for genes where the experimental intensity is less than the control intensity, the ratio is compressed to the range (0, 1). The log 2 transformation of ratio data promotes symmetry by treating under and over expressed genes similarly. A typical example used to elucidate this concept is the one that considers two genes, one with a ratio of 2.0 representing a doubling of intensity and another with a ratio of 0.5 represents a halving of intensity; on a log 2 scale these values are symmetric about zero (no change) with values 1 and -1 respectively [1]. The application of the log 2 transformation to cDNA data thus has proven useful for achieving symmetry as well as ease of interpretation.
The usefulness of the log 2 transformation for cDNA microarray data has led to its widespread application to Affymetrix GeneChip ® data as well [2][3][4]. While this transformation is appealing to the biologists due to the reasons stated above for cDNA arrays, it neither facilitates interpretation nor does it necessarily render true the assumptions required for statistical analysis, such as equal variance, normality, etc., in Affymetrix GeneChip ® data. When data are amounts or counts, as with Affymetrix GeneChip ® data, where the intensities are assumed to represent amounts of mRNA transcripts, there is often a systematic relationship between the variance of the measurements and magnitude of the measurements. The log 2 transformation expands the scale of genes with low absolute intensities while compressing the scale of genes with higher intensities; it essentially reverses the direction of the relationship between the variance and the mean expression level. That is, after the transformation lowly expressed genes have a higher variance and highly expressed genes have a lower variance [5].
Recently, other variance stabilizing transformation methods for microarray gene expression data have been introduced [6][7][8]. The gene expression data have been modelled as y = α + µe η + ε, (1) where y represents the measured intensity, α represents the average background, µ represents the true gene expression level, with normally distributed error terms η and ε with zero mean and differing non-zero variances [6]. For this model, when µ is large, y is distributed approximately as a lognormal random variable. Therefore, a log transformation of y results in observations with constant variance when µ is large. In this case the generalized logarithm transformation is (2) where , stabilizes the asymptotic variance [6] for large samples. Rocke and Durbin [7] compared three logarithmic-based transformations (the generalized logarithm, the started logarithm, and the log-linear hybrid) using a simulation study and in application to an existing dataset. They found the generalized logarithm resulted in better overall performance in achieving variance stabilization.
In this paper, an Affymetrix GeneChip ® HG-U133A dataset consisting of 16 technical replicates (QAQC Dataset), where the Microarray Suite Software (version 5.0) was used to derive the expression summaries for all probe sets, is used to demonstrate some of the problems associated with applying the log 2 transformation to absolute intensity data. Another approach to identify an appropriate variance stabilizing transformation using the spread-versuslevel plot [9] is proposed. The spread-versus-level plot plots the log of the median on the x-axis ( ) against the log of the fourth-spread on the y-axis. The slope of the spread-versus-level plot (b) can be used to suggest a power transformation that will most appropriately stabilize the variance. When using a spread-vs-level plot, the suggested power of the transformation is p = 1 -b.
The effectiveness of using the spread-versus-level plot to identify a suitable monotone transformation that appropriately stabilizes the variance for probe set expression summaries is demonstrated using the QAQC dataset.

Results
To demonstrate the dependence of the variance on level of gene expression, a plot of the mean for each probe set across the 16 technical replicates by the associated variance was constructed ( Figure 1). It is apparent from the plot that the variance increases with increasing gene expression. Figure 2 represents the plot of the mean versus the corresponding variance for the log 2 transformation. This plot demonstrates, as previous authors have found [4,5], that the log transformation increases the variability on the lower end of the range while compressing the variability among the highly expressed genes. In other words, clearly, the log transformation has not stabilized the variance in this case. Figure 3 the spread-versus-level plot for the same data is presented. The estimated slope of the linear regression model fit to the spread-vs-level plot was = 0.569. Therefore, the estimate of the power transformation turned out to be, = 1 -0.569 = 0.431. Typically, for simplicity the parameter estimate is rounded to the nearest half-integer, which in this case would lead to = 0.5. The family of power transformations is defined as where, x is the observed data, p is a number to be determined and Φ is the monotonic transformation. Substituting the p obtained from the plot yields, Applying this transformation to the signal intensities in the QAQC dataset and plotting mean versus the variance as before (Figure 4) shows that stabilization of the variance is achieved. There are a few outlying probe sets identifiable in Figure 4. One of the advantages of using the estimated slope from the spread-versus-level plot to identify a power transformation is that this plot uses robust measures of location (median) and spread (fourthspread). For this dataset, the "outlying" probe sets were defined as those with a variance greater than 50. The estimated slope after removing the outliers was 0.568; as expected, the slope of the linear regression fit to the spread-versus-level plot is fairly robust against outliers. Thus, the transformation identified by the spread-versuslevel plot is not affected by these few outlying probe sets.  [10]. This plot more clearly demonstrates the improvement in variance stabilization achieved by this transformation. The fitted lowess regression curve is overlaid, where the span is 1/3, is roughly horizontal, indicating constant variance.

Mean versus variance plot
As noted in this figure, there are still a few probe sets at the highest intensity level that suffer from increased variability. These probe sets are among those expression summary values that are greater than the expression summary values for the most concentrated hybridization spike-in control (that is, the AFFX-CreX and AFFX-r2-P1-cre probe sets, with a mean of 2796 on MAS 5.0 scale, 103.7 on the transformed scale). From our quality assessment of the hybridizations, there is linearity over the range of concentrations covered by all the hybridization spike-in controls.
However, for signal intensities outside of this range, linearity cannot be determined. In a statistical analysis, these may be deemed outliers and may need to be examined separately from the remaining probe sets.
After applying the generalized logarithm transformation [6,8] to the QAQC Dataset using the Bioconductor vsn() library [8] and plotting the rank of the median probe set signal intensities by the associated fourth-spread ( Figure  6) reveals the retention of increased variability among genes with smaller signal intensities, though vastly improved in comparison to the log 2 transformation. The fitted lowess regression curve is overlaid where the span is 1/3. The coefficient of variation for the fitted values from the lowess regression for the transformed data was 0.15 while the coefficient of variation of the fitted values from the lowess regression for the generalized logarithm transformation was 0.22. Bootstrap percentilebased confidence intervals for the coefficient of variation were estimated by resampling the paired median and 2 2 x − spread observations, fitting a lowess regression to the bootstrapped data, then estimating the coefficient of variation using the lowess fitted values. The 99% confidence interval for the vsn() transformed data was (0.2109, 0.2265); the 99% confidence interval for the transformed data was (0.1431, 0.1644). Thus, it may be concluded that the variance was better stabilized by the transformation.

Discussion
Reasons for transforming data often include the desire to find a scaling of the data that would simplify the analysis, for example, by making the data meet assumptions such as symmetry, equal variance, straight line relationship, and/or additivity of effect. One common assumption of many traditional statistical methods is that the variance is constant over the levels of gene expression. Since the absolute intensities are positive, this assumption is generally not valid because the variance increases with increasing intensities. The log 2 transformation, which is routinely applied to absolute intensities that result from Affymetrix GeneChip ® experiments [2,3], unfortunately does not achieve the desired results. As shown ( Figure 2) the transformation only reverses the mean-variance relationship. To overcome this drawback of the log 2 transformation, methods such as the Local Pooled Error test have been suggested [4] to more accurately estimate the variance for lowly expressed log 2 signal intensities. In this article, as one alternative, the spread-versus-level plot to identify more appropriate monotonic transformations that stabilize the variance was proposed. The spread-vs-level plot proved useful for identifying a transformation useful for simplifying an analysis by meeting assumptions such as symmetry and equal variance.
For the QAQC Dataset, the spread-versus-level plot led to a square root transformation, rather than the log 2 transformation. This dataset consisted of 16 technical replicates; since the spread-vs-level plot is used to empirically identify a power transformation, the most appropriate power transformation to stabilize the variance is completely data

Spread versus level plot
Therefore, the transformation suggested should be treated as an illustration and is not necessarily the most appropriate one for all data sets. In fact, there may be situations in which our method may lead to the more commonly used log transformation itself, while for other datasets it may not lead to any power transformation that could stabilize the variance. We note, however, that in a serial dilution experiment conducted to assess linearity of signal [11], the authors found the square root transformation by trial and error to be an appropriate transformation. Thus this empirically derived transformation obtained through the use of the spread-versus-level plot confirmed a similar finding.

Conclusions
The estimated slope from the spread-versus-level plot could be used to identify a power transformation that will appropriately transform gene expression data to stabilize the variance. It is computationally easy to implement as it avoids the requirement of a statistical model and maximization of a likelihood. Moreover, since this plot uses robust measures of location and spread, the slope of the linear regression fit to the spread-versus-level plot is fairly robust against outliers.

Transformation
One of the most commonly used families of variance stabilizing transformations is the Box-Cox transformation [12]. This family of transformations is defined by equation (3). Box and Cox selected p as the value that maximized the likelihood under any given model. A graphical technique based on the spread-versus-level plot may also be used to empirically estimate p [9]. This method is robust against outliers and avoids the need for maximization of the likelihood. The spread-versus-level plot plots the log of the median (x-axis) by the log of the fourthspread (y-axis). The slope b of the spread-versus-level plot identifies the approximate value of the exponent for the power transformation, where p = 1 -b. When the slope b is close to 1, p is close to 0 and therefore the transformation would be ln(x).

Spread-versus-level plot
To construct a spread-versus-level plot, three order statistics, namely the median (M) the upper fourth (F U ) and the lower fourth (F L ) must be identified. These order statistics may be found using the following algorithm. First, sort the data in ascending order. On the basis of ordering, the rank of an observation can be defined as an upward rank (determined by counting up from smallest to largest) or a downward rank (determined by counting down from largest to smallest). Considering both the upward and the downward ranks together, the depth of a data value in a sample is defined as the smaller of its upward rank and its downward rank. For example, the depth of the median is (n + 1)/2. (When the number of observations n is even, the median is interpolated to lie between the two middle values.) The fourths are those values with a depth equal to ([depth of median] + 1)/2. Therefore, each fourth is halfway between the median and the corresponding extreme [13]. The fourth-spread (d F ) is a robust measure of spread and is estimated by (upper fourth) -(lower fourth), denoted by d F = F U -F L . The interquartile range is nearly the same as the fourth-spread but differs slightly when the number of observations is even.
To understand why a spread-versus-level plot is useful in identifying an appropriate variance stabilizing transformation [9], let X be a real random variable from a population with median M X , lower fourth F L , upper fourth F U , and fourth spread d F . The spread-versus-level plot identifies a transformation y = Φ(x) for which the fourth-spread of Y is constant. Assuming the second derivative Φ''(x) and higher order derivatives exist for all x, a Taylor M X to F L . Therefore, the median and upper and lower fourths for the data transformed by y = Φ(x) are Substituting these expressions into equation (4) to get (F U ) Y and (F L ) Y , then taking the difference to get the fourth-spread for the transformed data yields, Recall that λ(M X ) represent the distance from M X to F U as a proportion of the distance d F ; thus the term [2* λ(M X ) -1] will range from [0,1]. Moreover, if the upper and lower fourths are roughly equal distant from the median (i.e., x is roughly symmetric), this quantity will be zero. Addi-tionally, since Φ''(M X ) is a measure of concavity of the tangent line at the median, it will generally be small. Therefore, the quadratic term of (d F ) Y is generally small enough to be negligible compared to the leading term, thus the approximation is reasonable. Now, if the right-hand side of this equation is held constant, it leads to a transformed variable Y with constant variance. Therefore, supposing that the relationship between spread d F and level X is a power transformation of the form d F (x) = k·x b for some constants k and b, The spread-versus-level plot plots the log of the median on the x-axis ( ) against the log of the fourthspread on the y-axis. (Recall, d F (x) = k· so that log(d F ) Rank versus spread plot of generalized logarithm transformed data Figure 6 Rank versus spread plot of generalized logarithm transformed data. Plot of the rank of the median probe set signal intensities after applying the generalized logarithm transformation by the associated fourth-spread across the 16 HG-U133A GeneChips ® . The fitted lowess regression curve is overlaid where the span is 1/3. ). Therefore, the slope of the spread-versus-level plot estimates b. Hence the suggested power transformation would be x p = x 1 -b .

Data
An experiment using 16 technical replicates in which data were obtained using the Affymetrix GeneChip ® technology is considered to illustrate the transformation method suggested in the previous section. Five identical aliquots of 40 µg of the Universal Human Reference RNA (Stratagene, La Jolla, CA), isolated from 10 human cell lines, were used as template for cDNA synthesis, cRNA in vitro transcription, and labelling reactions. The final fragmented cRNA aliquots were pooled together. The pooled Reference RNA was hybridized to 16 HG-U133A GeneChips ® (Dumur et al, personal communication) to study GeneChip ® reproducibility. Hybridization was performed using the same lot number of the HG-U133A arrays, which contains 22,283 probe sets. Immediately after the scanning of every chip, a visual quality control of the image was performed in order to ensure a successful hybridization. Moreover, several parameters produced by the Affymetrix Microarray Suite Software (version 5.0) were monitored for quality control purposes, including: the scaling factor, which is used to scale all probes sets to a predetermined target value (in this study the target value was set at 100); the percentage of probe sets called 'Present' using the Affymetrix Detection Call algorithm [14]; and the 3'/5' ratios of signal intensity values for two housekeeping genes, GAPDH and β-actin. The Microarray Suite Software (version 5.0) was also used to derive the expression summaries for all probe sets. The present study was performed on the 22,215 probe sets designed to match human transcripts; we eliminated those probe sets that query for bacterial spike-in controls from our analyses.