# Graphical technique for identifying a monotonic variance stabilizing transformation for absolute gene intensity signals

- Kellie J Archer
^{1, 2}Email author, - Catherine I Dumur
^{3}and - Viswanathan Ramakrishnan
^{1}

**5**:60

https://doi.org/10.1186/1471-2105-5-60

© Archer et al; licensee BioMed Central Ltd. 2004

**Received: **19 February 2004

**Accepted: **17 May 2004

**Published: **17 May 2004

## Abstract

### Background

The usefulness of log_{2} transformation for cDNA microarray data has led to its widespread application to Affymetrix data. For Affymetrix data, where absolute intensities are indicative of number of transcripts, there is a systematic relationship between variance and magnitude of measurements. Application of the log_{2} transformation expands the scale of genes with low intensities while compressing the scale of genes with higher intensities thus reversing the mean by variance relationship. The usefulness of these transformations needs to be examined.

### Results

Using an Affymetrix GeneChip^{®} dataset, problems associated with applying the log_{2} transformation to absolute intensity data are demonstrated. Use of the spread-versus-level plot to identify an appropriate variance stabilizing transformation is presented. For the data presented, the spread-versus-level plot identified a power transformation that successfully stabilized the variance of probe set summaries.

### Conclusion

The spread-versus-level plot is helpful to identify transformations for variance stabilization. This is robust against outliers and avoids assumption of models and maximizations.

## Background

Microarrays measure the abundance of thousands of mRNA transcripts in one experiment. Currently, two different microarray technologies dominate gene expression research efforts, namely, custom spotted arrays and Affymetrix GeneChips^{®}. Custom spotted arrays are characterized by long single strands of complimentary DNA (cDNA) affixed to a solid substrate in spots to which two different fluorescently labelled samples are hybridized. Affymetrix GeneChips^{®} are characterized by the use of several (11–20) short oligonucleotide (25-mers) probes to interrogate for a single gene and to which one fluorescently labelled sample is hybridized. For both technologies, the fluorescence intensity for each probe/spot is assumed to be indicative of the amount of mRNA transcript in the sample. Since two samples are hybridized to custom spotted arrays, the ratios of the experimental signal relative to a control signal of the resulting intensities are analyzed. On the other hand, for the one sample hybridized to an Affymetrix GeneChip^{®}, the absolute intensities indicative of the number of transcripts are the resulting gene expression measures.

To satisfy assumptions required for statistical analyses, data from both technologies are often transformed by a suitable function [1]. Specifically, the reasons for applying a transformation to a dataset include to achieve stability in variance, or to achieve linearity, additivity and/or normality. Sometimes a transformation is applied because it facilitates interpretation and induces symmetry. Such is the case when applying the log_{2} transformation to custom spotted microarray (cDNA) data. This transformation defines the relative abundance of a transcript in an experimental sample in comparison to the control sample as the unit of analysis. For genes where the experimental intensity is greater than the control intensity, the ratio could take values in the range (1, ∞); for genes where the experimental intensity is less than the control intensity, the ratio is compressed to the range (0, 1). The log_{2} transformation of ratio data promotes symmetry by treating under and over expressed genes similarly. A typical example used to elucidate this concept is the one that considers two genes, one with a ratio of 2.0 representing a doubling of intensity and another with a ratio of 0.5 represents a halving of intensity; on a log_{2} scale these values are symmetric about zero (no change) with values 1 and -1 respectively [1]. The application of the log_{2} transformation to cDNA data thus has proven useful for achieving symmetry as well as ease of interpretation.

The usefulness of the log_{2} transformation for cDNA microarray data has led to its widespread application to Affymetrix GeneChip^{®} data as well [2–4]. While this transformation is appealing to the biologists due to the reasons stated above for cDNA arrays, it neither facilitates interpretation nor does it necessarily render true the assumptions required for statistical analysis, such as equal variance, normality, etc., in Affymetrix GeneChip^{®} data. When data are amounts or counts, as with Affymetrix GeneChip^{®} data, where the intensities are assumed to represent amounts of mRNA transcripts, there is often a systematic relationship between the variance of the measurements and magnitude of the measurements. The log_{2} transformation expands the scale of genes with low absolute intensities while compressing the scale of genes with higher intensities; it essentially reverses the direction of the relationship between the variance and the mean expression level. That is, after the transformation lowly expressed genes have a higher variance and highly expressed genes have a lower variance [5].

Recently, other variance stabilizing transformation methods for microarray gene expression data have been introduced [6–8]. The gene expression data have been modelled as

*y* = *α* + *μ* *e*^{
η
}+ *ε*, (1)

where *y* represents the measured intensity, *α* represents the average background, *μ* represents the true gene expression level, with normally distributed error terms *η* and *ε* with zero mean and differing non-zero variances [6]. For this model, when *μ* is large, *y* is distributed approximately as a lognormal random variable. Therefore, a log transformation of *y* results in observations with constant variance when *μ* is large. In this case the generalized logarithm transformation is

*g*(*y*) = *ln* (*y* - *α* + *sqrt* [(*y* - *α*)^{2} + *c*]), (2)

In this paper, an Affymetrix GeneChip^{®} HG-U133A dataset consisting of 16 technical replicates (QAQC Dataset), where the Microarray Suite Software (version 5.0) was used to derive the expression summaries for all probe sets, is used to demonstrate some of the problems associated with applying the log_{2} transformation to absolute intensity data. Another approach to identify an appropriate variance stabilizing transformation using the spread-versus-level plot [9] is proposed. The spread-versus-level plot plots the log of the median on the x-axis (
) against the log of the fourth-spread on the y-axis. The slope of the spread-versus-level plot (*b*) can be used to suggest a power transformation that will most appropriately stabilize the variance. When using a spread-vs-level plot, the suggested power of the transformation is *p* = 1 - *b*.

The effectiveness of using the spread-versus-level plot to identify a suitable monotone transformation that appropriately stabilizes the variance for probe set expression summaries is demonstrated using the QAQC dataset.

## Results

_{2}transformation. This plot demonstrates, as previous authors have found [4, 5], that the log transformation increases the variability on the lower end of the range while compressing the variability among the highly expressed genes. In other words, clearly, the log transformation has not stabilized the variance in this case.

where, x is the observed data, *p* is a number to be determined and Φ is the monotonic transformation. Substituting the *p* obtained from the plot yields,

Φ_{0.5}(x) = (*x*^{0.5} - 1)/0.5 =
.

^{®}(Figure 5) is considered [10]. This plot more clearly demonstrates the improvement in variance stabilization achieved by this transformation. The fitted lowess regression curve is overlaid, where the span is 1/3, is roughly horizontal, indicating constant variance.

As noted in this figure, there are still a few probe sets at the highest intensity level that suffer from increased variability. These probe sets are among those expression summary values that are greater than the expression summary values for the most concentrated hybridization spike-in control (that is, the AFFX-CreX and AFFX-r2-P1-cre probe sets, with a mean of 2796 on MAS 5.0 scale, 103.7 on the transformed scale). From our quality assessment of the hybridizations, there is linearity over the range of concentrations covered by all the hybridization spike-in controls. However, for signal intensities outside of this range, linearity cannot be determined. In a statistical analysis, these may be deemed outliers and may need to be examined separately from the remaining probe sets.

_{2}transformation. The fitted lowess regression curve is overlaid where the span is 1/3. The coefficient of variation for the fitted values from the lowess regression for the transformed data was 0.15 while the coefficient of variation of the fitted values from the lowess regression for the generalized logarithm transformation was 0.22. Bootstrap percentile-based confidence intervals for the coefficient of variation were estimated by resampling the paired median and spread observations, fitting a lowess regression to the bootstrapped data, then estimating the coefficient of variation using the lowess fitted values. The 99% confidence interval for the vsn() transformed data was (0.2109, 0.2265); the 99% confidence interval for the transformed data was (0.1431, 0.1644). Thus, it may be concluded that the variance was better stabilized by the transformation.

## Discussion

Reasons for transforming data often include the desire to find a scaling of the data that would simplify the analysis, for example, by making the data meet assumptions such as symmetry, equal variance, straight line relationship, and/or additivity of effect. One common assumption of many traditional statistical methods is that the variance is constant over the levels of gene expression. Since the absolute intensities are positive, this assumption is generally not valid because the variance increases with increasing intensities. The log_{2} transformation, which is routinely applied to absolute intensities that result from Affymetrix GeneChip^{®} experiments [2, 3], unfortunately does not achieve the desired results. As shown (Figure 2) the transformation only reverses the mean-variance relationship. To overcome this drawback of the log_{2} transformation, methods such as the Local Pooled Error test have been suggested [4] to more accurately estimate the variance for lowly expressed log_{2} signal intensities. In this article, as one alternative, the spread-versus-level plot to identify more appropriate monotonic transformations that stabilize the variance was proposed. The spread-vs-level plot proved useful for identifying a transformation useful for simplifying an analysis by meeting assumptions such as symmetry and equal variance.

For the QAQC Dataset, the spread-versus-level plot led to a square root transformation, rather than the log_{2} transformation. This dataset consisted of 16 technical replicates; since the spread-vs-level plot is used to empirically identify a power transformation, the most appropriate power transformation to stabilize the variance is completely data specific. Therefore, the
transformation suggested should be treated as an illustration and is not necessarily the most appropriate one for all data sets. In fact, there may be situations in which our method may lead to the more commonly used log transformation itself, while for other datasets it may not lead to any power transformation that could stabilize the variance. We note, however, that in a serial dilution experiment conducted to assess linearity of signal [11], the authors found the square root transformation by trial and error to be an appropriate transformation. Thus this empirically derived transformation obtained through the use of the spread-versus-level plot confirmed a similar finding.

## Conclusions

The estimated slope from the spread-versus-level plot could be used to identify a power transformation that will appropriately transform gene expression data to stabilize the variance. It is computationally easy to implement as it avoids the requirement of a statistical model and maximization of a likelihood. Moreover, since this plot uses robust measures of location and spread, the slope of the linear regression fit to the spread-versus-level plot is fairly robust against outliers.

## Methods

### Transformation

One of the most commonly used families of variance stabilizing transformations is the Box-Cox transformation [12]. This family of transformations is defined by equation (3). Box and Cox selected p as the value that maximized the likelihood under any given model. A graphical technique based on the spread-versus-level plot may also be used to empirically estimate *p* [9]. This method is robust against outliers and avoids the need for maximization of the likelihood. The spread-versus-level plot plots the log of the median (x-axis) by the log of the fourth-spread (y-axis). The slope *b* of the spread-versus-level plot identifies the approximate value of the exponent for the power transformation, where *p* = 1 - *b*. When the slope *b* is close to 1, *p* is close to 0 and therefore the transformation would be ln(*x*).

### Spread-versus-level plot

To construct a spread-versus-level plot, three order statistics, namely the median (*M*) the upper fourth (*F*_{
U
}) and the lower fourth (*F*_{
L
}) must be identified. These order statistics may be found using the following algorithm. First, sort the data in ascending order. On the basis of ordering, the rank of an observation can be defined as an upward rank (determined by counting up from smallest to largest) or a downward rank (determined by counting down from largest to smallest). Considering both the upward and the downward ranks together, the depth of a data value in a sample is defined as the smaller of its upward rank and its downward rank. For example, the depth of the median is (*n* + 1)/2. (When the number of observations *n* is even, the median is interpolated to lie between the two middle values.) The fourths are those values with a depth equal to ([depth of median] + 1)/2. Therefore, each fourth is halfway between the median and the corresponding extreme [13]. The fourth-spread (*d*_{
F
}) is a robust measure of spread and is estimated by (upper fourth) - (lower fourth), denoted by *d*_{
F
}= *F*_{
U
}- *F*_{
L
}. The interquartile range is nearly the same as the fourth-spread but differs slightly when the number of observations is even.

To understand why a spread-versus-level plot is useful in identifying an appropriate variance stabilizing transformation [9], let *X* be a real random variable from a population with median *M*_{
X
}, lower fourth *F*_{
L
}, upper fourth *F*_{
U
}, and fourth spread *d*_{
F
}. The spread-versus-level plot identifies a transformation *y* = Φ(*x*) for which the fourth-spread of *Y* is constant. Assuming the second derivative Φ''(*x*) and higher order derivatives exist for all *x*, a Taylor series expansion of Φ(*x*) around *M*_{
X
}is

Let *λ*(*M*_{
X
}) represent the distance from *M*_{
X
}to *F*_{
U
}as a proportion of the distance *d*_{
F
}, or *λ*(*M*_{
X
}) = (*F*_{
U
}- *M*_{
X
})/*d*_{
F
}; then [1 - *λ*(*M*_{
X
})] would represent the proportionate distance from *M*_{
X
}to *F*_{
L
}. Therefore, the median and upper and lower fourths for the data transformed by *y* = Φ(*x*) are

*M*_{
Y
}= Φ(*M*_{
X
}), (5)

(*F*_{
U
})_{
Y
}= Φ(*F*_{
U
}) = Φ [*M*_{
X
}+ *λ*(*M*_{
X
})*d*_{
F
}], (6)

(*F*_{
L
})_{
Y
}= Φ(*F*_{
L
}) = Φ{*M*_{
X
}- [1 - *λ*(*M*_{
X
})]*d*_{
F
}}. (7)

Substituting these expressions into equation (4) to get (*F*_{
U
})_{
Y
}and (*F*_{
L
})_{
Y
}, then taking the difference to get the fourth-spread for the transformed data yields,

Recall that *λ*(*M*_{
X
}) represent the distance from *M*_{
X
}to *F*_{
U
}as a proportion of the distance *d*_{
F
}; thus the term [2* *λ*(*M*_{
X
}) - 1] will range from [0,1]. Moreover, if the upper and lower fourths are roughly equal distant from the median (i.e., *x* is roughly symmetric), this quantity will be zero. Additionally, since Φ''(*M*_{
X
}) is a measure of concavity of the tangent line at the median, it will generally be small. Therefore, the quadratic term of (*d*_{
F
})_{
Y
}is generally small enough to be negligible compared to the leading term, thus the approximation

(*d*_{
F
})_{
Y
}≈ *d*_{
F
}· Φ'(*M*_{
X
}) (9)

is reasonable. Now, if the right-hand side of this equation is held constant, it leads to a transformed variable *Y* with constant variance. Therefore, supposing that the relationship between spread *d*_{
F
}and level *X* is a power transformation of the form *d*_{
F
}(*x*) = *k*·*x*^{
b
}for some constants *k* and *b*,

*d*

_{ F }(

*x*) =

*k*· so that log(

*d*

_{ F }) = log(

*k*) +

*b*· ). Therefore, the slope of the spread-versus-level plot estimates

*b*. Hence the suggested power transformation would be

*x*

^{ p }=

*x*

^{1 - b}.

### Data

An experiment using 16 technical replicates in which data were obtained using the Affymetrix GeneChip^{®} technology is considered to illustrate the transformation method suggested in the previous section. Five identical aliquots of 40 *μ* g of the Universal Human Reference RNA (Stratagene, La Jolla, CA), isolated from 10 human cell lines, were used as template for cDNA synthesis, cRNA *in vitro* transcription, and labelling reactions. The final fragmented cRNA aliquots were pooled together. The pooled Reference RNA was hybridized to 16 HG-U133A GeneChips^{®} (Dumur et al, personal communication) to study GeneChip^{®} reproducibility. Hybridization was performed using the same lot number of the HG-U133A arrays, which contains 22,283 probe sets. Immediately after the scanning of every chip, a visual quality control of the image was performed in order to ensure a successful hybridization. Moreover, several parameters produced by the Affymetrix Microarray Suite Software (version 5.0) were monitored for quality control purposes, including: the scaling factor, which is used to scale all probes sets to a predetermined target value (in this study the target value was set at 100); the percentage of probe sets called 'Present' using the Affymetrix Detection Call algorithm [14]; and the 3'/5' ratios of signal intensity values for two housekeeping genes, GAPDH and β-actin. The Microarray Suite Software (version 5.0) was also used to derive the expression summaries for all probe sets. The present study was performed on the 22,215 probe sets designed to match human transcripts; we eliminated those probe sets that query for bacterial spike-in controls from our analyses.

## Declarations

### Acknowledgements

This research was supported by the Commonwealth Technology Research Fund (CTRF #SE2002_02). The authors would like to thank the two anonymous reviewers for their helpful insights and recommendations.

## Authors’ Affiliations

## References

- Quackenbush J:
**Microarray data normalization and transformation.***Nature Genetics*2002,**32(Supplement):**496–501. 10.1038/ng1032View ArticleGoogle Scholar - Hubbell E, Liu W-M, Mei R:
**Robust estimators for expression analysis.***Bioinformatics*2002,**18(12):**1585–1592. 10.1093/bioinformatics/18.12.1585View ArticleGoogle Scholar - Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP:
**Summaries of Affymetrix GeneChip probe level data.***Nucleic Acids Research*2003,**31:**e15. 10.1093/nar/gng015PubMed CentralView ArticlePubMedGoogle Scholar - Jain N, Thatte J, Braciale T, Ley K, O'Connell M, Lee JK:
**Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays.***Bioinformatics*2003,**19:**1945–1951. 10.1093/bioinformatics/btg264View ArticlePubMedGoogle Scholar - Kohane IS, Kho AT, Butte AJ:
**Microarrays for an integrative genomics.**Cambridge, MA: MIT Press; 2003.Google Scholar - Durbin BP, Hardin JS, Hawkins DM, Rocke DM:
**A variance-stabilizing transformation for gene-expression microarray data.***Bioinformatics*2002,**18(Suppl. 1):**S105-S110.View ArticleGoogle Scholar - Rocke D, Durbin B:
**Approximate variance-stabilizing transformations for gene-expression microarray data.***Bioinformatics*2003,**19(8):**966–972. 10.1093/bioinformatics/btg107View ArticleGoogle Scholar - Huber W, von Heydebreck A, Sultmann H, Poustka A, Vingron M:
**Variance stabilization applied to microarray data calibration and to the quantification of differential expression.***Bioinformatics*2002,**18(Suppl. 1):**S96-S104.View ArticleGoogle Scholar - Emerson JD:
**Mathematical Aspects of Transformation.***Understanding Robust and Exploratory Data Analysis*. Edited by Hoaglin DC, Mosteller F, Tukey JW. New York: John Wiley & Sons; 1987, 247–282.Google Scholar - Geller SC, Gregg JP, Hagerman P, Rocke D:
**Transformation and normalization of oligonucleotide microarray data.***Bioinformatics*2003,**19(14):**1817–1823. 10.1093/bioinformatics/btg245View ArticleGoogle Scholar - Ramdas L, Coombes KR, Baggerly K, Abruzzo L, Highsmith WE, Krogmann T, Hamilton SR, Zhang W:
**Sources of nonlinearity in cDNA microarray expression measurements.***Genome Biology*2001,**2(11):**RESEARCH0047.Google Scholar - Box GEP, Cox DR:
**An analysis of transformations.***Journal of the Royal Statistical Society, Series B*1964,**26(2):**211–252.Google Scholar - Hoaglin DC:
**Letter Values: A Set of Selected Order Statistics.***Understanding Robust and Exploratory Data Analysis**(Edited by: Hoaglin DC, Mosteller F and Tukey JW).*New York, John Wiley & Sons 1987, 33–57.Google Scholar - Liu W-M, Mei R, Di X, Ryder TB, Hubbell E, Dee S, Webster TA, Harrington CA, Ho MH, Baid J, Smeekens SP:
**Analysis of high density expression microarrays with signed-rank call algorithms.***Bioinformatics*2002,**18(12):**1593–1599. 10.1093/bioinformatics/18.12.1593View ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.