One of the primary goals of experiments involving DNA microarrays is to find genes which are somehow similar across various experimental conditions. "Similar" is usually taken to mean co-expressed, but it can be measured in several different ways. The distance (usually one minus similarity) measure most commonly used is Pearson correlation, though Euclidean distance, cosine-angle metric, Spearman rank correlation, and jackknife correlation are also used frequently. (Note that correlation and cosine-angle metrics do not fulfill the triangle inequality, so they are not true distance metrics. However, they are used to measure distance in many applications.) For example, [1–4] use Pearson correlation in their gene network analysis; [5–13] use Pearson correlation (or a modification) to cluster gene expression data. Once the similarity or distance measure is chosen, the relationship between the genes is given by some sort of clustering algorithm (e.g., k-means, hierarchical clustering, *k* nearest neighbors) or gene network analysis.

Clustering results can be highly dependent on the choice of similarity measure (particularly when comparing genes whose similarities are based on tens of samples instead of comparing samples whose similarities are based on thousands of genes); one or two outlying values can produce large changes in the value of some similarity measures. Outlying data points can be real or noise, though microarray data are known to have substantial noise. The noise can occur during any of the stages in the experimental process, and the effect can be in any direction. For example, a large outlier might cause co-expressed genes to seem dissimilar while a different large outlier might cause dissimilar genes to look co-expressed. Sometimes the outlying value is meaningful and important in which case the data should be included in the correlation. Our flagging procedure lets the practitioner determine whether or not a flagged value should be removed.

The goal in our paper is to give a resistant correlation measure that can be used as a distance metric in any clustering or gene network algorithm which calls for some type of distance or similarity measure in order to identify the relationship between a pair of genes, across gene modules, or within a cluster of genes. Tukey's biweight [14] has been well established as a resistant measure of location and scale for multivariate data [15–17]. When considering 2 genes on *n* samples, the 2 × 2 biweight covariance matrix that results from the biweight measurement of multivariate scale can be thought of as a resistant covariance between two genes (or of *n* points in dimension 2). Translating a 2 × 2 biweight covariance matrix into a biweight correlation measure is simply a matter of taking the biweight covariance divided by the product of the individual gene biweight standard deviations (analogous to computing the Pearson correlation from a standard covariance matrix.) Tukey's biweight is a type of M-estimate, a class of estimators which has been used in robust correlation estimates (for example, Mosteller and Tukey defined the cob [18] and Wilcox defined the percentage bend correlation [19].) M-estimates are consistent estimates of multivariate location and shape, so the biweight correlation is estimating the same parameter as the Pearson correlation. We show that our robust correlation based on the biweight M-estimate is intuitive, flexible, and performs well under a variety of data distributions.

When considering the correlation between each pair of genes, we find that, although the biweight correlation and the Pearson correlation usually agree, when they do not agree, there are often problems with the gene's (or genes') data which may indicate to the biologist that the gene should be removed from further study. Our biweight correlation method provides two novel applications particularly suited to microarray analysis: 1. We have created a similarity measure that is resistant to outlying data points (an important feature in analyzing microarray data), and 2. By investigating gene pairs that have discrepant correlation values, we create a diagnostic procedure to identify values which may need to be flagged (i.e., removed or else further investigated.)

In the remainder of the paper, we provide details of the method and results. First, in section 1.1 we discuss microarrays and their particular need for resistant measures. In section 1.2 we explain Tukey's biweight (its computation is given in the appendix, section 8.) We give our results in section 2, showing that the biweight correlation can be used as a resistant similarity measure or a diagnostic procedure for flagging data. We then demonstrate, in section 2.4, that our method is more efficient than Spearman correlation (another resistant correlation method.) In section 2.5 we show that the biweight correlation has empirically low bias and is superior to other robust measures. We conclude with some ideas of how to further develop our methods for other microarray applications.

### 1.1 Why resistance is important in microarray analysis

Microarray technology requires biologists and statisticians to work side by side in analyzing gene expression information. Gene microarray chips measure, simultaneously, the expression levels of thousands of genes in an organism. Comprehensive gene expression data is useful if one wishes to find clusters of genes with similar function. Microarrays have been used to study the gene expression trends (for example across time) in diseases and even to classify and diagnose different types of diseases, such as cancerous tumors [20]. For some organisms, microarrays enable biologists to monitor the entire genome of interest on a single chip in order to create a large picture of the interactions among thousands of genes simultaneously. A microarray is an orderly arrangement of spots that provides a medium for measuring known and unknown DNA pieces (genes) based on base-pairing rules. Each microarray measures thousands of genes simultaneously, so resulting microarray data is typically on the order of thousands of genes by tens of samples.

Although microarray technology has been very useful in discovering changes in gene expression, limitations of the technology have been observed: dye bias and relative gene expression levels having different sample variances due to differences in experimental conditions [21]; differences due to laboratories and platforms [22, 23]; pixel saturation [24]; low signal/noise ratio [25]; and differences due to image analysis techniques [26–29]. Researchers have worked to address the particular problems inherent in microarray analyses, but even after novel techniques (of, for example, normalization or filtering) have been applied, microarray data remain noisy [30, 31].

Some work has been done showing the need for resistant correlation metrics as similarity measures. In particular, Heyer et al. give a jackknife correlation that is more resistant than the Pearson correlation. However, as they state in their paper, the jackknife correlation is only resistant to single outliers [32].

### 1.2 Biweight as a resistant correlation measure

Tukey's biweight has been used as a resistant estimator of location and scatter as well as a resistant estimator of regression parameters in a wide range of applications (see [33] for an overview of Tukey's work in resistant statistics). The former approach has been used by Affymetrix to normalize microarray data [34] but not in applications of data distances.

M-estimators are a class of estimators of multi-dimensional location and scatter that provide for flexibility, efficiency, and resistance. The key to M-estimation is the ability of the estimator to down-weight points that are far from the data center with respect to the data scatter. Because of the weighting, M-estimates are more resistant to outlying values than standard estimates (like the mean or the Pearson correlation.) Additionally, M-estimates use the actual data values in constructing location and scatter estimates and are therefore more efficient than estimators based on rank (like the median or the Spearman rank correlation.) M-estimators are defined iteratively using a weight function which down-weights data values that are far from the center of the data. We use the M-estimate of 2-dimensional scatter (i.e., covariance) to calculate a biweight correlation. Details for the biweight are given in the appendix, and R code for the biweight is available from the corresponding author (some of the R code is taken from Wilcox [19].)

An important aspect of M-estimators is their resistance to outlying data values. One measure of the resistance of an estimator is its replacement breakdown, which is the smallest fraction of a data set that one could replace with corrupt data in such a way as to take the estimator over all bounds [17]. Unlike the mean (breakdown = 0) or the median (breakdown close to $\frac{1}{2}$), the biweight is parameterized so that the breakdown can be adjusted over a range of values. Adjusting the breakdown value will have implications in flagging data values (discussed further in section 2.3).

The results of the biweight iteration scheme are a multivariate location estimate,

$\tilde{T}$, and shape estimate,

$\tilde{S}$. Letting

${\tilde{s}}_{jl}$ be the (

*j, l*)

^{
th
}element of

$\tilde{S}$, we can think of

${\tilde{s}}_{jl}$ as a resistant estimate of cov (

**X**_{
j
},

**X**_{
l
}) (where

**X**_{
j
}and

**X**_{
l
}are two

*n*-vectors of interest.) Consequently,

${\tilde{r}}_{jl}=\frac{{\tilde{s}}_{jl}}{\sqrt{{\tilde{s}}_{jj}{\tilde{s}}_{ll}}}$

(1)

is the biweight correlation between vectors *j* and *l* and is a more resistant estimate of correlation than the Pearson correlation (denoted by *r*_{
jl
}.) Because the components (center and shape parameters) are estimated using resistant techniques (unlike the Pearson correlation), we know the biweight correlation will be more resistant than the Pearson correlation. Note that |${\tilde{r}}_{jl}$| ≤ 1.

Using the biweight correlation ($\tilde{r}$) as a resistant estimate of the correlation measure, we can incorporate $\tilde{r}$ into clustering algorithms which depend on similarities or 1 - $\tilde{r}$ into clustering algorithms that depend on distances. In the next section we will demonstrate that the biweight correlation is clearly a better choice for a distance (or similarity) measure than the Pearson correlation (*r*).