### Similarity measures

This section discusses three similarity measures and their properties: the *Euclidean distance* and the *Pearson correlation* that are commonly used measures in gene expression clustering [6, 37–39] and the proposed *mutual information* (MI) measure. These measures quantify a pairwise distance between expression profiles over *n* conditions that are represented by the two vectors **x** = (*x*_{1},..., *x*_{
n
}) and **y** = (*y*_{1},..., *y*_{
n
}).

#### Euclidean Distance and Pearson Correlation

The Euclidean distance between two expression profiles is given by

E(x,y)=\sqrt{{\displaystyle {\sum}_{i=1}^{n}{({x}_{i}-{y}_{i})}^{2}}}.\phantom{\rule{0.5em}{0ex}}\left(1\right)

It measures similarity according to positive linear correlation between expression profiles, which may identify similar or identical regulation [6]. The measure is highly influenced by the magnitude of changes in the measured expression profiles. Therefore, it should be used mainly for expression data that are suitably normalized. When such normalization is used, the Euclidean distance and the Pearson correlation are monotonically related, as indicated below.

Numerous biological researches (e.g., [17, 28, 40, 41]) implemented the Euclidean distance as a similarity measure for gene expression analysis. Most of these publications analyzed similar expression trends, i.e., simultaneous up-regulated or down-regulated expression levels. From a biological viewpoint, a relative up/down-regulation of gene expressions is often considered more important than the amplitude absolute changes [28].

The Pearson correlation coefficient between two expression patterns (e.g., [1, 10–14]) is defined as

R(x,y)=\frac{{\displaystyle {\sum}_{i=1}^{n}({x}_{i}-\overline{x})}\cdot ({y}_{i}-\overline{y})}{\sqrt{{\displaystyle {\sum}_{i=1}^{n}{({x}_{i}-\overline{x})}^{2}}{\displaystyle {\sum}_{i=1}^{n}{({y}_{i}-\overline{y})}^{2}}}},\phantom{\rule{0.5em}{0ex}}\left(2\right)

where \overline{x}, \overline{y} denote the average patterns level.

The Pearson correlation reflects the degree of linear relationship between two patterns. It ranges between -1 to +1, reflecting respectively a perfect negative (positive) linear relationship between the patterns. A zero correlation value implies that there is no linear relationship between the two patterns, yet it gives no indication regarding nonlinear relationships that might exist between the patterns.

The correlation coefficient is invariant under any scalar transformation of the data. Accordingly, two expression profiles that have "identical" shapes with different magnitudes will obtain a correlation value of 1. The ability to measure (dis)similarities according to positive and negative correlations can help to identify control processes that antagonistically regulate downstream pathways [6]. Nonetheless, the majority of the publications utilize only the positive correlation range, while others map the entire range of the correlation coefficient to obtain values between 0 and 1.

Gene expression measurements, like other empirical measurements, suffer from noise effects. Variations in the measurements might come from many sources: intrachip defects, variation within a single lot of chips, variation within an experiment, and biological variation for a particular gene [42]. Both the Pearson correlation and the Euclidean distance are sensitive to noise effects and outliers. A single outlier could transform the Euclidean distance to an unbounded value, while transforming the Pearson correlation to any value between -1 and 1 [43]. Both measures are easily distorted when the expression levels are not uniformly distributed across the expression pattern. For example, two expression patterns with one high measured value at the same cellular condition will obtain a high correlation coefficient score, regardless of the expression values of the other cellular conditions [44]. Similarly, a large difference in a single expression level at the same cellular condition will lead to a high Euclidean distance, regardless of the other expression levels. In this way, outlying points can bias the Correlation coefficient and the Euclidean distance.

Both the Pearson correlation and the Euclidean distance require complete gene expression profiles as input. However, gene-expression microarray experiments often generate datasets with missing expression values. Therefore, another source of uncertainty when implementing these measures is the need to use methods for estimating missing data, such as *row average* or *singular value decomposition* [26, 36].

#### Mutual Information

The Mutual *information* (MI) provides a general measure for dependencies in the data, in particular, positive, negative and nonlinear correlations. It is a well known measure in *information theory* [45] that has been used to analyze gene-expression data [6, 16, 17, 44, 46]. The used MI measure requires the expression patterns to be represented by discrete random variables. Given two random variables *X*, *Y* with respective ranges *x*_{
i
}∈ *A*_{
x
}, *y*_{
j
}∈ *A*_{
j
}and probability distributions functions *P*(*X* = *x*_{
i
}) ≡ *p*_{
i
}, *P*(*Y* = *y*_{
j
}) ≡ *p*_{
j
}, the Mutual information between two expression patterns, represented by random variables *X* and *Y*, is given by

I(X;Y)={\displaystyle {\sum}_{i}{\displaystyle {\sum}_{j}{p}_{ij}}}\mathrm{log}\phantom{\rule{0.5em}{0ex}}\frac{{p}_{ij}}{{p}_{i}{p}_{j}}.\phantom{\rule{0.5em}{0ex}}\left(3\right)

The MI is always non-negative. It equals zero if and only if *X* and *Y* are statistically independent, meaning that *X* contains no information about *Y* and vice versa. A zero MI indicates that the patterns do not follow *any kind* of dependence, an indication which is impossible to obtain from the Pearson correlation or the Euclidean distance [15]. This property makes the MI a generalized measure of correlation, which is advantageous in gene expression analysis. For example, if one gene acts as a transcription factor only when it is expressed at a midrange level, then the scatter plot between this transcription factor and the other genes might closely resemble a normal distribution rather than a linear model. The Pearson correlation coefficient under this scenario will obtain a low score, while the MI measure can obtain a high score [44].

Another important feature of the MI is its robustness with respect to missing expression values. In fact the MI can be estimated from datasets of different sizes. This is advantageous in analyzing expression datasets that contain a certain amount (up to 25%) of missing values [46].

The MI between a pair of expression patterns is upper bounded by their marginal entropies. Accordingly, the MI measure exhibits a low value if the marginal entropies are low, even if the patterns are completely correlated. Therefore, there is a need to normalize the MI measure, giving a high score for highly correlated sequences, independent of their marginal entropies. There are several ways to carry out such normalization. Michaels et al. (1998) [17] normalize the MI measure by dividing it by the maximal marginal entropy of the considered sequence. Steuer et al. (2002) [16] suggest a rank-ordering procedure. We use a partitioning method for equal-probability bins, where each bin contains approximately the same number of data points. The width of each bin is determined by the local density of the measured expression levels. Besides the obtained normalization, the proposed method is advantageous also in terms of outlier protection. The MI treats each expression level equally, regardless of the actual value, and thus is less biased by outliers.

As noted above, the use of the discrete form of the MI measure requires the discretization of the continuous expression values. The most straightforward and commonly used discretization technique is to use a histogram-based procedure [16, 18, 44]. We use a two-dimensional histogram to approximate the joint probability density function of two expression patterns. We use the same number of bins for all expression patterns. However, the bins in each expression pattern are determined independently according to the density of the expression values. The joint probabilities are then estimated by the corresponding relative frequencies of expression values in each bin in the two-dimensional histogram. This estimation requires the sorting of expression values with a computational complexity of *O*(*n* log*n*), where *n* is the number of expression values. Such sorting is not required when calculating the Pearson coefficient or the Euclidean distance measure. The number of bins should be moderate enough to allow good estimates of the probability function. If this number is too small or too large, then all bins will contain approximately the same number of expression values. In such a case the joint distributions of all pairs of expression patterns will be similar and will lead to the same MI value. There is no optimal solution to choose the number of bins, since it depends on data normalization and on the particular biological application [18]. Consequently, the number of bins is often obtained heuristically. We follow Sturges (1926) [47] and Law and Kelton (1991) [48] and use the following simple lower/upper bounds on the number of bins:

*M*_{
l
}= ⌊1 + log_{2}*n*⌋ and *M*_{
u
}= \sqrt{n}. In the Results section we show that within this range for the number of bins, the MI measure outperforms the other distance measures.

### Assessment of clustering quality

The homogeneity and the separation functions are often used to determine the quality of a clustering solution when the true solution is unknown [5]. When considering similarity measures like the MI and the Pearson correlation, high homogeneity implies that elements in the same cluster are very similar to each other; while low separation implies that elements from different clusters are very dissimilar to each other. The two criteria are widely used in gene expression analysis, as well as in other fields [12, 21, 28, 31, 39, 49].

Consider a set of *N* elements (genes or samples represented by expression patterns or profiles) X ≡ {*X*_{1}, *X*_{2}, ..., *X*_{
N
}} divided into *k* clusters. Denote by *X*_{
i
}and *C*(*X*_{
i
}) the expression pattern of element *i* and the expression pattern of its cluster respectively, then the homogeneity is given by

Hm=\frac{1}{N}{\displaystyle \sum _{{X}_{i}\in X}D({X}_{i},C({X}_{i}))}\phantom{\rule{0.5em}{0ex}}\left(4\right)

where *D*(·) represents a given similarity measure, i.e., the Euclidean distance, the Pearson correlation or the Mutual Information. The solution separation score is evaluated by the weighted average similarity between cluster expression patterns: denote the expression patterns of clusters *t*_{1},...,*t*_{
k
}, by *C*_{
t1
},...,*C*_{
tk
}, then, the average separation is given by

Sp={\displaystyle \sum _{i\ne j}{N}_{i}{N}_{j}D({C}_{{t}_{i}},{C}_{{t}_{j}})/{\displaystyle \sum _{i\ne j}{N}_{i}{N}_{j}}},\phantom{\rule{0.5em}{0ex}}\left(5\right)

where *N*_{
i
}, *N*_{
j
}are the number of elements in cluster *t*_{
i
}, *t*_{
j
}respectively. The homogeneity and the separation are typically conflicting functions – usually the better is the homogeneity of a solution, the worse is its separation, and vice versa.

### Known Clustering Algorithms

The underlying concepts and the parameters of each of the clustering algorithms are given in an additional file 1, section 2. The *K-means* was implemented by Matlab procedures. The SOM algorithm was implemented by *GeneCluster 2.0* [20, 49]. The *Click* was implemented by *Expander* [50, 51]. The *sIB* was implemented by *IBA_1.0* [52]. Further information regarding common unsupervised clustering and learning methods can also be found in [53–55].