### Distance measures

After selecting a clustering method one usually has to determine which distance will be employed between objects, given that most clustering methods are based on distance calculations [55, 56]. In gene expression one usually seeks for similarity in shape or trend between objects [15]. For such a reason, correlation coefficients have been popular choices [3, 10]. As a matter of fact, the well-known Spearman and Pearson correlation coefficients, alongside the traditional Euclidean distance, have found great applicability in gene expression, as highlighted by several authors, e.g., [1–3, 10, 32, 34, 57]. There is, however, a number of less-known distance measures that remain practically unexplored to this date. Bearing this in mind we describe the 15 distances that we consider for evaluation in this study. We begin by describing 6 correlation coefficients. Afterwards, we review 4 measures which we refer to as *traditional measures*. Finally we review 5 distance that were tailored for clustering short gene time-series.

#### Correlation coefficients

Correlation coefficients are popular choices for clustering microarray data, with values in the [*−* 1, 1] interval. Since the sign of the correlation is important for gene expression data, one minus the value of the correlation provides the distance we use for clustering in our experiments (as is usual in the gene expression literature). In the following, both **x** and **y** are sequences of real numbers in the form **x** = (*x*_{1}, . . . , *x*_{
n
}) and **y** = (*y*_{1}, . . . , *y*_{
n
}).

*Pearson:* Pearson [58], which is given by Equation (2), is probably one of the most popular correlation coefficients in the literature, allowing one to identify linear relationships of variables. Previous studies have reported that Pearson can display sensitivity when the variables have outliers [3, 15]. In such cases variables that are not truly similar (i.e., variables that are similar just because they contain outliers) can end up as false positives, i.e., with a large correlation. Its computation is straightforward, with linear running time.

PE(x,y)=\frac{{\displaystyle {\sum}_{i=1}^{n}({x}_{i}-\overline{x})({y}_{i}-\overline{y})}}{\sqrt{{\displaystyle {\sum}_{i=1}^{n}{({x}_{i}-\overline{x})}^{2}}}\sqrt{{\displaystyle {\sum}_{i=1}^{n}{({y}_{i}-\overline{y})}^{2}}}}

(2)

*Goodman-Kruskal:* The Goodman-Kruskal [59] correlation coefficient is a rank-based correlation coefficient. In order to introduce such correlation, let us define first three different types of pairs of values with respect to sequences **x** and **y**, namely: concordant, discordant and, neutral pairs. We define as concordant, those pairs of values that obey a same order, i.e., *x*_{
i
} *< x*_{
j
} and *y*_{
i
} *< y*_{
j
} or *x*_{
i
} *> x*_{
j
} and *y*_{
i
} *> y*_{
j
} . We call discordant all the pairs for which *x*_{
i
} *< x*_{
j
} and *yi > yj* or *xi > xj* and *yi < yj* . Pairs that are neither concordant nor discordant are defined as neutrals. Based on these three definitions, the Goodman-Kruskal correlation coefficient is provided by Equation (3), for which *P*_{+} and *P*_{
−
} correspond to the total number of concordant and discordant pairs in sequences **x** and **y**. The Goodman-Kruskal correlation has *O*(*n log n*) running time [60].

GK\left(x,y\right)=\frac{{P}_{+}-{P}_{-}}{{P}_{+}+{P}_{-}}

(3)

*Kendall:* Kendall [61], which is given by Equation (4), is also a rank-based correlation coefficient. It follows the same definitions previously introduced for Goodman-Kruskal. In Equation (4), the denominator accounts for the number of pairs of values in **x** and **y**. From this different normalization Kendall can achieve its maximum values only when the sequences under evaluation have no neutral pairs. It is easy to observe that Kendall has the same time-complexity as Goodman-Kruskal, that is, *O*(*n log n*).

KE\left(x,y\right)=\frac{{P}_{+}-{P}_{-}}{n\left(n-1\right)/2}

(4)

*Spearman:* If the values of each sequence are replaced by their respective ranks, the Spearman correlation coefficient is also given by Equation (2). Given that the actual values of the sequences are replaced by their ranks, Spearman tends to be less sensitive to outliers than its counterpart, Pearson [3]. Due to the need of obtaining ranks for the values in each sequence (the sequences need to be sorted) Spearman has a *O*(*n log n*) running time.

*Rank-Magnitude:* In order to correlate sequences with ranks and real values, [60] introduced the measure called Rank-Magnitude, which in its original version is an asymmetric correlation coefficient. Its asymmetric definition is given by Equation (5), for which mi{n}^{rank}={\sum}_{i=1}^{n}{y}_{i}\left(n-i+1\right) and ma{x}^{rank}={\sum}_{i=1}^{n}i{y}_{i}, given that **y** is sorted in increasing order of values.

\widehat{r}\left(x,y\right)=\frac{2{\sum}_{i=1}^{n}Rank\left({x}_{i}\right){y}_{i}-ma{x}^{rank}-mi{n}^{rank}}{ma{x}^{rank}-mi{n}^{rank}}

(5)

Given that gene expression data is symmetric, i.e., we deal only with real values, we use here a symmetric adaption of Rank-Magnitude [41, 62], which we call RM for short. Such symmetric version is easily obtained with RM\left(x,y\right)=\left(\widehat{r}\left(x,y\right)+\widehat{r}\left(y,x\right)\right)/2. Note that although such measure is symmetric, it captures both the behavior of ranks and magnitudes of sequences. Both versions of Rank-Magnitude have an *O*(*n log n*) running time.

*Weighted Goodman-Kruskal:* The measure referred to as Weighted Goodman-Kruskal, introduced by [60], also considers in its formulation both magnitudes and ranks of the sequences under evaluation. It is defined by Equation (6), for which {\widehat{\omega}}_{ij} is given in Equation (7). From the latter Equation, {\widehat{\omega}}_{ij}^{x} and {\widehat{\omega}}_{ij}^{y} account for the percentual (signed) difference from the *i* th and *j* th elements in their sequences and are given by Equation (8). Finally, *ω*_{
ij
} is given by Equation (9), where {\omega}_{ij}^{x}=sign\left({x}_{i}-{x}_{j}\right) and {\omega}_{ij}^{y}=sign\left({y}_{i}-{y}_{j}\right). Weighted Goodman-Kruskal running time is *O*(*n*^{2}).

WGK\left(x,y\right)=\frac{{\sum}_{i=1}^{n-1}{\sum}_{j=i+1}^{n}{\widehat{\omega}}_{ij}}{{\sum}_{i=1}^{n-1}{\sum}_{j=i+1}^{n}\left|{\omega}_{ij}\right|}

(6)

{\widehat{\omega}}_{ij}\left\{\begin{array}{cc}1\hfill & \mathsf{\text{if}}\phantom{\rule{2.77695pt}{0ex}}{\widehat{\omega}}_{ij}^{x}\mathsf{\text{and}}{\widehat{\omega}}_{ij}^{y}=0\hfill \\ max\left\{\frac{{\widehat{\omega}}_{ij}^{\mathsf{\text{x}}}}{{\widehat{\omega}}_{ij}^{\mathsf{\text{y}}}},\frac{{\widehat{\omega}}_{ij}^{\mathsf{\text{y}}}}{{\widehat{\omega}}_{ij}^{\mathsf{\text{x}}}}\right\}\hfill & \mathsf{\text{if}}\phantom{\rule{2.77695pt}{0ex}}{\widehat{\omega}}_{ij}^{x}\phantom{\rule{2.77695pt}{0ex}}{\widehat{\omega}}_{ij}^{y}0\hfill \\ min\left\{\frac{{\widehat{\omega}}_{ij}^{\mathsf{\text{x}}}}{{\widehat{\omega}}_{ij}^{\mathsf{\text{y}}}},\frac{{\widehat{\omega}}_{ij}^{\mathsf{\text{y}}}}{{\widehat{\omega}}_{ij}^{\mathsf{\text{x}}}}\right\}\hfill & \mathsf{\text{if}}\phantom{\rule{2.77695pt}{0ex}}{\widehat{\omega}}_{ij}^{x}\phantom{\rule{2.77695pt}{0ex}}{\widehat{\omega}}_{ij}^{y}0\hfill \\ 0\hfill & \mathsf{\text{otherwise}}\hfill \end{array}\right.

(7)

{\widehat{\omega}}_{ij}^{x}=\left\{\begin{array}{cc}\frac{{x}_{i}-{x}_{j}}{ma{x}_{x}-mi{n}_{x}}\hfill & \mathsf{\text{if}}\phantom{\rule{2.77695pt}{0ex}}ma{x}_{x}\ne mi{n}_{x}\hfill \\ 0\hfill & \mathsf{\text{otherwise}}\hfill \end{array}\right.

(8)

{\omega}_{ij}=\left\{\begin{array}{cc}1\hfill & \mathsf{\text{if}}{\omega}_{ij}^{x}=0\phantom{\rule{2.77695pt}{0ex}}\mathsf{\text{and}}\phantom{\rule{2.77695pt}{0ex}}{\omega}_{ij}^{y}=0\hfill \\ {\omega}_{ij}^{x}/{\omega}_{ij}^{y}\hfill & \mathsf{\text{if}}{\omega}_{ij}^{x}\ne 0\hfill \\ 0\hfill & \mathsf{\text{otherwise}}\hfill \end{array}\right.

(9)

#### Traditional distance measures

In order to provide a broad view regarding distance measures we also review and evaluate "traditional" distances from the clustering literature. We consider four different distance measures, all of which have linear running time, i.e., *O*(*n*).

*Minkowski:* Distances measures known as Manhattan (MAN), Supreme (SUP) and, Euclidean (EUC) are particular cases of the more general Minkowski family of metric distances [23], defined in Equation (10). Such distances are obtained with different configurations of *λ*, in Equation (10). For the three particular cases of the Minkowski distance we consider in this work, i.e., MAN, SUP and, EUC, we have *λ* = 1, *λ* = *∞* and, *λ* = 2, respectively.

Minkowshi\left(x,y\right)={\left({\displaystyle \sum _{i=1}^{n}}|{x}_{i}-{y}_{i}{|}^{\lambda}\right)}^{1/\lambda}

(10)

*Cosine:* Cosine is a measure similar to the Pearson correlation coefficient [10]. The only difference between these two measures is due to the fact that Pearson considers the mean of each variable, measuring the difference between their angles considering the origin, whereas Cosine does not, measuring thus their difference based on the mean of the variables under comparison. Made such considerations, Cosine is given by Equation (11).

co{s}_{sim}\left(x,y\right)=\phantom{\rule{0.3em}{0ex}}\frac{{\sum}_{i=1}^{n}{x}_{i}{y}_{i}}{\sqrt{{\sum}_{i=1}^{n}{\left({x}_{i}\right)}^{2}}\sqrt{{\sum}_{i=1}^{n}{\left({y}_{i}\right)}^{2}}}

(11)

Note that Equation (11) defines a similarity. Cosine dissimilarity, or simply *COS*, can be obtained by 1 minus the value produced by Equation (11).

#### Time-series specific measures

In the following distances tailored for short gene time-series are reviewed. Before reviewing such measures let us define the timestamps in which the values of the features for each gene are measured as **t** = (*t*_{1}, . . . , *t*_{
n
}).

*Son and Baek dissimilarities:* Although correlation coefficients can identify sequences with the same trend, they are invariant to swaps in values of both sequences, i.e., changing the ordering of features for both sequences does not alter the final correlation value. Considering such a fact [37] propose the use of two measures, called YS1 and YR1, that consider correlation between sequences but also take into account other relevant information from the time-series under comparison (like the position of their maximum and minimum or the agreement among their slopes).

Given that a time-series with *n* features has *n −* 1 slopes, the slopes of two time-series can be compared with the use of Equation (12), with Equation (13) providing the definition of *Incl* and \mathcal{I}, in Equation (12), providing 1 for agreement and 0 in the remaining cases. The slope of a given a time-series **x** and a feature number (timestamp) can be readily obtained with Equation (14).

A\left(x,y\right)={\displaystyle \sum _{i=1}^{n-1}}\frac{\mathcal{I}\left(Incl\left(x,i\right)=Incl\left(y,i\right)\right)}{n-1}

(12)

Incl\left(a,i\right)=\left\{\begin{array}{cc}0\hfill & \mathsf{\text{if}}\phantom{\rule{2.77695pt}{0ex}}slope\left(a,i\right)=0\hfill \\ -1\hfill & \mathsf{\text{if}}\phantom{\rule{2.77695pt}{0ex}}slope\left(a,i\right)<0\hfill \\ 1\hfill & \mathsf{\text{if}}\phantom{\rule{2.77695pt}{0ex}}slope\left(a,i\right)>0\hfill \end{array}\right.

(13)

slope\left(a,i\right)=\frac{{a}_{i+1}-{a}_{i}}{{t}_{i+1}-{t}_{i}}

(14)

Along with the slope information previously defined, the authors consider whether the minimum and maximum values of the time-series under comparison happen in the same feature (timestamp). Such concept is defined in Equation (15).

M\left(x,y\right)\left\{\begin{array}{cc}0\hfill & \mathsf{\text{if}}ma{x}^{{t}_{\mathsf{\text{x}}}}\ne ma{x}^{{t}_{\mathsf{\text{y}}}}\mathsf{\text{and}}mi{n}^{{t}_{\mathsf{\text{x}}}}\ne mi{n}^{{t}_{\mathsf{\text{y}}}}\hfill \\ 0.5\hfill & \mathsf{\text{if}}ma{x}^{{t}_{\mathsf{\text{x}}}}=ma{x}^{{t}_{\mathsf{\text{y}}}}\mathsf{\text{or}}mi{n}^{{t}_{\mathsf{\text{x}}}}=mi{n}^{{t}_{\mathsf{\text{y}}}}\hfill \\ 1\hfill & \mathsf{\text{if}}ma{x}^{{t}_{\mathsf{\text{x}}}}=ma{x}^{{t}_{\mathsf{\text{y}}}}\mathsf{\text{and}}mi{n}^{{t}_{\mathsf{\text{x}}}}=mi{n}^{{t}_{\mathsf{\text{y}}}}\hfill \end{array}\right.

(15)

YS1 and YR1 take into account Equations (12) and (15) alongside information provided from two correlation measures. YS1, which is given by Equation (16), combines previously introduced information with Spearman correlation coefficient, whereas YR1, Equation (17), takes into account the Pearson correlation coefficient. In such Equations Spearman and Pearson are adapted, respectively, in the following forms: *S*(**x**, **y**) = (1 + *SP* (**x**, **y**))*/* 2 and *R*(**x**, **y**) = (1 + *P E*(**x**, **y**))*/* 2.

Y\phantom{\rule{2.77695pt}{0ex}}S1\left(x,y\right)={\theta}_{1}A\left(x,y\right)+{\theta}_{2}M\left(x,y\right)+{\theta}_{3}S\left(x,y\right)

(16)

Y\phantom{\rule{0.3em}{0ex}}R1\left(x,y\right)={\theta}_{1}A\left(x,y\right)+{\theta}_{2}M\left(x,y\right)+{\theta}_{3}R\left(x,y\right)

(17)

Note that Equations (16) and (17) are weighted summations, for which one should have *θ*_{1} + *θ*_{2} + *θ*_{3} = 1. Given the high cost associated with the estimation of such weights [37] we employed fixed values in order to compare such measures. In all our experiments we employed *θ*_{1} = 1*/* 4, *θ*_{2} = 1*/* 4, and *θ*_{3} = 1*/* 2, as in [37]. The running time for the measures is the same as the correlation coefficient that they employ, i.e., it is *O*(*n log n*) for *YS* 1 and *O*(*n*) for *YR* 1.

*Short Time-Series dissimilarity:* Taking into account the fact that a time-series is composed by *n −* 1 slopes (where *n* is the number of feature in the time-series) [36] introduced a measure called Short Time-Series dissimilarity (STS), which is defined in Equation (18). The measure takes into account the time difference between the biological collection os samples (timestamps). In this sense, shorter intervals have greater impact in the final value of the measure. STS has *O*(*n*) running time.

STS\left(x,y\right)=\sqrt{{\displaystyle \sum _{i=1}^{n-1}}\left(\frac{{y}_{i+1}-{y}_{i}}{{t}_{i+1}-{t}_{i}}\right.-{\left(\right)close=")">\frac{{x}_{i+1}-{x}_{i}}{{t}_{i+1}-{t}_{i}}}^{}2}\n

(18)

*Jackknife:* The so-called Jackknife correlation coefficient [15] was introduced aiming to reduce the number of false positives caused by Pearson. Such reduction is sought by removing values from both sequences during the computation of the Pearson correlation coefficient. False positive sequences tend to have a high correlation that will vanish when outlier values are removed. Therefore, Jackknife takes as its final correlation value the smaller Pearson correlation value over the sequences considering the removal of all their features, one at each step. The Jackknife correlation coefficient is formally defined in Equation (19). In such Equation, *PE*^{i}(**x**, **y**) stands for PE without considering the *i*^{th} feature of both **x** and **y** (*PE*^{0} (**x**, **y**) accounts for no feature removal).

JK\left(x,y\right)={\displaystyle \underset{0\le i\le n}{\mathsf{\text{min}}}}P{E}^{i}\left(x,y\right)

(19)

Although it was proposed for short gene time-series clustering, Jackknife can also be employed in other scenarios (note that it only considers feature removal). Due to such a fact we employed it in all our experiments in this paper. It is easy to verify that Jackknife correlation coefficient has *O*(*n*^{2}) running time, which can become prohibitive for data with a large number of features (which is the case for cancer data).

*Local Shape-based Similarity:* The measure called Local Shape-based Similarity, introduced by [35] considers the fact that similarities between genes can occur locally, in a subspace of the features from the time-series. The authors also consider the possibility that such local similarities may be transposed in one of the genes. Therefore, the Local Shape-based Similarity seeks for local and transposed alignments in sequences that have a high score. The alignment with highest score is defined as final value of similarity, given that it represents the best local (possibly transposed) similarity between the two time-series. The measure is given by Equations (20) and (21), for which *S*, accounts for the similarity considered between any two size *k* subsequences of **x** and **y**. The authors suggest a *min*_{
k
} of *n −* 2 (*n* is the number of features in the original series) [35].

LSS\left(x,y\right)={\displaystyle \underset{mi{n}_{k}\le k\le n}{\mathsf{\text{max}}}}Similarit{y}_{k}\left(x,y\right)

(20)

Similarit{y}_{k}\left(x,y\right)={\displaystyle \underset{1\le i,j\le n+1-k}{\mathsf{\text{max}}}}S\left(x\mathsf{\text{[}}i,i-1+k\right],y\left[j,j-1+k\right])

(21)

It is important to note that in order to obtain the final value of the Local Shape-based Similarity one has to compute similarities among different sized sequences (for any two sequences of *same* length LSS uses Spearman correlation). Given that the probability of obtaining high similarity values is greater for sequences with smaller sizes, LSS employs such probability rationale in order to obtain its final similarity value. Made such considerations, *S* is defined as the probability associated with the correlation value for the subsequences being compared (which relates to their sizes). Details on such calculations can be obtained in [35]. Local Shape-based Similarity has *O*(*n*^{3}) running time, which according to its authors can be decreased if one employs an approximated version [35].

### Datasets

We consider a total of 52 gene expression datasets in our study. These datasets are both from cancer and gene time-series experiments, as we detail in the following.

*Datasets from cancer studies:* We adopt the benchmark set of 35 datasets compiled by [4] in order to evaluate distance measures for the clustering of cancer data. From these datasets, 14 were obtained with cDNA microarrays, whereas 21 were produced with Affymetrix microarrays. Cancer benchmark data is summarized by Table 1. Please, consult [4] for full details regarding this benchmark set.

*Datasets from short gene time-series studies:* For this type of data we adopt the benchmark set of 17 datasets compiled by [34]. All the datasets from this benchmark set, which come from three independent studies involving yeast, i.e., *Saccharomyces cerevisiae*, were produced employing cDNA microarrays. These datasets are summarized by Table 2. Please, consult [34] for full details regarding this benchmark set.

### Clustering methods

We employed four different clustering methods in our comparison, which are briefly reviewed in the sequel.

The k-medoids clustering method [25] is similar to the more popular k-means [63]. The only difference between these two clustering methods is due to the fact that, in k-medoids, each cluster is summarized by a medoid, i.e., a real object that minimizes its distance to all the remaining objects that belong to the cluster. The k-medoids method has three main steps: (i) for a given number *k* of clusters, *k* randomly chosen objects are selected as cluster medoids, (ii) each object in the dataset is assigned to the cluster with closest medoid and; (iii) cluster medoids are updated, i.e., for each cluster the new medoid is the object that has the lowest distance to the remaining objects that belong to its cluster. Steps (ii) and (iii) are repeated until a fixed number of iterations is exceeded or changes in clustering memberships are no longer observed. It is important to note that the k-medoids is not a deterministic method, i.e., for different initializations it may produce different outputs. To this extent, for each different dataset, number of clusters and distance adopted the method is initialized 50 times.

Hierarchical clustering methods are fairly common in the gene expression literature. We consider three different variants of agglomerative hierarchical clustering [23], i.e., Average-Linkage, Complete-Linkage and Single-Linkage. These methods take as input a proximity matrix generated from a dataset and produce as output a hierarchy of partitions, usually referred to as a dendrogram. Hierarchical clustering methods have two main steps: (i) each one of the objects is assigned to a singleton cluster, i.e., a cluster with a single object and; (ii) the two closest clusters are merged into a new cluster comprising their objects. Step (ii) is then repeated until a single cluster is obtained. Note that differences among Average-Linkage, Complete-Linkage and, Single-Linkage are defined by how the distance between clusters is computed, in order to identify the two closest clusters. For Average-Linkage this distance is given by the mean distance among all objects belonging to different clusters. For Complete-Linkage this distance is given by the farthest distance between objects in different clusters. In Single-Linkage it is provided by the smallest distance among objects belonging to different clusters. To obtain partitions with distinct cluster numbers we just have to "cut" the resulting dendrogram at the desired level.

Finally, the intervals \left[2,\u2308\sqrt{o}\u2309\right], that comprehend the number of clusters considered during our second and third evaluation scenarios, are chosen due to its common usage in the clustering literature [64, 65].

### Clustering validity

In the following we briefly describe the two *traditional* clustering validity criteria employed in order to assess the quality of partitions. Note that for gene time-series datasets we also employed a biologically driven validation methodology, as we already detailed.

#### Adjusted rand index

For cases in which a reference partition is available one can employ external validation measures to quantify the quality of the results. Due to its correction that takes into account conformities between partitions found by chance [66], we choose the Adjusted Rand [23, 47], defined by Eq. (22), to evaluate clustering results. The greater its value, the greater is the concordance between the two partitions under comparison, with values close to 0 indicating conformities found by chance. Given a partition \mathcal{U} and a reference partition \mathcal{V}, in Eq. (22), (*a*) accounts for the total number of object pairs belonging to the same cluster in both \mathcal{U} and \mathcal{V}; (*b*) represents the total number of object pairs in the same cluster in \mathcal{U} and in different clusters in \mathcal{V}; (*c*) is the total number of object pairs that are in different clusters in \mathcal{U} and in the same cluster in \mathcal{V}; and (*d*) is the total number of object pairs that are in different clusters in both \mathcal{U} and \mathcal{V}.

AR=\frac{a-\frac{\left(a+b\right)\left(a+c\right)}{\left(a+b+c+d\right)}}{\frac{\left(a+b\right)\left(a+c\right)}{2}-\frac{\left(a+b\right)\left(a+c\right)}{\left(a+b+c+d\right)}}

(22)

#### Silhouette index

To estimate the number of clusters in our third evaluation scenario, a relative index of comparison between partitions is also employed. The Silhouette index is defined by Eq. (23), considering a partitioning of *m* objects in *k* disjoint clusters. In Eq. (23), *u*(*i*) represents the average distance of **x** and all the remaining objects of its cluster. Value *v*(*i*) is obtained as follows: for a given object **x**, the average distance of **x** and all the objects from a given cluster is obtained. This process is repeated for all the *k −* 1 clusters, excluding the cluster to which **x** belongs. At the end of the process the lowest mean value found is attributed to *v*(*i*). In other words, *v*(*i*) stands for the mean distance between **x** and its neighbor cluster (closest cluster). Silhouette, which is a maximization measure, has its values within [*−* 1, 1].

S=\frac{1}{m}{\displaystyle \sum _{i=1}^{m}}\frac{v\left(i\right)-u\left(i\right)}{max\left\{v\left(i\right),u\left(i\right)\right\}}

(23)

We choose the Silhouette based on its superior results in comparison to other relative criteria, as demonstrated by [49, 67, 68]. We also note that the Silhouette has already been successfully employed in order to estimate the number of cluster for gene expression data, e.g., [69–71].

Finally, we would like to note, that by using the Silhouette index we simulate a real application in which the user does not have any a priori information regarding the number of clusters present in the data. It is important to make clear, that the use of relative indexes (such as the Silhouette) is just part of the more general procedure that comprehends the whole clustering analysis, i.e., (i) pre-processing, (ii) clustering and, (iii) validation [72]. To this extent, in a real application, relative indexes may, in turn, help the user to choose the "best" partition or the "best" number of clusters for a given dataset (according to the criterion). For a review of clustering validation techniques in gene expression, please refer to [72].

### Friedman and Nemenyi statistical tests

Statistical tests were employed to assess the significance of the results obtained during our experimental evaluation. Based on the work of [73] we use Friedman [74] and Nemenyi [75] (with p-value = 0.05), given that they are more appropriate for evaluating the results of a collection of methods obtained over different datasets.