We first briefly present the k-means algorithm as a standard algorithm for clustering numerical values into two groups. We also describe a model-based clustering approach. Then we introduce different measures for assessing if the shape of a distribution supports the assumption that two clear subgroups can be identified. In other words, we present measures for testing bimodality of distributions. We briefly describe the logrank test for quantifying the statistical significance of survival time differences between two groups of patients, and we present the Kolmogorov-Smirnov test for the enrichment analysis comparing the prognostic genes with the genes with bimodal distributions.
Cluster Algorithms
K-means
A straightforward method to assign patients to two groups based on their gene expression values is to use a clustering algorithm. One popular method is the k -means algorithm.
In the univariate case, let x1,..., x
n
be a set of observations. Given k < n initial distinct cluster centers the k-means algorithm partitions the set into k groups by minimizing the Within Cluster Sum of Squares
, where C
j
, j = 1,..., k, is the j th cluster, n
j
is the number of elements in cluster C
j
and
is the cluster center. The algorithm is an iterative process that alternates between assigning data points to clusters based on their distances to cluster prototypes and updating the prototypes based on new cluster assignments.
K-means is sensitive to the initially randomly selected cluster centers [13]. As we want to discriminate a group with low expression values from a group with high expression values we choose for each gene its minimum and its maximum observed expression value as initial centers.
Model-based clustering
The assumption for this approach is that the distribution Y of a gene with bimodal expression can be expressed as a mixture of two gaussian distributions Y1 and Y2 with parameters μ1,
and μ2,
, respectively.
where Δ ∈ {0, 1} with P(Δ = 1) = π. π is the probability of belonging to Y1. Let ϕ
θ
denote the normal density with parameters θ = (μ, σ2). Then the density of Y is given by
The parameters of the mixture model are determined using the EM-algorithm.
To fit the models to our data we make use of the R package mclust[14, 15]. Typically equality of variances is not assumed and we fit models with unequal variances. However, in cases with only one outlier in the expression distribution a model with unequal variances is not suitable since one component has variance 0. In this case we fit a two component model with equal variances.
Tests for bimodality
The first two scores for testing bimodality quantify whether the partitioning of the expression values into two groups is adequate. We want to quantify whether assuming two clusters is more appropriate than assuming one cluster. The tests are based on the decomposition of the Total Sum of Squares
as
where BSS is the Between Cluster Sum of Squares and WSS the Within Cluster Sum of Squares.
Variance reduction score
The variance reduction score (VRS) is defined as the ratio of WSS and TSS:
VRS measures the proportion of variance reduction when splitting the data into two clusters. The value of this score lies in the interval [0; 1], and a low score indicates an informative split.
Weighted variance reduction score
The weighted variance reduction score (WVRS) measures the variance reduction independent of the cluster sizes. In the numerator we calculate the mean of the two within cluster variances. The denominator is the sample variance as above for the VRS. We define
In this case the variance of a cluster with few observations has the same influence as the variance of a large cluster. The value of this score can be larger than 1. Again, a low score reflects bimodality. WVRS has the ability to also identify splits into two clusters with extremely unequal sample sizes.
Dip Test of Unimodality
The Dip Test of Unimodality was suggested by Hartigan and Hartigan (1985) [7]. The dip statistic is defined as the maximum difference between an empirical distribution function and the unimodal distribution function that minimizes that maximum difference.
For two arbitrary bounded functions F and G define ρ(F, G):= sup
x
|F(x)-G(x)|. Define
for a class
of bounded functions. Let
denote the class of unimodal distribution functions. Then the dip of a distribution function F is defined by
. To test the null hypothesis that F has a unimodal density Hartigan and Hartigan proposed the statistic D(F
n
), where F
n
is the empirical distribution function of a random sample of size n. The distribution of D(F
n
) is compared with the distribution of D(F), where F is the uniform distribution on [0; 1]. Hartigan and Hartigan [7] showed that the dip is asymptotically larger for the uniform distribution than for any other unimodal distribution.
To calculate the dip statistic we make use of the R package diptest[16], which also provides a table of empirical percentage points of the dip statistic D(F
n
) for some specified sample sizes n. This distribution can be used to calculate p-values for the dip test, where the null hypothesis is a unimodal distribution.
Kurtosis
Teschendorff et al. (2006) [8] proposed an approach to identify genes with bimodal density based on model-based clustering and kurtosis. The general approach is based on two steps. To select the optimal number of clusters and to find bimodal expression profiles a clustering algorithm is used together with a model selection criterion. Then the kurtosis can be used to rank the bimodal genes according to if they define major subgroups or outlier subgroups. Kurtosis is related to the fourth central moment and can be defined by
Given a gene's expression values x = (x1,..., x
n
) an unbiased estimate for the kurtosis is given by
A gaussian distribution has kurtosis K = 0, whereas most non-gaussian distributions have either K > 0 or K < 0. Specifically, a mixture of two approximately equal mass normal distributions must have negative kurtosis since the two modes on either side of the center of mass effectively flatten out the distribution. A mixture of two normal distributions with highly unequal mass must have positive kurtosis since the smaller distribution lengthens the tail of the more dominant normal distribution [8]. If there is an 80%-20% split of the samples into two groups, then the kurtosis is close to 0 [9]. Therefore biologically interesting genes might be missed.
We calculate the kurtosis for all genes without a prior feature selection step.
Likelihood Ratio
Using the likelihood ratio of a normal model and a mixture normal model to identify bimodal distributions was suggested by Ertel and Tozeren (2008) [6]. We fit a normal model and a mixture normal model to the expression values for each gene using the mclust package and calculate the likelihood ratio
where L1 is the likelihood of the normal model with one component and L2 is the likelihood of a normal mixture model with two components with unequal variance.
Small ratios indicate that the distribution is unimodal, whereas large ratios suggest that the expression values are bimodally distributed.
Bimodality Index
The Bimodality Index introduced by Wang et al. (2009) [9] is a criterion to identify and rank bimodal signatures from gene expression data. It is assumed that the distribution of a gene with bimodal expression can be expressed as a mixture of two normal distributions with means μ1 and μ2 and equal standard deviation σ. The standardized distance δ between the two populations is given by
For identifying genes with bimodal distribution the null hypothesis is δ = 0. Then the bimodality index (BI) is defined by
where π is the proportion of observations in the first component.
For a given data set δ and π can be estimated using mclust. Then BI can be calculated using the estimated values. Larger values of BI correspond to bimodal distributions where the two components are easier to distinguish. Wang et al. recommend BI = 1.1 as a cutoff to select bimodally distributed genes.
Outlier-Sum Statistic
Another approach to group genes by expression values is based on the calculation of outlier sums as proposed by Tibshirani and Hastie (2007) [12]. This method is able to detect genes with unusually high or low expression in some but not all samples. Tibshirani and Hastie assume that the samples fall into a reference and a disease group. However, as we only consider tumor samples we modified the method for our purpose.
Let med
i
and mad
i
be the median and the median absolute deviation of the expression values of gene i. First, the expression values x
ij
of each gene are standardized as follows:
Let q
r
(i) be the r th percentile of the
values and IQR(i) = q75(i) - q25(i) the interquartile range. All values smaller than IQR(i) - q25(i) or greater than IQR(i) + q75(i) are defined to be outliers. The outlier-sum statistic for positive outliers is defined as
W
i
is large if there are many outliers in the data or few extreme outliers with high values, and W
i
is zero if there are no outliers. Analogously the outlier-statistic for negative outliers is defined as
The outlier-sum statistic is the maximum of W
i
and W'
i
in absolute value.
For the survival analysis we need to split patients into two groups. We consider the outliers that are used to calculate the outlier sum as one group and all other observations as the other group.
Testing for prognostic relevance
For each gene we compare the hazard rates of two patient subgroups obtained by the different methods. The goal is to determine whether the survival in the two groups differs. We test the hypothesis H0: h1(t) = h2(t), for all t ≤ τ, against the alternative that the hazard rates differ for some t ≤ τ. τ is the largest time at which both groups have at least one subject at risk. We make use of the logrank test which is a nonparametric test for right-censored data first proposed by Mantel (1966) [17].
Let t
1
< t2 < ... < t
D
be the distinct death times in the pooled sample. Then d
ij
is the number of events in the j th sample out of Y
ij
individuals at risk at time t
i
, j = 1, 2, i = 1,..., D. Let d
i
= di 1+ di 2and Y
i
= Yi 1+ Yi 2be the number of events and the number at risk in the combined sample at time t
i
. The test statistic Z of the logrank test is given by
When the null hypothesis is true and the sample sizes in the two groups are similar, asymptotically Z has a standard normal distribution. In the case of highly unequal sample sizes, especially if one group contains less than five patients, the asymptotic distribution under the null hypothesis is not appropriate anymore [18, 19]. Thus we do not make use of the normal distribution for obtaining significance values. Instead, we use a permutation test. We calculate the test statistic for 100.000 random permutations of the patients and determine the percentage of values more extreme than the observed value of the test statistic. Using k-means we always get two groups of patients. However, using the model-based clustering approach and the outlier sum approach it is possible that for some genes all patients are in the same group so that the logrank test can not be applied. As these genes rather have unimodal expression profiles we decided to set the corresponding p-value 1.
Adjustment for multiple testing
We use the logrank test described in the previous Section to test for prognostic relevance for each of the genes. Here, we have to take the multiple testing problem into account. The significance level α of a hypothesis test controls the Type I error, i.e. the probability of rejecting a true null hypothesis. If m independent true hypotheses are tested simultaneously one would expect m · α p-values p < α. Therefore one has to adjust for multiple testing.
There are different correction methods. The Bonferroni-Holm method[20] controls the Familywise Error Rate (FWER), i.e. the probability of rejecting at least one true null hypothesis.
The method suggested by Benjamini and Hochberg (1995) [21] controls the False Discovery Rate (FDR), which is the expected proportion of falsely rejected hypotheses among all rejected hypotheses. This procedure is less conservative than the Bonferroni-Holm method and often more appropriate for high-dimensional microarray data.
Let p(1), . . ., p(m)denote the ordered p-values of m hypothesis tests, i.e. p(i)≤ p(i+1). Then the adjusted p-values, also referred to as q-values, can be calculated iteratively as follows
where q(m)= p(m).
Analyzing gene lists for enrichment with prognostic genes
To determine whether there is an enrichment of the top-scoring bimodality genes with prognostic genes, we rank the genes based on the different bimodality scores. We then select the 250 genes with smallest p-values of the logrank test. These genes differ for the different approaches (i.e. k-means, model-based clustering, outlier sum). In this context, we call these genes prognostic genes.
An enrichment of the top-scoring bimodality genes with prognostic genes corresponds to small ranks of prognostic genes in the ranked list. We test the null hypothesis that the ranks of the prognostic genes are uniformly distributed among the bimodality-ranked genes with a one-sample Kolmogorov-Smirnov test, which is a nonparametric test of equality with a given one-dimensional reference function. The Kolmogorov-Smirnov statistic quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution.
We perform the one-sided Kolmogorov-Smirnov test for testing the hypothesis that the cumulative distribution function of the ranks of the prognostic genes is not greater than the distribution function of a uniform distribution. Let F
n
denote the empirical distribution function of the sample and F the distribution function of the reference distribution, then the test statistic is given by
The null distribution of this statistic is calculated under the null hypothesis that the sample is drawn from the reference distribution. If F is continuous the distribution of
does not depend on F but just on the sample size n. For n > 40 approximative p-values can be derived from the asymptotical distribution of
[22].
Patients and gene array analysis
The Mainz study cohort consists of 194 node-negative breast cancer patients who were treated at the Department of Obstetrics and Gynecology of the Johannes Gutenberg University Mainz between the years 1988 and 1998 [1]. All patients underwent surgery and did not receive any systemic therapy in the adjuvant setting. Gene expression profiling of the patients' RNA was performed using the Affymetrix HG-U133A array, containing 22283 probe sets, and the GeneChip System [23]. The normalization of the raw data was done using RMA [24] from the Bioconductor [25] package affy[26]. The raw .cel files are deposited at the NCBI GEO data repository [27] with accession number GSE11121.