The three stages of heritable clustering are laid out in detail here. First, we will describe an algorithm for determining the number of clusters. Then, we discuss two new procedures for clustering that are appropriate for our problem. Finally, we provide an algorithm that organizes the clusters into a pathway network to capture tumor progression epigenetically and phenotypically. Except for the likelihood approach, which is based on probabilistic modeling, all the other methods considered here make use of a distance metric. Thus, we describe our chosen distance, or the equivalent – similarity measure, next before we detail the three stages of heritable clustering.

### Similarity measures for epigenotypes and phenotypes

For our purposes, the data generated by methylation microarray are interpreted as categorical in nature (henceforth described as epigenotype) – 1: hypermethylated; 0: unmethylated. Methylation progression patterns among tumor samples are integral to the inheritance property of our model; hence, the capacity to capture such patterns is a requisite of any clustering method employed. Under these constraints, we choose to design our algorithm based on the concept of *ε*-similarity [22], which defines distance and similarity measures suitable for our analysis. Specifically, the Hamming distance [23] defines the distance between two binary vectors of equal length as the number of elements that have different bits. This distance measure is adopted for describing the distance between the epigenotypes of two tumor samples. For each tumor *t*, *t* = 1,… , *T*, let X
_{
t
}= {*X*
_{
tg
}, *g* = 1,… , *G*} be the epigenotype vector at *G* loci. The epigenotype distance between two tumor samples *t*
_{
i
}and *t*
_{
j
}is then defined as

Since most phenotype data (e.g., clinical stage, histological grade, or hormone receptor status) are categorical in nature, we assume that each tumor phenotype is a discrete ordinal, or can be ordered sensibly beginning from 0 to *K*
_{
p
}- 1, where *K*
_{
p
}is the number of the categories for phenotype *p*. Similar to the notation for epigenotypes, we use Y
_{
t
}= {*Y*
_{
tp
}, *p* = 1,… , *P*} to denote the vector of phenotypes for tumor *t*. Then the phenotype distance between two tumors *t*
_{
i
}and *t*
_{
j
}is

Finally, the similarity measure between two tumors *t*
_{
i
}and *t*
_{
j
}is defined as

*S*(*t*_{
i
}, *t*_{
j
}) = 1 - (*w*·*d*_{
p
}(*i*, *j*) + (1 - *w*)·*d*_{
g
}(*i*, *j*)),

where 0 ≤ *w* ≤ 1 is a weight parameter to balance the contributions from epigenotype and phenotype similarities. Two tumors, *t*
_{
i
}and *t*
_{
j
}, are said to be *ε*-similar if and only if *S*(*t*
_{
i
}, *t*
_{
j
}) ≥ *ε*, where 0 ≤ *ε* ≤ 1 represents the level of similarity. In the proposed heritable clustering method, if two tumors are sufficiently similar, they will be clustered into the same group. The selection of an appropriate *ε* depends on the desired degree of similarity within a cluster. The lower the *ε* value, the less similarity (i.e. more variation) within each cluster is allowed. To balance the contributions from epigenotypes and phenotypes, and to guarantee a reasonable level of similarities among tumor samples within each cluster, we suggest considering the parameters *w* and *ε* in the following ranges: 0.2 ≤ *w* ≤ 0.8 and 0.5 = *ε* ≤ 1.

### Determination of number of clusters

For each combination of weight and similarity (*w*, *ε*) (e.g., 0.2 ≤ *w* ≤ 0.8,0.5 ≤ *ε* ≤ 1), we use the following steps to determine the number of clusters.

1. Begin with two tumors {*t*
_{
i
}, *t*
_{
j
}} that maximize the similarity measure *S*(*t*
_{
i
}, *t*
_{
j
}). If the maximal value of S is less than *ε*, assign each tumor into separate clusters and stop. Otherwise let *C*
_{1} = {*t*
_{
i
}, *t*
_{
j
}} and go to the next step.

2. Suppose there exist *K* clusters *C*
_{1},… , *C*
_{
k
}. Let *n*
_{
k
}be the number of tumors in *C*
_{
k
}and *t*
_{
ki
}be the *i*-th tumor to be added to *C*
_{
k
}, *k* = 1,… , *K*, and *i* = 1,… , *n*
_{
k
}. Let *t* ∈ *T* be a tumor sample that has not yet been assigned to any of the clusters. The similarity score between *t* and each of the existing cluster is defined as:

Let (*t**, *k**) = argmax{*S*(*t*, C_{
k
}), *t* ∈ *T*, *k* = 1,… , *K*}. If *S*(*t**, *C*
_{k*}) ≥ *ε*, then *C*
_{k* }= *C*
_{k* }∪ {*t**}; otherwise create a new cluster *C*
_{K+1 }= {*t**}.

3. Repeat step 2 until all tumor samples are assigned to clusters. Then calculate the total similarity score

where
is the average similarity in cluster *k* and *K*(*w*, *ε*) is the corresponding number of clusters with parameters *w* and *ε*.

What remains is to find the optimal cluster number *K* and its corresponding parameter pair (*w*, *ε*). In general, if the number of clusters *K* is large, then *T*
_{
S
}is also large, and consequently log(*T*
_{
S
}) is also large. This leads us to propose a model selection criterion following the formulation of Akaike's Information Criterion (AIC) [24]. The main idea is to maximize total similarity subject to a penalty term for over stratification. Specifically, we seek (*K*, *w*, *ε*) that satisfies

where *P* and *G* denote the number of phenotypes and epigenotypes, respectively, as defined before. Note that the second term in the objective function *f* is to penalize over estimation of the number of clusters. It is designed to balance the number of clusters and total similarity, as in AIC.

A different clustering algorithm other than the one described above may be used to determine the number of clusters for each *w* and *ε*. However, existing algorithms cannot easily accommodate the requirement of *ε*-similarity, which is what prompted us to devise the above algorithm.

### Clustering samples

We then turn to clustering algorithms to group samples into *K* clusters, where *K* is the optimal number of clusters determined from the previous stage. Here we describe two novel algorithms, one based on distance (similarity) and the other based on likelihood. Both methods are iterative procedures like that of k-means, and therefore it is worth noting that the first stage of heritable clustering also produces clusters as a by-product, which can be conveniently used as initial clusters here. For the three existing distance-based algorithms that we also consider, K-means, PAM, and hierarchical clustering, the distance measure used is the same as that for SIM, to be described, which uses both methylation status and phenotypes. We note that this distance measure differs from what usually used in these popular algorithms (e.g., Euclidean distance or correlation based) as we are dealing with discrete data and want to take both epigenotype and phenotype information into consideration. However, the proposed method is equally applicable to continuous data or a combination of continuous/discrete data from an operational point of view, albeit with a different distance measure suitable for the data type(s).

#### Distance-based similarity approach

This can be regarded as a hybrid clustering algorithm that combines the essence of K-means and Silhouette (principal criterion in PAM) to balance within-cluster similarity and between-cluster distinctiveness. Let *a*(*t*) denote the average similarity between tumor *t* and all the other tumors in cluster *C*
_{k* }to which *t* belongs. For any of the other clusters *C*
_{
k
}, *k* ≠ *k**, let *S*(*t*, *C*
_{
k
}) be the average similarity of *t* to all samples in *C*
_{
k
}. We denote by *b*(*t*) = max_{k ≠ k* }
*S*(*t*, *C*
_{
k
}) the similarity between *t* and its nearest "neighbor" cluster. If *ab*(*t*) = *a*(*t*) - *b*(*t*) ≥ 0, we say that *t* is correctly assigned to its current cluster, *C*
_{k*}, otherwise, it is a candidate for switching its cluster membership.

Algorithm: SIM

1. Calculate *ab*(*t*) for each *t* and let *abmin* = min_{
t
}
*ab*(*t*).

2. If *abmin* < 0, move the corresponding *t* to its "neighbor" cluster.

3. Repeat 1–2 until *abmin* ≥ 0.

#### Likelihood-based method

Unlike the SIM algorithm or most other existing algorithms in the literature, the likelihood-based method proposed here does not depend on any measure of distance. This leads to greater flexibility, in that the approach can deal with both discrete and continuous data, missing observations for some of the variables, and dependencies among the variables. The idea is similar, in spirit, to that reported in Cai et al. [25] for SAGE data. It also shares the commonality of a parametric clustering formulation with Siegmund et al. [26] for analyzing methylation data, although the two address two completely different problems. We assume that the epigenotype vectors (X
_{
t
}) for tumor *t* follows a common parametric family of distributions with its own parameter vector θ
_{
tG
}. Analogously, we use θ
_{
tp
}to denote the parameter vector for the distribution of the clinical phenotype vector Y
_{
t
}. That is,

X_{
t
}= {*X*_{t1}, *X*_{t2},… , *X*_{
tG
}}~*P*(. | *θ*_{
tG
}),

Y_{
t
}= {*Y*_{t1}, Y_{t2},… , *Y*_{
tP
}}~*P*(.| *θ*_{
tP
}).

Thus, X
_{
t
}and Y
_{
t
}are jointly distributed as

(X
_{
t
}, Y
_{
t
}) ~ *P*(.|*θ*
_{
t
}= (*θ*
_{
tG
}, *θ*
_{
tP
})).

If X
_{
t
}and Y
_{
t
}are assumed to be independent, then

*P*(X_{
t
}, Y_{
t
}| *θ*_{
tG
}, *θ*_{
tP
}) = *P*(X_{
t
}| *θ*_{
tG
})*P*(Y_{
t
}| *θ*_{
tP
}).

The goal is to group tumors with similar epigenotypes and phenotypes according to their parameter vectors. That is, we assume that tumors within a cluster (*C*
_{
k
}) share the common distributional parameter vector
, which represents the cluster profile. Let *I*
_{
k
}(*t*) = 1 if tumor *t* is in cluster *C*
_{
k
}, otherwise it is 0, in the current iteration. Then, the joint likelihood is

where *K* is the number of clusters, and *T* is the number of tumor samples. Suppose that
and
are the maximum likelihood estimate of
and
, respectively, *k* = 1,… , *K*, then it is natural to evaluate how well a particular tumor sample fits into the assigned cluster by computing

*k** = argmin_{
k
}{- log *P*(X_{
t
}, Y_{
t
}|
,
); *k* = 1,… , *K*}.

If *C*
_{k* }is not the same as its currently assigned cluster, then tumor *t* is a candidate for switching cluster membership. This basic idea may lead to various clustering algorithms, including the one below used for our primary breast tumor data.

The above formulation of the likelihood clustering approach provides a general setting in which dependencies among epigenotypes (e.g., hypermethylated promoter regions binded to the same transcription factor) and phenotypes (e.g, tumor grade and metastasis status) can be easily incorporated. In the breast tumor example, we have discrete epigenotypes (hypermethylated or not) and phenotypes (ordinal), therefore binomial and multinomial are the natural choice of parametric families for the distributions of the variables. However, the framework can be flexibly adapted to any other type of data, such as continuous measurements of methylation intensities. Finally, the approach can make use of tumor samples that have missing data on some of the variables; the contribution to the corresponding likelihood from such a sample will be set to unity by convention.

Algorithm: LH

1. For each tumor *t* ∈ *C*
_{k*}, calculate *L*
_{
t
}(*k**) = -log *P*(X
_{
t
}, Y
_{
t
}|
) and *L*(*k* ≠ *k**) = min_{k≠k*}{-log *P*(X
_{
t
}, Y
_{
t
}|
)}.

2. If max_{
t
}{*L*
_{
t
}(*k**)/*L*
_{
t
}(*k* ≠ *k**)}*>* 1, move the corresponding *t* to cluster
where
= argmin_{k≠k*}{-log *P*(X
_{
t
}, Y
_{
t
}
)}.

3. Repeat 1–2 until max_{
t
}≤ 1.

### Building progression pathway

In the final stage of heritable clustering, clusters generated from the previous stage will be assembled into a pathway structure to represent the pathway of tumorigenesis. We first describe the concepts of cluster centers and scores, which are essential for our pathway discovery (PD) algorithm.

The clusters as previously described become the nodes of the tumor progression model. In order to derive pathways between nodes, a vector representative of the epigenotype and phenotype signatures of the tumors within a given node needs to be defined. Node centers and scores (both epigenotypic and phenotypic) derived from each cluster are used to define such a vector and is referred to as the node label. The epigenotype center of a cluster is determined by the epigenotype status common to the majority of tumors in the cluster. Let *V*
_{
kg
}denote the set of epigenotype statuses at locus *g* over all tumors in cluster *C*
_{
k
}, and *P*(*V*
_{
kg
}) be the number of 1s in *V*
_{
kg
}. Then the *epigenetic node center* (*ENC*) for locus *g* in cluster *C*
_{
k
}is defined as:

where card{*V*
_{
kg
}} is the cardinality of the set *V*
_{
kg
}. The epigenotype score, or degree, of the node *k* is then defined based on the calculated node centers as follows:

*i.e*., the proportion of 1s in the set of ENCs for the node. The epigenotype score of a node is interpreted as measuring the extent of methylation of the tumors within the cluster.

With the definition of epigenotype centers and scores, it is now possible to define heritability of a progeny *C*
_{
j
}from a parental node *C*
_{
i
}:

The value of *H*(*C*
_{
i
}, *C*
_{
j
}), which is between 0 and 1 and meaningful only if *GS*
_{
i
}≤ *GS*
_{
j
}, is the degree of heritability. Strict inheritance is defined when *H* = 1. Under this condition, all hypermethylated loci in a parental node are inherited by its progeny nodes. Note that the heritability is defined on the loci methylation signature of the ENC and not the methylation signature of the tumors that comprise the node. Such a definition of heritability is faithful to the recapitulation nature of the method. In an analogous manner, phenotype centers and scores are used to capture the clinical progression in tumorigenesis. The center of a phenotype in a cluster is taken to be the weighted average of the phenotype values of the samples in the cluster rounded to the nearest integer. Let *n*
_{
p
}be the number of categories for phenotype *p* and *c*
_{
ki
}= card{*Y*
_{
tp
}= *i*| *t* ∈ *k*} be the count of category *i* in cluster *C*
_{
k
}, *i* = 1,… , *n*
_{
p
}. Then the phenotypic node center of phenotype *p* in cluster *C*
_{
k
}is

where ⌊ ⌋ is the floor of the value being bracketed.

The phenotype score for cluster *C*
_{
k
}is then calculated as

where *K*
_{
p
}, as defined before, is the number of categories for phenotype *p*. This score can be interpreted as measuring the average phenotypic value of the tumors in the cluster, with a larger score being indicative of more advanced tumors. Analogous to the concept of epigenotype heritability, we assume that phenotypic scores follow a temporal order. That is, a node with a small score represents a tumor that occurred temporally before a tumor represented by a node with a larger score. Our PD algorithm is built to capture this chronological characteristic.

Algorithm: PD

1. Sort nodes in ascending order according to their epigenotypic scores with ties determined by their phenotypic scores. In the unlikely case that at least two nodes have the same epigenotypic and phenotypic scores, their ordering is determined randomly. Assume, with possible relabeling, that the set of ordered nodes is *C* = {1, 2, … , *K*}. Set node 1 as the initial node in the pathway network.

2. Suppose *C*
^{
N
}is the set of nodes already used to construct the pathway. The ordering of *C* determined in the previous step is inherited by *C*
^{
N
}. Let *C** = *C*
^{
N
}, and also note that *C** will be reset at each iteration in Step 3. Add the node *k* = min{*C*\*C*
^{
N
}} to the pathway (i.e., to *C*
^{
N
}but not to *C**) and finding all possible directed paths to it from other nodes already in the pathway in Step 3.

3. While *C** ≠ ∅:

Let *j* = max{*C**}. If (a) *H*(*j*, *k*) ≥ *h*, and (b) for each phenotype *p*, *PC*
_{
jp
}≤ *PC*
_{
kp
}, then *k* is added as a downstream node to node *j*; then the node *j* and all nodes upstream of it (*i.e*., nodes *i* such that there exists a directed path from *i* to *j*) are removed from the set *C**. Otherwise, only *j* is removed from the set *C**.

4. Repeat Steps 2 and 3 until all nodes have been added to the pathway.

5. If there does not exist a node that is upstream of all other nodes, then a pseudo-origin is created by defining a node *C*
_{0} with both phenotype and epigenotype score of zero. All nodes in *C*
^{
N
}without upstream nodes are added as downstream adjacent nodes of *C*
_{0}.

In step 1 of the PD algorithm, different ordering of the tied nodes will not lead to altered pathways. In fact, step 1 is needed only for designing an efficient algorithm. One may work with the nodes in any order, but then one would also need to check whether a new node to be added is a upstream node of a node already in the pathway (in addition to checking whether it is a downstream node). In this way, regardless of the ordering, the same pathway will result.