### Comparing segmentations

We first give some intuition for our segmentation-comparison method and provide some basic definitions. Consider a sequence *S* with length |*S*| = *N* and a segmentation *P* of *S*. The segmentation partitions the sequence into *k* non-overlapping and contiguous intervals that span the whole sequence and they are called *segments*. A segmentation *P* with *k* segments can be fully defined using (*k* + 1) segment boundaries *p*
_{0},...,*p*
_{
k
}, where *p*
_{
i
}∈ *S*, *p*
_{
i
}<*p*
_{i+1 }for every *i*, and *p*
_{0} = 0 and *p*
_{
k
}= *N*. The *i*-th segment of *P*, denoted by
, is defined to be the interval
= (*p*
_{i-1}, *p*
_{
i
}]. Each segment
consists of |
| points that correspond to the length of the segment.

Consider now segmentation

*P* of consisting of

*k* segments

. If we randomly pick a point

*x* on the sequence, then the probability that

*x* ∈ is

. Since the segments cover the whole sequence we have

. Therefore, we can define the

*entropy* of a segmentation

*P* to be

The maximum value that the entropy of a segmentation can have is log *N*.

Consider now a pair of segmentations

*P* and

*Q* of sequence

*S*. Assume that

*P* and

*Q* have

*k*
_{
p
}and

*k*
_{
q
}segments, respectively, such that

*P* = and

*Q* = . The

*conditional entropy* [

24] of

*P* given

*Q* is defined as follows.

That is, the conditional entropy of segmentation *P* given segmentation *Q* is the expected amount of information we need to identify the segment of *P* a point belongs to, given that we know the segment of this point in *Q*.

The following lemma gives an efficient algorithm for computing the conditional entropies between two segmentations. The algorithm runs in time *O* (*k*
_{
p
}+ *k*
_{
q
}).

**Lemma 1**. *Let P and Q be two segmentations. Denote by U their union, i.e., the segmentation defined by the segment boundaries that appear in P or in Q. The conditional entropy of P given Q, H (P|Q), can be computed using the following closed formula*

*H* (*P*|*Q*) = *H* (*U*) - *H* (*Q*).

*Proof*. Assume that segmentation

*P* has

*k*_{
p
}segments

and segmentation

*Q* has

*k*_{
q
}segments

. Using Equation (1) we can obtain the desired result. That is,

Intuitively, the entropy of *P* given *Q* tells us how much information we obtain about *P*, if we know that that we are in a specific segment of *Q*. The more information *Q* reveals about the structure *P*, the more *similar* segmentations *P* and *Q* are, and the smaller the value of *H* (*P*|*Q*).

The single value *H* (*P*|*Q*) does not give the whole picture of segmentation similarity, however. For example, consider the case where segmentation *Q* consists of a single segment. Then, using Lemma 1 we can verify our intuition that knowledge about *Q* gives us no information about *P*, i.e., *H* (*P*|*Q*) = *H* (*P*). However, notice that *H* (*Q*|*P*) = 0, for any *P*.

Consider also the case where *Q* consists of *N* segments where each segment has length 1. In this case, *H* (*P*|*Q*) = 0, that is, *Q* gives lots of information about *P*, irrespective of the structure of *P*. As before, observe that *H* (*Q*|*P*) = log *N* - *H* (*P*). Thus, if *P* has low entropy, the value of *H* (*Q*|*P*) is large.

The above examples show that the similarity of two segmentations *P* and *Q* cannot be judged just by the single value *H* (*P*|*Q*) or *H* (*Q*|*P*). Rather, we can conclude that segmentations *P* and *Q* are similar only if both *H* (*P*|*Q*) and *H* (*Q*|*P*) are small. Even when using the *entropy distance* between two segmentations (see [25]), defined as *D*
_{
H
}(*P*, *Q*) = *H* (*P*|*Q*) + *H* (*Q*|*P*), we can get small values to segmentations that are quite different. We show in the experimental section that considering the two conditional entropies separately gives more accurate results than using their sum.

### Randomization techniques

Consider a segmentation algorithm that given as input sequence *S* outputs a segmentation *P*. The plethora of segmentation algorithms and segmentation criteria naturally raises the question of how good and how informative segmentation *P* is. Assume that we a priori know a ground-truth segmentation *T* of *S*. Then, we can say that segmentation *P* is good if *P* is similar to *T*. Thus, using the definitions in the Methods section, *P* is a good segmentation if *H* (*P*|*T*) and *H* (*T*|*P*) are small. However, a natural question is how small is small enough? Or, is there a threshold in the values of the conditional entropies below which we can characterize segmentation *P* as being correct or interesting? Finally, can we set this threshold universally for all segmentations? In this section we describe a set of randomization techniques that we devise in order to provide an answer to these questions.

Our generic methodology is the following. Given a segmentation *P* and a ground-truth segmentation *T* of the same sequence, we first compute *H* (*P*|*T*) and *H* (*T*|*P*). We compare the values of these conditional entropies with the values of the conditional entropies *H* (*R*|*T*) and *H* (*T*|*R*) for a random segmentation *R*. We conclude that *P* is similar to *T*, and thus interesting, if the values of *H* (*P*|*T*) (and *H* (*T*|*P*)) are small compared to the values of *H* (*R*|*T*) (and *H* (*T*|*R*)) for a large majority of random segmentations *R*.

Consider a class
of segmentations for sequences of length *N*. Then, the randomization test is conducted as follows. Pick random segmentations
*R* ∈ . For each such *R* compute *H* (*T*|*R*), and compare *H* (*T*|*P*) against the distribution of the values *H* (*T*|*R*). Similarly, compute the values of *H* (*R*|*T*) for a large number of segmentations
*R* ∈ and compare these values with the value of *H* (*P*|*T*). In general, if segmentations *T* and *P* have a very different number of of segments, one of *H* (*T*|*R*) and *H*(*R*|*T*) will be large for any *R* from
. The randomization method we describe is best suited for the case when *T* and *P* have about the same number of segments.

We still need to specify the class
of segmentations from which the random segmentations are picked. We define two classes of segmentations. Intuitively, the first class is used for checking if the candidate segmentation *P* is significantly closer to *T* than random segmentations *R* with the same number of segments as in *T*. Imagine that the segmentation procedure that generated *P* has knowledge of the segment number in *T*. By using this class we find out if it is enough to guess a segmentation as close to *T* as *P* is, by just randomly assigning a correct number of segment boundaries. The other class is used similarly, for checking if the knowledge of *T*'s segment length distribution is enough to generate segmentations as close to *T* as *P*. An analog is found in classification problems, where the true class labels in *T* are permuted to check if the candidate classification *P* offers more insight into *T* than we would expect from guessing a random classification *R*.

In the first case, if *T* has *k* segments, then we restrict the random segmentations to those that have *k* segments as well. We denote by
the class of all segmentations with *k* segments that partition sequences of length *N*. We have
, since there are
ways to choose *k* - 1 segment boundaries from the *N* points of the sequence (the first and the last boundary are always fixed). We call the randomization test in which the random segmentations
*R* ∈ a *k-randomization test*.

We also introduce the *ℓ-randomization test*, specified as follows. Consider a segmentation *T* with segments
. Each segment
has length |
| and these lengths define a distribution of the segment lengths of the segmentation. There are a total of at most *k*! segmentations that have the same segment lengths as *T* does (maximized when all segments in *T* have different lengths). For a given distribution of segment lengths *ℓ* we denote by
the class of segmentations with *k* segments and lengths *ℓ*. Obviously,
⊂ and
≤ *k*!, since the segmentations in
. differ only in the order in which the segments with different lengths appear. Note that for a random segmentation
*R* ∈ , the conditional entropies w.r.t. segmentation *T* are equal, i.e., *H* (*T*|*R*) = *H* (*R*|*T*). This follows from the fact that *H* (*T*) = *H* (*R*), since both segmentations contain exactly the same segments. The *ℓ*-randomization restricts our attention to random segmentations with segment length distribution being the same as the segment-length distribution of the ground-truth segmentation *T*. That is, the significance of *H* (*T*|*P*) and *H* (*P*|*T*) for a candidate segmentation *P* are evaluated under the assumption that the segment-length distribution is known. In the special case where all segments in *T* have the same length, the segment length distribution uniquely characterizes the segmentations
*R* ∈ . In this case *H* (*T*|*R*) = *H* (*R*|*T*) = 0. Moreover, any segmentation *P* that does not have equal-length segments has *H* (*P*|*T*) > 0 and *H* (*T*|*P*) > 0 and thus is considered far from the ground truth w.r.t. the *ℓ*-randomization test.