A Naïve approach to segment modeling simply enumerates all possible segment configurations. Every combination of segment boundaries is considered, while changing the setting of values for boundary indicator variable

*B*
_{
i
} ∈ {0, 1}. Then, an error function for each segment set definition is computed. However, this requires the enumeration of a 2

^{
m
} possible segment configurations, where

*m* is the number of

*B*
_{
i
}. To compute the optimal k-mer logistic regression model, segment boundaries must first be identified; however, as these are unknown, we started with an initial presumption of the methylation susceptible and resistant segments. We then used an iterative improvement procedure in search of both the segment definition and the best fitting logistic regression model. The major steps of the segment modeling algorithm are as follows:

- 1.
**Initialization of a configuration:** Define a boundary variable *B*
_{
i
} = 1 at every genomic position where labels (+ or -) of two adjacent CpG sites around the position are different. Define a segment as a DNA region between two boundary variables set to 1. By taking this approach, we start with a configuration of smallest possible segments. By merging segments in many different ways and re-calculating the logistic regression model, the algorithm attempts to find the best segment configuration. This is how INITIALCONFIGURATION() is implemented in the HillClimbingConfigurationSearch in Algorithm 1.

- 2.
**Computing a logistic regression model**: Given a k-mer occurrence and a segment configuration, compute a logistic regression model by (1). This is how COMPUTEMODEL() is implemented in the HillClimbingConfigurationSearch in Algorithm 1.

- 3.
**Computing an error of a segment configuration**: Errors in the segment set

are measured by (2).

#### The random binary segment merging algorithm

Given the current segment configuration {*B*
_{
i
}}, a segment is randomly chosen using a distribution of errors measured by a weighted square error. For a segment *B*
_{
j
}, the weighted square error is defined by
where the weight of the segment
,
is the predicted methylation level of the segment *j*, and *t*
_{
j
} is the actual methylation level of the segment *j*. A segment is chosen by random sampling using a segment error vector < *e*
_{1}, . . . , *e*
_{n} > where *n* is the number of segments in the current segment configuration. The random sampling using a segment error vector < *e*
_{1}, . . . , *e*
_{
n
} > guides choosing a segment with a higher prediction error, but also ensure a random sampling. Note that segments that are already considered for merging are excluded for the next round of sampling (see the use of visit[] in the HillClimbingConfigurationSearch in Algorithm 1).

Once a segment *B*
_{
j
} is chosen, it is tentatively merged with segment *B*
_{
j+1
} next to *B*
_{
j
}. Then a logistic regression model is re-calculated. The two segment merging is accepted only if the merging of two segments reduces the weighted squared error (equation 2). Otherwise, the original segment configuration is retained, rejecting the merging. A segment *B*
_{
j
} considered for merging is marked so that the segment will not be repeatedly chosen for the next step. This sampling and marking a segment is repeated until all segments in the current configuration are considered for merging.

**Input** : A set of pre-selected k-mers K = {*x*
_{
i
}}; Occurrences of K; Methylation levels at CpG sites

**Output**: A logistic regression model; A segment configuration.

**HillClimbingConfigurationSearch**(N)

**begin**

(*C**, *E**, *M**) = RandomConfigurationSearch ()

**for**
*i* ← 2 **to**
*N*
**do**

(*C*, *M*, *E*) = RandomConfigurationSearch ()

**if**
*E* < *E** **then**

*C** = *C*; *M** = *M*; *E** = *E*

**end**

**report** (*C**, *M**, *E**)

**end**

**end**

**RandomConfigurationSearch** ( )

**begin**

*C* = InitialConfiguration (); *E* = 1.0 //Reset configuration; See text.

**while**
*true*
**do**

(C',M',E') = RandomBinaryMerging(
*C*
)

**if** (*E* - *E*') ≤ *δ*
**then break**

*C* = *C'*; *M* = *M*'; *E* = *E'*

**return**
*(C,M,E)*

**end**

**end**

**RandomBinaryMerging**(**configuration**
*C*)

**begin**

*M* = computeModel(
*C*, *K*
) //Equation
1
; Training stage only

*E* = computeError(
*C*, *M*
) //Equation
2

**bool**
*visit*[*n*] = **{false**} //Mark that no segments are considered.

**while** ∃*i such that visit*[*i*] = = **false do**

*j* = selectAtRandom(
*visit*
) //See text.

*visit*[*j*] = **true** //
*s*
_{
j
}
is merge candidate.

*C*' = *C*

= **false** //Merge
*s*
_{
j
}
and
*s*
_{
j+1
}.

*M*' = computeModel(
*C*', *K*
) //Equation
1
; Training stage only

*E*' = computeError(
*C*', *M*') //Equation
2

**if**
*E* ≤ *E*' **then**

*C* = *C*'; *visit*[*j* + 1] = **true** //Accept
*C*'.

**else**

= **true** //Reject
*C*'.

**end**

**end**

**return**
*(C,M,E)*

**end**

**Algorithm 1:** Hill climbing configuration search algorithm. An algorithm tries to merge two segments at random until all segments are considered for merging. A new configuration is accepted only when the error is reduced with a new logistic regression model, thus it is a hill climbing algorithm.