 Proceedings
 Open access
 Published:
A novel kmer mixture logistic regression for methylation susceptibility modeling of CpG dinucleotides in human gene promoters
BMC Bioinformatics volume 13, Article number: S15 (2012)
Abstract
Background
DNA methylation is essential for normal development and differentiation and plays a crucial role in the development of nearly all types of cancer. Aberrant DNA methylation patterns, including genomewide hypomethylation and regionspecific hypermethylation, are frequently observed and contribute to the malignant phenotype. A number of studies have recently identified distinct features of genomic sequences that can be used for modeling specific DNA sequences that may be susceptible to aberrant CpG methylation in both cancer and normal cells. Although it is now possible, using next generation sequencing technologies, to assess human methylomes at base resolution, no reports currently exist on modeling cell typespecific DNA methylation susceptibility. Thus, we conducted a comprehensive modeling study of cell typespecific DNA methylation susceptibility at three different resolutions: CpG dinucleotides, CpG segments, and individual gene promoter regions.
Results
Using a kmer mixture logistic regression model, we effectively modeled DNA methylation susceptibility across five different cell types. Further, at the segment level, we achieved up to 0.75 in AUC prediction accuracy in a 10fold cross validation study using a mixture of kmers.
Conclusions
The significance of these results is three fold: 1) this is the first report to indicate that CpG methylation susceptible "segments" exist; 2) our model demonstrates the significance of certain kmers for the mixture model, potentially highlighting DNA sequence features (kmers) of differentially methylated, promoter CpG island sequences across different tissue types; 3) as only 3 or 4 bp patterns had previously been used for modeling DNA methylation susceptibility, ours is the first demonstration that 6mer modeling can be performed without loss of accuracy.
Background
DNA methylation is the chemical modification of DNA bases, mostly on cytosines that precede a guanosine in the DNA sequence, i.e., the CpG dinucleotides. This epigenetic modification involves the addition of a methyl group to the number 5 carbon of the cytosine pyrimidine ring. DNA methylation is essential for cellular growth, development and differentiation [1], playing a fundamental role in the activation of genes at the transcriptional level. In cancer cells, aberrant DNA methylation patterns, such as genomewide hypomethylation and regionspecific hypermethylation, are frequently observed [2]. CpG islands, short CpGrich regions of DNA often located around gene promoters and normally protected from DNA methylation, become hypermethylated in cancer, contributing to transcriptional silencing [3, 4]. As CpG island methylation patterns have been shown to differ across cancer types, recent studies have revealed that some CpG islands are "methylation sensitive", while others are "resistant" to DNA methylation [5]. Recent technological breakthroughs allow, for the first time, the capability to measure human methylomes at base resolution [6], providing unprecedented opportunities for understanding the phenomenon of methylation susceptibility.
Previous work
Several recent studies have attempted to predict CpG island methylation patterns in normal and cancer cells. DNA pattern recognition and supervised learning techniques were used by Feltus et al to discriminate methylationprone (MP) and methylationresistant (MR) CpG islands based on seven DNA sequence patterns [7]. McCabe et al then developed a classifier (PatMAn) based on the frequencies of those seven patterns in cancer [8], followed by "SUPERPatMAn" for predicting methylation susceptible CpG islands using both local sequence context and transacting factors such SUZ12 [9]. In addition, Feltus et al used motifs related to 28 MP and MR CpG islands to predict DNA methylation susceptibility [10], and Keshet et al showed evidence of instructive mechanisms in cancer cells, finding common sequence motifs in the regions of promoters whose genes show tumorspecific "methylation susceptibility" [11]. A prediction method for finding a minority class in an imbalanced data setting (which is the case for DNA methylation data), called "cluster_boost", was recently developed by Goh et al and used to identify novel hypermethylated genes in cancer [12]. Fang et al developed "MethCGI" to predict the methylation status of CpG islands using a support vector machine and both local sequence context and transcription factor binding sites [13]. Finally, a prediction method using DNA sequence features of various types, including sequence, repeats, predicted structure, CpG islands, and genes, was developed by Bock et al to predict binding sites, conservation, and single nucleotide polymorphisms [14].
While the focus of the above studies was on CpG island methylation susceptibility, recent experiments have convincingly demonstrated that methylation levels of CpG sites, i.e. genomic location of CpG dinucleotides, within a CpG island can be highly variable. For example, Handa et al found that certain sequence features flanking CpG sites were associated with high and lowmethylation CpG sites in an in vitro DNMT1 overexpression model [15]. Moreover, at single base pair resolution, Zhang et al demonstrated that DNA methylation levels frequently differ within a CpG island [16]. To investigate the role of DNA methylation during development in human embryonic stem cells Brunner et al developed Methylseq, which assays DNA methylation at more than 90,000 regions throughout the genome [17]. Using bisulfite sequencing data, Lister et al determined the first genomewide, singlebaseresolution maps of methylated cytosines in mammalian genomes (human embryonic stem cells (ESC) and fetal broblasts) [18]. By using "ultradeep" sequencing data from Taylor et al [19], we demonstrated that CpG flanking sequences can be used to model methylation susceptible CpG sites [20]. Finally, Previti et al analyzed tissuespecific CpG island methylation status, in terms of profiles created by probabilistically combining two sources of independent clusters (clusters from methylation data in 12 tissues and clusters from CGIs attributes) to demonstrate the predictive power of their method with a decision tree classifier [21]. Those investigators categorized profiles into four classes: constitutive unmethylated, constitutive methylated, unmethylated in sperm, and differentially methylated [21].
Motivation
Previous CpG island methylation susceptibility prediction studies have not considered cell typespecific methylation status. Considering variations in DNA methylation level even in the same genomic regions of different types of cells, we asked the question: can cell typespecific DNA methylation susceptibility be modeled? The significance of exploring this question is based on evidence supporting the strong association of genomic sequence features with DNA methylation status. Furthermore, recent studies strongly indicate the existence of methylation sensitive/resistant CpG islands in different cancer types [5]. In this paper, we performed a comprehensive DNA methylation susceptibility modeling study in five different cell lines at three different levels: CpG sites, entire promoter regions, and short DNA segments. We focused on DNA methylation in the context of CpG dinucleotides in adult cells (we are aware of a recent study [18] reporting nonCpG methylation in ESC).
Methods
The problem: methylation susceptible dna segment modeling problem
The need for segment modeling
Bisulfite sequencing data clearly demonstrates that methylation levels, even within a single gene promoter, can be highly variable. Furthermore, a figure in Additional file 1 shows highly variable methylation of the same promoter sequence in five different cell lines, i.e. cell typespecific DNA methylation susceptibility (bisulfite sequencing data obtained from [16]).
Definition of the problem
The following notations were used to formally define the problem. A small set of preselected kmers x = {x_{ i }}, where a kmer is fixed number of DNA base pairs. Labels t = {t_{ j }} on data are assigned as +/ depending on methylation level p_{ j } of each sample.
For each cell type, a kmer mixture logistic regression model (equation 1) was built using a small set of preselected patterns, i.e. kmers. To select the best logistic model, predicted methylation at a CpG site (based on the logistic model under consideration) was compared with actual CpG methylation obtained from the bisulfite sequencing data. To make the comparison, we calibrated the predicted methylation level between 0 and 1 (below).
where f\left(\mathbf{x}\right)={\sum}_{i}{\beta}_{i}{x}_{i} and β_{ i }'s are parameters to be learned for the machine learning predictor.
The kmer mixture modeling problem
Our goal was to test whether methylation susceptibility can be modeled by a logistic regression model using a small set of kmers. Although using kmers for DNA methylation modeling is not entirely new, to our knowledge, only short kmers (3 or 4 bp in length) were used in previous studies [14]. As short kmers can occur in almost every DNA sequence, modeling using 3 or 4 bp relies on kmer frequency.

1.
First, we attempted to use longer kmers (up to 6 bp) to utilize those that only occur in methylation susceptible sequences (vs. frequency for short kmers, described above).

2.
Our goal of determining whether machine learning predictors can be built by using kmers required that we address two important issues: overfitting and generalizability of prediction beyond the test data. The overfitting problem was addressed by selecting a small number of kmers from the training data set (using a larger number of kmers can easily overfit the training data). The cross validation technique was used to test the generalizability of prediction power. We selected kmers and built machine predictors by using only the training data set. We then assessed the predictor on the test data set not used for either selecting kmer features or building predictors.
Two kmer feature selection methods
We used a selected set of kmers for DNA methylation susceptibility modeling in the different cell types. The research question explored in this paper is the feasibility of modeling methylation susceptible segments given a set of kmers. As selection of the "best set" of kmers for modeling was not explored (a solution to the combined problem was too difficult), we used two standard pattern selection methods for a twoclass data set.

1.
Feature selection with ttest: A popular ttest method was used to select kmers because of its simplicity and applicability for all modeling approaches. For each attribute a, occurrences of a were counted in positive samples and negative samples. Then, the Pvalue of a was measured by ttest. A fixed number of patterns was selected from a list of kmers ordered by Pvalue. Alternatively, patterns with a Pvalue below a threshold were selected.

2.
Feature selection with the random forest technique: The RF algorithm [22] can be used for feature selection. The usefulness of the RFbased feature selection method was clearly demonstrated by YiWei Chen and ChihJen Lin at the NIPS 2003 feature selection challenge [23]. We used an extended version of the RFbased feature selection method. Multiple rounds of the RFbased feature selection were performed using a balanced data set of methylationsusceptible and nonsusceptible sequences. We performed k times of RF runs, where each RF run used n random trees; only top N attributes with zscores > 0 were collected. After k RF runs, a subset of attributes, which had appeared p% times, were selected. The values were set k = 30, n = 100, N = 100, and p = 90 for the kmer feature selection.
In both methods, we extracted a set of patterns in the balanced data set. First, centered at each CpG site, we extracted a flanking sequence of length l, where we set l = 100. A label of the CpG site was given as +/ depending on methylation level. Then, we balanced the data with even number of +/ classes. A set of all kmers obtained in sliding windows on each sequence were used for kmer feature selection.
Modeling methylation levels of DNA segments
Definition A boundary variable
B_{ i } at a genomic sequence position is an indicator variable that is defined where two adjacent CpG sites have different labels. The value 1 of B_{ i } denotes that the genomic position is a boundary and the value 0 denotes that the position is not a boundary. A DNA segment S is defined by two boundary variables B_{ a } and B_{ z } where B_{ a } = 1 and B_{ z } = 1 and for all a <i <z, B_{ i } = 0. Figure 1 illustrates how boundary variables are used to define 10 segments. We call a set of DNA segments defined by the boundary variables a configuration.
Labeling data
Given a segment S_{ i }, the methylation probability p_{ i } of a segment was defined as a ratio of the number of CpG sites with the + label to the number of CpG sites in the segment. Then, the label t_{ i } of S_{ i } was assigned + if p_{ i } is greater than 0.5. Otherwise, a label  was assigned to t_{ i }.
Attributes for modeling
Kmer occurrences in segments in the training data set were used as attributes. A small subset of kmers features x was selected from all kmers using the feature selection methods.
Modeling
A single logistic regression model was used to model all DNA segments for each cell line, using attributes x and labels t.
Segmentlevel modeling challenges: exponential search space
Although the methylation status of a DNA segment is defined by an aggregation of the methylation status of individual promoter regions (as we did for the whole promoter regionmodeling approach), how to define methylation susceptible DNA segments is currently unknown. For example, consider a DNA segment with five CpG sites {s 1, s 2, s 3, s 4, s 5} in a short DNA segment and assume that three sites, s 1, s 2, s 4 are methylation susceptible and the other two sites s 3, s 5 are resistance methylation. By definition, the DNA segment is methylation susceptible, as the majority of sites (three) are methylation susceptible. However, if we divide the segment into two subsegments, {s 1, s 2} and {s 3, s 4, s 5}, there will be a segment that is susceptible to methylation and one that is resistant. To determine which of the two segment definitions can be better modeled for methylation susceptibility, enumeration of all possible definitions of segment configurations and for each definition of segment is required. We thus computed a "best fit" logistic model for methylation data in a cell line. The complexity of this problem can be discussed in terms of the wellknown "counting the number of parenthesization" problem [24], because a parenthesis can define a segment of CpG sites. The number of parenthesis P (n) for n CpG sites is P (1) = 1; P\left(n\right)={\sum}_{i=1}^{n1}P\left(i\right)P\left(ni\right)for n ≥ 2. Given the complexity, an optimal solution using an exhaustive search algorithm is unlikely to be found (known to be Ω (2^{n}) [24]). Thus, we developed a heuristic algorithm that used a random segment merging starting from the finest definition of segments.
A random binary segment merging algorithm
A Naïve approach to segment modeling simply enumerates all possible segment configurations. Every combination of segment boundaries is considered, while changing the setting of values for boundary indicator variable B_{ i } ∈ {0, 1}. Then, an error function for each segment set definition is computed. However, this requires the enumeration of a 2^{m} possible segment configurations, where m is the number of B_{ i }. To compute the optimal kmer logistic regression model, segment boundaries must first be identified; however, as these are unknown, we started with an initial presumption of the methylation susceptible and resistant segments. We then used an iterative improvement procedure in search of both the segment definition and the best fitting logistic regression model. The major steps of the segment modeling algorithm are as follows:

1.
Initialization of a configuration: Define a boundary variable B_{ i } = 1 at every genomic position where labels (+ or ) of two adjacent CpG sites around the position are different. Define a segment as a DNA region between two boundary variables set to 1. By taking this approach, we start with a configuration of smallest possible segments. By merging segments in many different ways and recalculating the logistic regression model, the algorithm attempts to find the best segment configuration. This is how INITIAL CONFIGURATION() is implemented in the HillClimbingConfigurationSearch in Algorithm 1.

2.
Computing a logistic regression model: Given a kmer occurrence and a segment configuration, compute a logistic regression model by (1). This is how COMPUTE MODEL() is implemented in the HillClimbingConfigurationSearch in Algorithm 1.

3.
Computing an error of a segment configuration: Errors in the segment set \mathcal{S} are measured by (2).
O\left(\mathcal{S}\right)=\sum _{i=1}^{\left\mathcal{S}\right}{w}_{i}{\left({\u0177}_{i}{t}_{i}\right)}^{2}(2)
where \left\mathcal{S}\right is the total number of segments, {\u0177}_{i}is the predicted methylation level of the segment i, t_{ i } is the actual methylation level of the segment i, and w_{ i } is the weight of each segment. A segment weight is defined as {w}_{i}=\overline{\leftS\right}/\left{S}_{i}\right, where \overline{\leftS\right} is the average count of CpG sites in all segments and S_{ i } is the count of CpG in a segment. A weight of each segment w_{ i } is given as an inverse proportion to average segment size. In this way, large segments are penalized less, and vice versa. This is how COMPUTE ERROR() is implemented in the HillClimbingConfigurationSearch in Algorithm 1.
The random binary segment merging algorithm
Given the current segment configuration {B_{ i }}, a segment is randomly chosen using a distribution of errors measured by a weighted square error. For a segment B_{ j }, the weighted square error is defined by {e}_{j}={\beta}_{j}{\left({\u0177}_{j}{t}_{j}\right)}^{2} where the weight of the segment {\beta}_{j}=\left{S}_{i}\right/\overline{\leftS\right}, {\u0177}_{j} is the predicted methylation level of the segment j, and t_{ j } is the actual methylation level of the segment j. A segment is chosen by random sampling using a segment error vector <e_{1}, . . . , e_{n} > where n is the number of segments in the current segment configuration. The random sampling using a segment error vector <e_{1}, . . . , e_{ n } > guides choosing a segment with a higher prediction error, but also ensure a random sampling. Note that segments that are already considered for merging are excluded for the next round of sampling (see the use of visit[] in the HillClimbingConfigurationSearch in Algorithm 1).
Once a segment B_{ j } is chosen, it is tentatively merged with segment B_{ j+1 } next to B_{ j }. Then a logistic regression model is recalculated. The two segment merging is accepted only if the merging of two segments reduces the weighted squared error (equation 2). Otherwise, the original segment configuration is retained, rejecting the merging. A segment B_{ j } considered for merging is marked so that the segment will not be repeatedly chosen for the next step. This sampling and marking a segment is repeated until all segments in the current configuration are considered for merging.
Input : A set of preselected kmers K = {x_{ i }}; Occurrences of K; Methylation levels at CpG sites
Output: A logistic regression model; A segment configuration.
HillClimbingConfigurationSearch(N)
begin
(C*, E*, M*) = RandomConfigurationSearch ()
for i ← 2 to N do
(C, M, E) = RandomConfigurationSearch ()
if E <E* then
C* = C; M* = M; E* = E
end
report (C*, M*, E*)
end
end
RandomConfigurationSearch ( )
begin
C = InitialConfiguration (); E = 1.0 //Reset configuration; See text.
while true do
(C',M',E') = RandomBinaryMerging( C )
if (E  E') ≤ δ then break
C = C'; M = M'; E = E'
return (C,M,E)
end
end
RandomBinaryMerging(configuration C)
begin
M = computeModel( C, K ) //Equation 1; Training stage only
E = computeError( C, M ) //Equation 2
bool visit[n] = {false} //Mark that no segments are considered.
while ∃i such that visit[i] = = false do
j = selectAtRandom( visit ) //See text.
visit[j] = true // s_{ j } is merge candidate.
C' = C
{B}_{i}^{{C}^{\prime}} = false //Merge s_{ j } and s_{ j+1 }.
M' = computeModel( C', K ) //Equation 1; Training stage only
E' = computeError( C', M') //Equation 2
if E ≤ E' then
C = C'; visit[j + 1] = true //Accept C'.
else
{B}_{i}^{{C}^{\prime}} = true //Reject C'.
end
end
return (C,M,E)
end
Algorithm 1: Hill climbing configuration search algorithm. An algorithm tries to merge two segments at random until all segments are considered for merging. A new configuration is accepted only when the error is reduced with a new logistic regression model, thus it is a hill climbing algorithm.
Results
Data set
We used data from Zhang et al [16] for DNA methylation patterns in chromosome 21 (297 amplicons from 190 gene promoters using bisulfite conversion, subcloning and sequencing DNA as the major experimental methods). The bisulfite sequencing data were collected in five cell types: viz. human peripheral blood (primarily leukocytes), fibroblast, the human embryonic kidney cell line HEK293, the human hepatocellular liver carcinoma cell line HepG2 and fibroblast cells derived from a patient with Down syndrome (trisom 21). Methylation patterns differed widely and specific to each cell types.
Experimental setup
The 10fold cross validation (described above) was used to compare the performances of three modeling approaches. For each round of 10fold validation, one of the 10 subsets was set aside for testing, and the kmer features were selected only from the training set, ensuring that the test data would have no influence on the kmer feature selection. Also, regression coefficients were computed in only training stage. We measured the area under the ROC curve (AUC) score for performance comparison.
Effectiveness of the segment modeling approach
We extensively tested the effectiveness of the segment modeling algorithm using 4mer, 5mer, and 6mer patterns. For each of the experiments, the AUC score was measured from 10fold cross validation for the initial segment definition vs. the final segment definition. The RFbased algorithm with 100 trees was used for kmer feature selection. For each kmer selection procedure, 30 random experiments were performed, and kmers with zscore > 0 that appeared in at least 90% of experiments were selected as kmer features. Using the set of kmers, the optimal logistic regression model was computed.
10fold cross validation experiments
The performance comparison between the initial segments and the final segments in the test set is shown in Figure 2. Bars between adjacent dotted lines show the improvement in the between prediction results of two models with the initial segment setting and the final segment setting in terms of the AUC scores. We measured the performance improvement using 4mer, 5mer, and 6mer features. For each cell type, the segment modeling algorithm identified significantly improved segment definitions. Five panels in each plot correspond to tissue types: (A) Fibroblast, (B) HEK293, (C) HepG2, (D) Leukocytes, and (E) Trisom 21. Our algorithm achieved approximately 10% improvement in most cell types, illustrating the effectiveness of the segment modeling algorithm.
Search behavior
The search behavior of the segment modeling algorithm is shown in Figure 3. In this experiment, we used the whole data set to show the algorithmic convergence of our approach. The learning error (Equation 2) was reduced at each iteration of segment merging and model recalculation. Our random segment sampling algorithm converged for all 15 cases of 5 different cell lines with 4, 5, and 6mers.
Discussion on the predictive power of the model
The predictive power of the model measured by 10fold cross validation is encouraging. For 6mers, the predictive accuracy was 0.69 for Fibroblast, 0.70 for HEK293, 0.54 for HepG2, 0.73 for Leukocytes, and 0.65 for Trisom 21. These prediction accuracies using 6mer cannot be achieved in random data sets where the expected prediction accuracy is 0.5. Variations in the prediction accuracy for the five cell types, especially for HepG2, may be due to the cell type specific characteristics. On the other hand, the data obtained from [16] was of a low coverage. Amplicons covered less than 0.2% of entire Chromosome 21. Thus variations in the prediction accuracy may due to the low coverage of the data used. We were not able to further verify why the prediction accuracy varied. In fitting the whole data set, as opposed to 10 fold cross validation, the final model was able to accurately predict methylation susceptibility.
Effect of the number of kmers used for prediction
The three modeling approaches were compared in terms of AUC obtained by 10fold crossvalidation technique. We conducted comprehensive modeling of celltype specific DNA methylation susceptibility at three different resolutions: individual CpG sites, CpG segments, and promoter regions in terms of AUC obtained by the 10fold cross validation technique. The methods for modeling at individual CpG sites and at promoter regions are described in Additional file 2. To measure the effect of the number of kmer patterns used for modeling, 10fold crossvalidations were performed with a varying number of kmer patterns from 10 to 100 (with an increase of 10 kmers). Pvalues from ttests were used to select the kmers. The experimental results are illustrated in Figure 4. Only the segment modeling approach was effective for all 4, 5, and 6mer experiments. Interestingly, the number of kmers used for modeling had little impact on the prediction result, demonstrating that the prediction accuracy did not derive from the overfitting the data and indicating that the selection of a small number of kmers can effectively model methylation susceptibility without a loss of prediction power. Moreover, when a longer kmer was used (up to 6mer), the prediction accuracy did not decrease. This finding is highly encouraging because on average, a 6mer is unlikely to occur by chance in a short (274 bp) DNA segment. Thus, a set of 6mers can be used to model DNA methylation susceptibility.
Conclusion
We conducted a comprehensive modeling study for celltype specific DNA methylation susceptibility. By performing extensive computational experiments of data from five distinct cell types, we show that DNA methylation susceptibility can be accurately modeled at the segment level, achieving up to 0.75 in AUC prediction accuracy in a 10fold cross validation study. The twostep iterative segment modeling algorithm successfully identified optimal segments that can be modeled as a logistic regression model using a set of kmers. Our model further shows the significance of certain kmers for the mixture model, which can potentially highlight DNA sequence features (kmers) of differentially methylated promoter CpG island sequences in different cells and tissues, including malignancies. As only used 4 bp patterns were used in previous modeling studies of DNA methylation susceptibility, this is the first report to show that kmer modeling can be performed using up to 6mer without the loss of modeling accuracy.
Abbreviations
 • AUC:

area under the ROC curve
 • DNA:

deoxyribonucleic acid
 • MP:

methylationprone
 • MR:

methylationresistant
 • RF:

random forest
 • YY:

Youngik Yang
 • SK:

Sun Kim
 • KN:

Ken Nephew.
References
Bird A: DNA methylation patterns and epigenetic memory. Genes Dev. 2002, 16: 621. 10.1101/gad.947102.
Jones PA, Laird PW: Cancerepigenetics comes of age. Nat Genet. 1999, 21 (2): 163167. 10.1038/5947.
Ting AH, McGarvey KM, Baylin SB: The cancer epigenomecomponents and functional correlates. Genes Dev. 2006, 20 (23): 32153231. 10.1101/gad.1464906.
Herman JG, Baylin SB: Gene silencing in cancer in association with promoter hypermethylation. N Engl J Med. 2003, 349 (21): 20422054. 10.1056/NEJMra023075.
Costello JF, Frühwald MC, Smiraglia DJ, Rush LJ, Robertson GP, Gao X, Wright FA, Feramisco JD, Peltomäki P, Lang JC, Schuller DE, Yu L, Bloomfield CD, Caligiuri MA, Yates A, Nishikawa R, Su Huang H, Petrelli NJ, Zhang X, O'Dorisio MS, Held WA, Cavenee WK, Plass C: Aberrant CpGisland methylation has nonrandom and tumourtypespecific patterns. Nat Genet. 2000, 24 (2): 132138. 10.1038/72785.
Laird PW: Principles and challenges of genomewide DNA methylation analysis. Nat Rev Genet. 2010, 11 (3): 191203.
Feltus FA, Lee EK, Costello JF, Plass C, Vertino PM: Predicting aberrant CpG island methylation. Proc Natl Acad Sci USA. 2003, 100 (21): 1225312258. 10.1073/pnas.2037852100.
Prüfer K, Stenzel U, Dannemann M, Green RE, Lachmann M, Kelso J: PatMaN: rapid alignment of short sequences to large databases. Bioinformatics. 2008, 24 (13): 15301531. 10.1093/bioinformatics/btn223.
McCabe MT, Lee EK, Vertino PM: A multifactorial signature of DNA sequence and polycomb binding predicts aberrant CpG island methylation. Cancer Res. 2009, 69: 282291. 10.1158/00085472.CAN083274.
Feltus FA, Lee EK, Costello JF, Plass C, Vertino PM: DNA motifs associated with aberrant CpG island methylation. Genomics. 2006, 87 (5): 572579. 10.1016/j.ygeno.2005.12.016.
Keshet I, Schlesinger Y, Farkash S, Rand E, Hecht M, Segal E, Pikarski E, Young RA, Niveleau A, Cedar H, Simon I: Evidence for an instructive mechanism of de novo methylation in cancer cells. Nat Genet. 2006, 38 (2): 149153. 10.1038/ng1719.
Goh L, Murphy SK, Muhkerjee S, Furey TS: Genomic sweeping for hypermethylated genes. Bioinformatics. 2007, 23 (3): 281288. 10.1093/bioinformatics/btl620.
Fang F, Fan S, Zhang X, Zhang MQ: Predicting methylation status of CpG islands in the human brain. Bioinformatics. 2006, 22 (18): 22042209. 10.1093/bioinformatics/btl377.
Bock C, Paulsen M, Tierling S, Mikeska T, Lengauer T, Walter J: CpG island methylation in human lymphocytes is highly correlated with DNA sequence, repeats, and predicted DNA structure. PLoS Genet. 2006, 2 (3): e2610.1371/journal.pgen.0020026.
Handa V, Jeltsch A: Profound flanking sequence preference of Dnmt3a and Dnmt3b mammalian DNA methyltransferases shape the human epigenome. J Mol Biol. 2005, 348 (5): 11031112. 10.1016/j.jmb.2005.02.044.
Zhang Y, Rohde C, Tierling S, Jurkowski TP, Bock C, Santacruz D, Ragozin S, Reinhardt R, Groth M, Walter J, Jeltsch A: DNA methylation analysis of chromosome 21 gene promoters at single base pair and single allele resolution. PLoS Genet. 2009, 5 (3): e100043810.1371/journal.pgen.1000438.
Brunner AL, Johnson DS, Kim SW, Valouev A, Reddy TE, Neff NF, Anton E, Medina C, Nguyen L, Chiao E, Oyolu CB, Schroth GP, Absher DM, Baker JC, Myers RM: Distinct DNA methylation patterns characterize differentiated human embryonic stem cells and developing human fetal liver. Genome Res. 2009, 19 (6): 10441056. 10.1101/gr.088773.108.
Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, TontiFilippini J, Nery JR, Lee L, Ye Z, Ngo QM, Edsall L, AntosiewiczBourget J, Stewart R, Ruotti V, Millar AH, Thomson JA, Ren B, Ecker JR: Human DNA methylomes at base resolution show widespread epigenomic differences. Nature. 2009, 462 (7271): 315322. 10.1038/nature08514.
Taylor KH, Kramer RS, Davis WJ, Guo J, Duff DJ, Xu D, Caldwell CW, Shi H: Ultradeep bisulfite sequencing analysis of DNA methylation patterns in multiple gene promoters by 454 sequencing. Cancer Res. 2007, 67 (18): 85118518. 10.1158/00085472.CAN071016.
Kim S, Li M, Paik H, Nephew K, Shi H, Kramer R, Xu D, Huang TH: Predicting DNA methylation susceptibility using CpG flanking sequences. Pac Symp Biocomput. 2008, 315326.
Previti C, Harari O, Zwir I, Val CD: Profile analysis and prediction of tissuespecific CpG island methylation classes. BMC Bioinformatics. 2009, 10: 11610.1186/1471210510116.
Breiman L: Random forests. Machine Learning. 2001, 45: 532. 10.1023/A:1010933404324.
NIPS: Feature selection challenge. 2003, [http://www.nipsfsc.ecs.soton.ac.uk]
Cormen TH, Leiserson CE, Rivest RL, Stein C: Introduction to Algorithms. 2003, McGrawHill Science/Engineering/Math, [http://www.amazon.com/exec/obidos/redirect?tag=citeulike0720\&path=ASIN/0072970545]2
Zhang Y, Rohde C, Tierling S, Jurkowski TP, Bock C, Santacruz D, Ragozin S, Reinhardt R, Groth M, Walter J, Jeltsch A: amplicon 193 @ONLINE. 2010, [http://biochem.jacobsuniversity.de/name21/presentation/amplicon_summaries/193_amplicon_summary.html]
Acknowledgements and funding
This work supported by NIH U54 CA1130002 (Interrogating Epigenetic Changes in Cancer Genomes) to SK and KN and by Korea National Research Foundation 054320110016 to SK.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 3, 2012: ACM Conference on Bioinformatics, Computational Biology and Biomedicine 2011. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/13/S3.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
YY designed the computational framework, conducted simulation, and wrote the manuscript. KN gave critical input on biological discussion of this work, and drafted the manuscript. SK led the project, designed the algorithm and tests, and drafted the manuscript.
Electronic supplementary material
12859_2012_5093_MOESM1_ESM.pdf
Additional file 1: DNA methylation level variation. A figure in the file shows DNA methylation level variation in an amplicon from 5 cell types. (PDF 90 KB)
12859_2012_5093_MOESM2_ESM.pdf
Additional file 2: Competing modeling approaches. Compared to segment modeling, two competing modelings, CpG sitespecfic modeling and promoter region modeling were described. (PDF 471 KB)
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Yang, Y., Nephew, K. & Kim, S. A novel kmer mixture logistic regression for methylation susceptibility modeling of CpG dinucleotides in human gene promoters. BMC Bioinformatics 13 (Suppl 3), S15 (2012). https://doi.org/10.1186/1471210513S3S15
Published:
DOI: https://doi.org/10.1186/1471210513S3S15