CpGcluster: a distance-based algorithm for CpG-island detection
© Hackenberg et al; licensee BioMed Central Ltd. 2006
Received: 22 June 2006
Accepted: 12 October 2006
Published: 12 October 2006
Despite their involvement in the regulation of gene expression and their importance as genomic markers for promoter prediction, no objective standard exists for defining CpG islands (CGIs), since all current approaches rely on a large parameter space formed by the thresholds of length, CpG fraction and G+C content.
Given the higher frequency of CpG dinucleotides at CGIs, as compared to bulk DNA, the distance distributions between neighboring CpGs should differ for bulk and island CpGs. A new algorithm (CpGcluster) is presented, based on the physical distance between neighboring CpGs on the chromosome and able to predict directly clusters of CpGs, while not depending on the subjective criteria mentioned above. By assigning a p-value to each of these clusters, the most statistically significant ones can be predicted as CGIs. CpGcluster was benchmarked against five other CGI finders by using a test sequence set assembled from an experimental CGI library. CpGcluster reached the highest overall accuracy values, while showing the lowest rate of false-positive predictions. Since a minimum-length threshold is not required, CpGcluster can find short but fully functional CGIs usually missed by other algorithms. The CGIs predicted by CpGcluster present the lowest degree of overlap with Alu retrotransposons and, simultaneously, the highest overlap with vertebrate Phylogenetic Conserved Elements (PhastCons). CpGcluster's CGIs overlapping with the Transcription Start Site (TSS) show the highest statistical significance, as compared to the islands in other genome locations, thus qualifying CpGcluster as a valuable tool in discriminating functional CGIs from the remaining islands in the bulk genome.
CpGcluster uses only integer arithmetic, thus being a fast and computationally efficient algorithm able to predict statistically significant clusters of CpG dinucleotides. Another outstanding feature is that all predicted CGIs start and end with a CpG dinucleotide, which should be appropriate for a genomic feature whose functionality is based precisely on CpG dinucleotides. The only search parameter in CpGcluster is the distance between two consecutive CpGs, in contrast to previous algorithms. Therefore, none of the main statistical properties of CpG islands (neither G+C content, CpG fraction nor length threshold) are needed as search parameters, which may lead to the high specificity and low overlap with spurious Alu elements observed for CpGcluster predictions.
Given the inherent mutability of methylated cytosine, the human genome has only a fraction (≈ 20%) of the CpG dinucleotides expected on the basis of its G+C content [1, 2]. However, the resulting scarcity of CpGs is not uniform throughout the chromosome: there are many DNA tracts (CpG islands or CGIs), totaling 1% of the genome, where CpGs are abundant [3–6]. The lack of methylation at CGIs, together with their elevated G+C content relative to the genome average, results in a frequency of CpG dinucleotides that is about 10-fold higher than in bulk DNA [5, 6]. About 60% of all genes have a CGI, normally unmethylated, in their promoter region [2, 6, 7]. However, in some physiological or pathological situations promoter-associated CGIs can be methylated, then provoking a change in the expression of the associated gene [8–11]. The maintenance of a particular genomic pattern of methylated CpGs provides an epigenetic means for differential regulation of gene expression [2, 7, 12].
Approximately 80% of all CpGs are methylated in human and mouse genomes, which makes the hypomethylated and GC-rich CGIs an outstanding genomic property. Given their putative function in gene regulation and their importance as genomic markers in promoter prediction, over recent years there has been a considerable effort to predict CGIs in silico. Current algorithms (newcpgreport ; cpg ; CpGProD ; CpGIS [16, 17]; CpGIE ; CpGED ) rely on the ad hoc thresholds of length, CpG O/E ratio and G+C content early defined by Gardiner-Garden and Frommer . These three thresholds lead to a parameter space which is relatively large and difficult to explore completely. Consequently, in many publications, these parameters have been fine-tuned in different ways -for example, to filter out spurious Alu elements or restricting the prediction to putative promoter CGIs. However, in every fine-tuning, "valid" CGIs also become filtered out, as a consequence of using the same parameters for both prediction and filtering; this suggests the use of different parameters in both steps, as proposed below in the CpGcluster algorithm. Another shortcoming, shared by the algorithms using the conventional moving-window approach (newcpgreport, CpGProdD, CpGIS and CpGIE), but not by the cpg script (which uses compositional segmentation) or CpGED (which uses a sliding double window), is that the island boundaries cannot be accurately defined to single base-pair resolution. As is well known (see, for example ), the methods using a moving window add another level of subjectivity in choosing both the length of the window and the step size. Taking this problem into account, the algorithm CpGcluster is able to predict the island boundaries to a single base-pair resolution by definition.
Bulk CpGs are thought to be in a dynamic equilibrium between the decay of methylated CpGs and the generation of new ones due to point mutations . This is a random process and therefore the CpG distance distributions should be strikingly different for bulk and island CpGs, which motivated our approach. In particular, the distances between consecutive bulk CpGs, as the result of a random process, should follow the geometric distribution, while the distance distribution for in-island CpGs must contain information on the high local clustering. Taking advantage of this high local clustering of CpG dinucleotides at CGIs, CpGcluster directly predicts clusters of CpGs on the chromosome. Predicted clusters with high enough statistical significance can then be identified as CGIs (see Methods).
Benchmarking of CpGcluster
Sn ± SD
Sp ± SD
CC ± SD
Hit* [%] ± SD
0.545 ± 0.002
0.973 ± 0.002
0.725 ± 0.005
87.000 ± 0.540
0.918 ± 0.003
0.657 ± 0.003
0.772 ± 0.006
94.675 ± 0.808
0.832 ± 0.003
0.756 ± 0.007
0.789 ± 0.013
86.675 ± 1.528
0.910 ± 0.002
0.667 ± 0.003
0.775 ± 0.006
94.650 ± 0.810
0.819 ± 0.013
0.584 ± 0.004
0.685 ± 0.005
84.075 ± 1.191
CpGcluster (d t = median, or 44 bp)
0.655 ± 0.003
0.976 ± 0.005
0.797 ± 0.009
95.475 ± 0.870
CpGcluster (d t = 75th percentile, or 94 bp)
0.866 ± 0.006
0.832 ± 0.009
0.846 ± 0.006
95.050 ± 0.643
For these results, the median distance between neighbor CpGs and a p-value cutoff of 10-5 were used to run CpGcluster. As shown in the last row of Table 1, the raising of the distance threshold to the 75th percentile, thereby obtaining longer islands, increased sensitivity by more than 20% while only minimally improving overall accuracy. However, this led to a smaller fraction of CGI overlapping with PhastCons (shown below). On the other hand, lowering the p-value threshold beyond 10-5 slightly increased Sp but also clearly decreased Sn, thus lowering overall global accuracy (not shown). Consequently, the median distance was used as the only parameter for the island prediction and the 10-5 cutoff for the filtering in all subsequent analyses.
Finally, we examined another accuracy indicator, the Hit percentage, which gives the proportion of experimental CGIs which have been at least partially overlapped by the predicted islands. Table 1 shows that CpGcluster "hits" a higher number of islands than any other algorithm. This highest partial overlap (at least the core region of the CGI is predicted), together with the highest specificity mentioned above, might indicate an advantage of CpGcluster over the other tested algorithms.
Statistical analysis of predicted islands in human and mouse genomes
Basic statistics of CpGcluster and CpGProD islands
Genome length (without N-runs, bp)
2.85E + 09
2.85E + 09
2.51E + 09
2.51E + 09
Total number of CpGs
CpG-dinucleotides in CpG-islands (%)
Number of islands predicted
*Island coverage (%)
Island length (bp):
273.5 ± 246.7
1043.8 ± 761.7
314.0 ± 293.8
1030.3 ± 560.0
Average island GC-content (%)
63.76 ± 7.51
54.58 ± 6.12
61.58 ± 10.03
54.62 ± 5.17
Average CpG O/E ratio
0.855 ± 0.265
0.636 ± 0.089
0.956 ± 0.428
0.652 ± 0.103
0.087 ± 0.041
0.047 ± 0.016
0.097 ± 0.084
0.048 ± 0.015
Overlap with PhastCons and MAGE genes
Overlap with TSS of MAGE genes
% of overlap with
Average length ± SD
271.0 ± 18.4
1,314.3 ± 525.1
800.0 ± 243.3
1,093.0 ± 476.1
730.5 ± 320.3
258.3 ± 100.8
The minimum length of a functional CGI is a difficult question, but insights can be derived from recent advances in mapping functional promoters. The shortest island in our prediction which overlaps with a TSS from DBTSS is 33 bp in length. When functional promoters are determined through ChIP-on-chip technology , that length goes down to 13 bp. Finally, when promoters are determined by using the cap analysis of gene expression (CAGE) approach  the minimum island length is just 11 bp long, thus approaching the minimum lengths observed in both DerLab and CpGcluster databases. Thus, it seems that even very short islands may be functional. Also, it bears mentioning that short islands (<200 bp) predicted by CpGcluster which overlap with a TSS also show a very high degree of overlap (37%) with conserved elements (see below), thus suggesting probable biological relevance.
Further insight into the possible role of short CGIs is suggested by the recent finding of CpG "islets", genomic regions that are not conventionally classified as CpG islands because of their short length (<200 bp), but have a GC content and observed-to-expected CpG ratio that is characteristic of a CpG island. CpG islets may be non-methylated, corresponding to sites of active transcription and/or boundaries that separate major centromeric chromatin sub-domains .
All in all, these data support the possibility that genomic tracts with GC content and CpG Obs/Exp ratios typical of CGIs, but below the detection threshold of conventional CGI finders, may have a functional role in the genome. CpGcluster represents a new tool that may help to uncover such regions.
Minimal overlap between CGIs predicted by CpGcluster and Alu retrotransposons
A major source of uncertainty in CGI prediction is the interference of Alu retrotransposons. These elements, abundant in primate genomes, have often been falsely identified as CGIs by conventional CGI finders. To cope with this problem, some authors [15, 16, 18] have proposed a simple increment in the value of some of the thresholds used. The drawback of such a strategy is that some CGIs associated with genes would also be excluded under these more stringent criteria. Even so, the fraction of Alu overlap shown by the islands predicted by most programs is still rather large, while CpGcluster's CGIs demonstrate the least amount of overlap with Alu elements (Table 3). We wish to stress especially that CpGcluster does not need any minimum-length criterion to exclude a higher proportion of Alu elements than any of the previously existing algorithms tested.
Highest degree of overlap between CpGcluster islands and phylogenetic conserved elements (PhastCons) from vertebrates
Functional genomic elements, being under natural selection, are expected to be highly conserved during evolution. Therefore, if the predicted CGIs truly play a functional role, they should show a high degree of overlap with vertebrate PhastCons . Taking advantage of the 'most conserved' track (based on the best-in-genome pairwise alignments between human and other seven vertebrate genomes) at UCSC Genome Browser , we computed the percentage of overlap between PhastCons and the CGIs predicted by the different finders. As seen in Table 3, the islands predicted by CpGcluster show the highest degree of overlap with PhastCons, thus indicating that our algorithm predicts a higher proportion of evolutionarily conserved, functionally relevant CGIs than does any other tested algorithm.
Promoter and CpG island co-location
For a further quality assessment for the islands predicted by CpGcluster, we assigned them to five classes according to their co-location with annotated genes from the RefSeq database . The classification proposed by Ioshikhes and Zhang  was improved by using exon boundaries (instead of absolute positions) to define the different classes. Accordingly, we divided CGIs into five classes defined as follows: L1, the island overlaps with the TSS; L2, the island does not overlap with the TSS but is located somewhere between 2 kb upstream of the TSS and the end of the first exon; L3, the island is located somewhere between the end of the first exon and the start of the last exon; L4, the island is located between the start of the last exon and 2 kb downstream of the Transcription End Site (TES); NG, the island is outside the gene environment.
Location of CpGcluster islands
# CpG islands
Length ± SD
CpG Density ± SD
Obs/Exp Ratio ± SD
%GG ± SD
PhastCons overlap (%)
672.3 ± 398.7
0.10 ± 0.02
0.89 ± 0.13
68.4 ± 5.7
256.3 ± 213.1
0.09 ± 0.04
0.89 ± 0.23
64.6 ± 7.8
230.5 ± 173.0
0.08 ± 0.04
0.84 ± 0.27
63.1 ± 7.1
247.8 ± 212.8
0.09 ± 0.04
0.85 ± 0.23
65.8 ± 7.3
266.0 ± 238.2
0.09 ± 0.04
0.85 ± 0.27
63.5 ± 7.5
745.7 ± 373.8
0.09 ± 0.02
0.83 ± 0.13
65.3 ± 5.2
302.7 ± 257.4
0.09 ± 0.07
0.92 ± 0.38
62.0 ± 9.3
230.6 ± 190.7
0.10 ± 0.09
0.97 ± 0.45
61.4 ± 10.4
284.6 ± 232.6
0.08 ± 0.06
0.87 ± 0.35
61.3 ± 8.3
284.0 ± 257.6
0.10 ± 0.09
0.98 ± 0.45
61.1 ± 10.5
Table 4 also shows that both in humans and in mice CpGcluster predicts drastically different islands as a function of genomic location: promoter CGIs (L1) are longer and have lower p-values than do the rest of the classes. Another important observation is that both in humans and mice, promoter CGIs are much richer in vertebrate PhastCons elements than are non-genic islands (NG).
In addition, two surprising observations were made in the mouse genome: 1) promoter islands have smaller CpG densities and CpG fractions than non-genic islands have; and 2) L4 islands, located mainly in the 3' untranslated regions (3' UTRs), show a high proportion of PhastCons overlap. It is well known that when mouse and human orthologous genes are compared, the mouse line shows a net loss of CpG islands . This probably indicates a higher "pressure" on CGIs in mice, which may account for these findings.
Location of CpGProD islands
# CpG islands
Length ± SD
CpG Density ± SD
Obs/Exp Ratio ± SD
%GG ± SD
PhastCons overlap (%)
1,831.5 ± 875.8
0.06 ± 0.01
0.74 ± 0.09
59.3 ± 5.0
933.0 ± 599.3
0.04 ± 0.01
0.62 ± 0.08
53.6 ± 5.8
814.5 ± 436.2
0.04 ± 0.01
0.61 ± 0.07
53.4 ± 5.6
1,097.2 ± 775.8
0.05 ± 0.02
0.63 ± 0.08
56.2 ± 6.8
988.3 ± 739.8
0.05 ± 0.02
0.63 ± 0.09
54.2 ± 6.1
1,463.4 ± 576.0
0.06 ± 0.01
0.72 ± 0.11
58.1 ± 4.2
1,007.7 ± 560.1
0.05 ± 0.01
0.64 ± 0.11
55.0 ± 5.2
807.2 ± 374.0
0.04 ± 0.01
0.61 ± 0.08
53.1 ± 4.5
997.4 ± 501.5
0.05 ± 0.01
0.64 ± 0.09
55.1 ± 5.1
924.4 ± 502.7
0.05 ± 0.01
0.64 ± 0.10
53.2 ± 5.0
Statistics of CpG distances and %G+C in human and mouse chromosomes
"Stretches of DNA with a high G+C content, and a frequency of CpG dinucleotides close to the expected value, appear as CpG clusters within the CpG-depleted bulk DNA, and are now generally known as CpG islands". This original description of CpG islands by Gardiner and Frommer in 1987  formulates the basic idea underlying the present work: CpG dinucleotides appear clustered within the CpG-depleted bulk DNA and these clusters should be able to be associated with CpG islands. In the same work , the above authors also proposed a criterion for CpG islands based on thresholds which later became the basic principle of practically all existing CpG island finders. They justify these criteria by assuming that CpG-rich regions over 200 bp in length are unlikely to have occurred by chance alone, which points out another important property of CpG islands implemented in this work: the statistical significance. Some years before, McClelland and Ivarie  had introduced a Chi-square test to assign a statistical significance to CpG islands. Therefore, our approach is probably more related to the original perception of CpG islands as statistically significant CpG clusters within CpG-depleted regions.
The first consequence of the difference between the distance and threshold approaches is that, on average, CpGcluster islands are shorter. However, they show higher mean G+C content, CpG density, and CpG fractions than do any of the other five tested algorithms (Table 2). The lower values shown by these threshold-based algorithms may be an inherited consequence of the general approach shared by most of them. To some extent, the chosen thresholds predetermine the statistical properties of the islands, since these usually become enlarged as long as the thresholds are not violated. This threshold-dependent enlargement in the search process may also lead to the observed over-prediction of CpG islands and high Alu overlap shown by most threshold-based algorithms. On the contrary, CpGcluster overcomes this drawback since statistical properties of the CGIs, such as G+C content or CpG fraction are not used as search parameters. Note furthermore that the p-value is a crucial filter parameter to sort out spurious Alu elements. Young Alus have p-values around 10-7 (with slight variations among chromosomes); therefore, the high substitution rates on the Alu CpG sites produce a fast loss of statistical significance, which explains the low overlap with spurious Alu elements shown by the islands predicted by CpGcluster.
Finally, we wish to discuss briefly the lack of any length filter in CpGcluster which allows the prediction of extremely short islands and which, at first glance, could be interpreted as a disadvantage. It should be noted that in all of the previous algorithms the length is not used for prediction purposes, and is considered only in the final filtering process. In fact, the original idea of the length threshold was to guarantee that the predicted islands are not just a mere product of chance alone. Instead, we change the length filter by a statistically stricter criterion: the p-value. In this way, all predicted CGIs are statistically significant CpG clusters. We are aware that the putative functional CGIs are on average very long (as for example the L1 class in Table 4). However, it is important to stress the conceptual difference between the detection of CpG clusters and the subsequent filtering for a particular subset (e.g. promoter overlapping CGIs). These two steps should be clearly distinguished.
The distance-based CGI-finder algorithm described here presents three outstanding features: i) all the predicted CGIs start and end with a CpG dinucleotide; ii) all the computations needed use integer arithmetic, thus leading to a fast and computationally efficient CGI finder, and iii) a p-value is associated with each of the predicted islands. When compared to other CGI finders,CpGcluster is able to predict CGIs with the highest global accuracy and specificity, thus indicating a low rate of false-positive predictions. Short but fully functional CGIs are also predicted by CpGcluster. Furthermore, the degree of overlap of predicted CGIs with Alu retrotransposons is minimal, while the overlap with vertebrate PhastCons is maximal. The promoter CGIs predicted by CpGcluster also show the highest statistical significance, thus qualifying CpGcluster as a valuable tool to distinguish functional CGIs from the remaining islands in the bulk genome.
The algorithm CpGcluster presented in this work consists of two main steps: i) a distance-based algorithm searches for clusters of CpGs in the chromosome sequence. ii) a p-value is associated with each of these clusters, then predicting as CGIs only those clusters with large enough statistical significance (i.e. for which their p-values are below the selected threshold). These two steps are explained in detail in the next subsections.
CpG cluster-searching algorithm
The cluster-searching method is based on the statistical properties of the physical distances between neighboring CpG dinucleotides on the DNA sequence. In principle, if CpGs are distributed totally at random along the chromosome sequence, the distances between neighboring CpG dinucleotides should follow the geometric distribution:
P(d) = (1 - p)d-1p 
where P(d) represents the probability of finding a distance d between neighboring CpGs and p corresponds to the probability of CpGs in the sequence, calculated as the ratio between CpGs and the total number of dinucleotides in the DNA sequence.
The working hypothesis behind the cluster-searching algorithm is that the abundant CpGs in CGIs may be separated by shorter distances (thereby forming clusters) than the distances between bulk CpGs, which in principle should follow the geometric distribution [Eq. 1].
The DNA chromosome sequence is scanned for CpG dinucleotides, then recording the positions occupied by the 'C': x1, x2 ... xN, N being the total number of CpGs in the sequence. The sequence was usually scanned in the 5' → 3' direction. Trivially, the reverse scan (3' → 5') produces the same results.
As a convention, the physical distance separating two neighboring CpGs is defined as:
d i = xi+1- x i - 1, 
so that the minimal distance between two neighboring CpGs (i.e. CGCG) is equal to 1.
In the course of the scan, the first distance below a given threshold (d t ) identifies the first CpG cluster. The threshold d t can be conveniently derived from the distribution of distances between neighboring CpGs in the chromosome sequence. The median distance (Figs. 1, 2) often gives the best results because the median distance of the observed distribution is approximately at the transition point of the over-represented (intra-cluster) small distances and the under-represented intermediate ones. This is not an exclusive property of chromosome 1, as it is shared by all the chromosomes (see Additional files 2, 3, 4, 5), thus indicating that the median distance can be chosen in general as a good threshold (d t ).
We then try to extend this first cluster downstream (→ 3') by adding the next CpG while the distances are below d t . When a distance exceeding d t is found, the cluster is completed, and the search for a new one continues downstream.
Steps 3 and 4 are iterated until all the CpG clusters in the sequence are identified.
Note that this algorithm acquires two important and distinctive features by construction. First, all predicted CGIs start and end with a CpG dinucleotide, which seems appropriate. Secondly, the algorithm uses only integer arithmetic, thus being computationally efficient. No other CGI searching algorithm shares these two important properties.
Assigning p-values to CpG-clusters
Once all the CpG clusters are found in the sequence following the algorithm described above, the next step is to associate a p-value with each one – i.e. the probability of such a cluster appearing by chance in a random sequence. Such a probability can be estimated either numerically by a randomization test on the DNA sequence or by means of a theoretical probability function (both cases shown in Additional file 6). For the latter case, the negative binomial distribution (also known as Pascal or Pólya distribution) can be conveniently tailored to the requirements of CpG clusters. In general, this distribution can be applied to experiments with dichotomous outcomes (either success or failure) and gives the probability of having a certain number of failures when the number of successes was fixed in advance, taking into account that the experiment must always end with a success. By translating these requirements to a genomic context, the successes were equated with CpG dinucleotides and the failures with non-CpGs (all other 15 possible dinucleotides). One prerequisite is that all trials be independent, which is not automatically fulfilled when dealing with overlapping dinucleotides (note that a CpG dinucleotide will always be followed by a non-CpG [GN] dinucleotide). Therefore, these "forced" non-CpGs need to be considered when calculating the success probabilities. Thus, the probability for a cluster with a number (N) of CpGs is given by
This formula takes into account that all our clusters start with a CpG, and therefore the number of successes is N-1 (instead of N, the number of CpGs in the cluster). The number of independent non-CpGs (n f ) in a cluster (failures) can be calculated as:
n f = L - 2 · N 
L being the cluster length (in nucleotides). The success probability p (probability of finding a CpG) is calculated as:
N s being the number of CpG dinucleotides in the sequence and n is the number of independent dinucleotides (i.e. including the CpGs but excluding the forced non-CpGs). The theoretical probabilities determined by this analytical method agree well with those found by numerical simulation, as shown in Additional file 6. Given its lower computational cost, the theoretical approach was implemented in our software.
The negative binomial is a two-tailed distribution (Additional file 6). The left tail indicates a high local CpG clustering (accumulation of CpGs) while the right tail is comprised of CpG-depleted regions. Therefore, the probability that an observed local CpG frequency is significantly higher than those expected under random conditions (CpG clustering) is given by the cumulative density function of the CpG cluster at point n f , which can therefore be taken as its p-value:
The use of this latter expression allows us to discriminate between the clusters found in the first step of the algorithm: those clusters with a p-value below a given threshold (usually 10-5, see Section "Benchmarking CpGcluster") are predicted as CGIs, while the rest of the clusters are discarded.
Assembling test sequences containing CpG islands
The full list of DerLab CGIs  was retrieved. These experimental CGIs can be quite short and the minimum length is actually 8. Out of the 6235 CGIs, 1612 (or 26 %) were shorter than 200 bp. The experimental islands were then divided into two groups: those that overlapped with the TSS and those that did not. The TSS coordinates were taken from the DBTSS database . These two groups differed significantly in their mean length, CpG density, and CpG fraction, with all values being higher for the TSS group. In assembling the test sequences, we exclusively used the TSS group, which had a greater average length.
In addition, non-island sequences – the sequences located between the CGIs of chromosome 22, as specified by the UCSC annotation  – were extracted.
To further ensure a random background for the CGIs in our test sequences, all non-island segments were randomly shuffled using an algorithm that preserves dinucleotide frequencies . As non-island segments could contain some non-annotated CGIs, this step ensures the randomness of the non-island segments, at the same time conserving nucleotide and dinucleotide compositions. This setup is the less biased, as none of the finders is expected to predict CGIs on randomized sequences.
The shuffled non-island segments were then alternatively combined with 400 island segments overlapping with TSS, chosen at random from the DerLab sample, thus assembling a test sequence of approximately 18 Mb in length.
Using the assembling process described in step (4), we generated a set of 10 test sequences containing experimental CGIs alternating with shuffled non-island segments.
Availability and requirements
Project name: CpGcluster
Project home pages: http://bioinfo2.ugr.es/CpGcluster
Operating system(s): platform independent
Programming language: Perl 5 (see Additional file 7 for source code)
Licence: open source
- CpG O/E ratio:
Ratio between observed and expected CpG frequencies
- G+C content:
%G+C: Molecular fraction of guanine and cytosine
Phylogenetic Conserved Elements
Sensitivity – the proportion of island nucleotides which have been correctly predicted as islands
Specificity – the proportion of predicted island nucleotides that are actually islands
promoter of SERPINB5 gene
Transcription End Site
Transcription Start Site
Helpful comments by Antonio Marín and two anonymous reviewers are greatly appreciated. This work was supported by the Spanish Government (BIO2005-09116-C03-01 to JLO, MH, PC and CP and BIO2002-04014-C03-03 to PLE and JMA) and Plan Andaluz de Investigación (CVI-162 and FQM-322). MH and CP acknowledge the grants from the University of Granada (Spain) and the German Academic Exchange Service (DAAD), respectively. CP is grateful for the use of the facilities of the German Cancer Research Center, Heidelberg, Germany). The help of David Nesbitt with the English version of the manuscript is also appreciated.
- Sved J, Bird A: The expected equilibrium of the CpG dinucleotide in vertebrate genomes under a mutation model. Proc Natl Acad Sci USA 1990, 87(12):4692–6. 10.1073/pnas.87.12.4692PubMed CentralView ArticlePubMedGoogle Scholar
- Antequera F: Structure, function and evolution of CpG island promoters. Cell Mol Life Sci 2003, 60(8):1647–58. 10.1007/s00018-003-3088-6View ArticlePubMedGoogle Scholar
- McClelland M, Ivarie R: Asymmetrical distribution of CpG in an 'average' mammalian gene. Nucleic Acids Res 1982, 10(23):7865–77.PubMed CentralView ArticlePubMedGoogle Scholar
- Cooper DN, Taggart MH, Bird AP: Unmethylated domains in vertebrate DNA. Nucleic Acids Res 1983, 11(3):647–58.PubMed CentralView ArticlePubMedGoogle Scholar
- Bird AP: CpG-rich islands and the function of DNA methylation. Nature 1986, 321(6067):209–13. 10.1038/321209a0View ArticlePubMedGoogle Scholar
- Antequera F, Bird A: Number of CpG islands and genes in human and mouse. Proc Natl Acad Sci USA 1993, 90(24):11995–9. 10.1073/pnas.90.24.11995PubMed CentralView ArticlePubMedGoogle Scholar
- Bird AP: DNA methylation patterns and epigenetic memory. Genes Dev 2002, 16: 6–21. 10.1101/gad.947102View ArticlePubMedGoogle Scholar
- Antequera F, Boyes J, Bird A: High levels of de novo methylation and altered chromatin structure at CpG islands in cell lines. Cell 1990, 62(3):503–14. 10.1016/0092-8674(90)90015-7View ArticlePubMedGoogle Scholar
- Esteller M, Corn PG, Baylin SB, Herman JG: A gene hypermethylation profile of human cancer. Cancer Res 2001, 61(8):3225–9.PubMedGoogle Scholar
- Baylin SB, Esteller M, Rountree MR, Bachman KE, Schuebel K, Herman JG: Aberrant patterns of DNA methylation, chromatin formation and gene expression in cancer. Hum Mol Genet 2001, 10(7):687–92. 10.1093/hmg/10.7.687View ArticlePubMedGoogle Scholar
- Issa JP: CpG island methylator phenotype in cancer. Nat Rev Cancer 2004, 4(12):988–93. 10.1038/nrc1507View ArticlePubMedGoogle Scholar
- Saxonov S, Berg P, Brutlag DL: A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters. Proc Natl Acad Sci USA 2006, 103(5):1412–7. 10.1073/pnas.0510310103PubMed CentralView ArticlePubMedGoogle Scholar
- Larsen F, Gundersen G, Lopez R, Prydz H: CpG islands as gene markers in the human genome. Genomics 1992, 13(4):1095–107. 10.1016/0888-7543(92)90024-MView ArticlePubMedGoogle Scholar
- Li W, Bernaola-Galván PA, Haghighi F, Grosse I: Applications of recursive segmentation to the analysis of DNA sequences. Comput Chem 2002, 26: 491–509. 10.1016/S0097-8485(02)00010-4View ArticlePubMedGoogle Scholar
- Ponger L, Mouchiroud D: CpGProD: identifying CpG islands associated with transcription start sites in large genomic mammalian sequences. Bioinformatics 2002, 18(4):631–3. 10.1093/bioinformatics/18.4.631View ArticlePubMedGoogle Scholar
- Takai D, Jones PA: Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proc Natl Acad Sci USA 2002, 99(6):3740–5. 10.1073/pnas.052410099PubMed CentralView ArticlePubMedGoogle Scholar
- Takai D, Jones PA: The CpG island searcher: a new WWW resource. In Silico Biol 2003, 3(3):235–40.PubMedGoogle Scholar
- Wang Y, Leung FC: An evaluation of new criteria for CpG islands in the human genome as gene markers. Bioinformatics 2004, 20(7):1170–7. 10.1093/bioinformatics/bth059View ArticlePubMedGoogle Scholar
- Luque-Escamilla PL, Martinez-Aroza J, Oliver JL, Gomez-Lopera JF, Roman-Roldan R: Compositional searching of CpG islands in the human genome. Phys Rev E Stat Nonlin Soft Matter Phys 2005, 71(6 Pt 1):061925.View ArticlePubMedGoogle Scholar
- Gardiner-Garden M, Frommer M: CpG islands in vertebrate genomes. J Mol Biol 1987, 196(2):261–82. 10.1016/0022-2836(87)90689-9View ArticlePubMedGoogle Scholar
- Li W: Delineating relative homogeneous G+C domains in DNA sequences. Gene 2001, 276(1–2):57–72. 10.1016/S0378-1119(01)00672-2View ArticlePubMedGoogle Scholar
- Burset M, Guigo R: Evaluation of gene structure prediction programs. Genomics 1996, 34(3):353–67. 10.1006/geno.1996.0298View ArticlePubMedGoogle Scholar
- Stancheva I, El-Maarri O, Walter J, Niveleau A, Meehan RR: DNA methylation at promoter regions regulates the timing of gene activation in Xenopus laevis embryos. Dev Biol 2002, 243(1):155–65. 10.1006/dbio.2001.0560View ArticlePubMedGoogle Scholar
- Futscher BW, Oshiro MM, Wozniak RJ, Holtan N, Hanigan CL, Duan H, Domann FE: Role for DNA methylation in the control of cell type specific maspin expression. Nat Genet 2002, 31(2):175–9. 10.1038/ng886View ArticlePubMedGoogle Scholar
- De Smet C, Lurquin C, Lethe B, Martelange V, Boon T: DNA methylation is the primary silencing mechanism for a set of germ line- and tumor-specific genes with a CpG-rich promoter. Mol Cell Biol 1999, 19(11):7327–35.PubMed CentralView ArticlePubMedGoogle Scholar
- Kim TH, Barrera LO, Qu C, Van Calcar S, Trinklein ND, Cooper SJ, Luna RM, Glass CK, Rosenfeld MG, Myers RM, Ren B: Direct isolation and identification of promoters in the human genome. Genome Res 2005, 15(6):830–9. 10.1101/gr.3430605PubMed CentralView ArticlePubMedGoogle Scholar
- Carninci P, Sandelin A, Lenhard B, Katayama S, Shimokawa K, Ponjavic J, Semple CA, Taylor MS, Engstrom PG, Frith MC, Forrest AR, Alkema WB, Tan SL, Plessy C, Kodzius R, Ravasi T, Kasukawa T, Fukuda S, Kanamori-Katayama M, Kitazume Y, Kawaji H, Kai C, Nakamura M, Konno H, Nakano K, Mottagui-Tabar S, Arner P, Chesi A, Gustincich S, Persichetti F, Suzuki H, Grimmond SM, Wells CA, Orlando V, Wahlestedt C, Liu ET, Harbers M, Kawai J, Bajic VB, Hume DA, Hayashizaki Y: Genome-wide analysis of mammalian promoter architecture and evolution. Nat Genet 2006, 38(6):626–35. 10.1038/ng1789View ArticlePubMedGoogle Scholar
- Wong NC, Wong LH, Quach JM, Canham P, Craig JM, Song JZ, Clark SJ, Choo KH: Permissive transcriptional activity at the centromere through pockets of DNA hypomethylation. PLoS Genet 2006, 2(2):e17. 10.1371/journal.pgen.0020017PubMed CentralView ArticlePubMedGoogle Scholar
- Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W, Haussler D: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 2005, 15(8):1034–50. 10.1101/gr.3715005PubMed CentralView ArticlePubMedGoogle Scholar
- UCSC Genome Browser[http://genome.ucsc.edu]
- The RefSeq Database[http://www.ncbi.nih.gov/RefSeq]
- Ioshikhes IP, Zhang MQ: Large-scale human promoter mapping using CpG islands. Nat Genet 2000, 26(1):61–3. 10.1038/79189View ArticlePubMedGoogle Scholar
- Heisler LE, Torti D, Boutros PC, Watson J, Chan C, Winegarden N, Takahashi M, Yau P, Huang TH, Farnham PJ, Jurisica I, Woodgett JR, Bremner R, Penn LZ, Der SD: CpG Island microarray probe sequences derived from a physical library are representative of CpG Islands annotated on the human genome. Nucleic Acids Res 2005, 33(9):2952–61. 10.1093/nar/gki582PubMed CentralView ArticlePubMedGoogle Scholar
- Yamashita R, Suzuki Y, Wakaguri H, Tsuritani K, Nakai K, Sugano S: DBTSS: DataBase of Human Transcription Start Sites, progress report 2006. Nucleic Acids Res 2006, 34(Database issue):D86–9. 10.1093/nar/gkj129PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Erickson BW: Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. Mol Biol Evol 1985, 2(6):526–38.PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.