Estimation of CpG coverage in whole methylome next-generation sequencing studies
© van den Oord et al.; licensee BioMed Central Ltd. 2013
Received: 25 July 2012
Accepted: 8 February 2013
Published: 12 February 2013
Skip to main content
© van den Oord et al.; licensee BioMed Central Ltd. 2013
Received: 25 July 2012
Accepted: 8 February 2013
Published: 12 February 2013
Methylation studies are a promising complement to genetic studies of DNA sequence. However, detailed prior biological knowledge is typically lacking, so methylome-wide association studies (MWAS) will be critical to detect disease relevant sites. A cost-effective approach involves the next-generation sequencing (NGS) of single-end libraries created from samples that are enriched for methylated DNA fragments. A limitation of single-end libraries is that the fragment size distribution is not observed. This hampers several aspects of the data analysis such as the calculation of enrichment measures that are based on the number of fragments covering the CpGs.
We developed a non-parametric method that uses isolated CpGs to estimate sample-specific fragment size distributions from the empirical sequencing data. Through simulations we show that our method is highly accurate. While the traditional (extended) read count methods resulted in severely biased coverage estimates and introduces artificial inter-individual differences, through the use of the estimated fragment size distributions we could remove these biases almost entirely. Furthermore, we found correlations of 0.999 between coverage estimates obtained using fragment size distributions that were estimated with our method versus those that were “observed” in paired-end sequencing data.
We propose a non-parametric method for estimating fragment size distributions that is highly precise and can improve the analysis of cost-effective MWAS studies that sequence single-end libraries created from samples that are enriched for methylated DNA fragments.
Methylation studies are a promising complement to genetic studies of variation in DNA sequence and structure. Most intensively studied is the methylation of DNA cytosine residues at the carbon 5 position (5meC). Methylation is typically associated with transcriptional repression [1, 2]. This direct link to gene expression means that methylation studies can potentially capture more individual variation in disease susceptibility. Methylation studies can also shed a unique light on disease mechanisms and clinical phenomena [2, 3] such as sex differences [4, 5], genotype environment interactions [3, 6], and age-related patterns associated with the disease course . Finally, methylation sites are appealing from a translational perspective because they are modifiable by pharmacological interventions  and are easy to measure using cost-effective assays in readily available biosamples .
For most common, complex diseases, detailed prior biological knowledge is typically lacking. Therefore, genome-wide approaches that proved fruitful in the context of sequence variants  will also be critical to detect disease relevant methylation sites [11–13]. Next-generation sequencing (NGS) is an appealing technology for such methylome-wide association studies (MWAS). Compared to arrays, NGS provides better coverage of all possible methylation sites in the human genome . Furthermore, the relatively low amounts of starting material will reduce errors and bias caused by sample preparation and amplification. Finally, the availability of fast semi-automated sample preparation, the increase in the amount of data generated per run, and the decrease in reagent costs have already made NGS a cost-effective option for a comprehensive interrogation of the methylome.
The most comprehensive method for ascertaining methylation (5meC) status at each nucleotide position is bisulfite sequencing , where unmethylated cytosines in genomic DNA are converted to uracil and then converted to thymine in post-bisulfite PCR . The single base resolution is attractive because it allows precise mapping of disease relevant sites . However, due to the combination of high costs of sequencing entire genomes and the large numbers of samples needed to provide adequate statistical power, whole-genome bisulfite sequencing is not currently economically feasible as a screening tool for disease association studies . A commonly used cost-effective alternative aims to sequence only the methylated part of the genome. Here, DNA is first fragmented and the methylated fragments are bound to antibodies  or other proteins  with high affinity for methylated DNA. The unmethylated genomic fraction is washed away, and the methylation-enriched portion of the sample is then collected and sequenced [18–21].
Knowledge of the fragment size distributions in enrichment-based MWAS is important for several aspects of the data analysis. A clear example involves the calculation of enrichment measures. DNA methylation is most often, although not exclusively, found in the sequence context CpG. Certain enrichment protocols (e.g. MBD-based capture that uses the methyl binding domain of methyl binding proteins ), can even only detect CpG methylation. Given that we know exactly where the CpGs are located, there is no need to search for enrichment peaks using methods commonly used in ChIP-seq experiments [22–26]. Although there are other ways to quantify enrichment , a commonly used approach is to count the number of fragments covering each CpG. If the fragment sizes are unknown, the number of reads covering the CpG is typically counted instead , where read length is sometimes extended to the expected fragment length. However, this is a rough approximation. First, due to the stochastic nature of DNA fragmentation, one cannot assume an equal size for all fragments. Second, the expected fragment size may be mis-specified if the sequenced fragment pool differs from the one obtained after fragmentation. This could arise, for example, if smaller fragments are more likely to be pulled down in the enrichment step. Third, there will be variation in the fragment size distribution across samples despite standardized lab protocols. The possible implication of using the read count approximation is therefore that coverage estimates may become biased and imprecise.
A second example illustrating the importance of the fragment size distribution is that failure to account for differences in fragment size distributions between samples may create artificial inter-individual differences in coverage estimates. To illustrate this, assume that in sample A all DNA fragments are exactly 50 bases in length and that we sequence at 50 bp read length. When using a read count to calculate coverage for the CpG that caused the enrichment, all reads will contribute to this count because the start positions of all aligned reads will be within 50 bp of that methylated CpG. Now assume a second sample B that has identical methylation levels at the target CpG. However, for this sample all fragments are 200 bp long, but we still sequence only 50 bp of each fragment. As only a proportion of the reads would now start within 50 bases of the CpG, the read count will be less than for sample A and underestimate the number of fragments covering the target CpG.
Rather than using an estimation procedure, by sequencing paired-end libraries we can obtain the fragment size distribution by subtracting the start positions of successfully aligned read pairs. However, paired-end libraries have only recently become available, so legacy data is typically single-end and not all sequencing platforms currently support paired-end libraries. Second, the use of paired-end libraries is more expensive and almost doubles the sequencing run time. These disadvantages may be justified in studies that require single base resolution, such as calling DNA sequence variants or estimating the percentage of methylation at specific bases after bisulfite conversion. In these scenarios, the number of reads covering each base is a critical determinant of data quality. However, for the enrichment-based methylation studies considered in this paper, it is the number of sequenced fragments that determine data quality. As the use of paired-end libraries does not increase the number of fragments, one could argue that it is better to spend the additional resources on sequencing more fragments using single-end libraries.
The goal of our investigation is to develop a method that uses single-end sequencing data to estimate fragment size distributions. Due to the nature of the sequencing technology as well as specific lab procedures to optimize the assays (e.g. size selection on fragments prior to sequencing), it is difficult to make strong parametric assumptions about this distribution. Therefore, we propose a non-parametric method. To validate our method, we performed simulation studies and made comparisons with NGS studies of paired-end libraries, which provide a benchmark by allowing an empirical determination of the fragment size distribution.
A detailed exposition of the proposed method to estimate CpG coverage can be found in the supplemental material and we confine ourselves here to a summary. In contrast to for example Chip-seq data, in methylation studies there are often many sites that are located close to each and all can affect the enrichment. This complicates the estimation of the fragment size distribution. For example, using all reads in the neighborhood of a CpG that has other CpGs nearby will give imprecise estimates of the fragment size distribution because part of the enrichment at that locus will be the result of the nearby CpGs. To address this problem, our method uses only isolated CpGs. An isolated CpG is defined as a site C for which the interval [C-d,C + d] contains no other CpGs but C and where d is larger than the longest possible fragment size. For these isolated CpGs it is reasonable to assume that, given the fragment size X = x, the possible nucleotide positions R where the reads can start is independent and uniform distributed (see Additional file 1: Figure S3 and Results section for empirical support of this assumption).
Because the enrichment is imperfect, not all sequenced fragments will contain methylated CpGs. Under the assumption that the start positions of such “noise” reads are uniformly distributed in the [C-d,C + d] interval, they will not bias estimates because Pr(x) in (2) is calculated from the difference between the numbers of reads starting at adjacent bases.
For example, if the read length is 50, Pr(X > r) = 1.0 for reads starting within 50 bp of the CpG but Pr(X > r) < 1.0 for reads starting further away as part of the tagged fragments will be too short to cover the CpG. Once we determined for every read the probability that the fragments they are tagging cover the CpG, these probabilities can be summed over all reads to obtain a coverage estimate for the CpG.
Equation (4) shows that E(cov) depends on the fragment sizes. The implication is that coverage estimates will differ across samples if the fragment sizes differ across these samples. However, because the fragment size distribution is determined by the lab protocol and is not directly related to the amount of methylation, this difference represents an artifact. To avoid such differences it may be necessary to standardize the coverage estimates using this expected contribution. We calculated the required coverage standardization factor for each sample as the mean of the expected read contributions across all samples in the study divided by E(cov) in (4) for that specific sample.
As the read start counts will show sampling fluctuations, we “smooth” the data prior to calculating Pr(x) with formula (2). Our first method involved the Nadaraya-Watson estimator [28, 29]. This smoother takes for each position the m nearest neighbors and estimates the number of read starts by averaging the values across this window using a kernel as a weighting function. As it essentially uses the mean, this kernel-based smoother assumes that the underlying function is locally constant. To provide an alternative we also used a cubic spline method that fits a more flexible local regression model instead. We illustrate results obtained with the two methods in Figure 1.
Step 2 results in a set of estimates of Pr(x) that are used in step 3 to calculate the coverage function (3). The true underlying coverage function (3) is monotone descending but in practice this may not hold due to sampling fluctuations. To ensure monotonicity, as a final smoothing step we used the procedure proposed by Dette et al. [30, 31].
To obtain the coverage functions and standardization factors we wrote an R function that also summarizes and plots the (smoothed) data. For the kernel-based smoother we used the R function ksmooth. For the cubic spline we used smooth.spline, the design of which parallels the smooth.spline function of Chambers and Hastie . The function monoProc was used in the final step to obtain monotone descending coverage functions. We also created a program to create the input data, which is a table with the counts of the read starts around isolated CpGs. Prior to calculating this table, the program allows user specified quality control (QC) of the reads. Because of the size of the data files, this program was coded in C++. The source code, Windows and Linux executables, and documentation are freely available from http://www.people.vcu.edu/~ejvandenoord/.
To validate our method we sequenced 50 bp + 35 bp paired-end libraries in 8 inbred adult C57BL/6 male mice (see supplemental material for details on the laboratory methods, quality control, and data processing). The number of reads per sample was on average 53.6 million. We could map 87% of the reads. Using d = 350 bp, the total number of isolated CpGs was 287,493 which corresponds to 1.4% of all CpGs in the C57BL/6 genome (build 9/NCBI37). In terms of uniquely mapped reads, an average of 184,853 reads per sample mapped to isolated CpGs.
Fifty-two percent of the mapped reads (or 45% of total reads) satisfied our criteria for high quality read pairs meaning that they aligned uniquely with the right orientation and acceptable fragment size. We obtained the fragment size distributions from the paired-end data by subtracting the start positions of the successfully aligned read pairs. Although these fragment size distributions may not be perfect (e.g. only a proportion of the read pairs are used and the fragments tagged by these high quality pairs may not be completely representative of all sequenced fragments), they should provide a good opportunity to validate findings as the distributions are “observed” and do not require estimation.
To test our estimation procedure through simulations, we generated 10,000 random samples based on each of the three distributions in Figure 2. The number of reads with start positions close to isolated CpGs equaled 10,000, 25,000, 50,000, 75,000, or 100,000. The condition that assumes 100,000 reads is comparable to what we observe in our empirical data. The other conditions enable us to get a sense of the robustness of our method in case fewer reads would be available. To assess the precision of our estimator, we first calculated the mean difference and absolute mean difference between the estimated coverage function and the real coverage function used to simulate the data, and then averaged these differences across all possible read start positions. When subsequently averaged across the 10,000 simulated samples, the mean difference provides information about whether there are systematic differences (=bias) between estimated and true coverage functions. To obtain a measure of the variability of the estimated coverage functions, we also calculated the standard deviation of the mean difference in the 10,000 simulations. Finally, the mean of the absolute differences across the 10,000 simulations provides an overall measure of precision that incorporates both systematic differences and the variability of the estimates.
Summary of simulation results comparing estimated and true coverage curves with different numbers of reads
Additional file 1: Figure S3 shows the plots with read start position distributions for each of the 8 samples. These distributions show systematic outliers at the very beginning of the read (positions 0–4). However, after these initial positions, the frequencies of the read start positions do not show a systematic trend until the read length is reached. The decay after that point is expected and caused by parts of fragments becoming too short to cover the CpG (see Formula 1), which essentially forms the basis of our estimator. In other data, we have sometimes observed a decay that starts before the read length is reached. However, this was the result of some fragments being shorter than the read length. Such fragments can occur when the instrument initially sequences part of the adaptor. These reads are then “trimmed” during alignment. Thus, when the methylated CpG is at the very beginning of the read, the assumption of uniform read start distribution does not hold. However, as our estimator only uses the data starting from approximately the minimum read length, these outliers do not affect the estimation. The absence of a systematic trend until the minimum fragment length is reached suggests that the assumption of a uniform distribution for position-level read counts is reasonable for the range from which the data are used.
Before estimating the coverage function using the empirical sequencing data, we first eliminated one read from each pair as to create single-end read input data. Empirical data will comprise multi- and duplicate-reads. Many reads map to multiple locations of the genome. Often a single alignment can be selected because it is clearly better than the others. In the case of multi-reads, multiple alignments are about equally good. Selecting only the best alignment for each multi-reads read carries along the danger of alignment errors (e.g. alignments to regions with SNPs are less likely to be best alignments because SNPs cause mismatches). On the other hand, excluding all multireads may affect accuracy in a negative way . Duplicate-reads are reads that start at the same nucleotide positions. When sequencing a whole genome duplicate-reads often arise from template preparation or amplification artifacts. In our context of sequencing an enriched genomic fraction, duplicate-reads are increasingly likely to occur by chance because reads are expected to align to a much smaller fraction of the genome.
We examined empirically whether it would be better to allow for a limited set of high quality multi- and duplicate reads or exclude all such reads. To select high quality multi-reads, any read that mapped to more than 10 loci was excluded from further consideration. From the remaining multi-reads, we selected those that aligned almost equally well to only a few loci. Specifically, we selected the multi-reads that had fewer than five alignments with alignment scores (read length ‒ 3 × the number of mismatches) within five points of the best score. To avoid disproportionate representation, multi-reads were weighted in proportion to the number of alignments in the coverage calculations. In all instances where >3 (duplicate) reads started at the same position, we reset the read count to 1 for the coverage calculations assuming that these reads all tagged a single fragment. If 2 or 3 reads started at the same position, we looked for other reads in neighborhood ±25 bp. If other reads mapped to this area, we retained the read count of 2 or 3 in the coverage calculations, assuming that the duplicate reads occurred by chance due to enrichment of fragments caused by methylated CpG in the region. If no other reads were found, we assumed that the duplicate reads were artifacts and reset the read count to 1 for the coverage calculations.
Precision coverage function estimated after different QC procedures for duplicate- and multi-reads
Comparison of coverage estimates obtained using different methods
Read count (50 bp)
Read count extended (150 bp)
Individual kernel function
Mean paired-end function
Mean kernel function
We developed a non-parametric method that uses isolated CpGs to estimate sample specific fragment size distributions from data obtained by sequencing single-end libraries. An important application of the proposed method is to quantify the amount of methylation by estimating the number of fragments covering a CpG. To optimize coverage estimation, we studied several variations. Although it is possible that the optimal approach varies somewhat across settings, we found that the two smoothing methods had very similar overall performance. Furthermore, the inclusion of particularly high quality multi-reads, rather than merely using uniquely mapped reads, improved the precision of the estimated coverage function. This finding is consistent with other reports showing that multi-reads contain information and should not automatically be discarded . Finally, the use of a mean coverage function across all samples may result in a loss of precision. This is because the reduced sampling fluctuations may not outweigh the biases that are introduced when a mean function is used to approximate fragment size distributions that are likely to vary across samples, even if stringent lab protocols are used to minimize these differences.
Our data suggested that taking the fragment size distribution into account may be important to obtain unbiased coverage estimates even when the standard (extended) read count method is used for coverage calculations. Thus, using the mouse sequence data, we showed that even after careful size selection and the use of a standardized protocol to fragment DNA, differences in fragment size distributions can occur that can create artificial inter-individual differences in coverage estimates. To avoid these biases we proposed a standardization factor that can be calculated from the estimated fragment size distributions.
Further applications of the estimated fragment size distributions are conceivable as well. For example, enrichment-based methods are semi-quantitative in the sense that they do not yield direct estimates of methylation levels. For the purpose of assessing methylation levels of sites, methods have been developed to remedy this problem by normalizing the data based on local CpG density [34, 35]. However, the optimal definition of CpG density depends on the fragment size distribution. For example, the local CpG density of a site will be higher if the fragments are larger. Thus, the proposed method can yield a more refined measure of CpG density.
Our estimator uses data from isolated CpG sites, which correspond to a modest proportion of all CpGs (e.g. 1.4% in C57BL/6 mice). It is possible that fragment length distribution differs for the remaining CpG sites. It is important to note that such a bias would affect our method but not the coverage function derived from the paired-end sequencing data that considers all fragments. The fact that we observed a correlation of .999 between coverage calculations based on our estimate versus those based on the paired-end coverage function suggests that a possible bias does not interfere with the precision of our method. A somewhat related point is that enrichment protocols may be less efficient for CpG poor versus CpG rich regions , and that the enrichment will depend on the extent isolated CpGs are methylated. As the precision of our method depends on the successful enrichment of fragments with a single methylated CpG, it may not work as well with protocols that mainly enrich for CpG dense regions or in samples where isolated CpGs are not methylated.
Methylation studies are a promising complement to genetic studies of DNA sequence. However, detailed prior biological knowledge is typically lacking, so methylome-wide association studies will be critical to detect disease relevant sites. A cost-effective approach involves sequencing single-end libraries created from samples that are enriched for methylated DNA fragments. A limitation of single-end libraries is that the fragment size distribution is not observed, which hampers several aspects of the data analysis. In this article we developed a non-parametric method that uses isolated CpGs to estimate sample specific fragment size distributions. We show that our method is highly accurate and can improve the analysis of cost-effective MWAS studies that sequence single-end libraries created from samples that are enriched for methylated DNA fragments.
Methylome-wide association studies
This work was supported by the National Human Genome Research Institute (Grants R01 HG004240 and HG004240-02S1), the National Institute on Drug Abuse (Grant R25 DA026119), and the National Institute of Mental Health (Grant RC2 MH089996).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.